AI Skill Hub 推荐使用:文档爬虫转换工具 是一款优质的MCP工具。AI 综合评分 7.2 分,在同类工具中表现稳健。如果你正在寻找可靠的MCP工具解决方案,这是一个值得深入了解的选择。
文档爬虫转换工具 是一款遵循 MCP(Model Context Protocol)标准协议的 AI 工具扩展。通过 MCP 协议,它可以让 Claude、Cursor 等主流 AI 客户端直接访问和操作外部工具、数据源和服务,实现 AI 能力的无缝扩展。无论是文件操作、数据库查询还是 API 调用,都可以通过自然语言在 AI 对话中直接触发,极大提升生产效率。
文档爬虫转换工具 是一款遵循 MCP(Model Context Protocol)标准协议的 AI 工具扩展。通过 MCP 协议,它可以让 Claude、Cursor 等主流 AI 客户端直接访问和操作外部工具、数据源和服务,实现 AI 能力的无缝扩展。无论是文件操作、数据库查询还是 API 调用,都可以通过自然语言在 AI 对话中直接触发,极大提升生产效率。
# 方式一:通过 Claude Code CLI 一键安装
claude skill install https://github.com/raintree-technology/docpull
# 方式二:手动配置 claude_desktop_config.json
{
"mcpServers": {
"--------": {
"command": "npx",
"args": ["-y", "docpull"]
}
}
}
# 配置文件位置
# macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
# Windows: %APPDATA%/Claude/claude_desktop_config.json
# 安装后在 Claude 对话中直接使用 # 示例: 用户: 请帮我用 文档爬虫转换工具 执行以下任务... Claude: [自动调用 文档爬虫转换工具 MCP 工具处理请求] # 查看可用工具列表 # 在 Claude 中输入:"列出所有可用的 MCP 工具"
// claude_desktop_config.json 配置示例
{
"mcpServers": {
"________": {
"command": "npx",
"args": ["-y", "docpull"],
"env": {
// "API_KEY": "your-api-key-here"
}
}
}
}
// 保存后重启 Claude Desktop 生效
Security-hardened, browser-free crawler that turns static documentation sites into clean, AI-ready Markdown — fast.
<p align="center"> <a href="https://docpull.raintree.technology"> <img src="https://pub-e85a1abca36f4fd8b4300a6ec2d6f45f.r2.dev/marketing/docpull/1768954147343-iaiziy-docpull-terminal-hero.gif" alt="docpull demo" width="600"> </a> </p>
docpull uses async HTTP (not Playwright) to fetch server-rendered pages, extracts main content, and writes clean Markdown with source-URL frontmatter — in seconds, with a small install footprint. It won't render JavaScript, but for the large class of docs that don't need it (API references, Python/Go stdlib, most dev-tool docs, OpenAPI specs, Next.js and Docusaurus builds), it is a fast, auditable, sandbox-friendly way to pipe documentation into an LLM context, a RAG index, or an offline archive. SSRF, XXE, DNS-rebinding, and CRLF-injection protections are on by default — a necessity when an AI agent is choosing the URLs.
The mcp/ directory at the repo root is a separate TypeScript + Bun MCP server backed by PostgreSQL with pgvector for semantic search. It is not the Python MCP server shipped in the docpull package described above — that one is the right choice for almost every user and is installed with pip install 'docpull[mcp]'. The mcp/ tree is mirrored to its own repo at raintree-technology/docpull-mcp; unless you specifically need pgvector-backed semantic search, ignore it and use docpull mcp.
- --single — fetch a single URL without discovery. Designed for tool loops. - --stream — NDJSON one-record-per-line, flushed on every page, pipeable. - --max-tokens-per-file N — split each page into token-bounded chunks on heading boundaries (exact counts with tiktoken, estimate without). - --emit-chunks — write one file or record per chunk instead of per page. - --strict-js-required — hard-fail on JS-only pages instead of silently skipping. - --extractor trafilatura — swap in trafilatura for sites where the default heuristics struggle.
pip install 'docpull[mcp]'
```bash pip install docpull
…
NDJSON (one record per page or chunk):
json {"url": "...", "title": "...", "content": "...", "hash": "...", "token_count": 842, "chunk_index": 0} ```
```bash
pip install 'docpull[llm]' # tiktoken for token-accurate chunking pip install 'docpull[trafilatura]' # alternative extractor for noisy pages pip install 'docpull[mcp]' # run as an MCP server for AI agents pip install 'docpull[all]' # everything above ```
Run docpull --help for the full list. Highlights:
Core:
--profile {rag,mirror,quick,llm,custom}
--single Fetch one URL (no crawl)
--format {markdown,json,ndjson,sqlite}
--stream Stream NDJSON to stdout
LLM / chunking:
--max-tokens-per-file N
--tokenizer NAME tiktoken encoding (default cl100k_base)
--emit-chunks One file/record per chunk
Content extraction:
--extractor {default,trafilatura}
--no-special-cases Disable framework extractors
--strict-js-required Error on JS-only pages
Cache:
--cache Enable incremental updates
--cache-dir DIR
--cache-ttl DAYS
from docpull import fetch_one
ctx = fetch_one("https://docs.python.org/3/library/asyncio.html")
print(ctx.title, ctx.source_type)
print(ctx.markdown[:500])
Async streaming:
import asyncio
from docpull import Fetcher, DocpullConfig, ProfileName, EventType
async def main():
cfg = DocpullConfig(
url="https://docs.example.com",
profile=ProfileName.LLM, # chunked NDJSON output
)
async with Fetcher(cfg) as fetcher:
async for event in fetcher.run():
if event.type == EventType.FETCH_PROGRESS:
print(f"{event.current}/{event.total}: {event.url}")
print(f"Done: {fetcher.stats.pages_fetched} pages")
asyncio.run(main())
Single-page from an agent tool:
from docpull import Fetcher, DocpullConfig
async def tool_call(url: str) -> str:
async with Fetcher(DocpullConfig(url=url)) as f:
ctx = await f.fetch_one(url, save=False)
return ctx.markdown or ctx.error or ""
docpull --doctor # Check installation
docpull URL --verbose # Verbose output
docpull URL --dry-run # Test without downloading
docpull URL --preview-urls # List URLs without fetching
这是一个安全的、不依赖浏览器的爬虫工具,能够将静态文档网站转换为干净的、AI准备好的 Markdown 格式文件。它的速度快且易于使用。
该工具提供了以下功能:
该工具需要以下环境依赖和系统要求:
要安装该工具,请使用以下命令:
使用该工具的步骤如下:
该工具提供了以下配置选项:
该工具提供了一个 Python API,用于程序matic访问。
以下是解决常见问题的方法:
实用的MCP工具,专注网页转Markdown,异步设计高效。但star数较少,社区活跃度待观察,适合专项数据采集场景。
AI Skill Hub 为第三方内容聚合平台,本页面信息基于公开数据整理,不对工具功能和质量作任何法律背书。
建议在沙箱或测试环境中充分验证后,再部署至生产环境,并做好必要的安全评估。
✅ MIT 协议 — 最宽松的开源协议之一,可自由商用、修改、分发,仅需保留版权声明。
总体来看,文档爬虫转换工具 是一款质量良好的MCP工具,在同类工具中具备一定竞争力。AI Skill Hub 将持续追踪其更新动态,建议收藏备用,结合自身场景选择合适时机引入使用。
| 原始名称 | docpull |
| 原始描述 | 开源MCP工具:Crawl any website and convert it to clean, AI-ready Markdown — async Python CLI 。⭐21 · Python |
| Topics | 网页爬虫Markdown转换AI数据文档处理异步工具CLI工具 |
| GitHub | https://github.com/raintree-technology/docpull |
| License | MIT |
| 语言 | Python |
收录时间:2026-05-20 · 更新时间:2026-05-21 · License:MIT · AI Skill Hub 不对第三方内容的准确性作法律背书。
选择 Agent 类型,复制安装指令后粘贴到对应客户端