2.0 KiB
2.0 KiB
| name | description |
|---|---|
| web2md | Convert any web URL to Markdown. Triggers on "转成Markdown/转换/网页转Markdown/convert to Markdown + URL". Handles static sites, dynamic SPAs, WeChat, arXiv, Twitter/X. |
docai:web2md
When to Trigger
User wants to convert a web page to Markdown. Common patterns:
- "把这个链接转成 Markdown"、"网页转 Markdown"、"提取网页内容"
- "convert this URL to Markdown"、"get the content of this page"
- Any URL + intent to extract/read content (without summarization)
If user wants summary, use web2summary instead.
How to Execute
python skills/web2md/tools/convert.py <URL> [--use-python] [-o <file>]
Parameters
| Parameter | Required | Description |
|---|---|---|
url |
Yes | Web page URL |
--use-python |
No | Force Python method (skip Jina/Firecrawl) |
-o / --output |
No | Save to file instead of stdout |
Examples
# Basic conversion (parallel: Jina / Firecrawl / Python)
python skills/web2md/tools/convert.py https://example.com/article
# arXiv paper (auto HTML priority, PDF fallback)
python skills/web2md/tools/convert.py https://arxiv.org/abs/2601.04500v1
# Save to file
python skills/web2md/tools/convert.py https://mp.weixin.qq.com/s/... -o article.md
# Force Python method
python skills/web2md/tools/convert.py https://example.com --use-python
What It Does
Four methods run in parallel, returning the first successful result:
- Jina Reader API (fastest, zero install)
- Firecrawl API (if key configured)
- Python fallback (requests + BeautifulSoup)
- Playwright (headless browser for JS-rendered pages)
Special cases handled automatically: WeChat → Playwright first (mobile UA, JS rendering), arXiv → HTML priority, Twitter/X → Playwright.
Troubleshooting
- arXiv PDF garbled: Requires
pymupdf—pip install pymupdf - Dynamic page empty: Script auto-detects SPAs and uses Playwright
- All methods fail: Try
--use-pythonto bypass API methods