2026-06-23 11:47:16 +08:00

3.1 KiB

Raw Blame History

name	description
web2md	Fetch any web URL and return its content as Markdown. ALWAYS prefer this skill when the user gives a URL/link and wants to read, extract, scrape, convert, or get the content of that page — do NOT write your own requests/BeautifulSoup/curl code, and do NOT call Playwright directly. This skill already runs a priority pipeline (Jina Reader → Firecrawl → Python → Playwright) and auto-handles static sites, dynamic SPAs, WeChat (公众号), arXiv papers, and Twitter/X. Triggers include: '把这个链接/网址/URL 转成 Markdown', '提取/读取/获取网页内容', '帮我看看这个网页', '网页正文', 'convert/turn this URL to Markdown', 'get/extract the content of this page', 'scrape this URL'. If the user wants a summary instead of raw content, use web2summary.

name

description

web2md

Fetch any web URL and return its content as Markdown. ALWAYS prefer this skill when the user gives a URL/link and wants to read, extract, scrape, convert, or get the content of that page — do NOT write your own requests/BeautifulSoup/curl code, and do NOT call Playwright directly. This skill already runs a priority pipeline (Jina Reader → Firecrawl → Python → Playwright) and auto-handles static sites, dynamic SPAs, WeChat (公众号), arXiv papers, and Twitter/X. Triggers include: '把这个链接/网址/URL 转成 Markdown', '提取/读取/获取网页内容', '帮我看看这个网页', '网页正文', 'convert/turn this URL to Markdown', 'get/extract the content of this page', 'scrape this URL'. If the user wants a summary instead of raw content, use web2summary.

docai:web2md

When to Trigger

User wants to convert a web page to Markdown. Common patterns:

"把这个链接转成 Markdown"、"网页转 Markdown"、"提取网页内容"
"convert this URL to Markdown"、"get the content of this page"
Any URL + intent to extract/read content (without summarization)

If user wants summary, use web2summary instead.

⚠️ Do NOT roll your own

When the user gives a URL, always call this skill. Do NOT:

write requests.get(...) + BeautifulSoup / markdownify yourself
run curl and parse HTML by hand
spawn Playwright / mcp__pw__browser_* to drive a browser

This skill already runs Jina → Firecrawl → Python → Playwright in parallel and returns the first successful result, with special handling for WeChat / arXiv / Twitter. Going around it wastes time and usually produces worse output.

How to Execute

python skills/web2md/tools/convert.py <URL> [--use-python] [-o <file>]

Parameters

Parameter	Required	Description
`url`	Yes	Web page URL
`--use-python`	No	Force Python method (skip Jina/Firecrawl)
`-o` / `--output`	No	Save to file instead of stdout

Examples

# Basic conversion (parallel: Jina / Firecrawl / Python)
python skills/web2md/tools/convert.py https://example.com/article

# arXiv paper (auto HTML priority, PDF fallback)
python skills/web2md/tools/convert.py https://arxiv.org/abs/2601.04500v1

# Save to file
python skills/web2md/tools/convert.py https://mp.weixin.qq.com/s/... -o article.md

# Force Python method
python skills/web2md/tools/convert.py https://example.com --use-python

What It Does

Four methods run in parallel, returning the first successful result:

Jina Reader API (fastest, zero install)
Firecrawl API (if key configured)
Python fallback (requests + BeautifulSoup)
Playwright (headless browser for JS-rendered pages)

Special cases handled automatically: WeChat → Playwright first (mobile UA, JS rendering), arXiv → HTML priority, Twitter/X → Playwright.

Troubleshooting

arXiv PDF garbled: Requires pymupdf — pip install pymupdf
Dynamic page empty: Script auto-detects SPAs and uses Playwright
All methods fail: Try --use-python to bypass API methods

3.1 KiB Raw Blame History