add web2md web2summary

This commit is contained in:
朱潮 2026-06-23 10:12:55 +08:00
parent ede5960b5a
commit 62c0f62134
5 changed files with 1145 additions and 0 deletions

View File

@ -0,0 +1,194 @@
# docai-web2md
独立 Python 工具,用于将网页转换为 Markdown 格式,采用**优先级架构**。
> 📖 **文档导航**
> - **SKILL.md** - Claude Code 使用指南(如何调用此技能)
> - **README.md** - 本文档(工具功能说明和独立使用)
> - **tools/convert.py** - 实际转换代码实现
> - **共享参考**: [web-sources.md](../../shared/references/web-sources.md) - 平台支持矩阵
## 核心特性
- ✅ **Jina Reader API 优先** - 零安装,最快最简单(微信公众号除外)
- ✅ **Firecrawl API 支持** - 高级爬虫需求
- ✅ **Python 智能回退** - 以上方法失败时自动切换
- ✅ **arXiv HTML 优先** - 优先获取 HTML 版,失败时回退 PDF
- ✅ **多平台支持** - 微信公众号(直接 Python、静态博客、动态页面等
## 优先级策略
```python
# 转换流程
输入 URL
是 arXiv? → 转换为 HTML URL
是微信公众号? → 直接 Python 方法 ⭐
尝试 Jina Reader API (快速)
↓ (失败)
尝试 Firecrawl API (需要密钥)
↓ (失败)
Python 方法 (回退)
arXiv? → 下载 PDF 提取
```
**微信公众号特殊处理**:由于 Jina Reader 对微信公众号支持不佳,直接使用 Python 方法以确保最佳效果。
## 快速开始
### 方式 1: 使用 uv推荐
```bash
# 1. 安装 uv如果尚未安装
curl -LsSf https://astral.sh/uv/install.sh | sh
# 2. 在 docai-skills 目录初始化环境
cd docai-skills
uv sync
# 3. 执行脚本(无需激活环境)
uv run python skills/docai-web2md/tools/convert.py https://www.breezedeus.com/article/ai-agent-context-engineering
```
### 方式 2: 使用 pip传统方式
```bash
# 1. 创建虚拟环境(可选但推荐)
python -m venv .venv
source .venv/bin/activate
# 2. 安装依赖
pip install requests beautifulsoup4 markdownify pymupdf
# 3. 执行脚本
python skills/docai-web2md/tools/convert.py https://www.breezedeus.com/article/ai-agent-context-engineering
```
### 方式 3: Jina Reader API无需安装
```bash
# 直接使用 API无需任何依赖
curl https://r.jina.ai/https://www.breezedeus.com/article/ai-agent-context-engineering
```
### ⚠️ Claude Code Skill 集成
**重要**Claude Code 调用 Skill 时使用系统 Python需要额外配置
```bash
# 使用 uv 安装到系统(不影响项目虚拟环境)
uv pip install --system requests beautifulsoup4 markdownify pymupdf
# 或使用 pip
pip install requests beautifulsoup4 markdownify pymupdf
```
**详见**[UV_ENVIRONMENT.md](../../UV_ENVIRONMENT.md)
## 命令行使用
```bash
# 基本用法(自动优先级)
python skills/docai-web2md/tools/convert.py https://www.breezedeus.com/article/ai-agent-context-engineering
# 保存到文件
python skills/docai-web2md/tools/convert.py https://www.breezedeus.com/article/ai-agent-context-engineering -o article.md
# 纯文本模式
python skills/docai-web2md/tools/convert.py https://www.breezedeus.com/article/ai-agent-context-engineering --pure-text
# 强制使用 Python 方法(跳过 Jina/Firecrawl
python skills/docai-web2md/tools/convert.py https://www.breezedeus.com/article/ai-agent-context-engineering --use-python
```
## 优先级架构
```
输入 URL
arXiv? → 转换为 HTML URL
Jina Reader API (⭐ 零安装)
↓ 失败
Firecrawl API (需密钥)
↓ 失败
Python 实现 (全能回退)
arXiv? → 下载 PDF 提取
普通网页 → HTML 解析
```
## 使用示例
```bash
# 静态博客Jina Reader
python skills/docai-web2md/tools/convert.py https://www.breezedeus.com/article/ai-agent-context-engineering
# arXiv 论文HTML 优先PDF 回退)
python skills/docai-web2md/tools/convert.py https://arxiv.org/abs/2601.04500v1
# 微信公众号Jina → Python 回退)
python skills/docai-web2md/tools/convert.py https://mp.weixin.qq.com/s/1LfkYdbzymoWxdvdnKeLnA
# X.com/TwitterPython 动态渲染)
python skills/docai-web2md/tools/convert.py https://x.com/user/status/123
```
## Python API
```python
from skills.docai_web2md.tools.convert import WebToMarkdown
converter = WebToMarkdown()
# 自动优先级(推荐)
markdown = converter.convert("https://www.breezedeus.com/article/ai-agent-context-engineering")
# arXiv 自动处理HTML → PDF
paper = converter.convert("https://arxiv.org/abs/2601.04500v1")
# 强制 Python 方法
markdown = converter.convert("https://www.breezedeus.com/article/ai-agent-context-engineering", use_python=True)
# 纯文本输出
text = converter.convert("https://www.breezedeus.com/article/ai-agent-context-engineering", pure_text=True)
```
## 依赖说明
| 方法 | 依赖 | 说明 |
|------|------|------|
| **Jina Reader** | 无 | 只需网络连接 |
| **Firecrawl** | `FIRECRAWL_API_KEY` | 环境变量 |
| **Python 回退** | `requests`, `beautifulsoup4`, `markdownify` | 基础依赖 |
| **PDF 支持** | `pymupdf` | arXiv PDF 提取 |
| **动态页面** | `playwright` | React/Vue SPA |
## 性能参考
- **Jina Reader**: ~1-2 秒
- **Firecrawl**: ~2-5 秒
- **Python 静态**: ~1-2 秒
- **Python 动态**: ~5-10 秒
- **arXiv PDF**: ~2-5 秒
## 与 Skill 的关系
- **SKILL.md**: 指导 Claude 如何使用此工具
- **tools/convert.py**: 实际执行转换的代码
- **README.md**: 本文档(工具使用说明)
## 测试
```bash
# 测试 breezedeus.com 博客
python skills/docai-web2md/tools/convert.py https://www.breezedeus.com/article/ai-agent-context-engineering
```
## 许可证
MIT

View File

@ -0,0 +1,55 @@
---
name: web2md
description: Convert any web URL to Markdown. Triggers on "转成Markdown/转换/网页转Markdown/convert to Markdown + URL". Handles static sites, dynamic SPAs, WeChat, arXiv, Twitter/X.
---
# docai:web2md
## When to Trigger
User wants to convert a web page to Markdown. Common patterns:
- "把这个链接转成 Markdown"、"网页转 Markdown"、"提取网页内容"
- "convert this URL to Markdown"、"get the content of this page"
- Any URL + intent to extract/read content (without summarization)
If user wants summary, use web2summary instead.
## How to Execute
```bash
python skills/web2md/tools/convert.py <URL> [--use-python] [-o <file>]
```
### Parameters
| Parameter | Required | Description |
|-----------|----------|-------------|
| `url` | Yes | Web page URL |
| `--use-python` | No | Force Python method (skip Jina/Firecrawl) |
| `-o` / `--output` | No | Save to file instead of stdout |
### Examples
```bash
# Basic conversion (parallel: Jina / Firecrawl / Python)
python skills/web2md/tools/convert.py https://example.com/article
# arXiv paper (auto HTML priority, PDF fallback)
python skills/web2md/tools/convert.py https://arxiv.org/abs/2601.04500v1
# Save to file
python skills/web2md/tools/convert.py https://mp.weixin.qq.com/s/... -o article.md
# Force Python method
python skills/web2md/tools/convert.py https://example.com --use-python
```
## What It Does
Four methods run in parallel, returning the first successful result:
1. Jina Reader API (fastest, zero install)
2. Firecrawl API (if key configured)
3. Python fallback (requests + BeautifulSoup)
4. Playwright (headless browser for JS-rendered pages)
Special cases handled automatically: WeChat → Playwright first (mobile UA, JS rendering), arXiv → HTML priority, Twitter/X → Playwright.
## Troubleshooting
- **arXiv PDF garbled**: Requires `pymupdf``pip install pymupdf`
- **Dynamic page empty**: Script auto-detects SPAs and uses Playwright
- **All methods fail**: Try `--use-python` to bypass API methods

View File

@ -0,0 +1,683 @@
#!/usr/bin/env python3
"""
Web to Markdown Converter Tool
优先级方法非Python优先
1. Jina Reader API - 零安装一行URL转换
2. Firecrawl API - 需要API密钥
3. Python实现 - 以上方法失败时的回退
arXiv 特殊处理
- 输入: https://arxiv.org/abs/2601.04500v1
- 转换为: https://arxiv.org/html/2601.04500v1
- 优先 Jina Reader失败则 Python 下载 PDF
微信公众号特殊处理
- 优先 WeSpy
- 失败则回退 Playwright
- 最后回退 Python 方法
用法:
python convert.py <url> [--pure-text] [--output <file>]
示例:
python convert.py https://www.breezedeus.com/article/ai-agent-context-engineering
python convert.py https://arxiv.org/abs/2601.04500v1 --output paper.md
python convert.py https://x.com/user/status/123 --pure-text
"""
import sys
import argparse
import logging
from pathlib import Path
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from bs4 import BeautifulSoup
from markdownify import markdownify as md
import tempfile
from urllib.parse import urlparse
from concurrent.futures import ThreadPoolExecutor, as_completed
import re
import os
logger = logging.getLogger(__name__)
class WebToMarkdown:
"""网页转 Markdown 转换器(并行优先级方法)"""
# 超时常量(秒)
TIMEOUT_HEAD = 3
TIMEOUT_JINA = 8
TIMEOUT_FIRECRAWL = 10
TIMEOUT_REQUESTS = 15
TIMEOUT_PLAYWRIGHT = 15000 # 毫秒
def __init__(self):
self.session = requests.Session()
self.session.headers.update(
{"User-Agent": "Mozilla/5.0 (compatible; DocAI-Converter/1.0)"}
)
# 配置重试策略:仅针对 429/5xx最多 2 次,指数退避
retry = Retry(
total=2,
backoff_factor=1,
status_forcelist=[429, 500, 502, 503, 504],
allowed_methods=["HEAD", "GET", "POST"],
)
adapter = HTTPAdapter(max_retries=retry)
self.session.mount("https://", adapter)
self.session.mount("http://", adapter)
# 从环境变量获取 Firecrawl API 密钥
self.firecrawl_api_key = os.environ.get("FIRECRAWL_API_KEY")
def __enter__(self):
return self
def __exit__(self, exc_type, exc_val, exc_tb):
self.session.close()
def convert(self, url, pure_text=False, use_python=False):
"""转换 URL 到 Markdown并行优先级方法
并行发起 Jina Reader / Firecrawl / Python取最快成功的结果
微信公众号和 --use-python 模式走直连路径
Args:
url: 网页 URL
pure_text: 是否返回纯文本无格式
use_python: 强制使用Python方法
Returns:
str: Markdown 或纯文本内容
"""
url = url.strip()
# URL 校验
parsed = urlparse(url)
if parsed.scheme not in ("http", "https") or not parsed.netloc:
raise ValueError(f"无效的 URL: {url}")
# arXiv 特殊处理:转换为 HTML URL
if self._is_arxiv(url):
url = self._convert_arxiv_to_html(url)
# 微信公众号:优先使用 WeSpy失败则回退到 Playwright / Python
if self._is_wechat(url):
result = self._try_wespy(url, pure_text)
if result:
return result
result = self._try_playwright(url, pure_text)
if result:
return result
return self._python_convert(url, pure_text)
# 推特 X.com 特殊处理如果URL是twitter/x.com转换为fxtwitter/fixupx以获取元数据渲染的内容
if self._is_twitter(url):
url = self._convert_twitter_to_proxy(url)
# 强制 Python 模式
if use_python:
if self._is_arxiv(url):
return self._handle_arxiv(url, pure_text)
return self._python_convert(url, pure_text)
# 并行发起多种方法,取最快成功的
result = self._parallel_convert(url, pure_text)
if result:
return result
# 所有并行方法都失败arXiv 尝试 PDF 回退
if self._is_arxiv(url):
return self._handle_arxiv(url, pure_text)
return None
def _parallel_convert(self, url, pure_text):
"""并行尝试多种方法,返回最快成功的结果"""
futures = {}
with ThreadPoolExecutor(max_workers=4) as executor:
futures[executor.submit(self._try_jina_reader, url, pure_text)] = "jina"
if self.firecrawl_api_key:
futures[executor.submit(self._try_firecrawl, url, pure_text)] = (
"firecrawl"
)
futures[executor.submit(self._python_convert, url, pure_text)] = "python"
futures[executor.submit(self._try_playwright, url, pure_text)] = (
"playwright"
)
for future in as_completed(futures):
try:
result = future.result()
except Exception:
continue
if result:
for f in futures:
f.cancel()
return result
return None
def _try_jina_reader(self, url, pure_text):
"""尝试使用 Jina Reader API
用法: https://r.jina.ai/https://www.breezedeus.com/article/ai-agent-context-engineering
"""
jina_base_urls = ["https://r.jinaai.cn", "https://r.jina.ai"]
try:
for jina_base_url in jina_base_urls:
jina_url = f"{jina_base_url}/{url}"
try:
response = self.session.get(jina_url, timeout=self.TIMEOUT_JINA)
response.raise_for_status()
content = response.text
if content and len(content.strip()) > 50: # 验证有内容
if pure_text:
return content
# Jina 已经返回不错的 Markdown稍作清理即可
return self._clean_jina_markdown(content)
except Exception as e:
logger.warning("Jina Reader 失败 (%s): %s", jina_base_url, e)
except Exception as e:
logger.warning("Jina Reader 失败: %s", e)
return None
def _try_firecrawl(self, url, pure_text):
"""尝试使用 Firecrawl API"""
if not self.firecrawl_api_key:
logger.info("Firecrawl API 密钥未设置 (FIRECRAWL_API_KEY)")
return None
try:
response = self.session.post(
"https://api.firecrawl.dev/v0/scrape",
headers={"Authorization": f"Bearer {self.firecrawl_api_key}"},
json={"url": url, "formats": ["markdown"]},
timeout=self.TIMEOUT_FIRECRAWL,
)
if response.status_code == 200:
data = response.json()
if data.get("success") and data.get("data", {}).get("markdown"):
markdown = data["data"]["markdown"]
if pure_text:
# 从 Markdown 提取纯文本
return re.sub(r"[\*\#\`\[\]\(\)]", "", markdown)
return markdown
else:
logger.warning("Firecrawl 错误: %s", response.status_code)
except Exception as e:
logger.warning("Firecrawl 失败: %s", e)
return None
def _try_playwright(self, url, pure_text):
"""尝试使用 Playwright 获取动态页面"""
try:
content = self._get_with_playwright(url)
if not content or len(content.strip()) < 50:
return None
if pure_text:
return self._to_plain_text(content)
return self._to_markdown(content)
except Exception as e:
logger.warning("Playwright 失败: %s", e)
return None
def _try_wespy(self, url, pure_text):
"""尝试使用 WeSpy 获取微信公众号内容"""
try:
from wespy import ArticleFetcher
except ImportError as e:
logger.warning("WeSpy 未安装: %s", e)
return None
try:
fetcher = ArticleFetcher()
with tempfile.TemporaryDirectory() as output_dir:
article_info = fetcher.fetch_article(
url=url,
output_dir=output_dir,
save_markdown=True,
save_html=False,
save_json=False,
)
if not article_info:
return None
markdown = self._read_wespy_markdown(output_dir, article_info)
if not markdown:
return None
if pure_text:
return self._markdown_to_plain_text(markdown)
return markdown.strip()
except Exception as e:
logger.warning("WeSpy 失败: %s", e)
return None
def _python_convert(self, url, pure_text):
"""Python实现回退方法"""
# 自动检测是否需要浏览器
use_browser = self._needs_browser(url)
if use_browser:
content = self._get_with_playwright(url)
is_pdf = False
else:
content, is_pdf = self._get_with_requests(url)
if is_pdf:
return self._process_pdf(content, pure_text)
# HTML 转换
if pure_text:
return self._to_plain_text(content)
else:
return self._to_markdown(content)
def _handle_arxiv(self, url, pure_text):
"""arXiv Python回退方法从HTML URL转为PDF下载"""
try:
pdf_url = self._convert_arxiv_to_pdf(url)
logger.info("arXiv Python回退: 下载PDF %s", pdf_url)
pdf_content, _ = self._get_with_requests(pdf_url)
return self._process_pdf(pdf_content, pure_text)
except Exception as e:
logger.error("arXiv PDF失败: %s", e)
return None
def _clean_jina_markdown(self, markdown):
"""清理 Jina Reader 返回的 Markdown"""
# 移除多余的空行
markdown = re.sub(r"\n{3,}", "\n\n", markdown)
# 移除行尾空格
markdown = re.sub(r" +\n", "\n", markdown)
return markdown.strip()
def _markdown_to_plain_text(self, markdown):
"""从 Markdown 提取纯文本"""
text = re.sub(r"!\[[^\]]*\]\([^\)]*\)", "", markdown)
text = re.sub(r"\[([^\]]+)\]\([^\)]*\)", r"\1", text)
text = re.sub(r"^#{1,6}\s*", "", text, flags=re.MULTILINE)
text = re.sub(r"^[>*-]\s*", "", text, flags=re.MULTILINE)
text = re.sub(r"[`*_~]", "", text)
text = re.sub(r"\n{3,}", "\n\n", text)
return text.strip()
def _read_wespy_markdown(self, output_dir, article_info):
"""从 WeSpy 输出目录或返回值读取 Markdown"""
if isinstance(article_info, dict):
for key in ("markdown", "markdown_content", "content"):
value = article_info.get(key)
if isinstance(value, str) and value.strip():
return value.strip()
markdown_files = sorted(Path(output_dir).rglob("*.md"))
if not markdown_files:
return None
return markdown_files[0].read_text(encoding="utf-8").strip()
def _is_twitter(self, url):
"""检查是否是推特/X URL"""
parsed = urlparse(url)
netloc = parsed.netloc.lower()
return "twitter.com" in netloc or "x.com" in netloc
def _convert_twitter_to_proxy(self, url):
"""转换 Twitter/X URL 到支持元数据预览的 fxtwitter 或 fixupx"""
parsed = urlparse(url)
netloc = parsed.netloc.lower()
# 将 twitter.com 替换为 fxtwitter.com, x.com 替换为 fixupx.com
if "twitter.com" in netloc:
new_netloc = netloc.replace("twitter.com", "fxtwitter.com")
elif "x.com" in netloc:
new_netloc = netloc.replace("x.com", "fixupx.com")
else:
return url
return url.replace(netloc, new_netloc)
def _is_arxiv(self, url):
"""检测是否为 arXiv 链接"""
return "arxiv.org" in url and (
"/abs/" in url or "/pdf/" in url or "/html/" in url
)
def _is_wechat(self, url):
"""检测是否为微信公众号链接"""
return "weixin.qq.com" in url
def _convert_arxiv_to_html(self, url):
"""转换 arXiv 链接为 HTML URL"""
if "/html/" in url:
return url
if "/pdf/" in url:
paper_id = url.split("/pdf/")[-1].split("?")[0].replace(".pdf", "")
return f"https://arxiv.org/html/{paper_id}"
paper_id = url.split("/abs/")[-1].split("?")[0]
return f"https://arxiv.org/html/{paper_id}"
def _convert_arxiv_to_pdf(self, url):
"""转换 arXiv 链接为 PDF URL"""
if "/pdf/" in url:
return url
if "/html/" in url:
paper_id = url.split("/html/")[-1].split("?")[0]
return f"https://arxiv.org/pdf/{paper_id}.pdf"
paper_id = url.split("/abs/")[-1].split("?")[0]
return f"https://arxiv.org/pdf/{paper_id}.pdf"
def _is_known_dynamic_site(self, url):
"""检测是否为已知的动态网站(纯函数,无网络调用)"""
parsed = urlparse(url)
domain = parsed.netloc.lower()
dynamic_domains = [
"x.com",
"twitter.com",
"medium.com",
"substack.com",
"github.com",
"reddit.com",
]
for dynamic in dynamic_domains:
if domain.endswith(dynamic):
return True
if "weixin.qq.com" in domain:
return False
return None # 未知,需要探测
def _probe_for_spa(self, url):
"""通过 HEAD 请求探测是否为 SPA有网络调用"""
try:
response = self.session.head(
url, timeout=self.TIMEOUT_HEAD, allow_redirects=True
)
content_type = response.headers.get("content-type", "").lower()
if "application/json" in content_type:
return True
server = response.headers.get("server", "").lower()
if any(s in server for s in ["nextjs", "vercel", "vite"]):
return True
except Exception:
return True
return False
def _needs_browser(self, url):
"""自动检测是否需要浏览器渲染"""
known = self._is_known_dynamic_site(url)
if known is not None:
return known
return self._probe_for_spa(url)
def _get_with_requests(self, url):
"""使用 requests 获取静态页面或 PDF
Returns:
tuple: (content, is_pdf) - content bytes(PDF) str(HTML)
"""
response = self.session.get(url, timeout=self.TIMEOUT_REQUESTS)
response.raise_for_status()
content_type = response.headers.get("content-type", "").lower()
is_pdf = "application/pdf" in content_type or url.lower().endswith(".pdf")
if is_pdf:
return response.content, True
return response.text, False
def _process_pdf(self, pdf_content, pure_text=False):
"""处理 PDF 内容,返回 Markdown 或纯文本(只打开一次文档)"""
try:
import fitz # PyMuPDF
except ImportError:
raise ImportError("PDF 处理需要 PyMuPDF。\n" "请运行: pip install pymupdf")
try:
doc = fitz.open(stream=pdf_content, filetype="pdf")
# 提取全文
text = ""
for page_num, page in enumerate(doc):
page_text = page.get_text()
if page_text.strip():
text += f"--- Page {page_num + 1} ---\n\n"
text += page_text + "\n\n"
text = text.strip()
if pure_text:
return text
# 提取标题
title = None
metadata = doc.metadata
if metadata and metadata.get("title"):
title = metadata["title"]
elif doc.page_count > 0:
first_page = doc[0]
lines = [
line.strip()
for line in first_page.get_text().split("\n")
if line.strip()
]
if lines:
title = " ".join(lines[:2])
if title:
return f"# {title}\n\n{text}"
return text
except Exception as e:
raise Exception(f"PDF 处理失败: {e}")
def _get_with_playwright(self, url):
"""使用 Playwright 获取动态页面"""
try:
from playwright.sync_api import sync_playwright
except ImportError:
raise ImportError(
"Playwright 未安装。\n"
"请运行: pip install playwright && playwright install chromium"
)
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
# 微信公众号使用移动 UA
if "weixin.qq.com" in url:
page.set_extra_http_headers(
{
"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0) AppleWebKit/605.1.15"
}
)
try:
page.goto(
url, wait_until="networkidle", timeout=self.TIMEOUT_PLAYWRIGHT
)
page.wait_for_timeout(2000)
content = page.content()
finally:
browser.close()
return content
def _to_markdown(self, html):
"""HTML 转 Markdown"""
soup = BeautifulSoup(html, "html.parser")
# 提取标题(微信公众号等)
title = None
# 尝试多种标题来源
title_selectors = ["title", "h1#activity-name", ".rich_media_title", "h1"]
for selector in title_selectors:
title_elem = soup.select_one(selector)
if title_elem:
title = title_elem.get_text(strip=True)
if title:
break
# 查找正文内容(优先级)
content_elem = None
content_selectors = [
"#js_content", # 微信公众号
".rich_media_content", # 微信公众号
"#activity-detail", # 微信公众号
"article", # 标准文章
"main", # 标准主内容
".post-content", # 博客
".article-content", # 博客
]
for selector in content_selectors:
elem = soup.select_one(selector)
if elem and elem.get_text(strip=True):
content_elem = elem
break
# 如果没找到特定内容,使用 body
if not content_elem:
content_elem = soup.body or soup
# 移除噪音元素
for tag in content_elem(
["script", "style", "nav", "footer", "header", "iframe", "aside"]
):
tag.decompose()
# 移除广告和交互元素
for tag in content_elem.find_all(
class_=lambda x: x
and any(
w in x.lower()
for w in [
"ad",
"banner",
"cookie",
"consent",
"popup",
"modal",
"share",
"like",
"comment",
]
)
):
tag.decompose()
# 移除按钮和链接区域
for tag in content_elem.find_all(
class_=lambda x: x
and any(w in x.lower() for w in ["btn", "button", "share", "reward"])
):
tag.decompose()
# 移除空段落
for tag in content_elem.find_all("p"):
if not tag.get_text(strip=True):
tag.decompose()
# 构建最终内容
if title:
markdown = f"# {title}\n\n"
else:
markdown = ""
cleaned_html = str(content_elem)
markdown += md(cleaned_html, heading_style="ATX")
# 清理多余空白
markdown = re.sub(r"\n{3,}", "\n\n", markdown)
markdown = re.sub(r" +\n", "\n", markdown) # 行尾空格
return markdown.strip()
def _to_plain_text(self, html):
"""提取纯文本"""
soup = BeautifulSoup(html, "html.parser")
main = soup.find("main") or soup.find("article") or soup.body
if not main:
return soup.get_text(separator="\n\n", strip=True)
for tag in main(["script", "style", "nav", "footer", "header", "aside"]):
tag.decompose()
text = main.get_text(separator="\n\n", strip=True)
text = re.sub(r"\n{3,}", "\n\n", text)
return text.strip()
def main():
"""命令行入口"""
logging.basicConfig(
level=logging.INFO,
format="%(levelname)s: %(message)s",
stream=sys.stderr,
)
parser = argparse.ArgumentParser(
description="将网页转换为 Markdown 格式优先使用非Python方法",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""优先级方法:
1. Jina Reader API (https://r.jina.ai/URL) - 零安装
2. Firecrawl API (需要 FIRECRAWL_API_KEY)
3. Python实现 (回退)
示例:
%(prog)s https://www.breezedeus.com/article/ai-agent-context-engineering
%(prog)s https://arxiv.org/abs/2601.04500v1 --output paper.md
%(prog)s https://x.com/user/status/123 --pure-text
%(prog)s https://www.breezedeus.com/article/ai-agent-context-engineering --use-python # 强制使用Python方法
""",
)
parser.add_argument("url", help="要转换的网页 URL")
parser.add_argument(
"--pure-text", action="store_true", help="输出纯文本(无 Markdown 格式)"
)
parser.add_argument(
"--use-python",
action="store_true",
help="强制使用Python方法跳过Jina/Firecrawl",
)
parser.add_argument("--output", "-o", help="输出到文件")
args = parser.parse_args()
try:
with WebToMarkdown() as converter:
result = converter.convert(
args.url, pure_text=args.pure_text, use_python=args.use_python
)
if result is None:
logger.error("转换失败:所有方法均不可用")
sys.exit(1)
if args.output:
with open(args.output, "w", encoding="utf-8") as f:
f.write(result)
print(f"✓ 已保存到: {args.output}")
else:
print(result)
except Exception as e:
logger.error("错误: %s", e)
sys.exit(1)
if __name__ == "__main__":
main()

View File

@ -0,0 +1,45 @@
# docai-web2summary
对任意网页 URL 生成结构化总结的 AI Skill。
## 工作流程
1. **获取内容**:调用 `docai-web2md` 将 URL 转为 Markdown
2. **AI 总结**AI 直接根据 [SKILL.md](SKILL.md) 中的规范完成总结,无需外部脚本
3. **信息卡(可选)**:如需生成图片卡片,使用 [info-card-designer](https://github.com/joeseesun/info-card-designer)
## 使用方式
直接对 AI 说明需求即可:
> 总结这个链接https://www.breezedeus.com/article/ai-agent-context-engineering
AI 会自动:
- 判断内容类型(论文 / 新闻 / 教程 / 产品 / AI 动态 / 通用)
- 套用对应的总结结构
- 按统一格式输出 Markdown
## 输出格式示例
```markdown
# **给Claude Code装个仪表盘claude-hud插件深度评测**
✔ 一句话总结:一个让 Claude Code 从黑盒变透明的仪表盘插件...
**核心洞见**Claude Code 最大的痛点不是功能不足,而是"黑盒"体验...
**技术细节**:基于 Claude Code 原生 statusline API 构建...
**应用场景**复杂任务重构、CI/CD 调试、长期项目开发...
**原文:** https://mp.weixin.qq.com/s/XClh6xJmXoXbyBC9lKzPdA
```
## 依赖
- `docai-web2md`(获取网页内容)
- [info-card-designer](https://github.com/joeseesun/info-card-designer)(可选,生成信息卡图片)
## 许可证
MIT

View File

@ -0,0 +1,168 @@
---
name: web2summary
description: Summarize any web URL. Triggers on "summarize/总结/概括/摘要 + URL". Auto-detects content type (paper, news, tutorial, product, AI news) and generates adaptive structured summary.
---
# docai:web2summary
## When to Trigger
User wants to summarize a web page. Common patterns:
- "总结这个链接"、"帮我总结一下"、"概括这篇文章"、"给个摘要"
- "summarize this URL"、"give me a summary of"
- Any URL + intent to understand/extract key points
## How to Execute
### Step 1 — 获取网页内容
使用 `web2md` skill 将 URL 转换为 Markdown
```bash
python skills/web2md/tools/convert.py <URL>
```
### Step 2 — 直接总结(你来做,无需调用外部 AI
拿到 Markdown 内容后,**你AI agent直接按照下方的总结规范输出总结**,不需要再调用任何脚本或 API。
### Step 3 — 生成信息卡片(可选)
如果用户需要信息卡图片,使用 [info-card-designer](https://github.com/joeseesun/info-card-designer) skill 生成。
---
## 总结规范
### 格式要求
**标题格式**
- 所有级别标题都必须加粗:`# **标题**`、`## **标题**`、`### **标题**`
- 如内容来自知名机构,一级标题末尾标注:`# **标题内容 | 机构名称**`
- 标题与前面内容之间空一行
**加粗与标点**
- 加粗标记 `**` 在标点符号内部,不在外部
- ✅ `「**更聪明地激活**」` ❌ `**「更聪明地激活」**`
**链接处理**
- 末尾必须包含原文链接:`**原文:** <链接>`
- 删除 URL 中 `?` 后的查询参数
**列表格式**
- 无序列表用 ✔ 代替 `-` / `*`,每条后空一行
**内容约束**
- 只基于网页中的信息,禁止自行推断
- 不输出 LaTeX 数学公式,不包含索引或引用
---
### 内容类型判断与结构
先判断内容类型,再按对应结构输出。
---
#### 🔬 类型A技术论文/研究
适用学术论文、技术报告、arXiv 论文、算法介绍等
结构(整体不超过 1000 字,没有的章节直接删除):
✔ 一句话总结(开篇):体现研究的核心突破,必须有吸引力
✔ 核心洞见:解决了什么问题?提出了什么新思路?
✔ 技术细节/架构创新:关键方法、模型结构、算法设计
✔ 性能数据/实验结果:量化指标、对比基线、关键数据
✔ 应用场景:这项技术能用在哪里?
✔ 长期意义:为什么值得关注?对领域的影响
✔ 原文链接(末尾)
---
#### 📰 类型B新闻报道
适用:行业新闻、公司动态、政策发布、事件报道等
结构(整体不超过 800 字,没有的章节直接删除):
✔ 一句话总结(开篇):概括核心事件,突出新闻价值
✔ 核心事件:发生了什么?关键细节
✔ 关键人物/机构:谁在推动?谁受影响?
✔ 背景与影响:为什么重要?对行业/社会的影响
✔ 后续展望:接下来可能发生什么?
✔ 原文链接(末尾)
---
#### 📚 类型C教程/指南
适用编程教程、操作指南、How-to 文章、最佳实践等
结构(整体不超过 1000 字,没有的章节直接删除):
✔ 一句话总结(开篇):这篇教程教你什么?适合谁?
✔ 学习目标:读完能掌握什么?
✔ 前置条件:需要什么基础或工具?
✔ 关键步骤摘要:核心流程的精炼提取(不是逐步复述)
✔ 注意事项/常见坑:作者提到的易错点或最佳实践
✔ 原文链接(末尾)
---
#### 🚀 类型D产品发布/评测
适用:产品发布、功能更新、产品评测、工具推荐等
结构(整体不超过 800 字,没有的章节直接删除):
✔ 一句话总结(开篇):核心卖点
✔ 产品定位:解决什么问题?面向谁?
✔ 核心功能/亮点:最值得关注的特性
✔ 与竞品对比:相比现有方案有什么优势?(如文中提及)
✔ 适用人群:谁最应该关注?
✔ 价格/获取方式:如何获取或使用?(如文中提及)
✔ 原文链接(末尾)
---
#### 🤖 类型EAI 行业动态
适用AI 领域新闻汇总、模型发布、行业趋势分析、AI Newsletter 等
结构(整体不超过 1000 字,没有的章节直接删除):
✔ 一句话总结(开篇):本期最值得关注的信号
✔ 核心动态:最重要的 2-3 条消息及其意义
✔ 技术要点:涉及的关键技术或方法(如有)
✔ 行业影响:对开发者/企业/用户意味着什么?
✔ 值得关注的信号:哪些趋势正在形成?
✔ 原文链接(末尾)
---
#### 📄 类型F通用
适用:个人博客、观点文章、随笔、访谈、其他类型
结构(整体不超过 800 字,没有的章节直接删除):
✔ 一句话总结(开篇):这篇内容的核心价值
✔ 核心内容:作者在说什么?主要观点或故事
✔ 关键要点:最值得记住的 2-3 个点
✔ 价值与启发:读完能获得什么?
✔ 原文链接(末尾)