add web2md web2summary
This commit is contained in:
parent
ede5960b5a
commit
62c0f62134
194
skills/common/web2md/README.md
Normal file
194
skills/common/web2md/README.md
Normal file
@ -0,0 +1,194 @@
|
||||
# docai-web2md
|
||||
|
||||
独立 Python 工具,用于将网页转换为 Markdown 格式,采用**优先级架构**。
|
||||
|
||||
> 📖 **文档导航**
|
||||
> - **SKILL.md** - Claude Code 使用指南(如何调用此技能)
|
||||
> - **README.md** - 本文档(工具功能说明和独立使用)
|
||||
> - **tools/convert.py** - 实际转换代码实现
|
||||
> - **共享参考**: [web-sources.md](../../shared/references/web-sources.md) - 平台支持矩阵
|
||||
|
||||
## 核心特性
|
||||
|
||||
- ✅ **Jina Reader API 优先** - 零安装,最快最简单(微信公众号除外)
|
||||
- ✅ **Firecrawl API 支持** - 高级爬虫需求
|
||||
- ✅ **Python 智能回退** - 以上方法失败时自动切换
|
||||
- ✅ **arXiv HTML 优先** - 优先获取 HTML 版,失败时回退 PDF
|
||||
- ✅ **多平台支持** - 微信公众号(直接 Python)、静态博客、动态页面等
|
||||
|
||||
## 优先级策略
|
||||
|
||||
```python
|
||||
# 转换流程
|
||||
输入 URL
|
||||
↓
|
||||
是 arXiv? → 转换为 HTML URL
|
||||
↓
|
||||
是微信公众号? → 直接 Python 方法 ⭐
|
||||
↓
|
||||
尝试 Jina Reader API (快速)
|
||||
↓ (失败)
|
||||
尝试 Firecrawl API (需要密钥)
|
||||
↓ (失败)
|
||||
Python 方法 (回退)
|
||||
↓
|
||||
arXiv? → 下载 PDF 提取
|
||||
```
|
||||
|
||||
**微信公众号特殊处理**:由于 Jina Reader 对微信公众号支持不佳,直接使用 Python 方法以确保最佳效果。
|
||||
|
||||
## 快速开始
|
||||
|
||||
### 方式 1: 使用 uv(推荐)
|
||||
|
||||
```bash
|
||||
# 1. 安装 uv(如果尚未安装)
|
||||
curl -LsSf https://astral.sh/uv/install.sh | sh
|
||||
|
||||
# 2. 在 docai-skills 目录初始化环境
|
||||
cd docai-skills
|
||||
uv sync
|
||||
|
||||
# 3. 执行脚本(无需激活环境)
|
||||
uv run python skills/docai-web2md/tools/convert.py https://www.breezedeus.com/article/ai-agent-context-engineering
|
||||
```
|
||||
|
||||
### 方式 2: 使用 pip(传统方式)
|
||||
|
||||
```bash
|
||||
# 1. 创建虚拟环境(可选但推荐)
|
||||
python -m venv .venv
|
||||
source .venv/bin/activate
|
||||
|
||||
# 2. 安装依赖
|
||||
pip install requests beautifulsoup4 markdownify pymupdf
|
||||
|
||||
# 3. 执行脚本
|
||||
python skills/docai-web2md/tools/convert.py https://www.breezedeus.com/article/ai-agent-context-engineering
|
||||
```
|
||||
|
||||
### 方式 3: Jina Reader API(无需安装)
|
||||
|
||||
```bash
|
||||
# 直接使用 API,无需任何依赖
|
||||
curl https://r.jina.ai/https://www.breezedeus.com/article/ai-agent-context-engineering
|
||||
```
|
||||
|
||||
### ⚠️ Claude Code Skill 集成
|
||||
|
||||
**重要**:Claude Code 调用 Skill 时使用系统 Python,需要额外配置:
|
||||
|
||||
```bash
|
||||
# 使用 uv 安装到系统(不影响项目虚拟环境)
|
||||
uv pip install --system requests beautifulsoup4 markdownify pymupdf
|
||||
|
||||
# 或使用 pip
|
||||
pip install requests beautifulsoup4 markdownify pymupdf
|
||||
```
|
||||
|
||||
**详见**:[UV_ENVIRONMENT.md](../../UV_ENVIRONMENT.md)
|
||||
|
||||
## 命令行使用
|
||||
|
||||
```bash
|
||||
# 基本用法(自动优先级)
|
||||
python skills/docai-web2md/tools/convert.py https://www.breezedeus.com/article/ai-agent-context-engineering
|
||||
|
||||
# 保存到文件
|
||||
python skills/docai-web2md/tools/convert.py https://www.breezedeus.com/article/ai-agent-context-engineering -o article.md
|
||||
|
||||
# 纯文本模式
|
||||
python skills/docai-web2md/tools/convert.py https://www.breezedeus.com/article/ai-agent-context-engineering --pure-text
|
||||
|
||||
# 强制使用 Python 方法(跳过 Jina/Firecrawl)
|
||||
python skills/docai-web2md/tools/convert.py https://www.breezedeus.com/article/ai-agent-context-engineering --use-python
|
||||
```
|
||||
|
||||
## 优先级架构
|
||||
|
||||
```
|
||||
输入 URL
|
||||
↓
|
||||
arXiv? → 转换为 HTML URL
|
||||
↓
|
||||
Jina Reader API (⭐ 零安装)
|
||||
↓ 失败
|
||||
Firecrawl API (需密钥)
|
||||
↓ 失败
|
||||
Python 实现 (全能回退)
|
||||
↓
|
||||
arXiv? → 下载 PDF 提取
|
||||
↓
|
||||
普通网页 → HTML 解析
|
||||
```
|
||||
|
||||
## 使用示例
|
||||
|
||||
```bash
|
||||
# 静态博客(Jina Reader)
|
||||
python skills/docai-web2md/tools/convert.py https://www.breezedeus.com/article/ai-agent-context-engineering
|
||||
|
||||
# arXiv 论文(HTML 优先,PDF 回退)
|
||||
python skills/docai-web2md/tools/convert.py https://arxiv.org/abs/2601.04500v1
|
||||
|
||||
# 微信公众号(Jina → Python 回退)
|
||||
python skills/docai-web2md/tools/convert.py https://mp.weixin.qq.com/s/1LfkYdbzymoWxdvdnKeLnA
|
||||
|
||||
# X.com/Twitter(Python 动态渲染)
|
||||
python skills/docai-web2md/tools/convert.py https://x.com/user/status/123
|
||||
```
|
||||
|
||||
## Python API
|
||||
|
||||
```python
|
||||
from skills.docai_web2md.tools.convert import WebToMarkdown
|
||||
|
||||
converter = WebToMarkdown()
|
||||
|
||||
# 自动优先级(推荐)
|
||||
markdown = converter.convert("https://www.breezedeus.com/article/ai-agent-context-engineering")
|
||||
|
||||
# arXiv 自动处理(HTML → PDF)
|
||||
paper = converter.convert("https://arxiv.org/abs/2601.04500v1")
|
||||
|
||||
# 强制 Python 方法
|
||||
markdown = converter.convert("https://www.breezedeus.com/article/ai-agent-context-engineering", use_python=True)
|
||||
|
||||
# 纯文本输出
|
||||
text = converter.convert("https://www.breezedeus.com/article/ai-agent-context-engineering", pure_text=True)
|
||||
```
|
||||
|
||||
## 依赖说明
|
||||
|
||||
| 方法 | 依赖 | 说明 |
|
||||
|------|------|------|
|
||||
| **Jina Reader** | 无 | 只需网络连接 |
|
||||
| **Firecrawl** | `FIRECRAWL_API_KEY` | 环境变量 |
|
||||
| **Python 回退** | `requests`, `beautifulsoup4`, `markdownify` | 基础依赖 |
|
||||
| **PDF 支持** | `pymupdf` | arXiv PDF 提取 |
|
||||
| **动态页面** | `playwright` | React/Vue SPA |
|
||||
|
||||
## 性能参考
|
||||
|
||||
- **Jina Reader**: ~1-2 秒
|
||||
- **Firecrawl**: ~2-5 秒
|
||||
- **Python 静态**: ~1-2 秒
|
||||
- **Python 动态**: ~5-10 秒
|
||||
- **arXiv PDF**: ~2-5 秒
|
||||
|
||||
## 与 Skill 的关系
|
||||
|
||||
- **SKILL.md**: 指导 Claude 如何使用此工具
|
||||
- **tools/convert.py**: 实际执行转换的代码
|
||||
- **README.md**: 本文档(工具使用说明)
|
||||
|
||||
## 测试
|
||||
|
||||
```bash
|
||||
# 测试 breezedeus.com 博客
|
||||
python skills/docai-web2md/tools/convert.py https://www.breezedeus.com/article/ai-agent-context-engineering
|
||||
```
|
||||
|
||||
## 许可证
|
||||
|
||||
MIT
|
||||
55
skills/common/web2md/SKILL.md
Normal file
55
skills/common/web2md/SKILL.md
Normal file
@ -0,0 +1,55 @@
|
||||
---
|
||||
name: web2md
|
||||
description: Convert any web URL to Markdown. Triggers on "转成Markdown/转换/网页转Markdown/convert to Markdown + URL". Handles static sites, dynamic SPAs, WeChat, arXiv, Twitter/X.
|
||||
---
|
||||
|
||||
# docai:web2md
|
||||
|
||||
## When to Trigger
|
||||
User wants to convert a web page to Markdown. Common patterns:
|
||||
- "把这个链接转成 Markdown"、"网页转 Markdown"、"提取网页内容"
|
||||
- "convert this URL to Markdown"、"get the content of this page"
|
||||
- Any URL + intent to extract/read content (without summarization)
|
||||
|
||||
If user wants summary, use web2summary instead.
|
||||
|
||||
## How to Execute
|
||||
```bash
|
||||
python skills/web2md/tools/convert.py <URL> [--use-python] [-o <file>]
|
||||
```
|
||||
|
||||
### Parameters
|
||||
| Parameter | Required | Description |
|
||||
|-----------|----------|-------------|
|
||||
| `url` | Yes | Web page URL |
|
||||
| `--use-python` | No | Force Python method (skip Jina/Firecrawl) |
|
||||
| `-o` / `--output` | No | Save to file instead of stdout |
|
||||
|
||||
### Examples
|
||||
```bash
|
||||
# Basic conversion (parallel: Jina / Firecrawl / Python)
|
||||
python skills/web2md/tools/convert.py https://example.com/article
|
||||
|
||||
# arXiv paper (auto HTML priority, PDF fallback)
|
||||
python skills/web2md/tools/convert.py https://arxiv.org/abs/2601.04500v1
|
||||
|
||||
# Save to file
|
||||
python skills/web2md/tools/convert.py https://mp.weixin.qq.com/s/... -o article.md
|
||||
|
||||
# Force Python method
|
||||
python skills/web2md/tools/convert.py https://example.com --use-python
|
||||
```
|
||||
|
||||
## What It Does
|
||||
Four methods run in parallel, returning the first successful result:
|
||||
1. Jina Reader API (fastest, zero install)
|
||||
2. Firecrawl API (if key configured)
|
||||
3. Python fallback (requests + BeautifulSoup)
|
||||
4. Playwright (headless browser for JS-rendered pages)
|
||||
|
||||
Special cases handled automatically: WeChat → Playwright first (mobile UA, JS rendering), arXiv → HTML priority, Twitter/X → Playwright.
|
||||
|
||||
## Troubleshooting
|
||||
- **arXiv PDF garbled**: Requires `pymupdf` — `pip install pymupdf`
|
||||
- **Dynamic page empty**: Script auto-detects SPAs and uses Playwright
|
||||
- **All methods fail**: Try `--use-python` to bypass API methods
|
||||
683
skills/common/web2md/tools/convert.py
Normal file
683
skills/common/web2md/tools/convert.py
Normal file
@ -0,0 +1,683 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Web to Markdown Converter Tool
|
||||
|
||||
优先级方法(非Python优先):
|
||||
1. Jina Reader API - 零安装,一行URL转换
|
||||
2. Firecrawl API - 需要API密钥
|
||||
3. Python实现 - 以上方法失败时的回退
|
||||
|
||||
arXiv 特殊处理:
|
||||
- 输入: https://arxiv.org/abs/2601.04500v1
|
||||
- 转换为: https://arxiv.org/html/2601.04500v1
|
||||
- 优先 Jina Reader,失败则 Python 下载 PDF
|
||||
|
||||
微信公众号特殊处理:
|
||||
- 优先 WeSpy
|
||||
- 失败则回退 Playwright
|
||||
- 最后回退 Python 方法
|
||||
|
||||
用法:
|
||||
python convert.py <url> [--pure-text] [--output <file>]
|
||||
|
||||
示例:
|
||||
python convert.py https://www.breezedeus.com/article/ai-agent-context-engineering
|
||||
python convert.py https://arxiv.org/abs/2601.04500v1 --output paper.md
|
||||
python convert.py https://x.com/user/status/123 --pure-text
|
||||
"""
|
||||
|
||||
import sys
|
||||
import argparse
|
||||
import logging
|
||||
from pathlib import Path
|
||||
import requests
|
||||
from requests.adapters import HTTPAdapter
|
||||
from urllib3.util.retry import Retry
|
||||
from bs4 import BeautifulSoup
|
||||
from markdownify import markdownify as md
|
||||
import tempfile
|
||||
from urllib.parse import urlparse
|
||||
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||
import re
|
||||
import os
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class WebToMarkdown:
|
||||
"""网页转 Markdown 转换器(并行优先级方法)"""
|
||||
|
||||
# 超时常量(秒)
|
||||
TIMEOUT_HEAD = 3
|
||||
TIMEOUT_JINA = 8
|
||||
TIMEOUT_FIRECRAWL = 10
|
||||
TIMEOUT_REQUESTS = 15
|
||||
TIMEOUT_PLAYWRIGHT = 15000 # 毫秒
|
||||
|
||||
def __init__(self):
|
||||
self.session = requests.Session()
|
||||
self.session.headers.update(
|
||||
{"User-Agent": "Mozilla/5.0 (compatible; DocAI-Converter/1.0)"}
|
||||
)
|
||||
# 配置重试策略:仅针对 429/5xx,最多 2 次,指数退避
|
||||
retry = Retry(
|
||||
total=2,
|
||||
backoff_factor=1,
|
||||
status_forcelist=[429, 500, 502, 503, 504],
|
||||
allowed_methods=["HEAD", "GET", "POST"],
|
||||
)
|
||||
adapter = HTTPAdapter(max_retries=retry)
|
||||
self.session.mount("https://", adapter)
|
||||
self.session.mount("http://", adapter)
|
||||
# 从环境变量获取 Firecrawl API 密钥
|
||||
self.firecrawl_api_key = os.environ.get("FIRECRAWL_API_KEY")
|
||||
|
||||
def __enter__(self):
|
||||
return self
|
||||
|
||||
def __exit__(self, exc_type, exc_val, exc_tb):
|
||||
self.session.close()
|
||||
|
||||
def convert(self, url, pure_text=False, use_python=False):
|
||||
"""转换 URL 到 Markdown(并行优先级方法)
|
||||
|
||||
并行发起 Jina Reader / Firecrawl / Python,取最快成功的结果。
|
||||
微信公众号和 --use-python 模式走直连路径。
|
||||
|
||||
Args:
|
||||
url: 网页 URL
|
||||
pure_text: 是否返回纯文本(无格式)
|
||||
use_python: 强制使用Python方法
|
||||
|
||||
Returns:
|
||||
str: Markdown 或纯文本内容
|
||||
"""
|
||||
url = url.strip()
|
||||
|
||||
# URL 校验
|
||||
parsed = urlparse(url)
|
||||
if parsed.scheme not in ("http", "https") or not parsed.netloc:
|
||||
raise ValueError(f"无效的 URL: {url}")
|
||||
|
||||
# arXiv 特殊处理:转换为 HTML URL
|
||||
if self._is_arxiv(url):
|
||||
url = self._convert_arxiv_to_html(url)
|
||||
|
||||
# 微信公众号:优先使用 WeSpy,失败则回退到 Playwright / Python
|
||||
if self._is_wechat(url):
|
||||
result = self._try_wespy(url, pure_text)
|
||||
if result:
|
||||
return result
|
||||
result = self._try_playwright(url, pure_text)
|
||||
if result:
|
||||
return result
|
||||
return self._python_convert(url, pure_text)
|
||||
|
||||
# 推特 X.com 特殊处理:如果URL是twitter/x.com,转换为fxtwitter/fixupx以获取元数据渲染的内容
|
||||
if self._is_twitter(url):
|
||||
url = self._convert_twitter_to_proxy(url)
|
||||
|
||||
# 强制 Python 模式
|
||||
if use_python:
|
||||
if self._is_arxiv(url):
|
||||
return self._handle_arxiv(url, pure_text)
|
||||
return self._python_convert(url, pure_text)
|
||||
|
||||
# 并行发起多种方法,取最快成功的
|
||||
result = self._parallel_convert(url, pure_text)
|
||||
if result:
|
||||
return result
|
||||
|
||||
# 所有并行方法都失败,arXiv 尝试 PDF 回退
|
||||
if self._is_arxiv(url):
|
||||
return self._handle_arxiv(url, pure_text)
|
||||
|
||||
return None
|
||||
|
||||
def _parallel_convert(self, url, pure_text):
|
||||
"""并行尝试多种方法,返回最快成功的结果"""
|
||||
futures = {}
|
||||
with ThreadPoolExecutor(max_workers=4) as executor:
|
||||
futures[executor.submit(self._try_jina_reader, url, pure_text)] = "jina"
|
||||
|
||||
if self.firecrawl_api_key:
|
||||
futures[executor.submit(self._try_firecrawl, url, pure_text)] = (
|
||||
"firecrawl"
|
||||
)
|
||||
|
||||
futures[executor.submit(self._python_convert, url, pure_text)] = "python"
|
||||
futures[executor.submit(self._try_playwright, url, pure_text)] = (
|
||||
"playwright"
|
||||
)
|
||||
|
||||
for future in as_completed(futures):
|
||||
try:
|
||||
result = future.result()
|
||||
except Exception:
|
||||
continue
|
||||
if result:
|
||||
for f in futures:
|
||||
f.cancel()
|
||||
return result
|
||||
return None
|
||||
|
||||
def _try_jina_reader(self, url, pure_text):
|
||||
"""尝试使用 Jina Reader API
|
||||
|
||||
用法: https://r.jina.ai/https://www.breezedeus.com/article/ai-agent-context-engineering
|
||||
"""
|
||||
jina_base_urls = ["https://r.jinaai.cn", "https://r.jina.ai"]
|
||||
try:
|
||||
for jina_base_url in jina_base_urls:
|
||||
jina_url = f"{jina_base_url}/{url}"
|
||||
try:
|
||||
response = self.session.get(jina_url, timeout=self.TIMEOUT_JINA)
|
||||
response.raise_for_status()
|
||||
|
||||
content = response.text
|
||||
if content and len(content.strip()) > 50: # 验证有内容
|
||||
if pure_text:
|
||||
return content
|
||||
# Jina 已经返回不错的 Markdown,稍作清理即可
|
||||
return self._clean_jina_markdown(content)
|
||||
except Exception as e:
|
||||
logger.warning("Jina Reader 失败 (%s): %s", jina_base_url, e)
|
||||
except Exception as e:
|
||||
logger.warning("Jina Reader 失败: %s", e)
|
||||
return None
|
||||
|
||||
def _try_firecrawl(self, url, pure_text):
|
||||
"""尝试使用 Firecrawl API"""
|
||||
if not self.firecrawl_api_key:
|
||||
logger.info("Firecrawl API 密钥未设置 (FIRECRAWL_API_KEY)")
|
||||
return None
|
||||
|
||||
try:
|
||||
response = self.session.post(
|
||||
"https://api.firecrawl.dev/v0/scrape",
|
||||
headers={"Authorization": f"Bearer {self.firecrawl_api_key}"},
|
||||
json={"url": url, "formats": ["markdown"]},
|
||||
timeout=self.TIMEOUT_FIRECRAWL,
|
||||
)
|
||||
|
||||
if response.status_code == 200:
|
||||
data = response.json()
|
||||
if data.get("success") and data.get("data", {}).get("markdown"):
|
||||
markdown = data["data"]["markdown"]
|
||||
if pure_text:
|
||||
# 从 Markdown 提取纯文本
|
||||
return re.sub(r"[\*\#\`\[\]\(\)]", "", markdown)
|
||||
return markdown
|
||||
else:
|
||||
logger.warning("Firecrawl 错误: %s", response.status_code)
|
||||
except Exception as e:
|
||||
logger.warning("Firecrawl 失败: %s", e)
|
||||
return None
|
||||
|
||||
def _try_playwright(self, url, pure_text):
|
||||
"""尝试使用 Playwright 获取动态页面"""
|
||||
try:
|
||||
content = self._get_with_playwright(url)
|
||||
if not content or len(content.strip()) < 50:
|
||||
return None
|
||||
if pure_text:
|
||||
return self._to_plain_text(content)
|
||||
return self._to_markdown(content)
|
||||
except Exception as e:
|
||||
logger.warning("Playwright 失败: %s", e)
|
||||
return None
|
||||
|
||||
def _try_wespy(self, url, pure_text):
|
||||
"""尝试使用 WeSpy 获取微信公众号内容"""
|
||||
try:
|
||||
from wespy import ArticleFetcher
|
||||
except ImportError as e:
|
||||
logger.warning("WeSpy 未安装: %s", e)
|
||||
return None
|
||||
|
||||
try:
|
||||
fetcher = ArticleFetcher()
|
||||
with tempfile.TemporaryDirectory() as output_dir:
|
||||
article_info = fetcher.fetch_article(
|
||||
url=url,
|
||||
output_dir=output_dir,
|
||||
save_markdown=True,
|
||||
save_html=False,
|
||||
save_json=False,
|
||||
)
|
||||
if not article_info:
|
||||
return None
|
||||
|
||||
markdown = self._read_wespy_markdown(output_dir, article_info)
|
||||
if not markdown:
|
||||
return None
|
||||
|
||||
if pure_text:
|
||||
return self._markdown_to_plain_text(markdown)
|
||||
return markdown.strip()
|
||||
except Exception as e:
|
||||
logger.warning("WeSpy 失败: %s", e)
|
||||
return None
|
||||
|
||||
def _python_convert(self, url, pure_text):
|
||||
"""Python实现(回退方法)"""
|
||||
# 自动检测是否需要浏览器
|
||||
use_browser = self._needs_browser(url)
|
||||
|
||||
if use_browser:
|
||||
content = self._get_with_playwright(url)
|
||||
is_pdf = False
|
||||
else:
|
||||
content, is_pdf = self._get_with_requests(url)
|
||||
|
||||
if is_pdf:
|
||||
return self._process_pdf(content, pure_text)
|
||||
|
||||
# HTML 转换
|
||||
if pure_text:
|
||||
return self._to_plain_text(content)
|
||||
else:
|
||||
return self._to_markdown(content)
|
||||
|
||||
def _handle_arxiv(self, url, pure_text):
|
||||
"""arXiv Python回退方法:从HTML URL转为PDF下载"""
|
||||
try:
|
||||
pdf_url = self._convert_arxiv_to_pdf(url)
|
||||
logger.info("arXiv Python回退: 下载PDF %s", pdf_url)
|
||||
pdf_content, _ = self._get_with_requests(pdf_url)
|
||||
return self._process_pdf(pdf_content, pure_text)
|
||||
except Exception as e:
|
||||
logger.error("arXiv PDF失败: %s", e)
|
||||
return None
|
||||
|
||||
def _clean_jina_markdown(self, markdown):
|
||||
"""清理 Jina Reader 返回的 Markdown"""
|
||||
# 移除多余的空行
|
||||
markdown = re.sub(r"\n{3,}", "\n\n", markdown)
|
||||
# 移除行尾空格
|
||||
markdown = re.sub(r" +\n", "\n", markdown)
|
||||
return markdown.strip()
|
||||
|
||||
def _markdown_to_plain_text(self, markdown):
|
||||
"""从 Markdown 提取纯文本"""
|
||||
text = re.sub(r"!\[[^\]]*\]\([^\)]*\)", "", markdown)
|
||||
text = re.sub(r"\[([^\]]+)\]\([^\)]*\)", r"\1", text)
|
||||
text = re.sub(r"^#{1,6}\s*", "", text, flags=re.MULTILINE)
|
||||
text = re.sub(r"^[>*-]\s*", "", text, flags=re.MULTILINE)
|
||||
text = re.sub(r"[`*_~]", "", text)
|
||||
text = re.sub(r"\n{3,}", "\n\n", text)
|
||||
return text.strip()
|
||||
|
||||
def _read_wespy_markdown(self, output_dir, article_info):
|
||||
"""从 WeSpy 输出目录或返回值读取 Markdown"""
|
||||
if isinstance(article_info, dict):
|
||||
for key in ("markdown", "markdown_content", "content"):
|
||||
value = article_info.get(key)
|
||||
if isinstance(value, str) and value.strip():
|
||||
return value.strip()
|
||||
|
||||
markdown_files = sorted(Path(output_dir).rglob("*.md"))
|
||||
if not markdown_files:
|
||||
return None
|
||||
|
||||
return markdown_files[0].read_text(encoding="utf-8").strip()
|
||||
|
||||
def _is_twitter(self, url):
|
||||
"""检查是否是推特/X URL"""
|
||||
parsed = urlparse(url)
|
||||
netloc = parsed.netloc.lower()
|
||||
return "twitter.com" in netloc or "x.com" in netloc
|
||||
|
||||
def _convert_twitter_to_proxy(self, url):
|
||||
"""转换 Twitter/X URL 到支持元数据预览的 fxtwitter 或 fixupx"""
|
||||
parsed = urlparse(url)
|
||||
netloc = parsed.netloc.lower()
|
||||
|
||||
# 将 twitter.com 替换为 fxtwitter.com, x.com 替换为 fixupx.com
|
||||
if "twitter.com" in netloc:
|
||||
new_netloc = netloc.replace("twitter.com", "fxtwitter.com")
|
||||
elif "x.com" in netloc:
|
||||
new_netloc = netloc.replace("x.com", "fixupx.com")
|
||||
else:
|
||||
return url
|
||||
|
||||
return url.replace(netloc, new_netloc)
|
||||
|
||||
def _is_arxiv(self, url):
|
||||
"""检测是否为 arXiv 链接"""
|
||||
return "arxiv.org" in url and (
|
||||
"/abs/" in url or "/pdf/" in url or "/html/" in url
|
||||
)
|
||||
|
||||
def _is_wechat(self, url):
|
||||
"""检测是否为微信公众号链接"""
|
||||
return "weixin.qq.com" in url
|
||||
|
||||
def _convert_arxiv_to_html(self, url):
|
||||
"""转换 arXiv 链接为 HTML URL"""
|
||||
if "/html/" in url:
|
||||
return url
|
||||
if "/pdf/" in url:
|
||||
paper_id = url.split("/pdf/")[-1].split("?")[0].replace(".pdf", "")
|
||||
return f"https://arxiv.org/html/{paper_id}"
|
||||
paper_id = url.split("/abs/")[-1].split("?")[0]
|
||||
return f"https://arxiv.org/html/{paper_id}"
|
||||
|
||||
def _convert_arxiv_to_pdf(self, url):
|
||||
"""转换 arXiv 链接为 PDF URL"""
|
||||
if "/pdf/" in url:
|
||||
return url
|
||||
if "/html/" in url:
|
||||
paper_id = url.split("/html/")[-1].split("?")[0]
|
||||
return f"https://arxiv.org/pdf/{paper_id}.pdf"
|
||||
paper_id = url.split("/abs/")[-1].split("?")[0]
|
||||
return f"https://arxiv.org/pdf/{paper_id}.pdf"
|
||||
|
||||
def _is_known_dynamic_site(self, url):
|
||||
"""检测是否为已知的动态网站(纯函数,无网络调用)"""
|
||||
parsed = urlparse(url)
|
||||
domain = parsed.netloc.lower()
|
||||
|
||||
dynamic_domains = [
|
||||
"x.com",
|
||||
"twitter.com",
|
||||
"medium.com",
|
||||
"substack.com",
|
||||
"github.com",
|
||||
"reddit.com",
|
||||
]
|
||||
|
||||
for dynamic in dynamic_domains:
|
||||
if domain.endswith(dynamic):
|
||||
return True
|
||||
|
||||
if "weixin.qq.com" in domain:
|
||||
return False
|
||||
|
||||
return None # 未知,需要探测
|
||||
|
||||
def _probe_for_spa(self, url):
|
||||
"""通过 HEAD 请求探测是否为 SPA(有网络调用)"""
|
||||
try:
|
||||
response = self.session.head(
|
||||
url, timeout=self.TIMEOUT_HEAD, allow_redirects=True
|
||||
)
|
||||
content_type = response.headers.get("content-type", "").lower()
|
||||
|
||||
if "application/json" in content_type:
|
||||
return True
|
||||
|
||||
server = response.headers.get("server", "").lower()
|
||||
if any(s in server for s in ["nextjs", "vercel", "vite"]):
|
||||
return True
|
||||
except Exception:
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
def _needs_browser(self, url):
|
||||
"""自动检测是否需要浏览器渲染"""
|
||||
known = self._is_known_dynamic_site(url)
|
||||
if known is not None:
|
||||
return known
|
||||
return self._probe_for_spa(url)
|
||||
|
||||
def _get_with_requests(self, url):
|
||||
"""使用 requests 获取静态页面或 PDF
|
||||
|
||||
Returns:
|
||||
tuple: (content, is_pdf) - content 为 bytes(PDF) 或 str(HTML)
|
||||
"""
|
||||
response = self.session.get(url, timeout=self.TIMEOUT_REQUESTS)
|
||||
response.raise_for_status()
|
||||
content_type = response.headers.get("content-type", "").lower()
|
||||
is_pdf = "application/pdf" in content_type or url.lower().endswith(".pdf")
|
||||
if is_pdf:
|
||||
return response.content, True
|
||||
return response.text, False
|
||||
|
||||
def _process_pdf(self, pdf_content, pure_text=False):
|
||||
"""处理 PDF 内容,返回 Markdown 或纯文本(只打开一次文档)"""
|
||||
try:
|
||||
import fitz # PyMuPDF
|
||||
except ImportError:
|
||||
raise ImportError("PDF 处理需要 PyMuPDF。\n" "请运行: pip install pymupdf")
|
||||
|
||||
try:
|
||||
doc = fitz.open(stream=pdf_content, filetype="pdf")
|
||||
|
||||
# 提取全文
|
||||
text = ""
|
||||
for page_num, page in enumerate(doc):
|
||||
page_text = page.get_text()
|
||||
if page_text.strip():
|
||||
text += f"--- Page {page_num + 1} ---\n\n"
|
||||
text += page_text + "\n\n"
|
||||
text = text.strip()
|
||||
|
||||
if pure_text:
|
||||
return text
|
||||
|
||||
# 提取标题
|
||||
title = None
|
||||
metadata = doc.metadata
|
||||
if metadata and metadata.get("title"):
|
||||
title = metadata["title"]
|
||||
elif doc.page_count > 0:
|
||||
first_page = doc[0]
|
||||
lines = [
|
||||
line.strip()
|
||||
for line in first_page.get_text().split("\n")
|
||||
if line.strip()
|
||||
]
|
||||
if lines:
|
||||
title = " ".join(lines[:2])
|
||||
|
||||
if title:
|
||||
return f"# {title}\n\n{text}"
|
||||
return text
|
||||
except Exception as e:
|
||||
raise Exception(f"PDF 处理失败: {e}")
|
||||
|
||||
def _get_with_playwright(self, url):
|
||||
"""使用 Playwright 获取动态页面"""
|
||||
try:
|
||||
from playwright.sync_api import sync_playwright
|
||||
except ImportError:
|
||||
raise ImportError(
|
||||
"Playwright 未安装。\n"
|
||||
"请运行: pip install playwright && playwright install chromium"
|
||||
)
|
||||
|
||||
with sync_playwright() as p:
|
||||
browser = p.chromium.launch()
|
||||
page = browser.new_page()
|
||||
|
||||
# 微信公众号使用移动 UA
|
||||
if "weixin.qq.com" in url:
|
||||
page.set_extra_http_headers(
|
||||
{
|
||||
"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0) AppleWebKit/605.1.15"
|
||||
}
|
||||
)
|
||||
|
||||
try:
|
||||
page.goto(
|
||||
url, wait_until="networkidle", timeout=self.TIMEOUT_PLAYWRIGHT
|
||||
)
|
||||
page.wait_for_timeout(2000)
|
||||
content = page.content()
|
||||
finally:
|
||||
browser.close()
|
||||
|
||||
return content
|
||||
|
||||
def _to_markdown(self, html):
|
||||
"""HTML 转 Markdown"""
|
||||
soup = BeautifulSoup(html, "html.parser")
|
||||
|
||||
# 提取标题(微信公众号等)
|
||||
title = None
|
||||
# 尝试多种标题来源
|
||||
title_selectors = ["title", "h1#activity-name", ".rich_media_title", "h1"]
|
||||
for selector in title_selectors:
|
||||
title_elem = soup.select_one(selector)
|
||||
if title_elem:
|
||||
title = title_elem.get_text(strip=True)
|
||||
if title:
|
||||
break
|
||||
|
||||
# 查找正文内容(优先级)
|
||||
content_elem = None
|
||||
content_selectors = [
|
||||
"#js_content", # 微信公众号
|
||||
".rich_media_content", # 微信公众号
|
||||
"#activity-detail", # 微信公众号
|
||||
"article", # 标准文章
|
||||
"main", # 标准主内容
|
||||
".post-content", # 博客
|
||||
".article-content", # 博客
|
||||
]
|
||||
|
||||
for selector in content_selectors:
|
||||
elem = soup.select_one(selector)
|
||||
if elem and elem.get_text(strip=True):
|
||||
content_elem = elem
|
||||
break
|
||||
|
||||
# 如果没找到特定内容,使用 body
|
||||
if not content_elem:
|
||||
content_elem = soup.body or soup
|
||||
|
||||
# 移除噪音元素
|
||||
for tag in content_elem(
|
||||
["script", "style", "nav", "footer", "header", "iframe", "aside"]
|
||||
):
|
||||
tag.decompose()
|
||||
|
||||
# 移除广告和交互元素
|
||||
for tag in content_elem.find_all(
|
||||
class_=lambda x: x
|
||||
and any(
|
||||
w in x.lower()
|
||||
for w in [
|
||||
"ad",
|
||||
"banner",
|
||||
"cookie",
|
||||
"consent",
|
||||
"popup",
|
||||
"modal",
|
||||
"share",
|
||||
"like",
|
||||
"comment",
|
||||
]
|
||||
)
|
||||
):
|
||||
tag.decompose()
|
||||
|
||||
# 移除按钮和链接区域
|
||||
for tag in content_elem.find_all(
|
||||
class_=lambda x: x
|
||||
and any(w in x.lower() for w in ["btn", "button", "share", "reward"])
|
||||
):
|
||||
tag.decompose()
|
||||
|
||||
# 移除空段落
|
||||
for tag in content_elem.find_all("p"):
|
||||
if not tag.get_text(strip=True):
|
||||
tag.decompose()
|
||||
|
||||
# 构建最终内容
|
||||
if title:
|
||||
markdown = f"# {title}\n\n"
|
||||
else:
|
||||
markdown = ""
|
||||
|
||||
cleaned_html = str(content_elem)
|
||||
markdown += md(cleaned_html, heading_style="ATX")
|
||||
|
||||
# 清理多余空白
|
||||
markdown = re.sub(r"\n{3,}", "\n\n", markdown)
|
||||
markdown = re.sub(r" +\n", "\n", markdown) # 行尾空格
|
||||
|
||||
return markdown.strip()
|
||||
|
||||
def _to_plain_text(self, html):
|
||||
"""提取纯文本"""
|
||||
soup = BeautifulSoup(html, "html.parser")
|
||||
|
||||
main = soup.find("main") or soup.find("article") or soup.body
|
||||
if not main:
|
||||
return soup.get_text(separator="\n\n", strip=True)
|
||||
|
||||
for tag in main(["script", "style", "nav", "footer", "header", "aside"]):
|
||||
tag.decompose()
|
||||
|
||||
text = main.get_text(separator="\n\n", strip=True)
|
||||
|
||||
text = re.sub(r"\n{3,}", "\n\n", text)
|
||||
|
||||
return text.strip()
|
||||
|
||||
|
||||
def main():
|
||||
"""命令行入口"""
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(levelname)s: %(message)s",
|
||||
stream=sys.stderr,
|
||||
)
|
||||
|
||||
parser = argparse.ArgumentParser(
|
||||
description="将网页转换为 Markdown 格式(优先使用非Python方法)",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""优先级方法:
|
||||
1. Jina Reader API (https://r.jina.ai/URL) - 零安装
|
||||
2. Firecrawl API (需要 FIRECRAWL_API_KEY)
|
||||
3. Python实现 (回退)
|
||||
|
||||
示例:
|
||||
%(prog)s https://www.breezedeus.com/article/ai-agent-context-engineering
|
||||
%(prog)s https://arxiv.org/abs/2601.04500v1 --output paper.md
|
||||
%(prog)s https://x.com/user/status/123 --pure-text
|
||||
%(prog)s https://www.breezedeus.com/article/ai-agent-context-engineering --use-python # 强制使用Python方法
|
||||
""",
|
||||
)
|
||||
|
||||
parser.add_argument("url", help="要转换的网页 URL")
|
||||
parser.add_argument(
|
||||
"--pure-text", action="store_true", help="输出纯文本(无 Markdown 格式)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--use-python",
|
||||
action="store_true",
|
||||
help="强制使用Python方法(跳过Jina/Firecrawl)",
|
||||
)
|
||||
parser.add_argument("--output", "-o", help="输出到文件")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
try:
|
||||
with WebToMarkdown() as converter:
|
||||
result = converter.convert(
|
||||
args.url, pure_text=args.pure_text, use_python=args.use_python
|
||||
)
|
||||
|
||||
if result is None:
|
||||
logger.error("转换失败:所有方法均不可用")
|
||||
sys.exit(1)
|
||||
|
||||
if args.output:
|
||||
with open(args.output, "w", encoding="utf-8") as f:
|
||||
f.write(result)
|
||||
print(f"✓ 已保存到: {args.output}")
|
||||
else:
|
||||
print(result)
|
||||
|
||||
except Exception as e:
|
||||
logger.error("错误: %s", e)
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
45
skills/common/web2summary/README.md
Normal file
45
skills/common/web2summary/README.md
Normal file
@ -0,0 +1,45 @@
|
||||
# docai-web2summary
|
||||
|
||||
对任意网页 URL 生成结构化总结的 AI Skill。
|
||||
|
||||
## 工作流程
|
||||
|
||||
1. **获取内容**:调用 `docai-web2md` 将 URL 转为 Markdown
|
||||
2. **AI 总结**:AI 直接根据 [SKILL.md](SKILL.md) 中的规范完成总结,无需外部脚本
|
||||
3. **信息卡(可选)**:如需生成图片卡片,使用 [info-card-designer](https://github.com/joeseesun/info-card-designer)
|
||||
|
||||
## 使用方式
|
||||
|
||||
直接对 AI 说明需求即可:
|
||||
|
||||
> 总结这个链接:https://www.breezedeus.com/article/ai-agent-context-engineering
|
||||
|
||||
AI 会自动:
|
||||
- 判断内容类型(论文 / 新闻 / 教程 / 产品 / AI 动态 / 通用)
|
||||
- 套用对应的总结结构
|
||||
- 按统一格式输出 Markdown
|
||||
|
||||
## 输出格式示例
|
||||
|
||||
```markdown
|
||||
# **给Claude Code装个仪表盘:claude-hud插件深度评测**
|
||||
|
||||
✔ 一句话总结:一个让 Claude Code 从黑盒变透明的仪表盘插件...
|
||||
|
||||
✔ **核心洞见**:Claude Code 最大的痛点不是功能不足,而是"黑盒"体验...
|
||||
|
||||
✔ **技术细节**:基于 Claude Code 原生 statusline API 构建...
|
||||
|
||||
✔ **应用场景**:复杂任务重构、CI/CD 调试、长期项目开发...
|
||||
|
||||
**原文:** https://mp.weixin.qq.com/s/XClh6xJmXoXbyBC9lKzPdA
|
||||
```
|
||||
|
||||
## 依赖
|
||||
|
||||
- `docai-web2md`(获取网页内容)
|
||||
- [info-card-designer](https://github.com/joeseesun/info-card-designer)(可选,生成信息卡图片)
|
||||
|
||||
## 许可证
|
||||
|
||||
MIT
|
||||
168
skills/common/web2summary/SKILL.md
Normal file
168
skills/common/web2summary/SKILL.md
Normal file
@ -0,0 +1,168 @@
|
||||
---
|
||||
name: web2summary
|
||||
description: Summarize any web URL. Triggers on "summarize/总结/概括/摘要 + URL". Auto-detects content type (paper, news, tutorial, product, AI news) and generates adaptive structured summary.
|
||||
---
|
||||
|
||||
# docai:web2summary
|
||||
|
||||
## When to Trigger
|
||||
User wants to summarize a web page. Common patterns:
|
||||
- "总结这个链接"、"帮我总结一下"、"概括这篇文章"、"给个摘要"
|
||||
- "summarize this URL"、"give me a summary of"
|
||||
- Any URL + intent to understand/extract key points
|
||||
|
||||
## How to Execute
|
||||
|
||||
### Step 1 — 获取网页内容
|
||||
使用 `web2md` skill 将 URL 转换为 Markdown:
|
||||
```bash
|
||||
python skills/web2md/tools/convert.py <URL>
|
||||
```
|
||||
|
||||
### Step 2 — 直接总结(你来做,无需调用外部 AI)
|
||||
拿到 Markdown 内容后,**你(AI agent)直接按照下方的总结规范输出总结**,不需要再调用任何脚本或 API。
|
||||
|
||||
### Step 3 — 生成信息卡片(可选)
|
||||
如果用户需要信息卡图片,使用 [info-card-designer](https://github.com/joeseesun/info-card-designer) skill 生成。
|
||||
|
||||
---
|
||||
|
||||
## 总结规范
|
||||
|
||||
### 格式要求
|
||||
|
||||
**标题格式**
|
||||
- 所有级别标题都必须加粗:`# **标题**`、`## **标题**`、`### **标题**`
|
||||
- 如内容来自知名机构,一级标题末尾标注:`# **标题内容 | 机构名称**`
|
||||
- 标题与前面内容之间空一行
|
||||
|
||||
**加粗与标点**
|
||||
- 加粗标记 `**` 在标点符号内部,不在外部
|
||||
- ✅ `「**更聪明地激活**」` ❌ `**「更聪明地激活」**`
|
||||
|
||||
**链接处理**
|
||||
- 末尾必须包含原文链接:`**原文:** <链接>`
|
||||
- 删除 URL 中 `?` 后的查询参数
|
||||
|
||||
**列表格式**
|
||||
- 无序列表用 ✔ 代替 `-` / `*`,每条后空一行
|
||||
|
||||
**内容约束**
|
||||
- 只基于网页中的信息,禁止自行推断
|
||||
- 不输出 LaTeX 数学公式,不包含索引或引用
|
||||
|
||||
---
|
||||
|
||||
### 内容类型判断与结构
|
||||
|
||||
先判断内容类型,再按对应结构输出。
|
||||
|
||||
---
|
||||
|
||||
#### 🔬 类型A:技术论文/研究
|
||||
适用:学术论文、技术报告、arXiv 论文、算法介绍等
|
||||
|
||||
结构(整体不超过 1000 字,没有的章节直接删除):
|
||||
✔ 一句话总结(开篇):体现研究的核心突破,必须有吸引力
|
||||
|
||||
✔ 核心洞见:解决了什么问题?提出了什么新思路?
|
||||
|
||||
✔ 技术细节/架构创新:关键方法、模型结构、算法设计
|
||||
|
||||
✔ 性能数据/实验结果:量化指标、对比基线、关键数据
|
||||
|
||||
✔ 应用场景:这项技术能用在哪里?
|
||||
|
||||
✔ 长期意义:为什么值得关注?对领域的影响
|
||||
|
||||
✔ 原文链接(末尾)
|
||||
|
||||
---
|
||||
|
||||
#### 📰 类型B:新闻报道
|
||||
适用:行业新闻、公司动态、政策发布、事件报道等
|
||||
|
||||
结构(整体不超过 800 字,没有的章节直接删除):
|
||||
✔ 一句话总结(开篇):概括核心事件,突出新闻价值
|
||||
|
||||
✔ 核心事件:发生了什么?关键细节
|
||||
|
||||
✔ 关键人物/机构:谁在推动?谁受影响?
|
||||
|
||||
✔ 背景与影响:为什么重要?对行业/社会的影响
|
||||
|
||||
✔ 后续展望:接下来可能发生什么?
|
||||
|
||||
✔ 原文链接(末尾)
|
||||
|
||||
---
|
||||
|
||||
#### 📚 类型C:教程/指南
|
||||
适用:编程教程、操作指南、How-to 文章、最佳实践等
|
||||
|
||||
结构(整体不超过 1000 字,没有的章节直接删除):
|
||||
✔ 一句话总结(开篇):这篇教程教你什么?适合谁?
|
||||
|
||||
✔ 学习目标:读完能掌握什么?
|
||||
|
||||
✔ 前置条件:需要什么基础或工具?
|
||||
|
||||
✔ 关键步骤摘要:核心流程的精炼提取(不是逐步复述)
|
||||
|
||||
✔ 注意事项/常见坑:作者提到的易错点或最佳实践
|
||||
|
||||
✔ 原文链接(末尾)
|
||||
|
||||
---
|
||||
|
||||
#### 🚀 类型D:产品发布/评测
|
||||
适用:产品发布、功能更新、产品评测、工具推荐等
|
||||
|
||||
结构(整体不超过 800 字,没有的章节直接删除):
|
||||
✔ 一句话总结(开篇):核心卖点
|
||||
|
||||
✔ 产品定位:解决什么问题?面向谁?
|
||||
|
||||
✔ 核心功能/亮点:最值得关注的特性
|
||||
|
||||
✔ 与竞品对比:相比现有方案有什么优势?(如文中提及)
|
||||
|
||||
✔ 适用人群:谁最应该关注?
|
||||
|
||||
✔ 价格/获取方式:如何获取或使用?(如文中提及)
|
||||
|
||||
✔ 原文链接(末尾)
|
||||
|
||||
---
|
||||
|
||||
#### 🤖 类型E:AI 行业动态
|
||||
适用:AI 领域新闻汇总、模型发布、行业趋势分析、AI Newsletter 等
|
||||
|
||||
结构(整体不超过 1000 字,没有的章节直接删除):
|
||||
✔ 一句话总结(开篇):本期最值得关注的信号
|
||||
|
||||
✔ 核心动态:最重要的 2-3 条消息及其意义
|
||||
|
||||
✔ 技术要点:涉及的关键技术或方法(如有)
|
||||
|
||||
✔ 行业影响:对开发者/企业/用户意味着什么?
|
||||
|
||||
✔ 值得关注的信号:哪些趋势正在形成?
|
||||
|
||||
✔ 原文链接(末尾)
|
||||
|
||||
---
|
||||
|
||||
#### 📄 类型F:通用
|
||||
适用:个人博客、观点文章、随笔、访谈、其他类型
|
||||
|
||||
结构(整体不超过 800 字,没有的章节直接删除):
|
||||
✔ 一句话总结(开篇):这篇内容的核心价值
|
||||
|
||||
✔ 核心内容:作者在说什么?主要观点或故事
|
||||
|
||||
✔ 关键要点:最值得记住的 2-3 个点
|
||||
|
||||
✔ 价值与启发:读完能获得什么?
|
||||
|
||||
✔ 原文链接(末尾)
|
||||
Loading…
Reference in New Issue
Block a user