From 62c0f62134b37f7229c61194a234d7452a1f6ea9 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E6=9C=B1=E6=BD=AE?= Date: Tue, 23 Jun 2026 10:12:55 +0800 Subject: [PATCH] add web2md web2summary --- skills/common/web2md/README.md | 194 ++++++++ skills/common/web2md/SKILL.md | 55 +++ skills/common/web2md/tools/convert.py | 683 ++++++++++++++++++++++++++ skills/common/web2summary/README.md | 45 ++ skills/common/web2summary/SKILL.md | 168 +++++++ 5 files changed, 1145 insertions(+) create mode 100644 skills/common/web2md/README.md create mode 100644 skills/common/web2md/SKILL.md create mode 100644 skills/common/web2md/tools/convert.py create mode 100644 skills/common/web2summary/README.md create mode 100644 skills/common/web2summary/SKILL.md diff --git a/skills/common/web2md/README.md b/skills/common/web2md/README.md new file mode 100644 index 0000000..7f8dcb2 --- /dev/null +++ b/skills/common/web2md/README.md @@ -0,0 +1,194 @@ +# docai-web2md + +独立 Python 工具,用于将网页转换为 Markdown 格式,采用**优先级架构**。 + +> 📖 **文档导航** +> - **SKILL.md** - Claude Code 使用指南(如何调用此技能) +> - **README.md** - 本文档(工具功能说明和独立使用) +> - **tools/convert.py** - 实际转换代码实现 +> - **共享参考**: [web-sources.md](../../shared/references/web-sources.md) - 平台支持矩阵 + +## 核心特性 + +- ✅ **Jina Reader API 优先** - 零安装,最快最简单(微信公众号除外) +- ✅ **Firecrawl API 支持** - 高级爬虫需求 +- ✅ **Python 智能回退** - 以上方法失败时自动切换 +- ✅ **arXiv HTML 优先** - 优先获取 HTML 版,失败时回退 PDF +- ✅ **多平台支持** - 微信公众号(直接 Python)、静态博客、动态页面等 + +## 优先级策略 + +```python +# 转换流程 +输入 URL + ↓ +是 arXiv? → 转换为 HTML URL + ↓ +是微信公众号? → 直接 Python 方法 ⭐ + ↓ +尝试 Jina Reader API (快速) + ↓ (失败) +尝试 Firecrawl API (需要密钥) + ↓ (失败) +Python 方法 (回退) + ↓ +arXiv? → 下载 PDF 提取 +``` + +**微信公众号特殊处理**:由于 Jina Reader 对微信公众号支持不佳,直接使用 Python 方法以确保最佳效果。 + +## 快速开始 + +### 方式 1: 使用 uv(推荐) + +```bash +# 1. 安装 uv(如果尚未安装) +curl -LsSf https://astral.sh/uv/install.sh | sh + +# 2. 在 docai-skills 目录初始化环境 +cd docai-skills +uv sync + +# 3. 执行脚本(无需激活环境) +uv run python skills/docai-web2md/tools/convert.py https://www.breezedeus.com/article/ai-agent-context-engineering +``` + +### 方式 2: 使用 pip(传统方式) + +```bash +# 1. 创建虚拟环境(可选但推荐) +python -m venv .venv +source .venv/bin/activate + +# 2. 安装依赖 +pip install requests beautifulsoup4 markdownify pymupdf + +# 3. 执行脚本 +python skills/docai-web2md/tools/convert.py https://www.breezedeus.com/article/ai-agent-context-engineering +``` + +### 方式 3: Jina Reader API(无需安装) + +```bash +# 直接使用 API,无需任何依赖 +curl https://r.jina.ai/https://www.breezedeus.com/article/ai-agent-context-engineering +``` + +### ⚠️ Claude Code Skill 集成 + +**重要**:Claude Code 调用 Skill 时使用系统 Python,需要额外配置: + +```bash +# 使用 uv 安装到系统(不影响项目虚拟环境) +uv pip install --system requests beautifulsoup4 markdownify pymupdf + +# 或使用 pip +pip install requests beautifulsoup4 markdownify pymupdf +``` + +**详见**:[UV_ENVIRONMENT.md](../../UV_ENVIRONMENT.md) + +## 命令行使用 + +```bash +# 基本用法(自动优先级) +python skills/docai-web2md/tools/convert.py https://www.breezedeus.com/article/ai-agent-context-engineering + +# 保存到文件 +python skills/docai-web2md/tools/convert.py https://www.breezedeus.com/article/ai-agent-context-engineering -o article.md + +# 纯文本模式 +python skills/docai-web2md/tools/convert.py https://www.breezedeus.com/article/ai-agent-context-engineering --pure-text + +# 强制使用 Python 方法(跳过 Jina/Firecrawl) +python skills/docai-web2md/tools/convert.py https://www.breezedeus.com/article/ai-agent-context-engineering --use-python +``` + +## 优先级架构 + +``` +输入 URL + ↓ +arXiv? → 转换为 HTML URL + ↓ +Jina Reader API (⭐ 零安装) + ↓ 失败 +Firecrawl API (需密钥) + ↓ 失败 +Python 实现 (全能回退) + ↓ +arXiv? → 下载 PDF 提取 + ↓ +普通网页 → HTML 解析 +``` + +## 使用示例 + +```bash +# 静态博客(Jina Reader) +python skills/docai-web2md/tools/convert.py https://www.breezedeus.com/article/ai-agent-context-engineering + +# arXiv 论文(HTML 优先,PDF 回退) +python skills/docai-web2md/tools/convert.py https://arxiv.org/abs/2601.04500v1 + +# 微信公众号(Jina → Python 回退) +python skills/docai-web2md/tools/convert.py https://mp.weixin.qq.com/s/1LfkYdbzymoWxdvdnKeLnA + +# X.com/Twitter(Python 动态渲染) +python skills/docai-web2md/tools/convert.py https://x.com/user/status/123 +``` + +## Python API + +```python +from skills.docai_web2md.tools.convert import WebToMarkdown + +converter = WebToMarkdown() + +# 自动优先级(推荐) +markdown = converter.convert("https://www.breezedeus.com/article/ai-agent-context-engineering") + +# arXiv 自动处理(HTML → PDF) +paper = converter.convert("https://arxiv.org/abs/2601.04500v1") + +# 强制 Python 方法 +markdown = converter.convert("https://www.breezedeus.com/article/ai-agent-context-engineering", use_python=True) + +# 纯文本输出 +text = converter.convert("https://www.breezedeus.com/article/ai-agent-context-engineering", pure_text=True) +``` + +## 依赖说明 + +| 方法 | 依赖 | 说明 | +|------|------|------| +| **Jina Reader** | 无 | 只需网络连接 | +| **Firecrawl** | `FIRECRAWL_API_KEY` | 环境变量 | +| **Python 回退** | `requests`, `beautifulsoup4`, `markdownify` | 基础依赖 | +| **PDF 支持** | `pymupdf` | arXiv PDF 提取 | +| **动态页面** | `playwright` | React/Vue SPA | + +## 性能参考 + +- **Jina Reader**: ~1-2 秒 +- **Firecrawl**: ~2-5 秒 +- **Python 静态**: ~1-2 秒 +- **Python 动态**: ~5-10 秒 +- **arXiv PDF**: ~2-5 秒 + +## 与 Skill 的关系 + +- **SKILL.md**: 指导 Claude 如何使用此工具 +- **tools/convert.py**: 实际执行转换的代码 +- **README.md**: 本文档(工具使用说明) + +## 测试 + +```bash +# 测试 breezedeus.com 博客 +python skills/docai-web2md/tools/convert.py https://www.breezedeus.com/article/ai-agent-context-engineering +``` + +## 许可证 + +MIT diff --git a/skills/common/web2md/SKILL.md b/skills/common/web2md/SKILL.md new file mode 100644 index 0000000..639ce16 --- /dev/null +++ b/skills/common/web2md/SKILL.md @@ -0,0 +1,55 @@ +--- +name: web2md +description: Convert any web URL to Markdown. Triggers on "转成Markdown/转换/网页转Markdown/convert to Markdown + URL". Handles static sites, dynamic SPAs, WeChat, arXiv, Twitter/X. +--- + +# docai:web2md + +## When to Trigger +User wants to convert a web page to Markdown. Common patterns: +- "把这个链接转成 Markdown"、"网页转 Markdown"、"提取网页内容" +- "convert this URL to Markdown"、"get the content of this page" +- Any URL + intent to extract/read content (without summarization) + +If user wants summary, use web2summary instead. + +## How to Execute +```bash +python skills/web2md/tools/convert.py [--use-python] [-o ] +``` + +### Parameters +| Parameter | Required | Description | +|-----------|----------|-------------| +| `url` | Yes | Web page URL | +| `--use-python` | No | Force Python method (skip Jina/Firecrawl) | +| `-o` / `--output` | No | Save to file instead of stdout | + +### Examples +```bash +# Basic conversion (parallel: Jina / Firecrawl / Python) +python skills/web2md/tools/convert.py https://example.com/article + +# arXiv paper (auto HTML priority, PDF fallback) +python skills/web2md/tools/convert.py https://arxiv.org/abs/2601.04500v1 + +# Save to file +python skills/web2md/tools/convert.py https://mp.weixin.qq.com/s/... -o article.md + +# Force Python method +python skills/web2md/tools/convert.py https://example.com --use-python +``` + +## What It Does +Four methods run in parallel, returning the first successful result: +1. Jina Reader API (fastest, zero install) +2. Firecrawl API (if key configured) +3. Python fallback (requests + BeautifulSoup) +4. Playwright (headless browser for JS-rendered pages) + +Special cases handled automatically: WeChat → Playwright first (mobile UA, JS rendering), arXiv → HTML priority, Twitter/X → Playwright. + +## Troubleshooting +- **arXiv PDF garbled**: Requires `pymupdf` — `pip install pymupdf` +- **Dynamic page empty**: Script auto-detects SPAs and uses Playwright +- **All methods fail**: Try `--use-python` to bypass API methods diff --git a/skills/common/web2md/tools/convert.py b/skills/common/web2md/tools/convert.py new file mode 100644 index 0000000..ba56dba --- /dev/null +++ b/skills/common/web2md/tools/convert.py @@ -0,0 +1,683 @@ +#!/usr/bin/env python3 +""" +Web to Markdown Converter Tool + +优先级方法(非Python优先): +1. Jina Reader API - 零安装,一行URL转换 +2. Firecrawl API - 需要API密钥 +3. Python实现 - 以上方法失败时的回退 + +arXiv 特殊处理: +- 输入: https://arxiv.org/abs/2601.04500v1 +- 转换为: https://arxiv.org/html/2601.04500v1 +- 优先 Jina Reader,失败则 Python 下载 PDF + +微信公众号特殊处理: +- 优先 WeSpy +- 失败则回退 Playwright +- 最后回退 Python 方法 + +用法: + python convert.py [--pure-text] [--output ] + +示例: + python convert.py https://www.breezedeus.com/article/ai-agent-context-engineering + python convert.py https://arxiv.org/abs/2601.04500v1 --output paper.md + python convert.py https://x.com/user/status/123 --pure-text +""" + +import sys +import argparse +import logging +from pathlib import Path +import requests +from requests.adapters import HTTPAdapter +from urllib3.util.retry import Retry +from bs4 import BeautifulSoup +from markdownify import markdownify as md +import tempfile +from urllib.parse import urlparse +from concurrent.futures import ThreadPoolExecutor, as_completed +import re +import os + +logger = logging.getLogger(__name__) + + +class WebToMarkdown: + """网页转 Markdown 转换器(并行优先级方法)""" + + # 超时常量(秒) + TIMEOUT_HEAD = 3 + TIMEOUT_JINA = 8 + TIMEOUT_FIRECRAWL = 10 + TIMEOUT_REQUESTS = 15 + TIMEOUT_PLAYWRIGHT = 15000 # 毫秒 + + def __init__(self): + self.session = requests.Session() + self.session.headers.update( + {"User-Agent": "Mozilla/5.0 (compatible; DocAI-Converter/1.0)"} + ) + # 配置重试策略:仅针对 429/5xx,最多 2 次,指数退避 + retry = Retry( + total=2, + backoff_factor=1, + status_forcelist=[429, 500, 502, 503, 504], + allowed_methods=["HEAD", "GET", "POST"], + ) + adapter = HTTPAdapter(max_retries=retry) + self.session.mount("https://", adapter) + self.session.mount("http://", adapter) + # 从环境变量获取 Firecrawl API 密钥 + self.firecrawl_api_key = os.environ.get("FIRECRAWL_API_KEY") + + def __enter__(self): + return self + + def __exit__(self, exc_type, exc_val, exc_tb): + self.session.close() + + def convert(self, url, pure_text=False, use_python=False): + """转换 URL 到 Markdown(并行优先级方法) + + 并行发起 Jina Reader / Firecrawl / Python,取最快成功的结果。 + 微信公众号和 --use-python 模式走直连路径。 + + Args: + url: 网页 URL + pure_text: 是否返回纯文本(无格式) + use_python: 强制使用Python方法 + + Returns: + str: Markdown 或纯文本内容 + """ + url = url.strip() + + # URL 校验 + parsed = urlparse(url) + if parsed.scheme not in ("http", "https") or not parsed.netloc: + raise ValueError(f"无效的 URL: {url}") + + # arXiv 特殊处理:转换为 HTML URL + if self._is_arxiv(url): + url = self._convert_arxiv_to_html(url) + + # 微信公众号:优先使用 WeSpy,失败则回退到 Playwright / Python + if self._is_wechat(url): + result = self._try_wespy(url, pure_text) + if result: + return result + result = self._try_playwright(url, pure_text) + if result: + return result + return self._python_convert(url, pure_text) + + # 推特 X.com 特殊处理:如果URL是twitter/x.com,转换为fxtwitter/fixupx以获取元数据渲染的内容 + if self._is_twitter(url): + url = self._convert_twitter_to_proxy(url) + + # 强制 Python 模式 + if use_python: + if self._is_arxiv(url): + return self._handle_arxiv(url, pure_text) + return self._python_convert(url, pure_text) + + # 并行发起多种方法,取最快成功的 + result = self._parallel_convert(url, pure_text) + if result: + return result + + # 所有并行方法都失败,arXiv 尝试 PDF 回退 + if self._is_arxiv(url): + return self._handle_arxiv(url, pure_text) + + return None + + def _parallel_convert(self, url, pure_text): + """并行尝试多种方法,返回最快成功的结果""" + futures = {} + with ThreadPoolExecutor(max_workers=4) as executor: + futures[executor.submit(self._try_jina_reader, url, pure_text)] = "jina" + + if self.firecrawl_api_key: + futures[executor.submit(self._try_firecrawl, url, pure_text)] = ( + "firecrawl" + ) + + futures[executor.submit(self._python_convert, url, pure_text)] = "python" + futures[executor.submit(self._try_playwright, url, pure_text)] = ( + "playwright" + ) + + for future in as_completed(futures): + try: + result = future.result() + except Exception: + continue + if result: + for f in futures: + f.cancel() + return result + return None + + def _try_jina_reader(self, url, pure_text): + """尝试使用 Jina Reader API + + 用法: https://r.jina.ai/https://www.breezedeus.com/article/ai-agent-context-engineering + """ + jina_base_urls = ["https://r.jinaai.cn", "https://r.jina.ai"] + try: + for jina_base_url in jina_base_urls: + jina_url = f"{jina_base_url}/{url}" + try: + response = self.session.get(jina_url, timeout=self.TIMEOUT_JINA) + response.raise_for_status() + + content = response.text + if content and len(content.strip()) > 50: # 验证有内容 + if pure_text: + return content + # Jina 已经返回不错的 Markdown,稍作清理即可 + return self._clean_jina_markdown(content) + except Exception as e: + logger.warning("Jina Reader 失败 (%s): %s", jina_base_url, e) + except Exception as e: + logger.warning("Jina Reader 失败: %s", e) + return None + + def _try_firecrawl(self, url, pure_text): + """尝试使用 Firecrawl API""" + if not self.firecrawl_api_key: + logger.info("Firecrawl API 密钥未设置 (FIRECRAWL_API_KEY)") + return None + + try: + response = self.session.post( + "https://api.firecrawl.dev/v0/scrape", + headers={"Authorization": f"Bearer {self.firecrawl_api_key}"}, + json={"url": url, "formats": ["markdown"]}, + timeout=self.TIMEOUT_FIRECRAWL, + ) + + if response.status_code == 200: + data = response.json() + if data.get("success") and data.get("data", {}).get("markdown"): + markdown = data["data"]["markdown"] + if pure_text: + # 从 Markdown 提取纯文本 + return re.sub(r"[\*\#\`\[\]\(\)]", "", markdown) + return markdown + else: + logger.warning("Firecrawl 错误: %s", response.status_code) + except Exception as e: + logger.warning("Firecrawl 失败: %s", e) + return None + + def _try_playwright(self, url, pure_text): + """尝试使用 Playwright 获取动态页面""" + try: + content = self._get_with_playwright(url) + if not content or len(content.strip()) < 50: + return None + if pure_text: + return self._to_plain_text(content) + return self._to_markdown(content) + except Exception as e: + logger.warning("Playwright 失败: %s", e) + return None + + def _try_wespy(self, url, pure_text): + """尝试使用 WeSpy 获取微信公众号内容""" + try: + from wespy import ArticleFetcher + except ImportError as e: + logger.warning("WeSpy 未安装: %s", e) + return None + + try: + fetcher = ArticleFetcher() + with tempfile.TemporaryDirectory() as output_dir: + article_info = fetcher.fetch_article( + url=url, + output_dir=output_dir, + save_markdown=True, + save_html=False, + save_json=False, + ) + if not article_info: + return None + + markdown = self._read_wespy_markdown(output_dir, article_info) + if not markdown: + return None + + if pure_text: + return self._markdown_to_plain_text(markdown) + return markdown.strip() + except Exception as e: + logger.warning("WeSpy 失败: %s", e) + return None + + def _python_convert(self, url, pure_text): + """Python实现(回退方法)""" + # 自动检测是否需要浏览器 + use_browser = self._needs_browser(url) + + if use_browser: + content = self._get_with_playwright(url) + is_pdf = False + else: + content, is_pdf = self._get_with_requests(url) + + if is_pdf: + return self._process_pdf(content, pure_text) + + # HTML 转换 + if pure_text: + return self._to_plain_text(content) + else: + return self._to_markdown(content) + + def _handle_arxiv(self, url, pure_text): + """arXiv Python回退方法:从HTML URL转为PDF下载""" + try: + pdf_url = self._convert_arxiv_to_pdf(url) + logger.info("arXiv Python回退: 下载PDF %s", pdf_url) + pdf_content, _ = self._get_with_requests(pdf_url) + return self._process_pdf(pdf_content, pure_text) + except Exception as e: + logger.error("arXiv PDF失败: %s", e) + return None + + def _clean_jina_markdown(self, markdown): + """清理 Jina Reader 返回的 Markdown""" + # 移除多余的空行 + markdown = re.sub(r"\n{3,}", "\n\n", markdown) + # 移除行尾空格 + markdown = re.sub(r" +\n", "\n", markdown) + return markdown.strip() + + def _markdown_to_plain_text(self, markdown): + """从 Markdown 提取纯文本""" + text = re.sub(r"!\[[^\]]*\]\([^\)]*\)", "", markdown) + text = re.sub(r"\[([^\]]+)\]\([^\)]*\)", r"\1", text) + text = re.sub(r"^#{1,6}\s*", "", text, flags=re.MULTILINE) + text = re.sub(r"^[>*-]\s*", "", text, flags=re.MULTILINE) + text = re.sub(r"[`*_~]", "", text) + text = re.sub(r"\n{3,}", "\n\n", text) + return text.strip() + + def _read_wespy_markdown(self, output_dir, article_info): + """从 WeSpy 输出目录或返回值读取 Markdown""" + if isinstance(article_info, dict): + for key in ("markdown", "markdown_content", "content"): + value = article_info.get(key) + if isinstance(value, str) and value.strip(): + return value.strip() + + markdown_files = sorted(Path(output_dir).rglob("*.md")) + if not markdown_files: + return None + + return markdown_files[0].read_text(encoding="utf-8").strip() + + def _is_twitter(self, url): + """检查是否是推特/X URL""" + parsed = urlparse(url) + netloc = parsed.netloc.lower() + return "twitter.com" in netloc or "x.com" in netloc + + def _convert_twitter_to_proxy(self, url): + """转换 Twitter/X URL 到支持元数据预览的 fxtwitter 或 fixupx""" + parsed = urlparse(url) + netloc = parsed.netloc.lower() + + # 将 twitter.com 替换为 fxtwitter.com, x.com 替换为 fixupx.com + if "twitter.com" in netloc: + new_netloc = netloc.replace("twitter.com", "fxtwitter.com") + elif "x.com" in netloc: + new_netloc = netloc.replace("x.com", "fixupx.com") + else: + return url + + return url.replace(netloc, new_netloc) + + def _is_arxiv(self, url): + """检测是否为 arXiv 链接""" + return "arxiv.org" in url and ( + "/abs/" in url or "/pdf/" in url or "/html/" in url + ) + + def _is_wechat(self, url): + """检测是否为微信公众号链接""" + return "weixin.qq.com" in url + + def _convert_arxiv_to_html(self, url): + """转换 arXiv 链接为 HTML URL""" + if "/html/" in url: + return url + if "/pdf/" in url: + paper_id = url.split("/pdf/")[-1].split("?")[0].replace(".pdf", "") + return f"https://arxiv.org/html/{paper_id}" + paper_id = url.split("/abs/")[-1].split("?")[0] + return f"https://arxiv.org/html/{paper_id}" + + def _convert_arxiv_to_pdf(self, url): + """转换 arXiv 链接为 PDF URL""" + if "/pdf/" in url: + return url + if "/html/" in url: + paper_id = url.split("/html/")[-1].split("?")[0] + return f"https://arxiv.org/pdf/{paper_id}.pdf" + paper_id = url.split("/abs/")[-1].split("?")[0] + return f"https://arxiv.org/pdf/{paper_id}.pdf" + + def _is_known_dynamic_site(self, url): + """检测是否为已知的动态网站(纯函数,无网络调用)""" + parsed = urlparse(url) + domain = parsed.netloc.lower() + + dynamic_domains = [ + "x.com", + "twitter.com", + "medium.com", + "substack.com", + "github.com", + "reddit.com", + ] + + for dynamic in dynamic_domains: + if domain.endswith(dynamic): + return True + + if "weixin.qq.com" in domain: + return False + + return None # 未知,需要探测 + + def _probe_for_spa(self, url): + """通过 HEAD 请求探测是否为 SPA(有网络调用)""" + try: + response = self.session.head( + url, timeout=self.TIMEOUT_HEAD, allow_redirects=True + ) + content_type = response.headers.get("content-type", "").lower() + + if "application/json" in content_type: + return True + + server = response.headers.get("server", "").lower() + if any(s in server for s in ["nextjs", "vercel", "vite"]): + return True + except Exception: + return True + + return False + + def _needs_browser(self, url): + """自动检测是否需要浏览器渲染""" + known = self._is_known_dynamic_site(url) + if known is not None: + return known + return self._probe_for_spa(url) + + def _get_with_requests(self, url): + """使用 requests 获取静态页面或 PDF + + Returns: + tuple: (content, is_pdf) - content 为 bytes(PDF) 或 str(HTML) + """ + response = self.session.get(url, timeout=self.TIMEOUT_REQUESTS) + response.raise_for_status() + content_type = response.headers.get("content-type", "").lower() + is_pdf = "application/pdf" in content_type or url.lower().endswith(".pdf") + if is_pdf: + return response.content, True + return response.text, False + + def _process_pdf(self, pdf_content, pure_text=False): + """处理 PDF 内容,返回 Markdown 或纯文本(只打开一次文档)""" + try: + import fitz # PyMuPDF + except ImportError: + raise ImportError("PDF 处理需要 PyMuPDF。\n" "请运行: pip install pymupdf") + + try: + doc = fitz.open(stream=pdf_content, filetype="pdf") + + # 提取全文 + text = "" + for page_num, page in enumerate(doc): + page_text = page.get_text() + if page_text.strip(): + text += f"--- Page {page_num + 1} ---\n\n" + text += page_text + "\n\n" + text = text.strip() + + if pure_text: + return text + + # 提取标题 + title = None + metadata = doc.metadata + if metadata and metadata.get("title"): + title = metadata["title"] + elif doc.page_count > 0: + first_page = doc[0] + lines = [ + line.strip() + for line in first_page.get_text().split("\n") + if line.strip() + ] + if lines: + title = " ".join(lines[:2]) + + if title: + return f"# {title}\n\n{text}" + return text + except Exception as e: + raise Exception(f"PDF 处理失败: {e}") + + def _get_with_playwright(self, url): + """使用 Playwright 获取动态页面""" + try: + from playwright.sync_api import sync_playwright + except ImportError: + raise ImportError( + "Playwright 未安装。\n" + "请运行: pip install playwright && playwright install chromium" + ) + + with sync_playwright() as p: + browser = p.chromium.launch() + page = browser.new_page() + + # 微信公众号使用移动 UA + if "weixin.qq.com" in url: + page.set_extra_http_headers( + { + "User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0) AppleWebKit/605.1.15" + } + ) + + try: + page.goto( + url, wait_until="networkidle", timeout=self.TIMEOUT_PLAYWRIGHT + ) + page.wait_for_timeout(2000) + content = page.content() + finally: + browser.close() + + return content + + def _to_markdown(self, html): + """HTML 转 Markdown""" + soup = BeautifulSoup(html, "html.parser") + + # 提取标题(微信公众号等) + title = None + # 尝试多种标题来源 + title_selectors = ["title", "h1#activity-name", ".rich_media_title", "h1"] + for selector in title_selectors: + title_elem = soup.select_one(selector) + if title_elem: + title = title_elem.get_text(strip=True) + if title: + break + + # 查找正文内容(优先级) + content_elem = None + content_selectors = [ + "#js_content", # 微信公众号 + ".rich_media_content", # 微信公众号 + "#activity-detail", # 微信公众号 + "article", # 标准文章 + "main", # 标准主内容 + ".post-content", # 博客 + ".article-content", # 博客 + ] + + for selector in content_selectors: + elem = soup.select_one(selector) + if elem and elem.get_text(strip=True): + content_elem = elem + break + + # 如果没找到特定内容,使用 body + if not content_elem: + content_elem = soup.body or soup + + # 移除噪音元素 + for tag in content_elem( + ["script", "style", "nav", "footer", "header", "iframe", "aside"] + ): + tag.decompose() + + # 移除广告和交互元素 + for tag in content_elem.find_all( + class_=lambda x: x + and any( + w in x.lower() + for w in [ + "ad", + "banner", + "cookie", + "consent", + "popup", + "modal", + "share", + "like", + "comment", + ] + ) + ): + tag.decompose() + + # 移除按钮和链接区域 + for tag in content_elem.find_all( + class_=lambda x: x + and any(w in x.lower() for w in ["btn", "button", "share", "reward"]) + ): + tag.decompose() + + # 移除空段落 + for tag in content_elem.find_all("p"): + if not tag.get_text(strip=True): + tag.decompose() + + # 构建最终内容 + if title: + markdown = f"# {title}\n\n" + else: + markdown = "" + + cleaned_html = str(content_elem) + markdown += md(cleaned_html, heading_style="ATX") + + # 清理多余空白 + markdown = re.sub(r"\n{3,}", "\n\n", markdown) + markdown = re.sub(r" +\n", "\n", markdown) # 行尾空格 + + return markdown.strip() + + def _to_plain_text(self, html): + """提取纯文本""" + soup = BeautifulSoup(html, "html.parser") + + main = soup.find("main") or soup.find("article") or soup.body + if not main: + return soup.get_text(separator="\n\n", strip=True) + + for tag in main(["script", "style", "nav", "footer", "header", "aside"]): + tag.decompose() + + text = main.get_text(separator="\n\n", strip=True) + + text = re.sub(r"\n{3,}", "\n\n", text) + + return text.strip() + + +def main(): + """命令行入口""" + logging.basicConfig( + level=logging.INFO, + format="%(levelname)s: %(message)s", + stream=sys.stderr, + ) + + parser = argparse.ArgumentParser( + description="将网页转换为 Markdown 格式(优先使用非Python方法)", + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog="""优先级方法: + 1. Jina Reader API (https://r.jina.ai/URL) - 零安装 + 2. Firecrawl API (需要 FIRECRAWL_API_KEY) + 3. Python实现 (回退) + +示例: + %(prog)s https://www.breezedeus.com/article/ai-agent-context-engineering + %(prog)s https://arxiv.org/abs/2601.04500v1 --output paper.md + %(prog)s https://x.com/user/status/123 --pure-text + %(prog)s https://www.breezedeus.com/article/ai-agent-context-engineering --use-python # 强制使用Python方法 + """, + ) + + parser.add_argument("url", help="要转换的网页 URL") + parser.add_argument( + "--pure-text", action="store_true", help="输出纯文本(无 Markdown 格式)" + ) + parser.add_argument( + "--use-python", + action="store_true", + help="强制使用Python方法(跳过Jina/Firecrawl)", + ) + parser.add_argument("--output", "-o", help="输出到文件") + + args = parser.parse_args() + + try: + with WebToMarkdown() as converter: + result = converter.convert( + args.url, pure_text=args.pure_text, use_python=args.use_python + ) + + if result is None: + logger.error("转换失败:所有方法均不可用") + sys.exit(1) + + if args.output: + with open(args.output, "w", encoding="utf-8") as f: + f.write(result) + print(f"✓ 已保存到: {args.output}") + else: + print(result) + + except Exception as e: + logger.error("错误: %s", e) + sys.exit(1) + + +if __name__ == "__main__": + main() diff --git a/skills/common/web2summary/README.md b/skills/common/web2summary/README.md new file mode 100644 index 0000000..f75cbf5 --- /dev/null +++ b/skills/common/web2summary/README.md @@ -0,0 +1,45 @@ +# docai-web2summary + +对任意网页 URL 生成结构化总结的 AI Skill。 + +## 工作流程 + +1. **获取内容**:调用 `docai-web2md` 将 URL 转为 Markdown +2. **AI 总结**:AI 直接根据 [SKILL.md](SKILL.md) 中的规范完成总结,无需外部脚本 +3. **信息卡(可选)**:如需生成图片卡片,使用 [info-card-designer](https://github.com/joeseesun/info-card-designer) + +## 使用方式 + +直接对 AI 说明需求即可: + +> 总结这个链接:https://www.breezedeus.com/article/ai-agent-context-engineering + +AI 会自动: +- 判断内容类型(论文 / 新闻 / 教程 / 产品 / AI 动态 / 通用) +- 套用对应的总结结构 +- 按统一格式输出 Markdown + +## 输出格式示例 + +```markdown +# **给Claude Code装个仪表盘:claude-hud插件深度评测** + +✔ 一句话总结:一个让 Claude Code 从黑盒变透明的仪表盘插件... + +✔ **核心洞见**:Claude Code 最大的痛点不是功能不足,而是"黑盒"体验... + +✔ **技术细节**:基于 Claude Code 原生 statusline API 构建... + +✔ **应用场景**:复杂任务重构、CI/CD 调试、长期项目开发... + +**原文:** https://mp.weixin.qq.com/s/XClh6xJmXoXbyBC9lKzPdA +``` + +## 依赖 + +- `docai-web2md`(获取网页内容) +- [info-card-designer](https://github.com/joeseesun/info-card-designer)(可选,生成信息卡图片) + +## 许可证 + +MIT \ No newline at end of file diff --git a/skills/common/web2summary/SKILL.md b/skills/common/web2summary/SKILL.md new file mode 100644 index 0000000..0fb589d --- /dev/null +++ b/skills/common/web2summary/SKILL.md @@ -0,0 +1,168 @@ +--- +name: web2summary +description: Summarize any web URL. Triggers on "summarize/总结/概括/摘要 + URL". Auto-detects content type (paper, news, tutorial, product, AI news) and generates adaptive structured summary. +--- + +# docai:web2summary + +## When to Trigger +User wants to summarize a web page. Common patterns: +- "总结这个链接"、"帮我总结一下"、"概括这篇文章"、"给个摘要" +- "summarize this URL"、"give me a summary of" +- Any URL + intent to understand/extract key points + +## How to Execute + +### Step 1 — 获取网页内容 +使用 `web2md` skill 将 URL 转换为 Markdown: +```bash +python skills/web2md/tools/convert.py +``` + +### Step 2 — 直接总结(你来做,无需调用外部 AI) +拿到 Markdown 内容后,**你(AI agent)直接按照下方的总结规范输出总结**,不需要再调用任何脚本或 API。 + +### Step 3 — 生成信息卡片(可选) +如果用户需要信息卡图片,使用 [info-card-designer](https://github.com/joeseesun/info-card-designer) skill 生成。 + +--- + +## 总结规范 + +### 格式要求 + +**标题格式** +- 所有级别标题都必须加粗:`# **标题**`、`## **标题**`、`### **标题**` +- 如内容来自知名机构,一级标题末尾标注:`# **标题内容 | 机构名称**` +- 标题与前面内容之间空一行 + +**加粗与标点** +- 加粗标记 `**` 在标点符号内部,不在外部 +- ✅ `「**更聪明地激活**」` ❌ `**「更聪明地激活」**` + +**链接处理** +- 末尾必须包含原文链接:`**原文:** <链接>` +- 删除 URL 中 `?` 后的查询参数 + +**列表格式** +- 无序列表用 ✔ 代替 `-` / `*`,每条后空一行 + +**内容约束** +- 只基于网页中的信息,禁止自行推断 +- 不输出 LaTeX 数学公式,不包含索引或引用 + +--- + +### 内容类型判断与结构 + +先判断内容类型,再按对应结构输出。 + +--- + +#### 🔬 类型A:技术论文/研究 +适用:学术论文、技术报告、arXiv 论文、算法介绍等 + +结构(整体不超过 1000 字,没有的章节直接删除): +✔ 一句话总结(开篇):体现研究的核心突破,必须有吸引力 + +✔ 核心洞见:解决了什么问题?提出了什么新思路? + +✔ 技术细节/架构创新:关键方法、模型结构、算法设计 + +✔ 性能数据/实验结果:量化指标、对比基线、关键数据 + +✔ 应用场景:这项技术能用在哪里? + +✔ 长期意义:为什么值得关注?对领域的影响 + +✔ 原文链接(末尾) + +--- + +#### 📰 类型B:新闻报道 +适用:行业新闻、公司动态、政策发布、事件报道等 + +结构(整体不超过 800 字,没有的章节直接删除): +✔ 一句话总结(开篇):概括核心事件,突出新闻价值 + +✔ 核心事件:发生了什么?关键细节 + +✔ 关键人物/机构:谁在推动?谁受影响? + +✔ 背景与影响:为什么重要?对行业/社会的影响 + +✔ 后续展望:接下来可能发生什么? + +✔ 原文链接(末尾) + +--- + +#### 📚 类型C:教程/指南 +适用:编程教程、操作指南、How-to 文章、最佳实践等 + +结构(整体不超过 1000 字,没有的章节直接删除): +✔ 一句话总结(开篇):这篇教程教你什么?适合谁? + +✔ 学习目标:读完能掌握什么? + +✔ 前置条件:需要什么基础或工具? + +✔ 关键步骤摘要:核心流程的精炼提取(不是逐步复述) + +✔ 注意事项/常见坑:作者提到的易错点或最佳实践 + +✔ 原文链接(末尾) + +--- + +#### 🚀 类型D:产品发布/评测 +适用:产品发布、功能更新、产品评测、工具推荐等 + +结构(整体不超过 800 字,没有的章节直接删除): +✔ 一句话总结(开篇):核心卖点 + +✔ 产品定位:解决什么问题?面向谁? + +✔ 核心功能/亮点:最值得关注的特性 + +✔ 与竞品对比:相比现有方案有什么优势?(如文中提及) + +✔ 适用人群:谁最应该关注? + +✔ 价格/获取方式:如何获取或使用?(如文中提及) + +✔ 原文链接(末尾) + +--- + +#### 🤖 类型E:AI 行业动态 +适用:AI 领域新闻汇总、模型发布、行业趋势分析、AI Newsletter 等 + +结构(整体不超过 1000 字,没有的章节直接删除): +✔ 一句话总结(开篇):本期最值得关注的信号 + +✔ 核心动态:最重要的 2-3 条消息及其意义 + +✔ 技术要点:涉及的关键技术或方法(如有) + +✔ 行业影响:对开发者/企业/用户意味着什么? + +✔ 值得关注的信号:哪些趋势正在形成? + +✔ 原文链接(末尾) + +--- + +#### 📄 类型F:通用 +适用:个人博客、观点文章、随笔、访谈、其他类型 + +结构(整体不超过 800 字,没有的章节直接删除): +✔ 一句话总结(开篇):这篇内容的核心价值 + +✔ 核心内容:作者在说什么?主要观点或故事 + +✔ 关键要点:最值得记住的 2-3 个点 + +✔ 价值与启发:读完能获得什么? + +✔ 原文链接(末尾)