add web2md web2summary
This commit is contained in:
parent
ede5960b5a
commit
62c0f62134
194
skills/common/web2md/README.md
Normal file
194
skills/common/web2md/README.md
Normal file
@ -0,0 +1,194 @@
|
|||||||
|
# docai-web2md
|
||||||
|
|
||||||
|
独立 Python 工具,用于将网页转换为 Markdown 格式,采用**优先级架构**。
|
||||||
|
|
||||||
|
> 📖 **文档导航**
|
||||||
|
> - **SKILL.md** - Claude Code 使用指南(如何调用此技能)
|
||||||
|
> - **README.md** - 本文档(工具功能说明和独立使用)
|
||||||
|
> - **tools/convert.py** - 实际转换代码实现
|
||||||
|
> - **共享参考**: [web-sources.md](../../shared/references/web-sources.md) - 平台支持矩阵
|
||||||
|
|
||||||
|
## 核心特性
|
||||||
|
|
||||||
|
- ✅ **Jina Reader API 优先** - 零安装,最快最简单(微信公众号除外)
|
||||||
|
- ✅ **Firecrawl API 支持** - 高级爬虫需求
|
||||||
|
- ✅ **Python 智能回退** - 以上方法失败时自动切换
|
||||||
|
- ✅ **arXiv HTML 优先** - 优先获取 HTML 版,失败时回退 PDF
|
||||||
|
- ✅ **多平台支持** - 微信公众号(直接 Python)、静态博客、动态页面等
|
||||||
|
|
||||||
|
## 优先级策略
|
||||||
|
|
||||||
|
```python
|
||||||
|
# 转换流程
|
||||||
|
输入 URL
|
||||||
|
↓
|
||||||
|
是 arXiv? → 转换为 HTML URL
|
||||||
|
↓
|
||||||
|
是微信公众号? → 直接 Python 方法 ⭐
|
||||||
|
↓
|
||||||
|
尝试 Jina Reader API (快速)
|
||||||
|
↓ (失败)
|
||||||
|
尝试 Firecrawl API (需要密钥)
|
||||||
|
↓ (失败)
|
||||||
|
Python 方法 (回退)
|
||||||
|
↓
|
||||||
|
arXiv? → 下载 PDF 提取
|
||||||
|
```
|
||||||
|
|
||||||
|
**微信公众号特殊处理**:由于 Jina Reader 对微信公众号支持不佳,直接使用 Python 方法以确保最佳效果。
|
||||||
|
|
||||||
|
## 快速开始
|
||||||
|
|
||||||
|
### 方式 1: 使用 uv(推荐)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. 安装 uv(如果尚未安装)
|
||||||
|
curl -LsSf https://astral.sh/uv/install.sh | sh
|
||||||
|
|
||||||
|
# 2. 在 docai-skills 目录初始化环境
|
||||||
|
cd docai-skills
|
||||||
|
uv sync
|
||||||
|
|
||||||
|
# 3. 执行脚本(无需激活环境)
|
||||||
|
uv run python skills/docai-web2md/tools/convert.py https://www.breezedeus.com/article/ai-agent-context-engineering
|
||||||
|
```
|
||||||
|
|
||||||
|
### 方式 2: 使用 pip(传统方式)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. 创建虚拟环境(可选但推荐)
|
||||||
|
python -m venv .venv
|
||||||
|
source .venv/bin/activate
|
||||||
|
|
||||||
|
# 2. 安装依赖
|
||||||
|
pip install requests beautifulsoup4 markdownify pymupdf
|
||||||
|
|
||||||
|
# 3. 执行脚本
|
||||||
|
python skills/docai-web2md/tools/convert.py https://www.breezedeus.com/article/ai-agent-context-engineering
|
||||||
|
```
|
||||||
|
|
||||||
|
### 方式 3: Jina Reader API(无需安装)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 直接使用 API,无需任何依赖
|
||||||
|
curl https://r.jina.ai/https://www.breezedeus.com/article/ai-agent-context-engineering
|
||||||
|
```
|
||||||
|
|
||||||
|
### ⚠️ Claude Code Skill 集成
|
||||||
|
|
||||||
|
**重要**:Claude Code 调用 Skill 时使用系统 Python,需要额外配置:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 使用 uv 安装到系统(不影响项目虚拟环境)
|
||||||
|
uv pip install --system requests beautifulsoup4 markdownify pymupdf
|
||||||
|
|
||||||
|
# 或使用 pip
|
||||||
|
pip install requests beautifulsoup4 markdownify pymupdf
|
||||||
|
```
|
||||||
|
|
||||||
|
**详见**:[UV_ENVIRONMENT.md](../../UV_ENVIRONMENT.md)
|
||||||
|
|
||||||
|
## 命令行使用
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 基本用法(自动优先级)
|
||||||
|
python skills/docai-web2md/tools/convert.py https://www.breezedeus.com/article/ai-agent-context-engineering
|
||||||
|
|
||||||
|
# 保存到文件
|
||||||
|
python skills/docai-web2md/tools/convert.py https://www.breezedeus.com/article/ai-agent-context-engineering -o article.md
|
||||||
|
|
||||||
|
# 纯文本模式
|
||||||
|
python skills/docai-web2md/tools/convert.py https://www.breezedeus.com/article/ai-agent-context-engineering --pure-text
|
||||||
|
|
||||||
|
# 强制使用 Python 方法(跳过 Jina/Firecrawl)
|
||||||
|
python skills/docai-web2md/tools/convert.py https://www.breezedeus.com/article/ai-agent-context-engineering --use-python
|
||||||
|
```
|
||||||
|
|
||||||
|
## 优先级架构
|
||||||
|
|
||||||
|
```
|
||||||
|
输入 URL
|
||||||
|
↓
|
||||||
|
arXiv? → 转换为 HTML URL
|
||||||
|
↓
|
||||||
|
Jina Reader API (⭐ 零安装)
|
||||||
|
↓ 失败
|
||||||
|
Firecrawl API (需密钥)
|
||||||
|
↓ 失败
|
||||||
|
Python 实现 (全能回退)
|
||||||
|
↓
|
||||||
|
arXiv? → 下载 PDF 提取
|
||||||
|
↓
|
||||||
|
普通网页 → HTML 解析
|
||||||
|
```
|
||||||
|
|
||||||
|
## 使用示例
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 静态博客(Jina Reader)
|
||||||
|
python skills/docai-web2md/tools/convert.py https://www.breezedeus.com/article/ai-agent-context-engineering
|
||||||
|
|
||||||
|
# arXiv 论文(HTML 优先,PDF 回退)
|
||||||
|
python skills/docai-web2md/tools/convert.py https://arxiv.org/abs/2601.04500v1
|
||||||
|
|
||||||
|
# 微信公众号(Jina → Python 回退)
|
||||||
|
python skills/docai-web2md/tools/convert.py https://mp.weixin.qq.com/s/1LfkYdbzymoWxdvdnKeLnA
|
||||||
|
|
||||||
|
# X.com/Twitter(Python 动态渲染)
|
||||||
|
python skills/docai-web2md/tools/convert.py https://x.com/user/status/123
|
||||||
|
```
|
||||||
|
|
||||||
|
## Python API
|
||||||
|
|
||||||
|
```python
|
||||||
|
from skills.docai_web2md.tools.convert import WebToMarkdown
|
||||||
|
|
||||||
|
converter = WebToMarkdown()
|
||||||
|
|
||||||
|
# 自动优先级(推荐)
|
||||||
|
markdown = converter.convert("https://www.breezedeus.com/article/ai-agent-context-engineering")
|
||||||
|
|
||||||
|
# arXiv 自动处理(HTML → PDF)
|
||||||
|
paper = converter.convert("https://arxiv.org/abs/2601.04500v1")
|
||||||
|
|
||||||
|
# 强制 Python 方法
|
||||||
|
markdown = converter.convert("https://www.breezedeus.com/article/ai-agent-context-engineering", use_python=True)
|
||||||
|
|
||||||
|
# 纯文本输出
|
||||||
|
text = converter.convert("https://www.breezedeus.com/article/ai-agent-context-engineering", pure_text=True)
|
||||||
|
```
|
||||||
|
|
||||||
|
## 依赖说明
|
||||||
|
|
||||||
|
| 方法 | 依赖 | 说明 |
|
||||||
|
|------|------|------|
|
||||||
|
| **Jina Reader** | 无 | 只需网络连接 |
|
||||||
|
| **Firecrawl** | `FIRECRAWL_API_KEY` | 环境变量 |
|
||||||
|
| **Python 回退** | `requests`, `beautifulsoup4`, `markdownify` | 基础依赖 |
|
||||||
|
| **PDF 支持** | `pymupdf` | arXiv PDF 提取 |
|
||||||
|
| **动态页面** | `playwright` | React/Vue SPA |
|
||||||
|
|
||||||
|
## 性能参考
|
||||||
|
|
||||||
|
- **Jina Reader**: ~1-2 秒
|
||||||
|
- **Firecrawl**: ~2-5 秒
|
||||||
|
- **Python 静态**: ~1-2 秒
|
||||||
|
- **Python 动态**: ~5-10 秒
|
||||||
|
- **arXiv PDF**: ~2-5 秒
|
||||||
|
|
||||||
|
## 与 Skill 的关系
|
||||||
|
|
||||||
|
- **SKILL.md**: 指导 Claude 如何使用此工具
|
||||||
|
- **tools/convert.py**: 实际执行转换的代码
|
||||||
|
- **README.md**: 本文档(工具使用说明)
|
||||||
|
|
||||||
|
## 测试
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 测试 breezedeus.com 博客
|
||||||
|
python skills/docai-web2md/tools/convert.py https://www.breezedeus.com/article/ai-agent-context-engineering
|
||||||
|
```
|
||||||
|
|
||||||
|
## 许可证
|
||||||
|
|
||||||
|
MIT
|
||||||
55
skills/common/web2md/SKILL.md
Normal file
55
skills/common/web2md/SKILL.md
Normal file
@ -0,0 +1,55 @@
|
|||||||
|
---
|
||||||
|
name: web2md
|
||||||
|
description: Convert any web URL to Markdown. Triggers on "转成Markdown/转换/网页转Markdown/convert to Markdown + URL". Handles static sites, dynamic SPAs, WeChat, arXiv, Twitter/X.
|
||||||
|
---
|
||||||
|
|
||||||
|
# docai:web2md
|
||||||
|
|
||||||
|
## When to Trigger
|
||||||
|
User wants to convert a web page to Markdown. Common patterns:
|
||||||
|
- "把这个链接转成 Markdown"、"网页转 Markdown"、"提取网页内容"
|
||||||
|
- "convert this URL to Markdown"、"get the content of this page"
|
||||||
|
- Any URL + intent to extract/read content (without summarization)
|
||||||
|
|
||||||
|
If user wants summary, use web2summary instead.
|
||||||
|
|
||||||
|
## How to Execute
|
||||||
|
```bash
|
||||||
|
python skills/web2md/tools/convert.py <URL> [--use-python] [-o <file>]
|
||||||
|
```
|
||||||
|
|
||||||
|
### Parameters
|
||||||
|
| Parameter | Required | Description |
|
||||||
|
|-----------|----------|-------------|
|
||||||
|
| `url` | Yes | Web page URL |
|
||||||
|
| `--use-python` | No | Force Python method (skip Jina/Firecrawl) |
|
||||||
|
| `-o` / `--output` | No | Save to file instead of stdout |
|
||||||
|
|
||||||
|
### Examples
|
||||||
|
```bash
|
||||||
|
# Basic conversion (parallel: Jina / Firecrawl / Python)
|
||||||
|
python skills/web2md/tools/convert.py https://example.com/article
|
||||||
|
|
||||||
|
# arXiv paper (auto HTML priority, PDF fallback)
|
||||||
|
python skills/web2md/tools/convert.py https://arxiv.org/abs/2601.04500v1
|
||||||
|
|
||||||
|
# Save to file
|
||||||
|
python skills/web2md/tools/convert.py https://mp.weixin.qq.com/s/... -o article.md
|
||||||
|
|
||||||
|
# Force Python method
|
||||||
|
python skills/web2md/tools/convert.py https://example.com --use-python
|
||||||
|
```
|
||||||
|
|
||||||
|
## What It Does
|
||||||
|
Four methods run in parallel, returning the first successful result:
|
||||||
|
1. Jina Reader API (fastest, zero install)
|
||||||
|
2. Firecrawl API (if key configured)
|
||||||
|
3. Python fallback (requests + BeautifulSoup)
|
||||||
|
4. Playwright (headless browser for JS-rendered pages)
|
||||||
|
|
||||||
|
Special cases handled automatically: WeChat → Playwright first (mobile UA, JS rendering), arXiv → HTML priority, Twitter/X → Playwright.
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
- **arXiv PDF garbled**: Requires `pymupdf` — `pip install pymupdf`
|
||||||
|
- **Dynamic page empty**: Script auto-detects SPAs and uses Playwright
|
||||||
|
- **All methods fail**: Try `--use-python` to bypass API methods
|
||||||
683
skills/common/web2md/tools/convert.py
Normal file
683
skills/common/web2md/tools/convert.py
Normal file
@ -0,0 +1,683 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Web to Markdown Converter Tool
|
||||||
|
|
||||||
|
优先级方法(非Python优先):
|
||||||
|
1. Jina Reader API - 零安装,一行URL转换
|
||||||
|
2. Firecrawl API - 需要API密钥
|
||||||
|
3. Python实现 - 以上方法失败时的回退
|
||||||
|
|
||||||
|
arXiv 特殊处理:
|
||||||
|
- 输入: https://arxiv.org/abs/2601.04500v1
|
||||||
|
- 转换为: https://arxiv.org/html/2601.04500v1
|
||||||
|
- 优先 Jina Reader,失败则 Python 下载 PDF
|
||||||
|
|
||||||
|
微信公众号特殊处理:
|
||||||
|
- 优先 WeSpy
|
||||||
|
- 失败则回退 Playwright
|
||||||
|
- 最后回退 Python 方法
|
||||||
|
|
||||||
|
用法:
|
||||||
|
python convert.py <url> [--pure-text] [--output <file>]
|
||||||
|
|
||||||
|
示例:
|
||||||
|
python convert.py https://www.breezedeus.com/article/ai-agent-context-engineering
|
||||||
|
python convert.py https://arxiv.org/abs/2601.04500v1 --output paper.md
|
||||||
|
python convert.py https://x.com/user/status/123 --pure-text
|
||||||
|
"""
|
||||||
|
|
||||||
|
import sys
|
||||||
|
import argparse
|
||||||
|
import logging
|
||||||
|
from pathlib import Path
|
||||||
|
import requests
|
||||||
|
from requests.adapters import HTTPAdapter
|
||||||
|
from urllib3.util.retry import Retry
|
||||||
|
from bs4 import BeautifulSoup
|
||||||
|
from markdownify import markdownify as md
|
||||||
|
import tempfile
|
||||||
|
from urllib.parse import urlparse
|
||||||
|
from concurrent.futures import ThreadPoolExecutor, as_completed
|
||||||
|
import re
|
||||||
|
import os
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
class WebToMarkdown:
|
||||||
|
"""网页转 Markdown 转换器(并行优先级方法)"""
|
||||||
|
|
||||||
|
# 超时常量(秒)
|
||||||
|
TIMEOUT_HEAD = 3
|
||||||
|
TIMEOUT_JINA = 8
|
||||||
|
TIMEOUT_FIRECRAWL = 10
|
||||||
|
TIMEOUT_REQUESTS = 15
|
||||||
|
TIMEOUT_PLAYWRIGHT = 15000 # 毫秒
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
self.session = requests.Session()
|
||||||
|
self.session.headers.update(
|
||||||
|
{"User-Agent": "Mozilla/5.0 (compatible; DocAI-Converter/1.0)"}
|
||||||
|
)
|
||||||
|
# 配置重试策略:仅针对 429/5xx,最多 2 次,指数退避
|
||||||
|
retry = Retry(
|
||||||
|
total=2,
|
||||||
|
backoff_factor=1,
|
||||||
|
status_forcelist=[429, 500, 502, 503, 504],
|
||||||
|
allowed_methods=["HEAD", "GET", "POST"],
|
||||||
|
)
|
||||||
|
adapter = HTTPAdapter(max_retries=retry)
|
||||||
|
self.session.mount("https://", adapter)
|
||||||
|
self.session.mount("http://", adapter)
|
||||||
|
# 从环境变量获取 Firecrawl API 密钥
|
||||||
|
self.firecrawl_api_key = os.environ.get("FIRECRAWL_API_KEY")
|
||||||
|
|
||||||
|
def __enter__(self):
|
||||||
|
return self
|
||||||
|
|
||||||
|
def __exit__(self, exc_type, exc_val, exc_tb):
|
||||||
|
self.session.close()
|
||||||
|
|
||||||
|
def convert(self, url, pure_text=False, use_python=False):
|
||||||
|
"""转换 URL 到 Markdown(并行优先级方法)
|
||||||
|
|
||||||
|
并行发起 Jina Reader / Firecrawl / Python,取最快成功的结果。
|
||||||
|
微信公众号和 --use-python 模式走直连路径。
|
||||||
|
|
||||||
|
Args:
|
||||||
|
url: 网页 URL
|
||||||
|
pure_text: 是否返回纯文本(无格式)
|
||||||
|
use_python: 强制使用Python方法
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
str: Markdown 或纯文本内容
|
||||||
|
"""
|
||||||
|
url = url.strip()
|
||||||
|
|
||||||
|
# URL 校验
|
||||||
|
parsed = urlparse(url)
|
||||||
|
if parsed.scheme not in ("http", "https") or not parsed.netloc:
|
||||||
|
raise ValueError(f"无效的 URL: {url}")
|
||||||
|
|
||||||
|
# arXiv 特殊处理:转换为 HTML URL
|
||||||
|
if self._is_arxiv(url):
|
||||||
|
url = self._convert_arxiv_to_html(url)
|
||||||
|
|
||||||
|
# 微信公众号:优先使用 WeSpy,失败则回退到 Playwright / Python
|
||||||
|
if self._is_wechat(url):
|
||||||
|
result = self._try_wespy(url, pure_text)
|
||||||
|
if result:
|
||||||
|
return result
|
||||||
|
result = self._try_playwright(url, pure_text)
|
||||||
|
if result:
|
||||||
|
return result
|
||||||
|
return self._python_convert(url, pure_text)
|
||||||
|
|
||||||
|
# 推特 X.com 特殊处理:如果URL是twitter/x.com,转换为fxtwitter/fixupx以获取元数据渲染的内容
|
||||||
|
if self._is_twitter(url):
|
||||||
|
url = self._convert_twitter_to_proxy(url)
|
||||||
|
|
||||||
|
# 强制 Python 模式
|
||||||
|
if use_python:
|
||||||
|
if self._is_arxiv(url):
|
||||||
|
return self._handle_arxiv(url, pure_text)
|
||||||
|
return self._python_convert(url, pure_text)
|
||||||
|
|
||||||
|
# 并行发起多种方法,取最快成功的
|
||||||
|
result = self._parallel_convert(url, pure_text)
|
||||||
|
if result:
|
||||||
|
return result
|
||||||
|
|
||||||
|
# 所有并行方法都失败,arXiv 尝试 PDF 回退
|
||||||
|
if self._is_arxiv(url):
|
||||||
|
return self._handle_arxiv(url, pure_text)
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
def _parallel_convert(self, url, pure_text):
|
||||||
|
"""并行尝试多种方法,返回最快成功的结果"""
|
||||||
|
futures = {}
|
||||||
|
with ThreadPoolExecutor(max_workers=4) as executor:
|
||||||
|
futures[executor.submit(self._try_jina_reader, url, pure_text)] = "jina"
|
||||||
|
|
||||||
|
if self.firecrawl_api_key:
|
||||||
|
futures[executor.submit(self._try_firecrawl, url, pure_text)] = (
|
||||||
|
"firecrawl"
|
||||||
|
)
|
||||||
|
|
||||||
|
futures[executor.submit(self._python_convert, url, pure_text)] = "python"
|
||||||
|
futures[executor.submit(self._try_playwright, url, pure_text)] = (
|
||||||
|
"playwright"
|
||||||
|
)
|
||||||
|
|
||||||
|
for future in as_completed(futures):
|
||||||
|
try:
|
||||||
|
result = future.result()
|
||||||
|
except Exception:
|
||||||
|
continue
|
||||||
|
if result:
|
||||||
|
for f in futures:
|
||||||
|
f.cancel()
|
||||||
|
return result
|
||||||
|
return None
|
||||||
|
|
||||||
|
def _try_jina_reader(self, url, pure_text):
|
||||||
|
"""尝试使用 Jina Reader API
|
||||||
|
|
||||||
|
用法: https://r.jina.ai/https://www.breezedeus.com/article/ai-agent-context-engineering
|
||||||
|
"""
|
||||||
|
jina_base_urls = ["https://r.jinaai.cn", "https://r.jina.ai"]
|
||||||
|
try:
|
||||||
|
for jina_base_url in jina_base_urls:
|
||||||
|
jina_url = f"{jina_base_url}/{url}"
|
||||||
|
try:
|
||||||
|
response = self.session.get(jina_url, timeout=self.TIMEOUT_JINA)
|
||||||
|
response.raise_for_status()
|
||||||
|
|
||||||
|
content = response.text
|
||||||
|
if content and len(content.strip()) > 50: # 验证有内容
|
||||||
|
if pure_text:
|
||||||
|
return content
|
||||||
|
# Jina 已经返回不错的 Markdown,稍作清理即可
|
||||||
|
return self._clean_jina_markdown(content)
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning("Jina Reader 失败 (%s): %s", jina_base_url, e)
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning("Jina Reader 失败: %s", e)
|
||||||
|
return None
|
||||||
|
|
||||||
|
def _try_firecrawl(self, url, pure_text):
|
||||||
|
"""尝试使用 Firecrawl API"""
|
||||||
|
if not self.firecrawl_api_key:
|
||||||
|
logger.info("Firecrawl API 密钥未设置 (FIRECRAWL_API_KEY)")
|
||||||
|
return None
|
||||||
|
|
||||||
|
try:
|
||||||
|
response = self.session.post(
|
||||||
|
"https://api.firecrawl.dev/v0/scrape",
|
||||||
|
headers={"Authorization": f"Bearer {self.firecrawl_api_key}"},
|
||||||
|
json={"url": url, "formats": ["markdown"]},
|
||||||
|
timeout=self.TIMEOUT_FIRECRAWL,
|
||||||
|
)
|
||||||
|
|
||||||
|
if response.status_code == 200:
|
||||||
|
data = response.json()
|
||||||
|
if data.get("success") and data.get("data", {}).get("markdown"):
|
||||||
|
markdown = data["data"]["markdown"]
|
||||||
|
if pure_text:
|
||||||
|
# 从 Markdown 提取纯文本
|
||||||
|
return re.sub(r"[\*\#\`\[\]\(\)]", "", markdown)
|
||||||
|
return markdown
|
||||||
|
else:
|
||||||
|
logger.warning("Firecrawl 错误: %s", response.status_code)
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning("Firecrawl 失败: %s", e)
|
||||||
|
return None
|
||||||
|
|
||||||
|
def _try_playwright(self, url, pure_text):
|
||||||
|
"""尝试使用 Playwright 获取动态页面"""
|
||||||
|
try:
|
||||||
|
content = self._get_with_playwright(url)
|
||||||
|
if not content or len(content.strip()) < 50:
|
||||||
|
return None
|
||||||
|
if pure_text:
|
||||||
|
return self._to_plain_text(content)
|
||||||
|
return self._to_markdown(content)
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning("Playwright 失败: %s", e)
|
||||||
|
return None
|
||||||
|
|
||||||
|
def _try_wespy(self, url, pure_text):
|
||||||
|
"""尝试使用 WeSpy 获取微信公众号内容"""
|
||||||
|
try:
|
||||||
|
from wespy import ArticleFetcher
|
||||||
|
except ImportError as e:
|
||||||
|
logger.warning("WeSpy 未安装: %s", e)
|
||||||
|
return None
|
||||||
|
|
||||||
|
try:
|
||||||
|
fetcher = ArticleFetcher()
|
||||||
|
with tempfile.TemporaryDirectory() as output_dir:
|
||||||
|
article_info = fetcher.fetch_article(
|
||||||
|
url=url,
|
||||||
|
output_dir=output_dir,
|
||||||
|
save_markdown=True,
|
||||||
|
save_html=False,
|
||||||
|
save_json=False,
|
||||||
|
)
|
||||||
|
if not article_info:
|
||||||
|
return None
|
||||||
|
|
||||||
|
markdown = self._read_wespy_markdown(output_dir, article_info)
|
||||||
|
if not markdown:
|
||||||
|
return None
|
||||||
|
|
||||||
|
if pure_text:
|
||||||
|
return self._markdown_to_plain_text(markdown)
|
||||||
|
return markdown.strip()
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning("WeSpy 失败: %s", e)
|
||||||
|
return None
|
||||||
|
|
||||||
|
def _python_convert(self, url, pure_text):
|
||||||
|
"""Python实现(回退方法)"""
|
||||||
|
# 自动检测是否需要浏览器
|
||||||
|
use_browser = self._needs_browser(url)
|
||||||
|
|
||||||
|
if use_browser:
|
||||||
|
content = self._get_with_playwright(url)
|
||||||
|
is_pdf = False
|
||||||
|
else:
|
||||||
|
content, is_pdf = self._get_with_requests(url)
|
||||||
|
|
||||||
|
if is_pdf:
|
||||||
|
return self._process_pdf(content, pure_text)
|
||||||
|
|
||||||
|
# HTML 转换
|
||||||
|
if pure_text:
|
||||||
|
return self._to_plain_text(content)
|
||||||
|
else:
|
||||||
|
return self._to_markdown(content)
|
||||||
|
|
||||||
|
def _handle_arxiv(self, url, pure_text):
|
||||||
|
"""arXiv Python回退方法:从HTML URL转为PDF下载"""
|
||||||
|
try:
|
||||||
|
pdf_url = self._convert_arxiv_to_pdf(url)
|
||||||
|
logger.info("arXiv Python回退: 下载PDF %s", pdf_url)
|
||||||
|
pdf_content, _ = self._get_with_requests(pdf_url)
|
||||||
|
return self._process_pdf(pdf_content, pure_text)
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("arXiv PDF失败: %s", e)
|
||||||
|
return None
|
||||||
|
|
||||||
|
def _clean_jina_markdown(self, markdown):
|
||||||
|
"""清理 Jina Reader 返回的 Markdown"""
|
||||||
|
# 移除多余的空行
|
||||||
|
markdown = re.sub(r"\n{3,}", "\n\n", markdown)
|
||||||
|
# 移除行尾空格
|
||||||
|
markdown = re.sub(r" +\n", "\n", markdown)
|
||||||
|
return markdown.strip()
|
||||||
|
|
||||||
|
def _markdown_to_plain_text(self, markdown):
|
||||||
|
"""从 Markdown 提取纯文本"""
|
||||||
|
text = re.sub(r"!\[[^\]]*\]\([^\)]*\)", "", markdown)
|
||||||
|
text = re.sub(r"\[([^\]]+)\]\([^\)]*\)", r"\1", text)
|
||||||
|
text = re.sub(r"^#{1,6}\s*", "", text, flags=re.MULTILINE)
|
||||||
|
text = re.sub(r"^[>*-]\s*", "", text, flags=re.MULTILINE)
|
||||||
|
text = re.sub(r"[`*_~]", "", text)
|
||||||
|
text = re.sub(r"\n{3,}", "\n\n", text)
|
||||||
|
return text.strip()
|
||||||
|
|
||||||
|
def _read_wespy_markdown(self, output_dir, article_info):
|
||||||
|
"""从 WeSpy 输出目录或返回值读取 Markdown"""
|
||||||
|
if isinstance(article_info, dict):
|
||||||
|
for key in ("markdown", "markdown_content", "content"):
|
||||||
|
value = article_info.get(key)
|
||||||
|
if isinstance(value, str) and value.strip():
|
||||||
|
return value.strip()
|
||||||
|
|
||||||
|
markdown_files = sorted(Path(output_dir).rglob("*.md"))
|
||||||
|
if not markdown_files:
|
||||||
|
return None
|
||||||
|
|
||||||
|
return markdown_files[0].read_text(encoding="utf-8").strip()
|
||||||
|
|
||||||
|
def _is_twitter(self, url):
|
||||||
|
"""检查是否是推特/X URL"""
|
||||||
|
parsed = urlparse(url)
|
||||||
|
netloc = parsed.netloc.lower()
|
||||||
|
return "twitter.com" in netloc or "x.com" in netloc
|
||||||
|
|
||||||
|
def _convert_twitter_to_proxy(self, url):
|
||||||
|
"""转换 Twitter/X URL 到支持元数据预览的 fxtwitter 或 fixupx"""
|
||||||
|
parsed = urlparse(url)
|
||||||
|
netloc = parsed.netloc.lower()
|
||||||
|
|
||||||
|
# 将 twitter.com 替换为 fxtwitter.com, x.com 替换为 fixupx.com
|
||||||
|
if "twitter.com" in netloc:
|
||||||
|
new_netloc = netloc.replace("twitter.com", "fxtwitter.com")
|
||||||
|
elif "x.com" in netloc:
|
||||||
|
new_netloc = netloc.replace("x.com", "fixupx.com")
|
||||||
|
else:
|
||||||
|
return url
|
||||||
|
|
||||||
|
return url.replace(netloc, new_netloc)
|
||||||
|
|
||||||
|
def _is_arxiv(self, url):
|
||||||
|
"""检测是否为 arXiv 链接"""
|
||||||
|
return "arxiv.org" in url and (
|
||||||
|
"/abs/" in url or "/pdf/" in url or "/html/" in url
|
||||||
|
)
|
||||||
|
|
||||||
|
def _is_wechat(self, url):
|
||||||
|
"""检测是否为微信公众号链接"""
|
||||||
|
return "weixin.qq.com" in url
|
||||||
|
|
||||||
|
def _convert_arxiv_to_html(self, url):
|
||||||
|
"""转换 arXiv 链接为 HTML URL"""
|
||||||
|
if "/html/" in url:
|
||||||
|
return url
|
||||||
|
if "/pdf/" in url:
|
||||||
|
paper_id = url.split("/pdf/")[-1].split("?")[0].replace(".pdf", "")
|
||||||
|
return f"https://arxiv.org/html/{paper_id}"
|
||||||
|
paper_id = url.split("/abs/")[-1].split("?")[0]
|
||||||
|
return f"https://arxiv.org/html/{paper_id}"
|
||||||
|
|
||||||
|
def _convert_arxiv_to_pdf(self, url):
|
||||||
|
"""转换 arXiv 链接为 PDF URL"""
|
||||||
|
if "/pdf/" in url:
|
||||||
|
return url
|
||||||
|
if "/html/" in url:
|
||||||
|
paper_id = url.split("/html/")[-1].split("?")[0]
|
||||||
|
return f"https://arxiv.org/pdf/{paper_id}.pdf"
|
||||||
|
paper_id = url.split("/abs/")[-1].split("?")[0]
|
||||||
|
return f"https://arxiv.org/pdf/{paper_id}.pdf"
|
||||||
|
|
||||||
|
def _is_known_dynamic_site(self, url):
|
||||||
|
"""检测是否为已知的动态网站(纯函数,无网络调用)"""
|
||||||
|
parsed = urlparse(url)
|
||||||
|
domain = parsed.netloc.lower()
|
||||||
|
|
||||||
|
dynamic_domains = [
|
||||||
|
"x.com",
|
||||||
|
"twitter.com",
|
||||||
|
"medium.com",
|
||||||
|
"substack.com",
|
||||||
|
"github.com",
|
||||||
|
"reddit.com",
|
||||||
|
]
|
||||||
|
|
||||||
|
for dynamic in dynamic_domains:
|
||||||
|
if domain.endswith(dynamic):
|
||||||
|
return True
|
||||||
|
|
||||||
|
if "weixin.qq.com" in domain:
|
||||||
|
return False
|
||||||
|
|
||||||
|
return None # 未知,需要探测
|
||||||
|
|
||||||
|
def _probe_for_spa(self, url):
|
||||||
|
"""通过 HEAD 请求探测是否为 SPA(有网络调用)"""
|
||||||
|
try:
|
||||||
|
response = self.session.head(
|
||||||
|
url, timeout=self.TIMEOUT_HEAD, allow_redirects=True
|
||||||
|
)
|
||||||
|
content_type = response.headers.get("content-type", "").lower()
|
||||||
|
|
||||||
|
if "application/json" in content_type:
|
||||||
|
return True
|
||||||
|
|
||||||
|
server = response.headers.get("server", "").lower()
|
||||||
|
if any(s in server for s in ["nextjs", "vercel", "vite"]):
|
||||||
|
return True
|
||||||
|
except Exception:
|
||||||
|
return True
|
||||||
|
|
||||||
|
return False
|
||||||
|
|
||||||
|
def _needs_browser(self, url):
|
||||||
|
"""自动检测是否需要浏览器渲染"""
|
||||||
|
known = self._is_known_dynamic_site(url)
|
||||||
|
if known is not None:
|
||||||
|
return known
|
||||||
|
return self._probe_for_spa(url)
|
||||||
|
|
||||||
|
def _get_with_requests(self, url):
|
||||||
|
"""使用 requests 获取静态页面或 PDF
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
tuple: (content, is_pdf) - content 为 bytes(PDF) 或 str(HTML)
|
||||||
|
"""
|
||||||
|
response = self.session.get(url, timeout=self.TIMEOUT_REQUESTS)
|
||||||
|
response.raise_for_status()
|
||||||
|
content_type = response.headers.get("content-type", "").lower()
|
||||||
|
is_pdf = "application/pdf" in content_type or url.lower().endswith(".pdf")
|
||||||
|
if is_pdf:
|
||||||
|
return response.content, True
|
||||||
|
return response.text, False
|
||||||
|
|
||||||
|
def _process_pdf(self, pdf_content, pure_text=False):
|
||||||
|
"""处理 PDF 内容,返回 Markdown 或纯文本(只打开一次文档)"""
|
||||||
|
try:
|
||||||
|
import fitz # PyMuPDF
|
||||||
|
except ImportError:
|
||||||
|
raise ImportError("PDF 处理需要 PyMuPDF。\n" "请运行: pip install pymupdf")
|
||||||
|
|
||||||
|
try:
|
||||||
|
doc = fitz.open(stream=pdf_content, filetype="pdf")
|
||||||
|
|
||||||
|
# 提取全文
|
||||||
|
text = ""
|
||||||
|
for page_num, page in enumerate(doc):
|
||||||
|
page_text = page.get_text()
|
||||||
|
if page_text.strip():
|
||||||
|
text += f"--- Page {page_num + 1} ---\n\n"
|
||||||
|
text += page_text + "\n\n"
|
||||||
|
text = text.strip()
|
||||||
|
|
||||||
|
if pure_text:
|
||||||
|
return text
|
||||||
|
|
||||||
|
# 提取标题
|
||||||
|
title = None
|
||||||
|
metadata = doc.metadata
|
||||||
|
if metadata and metadata.get("title"):
|
||||||
|
title = metadata["title"]
|
||||||
|
elif doc.page_count > 0:
|
||||||
|
first_page = doc[0]
|
||||||
|
lines = [
|
||||||
|
line.strip()
|
||||||
|
for line in first_page.get_text().split("\n")
|
||||||
|
if line.strip()
|
||||||
|
]
|
||||||
|
if lines:
|
||||||
|
title = " ".join(lines[:2])
|
||||||
|
|
||||||
|
if title:
|
||||||
|
return f"# {title}\n\n{text}"
|
||||||
|
return text
|
||||||
|
except Exception as e:
|
||||||
|
raise Exception(f"PDF 处理失败: {e}")
|
||||||
|
|
||||||
|
def _get_with_playwright(self, url):
|
||||||
|
"""使用 Playwright 获取动态页面"""
|
||||||
|
try:
|
||||||
|
from playwright.sync_api import sync_playwright
|
||||||
|
except ImportError:
|
||||||
|
raise ImportError(
|
||||||
|
"Playwright 未安装。\n"
|
||||||
|
"请运行: pip install playwright && playwright install chromium"
|
||||||
|
)
|
||||||
|
|
||||||
|
with sync_playwright() as p:
|
||||||
|
browser = p.chromium.launch()
|
||||||
|
page = browser.new_page()
|
||||||
|
|
||||||
|
# 微信公众号使用移动 UA
|
||||||
|
if "weixin.qq.com" in url:
|
||||||
|
page.set_extra_http_headers(
|
||||||
|
{
|
||||||
|
"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 14_0) AppleWebKit/605.1.15"
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
try:
|
||||||
|
page.goto(
|
||||||
|
url, wait_until="networkidle", timeout=self.TIMEOUT_PLAYWRIGHT
|
||||||
|
)
|
||||||
|
page.wait_for_timeout(2000)
|
||||||
|
content = page.content()
|
||||||
|
finally:
|
||||||
|
browser.close()
|
||||||
|
|
||||||
|
return content
|
||||||
|
|
||||||
|
def _to_markdown(self, html):
|
||||||
|
"""HTML 转 Markdown"""
|
||||||
|
soup = BeautifulSoup(html, "html.parser")
|
||||||
|
|
||||||
|
# 提取标题(微信公众号等)
|
||||||
|
title = None
|
||||||
|
# 尝试多种标题来源
|
||||||
|
title_selectors = ["title", "h1#activity-name", ".rich_media_title", "h1"]
|
||||||
|
for selector in title_selectors:
|
||||||
|
title_elem = soup.select_one(selector)
|
||||||
|
if title_elem:
|
||||||
|
title = title_elem.get_text(strip=True)
|
||||||
|
if title:
|
||||||
|
break
|
||||||
|
|
||||||
|
# 查找正文内容(优先级)
|
||||||
|
content_elem = None
|
||||||
|
content_selectors = [
|
||||||
|
"#js_content", # 微信公众号
|
||||||
|
".rich_media_content", # 微信公众号
|
||||||
|
"#activity-detail", # 微信公众号
|
||||||
|
"article", # 标准文章
|
||||||
|
"main", # 标准主内容
|
||||||
|
".post-content", # 博客
|
||||||
|
".article-content", # 博客
|
||||||
|
]
|
||||||
|
|
||||||
|
for selector in content_selectors:
|
||||||
|
elem = soup.select_one(selector)
|
||||||
|
if elem and elem.get_text(strip=True):
|
||||||
|
content_elem = elem
|
||||||
|
break
|
||||||
|
|
||||||
|
# 如果没找到特定内容,使用 body
|
||||||
|
if not content_elem:
|
||||||
|
content_elem = soup.body or soup
|
||||||
|
|
||||||
|
# 移除噪音元素
|
||||||
|
for tag in content_elem(
|
||||||
|
["script", "style", "nav", "footer", "header", "iframe", "aside"]
|
||||||
|
):
|
||||||
|
tag.decompose()
|
||||||
|
|
||||||
|
# 移除广告和交互元素
|
||||||
|
for tag in content_elem.find_all(
|
||||||
|
class_=lambda x: x
|
||||||
|
and any(
|
||||||
|
w in x.lower()
|
||||||
|
for w in [
|
||||||
|
"ad",
|
||||||
|
"banner",
|
||||||
|
"cookie",
|
||||||
|
"consent",
|
||||||
|
"popup",
|
||||||
|
"modal",
|
||||||
|
"share",
|
||||||
|
"like",
|
||||||
|
"comment",
|
||||||
|
]
|
||||||
|
)
|
||||||
|
):
|
||||||
|
tag.decompose()
|
||||||
|
|
||||||
|
# 移除按钮和链接区域
|
||||||
|
for tag in content_elem.find_all(
|
||||||
|
class_=lambda x: x
|
||||||
|
and any(w in x.lower() for w in ["btn", "button", "share", "reward"])
|
||||||
|
):
|
||||||
|
tag.decompose()
|
||||||
|
|
||||||
|
# 移除空段落
|
||||||
|
for tag in content_elem.find_all("p"):
|
||||||
|
if not tag.get_text(strip=True):
|
||||||
|
tag.decompose()
|
||||||
|
|
||||||
|
# 构建最终内容
|
||||||
|
if title:
|
||||||
|
markdown = f"# {title}\n\n"
|
||||||
|
else:
|
||||||
|
markdown = ""
|
||||||
|
|
||||||
|
cleaned_html = str(content_elem)
|
||||||
|
markdown += md(cleaned_html, heading_style="ATX")
|
||||||
|
|
||||||
|
# 清理多余空白
|
||||||
|
markdown = re.sub(r"\n{3,}", "\n\n", markdown)
|
||||||
|
markdown = re.sub(r" +\n", "\n", markdown) # 行尾空格
|
||||||
|
|
||||||
|
return markdown.strip()
|
||||||
|
|
||||||
|
def _to_plain_text(self, html):
|
||||||
|
"""提取纯文本"""
|
||||||
|
soup = BeautifulSoup(html, "html.parser")
|
||||||
|
|
||||||
|
main = soup.find("main") or soup.find("article") or soup.body
|
||||||
|
if not main:
|
||||||
|
return soup.get_text(separator="\n\n", strip=True)
|
||||||
|
|
||||||
|
for tag in main(["script", "style", "nav", "footer", "header", "aside"]):
|
||||||
|
tag.decompose()
|
||||||
|
|
||||||
|
text = main.get_text(separator="\n\n", strip=True)
|
||||||
|
|
||||||
|
text = re.sub(r"\n{3,}", "\n\n", text)
|
||||||
|
|
||||||
|
return text.strip()
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
"""命令行入口"""
|
||||||
|
logging.basicConfig(
|
||||||
|
level=logging.INFO,
|
||||||
|
format="%(levelname)s: %(message)s",
|
||||||
|
stream=sys.stderr,
|
||||||
|
)
|
||||||
|
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="将网页转换为 Markdown 格式(优先使用非Python方法)",
|
||||||
|
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||||
|
epilog="""优先级方法:
|
||||||
|
1. Jina Reader API (https://r.jina.ai/URL) - 零安装
|
||||||
|
2. Firecrawl API (需要 FIRECRAWL_API_KEY)
|
||||||
|
3. Python实现 (回退)
|
||||||
|
|
||||||
|
示例:
|
||||||
|
%(prog)s https://www.breezedeus.com/article/ai-agent-context-engineering
|
||||||
|
%(prog)s https://arxiv.org/abs/2601.04500v1 --output paper.md
|
||||||
|
%(prog)s https://x.com/user/status/123 --pure-text
|
||||||
|
%(prog)s https://www.breezedeus.com/article/ai-agent-context-engineering --use-python # 强制使用Python方法
|
||||||
|
""",
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument("url", help="要转换的网页 URL")
|
||||||
|
parser.add_argument(
|
||||||
|
"--pure-text", action="store_true", help="输出纯文本(无 Markdown 格式)"
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--use-python",
|
||||||
|
action="store_true",
|
||||||
|
help="强制使用Python方法(跳过Jina/Firecrawl)",
|
||||||
|
)
|
||||||
|
parser.add_argument("--output", "-o", help="输出到文件")
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
try:
|
||||||
|
with WebToMarkdown() as converter:
|
||||||
|
result = converter.convert(
|
||||||
|
args.url, pure_text=args.pure_text, use_python=args.use_python
|
||||||
|
)
|
||||||
|
|
||||||
|
if result is None:
|
||||||
|
logger.error("转换失败:所有方法均不可用")
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
if args.output:
|
||||||
|
with open(args.output, "w", encoding="utf-8") as f:
|
||||||
|
f.write(result)
|
||||||
|
print(f"✓ 已保存到: {args.output}")
|
||||||
|
else:
|
||||||
|
print(result)
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error("错误: %s", e)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
45
skills/common/web2summary/README.md
Normal file
45
skills/common/web2summary/README.md
Normal file
@ -0,0 +1,45 @@
|
|||||||
|
# docai-web2summary
|
||||||
|
|
||||||
|
对任意网页 URL 生成结构化总结的 AI Skill。
|
||||||
|
|
||||||
|
## 工作流程
|
||||||
|
|
||||||
|
1. **获取内容**:调用 `docai-web2md` 将 URL 转为 Markdown
|
||||||
|
2. **AI 总结**:AI 直接根据 [SKILL.md](SKILL.md) 中的规范完成总结,无需外部脚本
|
||||||
|
3. **信息卡(可选)**:如需生成图片卡片,使用 [info-card-designer](https://github.com/joeseesun/info-card-designer)
|
||||||
|
|
||||||
|
## 使用方式
|
||||||
|
|
||||||
|
直接对 AI 说明需求即可:
|
||||||
|
|
||||||
|
> 总结这个链接:https://www.breezedeus.com/article/ai-agent-context-engineering
|
||||||
|
|
||||||
|
AI 会自动:
|
||||||
|
- 判断内容类型(论文 / 新闻 / 教程 / 产品 / AI 动态 / 通用)
|
||||||
|
- 套用对应的总结结构
|
||||||
|
- 按统一格式输出 Markdown
|
||||||
|
|
||||||
|
## 输出格式示例
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
# **给Claude Code装个仪表盘:claude-hud插件深度评测**
|
||||||
|
|
||||||
|
✔ 一句话总结:一个让 Claude Code 从黑盒变透明的仪表盘插件...
|
||||||
|
|
||||||
|
✔ **核心洞见**:Claude Code 最大的痛点不是功能不足,而是"黑盒"体验...
|
||||||
|
|
||||||
|
✔ **技术细节**:基于 Claude Code 原生 statusline API 构建...
|
||||||
|
|
||||||
|
✔ **应用场景**:复杂任务重构、CI/CD 调试、长期项目开发...
|
||||||
|
|
||||||
|
**原文:** https://mp.weixin.qq.com/s/XClh6xJmXoXbyBC9lKzPdA
|
||||||
|
```
|
||||||
|
|
||||||
|
## 依赖
|
||||||
|
|
||||||
|
- `docai-web2md`(获取网页内容)
|
||||||
|
- [info-card-designer](https://github.com/joeseesun/info-card-designer)(可选,生成信息卡图片)
|
||||||
|
|
||||||
|
## 许可证
|
||||||
|
|
||||||
|
MIT
|
||||||
168
skills/common/web2summary/SKILL.md
Normal file
168
skills/common/web2summary/SKILL.md
Normal file
@ -0,0 +1,168 @@
|
|||||||
|
---
|
||||||
|
name: web2summary
|
||||||
|
description: Summarize any web URL. Triggers on "summarize/总结/概括/摘要 + URL". Auto-detects content type (paper, news, tutorial, product, AI news) and generates adaptive structured summary.
|
||||||
|
---
|
||||||
|
|
||||||
|
# docai:web2summary
|
||||||
|
|
||||||
|
## When to Trigger
|
||||||
|
User wants to summarize a web page. Common patterns:
|
||||||
|
- "总结这个链接"、"帮我总结一下"、"概括这篇文章"、"给个摘要"
|
||||||
|
- "summarize this URL"、"give me a summary of"
|
||||||
|
- Any URL + intent to understand/extract key points
|
||||||
|
|
||||||
|
## How to Execute
|
||||||
|
|
||||||
|
### Step 1 — 获取网页内容
|
||||||
|
使用 `web2md` skill 将 URL 转换为 Markdown:
|
||||||
|
```bash
|
||||||
|
python skills/web2md/tools/convert.py <URL>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 2 — 直接总结(你来做,无需调用外部 AI)
|
||||||
|
拿到 Markdown 内容后,**你(AI agent)直接按照下方的总结规范输出总结**,不需要再调用任何脚本或 API。
|
||||||
|
|
||||||
|
### Step 3 — 生成信息卡片(可选)
|
||||||
|
如果用户需要信息卡图片,使用 [info-card-designer](https://github.com/joeseesun/info-card-designer) skill 生成。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 总结规范
|
||||||
|
|
||||||
|
### 格式要求
|
||||||
|
|
||||||
|
**标题格式**
|
||||||
|
- 所有级别标题都必须加粗:`# **标题**`、`## **标题**`、`### **标题**`
|
||||||
|
- 如内容来自知名机构,一级标题末尾标注:`# **标题内容 | 机构名称**`
|
||||||
|
- 标题与前面内容之间空一行
|
||||||
|
|
||||||
|
**加粗与标点**
|
||||||
|
- 加粗标记 `**` 在标点符号内部,不在外部
|
||||||
|
- ✅ `「**更聪明地激活**」` ❌ `**「更聪明地激活」**`
|
||||||
|
|
||||||
|
**链接处理**
|
||||||
|
- 末尾必须包含原文链接:`**原文:** <链接>`
|
||||||
|
- 删除 URL 中 `?` 后的查询参数
|
||||||
|
|
||||||
|
**列表格式**
|
||||||
|
- 无序列表用 ✔ 代替 `-` / `*`,每条后空一行
|
||||||
|
|
||||||
|
**内容约束**
|
||||||
|
- 只基于网页中的信息,禁止自行推断
|
||||||
|
- 不输出 LaTeX 数学公式,不包含索引或引用
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 内容类型判断与结构
|
||||||
|
|
||||||
|
先判断内容类型,再按对应结构输出。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### 🔬 类型A:技术论文/研究
|
||||||
|
适用:学术论文、技术报告、arXiv 论文、算法介绍等
|
||||||
|
|
||||||
|
结构(整体不超过 1000 字,没有的章节直接删除):
|
||||||
|
✔ 一句话总结(开篇):体现研究的核心突破,必须有吸引力
|
||||||
|
|
||||||
|
✔ 核心洞见:解决了什么问题?提出了什么新思路?
|
||||||
|
|
||||||
|
✔ 技术细节/架构创新:关键方法、模型结构、算法设计
|
||||||
|
|
||||||
|
✔ 性能数据/实验结果:量化指标、对比基线、关键数据
|
||||||
|
|
||||||
|
✔ 应用场景:这项技术能用在哪里?
|
||||||
|
|
||||||
|
✔ 长期意义:为什么值得关注?对领域的影响
|
||||||
|
|
||||||
|
✔ 原文链接(末尾)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### 📰 类型B:新闻报道
|
||||||
|
适用:行业新闻、公司动态、政策发布、事件报道等
|
||||||
|
|
||||||
|
结构(整体不超过 800 字,没有的章节直接删除):
|
||||||
|
✔ 一句话总结(开篇):概括核心事件,突出新闻价值
|
||||||
|
|
||||||
|
✔ 核心事件:发生了什么?关键细节
|
||||||
|
|
||||||
|
✔ 关键人物/机构:谁在推动?谁受影响?
|
||||||
|
|
||||||
|
✔ 背景与影响:为什么重要?对行业/社会的影响
|
||||||
|
|
||||||
|
✔ 后续展望:接下来可能发生什么?
|
||||||
|
|
||||||
|
✔ 原文链接(末尾)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### 📚 类型C:教程/指南
|
||||||
|
适用:编程教程、操作指南、How-to 文章、最佳实践等
|
||||||
|
|
||||||
|
结构(整体不超过 1000 字,没有的章节直接删除):
|
||||||
|
✔ 一句话总结(开篇):这篇教程教你什么?适合谁?
|
||||||
|
|
||||||
|
✔ 学习目标:读完能掌握什么?
|
||||||
|
|
||||||
|
✔ 前置条件:需要什么基础或工具?
|
||||||
|
|
||||||
|
✔ 关键步骤摘要:核心流程的精炼提取(不是逐步复述)
|
||||||
|
|
||||||
|
✔ 注意事项/常见坑:作者提到的易错点或最佳实践
|
||||||
|
|
||||||
|
✔ 原文链接(末尾)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### 🚀 类型D:产品发布/评测
|
||||||
|
适用:产品发布、功能更新、产品评测、工具推荐等
|
||||||
|
|
||||||
|
结构(整体不超过 800 字,没有的章节直接删除):
|
||||||
|
✔ 一句话总结(开篇):核心卖点
|
||||||
|
|
||||||
|
✔ 产品定位:解决什么问题?面向谁?
|
||||||
|
|
||||||
|
✔ 核心功能/亮点:最值得关注的特性
|
||||||
|
|
||||||
|
✔ 与竞品对比:相比现有方案有什么优势?(如文中提及)
|
||||||
|
|
||||||
|
✔ 适用人群:谁最应该关注?
|
||||||
|
|
||||||
|
✔ 价格/获取方式:如何获取或使用?(如文中提及)
|
||||||
|
|
||||||
|
✔ 原文链接(末尾)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### 🤖 类型E:AI 行业动态
|
||||||
|
适用:AI 领域新闻汇总、模型发布、行业趋势分析、AI Newsletter 等
|
||||||
|
|
||||||
|
结构(整体不超过 1000 字,没有的章节直接删除):
|
||||||
|
✔ 一句话总结(开篇):本期最值得关注的信号
|
||||||
|
|
||||||
|
✔ 核心动态:最重要的 2-3 条消息及其意义
|
||||||
|
|
||||||
|
✔ 技术要点:涉及的关键技术或方法(如有)
|
||||||
|
|
||||||
|
✔ 行业影响:对开发者/企业/用户意味着什么?
|
||||||
|
|
||||||
|
✔ 值得关注的信号:哪些趋势正在形成?
|
||||||
|
|
||||||
|
✔ 原文链接(末尾)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### 📄 类型F:通用
|
||||||
|
适用:个人博客、观点文章、随笔、访谈、其他类型
|
||||||
|
|
||||||
|
结构(整体不超过 800 字,没有的章节直接删除):
|
||||||
|
✔ 一句话总结(开篇):这篇内容的核心价值
|
||||||
|
|
||||||
|
✔ 核心内容:作者在说什么?主要观点或故事
|
||||||
|
|
||||||
|
✔ 关键要点:最值得记住的 2-3 个点
|
||||||
|
|
||||||
|
✔ 价值与启发:读完能获得什么?
|
||||||
|
|
||||||
|
✔ 原文链接(末尾)
|
||||||
Loading…
Reference in New Issue
Block a user