Merge branch 'developing' into bot_manager

This commit is contained in:
朱潮 2026-06-07 10:26:43 +08:00
commit d009411360
49 changed files with 5372 additions and 9 deletions

View File

@ -0,0 +1,55 @@
---
feature: "memory"
scope: "Agent 长期记忆能力(基于 Mem0 + pgvector跨会话回忆与事实提取存储"
updated_at: "2026-06-01"
status: active
---
# Memory记忆功能
## 当前状态
Agent 的长期记忆能力,底层使用 **Mem0** 库 + **pgvector**PostgreSQL 向量存储)。
在 agent 执行前 `recall` 相关记忆并注入 system prompt在执行后于后台线程异步提取并存储新事实。
`(user_id, agent_id)` 多租户隔离,每个 `agent_id` 一张 `mem0_{agent_id}` 集合表。
> 注意API/配置字段历史上叫 `memori`,为兼容性保留命名,内部实际用的是 **Mem0**
## 配置开关
| 层级 | 字段 | 默认 | 位置 |
|------|------|------|------|
| 全局总开关 | `MEM0_ENABLED` (env) | `true` | `utils/settings.py:80` |
| Agent 配置 | `enable_memori: bool` | `False` | `agent/agent_config.py:47` |
| API 请求 | `enable_memory: bool` | `False` | `utils/api_models.py:56` |
| 召回数量 | `memori_semantic_search_top_k: int` | `20` | `agent/agent_config.py:48` |
| 召回数量(env) | `MEM0_SEMANTIC_SEARCH_TOP_K` | `20` | `utils/settings.py:84` |
| 连接池大小 | `MEM0_POOL_SIZE` (env) | `50` | `utils/settings.py:61` |
开启路径V1 走请求体 `enable_memory`V2 走 bot 配置 `enable_memory`;两者都受全局 `MEM0_ENABLED` 限制。
中间件注册在 `agent/deep_assistant.py:270``if config.enable_memori:`)。
## 核心文件
- `agent/mem0_manager.py` — Mem0 客户端管理器:实例创建/LRU 缓存(最多 50、连接池管理、`recall_memories` / `add_memory` / `delete_all`、多租户隔离、`CustomMem0Embedding`、`json_repair` 补丁
- `agent/mem0_middleware.py` — 中间件:`before_agent` 召回并写入 `config._mem0_context`(行 114/155`after_agent` 后台异步提取存储
- `agent/mem0_config.py` — Mem0 配置类user/agent/session id、记忆提示模板、自定义提取 prompt 加载(`PreMemoryPrompt` hook
- `routes/memory.py` — 内存管理 APIGET/POST/DELETE供前端管理用户记忆
- `drop_mem0_tables.py` — 清理脚本,删除所有 `mem0_*` 表(重置/清脏数据)
## 数据流
**写入**User+Assistant 消息 → `after_agent`(后台线程)→ `add_memory``Mem0.add()`LLM 提取事实)→ pgvector 向量化存入 `mem0_{agent_id}`
**读取**User query → `before_agent``recall_memories``Mem0.search()`(向量相似 top_k→ 格式化后写入 `config._mem0_context` → 注入 system prompt也供思考功能 [[../thinking/MEMORY|thinking]] 使用)。
## 关键设计决策
- 复用项目已加载的 embedding 模型(`CustomMem0Embedding`),避免 Mem0 重复加载 SentenceTransformer → `decisions/2026-06-custom-embedding.md`
- 连接池主动释放 + LRU 缓存实例,防连接池耗尽 → `decisions/2026-06-connection-pool.md`
## Gotchas开发必读
- **命名陷阱**:配置叫 `enable_memori`(无 yAPI 叫 `enable_memory`,内部实现是 Mem0三个名字别混。
- **连接池耗尽**Mem0 PGVector `__init__` 取连接、`__del__` 释放;必须在每次操作后主动 `_release_connection()`,否则高并发会打满 `MEM0_POOL_SIZE`
- **JSON 脆弱**LLM 提取事实返回的 JSON 常有尾逗号/单引号,已 monkey patch 成 `json_repair.loads`,不要改回原生解析。
- **表膨胀**:每个 `agent_id` 一张表,多 bot 长期运行会产生大量表,定期用 `drop_mem0_tables.py` 清理。
- **Embedding 维度**`paraphrase-multilingual-MiniLM-L12-v2`384 维;换模型需同步 pgvector 列维度,否则写入报错。
## 索引
- 设计决策:`decisions/`
- 变更历史:`changelog/`
- 相关文档:`docs/`

View File

@ -0,0 +1,6 @@
# Changelog 2026 Q2 — Memory
## 2026-06-01
- 初始化 feature memory 文档。
- 记录现状Mem0 + pgvector 长期记忆,`before_agent` 召回注入 / `after_agent` 后台提取存储。
- 归档设计决策:自定义 embedding 复用custom-embedding、连接池主动释放 + LRUconnection-pool

View File

@ -0,0 +1,25 @@
---
date: "2026-06-01"
status: adopted
topic: "connection-pool"
impact: [memory, performance, stability]
---
# 连接池主动释放 + Mem0 实例 LRU 缓存
## 背景
Mem0 的 PGVector 后端在实例 `__init__` 时从连接池取一个连接,理论上在 `__del__` 时归还。
但 Python GC 时机不确定,高并发下连接迟迟不归还会迅速打满 `MEM0_POOL_SIZE`(默认 50导致后续请求阻塞。
同时若为每个 `(user_id, agent_id)` 都新建 Mem0 实例且不回收,也会无限占用连接。
## 决策
1. `Mem0Manager``OrderedDict` 维护最多 50 个 Mem0 实例的 LRU 缓存,超出淘汰最旧的。
2. 每次记忆操作recall/add后调用 `_release_connection()` 立即把连接归还连接池,不等 GC。
## 影响
- 连接池不再被慢 GC 拖垮,高并发稳定。
- 实例数量有上界,内存可控。
## Gotchas
- 不要在操作链路里持有 Mem0 实例的连接跨多个 await会绕过释放逻辑。
- LRU 上限50`MEM0_POOL_SIZE`50相关联调整其一时需一并评估。

View File

@ -0,0 +1,22 @@
---
date: "2026-06-01"
status: adopted
topic: "custom-embedding"
impact: [memory, performance]
---
# 复用项目 embedding 模型而非 Mem0 自带 SentenceTransformer
## 背景
Mem0 默认会自行加载一个 SentenceTransformer 做 embedding。项目本身已经通过 `GlobalModelManager`
加载了 `paraphrase-multilingual-MiniLM-L12-v2`384 维)。若放任 Mem0 自加载,会出现同一模型在内存中加载两份,浪费显存/内存。
## 决策
`agent/mem0_manager.py` 实现 `CustomMem0Embedding`,把 Mem0 的 embedder 接到项目已加载的全局模型上,复用同一份权重。
## 影响
- 内存占用显著下降(不重复加载模型)。
- embedding 维度固定为 384与项目主模型一致换模型时 pgvector 列维度必须同步调整。
## 备注
相关连接池/实例缓存策略见 [[2026-06-connection-pool]]。

View File

View File

@ -0,0 +1,52 @@
---
feature: "thinking"
scope: "Agent 思考功能(基于 GuidelineMiddleware 的前置辅助推理),在主回答前生成一次 <think> 内容"
updated_at: "2026-06-01"
status: active
---
# Thinking思考功能
## 当前状态
思考功能通过自定义的 **`GuidelineMiddleware`** 实现:在主 agent 执行前,先用业务指引 prompt 调一次模型做"思考"
把结果包成 `<think>...</think>` 标签并打上 `message_tag: "THINK"` 元数据,供前端识别/折叠展示。
> 重要:这是"主请求前的一次辅助请求"**不是** Qwen 模型内置的 reasoning/extended-thinking 模式,因此与具体模型无关,任何 LLM 都能用。对标 OpenAI o1 / Claude thinking但实现更轻。
## 配置开关
| 层级 | 字段 | 默认 | 位置 |
|------|------|------|------|
| Agent 配置 | `enable_thinking: bool` | `False` | `agent/agent_config.py:26` |
| API 请求 | `enable_thinking: bool` | `False` | `utils/api_models.py:54` |
开启路径V1 走请求体 `enable_thinking`V2 走 bot 配置 `enable_thinking`
中间件注册在 `agent/deep_assistant.py:294``if config.enable_thinking: middleware.append(GuidelineMiddleware(...))`。
## 核心文件
- `agent/guideline_middleware.py` — 思考主逻辑。`get_guideline_prompt`(行 53+)组装指引 prompt`before_agent`/`abefore_agent` 调模型生成思考,包 `<think>` 标签并标 `THINK`(行 120-124 / 146-149
- `agent/deep_assistant.py:294-295` — 按 `enable_thinking` 注册中间件。
## 数据流
1. `before_agent` 加载指引system prompt 中的 Guidelines 块)。
2. 从 system prompt 提取 guidelines / tool_description / scenarios / terms_list。
3. 组装 `guideline_prompt` = 业务规则 + 聊天历史 + **记忆上下文** + 工具描述 + 场景 + 术语分析。
4. 调模型一次:`SystemMessage(guideline_prompt)` + 用户最后一条消息 → 得到思考内容。
5. 内容包成 `<think>...</think>``additional_kwargs["message_tag"] = "THINK"`。
6. 追加一条空 `HumanMessage`(兼容"最后必须是 user 消息"的模型)。
7. 主 agent 继续执行,产出正式回答。
## 与记忆功能的耦合
`guideline_middleware.py:63` 读取 `config._mem0_context`(由 [[../memory/MEMORY|memory]] 的 `before_agent` 写入)。
即:思考阶段会把已召回的长期记忆纳入指引 prompt从而基于记忆做更好的分析。
**顺序依赖**memory 中间件需在 thinking 之前执行,`_mem0_context` 才有值。
## Gotchas开发必读
- **思考是非流式的**:思考内容在 `before_agent` 一次性完整生成,只有正式回答才流式输出。前端靠 `<think>` 标签 + `message_tag:"THINK"` 折叠展示。
- **额外一次模型调用**:每次开启都多打一次 LLM 请求,增加延迟和成本,按场景权衡。
- **不是模型原生 reasoning**:别误以为依赖 `enable_thinking` 透传给 Qwen它是中间件层的自定义实现。
- **空 HumanMessage 收尾**:思考消息后会补一条空 user 消息,改消息列表处理逻辑时勿误删。
- **依赖记忆上下文顺序**:若调整中间件注册顺序,确认 memory 仍在 thinking 之前。
## 索引
- 设计决策:`decisions/`
- 变更历史:`changelog/`

View File

@ -0,0 +1,7 @@
# Changelog 2026 Q2 — Thinking
## 2026-06-01
- 初始化 feature memory 文档。
- 记录现状:`GuidelineMiddleware` 在 `before_agent` 生成 `<think>` 思考内容,标 `message_tag:"THINK"`
- 归档设计决策:用中间件实现而非模型原生 reasoningmiddleware-thinking
- 记录与 memory 功能的顺序耦合(依赖 `_mem0_context`)。

View File

@ -0,0 +1,28 @@
---
date: "2026-06-01"
status: adopted
topic: "middleware-thinking"
impact: [thinking, model-compat]
---
# 用中间件实现思考,而非依赖模型原生 reasoning
## 背景
"思考功能"可以有两种实现:
A. 透传 `enable_thinking` 给底层模型,依赖模型自带的 reasoning/extended-thinking 能力。
B. 在主请求前自己加一次"指引思考"的辅助 LLM 调用。
模型 A 路线要求底层模型支持原生 reasoning且不同模型行为/输出格式不一致,难以统一前端处理。
## 决策
采用 B实现 `GuidelineMiddleware`,在 `before_agent` 阶段用业务指引 prompt 调一次模型生成思考,
统一包成 `<think>...</think>` + `message_tag:"THINK"`
## 影响
- 与具体模型解耦,任何 LLMOpenAI/Claude/Qwen都能用。
- 思考阶段可注入业务规则、工具描述、术语分析、记忆上下文,可控性强。
- 代价:每次多一次 LLM 调用(延迟 + 成本);思考内容非流式。
## Gotchas
- 思考依赖 `config._mem0_context`,需保证 memory 中间件先于本中间件执行。
- 思考后补空 `HumanMessage` 以兼容"末条须为 user"的模型,勿删。

View File

@ -314,7 +314,12 @@ async def handle_request(request: Dict[str, Any]) -> Dict[str, Any]:
top_k = arguments.get("top_k", 100)
if not query:
return create_error_response(request_id, -32602, "Missing required parameter: query")
return create_success_response(request_id, {
"content": [{
"type": "text",
"text": "Error: missing required parameter 'query'. Please call this tool again with a non-empty 'query' argument describing what you want to retrieve."
}]
})
result = rag_retrieve(query, top_k, trace_id)
@ -328,7 +333,12 @@ async def handle_request(request: Dict[str, Any]) -> Dict[str, Any]:
query = arguments.get("query", "")
if not query:
return create_error_response(request_id, -32602, "Missing required parameter: query")
return create_success_response(request_id, {
"content": [{
"type": "text",
"text": "Error: missing required parameter 'query'. Please call this tool again with a non-empty 'query' argument describing what you want to retrieve."
}]
})
result = table_rag_retrieve(query, trace_id)

View File

@ -314,7 +314,12 @@ async def handle_request(request: Dict[str, Any]) -> Dict[str, Any]:
top_k = arguments.get("top_k", 100)
if not query:
return create_error_response(request_id, -32602, "Missing required parameter: query")
return create_success_response(request_id, {
"content": [{
"type": "text",
"text": "Error: missing required parameter 'query'. Please call this tool again with a non-empty 'query' argument describing what you want to retrieve."
}]
})
result = rag_retrieve(query, top_k, trace_id)
@ -328,7 +333,12 @@ async def handle_request(request: Dict[str, Any]) -> Dict[str, Any]:
query = arguments.get("query", "")
if not query:
return create_error_response(request_id, -32602, "Missing required parameter: query")
return create_success_response(request_id, {
"content": [{
"type": "text",
"text": "Error: missing required parameter 'query'. Please call this tool again with a non-empty 'query' argument describing what you want to retrieve."
}]
})
result = table_rag_retrieve(query, trace_id)

View File

@ -1,7 +1,7 @@
{
"name": "data-dashboard",
"description": "Renders data as an interactive dashboard card UI using the mcp-ui protocol.",
"category": "Data & Retrieval",
"category": "Interactive UI",
"hooks": {
"PrePrompt": [
{

View File

@ -19,5 +19,5 @@
]
}
},
"category": "Data & Retrieval"
"category": "Interactive UI"
}

View File

@ -0,0 +1,49 @@
---
name: mineru
description: An AI-Native skill for parsing PDF / Office / image files into Markdown with MinerU — a fast, zero-config document parser for AI agents. Works with NO token via the Agent API and auto-upgrades to the Standard API (token) for large files, batches, and DOCX/HTML/LaTeX export. Use when converting PDF/Word/PPT/Excel/image documents, extracting text/tables/formulas, running OCR, or batch processing.
category: Document Processing
metadata:
author: Nebutra
version: "3.3.1"
argument-hint: <pdf-file-or-url>
---
# MinerU PDF Parser
Parse PDF, Office, and image documents into structured Markdown via the MinerU API.
## Quick Start
```bash
# Zero-config: no token, no install (free Agent API)
python3 "${CLAUDE_PLUGIN_ROOT}/scripts/mineru.py" ./document.pdf --output ./output/
# Pipe Markdown back to an agent
python3 "${CLAUDE_PLUGIN_ROOT}/scripts/mineru.py" ./document.pdf --stdout
# Power mode: token unlocks large files / batch / extra formats
export MINERU_TOKEN="..." # https://mineru.net/apiManage/token
python3 "${CLAUDE_PLUGIN_ROOT}/scripts/mineru.py" ./pdfs/ --output ./output/ --workers 8 --resume
```
## Features
- **Auto-routing**: free Agent API by default, auto-upgrades to the Standard API (token) for large/batch/extra-format jobs
- **Multi-modal**: PDF, images, Word, PPT, Excel, HTML
- **High-performance OCR**: `--ocr` with language selection (`--lang`)
- **Formula & table recognition**: LaTeX formulas, structured tables
- **Multi-format export**: Markdown (default), plus DOCX / HTML / LaTeX
- **AI-Native output**: `--stdout` (Markdown) and `--json` (machine status)
- **Batch + resume**: parallel workers with `--resume`
- **Zero dependencies**: standard library only
## Authentication
A token is **optional** — the Agent API works without one. Set a token to unlock
the Standard API (≤ 200 MB / ≤ 200 pages, batch, DOCX/HTML/LaTeX):
```bash
export MINERU_TOKEN="your-token-here" # https://mineru.net/apiManage/token
```
Official API docs: https://mineru.net/apiManage/docs

View File

@ -0,0 +1,170 @@
# MinerU API Reference
Official docs: https://mineru.net/apiManage/docs · Token: https://mineru.net/apiManage/token
MinerU exposes **two** document-parsing APIs. This skill auto-routes between them.
| | 🎯 Standard API | ⚡ Agent API (lightweight) |
|---|---|---|
| Base URL | `https://mineru.net/api/v4` | `https://mineru.net/api/v1/agent` |
| Token | **required** (`Bearer`) | **none** (IP rate-limited) |
| Models | `pipeline` / `vlm` / `MinerU-HTML` | fixed lightweight `pipeline` |
| File size | ≤ 200 MB | ≤ 10 MB |
| Pages | ≤ 200 | ≤ 20 |
| Batch | ≤ 50 per request | single file only |
| Output | zip (Markdown + JSON, optional DOCX/HTML/LaTeX) | Markdown only (CDN link) |
| Designed for | high-accuracy / complex / batch | AI-agent / quick / no-login |
Free Standard-API quota: **1000 pages/day at highest priority** (overflow is lower priority).
---
## Authentication (Standard API)
```
Authorization: Bearer YOUR_API_TOKEN
```
Get a token at https://mineru.net/apiManage/token.
> **Response envelopes.** Business endpoints return `{"code":0,"data":{…},"msg":"ok"}`.
> The auth/gateway layer returns a *different* shape on failure:
> `{"success":false,"msgCode":"A0202","msg":"user authenticate failed"}`.
> Clients must handle both — this skill maps `msgCode` to the same error hints.
---
## Standard API endpoints (`/api/v4`)
### Single URL — `POST /extract/task`
```json
{
"url": "https://example.com/doc.pdf",
"model_version": "vlm",
"is_ocr": false,
"enable_formula": true,
"enable_table": true,
"language": "ch",
"page_ranges": "1-10",
"extra_formats": ["docx", "html"],
"data_id": "my-document"
}
```
Response → `{ "code": 0, "data": { "task_id": "…" } }`. HTML inputs require `model_version: "MinerU-HTML"`.
### Get task result — `GET /extract/task/{task_id}`
```json
{ "code": 0, "data": { "task_id": "…", "state": "done", "full_zip_url": "https://…", "err_msg": "" } }
```
### Batch local upload — `POST /file-urls/batch`
Returns signed upload URLs; PUT each file (no `Content-Type`). Up to **50** files / request.
```json
{ "files": [ { "name": "doc.pdf", "data_id": "doc" } ], "model_version": "vlm" }
```
Response → `{ "code": 0, "data": { "batch_id": "…", "file_urls": ["https://…"] } }`.
### Batch URL — `POST /extract/task/batch`
```json
{ "files": [ { "url": "https://…/doc.pdf", "data_id": "doc" } ], "model_version": "vlm" }
```
### Batch results — `GET /extract-results/batch/{batch_id}`
```json
{ "code": 0, "data": { "batch_id": "…", "extract_result": [
{ "file_name": "doc.pdf", "state": "done", "full_zip_url": "https://…" }
] } }
```
---
## Agent API endpoints (`/api/v1/agent`) — no token
### URL — `POST /parse/url`
```json
{ "url": "https://…/doc.pdf", "language": "ch", "enable_table": true, "is_ocr": false, "enable_formula": true, "page_range": "1-10" }
```
`page_range` accepts `from-to` or a single page only (no commas). Returns `{ "code": 0, "data": { "task_id": "…" } }`.
### File — `POST /parse/file`
```json
{ "file_name": "doc.pdf", "language": "ch" }
```
Response → `{ "data": { "task_id": "…", "file_url": "https://oss…" } }`; PUT the file to `file_url`.
### Result — `GET /parse/{task_id}`
```json
{ "code": 0, "data": { "task_id": "…", "state": "done", "markdown_url": "https://cdn…/full.md" } }
```
---
## Task states
`pending` (queued) · `running` (parsing) · `converting` (format conversion) ·
`uploading` (downloading source, Agent) · `waiting-file` (awaiting upload) ·
`done` (complete) · `failed` (error).
---
## Parameters
| Parameter | Type | Default | Notes |
|-----------|------|---------|-------|
| `model_version` | string | `pipeline` | `pipeline`, `vlm` (recommended), `MinerU-HTML` (HTML only) |
| `is_ocr` | bool | `false` | OCR for scanned docs (pipeline/vlm) |
| `enable_formula` | bool | `true` | Formula recognition |
| `enable_table` | bool | `true` | Table recognition |
| `language` | string | `ch` | OCR language (see official `language` table) |
| `page_ranges` | string | all | Standard: `"2,4-6"`; Agent `page_range`: `"1-10"` only |
| `extra_formats` | array | `[]` | `docx` / `html` / `latex` (Standard only) |
| `data_id` | string | | `[A-Za-z0-9_.-]`, ≤ 128 chars |
| `no_cache` | bool | `false` | Bypass URL cache (Standard) |
| `cache_tolerance` | int | `900` | Cache TTL seconds (Standard) |
---
## Limits
| | Standard | Agent |
|---|---|---|
| File size | 200 MB | 10 MB |
| Pages | 200 | 20 |
| Batch | 50 / request | 1 |
| Quota | 1000 pages/day priority | IP rate-limited (HTTP 429) |
Supported types: PDF, images (png/jpg/jpeg/jp2/webp/gif/bmp), Doc(x), Ppt(x), Xls(x); HTML is Standard-only.
---
## Error codes
| Code | Meaning |
|------|---------|
| `A0202` | Invalid token |
| `A0211` | Token expired |
| `-500` | Parameter error |
| `-10001` / `-10002` | Service error / invalid params |
| `-60002` | Unsupported file format |
| `-60003` / `-60004` | File read failed / empty file |
| `-60005` | File too large (> 200 MB) |
| `-60006` | Too many pages (> 200) |
| `-60008` | File read timeout (URL unreachable) |
| `-60010` | Parse failed |
| `-60015` / `-60016` | File / format conversion failed |
| `-60018` | Daily quota reached |
| `-60022` | Web page read failed (rate-limited) |
| **Agent API** | |
| `-30001` | Exceeds Agent 10 MB limit → use Standard API |
| `-30002` | Unsupported file type for Agent |
| `-30003` | Exceeds Agent 20-page limit → use Standard API or `--pages` |
| `-30004` | Invalid request parameters |

View File

@ -0,0 +1,193 @@
<!-- Web-researched competitive comparison (45 tools, 6 categories, adversarially fact-checked). Last researched 2026-05-31. Star counts / versions are point-in-time. -->
# MinerU Skill — Competitive Comparison Reference
This document gives an honest, sourced, per-tool breakdown of how **MinerU Skill** compares to the document-parsing landscape. Read the framing first: it determines how to interpret every "we win / they win" below.
## What MinerU Skill actually is (and is not)
MinerU Skill is a **zero-config, zero-dependency, agent-native convenience layer over [MinerU](https://github.com/opendatalab/MinerU)'s cloud API**, plus 17 turnkey delivery integrations to note/knowledge/content tools. Concretely (verified in this repo):
- Core script `scripts/mineru.py` is **~54KB / ~1,350 lines of pure Python standard library** — no `requests`/`aiohttp`, no model weights.
- A **genuinely token-free** default: the free **Agent API** path (`agent_parse` → `_agent_poll`) sends **no `Authorization` header** (the Bearer header is set only when a token is present). Files ≤10MB / ≤20 pages.
- **Auto-routing**: with a token, large/batched/extra-format jobs use the **Standard API** (≤200MB / ≤200 pages); the Agent path **auto-escalates** to Standard on size/page limits.
- **17 delivery sinks** (16 sink modules + `local.py` registering both `obsidian` and `logseq`): obsidian, logseq, siyuan, notion, confluence, onenote, coda, yuque, feishu, slack, dingtalk, wecom, ticktick, linear, airtable — all zero-dependency — plus **roam** (needs `roam-client`) and **wps** (needs `html-for-docx`) which lazy-load one library only when used.
- `--resume` dedup, parallel `--workers` (ThreadPoolExecutor), `--stdout`/`--json` agent output.
**Critical dependency:** our accuracy is **entirely downstream of, and capped by, what MinerU's cloud serves.** We own no models. Therefore:
- We have **no quality edge** over any other cloud wrapper that hits the same MinerU API — OCR/table/formula output is **identical**.
- Self-hosting the MinerU engine gives the **same or better** accuracy (version-controllable, no upload caps).
**Hard limits we cannot exceed:** 10MB/20-page free Agent tier, 200MB/200-page Standard tier, plus IP rate limits. Self-hosted tools have no such caps (only hardware).
**Our benchmark is latency-only.** `tests/test_live.py` measures end-to-end cloud round-trip latency (~1314s for the official demo PDF). It is **not** an accuracy benchmark; we have no OmniDocBench/olmOCR-Bench numbers of our own.
### A note on the speed claim
Our ~1314s/doc cloud round-trip is **not** a clean win over self-hosted GPU engines. A normal self-host with a GPU runs at ~0.18s/page (Marker) or ~2.12 pages/sec (MinerU on A100) — far faster at any real scale. We only out-run **slow Apple-Silicon-CPU local runs of small docs** (e.g., M4 VLM at 32148s/page). Do not frame "faster wall-clock" as a general win.
### A note on benchmarks
No single benchmark is authoritative. Different benchmarks favor different tools:
- **OmniDocBench** (v1.5/v1.6): MinerU2.5 **90.67** (v1.5), MinerU2.5-Pro **95.69** (v1.6) — leads, beating Gemini 2.5 Pro / GPT-4o / Qwen2.5-VL-72B on text/table/formula. Source: arXiv 2509.22186.
- **olmOCR-Bench** (Ai2, Oct 2025): olmOCR-2 **82.4** > Marker **76.1** > **MinerU 75.8**. Here MinerU **trails** — this is a real olmOCR win and must stay visible.
- **RD-TableBench**: Reducto 90.2% on complex tables — but Reducto authored this benchmark (vendor-biased).
- Mathpix is the de-facto formula-OCR standard (BLEU/edit-distance studies), though a PaddleOCR-VL-based tool claims to beat it on OmniDocBench v1.0 formula recognition, so the very top is contested.
> Star counts / versions below (e.g. MinerU "65.7k / v3.2.1") are point-in-time and not independently re-verified.
---
## Category 1 — Self-hosted / open-source parsing engines
These are the tools that close our single biggest gap: **fully offline / air-gapped / no cloud / no upload caps.**
### MinerU engine (opendatalab) — the engine we wrap
- **Source:** https://github.com/opendatalab/MinerU · arXiv 2509.22186 · https://huggingface.co/opendatalab/MinerU2.5-Pro-2604-1.2B
- **Strengths:** Owns the SOTA models (OmniDocBench 90.67 / 95.69-Pro v1.6). 109-language OCR, handwriting, cross-page table merge, formula→LaTeX (the source of *our* LaTeX). Fully self-hostable → offline, air-gappable, zero per-page cost, no caps. Pipeline backend runs pure CPU; VLM needs 8GB+ VRAM. Native MCP, Python/Go/TS SDKs, LangChain/LlamaIndex/Dify/FastGPT.
- **Weaknesses vs us:** Heavy install (multi-GB torch/vLLM + weights, 16GB RAM / 20GB disk floor); slow on Apple Silicon; no note/PKM delivery sinks; library/CLI rather than zero-config.
- **Verdict:** **Beats us** on offline, privacy, caps, accuracy ceiling, ecosystem. **We beat it** only on zero-install/zero-config and built-in delivery.
### Marker (datalab-to / VikParuchuri)
- **Source:** https://github.com/datalab-to/marker · https://allenai.org/blog/olmocr-2
- **Strengths:** Fully offline; very high batch throughput (~122 pages/sec/H100, 0.18s/page GPU); broad formats incl. EPUB; optional local-LLM (Ollama) quality boost with no data leaving the machine; ~35k+ stars, active.
- **Weaknesses:** **GPL-3.0** code + model weights under a modified RAIL-M (free only under ~$2M funding+revenue; commercial above that needs a Datalab license). olmOCR-Bench **76.1** — below olmOCR-2 and MinerU's OmniDocBench standing.
- **Verdict:** Beats us on offline/throughput; we beat it on zero-install and 17 delivery sinks. License gate is a real friction it has and we don't.
### Docling (IBM / DS4SD)
- **Source:** https://github.com/docling-project/docling · https://huggingface.co/ibm-granite/granite-docling-258M · arXiv 2408.09869
- **Strengths:** **Widest input modality set** (PDF/DOCX/PPTX/XLSX/HTML/AsciiDoc/LaTeX/CSV/images + **audio via ASR** + USPTO/JATS/XBRL). Tiny 258M Granite-Docling VLM runs on CPU/modest GPU. **MIT code + Apache-2.0 weights.** Deep framework ecosystem (LangChain/LlamaIndex/Haystack + official MCP), IBM-backed, 60k+ stars. Air-gapped by design.
- **Weaknesses:** Absolute accuracy lags MinerU on OmniDocBench/olmOCR-Bench; library-first (not a zero-config CLI); targets framework ingestion, not file delivery to note tools.
- **Verdict:** Beats us on offline, modality breadth, permissive license, ecosystem; we beat it on zero-install and note/PKM delivery. **Do not over-rank its MIT as uniquely best** — olmOCR's Apache-2.0 on *both* code and 7B weights is at least as commercially valuable.
### olmOCR (allenai)
- **Source:** https://github.com/allenai/olmocr · https://allenai.org/blog/olmocr-2 · https://huggingface.co/datasets/allenai/olmOCR-bench
- **Strengths:** **Leads Ai2's olmOCR-Bench (82.4 vs MinerU 75.8)** — a benchmark where MinerU trails. **Apache-2.0 on code AND the olmOCR-2-7B weights** (most commercial-friendly model reuse here). Built for million-page LLM-training linearization. Offline.
- **Weaknesses:** **PDF/image only** (no Office/HTML); **English-primary**, filters non-English (MinerU does 109-lang); **requires a 12GB+ NVIDIA GPU, no CPU mode at all**.
- **Verdict:** Beats us on offline, that-benchmark accuracy, license, scale. We beat it on modality breadth, multilingual, no-GPU, delivery, zero-install. **Keep the olmOCR-Bench lead visible — do not cherry-pick only OmniDocBench.**
### Nougat (facebookresearch / Meta AI)
- **Source:** https://github.com/facebookresearch/nougat · arXiv 2308.13418
- **Strengths:** Strong LaTeX/math on arXiv-style scientific PDFs (its trained niche). Offline.
- **Weaknesses:** **PDF + English/Latin-script only** (no CJK); **CC-BY-NC weights (non-commercial)**; effectively **unmaintained** (last release Aug 2023); known repetition/hallucination/[MISSING_PAGE] failures off-distribution.
- **Verdict:** Offline + niche math is its only edge; we beat it on general-purpose, multilingual, maintenance, commercial license, delivery.
### PyMuPDF4LLM (pymupdf / Artifex)
- **Source:** https://github.com/pymupdf/pymupdf4llm · https://pymupdf.io/blog/pymupdf-layout-10-faster-pdf-parsing-without-gpus
- **Strengths:** **Far faster and lighter than any ML tool on born-digital PDFs** (~hundreds of pages/sec on plain CPU; a C-optimized variant claims ~520 pages/sec). Lowest dependency/hardware footprint. Offline, no cloud, no caps. Ideal for huge clean-PDF corpora where speed > fidelity.
- **Weaknesses:** No ML → no real formula/LaTeX, weak complex tables, poor scanned/handwritten; slow external OCR; **AGPL-3.0 OR Artifex commercial**; Office formats need paid **PyMuPDF Pro**.
- **Verdict:** A genuine win for the speed-over-fidelity, clean-PDF use case. We beat it on hard-doc quality (MinerU's VLM), multilingual OCR, and delivery — but acknowledge its speed/footprint advantage honestly.
### Zerox (getomni-ai)
- **Source:** https://github.com/getomni-ai/zerox
- **Strengths:** Trivial provider-flexibility (OpenAI/Azure/Bedrock-Claude/Gemini/Vertex); JSON-Schema structured extraction (Node SDK); MIT code.
- **Weaknesses:** **NOT offline and NOT token-free** — mandates a paid cloud vision-LLM key; needs graphicsmagick+ghostscript; **no published benchmarks**; per-page LLM cost can exceed MinerU on large jobs.
- **Verdict:** We beat it on token-free start, benchmarked accuracy, dedicated formula/table models, system-dep footprint, and delivery. It beats us on provider-swap flexibility and typed JSON extraction.
---
## Category 2 — Commercial cloud document-parsing APIs
Mostly **stronger than us** on enterprise accuracy, SLAs, structured extraction, and RAG/MCP ecosystems. Our honest edges are narrow: token-free + zero-install hosted default, clean Markdown/LaTeX of academic PDFs, and 17 delivery sinks none of them offer.
### LlamaParse (LlamaIndex / LlamaCloud)
- **Source:** https://www.llamaindex.ai/pricing · LlamaCloud MCP docs
- **Beats us:** Official hosted **MCP server**; deep native RAG stack (parse→index→LlamaExtract/LlamaAgents); steerable NL parsing with frontier LLMs (GPT-4.1/Gemini 2.5 Pro); richer outputs (per-page JSON, XLSX, HTML tables, annotated PDF); enterprise SLAs; mature Python+TS SDKs.
- **We beat:** Token-free start (it needs a LlamaCloud key from page one); zero runtime deps; 17 note/PKM sinks (it delivers to RAG indexes, not note tools); built-in `--resume`/parallel batch CLI.
### Mathpix (Convert API)
- **Source:** https://mathpix.com/pricing/api · https://mathpix.com/image-to-latex
- **Beats us:** **Best-in-class formula/equation OCR (printed AND handwritten) → clean LaTeX — clearly better than MinerU for pure math fidelity; concede this, do not imply parity.** Mature Snip ecosystem + Overleaf workflows; very low per-image cost at scale.
- **We beat:** Token-free start (Mathpix API requires a paid PAYG account, **$19.99 setup fee**, card on file; **no recurring free monthly allowance** — only a one-time $29 test credit; the consumer Snip app's free quota does **not** apply to the API); general-purpose multi-modal Office parsing; 17 delivery sinks; built-in batch CLI.
### Unstructured.io
- **Source:** https://unstructured.io/pricing · https://github.com/Unstructured-IO/unstructured
- **Beats us:** **Apache-2.0 core library is fully self-hostable → 100% offline** (we cannot); official MCP + huge connector ecosystem (S3/SharePoint/vector DBs); built-in chunking+embedding (RAG-ready); 25+ file types; permissive license for product embedding.
- **We beat:** Token-free hosted default with zero install (its hosted API needs a key; self-host means running infra); cleaner human-readable Markdown out of the box (its primary output is JSON "elements"); 17 note/PKM sinks (it targets vector DBs/storage). *On parsing quality:* VLM parsing is generally stronger for complex layout/formula, but this is **not a benchmarked head-to-head** — state it as a tendency, not a measured win.
### Reducto
- **Source:** https://reducto.ai/pricing
- **Beats us:** **Best complex/financial table extraction (90.2% RD-TableBench — vendor-authored but the strongest public evidence)**; agentic multi-pass OCR; SOC2/HIPAA, on-prem/VPC/air-gapped, enterprise SLAs; schema-based extraction with bounding boxes/citations.
- **We beat:** Token-free start (it needs a key + credits); zero-install plain CLI; 17 delivery sinks; auto-routing/--resume/parallel batch.
### Chunkr (and similar RAG-native APIs)
- **Beats us:** Self-hostable (offline option we lack); RAG-native chunking + broad export (DOCX/HTML/LaTeX).
- **We beat:** Token-free start; zero-install; 17 note/PKM sinks.
- **Caveat (fact-check):** Do **not** claim "stronger VLM Markdown for formulas" — Chunkr cloud uses its own proprietary models and we have **no head-to-head benchmark**. Drop the quality claim; keep only the export-breadth and offline framing.
---
## Category 3 — Other MinerU wrappers, skills & MCP servers (our direct peers)
**Every cloud-backed wrapper here hits the same MinerU API we do, so its OCR/table/formula output is IDENTICAL to ours.** We have **no quality edge** over them — only DX differences. Claims of "better OCR/formula/Markdown" vs these are **invalid** and must not appear.
### Official MinerU MCP server (mineru-open-mcp / MinerU-Ecosystem)
- **Source:** https://github.com/opendatalab/MinerU-Ecosystem · https://pypi.org/project/mineru-open-mcp/
- **Beats us:** **Official, first-party** — tracks API/format changes day-one; native **MCP server** (stdio + streamable-http) in Claude Desktop/Cursor/Windsurf with zero glue; full ecosystem (Python/Go/TS SDKs, LangChain/LlamaIndex/Dify/FastGPT). **Same free no-token Flash tier as us** — our "free zero-token" edge is fully matched by the first party.
- **We beat:** Zero runtime deps (vs pip/uvx install); auto-routing Agent⇄Standard with auto-escalation; 17 delivery sinks; `--resume`/parallel batch; usable as a plain CLI outside any MCP host.
### MinerU-Document-Explorer (official, opendatalab)
- **Source:** https://github.com/opendatalab/MinerU-Document-Explorer
- **Beats us:** Different, **larger** value prop — a local agent-native **knowledge engine** (BM25/vector/hybrid retrieval + deep-reading + LLM-wiki) with 15 MCP tools; runs 100% locally for its core; MIT, 568 stars.
- **We beat:** We're a focused zero-dep converter; broader conversion modalities; 17 delivery sinks (it keeps content in its own index/wiki); no Node/local-model download.
### linxule/mineru-mcp (Node, cloud)
- **Source:** https://github.com/linxule/mineru-mcp
- **Beats us:** Native MCP server with 6 granular tools (explicit status-polling + batch-status pagination); first-class for Node/JS MCP stacks; batch up to 200 URLs/request.
- **We beat:** **Free no-token path** (it **requires** a token always); zero runtime deps (vs Node 18+); broader modalities (Excel/HTML); 17 delivery sinks; usable as plain CLI outside MCP.
### mineru-converter-mcp-server (AvatarGanymede/MinerU-MCP)
- **Source:** https://pypi.org/project/mineru-converter-mcp-server/
- **Beats us:** **Auto-splits PDFs >200MB and segments >600-page docs by page range — gracefully exceeding the 200MB/200-page cap we are bound by.** Turnkey Smithery + Render deploy (per-user key); explicit HTML input.
- **We beat:** Free no-token default (it requires a key); zero runtime deps; plain CLI (no MCP host/Render/Smithery needed); 17 sinks; auto-routing.
### grimoire-skill (LeoLin990405)
- **Source:** https://github.com/LeoLin990405/grimoire-skill
- **Beats us:** Higher-level knowledge-capture ("parse once, share twice" → Obsidian notes + reusable skill packs); ingests **video** (YouTube/Bilibili) + subtitles (modalities we don't touch); cross-agent skill management; content-aware Obsidian auto-filing.
- **We beat:** Free no-token default (it needs a token + `--cloud-ok` for local files); zero runtime deps (vs bash+jq+awk + optional yt-dlp/ffmpeg); 17 sinks vs primarily Obsidian; broader Office/HTML; cross-platform single-file portability.
### kesslerio/mineru-pdf-parser (openclaw/ClawHub skill, local CPU)
- **Source:** openclaw/skills · SKILL.md
- **Beats us:** **Fully local/offline (pure CPU, cross-platform)** — no cloud/token/caps; handles privacy-sensitive docs; native Markdown + JSON.
- **We beat:** Zero install (it needs a full local MinerU install + weights + shell wrapper); no GPU/heavy runtime; faster wall-clock **only vs slow local CPU**; broader modalities; 17 sinks; `--stdout`/`--json`; better docs.
### nilecui/mineru-parser-skills (Claude Agent SDK, cloud)
- **Source:** https://github.com/nilecui/mineru-parser-skills
- **Beats us:** Built directly on the Claude Agent SDK (slots into Agent-SDK apps). Honestly little else — it's a thinner cloud wrapper.
- **We beat:** Accepts local files/dirs **and** URLs (it is **URL-only** — cannot parse a local PDF); free no-token default; zero runtime deps; batch/`--resume`/parallel; 17 sinks; broader modalities; mature/documented vs a 4-commit, no-license repo. *Caveat:* our "benchmarked" claim means **latency-measured**, not accuracy-benchmarked.
### TINKPA/mcp-mineru (local MLX, Apple Silicon)
- **Source:** https://github.com/TINKPA/mcp-mineru
- **Beats us:** **Fully offline/local** via MinerU running on-device (MLX accel); no cloud/token/caps; data never leaves the Mac.
- **We beat:** Zero install/no weights/no GPU; **faster wall-clock only for typical multi-page docs vs its slow local inference (32148s/page on M4)** — not a general speed win; broader modalities; batch/`--resume`/17 sinks; more active/documented; usable as plain CLI.
---
## Summary of mandatory concessions (do not bury these)
1. **Offline / air-gapped is our single biggest gap.** MinerU engine, Marker, Docling, olmOCR, Nougat, PyMuPDF4LLM, TINKPA, kesslerio, MinerU-Document-Explorer, and self-hostable Unstructured/Chunkr all run with **zero cloud dependency**. We are cloud-only and **cannot handle confidential/regulated/air-gapped content at all.**
2. **Data privacy:** every self-hosted competitor keeps documents on the machine; we **upload every file** to MinerU's cloud — a hard disqualifier for many regulated users.
3. **Accuracy is downstream of, and capped by, MinerU's cloud.** Self-hosting MinerU2.5-Pro gives the same-or-better accuracy with no caps. Same-backend wrappers yield **identical** quality to us.
4. **Hard caps:** 10MB/20-page (Agent), 200MB/200-page (Standard), IP rate limits. mineru-converter exceeds them via auto-split/segmentation.
5. **Mathpix beats us on formula/LaTeX OCR (incl. handwriting).**
6. **Reducto leads complex/financial tables; olmOCR leads olmOCR-Bench (82.4 vs MinerU 75.8).** Different benchmarks favor different tools — never cherry-pick only OmniDocBench.
7. **Official first-party advantage:** the official MinerU MCP/Document-Explorer + ecosystem track changes day-one and match our free tier; we are third-party, can lag, and ship **no MCP server**.
8. **Permissive-license wins we lack:** olmOCR (Apache-2.0 code + 7B weights), Docling (MIT + Apache-2.0 weights), Unstructured (Apache-2.0 core).
9. **PyMuPDF4LLM is far faster/lighter on born-digital PDFs** (clean-text corpora, speed > fidelity).
## Sources
- MinerU engine: https://github.com/opendatalab/MinerU · arXiv 2509.22186 · https://huggingface.co/opendatalab/MinerU2.5-Pro-2604-1.2B · https://neurohive.io/en/state-of-the-art/mineru2-5-open-source-1-2b-model-for-pdf-parsing-outperforms-gemini-2-5-pro-on-benchmarks/
- Official MCP / ecosystem: https://github.com/opendatalab/MinerU-Ecosystem · https://pypi.org/project/mineru-open-mcp/ · https://github.com/opendatalab/MinerU-Document-Explorer
- Marker: https://github.com/datalab-to/marker · https://allenai.org/blog/olmocr-2
- Docling: https://github.com/docling-project/docling · arXiv 2408.09869 · https://huggingface.co/ibm-granite/granite-docling-258M
- olmOCR: https://github.com/allenai/olmocr · https://allenai.org/blog/olmocr-2 · https://huggingface.co/datasets/allenai/olmOCR-bench
- Nougat: https://github.com/facebookresearch/nougat · arXiv 2308.13418
- PyMuPDF4LLM: https://github.com/pymupdf/pymupdf4llm · https://pymupdf.io/blog/pymupdf-layout-10-faster-pdf-parsing-without-gpus
- Zerox: https://github.com/getomni-ai/zerox
- LlamaParse: https://www.llamaindex.ai/pricing
- Mathpix: https://mathpix.com/pricing/api · https://mathpix.com/image-to-latex
- Unstructured: https://unstructured.io/pricing · https://github.com/Unstructured-IO/unstructured
- Reducto: https://reducto.ai/pricing
- Other wrappers: https://github.com/linxule/mineru-mcp · https://pypi.org/project/mineru-converter-mcp-server/ · https://github.com/LeoLin990405/grimoire-skill · https://github.com/nilecui/mineru-parser-skills · https://github.com/TINKPA/mcp-mineru

View File

@ -0,0 +1,59 @@
# Delivery Integrations (`--to`)
After parsing, MinerU Skill can deliver the Markdown straight into your content
tools using each tool's **official ingestion path** — no fragile generic block
converters. Targets are pluggable sinks; select one or more with `--to NAME`
(repeatable). List them live with `python3 scripts/mineru.py --list-sinks`.
```bash
# Parse and fan out to several destinations at once
python3 scripts/mineru.py paper.pdf --to obsidian --to notion --to slack
```
Each sink reads its configuration from **environment variables** so an AI agent
can run it non-interactively. Delivery results appear in `--json` output under
each result's `sinks` array.
## Support matrix
| Target | `--to` | Native path | Auth / config (env) | Markdown fidelity | Images |
|--------|--------|-------------|---------------------|-------------------|--------|
| **Obsidian** | `obsidian` (`ob`) | filesystem write + YAML frontmatter | `OBSIDIAN_VAULT`, `OBSIDIAN_SUBDIR?` | full | ✅ copied to `<note>.assets/` |
| **Logseq** | `logseq` | filesystem write, outline + `key:: value` | `LOGSEQ_GRAPH` | full (outline transform) | ✅ copied to `assets/` |
| **SiYuan** | `siyuan` | kernel `createDocWithMd` | `SIYUAN_TOKEN`, `SIYUAN_API_URL?`, `SIYUAN_NOTEBOOK?` | full (GFM) | ✅ `asset/upload` |
| **Notion** | `notion` | `POST /v1/pages` (blocks) | `NOTION_API_KEY`, `NOTION_PARENT_PAGE_ID`, `NOTION_VERSION?` | structure (headings/lists/code/quote) | ⚠️ text only¹ |
| **Linear** | `linear` | GraphQL `issueCreate` | `LINEAR_API_KEY`, `LINEAR_TEAM_ID` | full (Markdown-native) | ✅ base64-inlined |
| **Yuque 语雀** | `yuque` (`语雀`) | open API create doc | `YUQUE_TOKEN`, `YUQUE_NAMESPACE` | full (Markdown-native) | ⚠️ host publicly² |
| **Coda** | `coda` | page canvas `format:markdown` | `CODA_API_TOKEN`, `CODA_DOC_ID?` | full (Markdown-native) | ⚠️ public URL² |
| **Slack** | `slack` | external-upload `.md` file | `SLACK_BOT_TOKEN`, `SLACK_CHANNEL` | full (raw file) | ⚠️ not embedded |
| **Lark 飞书** | `feishu` (`lark`, `飞书`) | Drive `import_tasks` → Docx | `FEISHU_APP_ID`, `FEISHU_APP_SECRET`, `FEISHU_FOLDER_TOKEN?` | full (server-converted) | ⚠️ public URL² |
| **Confluence** | `confluence` | `POST /wiki/api/v2/pages` (storage) | `CONFLUENCE_BASE_URL`, `CONFLUENCE_EMAIL`, `CONFLUENCE_API_TOKEN`, `CONFLUENCE_SPACE_ID` | MD→HTML | ⚠️ not attached |
| **OneNote** | `onenote` | Graph `sections/{id}/pages` | `ONENOTE_TOKEN`³, `ONENOTE_SECTION_ID` | MD→HTML | ⚠️ remote only |
| **TickTick 滴答** | `ticktick` (`dida`, `滴答清单`) | `POST /open/v1/task` | `TICKTICK_TOKEN`, `TICKTICK_PROJECT_ID?` | task note | ❌ unsupported |
| **DingTalk 钉钉** | `dingtalk` (`钉钉`) | robot markdown webhook | `DINGTALK_WEBHOOK`, `DINGTALK_SECRET?` | markdown message | ⚠️ public URL only |
| **Airtable** | `airtable` | `POST /v0/{base}/{table}` record | `AIRTABLE_API_KEY`, `AIRTABLE_BASE_ID`, `AIRTABLE_TABLE`, `AIRTABLE_TITLE_FIELD?`, `AIRTABLE_BODY_FIELD?` | record field⁴ | ❌ not uploaded |
| **WeCom 企业微信** | `wecom` (`企业微信`) | app `message/send` markdown | `WECOM_CORPID`, `WECOM_CORPSECRET`, `WECOM_AGENTID`, `WECOM_TOUSER?` | message (subset, ≤2 KB)⁵ | ❌ unsupported |
| **Roam Research** ⁶ | `roam` | `batch-actions` block tree | `ROAM_API_TOKEN`, `ROAM_GRAPH_NAME` | full (Markdown→outline) | ⚠️ public URL |
| **WPS 金山文档** ⁶ | `wps` (`kdocs`, `金山`) | Markdown→DOCX → kdocs upload | `WPS_APP_ID`, `WPS_APP_SECRET`, `WPS_PARENT_PATH?` | DOCX (via html-for-docx) | embedded in DOCX |
Notes:
1. **Notion** images need a separate `file_uploads` upload-then-reference dance; v1 delivers text + structure and notes the count of un-embedded local images. (Roadmap: image upload.)
2. Hosted services that ingest Markdown by value but have no first-class CLI asset upload — local images must be hosted at a public URL to render. The Markdown is delivered intact; image links that are already URLs work.
3. **OneNote** `ONENOTE_TOKEN` is a Microsoft Graph access token (delegated, scope `Notes.Create`). Obtain it via the device-code OAuth flow; the sink itself stays non-interactive.
4. **Airtable** is a database, not a document store — the doc is stored as one record (title + body fields). A good "save this doc as a row" target, not a document publisher.
5. **WeCom** markdown messages are a limited subset (≤2048 bytes, no images/tables, not rendered in the workbench). Best as a notification/summary; for a full document deliver via Lark/Notion and send the link.
6. **Optional-dependency sinks** — these two rely on a third-party library that the sink lazy-imports only when used, so the core and the other 15 sinks stay zero-dependency. If the library is absent, the sink returns a clear `pip install …` hint. They are implemented to the official specs but, being credential/desktop-gated, are best-effort until validated against live accounts.
## Optional-dependency sinks (`[roam]`, `[wps]`)
```bash
pip install "mineru-skill[wps]" # html-for-docx (Markdown → DOCX)
pip install "mineru-skill[roam]" # official roam-client SDK (git, needs Python ≥3.11)
# roam-client is git-only; equivalently:
pip install "roam-client @ git+https://github.com/Roam-Research/backend-sdks.git#subdirectory=python"
```
- **Roam** — no library ingests Markdown into Roam, but the official `roam-client` SDK handles the genuinely error-prone transport (307/308 peer-host redirect, dual `Authorization`/`x-authorization` Bearer headers, `/write`). We depend on it for transport and build only the Markdown→outline tree, delivering the whole document in one `batch-actions` request. Images must be public URLs.
- **WPS / 金山文档** — Markdown→DOCX uses the maintained pure-pip `html-for-docx` (reusing this project's Markdown→HTML); the kdocs upload signs requests with the documented WPS-2 scheme (plain SHA-1) using only the standard library. Requires an approved kdocs developer app + provisioned appspace.
Adding more targets is a single small module — see `scripts/sinks/base.py`. PRs welcome.

View File

@ -0,0 +1 @@
"""Importable package for MinerU Skill console entry points."""

View File

@ -0,0 +1,88 @@
"""Heading-aware Markdown chunking for RAG pipelines (zero-dependency).
``chunk_markdown`` splits a parsed Markdown document into retrieval-sized chunks
that preserve heading context matching the RAG-friendliness of LlamaParse /
Unstructured without any dependency.
"""
from __future__ import annotations
import re
_HEADING = re.compile(r"^(#{1,6})\s+(.*)$")
def _slug(text: str) -> str:
text = (text or "doc").strip().lower()
text = re.sub(r"[^a-z0-9]+", "-", text).strip("-")
return text or "doc"
def _split_by_size(text: str, max_chars: int) -> list:
"""Split text into <= max_chars pieces on paragraph boundaries (hard-split if needed)."""
if len(text) <= max_chars:
return [text]
pieces: list = []
current = ""
for para in text.split("\n\n"):
if len(para) > max_chars:
if current:
pieces.append(current)
current = ""
for i in range(0, len(para), max_chars):
pieces.append(para[i:i + max_chars])
elif not current:
current = para
elif len(current) + len(para) + 2 <= max_chars:
current = f"{current}\n\n{para}"
else:
pieces.append(current)
current = para
if current:
pieces.append(current)
return pieces
def chunk_markdown(markdown: str, *, max_chars: int = 2000, source: str = "") -> list:
"""Chunk Markdown by heading, size-splitting long sections.
Returns ``[{id, index, heading, text, chars, source}, ...]`` where ``heading``
is the ``H1 > H2 > H3`` breadcrumb for the chunk.
"""
lines = markdown.replace("\r\n", "\n").split("\n")
chunks: list = []
stack: list = [] # (level, text) heading breadcrumb
buf: list = []
base = _slug(source)
def breadcrumb() -> str:
return " > ".join(t for _, t in stack)
def flush():
text = "\n".join(buf).strip()
buf.clear()
if not text:
return
head = breadcrumb()
for piece in _split_by_size(text, max_chars):
idx = len(chunks)
chunks.append({
"id": f"{base}-{idx}",
"index": idx,
"heading": head,
"text": piece,
"chars": len(piece),
"source": source,
})
for line in lines:
match = _HEADING.match(line.strip())
if match:
flush() # close the previous section under its own breadcrumb
level = len(match.group(1))
while stack and stack[-1][0] >= level:
stack.pop()
stack.append((level, match.group(2)))
buf.append(line)
flush()
return chunks

View File

@ -0,0 +1,59 @@
"""Optional fully-offline parsing backend for born-digital PDFs.
Our single biggest honest gap is being cloud-only. ``--engine local`` parses a
PDF **entirely offline** with the optional, lightweight ``pymupdf4llm`` library
(no GPU, no cloud, no upload caps) ideal for confidential or born-digital PDFs
where MinerU's cloud VLM is overkill. Scanned/complex docs still want the cloud
engine, so ``--engine auto`` only uses local when the PDF has real text.
pip install "mineru-skill[local]" # i.e. pip install pymupdf4llm
"""
from __future__ import annotations
from pathlib import Path
_HINT = (
"--engine local needs pymupdf4llm — pip install 'mineru-skill[local]' "
"(i.e. pip install pymupdf4llm)"
)
class LocalEngineError(Exception):
"""Raised when local parsing is requested but cannot be performed."""
def available() -> bool:
try:
import pymupdf4llm # noqa: F401
return True
except ImportError:
return False
def is_born_digital(path, min_chars: int = 200) -> bool:
"""True if the PDF has extractable text (so local parsing is appropriate)."""
try:
import pymupdf
except ImportError:
return False
doc = pymupdf.open(str(path))
total = 0
for page in doc:
total += len(page.get_text().strip())
if total >= min_chars:
return True
return total >= min_chars
def parse_local(path, output_dir=None) -> str:
"""Parse a PDF to Markdown fully offline. Returns the Markdown string."""
try:
import pymupdf4llm
except ImportError as exc:
raise LocalEngineError(_HINT) from exc
if output_dir is not None:
images = Path(output_dir) / "images"
images.mkdir(parents=True, exist_ok=True)
return pymupdf4llm.to_markdown(str(path), write_images=True, image_path=str(images))
return pymupdf4llm.to_markdown(str(path))

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,178 @@
#!/usr/bin/env python3
"""Zero-dependency MCP server (stdio) for MinerU Skill.
Speaks newline-delimited JSON-RPC 2.0 over stdin/stdout using only the standard
library, so an MCP host (Claude, Cursor, Windsurf, ...) can call MinerU. Register:
{"command": "python3", "args": ["scripts/mineru_mcp.py"]}
Tools: ``mineru_parse``, ``mineru_parse_to``, ``mineru_list_sinks``.
"""
from __future__ import annotations
import json
import os
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).resolve().parent))
import mineru # noqa: E402
PROTOCOL_VERSION = "2024-11-05"
SERVER_INFO = {"name": "mineru", "version": mineru.__version__}
TOOLS = [
{
"name": "mineru_parse",
"description": "Parse a PDF / Office / image file or URL into clean Markdown via MinerU.",
"inputSchema": {
"type": "object",
"properties": {
"input": {"type": "string", "description": "Local file path or http(s) URL"},
"output_dir": {"type": "string", "description": "Where to write output (default ./output)"},
"api": {"type": "string", "enum": ["auto", "agent", "standard"]},
"engine": {"type": "string", "enum": ["cloud", "local", "auto"]},
"ocr": {"type": "boolean"},
"lang": {"type": "string"},
},
"required": ["input"],
},
},
{
"name": "mineru_parse_to",
"description": "Parse a document and deliver the Markdown into content tools (Obsidian, Notion, Feishu, ...).",
"inputSchema": {
"type": "object",
"properties": {
"input": {"type": "string"},
"sinks": {"type": "array", "items": {"type": "string"}, "description": "Sink names, e.g. ['obsidian','notion']"},
"output_dir": {"type": "string"},
},
"required": ["input", "sinks"],
},
},
{
"name": "mineru_list_sinks",
"description": "List available delivery targets and their required environment variables.",
"inputSchema": {"type": "object", "properties": {}},
},
]
class MethodNotFound(Exception):
pass
def _text_result(text: str, is_error: bool = False) -> dict:
return {"content": [{"type": "text", "text": text}], "isError": is_error}
def _tool_parse(args: dict) -> dict:
opts = mineru.ParseOptions(is_ocr=bool(args.get("ocr")), language=args.get("lang", "ch"))
token = os.environ.get("MINERU_TOKEN")
output_dir = Path(args.get("output_dir") or "./output")
res = mineru.process_one(
args["input"], opts, token=token, output_dir=output_dir,
api=args.get("api", "auto"), engine=args.get("engine", "cloud"),
)
if res.state == "done":
return _text_result(res.markdown or "")
return _text_result(f"Parse failed: {res.error}", is_error=True)
def _tool_parse_to(args: dict) -> dict:
opts = mineru.ParseOptions()
token = os.environ.get("MINERU_TOKEN")
output_dir = Path(args.get("output_dir") or "./output")
res = mineru.process_one(args["input"], opts, token=token, output_dir=output_dir)
if res.state != "done":
return _text_result(f"Parse failed: {res.error}", is_error=True)
sinks = mineru._load_sinks()
if sinks is None:
return _text_result("Sinks package unavailable.", is_error=True)
doc = sinks.ParsedDoc(title=res.name, markdown=res.markdown, source=res.source,
modality=res.modality, markdown_path=res.markdown_path)
outcomes = [o.to_status() for o in sinks.deliver_all(doc, args["sinks"])]
any_fail = any(not o["ok"] for o in outcomes)
return _text_result(json.dumps({"name": res.name, "deliveries": outcomes}, ensure_ascii=False, indent=2),
is_error=any_fail)
def _tool_list_sinks(_args: dict) -> dict:
sinks = mineru._load_sinks()
if sinks is None:
return _text_result("Sinks package unavailable.", is_error=True)
listing = [{"name": n, "label": sinks.get_sink(n).label, "requires": list(sinks.get_sink(n).requires)}
for n in sinks.sink_names()]
return _text_result(json.dumps(listing, ensure_ascii=False, indent=2))
_TOOL_HANDLERS = {
"mineru_parse": _tool_parse,
"mineru_parse_to": _tool_parse_to,
"mineru_list_sinks": _tool_list_sinks,
}
def _route(method: str, params: dict):
if method == "initialize":
return {"protocolVersion": PROTOCOL_VERSION, "capabilities": {"tools": {}}, "serverInfo": SERVER_INFO}
if method == "tools/list":
return {"tools": TOOLS}
if method == "tools/call":
name = params.get("name")
handler = _TOOL_HANDLERS.get(name)
if handler is None:
return _text_result(f"Unknown tool: {name}", is_error=True)
try:
return handler(params.get("arguments") or {})
except Exception as exc: # noqa: BLE001 - report as a tool error, never crash the server
return _text_result(f"{type(exc).__name__}: {exc}", is_error=True)
raise MethodNotFound(method)
def dispatch(request: dict):
"""Handle one JSON-RPC request dict; return a response dict, or None for notifications."""
is_notification = "id" not in request
req_id = request.get("id")
try:
result = _route(request.get("method"), request.get("params") or {})
except MethodNotFound as exc:
if is_notification:
return None
return {"jsonrpc": "2.0", "id": req_id, "error": {"code": -32601, "message": f"Method not found: {exc}"}}
except Exception as exc: # noqa: BLE001
if is_notification:
return None
return {"jsonrpc": "2.0", "id": req_id, "error": {"code": -32603, "message": str(exc)}}
if is_notification:
return None
return {"jsonrpc": "2.0", "id": req_id, "result": result}
def serve(stdin=None, stdout=None) -> None:
"""Read newline-delimited JSON-RPC from stdin, write responses to stdout."""
stdin = stdin or sys.stdin
stdout = stdout or sys.stdout
for line in stdin:
line = line.strip()
if not line:
continue
try:
request = json.loads(line)
except ValueError:
continue
response = dispatch(request)
if response is not None:
stdout.write(json.dumps(response, ensure_ascii=False) + "\n")
stdout.flush()
def main() -> int:
serve()
return 0
if __name__ == "__main__":
sys.exit(main())

View File

@ -0,0 +1,75 @@
"""Pluggable delivery sinks for parsed Markdown.
Each submodule registers one or more :class:`Sink` implementations that deliver a
:class:`ParsedDoc` into a content tool using that tool's official ingestion path.
Importing this package populates the registry; a sink module that fails to import
is recorded in :data:`IMPORT_ERRORS` rather than breaking the others.
"""
from __future__ import annotations
import importlib
import sys
from .base import ( # noqa: F401
ParsedDoc,
Sink,
SinkError,
SinkResult,
get_sink,
sink_names,
REGISTRY,
)
# Sink modules to load. Order is cosmetic.
_MODULES = [
"local", # obsidian, logseq (filesystem)
"siyuan",
"notion",
"linear",
"yuque",
"coda",
"ticktick",
"dingtalk",
"airtable",
"wecom",
"slack",
"feishu",
"confluence",
"onenote",
"roam", # optional dependency (roam-client)
"wps", # optional dependency (html-for-docx)
]
IMPORT_ERRORS: dict = {}
for _name in _MODULES:
try:
importlib.import_module(f"{__name__}.{_name}")
except Exception as exc: # noqa: BLE001 - a bad sink shouldn't break the rest
IMPORT_ERRORS[_name] = f"{type(exc).__name__}: {exc}"
print(f"[sinks] failed to load {_name}: {exc}", file=sys.stderr)
def deliver_all(doc: ParsedDoc, names) -> list:
"""Deliver ``doc`` to each named sink, returning a list of :class:`SinkResult`."""
results = []
for name in names:
sink = get_sink(name)
if sink is None:
results.append(SinkResult(sink=name, ok=False, error=f"unknown sink '{name}'"))
continue
missing = sink.missing_config()
if missing:
results.append(SinkResult(
sink=sink.name, ok=False,
error=f"missing config: {', '.join(missing)}",
))
continue
try:
results.append(sink.deliver(doc))
except SinkError as exc:
results.append(SinkResult(sink=sink.name, ok=False, error=str(exc)))
except Exception as exc: # noqa: BLE001 - surface but never crash the run
results.append(SinkResult(sink=sink.name, ok=False, error=f"{type(exc).__name__}: {exc}"))
return results

View File

@ -0,0 +1,72 @@
"""Zero-dependency HTTP helpers shared by all sinks (stdlib urllib only).
``http_request`` is the single seam tests monkeypatch.
"""
from __future__ import annotations
import json
import mimetypes
import urllib.error
import urllib.request
from typing import Optional
USER_AGENT = "MinerU-Skill-sink/1.0"
def http_request(method, url, *, headers=None, data=None, timeout=60):
"""Perform one HTTP request. Returns ``(status_code, body_bytes)``."""
req = urllib.request.Request(url, data=data, method=method, headers=headers or {})
req.add_header("User-Agent", USER_AGENT)
try:
with urllib.request.urlopen(req, timeout=timeout) as resp:
return resp.getcode(), resp.read()
except urllib.error.HTTPError as exc:
body = exc.read() if hasattr(exc, "read") else b""
return exc.code, body
def request_json(method, url, *, headers=None, payload=None, timeout=60):
"""JSON request helper. Returns ``(status_code, parsed_json_or_empty_dict)``."""
hdrs = dict(headers or {})
body = None
if payload is not None:
hdrs.setdefault("Content-Type", "application/json")
body = json.dumps(payload, ensure_ascii=False).encode("utf-8")
status, raw = http_request(method, url, headers=hdrs, data=body, timeout=timeout)
parsed: dict = {}
if raw:
try:
parsed = json.loads(raw.decode("utf-8"))
except (ValueError, UnicodeDecodeError):
parsed = {}
return status, parsed
def encode_multipart(fields=None, files=None):
"""Build a ``multipart/form-data`` body with stdlib only.
``fields``: dict of str -> str. ``files``: list of (field_name, filename, bytes).
Returns ``(content_type, body_bytes)``.
"""
boundary = "----MinerUSinkBoundary7MA4YWxkTrZu0gW"
crlf = b"\r\n"
parts = []
for name, value in (fields or {}).items():
parts.append(b"--" + boundary.encode())
parts.append(f'Content-Disposition: form-data; name="{name}"'.encode())
parts.append(b"")
parts.append(str(value).encode("utf-8"))
for field_name, filename, content in files or []:
ctype = mimetypes.guess_type(filename)[0] or "application/octet-stream"
parts.append(b"--" + boundary.encode())
parts.append(
f'Content-Disposition: form-data; name="{field_name}"; filename="{filename}"'.encode()
)
parts.append(f"Content-Type: {ctype}".encode())
parts.append(b"")
parts.append(content)
parts.append(b"--" + boundary.encode() + b"--")
parts.append(b"")
body = crlf.join(parts)
return f"multipart/form-data; boundary={boundary}", body

View File

@ -0,0 +1,244 @@
"""Small, dependency-free Markdown utilities used by sinks.
These are intentionally pragmatic, not a full CommonMark implementation: they
cover the constructs MinerU emits (headings, emphasis, code, lists, tables,
blockquotes, links, images) well enough to deliver faithful content to tools
that require HTML (Confluence, OneNote) or an outline (Logseq).
"""
from __future__ import annotations
import html
import re
from pathlib import Path
from typing import Optional
_IMAGE_RE = re.compile(r"!\[(?P<alt>[^\]]*)\]\((?P<ref>[^)\s]+)(?:\s+\"[^\"]*\")?\)")
_ILLEGAL_FS = re.compile(r'[\\/:*?"<>|#^\[\]]+')
def slugify(text: str, default: str = "document") -> str:
"""Filesystem/URL-safe slug."""
text = text.strip().lower()
text = re.sub(r"[\s_]+", "-", text)
text = re.sub(r"[^a-z0-9\-]+", "", text)
text = re.sub(r"-{2,}", "-", text).strip("-")
return text or default
def safe_filename(title: str, default: str = "document") -> str:
"""Clean a title into a safe note filename (keeps unicode, drops illegal chars)."""
name = _ILLEGAL_FS.sub(" ", title).strip()
name = re.sub(r"\s{2,}", " ", name)
return name[:120] or default
def is_remote(ref: str) -> bool:
return ref.startswith("http://") or ref.startswith("https://") or ref.startswith("data:")
def find_local_images(markdown: str, base_dir) -> list:
"""Return ``[(alt, ref, Path)]`` for image refs that point at existing local files."""
base = Path(base_dir) if base_dir else None
found = []
seen = set()
for match in _IMAGE_RE.finditer(markdown):
ref = match.group("ref")
if is_remote(ref) or ref in seen:
continue
path = Path(ref)
if not path.is_absolute() and base is not None:
path = base / ref
if path.is_file():
found.append((match.group("alt"), ref, path))
seen.add(ref)
return found
def rewrite_images(markdown: str, mapping: dict) -> str:
"""Rewrite local image refs using ``{old_ref: new_ref}``."""
def repl(match):
ref = match.group("ref")
if ref in mapping:
return f"![{match.group('alt')}]({mapping[ref]})"
return match.group(0)
return _IMAGE_RE.sub(repl, markdown)
def yaml_frontmatter(props: dict) -> str:
"""Render a YAML frontmatter block. List values become ``- item`` lines."""
lines = ["---"]
for key, value in props.items():
if value is None or value == "" or value == []:
continue
if isinstance(value, (list, tuple)):
lines.append(f"{key}:")
for item in value:
lines.append(f" - {item}")
else:
lines.append(f"{key}: {value}")
lines.append("---")
return "\n".join(lines)
# --------------------------------------------------------------------------- #
# Inline + block Markdown -> HTML (pragmatic, XHTML-safe)
# --------------------------------------------------------------------------- #
def _inline(text: str) -> str:
"""Convert inline Markdown to HTML on already-escaped text."""
# images first, then links
text = _IMAGE_RE.sub(
lambda m: f'<img src="{html.escape(m.group("ref"), quote=True)}" alt="{m.group("alt")}" />',
text,
)
text = re.sub(r"\[([^\]]+)\]\(([^)\s]+)\)",
lambda m: f'<a href="{html.escape(m.group(2), quote=True)}">{m.group(1)}</a>', text)
text = re.sub(r"`([^`]+)`", r"<code>\1</code>", text)
text = re.sub(r"\*\*([^*]+)\*\*", r"<strong>\1</strong>", text)
text = re.sub(r"(?<!\*)\*(?!\*)([^*]+)\*(?!\*)", r"<em>\1</em>", text)
return text
def md_to_html(markdown: str) -> str:
"""Convert a Markdown document to a pragmatic, XHTML-safe HTML fragment."""
out = []
lines = markdown.replace("\r\n", "\n").split("\n")
i = 0
n = len(lines)
in_code = False
code_buf: list = []
list_stack: list = [] # 'ul' / 'ol'
def close_lists():
while list_stack:
out.append(f"</{list_stack.pop()}>")
while i < n:
line = lines[i]
fence = line.strip().startswith("```")
if fence and not in_code:
close_lists()
in_code = True
code_buf = []
i += 1
continue
if fence and in_code:
out.append("<pre><code>" + html.escape("\n".join(code_buf)) + "</code></pre>")
in_code = False
i += 1
continue
if in_code:
code_buf.append(line)
i += 1
continue
stripped = line.strip()
if not stripped:
close_lists()
i += 1
continue
# table block
if "|" in stripped and i + 1 < n and re.match(r"^\s*\|?[\s:|-]+\|?\s*$", lines[i + 1]):
close_lists()
header = [c.strip() for c in stripped.strip("|").split("|")]
rows = []
i += 2
while i < n and "|" in lines[i] and lines[i].strip():
rows.append([c.strip() for c in lines[i].strip().strip("|").split("|")])
i += 1
out.append("<table><thead><tr>"
+ "".join(f"<th>{_inline(html.escape(c))}</th>" for c in header)
+ "</tr></thead><tbody>")
for row in rows:
out.append("<tr>" + "".join(f"<td>{_inline(html.escape(c))}</td>" for c in row) + "</tr>")
out.append("</tbody></table>")
continue
heading = re.match(r"^(#{1,6})\s+(.*)$", stripped)
if heading:
close_lists()
level = len(heading.group(1))
out.append(f"<h{level}>{_inline(html.escape(heading.group(2)))}</h{level}>")
i += 1
continue
if stripped.startswith(">"):
close_lists()
out.append(f"<blockquote>{_inline(html.escape(stripped[1:].strip()))}</blockquote>")
i += 1
continue
if re.match(r"^([-*+])\s+", stripped):
if not list_stack or list_stack[-1] != "ul":
close_lists()
list_stack.append("ul")
out.append("<ul>")
item = re.sub(r"^([-*+])\s+", "", stripped)
out.append(f"<li>{_inline(html.escape(item))}</li>")
i += 1
continue
if re.match(r"^\d+\.\s+", stripped):
if not list_stack or list_stack[-1] != "ol":
close_lists()
list_stack.append("ol")
out.append("<ol>")
item = re.sub(r"^\d+\.\s+", "", stripped)
out.append(f"<li>{_inline(html.escape(item))}</li>")
i += 1
continue
if re.match(r"^([-*_])\1{2,}$", stripped):
close_lists()
out.append("<hr />")
i += 1
continue
close_lists()
out.append(f"<p>{_inline(html.escape(stripped))}</p>")
i += 1
if in_code:
out.append("<pre><code>" + html.escape("\n".join(code_buf)) + "</code></pre>")
close_lists()
return "\n".join(out)
# --------------------------------------------------------------------------- #
# Markdown -> Logseq outline
# --------------------------------------------------------------------------- #
def md_to_logseq(markdown: str, properties: Optional[dict] = None) -> str:
"""Convert flat Markdown into a Logseq outline.
Every line becomes a ``- `` block. Headings are top-level blocks; the content
that follows a heading nests one level beneath it. Page properties
(``key:: value``) go on the first block, as Logseq requires.
"""
out = []
if properties:
prop_lines = []
for key, value in properties.items():
if not value:
continue
if isinstance(value, (list, tuple)):
value = ", ".join(str(v) for v in value)
prop_lines.append(f"{key}:: {value}")
if prop_lines:
out.append("- " + prop_lines[0])
out.extend(f" {p}" for p in prop_lines[1:])
have_heading = False
for raw in markdown.replace("\r\n", "\n").split("\n"):
line = raw.strip()
if not line:
continue
if re.match(r"^#{1,6}\s+", line):
out.append(f"- {line}")
have_heading = True
elif have_heading:
out.append(f"\t- {line}")
else:
out.append(f"- {line}")
return "\n".join(out)

View File

@ -0,0 +1,50 @@
"""Airtable sink — store parsed Markdown as a record in a base/table.
Airtable is a database, not a document tool: the native ingestion path is a
record whose fields hold the title and the Markdown body. Field names are
configurable to match an existing table schema.
Docs: https://airtable.com/developers/web/api/create-records
(POST /v0/{baseId}/{tableIdOrName}).
"""
from __future__ import annotations
import urllib.parse
from . import _http
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
API_BASE = "https://api.airtable.com/v0"
@register
class AirtableSink(Sink):
name = "airtable"
requires = ("AIRTABLE_API_KEY", "AIRTABLE_BASE_ID", "AIRTABLE_TABLE")
label = "Airtable record (database)"
def deliver(self, doc: ParsedDoc) -> SinkResult:
api_key = self.env("AIRTABLE_API_KEY")
base = self.env("AIRTABLE_BASE_ID")
table = self.env("AIRTABLE_TABLE")
title_field = self.env("AIRTABLE_TITLE_FIELD", "Title")
body_field = self.env("AIRTABLE_BODY_FIELD", "Notes")
url = f"{API_BASE}/{base}/{urllib.parse.quote(table)}"
headers = {"Authorization": f"Bearer {api_key}"}
payload = {"fields": {title_field: doc.title, body_field: doc.markdown}}
status, parsed = _http.request_json("POST", url, headers=headers, payload=payload)
if parsed.get("error") or status >= 400:
raise SinkError(str(parsed.get("error") or f"HTTP {status}"))
if not parsed.get("id"):
raise SinkError(f"Airtable returned no record id: {parsed}")
return SinkResult(
sink=self.name,
ok=True,
url=None,
detail="stored as a database record (Airtable is a DB, not a doc)",
)

View File

@ -0,0 +1,101 @@
"""Core types and the sink registry for delivering parsed Markdown to content tools.
A *sink* takes a :class:`ParsedDoc` (Markdown + local images + metadata) and
delivers it into one destination (Obsidian, Notion, Slack, Feishu, ...) using
that tool's OFFICIAL native ingestion path. Sinks read their configuration from
environment variables so an AI agent can run them without interactive prompts.
"""
from __future__ import annotations
import os
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class ParsedDoc:
"""A parsed document ready for delivery."""
title: str
markdown: str
images: tuple = () # absolute paths to local image files
source: str = ""
modality: str = "unknown"
markdown_path: Optional[str] = None
@dataclass
class SinkResult:
"""Outcome of delivering a :class:`ParsedDoc` to one sink."""
sink: str
ok: bool
url: Optional[str] = None
detail: Optional[str] = None
error: Optional[str] = None
def to_status(self) -> dict:
return {
"sink": self.sink,
"ok": self.ok,
"url": self.url,
"detail": self.detail,
"error": self.error,
}
class SinkError(Exception):
"""Raised by a sink when delivery fails for a known reason."""
class Sink:
"""Base class for a delivery target.
Subclasses set ``name``/``aliases``/``requires`` and implement
:meth:`deliver`. ``requires`` lists the environment variables that must be
present for the sink to be usable.
"""
name: str = "base"
aliases: tuple = ()
requires: tuple = () # required env vars
label: str = "" # human description
local: bool = False # filesystem-only, no network/auth
def env(self, key: str, default: Optional[str] = None) -> Optional[str]:
value = os.environ.get(key, default)
return value.strip() if isinstance(value, str) else value
def missing_config(self) -> list:
return [k for k in self.requires if not self.env(k)]
def is_configured(self) -> bool:
return not self.missing_config()
def deliver(self, doc: ParsedDoc) -> SinkResult: # pragma: no cover - abstract
raise NotImplementedError
# --------------------------------------------------------------------------- #
# Registry
# --------------------------------------------------------------------------- #
REGISTRY: dict = {}
def register(cls):
"""Class decorator that instantiates a sink and registers it by name+aliases."""
inst = cls()
REGISTRY[inst.name] = inst
for alias in inst.aliases:
REGISTRY[alias] = inst
return cls
def get_sink(name: str) -> Optional[Sink]:
return REGISTRY.get(name.lower())
def sink_names() -> list:
"""Canonical sink names (no aliases), sorted."""
return sorted({s.name for s in REGISTRY.values()})

View File

@ -0,0 +1,72 @@
"""Coda sink: deliver Markdown as a page, into an existing doc or a new one.
Coda's API (``https://coda.io/apis/v1``) authenticates with a Bearer token.
Markdown is delivered as canvas page content. If ``CODA_DOC_ID`` is set, a new
page is added to that doc; otherwise a new doc is created with the content as its
initial page.
Coda canvas content embeds images by URL only, so local image refs are left
untouched host images at a public URL for them to render.
"""
from __future__ import annotations
from pathlib import Path
from . import _http, _md
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
API = "https://coda.io/apis/v1"
def _canvas(markdown: str) -> dict:
return {"type": "canvas", "canvasContent": {"format": "markdown", "content": markdown}}
@register
class CodaSink(Sink):
name = "coda"
requires = ("CODA_API_TOKEN",)
label = "Coda page (REST API)"
def deliver(self, doc: ParsedDoc) -> SinkResult:
token = self.env("CODA_API_TOKEN")
doc_id = self.env("CODA_DOC_ID")
headers = {
"Authorization": f"Bearer {token}",
"Content-Type": "application/json",
}
base_dir = Path(doc.markdown_path).parent if doc.markdown_path else None
n_images = len(_md.find_local_images(doc.markdown, base_dir))
if doc_id:
status, parsed = _http.request_json(
"POST", f"{API}/docs/{doc_id}/pages", headers=headers, payload={
"name": doc.title,
"pageContent": _canvas(doc.markdown),
},
)
else:
status, parsed = _http.request_json(
"POST", f"{API}/docs", headers=headers, payload={
"title": doc.title,
"initialPage": {
"name": doc.title,
"pageContent": _canvas(doc.markdown),
},
},
)
if status >= 400:
raise SinkError(parsed.get("message") or f"HTTP {status}")
if n_images:
detail = f"text only ({n_images} local image(s); Coda embeds images by URL)"
else:
detail = "text only"
return SinkResult(
sink=self.name, ok=True,
url=parsed.get("browserLink"),
detail=detail,
)

View File

@ -0,0 +1,66 @@
"""Confluence sink: create a page from the parsed Markdown via the Cloud REST API.
Confluence Cloud ingests content as *storage-format* HTML. Delivery converts the
Markdown to HTML and creates a page with the v2 REST API
(``POST /wiki/api/v2/pages``) using Basic auth (email + API token).
Local images are not attached Confluence storage HTML references attachments by
filename, which would require a separate upload step.
"""
from __future__ import annotations
import base64
from . import _http, _md
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
@register
class ConfluenceSink(Sink):
name = "confluence"
requires = (
"CONFLUENCE_BASE_URL",
"CONFLUENCE_EMAIL",
"CONFLUENCE_API_TOKEN",
"CONFLUENCE_SPACE_ID",
)
label = "Confluence Cloud page (storage HTML)"
def deliver(self, doc: ParsedDoc) -> SinkResult:
base = self.env("CONFLUENCE_BASE_URL").rstrip("/")
email = self.env("CONFLUENCE_EMAIL")
token = self.env("CONFLUENCE_API_TOKEN")
space = self.env("CONFLUENCE_SPACE_ID")
auth = base64.b64encode(f"{email}:{token}".encode("utf-8")).decode("ascii")
headers = {
"Authorization": f"Basic {auth}",
"Content-Type": "application/json",
}
html = _md.md_to_html(doc.markdown)
status, parsed = _http.request_json(
"POST",
f"{base}/wiki/api/v2/pages",
headers=headers,
payload={
"spaceId": space,
"status": "current",
"title": doc.title,
"body": {"representation": "storage", "value": html},
},
)
if status >= 400:
raise SinkError(
parsed.get("title")
or parsed.get("message")
or f"Confluence HTTP {status}"
)
webui = (parsed.get("_links") or {}).get("webui")
url = base + webui if webui else None
return SinkResult(
sink=self.name, ok=True, url=url,
detail="converted Markdown->storage HTML (local images not attached)",
)

View File

@ -0,0 +1,65 @@
"""DingTalk (钉钉) sink — push parsed Markdown as a robot markdown message.
A DingTalk custom robot accepts a ``markdown`` message type. The official native
ingestion path is therefore a webhook POST carrying the document title and body.
When a signing secret is configured the request is HMAC-SHA256 signed per
DingTalk's spec. DingTalk's markdown renderer only fetches images over public
URLs, so local images won't render.
Docs: https://open.dingtalk.com/document/robots/custom-robot-access.
"""
from __future__ import annotations
import base64
import hashlib
import hmac
import time
import urllib.parse
from . import _http
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
@register
class DingTalkSink(Sink):
name = "dingtalk"
aliases = ("钉钉",)
requires = ("DINGTALK_WEBHOOK",)
label = "DingTalk robot markdown (钉钉)"
def _build_url(self) -> str:
webhook = self.env("DINGTALK_WEBHOOK")
if webhook.startswith("http"):
url = webhook
else:
url = f"https://oapi.dingtalk.com/robot/send?access_token={webhook}"
secret = self.env("DINGTALK_SECRET")
if secret:
timestamp = str(round(time.time() * 1000))
string_to_sign = f"{timestamp}\n{secret}"
hmac_code = hmac.new(
secret.encode(), string_to_sign.encode(), hashlib.sha256
).digest()
sign = urllib.parse.quote_plus(base64.b64encode(hmac_code))
url += f"&timestamp={timestamp}&sign={sign}"
return url
def deliver(self, doc: ParsedDoc) -> SinkResult:
url = self._build_url()
payload = {
"msgtype": "markdown",
"markdown": {"title": doc.title, "text": doc.markdown},
}
status, parsed = _http.request_json("POST", url, payload=payload)
if parsed.get("errcode") not in (0, None):
raise SinkError(parsed.get("errmsg") or f"DingTalk HTTP {status}: {parsed}")
return SinkResult(
sink=self.name,
ok=True,
url=None,
detail="robot markdown message (local images won't render; host publicly)",
)

View File

@ -0,0 +1,124 @@
"""Feishu / Lark sink: import the parsed Markdown as a Docx document.
Feishu (飞书) / Lark ingests Markdown through its Drive import pipeline. Delivery
follows that official path:
1. ``tenant_access_token/internal`` exchange the app id/secret for a tenant
access token.
2. ``drive/v1/medias/upload_all`` upload the ``.md`` bytes as an import medium
and obtain a ``file_token``.
3. ``drive/v1/import_tasks`` kick off an import task converting the medium to a
Docx, returning a ``ticket``.
4. Poll ``drive/v1/import_tasks/{ticket}`` until the job finishes, surfacing the
resulting document URL.
Local images are not uploaded they would need public URLs to render in Docx.
"""
from __future__ import annotations
import json
import time
from . import _http, _md
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
@register
class FeishuSink(Sink):
name = "feishu"
aliases = ("lark", "飞书")
requires = ("FEISHU_APP_ID", "FEISHU_APP_SECRET")
label = "Feishu / Lark Docx (Drive import)"
def deliver(self, doc: ParsedDoc) -> SinkResult:
app_id = self.env("FEISHU_APP_ID")
app_secret = self.env("FEISHU_APP_SECRET")
folder_token = self.env("FEISHU_FOLDER_TOKEN")
# Step 1: tenant access token.
status, parsed = _http.request_json(
"POST",
"https://open.feishu.cn/open-apis/auth/v3/tenant_access_token/internal",
payload={"app_id": app_id, "app_secret": app_secret},
)
token = parsed.get("tenant_access_token")
if parsed.get("code") not in (0, None) or not token:
raise SinkError(parsed.get("msg") or f"Feishu auth failed (HTTP {status})")
headers = {"Authorization": f"Bearer {token}"}
# Step 2: upload the Markdown bytes as an import medium.
content = doc.markdown.encode("utf-8")
fname = _md.safe_filename(doc.title) + ".md"
ctype, body = _http.encode_multipart(
fields={
"file_name": fname,
"parent_type": "ccm_import_open",
"size": str(len(content)),
"extra": json.dumps({"obj_type": "docx", "file_extension": "md"}),
},
files=[("file", fname, content)],
)
up_status, raw = _http.http_request(
"POST",
"https://open.feishu.cn/open-apis/drive/v1/medias/upload_all",
headers={**headers, "Content-Type": ctype},
data=body,
)
parsed = _parse_json(raw)
if parsed.get("code") not in (0, None):
raise SinkError(parsed.get("msg") or f"Feishu media upload failed (HTTP {up_status})")
file_token = (parsed.get("data") or {}).get("file_token")
if not file_token:
raise SinkError("Feishu did not return a file_token")
# Step 3: create the import task.
status, parsed = _http.request_json(
"POST",
"https://open.feishu.cn/open-apis/drive/v1/import_tasks",
headers=headers,
payload={
"file_extension": "md",
"file_token": file_token,
"type": "docx",
"file_name": doc.title,
"point": {"mount_type": 1, "mount_key": folder_token or ""},
},
)
if parsed.get("code") not in (0, None):
raise SinkError(parsed.get("msg") or f"Feishu import task failed (HTTP {status})")
ticket = (parsed.get("data") or {}).get("ticket")
if not ticket:
raise SinkError("Feishu did not return an import ticket")
# Step 4: poll until the import job completes.
url = None
for _attempt in range(20):
status, parsed = _http.request_json(
"GET",
f"https://open.feishu.cn/open-apis/drive/v1/import_tasks/{ticket}",
headers=headers,
)
res = (parsed.get("data") or {}).get("result") or {}
job_status = res.get("job_status")
if job_status == 0:
url = res.get("url")
break
if job_status in (1, 2):
time.sleep(1)
continue
raise SinkError(res.get("job_error_msg") or "Feishu import failed")
return SinkResult(
sink=self.name, ok=True, url=url,
detail="imported to Feishu Docx (local images need public URLs)",
)
def _parse_json(raw):
if not raw:
return {}
try:
return json.loads(raw.decode("utf-8"))
except (ValueError, UnicodeDecodeError):
return {}

View File

@ -0,0 +1,75 @@
"""Linear sink: create an issue from Markdown via the GraphQL API.
Linear's API is GraphQL at ``https://api.linear.app/graphql`` and authenticates
with a raw API key in the ``Authorization`` header (no ``Bearer`` prefix). The
issue description is Markdown; Linear renders inline ``data:`` image URIs, so
local images are read and embedded as base64 data URIs before delivery.
"""
from __future__ import annotations
import base64
from pathlib import Path
from . import _http, _md
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
API = "https://api.linear.app/graphql"
_MUTATION = (
"mutation IssueCreate($input: IssueCreateInput!)"
"{issueCreate(input:$input){success issue{id url identifier}}}"
)
_MIME = {
".png": "image/png",
".jpg": "image/jpeg",
".jpeg": "image/jpeg",
".gif": "image/gif",
".webp": "image/webp",
}
def _data_uri(path: Path) -> str:
mime = _MIME.get(path.suffix.lower(), "image/png")
b64 = base64.b64encode(path.read_bytes()).decode("ascii")
return f"data:{mime};base64,{b64}"
@register
class LinearSink(Sink):
name = "linear"
requires = ("LINEAR_API_KEY", "LINEAR_TEAM_ID")
label = "Linear issue (GraphQL API)"
def deliver(self, doc: ParsedDoc) -> SinkResult:
key = self.env("LINEAR_API_KEY")
team = self.env("LINEAR_TEAM_ID")
headers = {"Authorization": key, "Content-Type": "application/json"}
base_dir = Path(doc.markdown_path).parent if doc.markdown_path else None
images = _md.find_local_images(doc.markdown, base_dir)
mapping = {ref: _data_uri(path) for _alt, ref, path in images}
body = _md.rewrite_images(doc.markdown, mapping)
status, parsed = _http.request_json("POST", API, headers=headers, payload={
"query": _MUTATION,
"variables": {"input": {
"teamId": team,
"title": doc.title,
"description": body,
}},
})
if parsed.get("errors"):
raise SinkError(str(parsed["errors"]))
result = ((parsed.get("data") or {}).get("issueCreate")) or {}
if not result.get("success"):
raise SinkError(f"Linear did not create the issue (HTTP {status})")
issue = result.get("issue") or {}
return SinkResult(
sink=self.name, ok=True,
url=issue.get("url"),
detail=f"{len(mapping)} image(s) inlined",
)

View File

@ -0,0 +1,105 @@
"""Local-first sinks: Obsidian and Logseq (filesystem writes, no auth).
Both tools are folders of Markdown files. The native ingestion is a filesystem
write following each tool's conventions:
* Obsidian a flat note with YAML frontmatter; images in a per-note assets
folder, referenced with relative Markdown embeds.
* Logseq an outline (every line a ``- `` block) with ``key:: value`` page
properties on the first block; images in ``assets/`` referenced as
``![](../assets/x.png)``.
"""
from __future__ import annotations
from pathlib import Path
from . import _md
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
def _copy_images(doc: ParsedDoc, dest_dir: Path, ref_prefix: str) -> dict:
"""Copy referenced local images into ``dest_dir``; return ``{old_ref: new_ref}``."""
base = Path(doc.markdown_path).parent if doc.markdown_path else None
mapping = {}
images = _md.find_local_images(doc.markdown, base)
if images:
dest_dir.mkdir(parents=True, exist_ok=True)
for _alt, ref, path in images:
target = dest_dir / path.name
target.write_bytes(path.read_bytes())
mapping[ref] = f"{ref_prefix}{path.name}"
return mapping
@register
class ObsidianSink(Sink):
name = "obsidian"
aliases = ("ob",)
requires = ("OBSIDIAN_VAULT",)
label = "Obsidian vault (local Markdown)"
local = True
def deliver(self, doc: ParsedDoc) -> SinkResult:
vault = Path(self.env("OBSIDIAN_VAULT")).expanduser()
if not vault.is_dir():
raise SinkError(f"Obsidian vault not found: {vault}")
subdir = self.env("OBSIDIAN_SUBDIR", "") or ""
note_dir = vault / subdir if subdir else vault
note_dir.mkdir(parents=True, exist_ok=True)
stem = _md.safe_filename(doc.title)
assets = note_dir / f"{stem}.assets"
mapping = _copy_images(doc, assets, f"{stem}.assets/")
body = _md.rewrite_images(doc.markdown, mapping)
front = _md.yaml_frontmatter({
"title": doc.title,
"source": doc.source,
"modality": doc.modality,
"tags": ["mineru", "parsed"],
})
note_path = note_dir / f"{stem}.md"
note_path.write_text(f"{front}\n\n{body}\n", encoding="utf-8")
return SinkResult(sink=self.name, ok=True, url=str(note_path),
detail=f"{len(mapping)} image(s)")
@register
class LogseqSink(Sink):
name = "logseq"
requires = ("LOGSEQ_GRAPH",)
label = "Logseq graph (local outline)"
local = True
def deliver(self, doc: ParsedDoc) -> SinkResult:
graph = Path(self.env("LOGSEQ_GRAPH")).expanduser()
if not graph.is_dir():
raise SinkError(f"Logseq graph not found: {graph}")
pages = graph / "pages"
assets = graph / "assets"
pages.mkdir(parents=True, exist_ok=True)
stem = _md.safe_filename(doc.title)
# Namespace asset names by page slug to avoid collisions in the shared assets/.
prefix = _md.slugify(doc.title)
mapping = {}
base = Path(doc.markdown_path).parent if doc.markdown_path else None
images = _md.find_local_images(doc.markdown, base)
if images:
assets.mkdir(parents=True, exist_ok=True)
for _alt, ref, path in images:
new_name = f"{prefix}-{path.name}"
(assets / new_name).write_bytes(path.read_bytes())
mapping[ref] = f"../assets/{new_name}"
body = _md.rewrite_images(doc.markdown, mapping)
outline = _md.md_to_logseq(body, properties={
"title": doc.title,
"source": doc.source,
"tags": "mineru, parsed",
})
page_path = pages / f"{stem}.md"
page_path.write_text(outline + "\n", encoding="utf-8")
return SinkResult(sink=self.name, ok=True, url=str(page_path),
detail=f"{len(mapping)} image(s)")

View File

@ -0,0 +1,130 @@
"""Notion sink: create a page under a parent page from Markdown blocks.
Notion's native ingestion is the block API: each Markdown line becomes a typed
block (heading, quote, code, list item, paragraph). A page is created with up to
100 children inline; any remainder is appended in 100-block chunks via the
``/blocks/{id}/children`` PATCH endpoint.
Notion has no inline image-from-bytes path (images must be uploaded or hosted
separately), so local image refs are intentionally left untouched.
"""
from __future__ import annotations
from pathlib import Path
from . import _http, _md
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
API = "https://api.notion.com/v1"
MAX_BLOCKS = 100
MAX_TEXT = 2000
def _rich(text: str) -> list:
return [{"type": "text", "text": {"content": text[:MAX_TEXT]}}]
def _block(block_type: str, text: str, **extra) -> dict:
inner = {"rich_text": _rich(text)}
inner.update(extra)
return {"object": "block", "type": block_type, block_type: inner}
def _is_numbered(text: str) -> bool:
head = text.split(".", 1)
return len(head) == 2 and head[0].isdigit() and head[1].startswith(" ")
def _blocks(markdown: str) -> list:
"""Convert flat Markdown lines into a list of Notion block dicts."""
blocks = []
in_code = False
code_buf: list = []
for raw in markdown.replace("\r\n", "\n").split("\n"):
stripped = raw.strip()
if stripped.startswith("```"):
if in_code:
blocks.append(_block("code", "\n".join(code_buf), language="plain text"))
in_code = False
code_buf = []
else:
in_code = True
code_buf = []
continue
if in_code:
code_buf.append(raw)
continue
if not stripped:
continue
if stripped.startswith("# "):
blocks.append(_block("heading_1", stripped[2:].strip()))
elif stripped.startswith("## "):
blocks.append(_block("heading_2", stripped[3:].strip()))
elif stripped.startswith("### "):
blocks.append(_block("heading_3", stripped[4:].strip()))
elif stripped.startswith("> "):
blocks.append(_block("quote", stripped[2:].strip()))
elif stripped.startswith("- ") or stripped.startswith("* "):
blocks.append(_block("bulleted_list_item", stripped[2:].strip()))
elif _is_numbered(stripped):
blocks.append(_block("numbered_list_item", stripped.split(".", 1)[1].strip()))
else:
blocks.append(_block("paragraph", stripped))
if in_code:
blocks.append(_block("code", "\n".join(code_buf), language="plain text"))
return blocks
@register
class NotionSink(Sink):
name = "notion"
requires = ("NOTION_API_KEY", "NOTION_PARENT_PAGE_ID")
label = "Notion page (blocks API)"
def deliver(self, doc: ParsedDoc) -> SinkResult:
key = self.env("NOTION_API_KEY")
parent = self.env("NOTION_PARENT_PAGE_ID")
version = self.env("NOTION_VERSION", "2022-06-28") or "2022-06-28"
headers = {
"Authorization": f"Bearer {key}",
"Notion-Version": version,
"Content-Type": "application/json",
}
# Count local images for the detail note (refs are left as-is).
base_dir = Path(doc.markdown_path).parent if doc.markdown_path else None
n_images = len(_md.find_local_images(doc.markdown, base_dir))
blocks = _blocks(doc.markdown)
status, parsed = _http.request_json("POST", f"{API}/pages", headers=headers, payload={
"parent": {"page_id": parent},
"properties": {"title": {"title": [{"text": {"content": doc.title}}]}},
"children": blocks[:MAX_BLOCKS],
})
if parsed.get("object") == "error":
raise SinkError(parsed.get("message") or f"Notion API error (HTTP {status})")
created_id = parsed.get("id")
if not created_id:
raise SinkError(f"Notion did not return a page id (HTTP {status})")
page_url = parsed.get("url")
for start in range(MAX_BLOCKS, len(blocks), MAX_BLOCKS):
chunk = blocks[start:start + MAX_BLOCKS]
ch_status, ch_parsed = _http.request_json(
"PATCH", f"{API}/blocks/{created_id}/children",
headers=headers, payload={"children": chunk},
)
if ch_parsed.get("object") == "error":
raise SinkError(ch_parsed.get("message")
or f"Notion block append failed (HTTP {ch_status})")
if n_images:
detail = (f"text+structure ({n_images} local images not embedded; "
f"Notion needs file upload)")
else:
detail = "text+structure"
return SinkResult(sink=self.name, ok=True, url=page_url, detail=detail)

View File

@ -0,0 +1,66 @@
"""OneNote sink: create a page from the parsed Markdown via Microsoft Graph.
OneNote pages are created by POSTing an HTML document to a section's ``pages``
endpoint with a pre-obtained Microsoft Graph access token (OAuth). Delivery
converts the Markdown to a full HTML document and creates the page.
Only remote images render Graph fetches ``<img src>`` URLs, so local image
paths emitted by MinerU would need to be public URLs.
"""
from __future__ import annotations
import html
import json
from . import _http, _md
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
@register
class OneNoteSink(Sink):
name = "onenote"
aliases = ("msonenote",)
requires = ("ONENOTE_TOKEN", "ONENOTE_SECTION_ID")
label = "OneNote section page (Microsoft Graph)"
def deliver(self, doc: ParsedDoc) -> SinkResult:
token = self.env("ONENOTE_TOKEN")
section = self.env("ONENOTE_SECTION_ID")
body_html = _md.md_to_html(doc.markdown)
page = (
"<!DOCTYPE html><html><head>"
f"<title>{html.escape(doc.title)}</title>"
f"</head><body>{body_html}</body></html>"
)
status, raw = _http.http_request(
"POST",
f"https://graph.microsoft.com/v1.0/me/onenote/sections/{section}/pages",
headers={
"Authorization": f"Bearer {token}",
"Content-Type": "text/html",
},
data=page.encode("utf-8"),
)
if status >= 400:
preview = raw.decode("utf-8", "replace") if raw else ""
raise SinkError(f"OneNote HTTP {status}: {preview[:200]}")
if status != 201:
raise SinkError(f"OneNote unexpected response (HTTP {status})")
parsed = {}
if raw:
try:
parsed = json.loads(raw.decode("utf-8"))
except (ValueError, UnicodeDecodeError):
parsed = {}
links = parsed.get("links") or {}
web = links.get("oneNoteWebUrl") or {}
url = web.get("href")
return SinkResult(
sink=self.name, ok=True, url=url,
detail="converted Markdown->HTML (remote images only; OAuth token required)",
)

View File

@ -0,0 +1,106 @@
"""Roam Research sink — optional dependency.
There is no library that ingests a Markdown document into Roam, but the official
``roam-client`` SDK correctly handles the parts that are easy to get wrong the
307/308 peer-host redirect, the dual ``Authorization`` / ``x-authorization``
Bearer headers, and the ``/write`` plumbing. So we lazily depend on it for
transport and only build the Markdown block-tree ourselves, delivering the whole
document in a single ``batch-actions`` request (one HTTP round-trip).
Install the SDK (git-only, not on PyPI; needs Python 3.11):
pip install "roam-client @ git+https://github.com/Roam-Research/backend-sdks.git#subdirectory=python"
Config: ``ROAM_API_TOKEN`` (graph edit token), ``ROAM_GRAPH_NAME``.
"""
from __future__ import annotations
import itertools
import re
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
_HEADING = re.compile(r"^(#{1,6})\s+(.*)$")
_INSTALL_HINT = (
'Roam sink needs the official SDK — pip install '
'"roam-client @ git+https://github.com/Roam-Research/backend-sdks.git#subdirectory=python"'
)
def md_to_roam_tree(markdown: str) -> list:
"""Convert Markdown into a nested Roam block tree.
Headings become parent blocks (``heading`` 13); the lines under a heading
nest beneath it. Returns ``[{"string", "heading"?, "children": [...]}, ...]``.
"""
roots: list = []
stack: list = [] # [(heading_level, node)]
for raw in markdown.replace("\r\n", "\n").split("\n"):
line = raw.strip()
if not line:
continue
match = _HEADING.match(line)
if match:
level = len(match.group(1))
node = {"string": match.group(2), "heading": min(level, 3), "children": []}
while stack and stack[-1][0] >= level:
stack.pop()
(stack[-1][1]["children"] if stack else roots).append(node)
stack.append((level, node))
else:
node = {"string": line, "children": []}
(stack[-1][1]["children"] if stack else roots).append(node)
return roots
def tree_to_actions(children: list, parent_uid: str, uidgen) -> list:
"""Flatten a block tree into ``create-block`` actions for one batch request."""
actions: list = []
for order, node in enumerate(children):
uid = uidgen()
block = {"string": node["string"], "uid": uid}
if node.get("heading"):
block["heading"] = node["heading"]
actions.append({
"action": "create-block",
"location": {"parent-uid": parent_uid, "order": order},
"block": block,
})
actions.extend(tree_to_actions(node.get("children", []), uid, uidgen))
return actions
@register
class RoamSink(Sink):
name = "roam"
aliases = ("roamresearch",)
requires = ("ROAM_API_TOKEN", "ROAM_GRAPH_NAME")
label = "Roam Research (batch-actions, optional dep)"
def deliver(self, doc: ParsedDoc) -> SinkResult:
try:
from roam_client.client import create_page, initialize_graph
except ImportError as exc: # pragma: no cover - exercised via SinkError path
raise SinkError(_INSTALL_HINT) from exc
token = self.env("ROAM_API_TOKEN")
graph = self.env("ROAM_GRAPH_NAME")
client = initialize_graph({"token": token, "graph": graph})
create_page(client, {"page": {"title": doc.title}})
counter = itertools.count(1)
actions = tree_to_actions(
md_to_roam_tree(doc.markdown), doc.title, lambda: f"mu{next(counter):07d}"
)
if actions:
client.call(
f"/api/graph/{graph}/write", "POST",
{"action": "batch-actions", "actions": actions},
)
return SinkResult(
sink=self.name, ok=True,
url=f"https://roamresearch.com/#/app/{graph}",
detail=f"{len(actions)} block(s) via batch-actions (images need public URLs)",
)

View File

@ -0,0 +1,111 @@
"""SiYuan sink: create a new document from Markdown via the local kernel API.
SiYuan (思源笔记) exposes a kernel HTTP API (default ``http://127.0.0.1:6806``)
authenticated with an API token. Delivery follows SiYuan's native ingestion path:
1. Resolve the target notebook (``SIYUAN_NOTEBOOK`` or the first listed notebook).
2. Upload each referenced local image via ``/api/asset/upload`` and rewrite the
Markdown to point at the returned ``assets/...`` paths.
3. Create the document with ``/api/filetree/createDocWithMd``.
Every kernel response wraps its payload as ``{"code": 0, "msg": "", "data": ...}``;
a non-zero ``code`` is an error.
"""
from __future__ import annotations
import json
from pathlib import Path
from . import _http, _md
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
@register
class SiYuanSink(Sink):
name = "siyuan"
requires = ("SIYUAN_TOKEN",)
label = "SiYuan notebook (local kernel API)"
def _json_post(self, base: str, path: str, headers: dict, payload: dict):
"""POST JSON; return ``data`` after verifying ``code == 0``."""
try:
status, parsed = _http.request_json("POST", f"{base}{path}",
headers=headers, payload=payload)
except Exception as exc: # noqa: BLE001
raise self._unreachable(base, exc) from exc
return self._unwrap(base, status, parsed)
def _upload_post(self, base: str, headers: dict, content_type: str, body: bytes):
"""POST a multipart body; return ``data`` after verifying ``code == 0``."""
hdrs = dict(headers)
hdrs["Content-Type"] = content_type
try:
status, raw = _http.http_request("POST", f"{base}/api/asset/upload",
headers=hdrs, data=body)
except Exception as exc: # noqa: BLE001
raise self._unreachable(base, exc) from exc
parsed: dict = {}
if raw:
try:
parsed = json.loads(raw.decode("utf-8"))
except (ValueError, UnicodeDecodeError):
parsed = {}
return self._unwrap(base, status, parsed)
@staticmethod
def _unreachable(base: str, exc=None) -> SinkError:
suffix = f" ({exc})" if exc else ""
return SinkError(
f"SiYuan kernel not reachable at {base} — start SiYuan and enable "
f"the API token{suffix}"
)
def _unwrap(self, base: str, status: int, parsed: dict):
if status == 0:
raise self._unreachable(base)
if parsed.get("code") != 0:
raise SinkError(parsed.get("msg") or f"SiYuan API error (HTTP {status})")
return parsed.get("data")
def deliver(self, doc: ParsedDoc) -> SinkResult:
base = (self.env("SIYUAN_API_URL", "http://127.0.0.1:6806")
or "http://127.0.0.1:6806").rstrip("/")
token = self.env("SIYUAN_TOKEN")
headers = {"Authorization": f"Token {token}"}
notebook = self.env("SIYUAN_NOTEBOOK")
if not notebook:
data = self._json_post(base, "/api/notebook/lsNotebooks", headers, {})
notebooks = (data or {}).get("notebooks") or []
if not notebooks:
raise SinkError("SiYuan has no notebooks — create one before delivering")
notebook = notebooks[0]["id"]
base_dir = Path(doc.markdown_path).parent if doc.markdown_path else None
images = _md.find_local_images(doc.markdown, base_dir)
mapping = {}
for _alt, ref, path in images:
content_type, body = _http.encode_multipart(
fields={"assetsDirPath": "/assets/"},
files=[("file[]", path.name, path.read_bytes())],
)
data = self._upload_post(base, headers, content_type, body)
succ_map = (data or {}).get("succMap") or {}
if path.name in succ_map:
mapping[ref] = succ_map[path.name]
body_md = _md.rewrite_images(doc.markdown, mapping)
docid = self._json_post(base, "/api/filetree/createDocWithMd", headers, {
"notebook": notebook,
"path": "/" + _md.safe_filename(doc.title),
"markdown": body_md,
})
if not docid:
raise SinkError("SiYuan did not return a document id")
return SinkResult(
sink=self.name, ok=True,
url=f"siyuan://blocks/{docid}",
detail=f"{len(mapping)} image(s)",
)

View File

@ -0,0 +1,95 @@
"""Slack sink: upload the parsed Markdown as a file via the external-upload flow.
Slack deprecated ``files.upload`` (retired) in favour of a three-step external
upload. Delivery follows that official path:
1. ``files.getUploadURLExternal`` reserve an upload URL + file id for the
given filename and byte length.
2. ``POST`` the raw bytes to the returned upload URL.
3. ``files.completeUploadExternal`` finalize the upload, attach it to the
target channel, and post an initial comment.
Images are *not* embedded: Markdown is uploaded as a single ``.md`` file.
"""
from __future__ import annotations
import urllib.parse
from . import _http, _md
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
@register
class SlackSink(Sink):
name = "slack"
requires = ("SLACK_BOT_TOKEN", "SLACK_CHANNEL")
label = "Slack channel (file upload)"
def deliver(self, doc: ParsedDoc) -> SinkResult:
token = self.env("SLACK_BOT_TOKEN")
channel = self.env("SLACK_CHANNEL")
auth = {"Authorization": f"Bearer {token}"}
content = doc.markdown.encode("utf-8")
filename = _md.slugify(doc.title) + ".md"
# Step 1: reserve an external upload URL + file id. This endpoint wants
# form-encoded data, so use http_request and parse the JSON response.
form = urllib.parse.urlencode({
"filename": filename,
"length": len(content),
}).encode("utf-8")
status, raw = _http.http_request(
"POST",
"https://slack.com/api/files.getUploadURLExternal",
headers={**auth, "Content-Type": "application/x-www-form-urlencoded"},
data=form,
)
parsed = _parse_json(raw)
if not parsed.get("ok"):
raise SinkError(parsed.get("error") or f"Slack getUploadURLExternal failed (HTTP {status})")
upload_url = parsed.get("upload_url")
file_id = parsed.get("file_id")
if not upload_url or not file_id:
raise SinkError("Slack did not return an upload URL / file id")
# Step 2: upload the raw bytes to the reserved URL.
up_status, _up_body = _http.http_request(
"POST", upload_url,
headers={"Content-Type": "application/octet-stream"},
data=content,
)
if up_status != 200:
raise SinkError(f"Slack file upload failed (HTTP {up_status})")
# Step 3: finalize the upload into the channel.
status, parsed = _http.request_json(
"POST",
"https://slack.com/api/files.completeUploadExternal",
headers=auth,
payload={
"files": [{"id": file_id, "title": doc.title}],
"channel_id": channel,
"initial_comment": f"Parsed: {doc.title}",
},
)
if not parsed.get("ok"):
raise SinkError(parsed.get("error") or f"Slack completeUploadExternal failed (HTTP {status})")
files = parsed.get("files") or [{}]
url = files[0].get("permalink")
return SinkResult(
sink=self.name, ok=True, url=url,
detail="uploaded .md file (images not embedded)",
)
def _parse_json(raw):
import json
if not raw:
return {}
try:
return json.loads(raw.decode("utf-8"))
except (ValueError, UnicodeDecodeError):
return {}

View File

@ -0,0 +1,48 @@
"""TickTick (滴答清单) sink — create a task from parsed Markdown.
TickTick's Open API exposes a task object whose ``content`` field holds the body
text. The official native ingestion path for arbitrary Markdown is therefore a
task: the document title becomes the task title and the Markdown becomes the
task content. Tasks have no attachment/inline-image surface, so local images are
not delivered.
Docs: https://developer.ticktick.com/docs (POST /open/v1/task).
"""
from __future__ import annotations
from . import _http
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
API_URL = "https://api.ticktick.com/open/v1/task"
@register
class TickTickSink(Sink):
name = "ticktick"
aliases = ("dida", "滴答清单")
requires = ("TICKTICK_TOKEN",)
label = "TickTick task (滴答清单)"
def deliver(self, doc: ParsedDoc) -> SinkResult:
token = self.env("TICKTICK_TOKEN")
project_id = self.env("TICKTICK_PROJECT_ID")
payload = {"title": doc.title, "content": doc.markdown}
if project_id:
payload["projectId"] = project_id
headers = {"Authorization": f"Bearer {token}"}
status, parsed = _http.request_json("POST", API_URL, headers=headers, payload=payload)
if status >= 400:
raise SinkError(f"TickTick HTTP {status}: {parsed}")
if not parsed.get("id"):
raise SinkError(f"TickTick returned no task id: {parsed}")
return SinkResult(
sink=self.name,
ok=True,
url=None,
detail="task content (no inline images supported by TickTick)",
)

View File

@ -0,0 +1,60 @@
"""WeCom (企业微信 / WeChat Work) sink — send parsed Markdown as an app message.
WeCom apps deliver content via the message-send API. The native ingestion path
is a ``markdown`` message from a self-built app: first an access token is fetched
with the corp id + secret, then the message is posted. WeCom's markdown is a
limited subset with a 2048-byte content cap and no inline images, so the body is
truncated to fit.
Docs: https://developer.work.weixin.qq.com/document/path/90236 (message/send),
https://developer.work.weixin.qq.com/document/path/91039 (gettoken).
"""
from __future__ import annotations
from . import _http
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
TOKEN_URL = "https://qyapi.weixin.qq.com/cgi-bin/gettoken"
SEND_URL = "https://qyapi.weixin.qq.com/cgi-bin/message/send"
@register
class WeComSink(Sink):
name = "wecom"
aliases = ("企业微信", "wechatwork")
requires = ("WECOM_CORPID", "WECOM_CORPSECRET", "WECOM_AGENTID")
label = "WeCom app markdown (企业微信)"
def deliver(self, doc: ParsedDoc) -> SinkResult:
corpid = self.env("WECOM_CORPID")
secret = self.env("WECOM_CORPSECRET")
agentid = self.env("WECOM_AGENTID")
touser = self.env("WECOM_TOUSER", "@all")
# Step 1: fetch an access token.
token_url = f"{TOKEN_URL}?corpid={corpid}&corpsecret={secret}"
status, parsed = _http.request_json("GET", token_url)
if parsed.get("errcode") not in (0, None) or not parsed.get("access_token"):
raise SinkError(parsed.get("errmsg") or f"WeCom token fetch failed: {parsed}")
token = parsed["access_token"]
# Step 2: send the markdown message.
send_url = f"{SEND_URL}?access_token={token}"
payload = {
"touser": touser,
"msgtype": "markdown",
"agentid": int(agentid),
"markdown": {"content": doc.markdown[:2048]},
}
status, parsed = _http.request_json("POST", send_url, payload=payload)
if parsed.get("errcode") not in (0, None):
raise SinkError(parsed.get("errmsg") or f"WeCom send failed: {parsed}")
return SinkResult(
sink=self.name,
ok=True,
url=None,
detail="markdown notification (WeCom markdown is a limited subset, "
"2048-byte cap, no inline images)",
)

View File

@ -0,0 +1,104 @@
"""WPS / 金山文档 (Kingsoft kdocs) sink — optional dependency.
The native ingestion path is: Markdown ``.docx`` upload to the kdocs cloud
appspace. There is no official Python SDK, so:
* MarkdownDOCX uses the maintained, pure-pip ``html-for-docx`` package
(reusing this project's Markdown→HTML), lazily imported so the core stays
zero-dependency. Install with ``pip install mineru-skill[wps]``.
* The kdocs WPS-2 request signing (plain SHA-1) and multipart upload are done
with the standard library small and fully documented.
Cloud upload requires an approved kdocs developer app (``WPS_APP_ID`` /
``WPS_APP_SECRET``) and a provisioned appspace; it is opt-in and surfaces the
raw kdocs error on failure. Docs: https://developer.kdocs.cn/server/guide/signature.html
"""
from __future__ import annotations
import email.utils
import hashlib
import io
import json
from . import _http, _md
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
KDOCS_UPLOAD = "https://developer.kdocs.cn/api/v1/openapi/appspace/files/upload"
def _markdown_to_docx_bytes(markdown: str) -> bytes:
"""Convert Markdown → HTML → DOCX bytes via the optional html-for-docx lib."""
try:
from html4docx import HtmlToDocx # pip install html-for-docx
except ImportError as exc: # pragma: no cover - exercised via SinkError path
raise SinkError(
"WPS sink needs a Markdown→DOCX converter — "
"pip install 'mineru-skill[wps]' (i.e. pip install html-for-docx)"
) from exc
html = _md.md_to_html(markdown)
document = HtmlToDocx().parse_html_string(html)
buf = io.BytesIO()
document.save(buf)
return buf.getvalue()
def _wps2_headers(app_id: str, app_secret: str, body: bytes, content_type: str) -> dict:
"""Build kdocs WPS-2 auth headers.
signature = sha1(app_secret + content_md5 + content_type + date) hex.
Content-Md5 / Content-Type must match the exact wire body and header sent.
"""
content_md5 = hashlib.md5(body).hexdigest()
date = email.utils.formatdate(usegmt=True) # RFC1123 GMT
signature = hashlib.sha1(
(app_secret + content_md5 + content_type + date).encode("utf-8")
).hexdigest()
return {
"Date": date,
"Content-Md5": content_md5,
"Content-Type": content_type,
"Authorization": f"WPS-2:{app_id}:{signature}",
}
@register
class WpsSink(Sink):
name = "wps"
aliases = ("kdocs", "金山文档", "金山")
requires = ("WPS_APP_ID", "WPS_APP_SECRET")
label = "WPS / 金山文档 (Markdown→DOCX upload, optional dep)"
def deliver(self, doc: ParsedDoc) -> SinkResult:
app_id = self.env("WPS_APP_ID")
app_secret = self.env("WPS_APP_SECRET")
docx_bytes = _markdown_to_docx_bytes(doc.markdown)
filename = _md.safe_filename(doc.title) + ".docx"
fields = {}
parent_path = self.env("WPS_PARENT_PATH")
parent_token = self.env("WPS_PARENT_TOKEN")
if parent_path:
fields["parent_path"] = parent_path
if parent_token:
fields["parent_token"] = parent_token
content_type, body = _http.encode_multipart(
fields=fields, files=[("file", filename, docx_bytes)]
)
headers = _wps2_headers(app_id, app_secret, body, content_type)
status, raw = _http.http_request("POST", KDOCS_UPLOAD, headers=headers, data=body)
try:
parsed = json.loads(raw.decode("utf-8")) if raw else {}
except (ValueError, UnicodeDecodeError):
parsed = {}
if status >= 400 or parsed.get("code") not in (0, None):
raise SinkError(parsed.get("message") or parsed.get("msg") or f"kdocs HTTP {status}")
file_token = (parsed.get("data") or {}).get("file_token")
return SinkResult(
sink=self.name, ok=True, url=file_token,
detail="Markdown→DOCX uploaded to 金山文档 (experimental; needs a provisioned appspace)",
)

View File

@ -0,0 +1,65 @@
"""Yuque (语雀) sink: create a Markdown doc in a repository via the open API.
Yuque's open API (``https://www.yuque.com/api/v2``) authenticates with an
``X-Auth-Token`` header and creates docs under a repository namespace. The body
is posted as raw Markdown.
Yuque's open API has no asset-upload endpoint, so local image refs are left
untouched host images at a public URL for them to render.
"""
from __future__ import annotations
from pathlib import Path
from . import _http, _md
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
API = "https://www.yuque.com/api/v2"
@register
class YuqueSink(Sink):
name = "yuque"
aliases = ("语雀",)
requires = ("YUQUE_TOKEN", "YUQUE_NAMESPACE")
label = "Yuque doc (open API)"
def deliver(self, doc: ParsedDoc) -> SinkResult:
token = self.env("YUQUE_TOKEN")
namespace = self.env("YUQUE_NAMESPACE")
headers = {
"X-Auth-Token": token,
"User-Agent": "MinerU-Skill/3.0",
"Content-Type": "application/json",
}
base_dir = Path(doc.markdown_path).parent if doc.markdown_path else None
n_images = len(_md.find_local_images(doc.markdown, base_dir))
status, parsed = _http.request_json(
"POST", f"{API}/repos/{namespace}/docs", headers=headers, payload={
"title": doc.title,
"slug": _md.slugify(doc.title),
"public": 0,
"format": "markdown",
"body": doc.markdown,
},
)
data = parsed.get("data")
if not data:
if status >= 400 or parsed.get("message"):
raise SinkError(parsed.get("message") or f"HTTP {status}")
raise SinkError(f"Yuque returned no doc data (HTTP {status})")
slug = data.get("slug")
if n_images:
detail = f"text only ({n_images} local image(s); host images publicly to embed)"
else:
detail = "text only"
return SinkResult(
sink=self.name, ok=True,
url=f"https://www.yuque.com/{namespace}/{slug}",
detail=detail,
)

View File

@ -0,0 +1,64 @@
"""Split oversized PDFs into cap-sized parts so they clear the MinerU API limits.
The MinerU cloud caps at 20 pages (free Agent API) / 200 pages (Standard API).
``--split`` slices a larger PDF into parts locally, each is parsed, and the
Markdown is merged back so we are no longer bound by those page caps (the same
trick mineru-converter uses). Uses the optional ``pypdf`` library, lazily
imported, so the core stays zero-dependency.
pip install "mineru-skill[split]" # i.e. pip install pypdf
"""
from __future__ import annotations
from pathlib import Path
class SplitError(Exception):
"""Raised when splitting is requested but cannot be performed."""
def _load_pypdf():
try:
import pypdf # noqa: F401
return pypdf
except ImportError as exc:
raise SplitError(
"--split needs the pypdf library — pip install 'mineru-skill[split]' "
"(i.e. pip install pypdf)"
) from exc
def pdf_page_count(path) -> int:
"""Return the page count of a local PDF (requires pypdf)."""
pypdf = _load_pypdf()
return len(pypdf.PdfReader(str(path)).pages)
def split_pdf(path, max_pages: int, out_dir) -> list:
"""Slice ``path`` into ``max_pages``-page parts under ``out_dir``.
Returns the list of part paths (a single-element list pointing at the original
file if it already fits).
"""
if max_pages < 1:
raise SplitError("max_pages must be >= 1")
pypdf = _load_pypdf()
reader = pypdf.PdfReader(str(path))
total = len(reader.pages)
if total <= max_pages:
return [Path(path)]
out_dir = Path(out_dir)
out_dir.mkdir(parents=True, exist_ok=True)
stem = Path(path).stem
parts = []
for part_index, start in enumerate(range(0, total, max_pages), start=1):
writer = pypdf.PdfWriter()
for page in range(start, min(start + max_pages, total)):
writer.add_page(reader.pages[page])
part_path = out_dir / f"{stem}__part{part_index:03d}.pdf"
with open(part_path, "wb") as handle:
writer.write(handle)
parts.append(part_path)
return parts

View File

@ -167,7 +167,12 @@ async def handle_request(request: Dict[str, Any]) -> Dict[str, Any]:
top_k = arguments.get("top_k", 100)
if not query:
return create_error_response(request_id, -32602, "Missing required parameter: query")
return create_success_response(request_id, {
"content": [{
"type": "text",
"text": "Error: missing required parameter 'query'. Please call this tool again with a non-empty 'query' argument describing what you want to retrieve."
}]
})
result = rag_retrieve(query, top_k)

View File

@ -0,0 +1,137 @@
---
name: table-query
description: Query structured spreadsheet/table data (Excel/CSV) to answer questions about values, prices, quantities, inventory, specifications, rankings, comparisons, summaries, aggregations, lists, or any numeric/tabular lookup. Use this skill whenever the answer likely comes from uploaded tables. You locate tables, read their schema, author SQLite SQL yourself, and run it — the backend does no LLM work, so it is fast.
category: Data & Retrieval
---
# Table Query
Answer table/spreadsheet questions by authoring and running SQLite SQL against the
bot's uploaded Excel data. The backend is a thin, fast SQL executor — **you** do the
thinking (rewrite the question, pick tables, write SQL). Row-level citations
(`__src`) are produced for you.
## When to use
Use `table-query` for: values, prices, quantities, inventory, specifications,
rankings, comparisons, summaries, aggregations (sum/avg/count), lists, person /
project / product lookups, monthly/period totals, or any question whose answer
comes from structured tables. For pure concept / definition / policy / explanation
questions, use the `rag_retrieve` document tool instead.
## Workflow (do this in order, once)
1. **search-tables** — rewrite the user's question into a retrieval query (core
entity + attributes + synonyms), then locate candidate tables. Call this **once**.
2. **get-schemas** — for the relevant subset of returned tables, fetch their
`CREATE TABLE` schema and sample rows. Never write SQL without seeing the schema.
3. **author SQL** — write a SQLite query plan as JSON (see below).
4. **run-sql** — execute the plan. It returns CSV with an `__src` column and a
`file_ref_table` mapping plus citation instructions.
5. **answer + cite** — write the answer and add `<CITATION ... />` tags built from
`__src` + `file_ref_table`. Never print the `__src` column to the user.
### Anti-waste rules
- Call **search-tables at most once** per question. Do not re-locate tables you
already have schemas for.
- If `run-sql` returns an error, fix the SQL and call **run-sql** again (at most ~2
tries). Do **NOT** restart from search-tables.
- If `search-tables` finds nothing, fall back to the `rag_retrieve` document tool.
## Commands
```bash
# 1. locate tables
python {SKILL_DIR}/scripts/table_query.py search-tables --query "2025 April May June sales total" --top-k 20
# 2. read schema + sample rows for the tables you picked
python {SKILL_DIR}/scripts/table_query.py get-schemas --tables "sales_2025,customers"
# 3. run your authored plan — pipe the JSON plan via stdin (no temp file needed)
python {SKILL_DIR}/scripts/table_query.py run-sql <<'PLAN'
{"queries":[{"step":1,"sql":"CREATE TEMP TABLE \"final_table_step1\" AS SELECT \"month\", SUM(\"amount\") AS \"total\" FROM \"sales_2025\" GROUP BY \"month\"","source_table_names":["sales_2025"],"destine_table_name":"final_table_step1","destine_table_type":"final","destine_table_description":"Monthly totals"}]}
PLAN
```
## Authoring the SQL plan
The plan is a JSON object `{ "queries": [ ... ] }` that you pass to `run-sql` **on
stdin via a quoted heredoc** (`<<'PLAN' ... PLAN`). The quoted delimiter keeps all
the double quotes, single quotes and `$` in your SQL intact — no shell escaping.
(You may instead write it to a file and use `--plan-file path.json` if a plan is very
large, but stdin is the default and needs no extra step.)
Each query is one SQL step:
```json
{
"queries": [
{
"step": 1,
"sql": "CREATE TEMP TABLE \"final_table_step1\" AS SELECT \"month\", SUM(\"amount\") AS \"total\" FROM \"sales_2025\" WHERE \"month\" IN ('2025-04','2025-05','2025-06') GROUP BY \"month\"",
"source_table_names": ["sales_2025"],
"destine_table_name": "final_table_step1",
"destine_table_type": "final",
"destine_table_description": "Monthly sales totals for Apr-Jun 2025"
}
]
}
```
Field meaning:
- `step`: 1-based execution order.
- `sql`: a SQLite statement, normally `CREATE TEMP TABLE "..." AS SELECT ...`.
- `source_table_names`: tables this step reads (original tables, or earlier steps'
`destine_table_name` for multi-step plans).
- `destine_table_name`: the temp table this step creates. Convention:
`intermediate_table_stepN` or `final_table_stepN`.
- `destine_table_type`: `"final"` for results the user should see, `"intermediate"`
for helper steps. **At least one `final` is required.**
- `destine_table_description`: short human description of the result.
### SQL rules (important)
- **Quote every identifier** with double quotes: `"column name"`, `"table name"`.
- String literals use single quotes; escape `'` as `''`.
- Prefer **one logical result per `final` table**. For multiple separate results,
emit multiple `final` tables (e.g. step1, step2) — do **NOT** `UNION` unrelated results.
- For row-level citations to be precise, keep `final` steps as simple single-table
`SELECT`s (no `JOIN` / `GROUP BY` / aggregation). Aggregations still work but the
citation degrades to file+sheet level (`F1S2`) instead of an exact row (`F1S2R5`).
- Multi-step plans run in `step` order: build `intermediate_table_stepN` first, then
read it in a later step. Don't reference a temp table before it is created.
- **Sample rows are a format hint only** — never assume they represent the full data
or the row count. Your SQL must scan the whole table. Use `LIKE '%value%'` for free
text and `=` for enums/codes.
## Result handling & citations
- `run-sql` output begins with citation instructions, then `file_ref_table`, then the
result CSV (with `__src`).
- Parse `__src` (`F1S2R5` = file_ref F1, sheet 2, row 5) and `file_ref_table` to build
`<CITATION file="..." filename="..." sheet=N rows=[...] />`.
- Put citations on their own line **after** the list/table that uses the data; combine
same-(file,sheet) rows into one citation.
- If the result hint says rows were truncated (`Only the first N rows ...; the
remaining M ...`), tell the user the total (`N+M`), shown (`N`), and omitted (`M`).
- Never expose the `__src` column itself to the user.
### Controlling truncation
`run-sql` truncates results by default (total rows and per-cell characters) to keep
the context manageable. If a result comes back truncated and you genuinely need more,
re-run with higher limits — do **not** re-run search-tables:
```bash
python {SKILL_DIR}/scripts/table_query.py run-sql --max-rows 500 --cell-max 4000 <<'PLAN'
{"queries":[ ... ]}
PLAN
```
- `--max-rows`: max total rows across all `final` tables (default from backend config,
hard ceiling 2000). Prefer writing an aggregate query (SUM/COUNT/GROUP BY) over
pulling thousands of detail rows.
- `--cell-max`: max characters per cell before it is truncated with `..` (default from
backend config, hard ceiling 10000). Raise this when a long-text column (e.g. a
description/spec field) is getting cut off.

View File

@ -0,0 +1,213 @@
#!/usr/bin/env python3
"""
table-query CLI.
Fast, LLM-free table querying. Talks to the felo-mygpt table_query endpoints:
- search-tables : POST /v1/table_query/search_tables/{bot_id}
- get-schemas : POST /v1/table_query/get_schemas/{bot_id}
- run-sql : POST /v1/table_query/run_sql/{bot_id}
The agent drives the orchestration (rewrite -> locate -> author SQL -> run);
the backend only does cheap work, so each call returns in seconds.
"""
import argparse
import hashlib
import json
import os
import sys
try:
import requests
except ImportError:
print("Error: requests module is required. Please install it with: pip install requests")
sys.exit(1)
DEFAULT_BACKEND_HOST = os.getenv("BACKEND_HOST", "https://api-dev.gptbase.ai")
DEFAULT_MASTERKEY = os.getenv("MASTERKEY", "master")
# Same citation contract the legacy table_rag_retrieve used, so the agent's
# <CITATION ... /> behaviour is unchanged.
TABLE_CITATION_INSTRUCTIONS = """<CITATION_INSTRUCTIONS>
When using the retrieved table knowledge below, you MUST add XML citation tags for factual claims.
Format: `<CITATION file="file_id" filename="name.xlsx" sheet=1 rows=[2, 4] />`
- Parse `__src`: `F1S2R5` = file_ref F1, sheet 2, row 5
- Look up file_id in `file_ref_table`
- Combine same-sheet rows into one citation: `rows=[2, 4, 6]`
- MANDATORY: Create SEPARATE citation for EACH (file, sheet) combination
- NEVER put <CITATION> on the same line as a bullet point or table row
- Citations MUST be on separate lines AFTER the complete list/table
- NEVER include the `__src` column in your response - it is internal metadata only
- Citations MUST appear IMMEDIATELY AFTER the paragraph or bullet list that uses the knowledge
- NEVER collect all citations and place them at the end of your response
</CITATION_INSTRUCTIONS>
"""
def load_config() -> dict:
"""Load robot_config.json from the robot project root (3 levels up from scripts/)."""
config_path = os.path.join(os.path.dirname(__file__), '..', '..', '..', 'robot_config.json')
if os.path.exists(config_path):
try:
with open(config_path, 'r', encoding='utf-8') as f:
return json.load(f)
except (json.JSONDecodeError, IOError) as e:
print(f"Warning: failed to load robot_config.json: {e}", file=sys.stderr)
return {}
def _resolve_bot_id(cli_bot_id: str) -> str:
if cli_bot_id:
return cli_bot_id
return load_config().get('bot_id') or os.getenv("BOT_ID") or os.getenv("ASSISTANT_ID")
def _post(path: str, bot_id: str, payload: dict) -> dict:
url = f"{DEFAULT_BACKEND_HOST}/v1/table_query/{path}/{bot_id}"
auth_token = hashlib.md5(f"{DEFAULT_MASTERKEY}:{bot_id}".encode()).hexdigest()
headers = {
"content-type": "application/json",
"authorization": f"Bearer {auth_token}",
}
trace_id = os.getenv("TRACE_ID") or os.getenv("X_REQUEST_ID")
if trace_id:
headers["X-Request-ID"] = trace_id
resp = requests.post(url, json=payload, headers=headers, timeout=30)
if resp.status_code != 200:
raise RuntimeError(f"API {path} returned {resp.status_code}: {resp.text}")
return resp.json()
def cmd_search_tables(args, bot_id: str) -> str:
res = _post("search_tables", bot_id, {"query": args.query, "top_k": args.top_k})
tables = res.get("tables", [])
if not tables:
return ("No matching tables found. If the question may be answered from documents "
"instead of spreadsheets, fall back to the rag_retrieve document tool.")
lines = [f"Found {len(tables)} candidate table(s). Pick the relevant ones and call "
f"`get-schemas` for them next.\n"]
for t in tables:
lines.append(
f"- table_name: {t['table_name']}\n"
f" file: {t.get('file_name','')} | sheet: {t.get('sheet_name','')} "
f"| score: {round(t.get('score', 0), 3)}\n"
f" description: {t.get('table_description','')}"
)
return "\n".join(lines)
def cmd_get_schemas(args, bot_id: str) -> str:
table_names = [t.strip() for t in args.tables.split(',') if t.strip()]
res = _post("get_schemas", bot_id,
{"table_names": table_names, "sample_rows": args.sample_rows})
schemas = res.get("schemas", [])
missing = res.get("missing_tables", [])
if not schemas:
return f"No schemas resolved. Missing tables: {missing}"
blocks = []
for s in schemas:
block = [f"### Table: {s['table_name']}",
f"File: {s.get('file_name','')} | Sheet: {s.get('sheet_name','')}",
"```sql", s.get('sql_create', ''), "```"]
sample = s.get('sample_rows') or []
if sample:
block.append("Sample rows (format hint only, NOT the row count):")
block.append("```csv")
for row in sample:
block.append(",".join('"' + str(c).replace('"', '""') + '"' for c in row))
block.append("```")
blocks.append("\n".join(block))
out = "\n\n".join(blocks)
if missing:
out += f"\n\nNote: these requested tables were not found: {missing}"
out += ("\n\nNow author a SQLite plan and run it by piping the JSON to run-sql on stdin:\n"
" run-sql <<'PLAN'\n"
" {\"queries\": [{\"step\": 1, \"sql\": \"CREATE TEMP TABLE \\\"final_table_step1\\\" "
"AS SELECT ...\", \"source_table_names\": [\"...\"], "
"\"destine_table_name\": \"final_table_step1\", \"destine_table_type\": \"final\"}]}\n"
" PLAN\n"
"Quote all identifiers with double quotes.")
return out
def cmd_run_sql(args, bot_id: str) -> str:
# Read the plan from --plan-file if given, otherwise from stdin (heredoc).
try:
if args.plan_file:
with open(args.plan_file, 'r', encoding='utf-8') as f:
raw = f.read()
else:
raw = sys.stdin.read()
if not raw.strip():
return ("Error: no plan provided. Pipe the JSON plan via stdin, e.g.\n"
" python scripts/table_query.py run-sql <<'PLAN'\n"
" {\"queries\": [...]}\n"
" PLAN")
plan = json.loads(raw)
except (json.JSONDecodeError, IOError) as e:
return f"Error: failed to read SQL plan: {e}"
# accept either {"queries": [...]} or a bare [...] list
queries = plan.get("queries") if isinstance(plan, dict) else plan
if not queries:
return "Error: the plan must contain a non-empty `queries` list."
payload = {"queries": queries}
if args.max_rows is not None:
payload["max_rows"] = args.max_rows
if args.cell_max is not None:
payload["cell_max"] = args.cell_max
res = _post("run_sql", bot_id, payload)
if not res.get("success"):
return (f"SQL execution failed: {res.get('error')}\n"
"Fix your SQL and call run-sql again. Do NOT restart from search-tables.")
parts = [TABLE_CITATION_INSTRUCTIONS]
if res.get("instruction"):
parts.append(res["instruction"])
if res.get("knowledge"):
parts.append(res["knowledge"])
if res.get("extra_goal"):
parts.append(res["extra_goal"])
return "\n".join(parts)
def main():
parser = argparse.ArgumentParser(description="table-query: fast LLM-free table querying")
parser.add_argument("--bot-id", default=None, help="Bot id (defaults to robot_config.json)")
sub = parser.add_subparsers(dest="command", required=True)
p_search = sub.add_parser("search-tables", help="Vector-locate relevant tables")
p_search.add_argument("--query", "-q", required=True, help="Rewritten retrieval query")
p_search.add_argument("--top-k", "-k", type=int, default=20)
p_schemas = sub.add_parser("get-schemas", help="Fetch CREATE TABLE schema + sample rows")
p_schemas.add_argument("--tables", "-t", required=True, help="Comma-separated table names")
p_schemas.add_argument("--sample-rows", type=int, default=3)
p_run = sub.add_parser("run-sql", help="Execute an authored SQL plan (JSON via stdin or file)")
p_run.add_argument("--plan-file", "-f", default=None,
help="Path to plan JSON file (optional; defaults to reading stdin)")
p_run.add_argument("--max-rows", type=int, default=None,
help="Max total result rows (raise if a result came back truncated)")
p_run.add_argument("--cell-max", type=int, default=None,
help="Max characters per cell before truncation")
args = parser.parse_args()
bot_id = _resolve_bot_id(args.bot_id)
if not bot_id:
print("Error: bot_id is required (robot_config.json / --bot-id / BOT_ID env)")
sys.exit(1)
try:
if args.command == "search-tables":
print(cmd_search_tables(args, bot_id))
elif args.command == "get-schemas":
print(cmd_get_schemas(args, bot_id))
elif args.command == "run-sql":
print(cmd_run_sql(args, bot_id))
except Exception as e:
print(f"Error: {str(e)}")
sys.exit(1)
if __name__ == "__main__":
main()

View File

@ -0,0 +1,25 @@
name: table-query
version: 1.0.0
description: Fast LLM-free table querying. Locate tables, fetch schema, author SQLite SQL, and run it with row-level citations.
author:
name: sparticle
email: support@gbase.ai
license: MIT
tags:
- table
- sql
- excel
- retrieval
- citation
runtime:
python: ">=3.7"
dependencies:
- requests
entry_point: scripts/table_query.py
commands:
search-tables:
description: Vector-locate relevant tables for a query
get-schemas:
description: Fetch CREATE TABLE schema + sample rows for given tables
run-sql:
description: Execute an authored SQLite plan and return CSV with __src citations

View File

@ -0,0 +1,67 @@
#!/usr/bin/env bash
#
# Manual verification for the new table_query endpoints.
# Run this against an environment where the feature/table-query-split branch is
# deployed (e.g. dev). It checks the 3 fast endpoints and diffs run_sql output
# against the legacy table_rag_retrieve for parity.
#
# Usage:
# HOST=https://api-dev.gptbase.ai BOT_ID=<bot> MASTERKEY=master ./verify_table_query.sh
#
set -euo pipefail
HOST="${HOST:-https://api-dev.gptbase.ai}"
# bot from the slow-request log (has the 案1_売上明細 xlsx). Override as needed.
BOT_ID="${BOT_ID:-c1fa021b-6c41-41d5-b1e6-adfb8896aaaa}"
MASTERKEY="${MASTERKEY:-master}"
QUERY="${QUERY:-2025年4月〜6月の売上実績}"
# auth token = MD5(masterkey:bot_id)
TOKEN=$(python3 -c "import hashlib,sys;print(hashlib.md5(f'{sys.argv[1]}:{sys.argv[2]}'.encode()).hexdigest())" "$MASTERKEY" "$BOT_ID")
AUTH="authorization: Bearer ${TOKEN}"
CT="content-type: application/json"
echo "=== HOST=$HOST BOT_ID=$BOT_ID ==="
echo
echo "### 1) search_tables ###"
curl -s --request POST "$HOST/v1/table_query/search_tables/$BOT_ID" \
--header "$AUTH" --header "$CT" \
--data "{\"query\": \"$QUERY\", \"top_k\": 20}" | python3 -m json.tool
echo
echo "### 2) get_schemas (EDIT --data table_names with names from step 1) ###"
echo "curl -s --request POST \"$HOST/v1/table_query/get_schemas/$BOT_ID\" \\"
echo " --header \"$AUTH\" --header \"$CT\" \\"
echo " --data '{\"table_names\": [\"<TABLE_NAME_FROM_STEP_1>\"], \"sample_rows\": 3}' | python3 -m json.tool"
echo
echo "### 3) run_sql (EDIT the sql to match the real table/columns from step 2) ###"
cat > /tmp/tq_plan.json <<'JSON'
{
"queries": [
{
"step": 1,
"sql": "CREATE TEMP TABLE \"final_table_step1\" AS SELECT \"計上日\", \"得意先名\", \"売上金額\" FROM \"<TABLE_NAME>\" LIMIT 10",
"source_table_names": ["<TABLE_NAME>"],
"destine_table_name": "final_table_step1",
"destine_table_type": "final",
"destine_table_description": "sample rows"
}
]
}
JSON
echo "Edit /tmp/tq_plan.json (replace <TABLE_NAME>), then:"
echo "curl -s --request POST \"$HOST/v1/table_query/run_sql/$BOT_ID\" \\"
echo " --header \"$AUTH\" --header \"$CT\" \\"
echo " --data @/tmp/tq_plan.json | python3 -m json.tool"
echo
echo "ASSERT: run_sql output 'knowledge' contains a '__src' column and 'file_ref_table'."
echo
echo "### 4) legacy table_rag_retrieve (parity reference, same question) ###"
echo "curl -s --request POST \"$HOST/v1/table_rag_retrieve/$BOT_ID\" \\"
echo " --header \"$AUTH\" --header \"$CT\" \\"
echo " --data '{\"query\": \"$QUERY\"}' | python3 -m json.tool"
echo
echo "Compare the __src tokens / result rows between #3 and #4 for the same SQL intent."

View File

@ -193,7 +193,12 @@ async def handle_request(request: Dict[str, Any]) -> Dict[str, Any]:
top_k = arguments.get("top_k", 100)
if not query:
return create_error_response(request_id, -32602, "Missing required parameter: query")
return create_success_response(request_id, {
"content": [{
"type": "text",
"text": "Error: missing required parameter 'query'. Please call this tool again with a non-empty 'query' argument describing what you want to retrieve."
}]
})
result = rag_retrieve(query, top_k)

View File

@ -193,7 +193,12 @@ async def handle_request(request: Dict[str, Any]) -> Dict[str, Any]:
top_k = arguments.get("top_k", 100)
if not query:
return create_error_response(request_id, -32602, "Missing required parameter: query")
return create_success_response(request_id, {
"content": [{
"type": "text",
"text": "Error: missing required parameter 'query'. Please call this tool again with a non-empty 'query' argument describing what you want to retrieve."
}]
})
result = rag_retrieve(query, top_k)