add mineru

This commit is contained in:
朱潮 2026-06-05 14:35:17 +08:00
parent 22b9ad4877
commit b618cb12d2
30 changed files with 4691 additions and 0 deletions

View File

@ -0,0 +1,49 @@
---
name: mineru
description: An AI-Native skill for parsing PDF / Office / image files into Markdown with MinerU — a fast, zero-config document parser for AI agents. Works with NO token via the Agent API and auto-upgrades to the Standard API (token) for large files, batches, and DOCX/HTML/LaTeX export. Use when converting PDF/Word/PPT/Excel/image documents, extracting text/tables/formulas, running OCR, or batch processing.
category: Document Processing
metadata:
author: Nebutra
version: "3.3.1"
argument-hint: <pdf-file-or-url>
---
# MinerU PDF Parser
Parse PDF, Office, and image documents into structured Markdown via the MinerU API.
## Quick Start
```bash
# Zero-config: no token, no install (free Agent API)
python3 "${CLAUDE_PLUGIN_ROOT}/scripts/mineru.py" ./document.pdf --output ./output/
# Pipe Markdown back to an agent
python3 "${CLAUDE_PLUGIN_ROOT}/scripts/mineru.py" ./document.pdf --stdout
# Power mode: token unlocks large files / batch / extra formats
export MINERU_TOKEN="..." # https://mineru.net/apiManage/token
python3 "${CLAUDE_PLUGIN_ROOT}/scripts/mineru.py" ./pdfs/ --output ./output/ --workers 8 --resume
```
## Features
- **Auto-routing**: free Agent API by default, auto-upgrades to the Standard API (token) for large/batch/extra-format jobs
- **Multi-modal**: PDF, images, Word, PPT, Excel, HTML
- **High-performance OCR**: `--ocr` with language selection (`--lang`)
- **Formula & table recognition**: LaTeX formulas, structured tables
- **Multi-format export**: Markdown (default), plus DOCX / HTML / LaTeX
- **AI-Native output**: `--stdout` (Markdown) and `--json` (machine status)
- **Batch + resume**: parallel workers with `--resume`
- **Zero dependencies**: standard library only
## Authentication
A token is **optional** — the Agent API works without one. Set a token to unlock
the Standard API (≤ 200 MB / ≤ 200 pages, batch, DOCX/HTML/LaTeX):
```bash
export MINERU_TOKEN="your-token-here" # https://mineru.net/apiManage/token
```
Official API docs: https://mineru.net/apiManage/docs

View File

@ -0,0 +1,170 @@
# MinerU API Reference
Official docs: https://mineru.net/apiManage/docs · Token: https://mineru.net/apiManage/token
MinerU exposes **two** document-parsing APIs. This skill auto-routes between them.
| | 🎯 Standard API | ⚡ Agent API (lightweight) |
|---|---|---|
| Base URL | `https://mineru.net/api/v4` | `https://mineru.net/api/v1/agent` |
| Token | **required** (`Bearer`) | **none** (IP rate-limited) |
| Models | `pipeline` / `vlm` / `MinerU-HTML` | fixed lightweight `pipeline` |
| File size | ≤ 200 MB | ≤ 10 MB |
| Pages | ≤ 200 | ≤ 20 |
| Batch | ≤ 50 per request | single file only |
| Output | zip (Markdown + JSON, optional DOCX/HTML/LaTeX) | Markdown only (CDN link) |
| Designed for | high-accuracy / complex / batch | AI-agent / quick / no-login |
Free Standard-API quota: **1000 pages/day at highest priority** (overflow is lower priority).
---
## Authentication (Standard API)
```
Authorization: Bearer YOUR_API_TOKEN
```
Get a token at https://mineru.net/apiManage/token.
> **Response envelopes.** Business endpoints return `{"code":0,"data":{…},"msg":"ok"}`.
> The auth/gateway layer returns a *different* shape on failure:
> `{"success":false,"msgCode":"A0202","msg":"user authenticate failed"}`.
> Clients must handle both — this skill maps `msgCode` to the same error hints.
---
## Standard API endpoints (`/api/v4`)
### Single URL — `POST /extract/task`
```json
{
"url": "https://example.com/doc.pdf",
"model_version": "vlm",
"is_ocr": false,
"enable_formula": true,
"enable_table": true,
"language": "ch",
"page_ranges": "1-10",
"extra_formats": ["docx", "html"],
"data_id": "my-document"
}
```
Response → `{ "code": 0, "data": { "task_id": "…" } }`. HTML inputs require `model_version: "MinerU-HTML"`.
### Get task result — `GET /extract/task/{task_id}`
```json
{ "code": 0, "data": { "task_id": "…", "state": "done", "full_zip_url": "https://…", "err_msg": "" } }
```
### Batch local upload — `POST /file-urls/batch`
Returns signed upload URLs; PUT each file (no `Content-Type`). Up to **50** files / request.
```json
{ "files": [ { "name": "doc.pdf", "data_id": "doc" } ], "model_version": "vlm" }
```
Response → `{ "code": 0, "data": { "batch_id": "…", "file_urls": ["https://…"] } }`.
### Batch URL — `POST /extract/task/batch`
```json
{ "files": [ { "url": "https://…/doc.pdf", "data_id": "doc" } ], "model_version": "vlm" }
```
### Batch results — `GET /extract-results/batch/{batch_id}`
```json
{ "code": 0, "data": { "batch_id": "…", "extract_result": [
{ "file_name": "doc.pdf", "state": "done", "full_zip_url": "https://…" }
] } }
```
---
## Agent API endpoints (`/api/v1/agent`) — no token
### URL — `POST /parse/url`
```json
{ "url": "https://…/doc.pdf", "language": "ch", "enable_table": true, "is_ocr": false, "enable_formula": true, "page_range": "1-10" }
```
`page_range` accepts `from-to` or a single page only (no commas). Returns `{ "code": 0, "data": { "task_id": "…" } }`.
### File — `POST /parse/file`
```json
{ "file_name": "doc.pdf", "language": "ch" }
```
Response → `{ "data": { "task_id": "…", "file_url": "https://oss…" } }`; PUT the file to `file_url`.
### Result — `GET /parse/{task_id}`
```json
{ "code": 0, "data": { "task_id": "…", "state": "done", "markdown_url": "https://cdn…/full.md" } }
```
---
## Task states
`pending` (queued) · `running` (parsing) · `converting` (format conversion) ·
`uploading` (downloading source, Agent) · `waiting-file` (awaiting upload) ·
`done` (complete) · `failed` (error).
---
## Parameters
| Parameter | Type | Default | Notes |
|-----------|------|---------|-------|
| `model_version` | string | `pipeline` | `pipeline`, `vlm` (recommended), `MinerU-HTML` (HTML only) |
| `is_ocr` | bool | `false` | OCR for scanned docs (pipeline/vlm) |
| `enable_formula` | bool | `true` | Formula recognition |
| `enable_table` | bool | `true` | Table recognition |
| `language` | string | `ch` | OCR language (see official `language` table) |
| `page_ranges` | string | all | Standard: `"2,4-6"`; Agent `page_range`: `"1-10"` only |
| `extra_formats` | array | `[]` | `docx` / `html` / `latex` (Standard only) |
| `data_id` | string | | `[A-Za-z0-9_.-]`, ≤ 128 chars |
| `no_cache` | bool | `false` | Bypass URL cache (Standard) |
| `cache_tolerance` | int | `900` | Cache TTL seconds (Standard) |
---
## Limits
| | Standard | Agent |
|---|---|---|
| File size | 200 MB | 10 MB |
| Pages | 200 | 20 |
| Batch | 50 / request | 1 |
| Quota | 1000 pages/day priority | IP rate-limited (HTTP 429) |
Supported types: PDF, images (png/jpg/jpeg/jp2/webp/gif/bmp), Doc(x), Ppt(x), Xls(x); HTML is Standard-only.
---
## Error codes
| Code | Meaning |
|------|---------|
| `A0202` | Invalid token |
| `A0211` | Token expired |
| `-500` | Parameter error |
| `-10001` / `-10002` | Service error / invalid params |
| `-60002` | Unsupported file format |
| `-60003` / `-60004` | File read failed / empty file |
| `-60005` | File too large (> 200 MB) |
| `-60006` | Too many pages (> 200) |
| `-60008` | File read timeout (URL unreachable) |
| `-60010` | Parse failed |
| `-60015` / `-60016` | File / format conversion failed |
| `-60018` | Daily quota reached |
| `-60022` | Web page read failed (rate-limited) |
| **Agent API** | |
| `-30001` | Exceeds Agent 10 MB limit → use Standard API |
| `-30002` | Unsupported file type for Agent |
| `-30003` | Exceeds Agent 20-page limit → use Standard API or `--pages` |
| `-30004` | Invalid request parameters |

View File

@ -0,0 +1,193 @@
<!-- Web-researched competitive comparison (45 tools, 6 categories, adversarially fact-checked). Last researched 2026-05-31. Star counts / versions are point-in-time. -->
# MinerU Skill — Competitive Comparison Reference
This document gives an honest, sourced, per-tool breakdown of how **MinerU Skill** compares to the document-parsing landscape. Read the framing first: it determines how to interpret every "we win / they win" below.
## What MinerU Skill actually is (and is not)
MinerU Skill is a **zero-config, zero-dependency, agent-native convenience layer over [MinerU](https://github.com/opendatalab/MinerU)'s cloud API**, plus 17 turnkey delivery integrations to note/knowledge/content tools. Concretely (verified in this repo):
- Core script `scripts/mineru.py` is **~54KB / ~1,350 lines of pure Python standard library** — no `requests`/`aiohttp`, no model weights.
- A **genuinely token-free** default: the free **Agent API** path (`agent_parse` → `_agent_poll`) sends **no `Authorization` header** (the Bearer header is set only when a token is present). Files ≤10MB / ≤20 pages.
- **Auto-routing**: with a token, large/batched/extra-format jobs use the **Standard API** (≤200MB / ≤200 pages); the Agent path **auto-escalates** to Standard on size/page limits.
- **17 delivery sinks** (16 sink modules + `local.py` registering both `obsidian` and `logseq`): obsidian, logseq, siyuan, notion, confluence, onenote, coda, yuque, feishu, slack, dingtalk, wecom, ticktick, linear, airtable — all zero-dependency — plus **roam** (needs `roam-client`) and **wps** (needs `html-for-docx`) which lazy-load one library only when used.
- `--resume` dedup, parallel `--workers` (ThreadPoolExecutor), `--stdout`/`--json` agent output.
**Critical dependency:** our accuracy is **entirely downstream of, and capped by, what MinerU's cloud serves.** We own no models. Therefore:
- We have **no quality edge** over any other cloud wrapper that hits the same MinerU API — OCR/table/formula output is **identical**.
- Self-hosting the MinerU engine gives the **same or better** accuracy (version-controllable, no upload caps).
**Hard limits we cannot exceed:** 10MB/20-page free Agent tier, 200MB/200-page Standard tier, plus IP rate limits. Self-hosted tools have no such caps (only hardware).
**Our benchmark is latency-only.** `tests/test_live.py` measures end-to-end cloud round-trip latency (~1314s for the official demo PDF). It is **not** an accuracy benchmark; we have no OmniDocBench/olmOCR-Bench numbers of our own.
### A note on the speed claim
Our ~1314s/doc cloud round-trip is **not** a clean win over self-hosted GPU engines. A normal self-host with a GPU runs at ~0.18s/page (Marker) or ~2.12 pages/sec (MinerU on A100) — far faster at any real scale. We only out-run **slow Apple-Silicon-CPU local runs of small docs** (e.g., M4 VLM at 32148s/page). Do not frame "faster wall-clock" as a general win.
### A note on benchmarks
No single benchmark is authoritative. Different benchmarks favor different tools:
- **OmniDocBench** (v1.5/v1.6): MinerU2.5 **90.67** (v1.5), MinerU2.5-Pro **95.69** (v1.6) — leads, beating Gemini 2.5 Pro / GPT-4o / Qwen2.5-VL-72B on text/table/formula. Source: arXiv 2509.22186.
- **olmOCR-Bench** (Ai2, Oct 2025): olmOCR-2 **82.4** > Marker **76.1** > **MinerU 75.8**. Here MinerU **trails** — this is a real olmOCR win and must stay visible.
- **RD-TableBench**: Reducto 90.2% on complex tables — but Reducto authored this benchmark (vendor-biased).
- Mathpix is the de-facto formula-OCR standard (BLEU/edit-distance studies), though a PaddleOCR-VL-based tool claims to beat it on OmniDocBench v1.0 formula recognition, so the very top is contested.
> Star counts / versions below (e.g. MinerU "65.7k / v3.2.1") are point-in-time and not independently re-verified.
---
## Category 1 — Self-hosted / open-source parsing engines
These are the tools that close our single biggest gap: **fully offline / air-gapped / no cloud / no upload caps.**
### MinerU engine (opendatalab) — the engine we wrap
- **Source:** https://github.com/opendatalab/MinerU · arXiv 2509.22186 · https://huggingface.co/opendatalab/MinerU2.5-Pro-2604-1.2B
- **Strengths:** Owns the SOTA models (OmniDocBench 90.67 / 95.69-Pro v1.6). 109-language OCR, handwriting, cross-page table merge, formula→LaTeX (the source of *our* LaTeX). Fully self-hostable → offline, air-gappable, zero per-page cost, no caps. Pipeline backend runs pure CPU; VLM needs 8GB+ VRAM. Native MCP, Python/Go/TS SDKs, LangChain/LlamaIndex/Dify/FastGPT.
- **Weaknesses vs us:** Heavy install (multi-GB torch/vLLM + weights, 16GB RAM / 20GB disk floor); slow on Apple Silicon; no note/PKM delivery sinks; library/CLI rather than zero-config.
- **Verdict:** **Beats us** on offline, privacy, caps, accuracy ceiling, ecosystem. **We beat it** only on zero-install/zero-config and built-in delivery.
### Marker (datalab-to / VikParuchuri)
- **Source:** https://github.com/datalab-to/marker · https://allenai.org/blog/olmocr-2
- **Strengths:** Fully offline; very high batch throughput (~122 pages/sec/H100, 0.18s/page GPU); broad formats incl. EPUB; optional local-LLM (Ollama) quality boost with no data leaving the machine; ~35k+ stars, active.
- **Weaknesses:** **GPL-3.0** code + model weights under a modified RAIL-M (free only under ~$2M funding+revenue; commercial above that needs a Datalab license). olmOCR-Bench **76.1** — below olmOCR-2 and MinerU's OmniDocBench standing.
- **Verdict:** Beats us on offline/throughput; we beat it on zero-install and 17 delivery sinks. License gate is a real friction it has and we don't.
### Docling (IBM / DS4SD)
- **Source:** https://github.com/docling-project/docling · https://huggingface.co/ibm-granite/granite-docling-258M · arXiv 2408.09869
- **Strengths:** **Widest input modality set** (PDF/DOCX/PPTX/XLSX/HTML/AsciiDoc/LaTeX/CSV/images + **audio via ASR** + USPTO/JATS/XBRL). Tiny 258M Granite-Docling VLM runs on CPU/modest GPU. **MIT code + Apache-2.0 weights.** Deep framework ecosystem (LangChain/LlamaIndex/Haystack + official MCP), IBM-backed, 60k+ stars. Air-gapped by design.
- **Weaknesses:** Absolute accuracy lags MinerU on OmniDocBench/olmOCR-Bench; library-first (not a zero-config CLI); targets framework ingestion, not file delivery to note tools.
- **Verdict:** Beats us on offline, modality breadth, permissive license, ecosystem; we beat it on zero-install and note/PKM delivery. **Do not over-rank its MIT as uniquely best** — olmOCR's Apache-2.0 on *both* code and 7B weights is at least as commercially valuable.
### olmOCR (allenai)
- **Source:** https://github.com/allenai/olmocr · https://allenai.org/blog/olmocr-2 · https://huggingface.co/datasets/allenai/olmOCR-bench
- **Strengths:** **Leads Ai2's olmOCR-Bench (82.4 vs MinerU 75.8)** — a benchmark where MinerU trails. **Apache-2.0 on code AND the olmOCR-2-7B weights** (most commercial-friendly model reuse here). Built for million-page LLM-training linearization. Offline.
- **Weaknesses:** **PDF/image only** (no Office/HTML); **English-primary**, filters non-English (MinerU does 109-lang); **requires a 12GB+ NVIDIA GPU, no CPU mode at all**.
- **Verdict:** Beats us on offline, that-benchmark accuracy, license, scale. We beat it on modality breadth, multilingual, no-GPU, delivery, zero-install. **Keep the olmOCR-Bench lead visible — do not cherry-pick only OmniDocBench.**
### Nougat (facebookresearch / Meta AI)
- **Source:** https://github.com/facebookresearch/nougat · arXiv 2308.13418
- **Strengths:** Strong LaTeX/math on arXiv-style scientific PDFs (its trained niche). Offline.
- **Weaknesses:** **PDF + English/Latin-script only** (no CJK); **CC-BY-NC weights (non-commercial)**; effectively **unmaintained** (last release Aug 2023); known repetition/hallucination/[MISSING_PAGE] failures off-distribution.
- **Verdict:** Offline + niche math is its only edge; we beat it on general-purpose, multilingual, maintenance, commercial license, delivery.
### PyMuPDF4LLM (pymupdf / Artifex)
- **Source:** https://github.com/pymupdf/pymupdf4llm · https://pymupdf.io/blog/pymupdf-layout-10-faster-pdf-parsing-without-gpus
- **Strengths:** **Far faster and lighter than any ML tool on born-digital PDFs** (~hundreds of pages/sec on plain CPU; a C-optimized variant claims ~520 pages/sec). Lowest dependency/hardware footprint. Offline, no cloud, no caps. Ideal for huge clean-PDF corpora where speed > fidelity.
- **Weaknesses:** No ML → no real formula/LaTeX, weak complex tables, poor scanned/handwritten; slow external OCR; **AGPL-3.0 OR Artifex commercial**; Office formats need paid **PyMuPDF Pro**.
- **Verdict:** A genuine win for the speed-over-fidelity, clean-PDF use case. We beat it on hard-doc quality (MinerU's VLM), multilingual OCR, and delivery — but acknowledge its speed/footprint advantage honestly.
### Zerox (getomni-ai)
- **Source:** https://github.com/getomni-ai/zerox
- **Strengths:** Trivial provider-flexibility (OpenAI/Azure/Bedrock-Claude/Gemini/Vertex); JSON-Schema structured extraction (Node SDK); MIT code.
- **Weaknesses:** **NOT offline and NOT token-free** — mandates a paid cloud vision-LLM key; needs graphicsmagick+ghostscript; **no published benchmarks**; per-page LLM cost can exceed MinerU on large jobs.
- **Verdict:** We beat it on token-free start, benchmarked accuracy, dedicated formula/table models, system-dep footprint, and delivery. It beats us on provider-swap flexibility and typed JSON extraction.
---
## Category 2 — Commercial cloud document-parsing APIs
Mostly **stronger than us** on enterprise accuracy, SLAs, structured extraction, and RAG/MCP ecosystems. Our honest edges are narrow: token-free + zero-install hosted default, clean Markdown/LaTeX of academic PDFs, and 17 delivery sinks none of them offer.
### LlamaParse (LlamaIndex / LlamaCloud)
- **Source:** https://www.llamaindex.ai/pricing · LlamaCloud MCP docs
- **Beats us:** Official hosted **MCP server**; deep native RAG stack (parse→index→LlamaExtract/LlamaAgents); steerable NL parsing with frontier LLMs (GPT-4.1/Gemini 2.5 Pro); richer outputs (per-page JSON, XLSX, HTML tables, annotated PDF); enterprise SLAs; mature Python+TS SDKs.
- **We beat:** Token-free start (it needs a LlamaCloud key from page one); zero runtime deps; 17 note/PKM sinks (it delivers to RAG indexes, not note tools); built-in `--resume`/parallel batch CLI.
### Mathpix (Convert API)
- **Source:** https://mathpix.com/pricing/api · https://mathpix.com/image-to-latex
- **Beats us:** **Best-in-class formula/equation OCR (printed AND handwritten) → clean LaTeX — clearly better than MinerU for pure math fidelity; concede this, do not imply parity.** Mature Snip ecosystem + Overleaf workflows; very low per-image cost at scale.
- **We beat:** Token-free start (Mathpix API requires a paid PAYG account, **$19.99 setup fee**, card on file; **no recurring free monthly allowance** — only a one-time $29 test credit; the consumer Snip app's free quota does **not** apply to the API); general-purpose multi-modal Office parsing; 17 delivery sinks; built-in batch CLI.
### Unstructured.io
- **Source:** https://unstructured.io/pricing · https://github.com/Unstructured-IO/unstructured
- **Beats us:** **Apache-2.0 core library is fully self-hostable → 100% offline** (we cannot); official MCP + huge connector ecosystem (S3/SharePoint/vector DBs); built-in chunking+embedding (RAG-ready); 25+ file types; permissive license for product embedding.
- **We beat:** Token-free hosted default with zero install (its hosted API needs a key; self-host means running infra); cleaner human-readable Markdown out of the box (its primary output is JSON "elements"); 17 note/PKM sinks (it targets vector DBs/storage). *On parsing quality:* VLM parsing is generally stronger for complex layout/formula, but this is **not a benchmarked head-to-head** — state it as a tendency, not a measured win.
### Reducto
- **Source:** https://reducto.ai/pricing
- **Beats us:** **Best complex/financial table extraction (90.2% RD-TableBench — vendor-authored but the strongest public evidence)**; agentic multi-pass OCR; SOC2/HIPAA, on-prem/VPC/air-gapped, enterprise SLAs; schema-based extraction with bounding boxes/citations.
- **We beat:** Token-free start (it needs a key + credits); zero-install plain CLI; 17 delivery sinks; auto-routing/--resume/parallel batch.
### Chunkr (and similar RAG-native APIs)
- **Beats us:** Self-hostable (offline option we lack); RAG-native chunking + broad export (DOCX/HTML/LaTeX).
- **We beat:** Token-free start; zero-install; 17 note/PKM sinks.
- **Caveat (fact-check):** Do **not** claim "stronger VLM Markdown for formulas" — Chunkr cloud uses its own proprietary models and we have **no head-to-head benchmark**. Drop the quality claim; keep only the export-breadth and offline framing.
---
## Category 3 — Other MinerU wrappers, skills & MCP servers (our direct peers)
**Every cloud-backed wrapper here hits the same MinerU API we do, so its OCR/table/formula output is IDENTICAL to ours.** We have **no quality edge** over them — only DX differences. Claims of "better OCR/formula/Markdown" vs these are **invalid** and must not appear.
### Official MinerU MCP server (mineru-open-mcp / MinerU-Ecosystem)
- **Source:** https://github.com/opendatalab/MinerU-Ecosystem · https://pypi.org/project/mineru-open-mcp/
- **Beats us:** **Official, first-party** — tracks API/format changes day-one; native **MCP server** (stdio + streamable-http) in Claude Desktop/Cursor/Windsurf with zero glue; full ecosystem (Python/Go/TS SDKs, LangChain/LlamaIndex/Dify/FastGPT). **Same free no-token Flash tier as us** — our "free zero-token" edge is fully matched by the first party.
- **We beat:** Zero runtime deps (vs pip/uvx install); auto-routing Agent⇄Standard with auto-escalation; 17 delivery sinks; `--resume`/parallel batch; usable as a plain CLI outside any MCP host.
### MinerU-Document-Explorer (official, opendatalab)
- **Source:** https://github.com/opendatalab/MinerU-Document-Explorer
- **Beats us:** Different, **larger** value prop — a local agent-native **knowledge engine** (BM25/vector/hybrid retrieval + deep-reading + LLM-wiki) with 15 MCP tools; runs 100% locally for its core; MIT, 568 stars.
- **We beat:** We're a focused zero-dep converter; broader conversion modalities; 17 delivery sinks (it keeps content in its own index/wiki); no Node/local-model download.
### linxule/mineru-mcp (Node, cloud)
- **Source:** https://github.com/linxule/mineru-mcp
- **Beats us:** Native MCP server with 6 granular tools (explicit status-polling + batch-status pagination); first-class for Node/JS MCP stacks; batch up to 200 URLs/request.
- **We beat:** **Free no-token path** (it **requires** a token always); zero runtime deps (vs Node 18+); broader modalities (Excel/HTML); 17 delivery sinks; usable as plain CLI outside MCP.
### mineru-converter-mcp-server (AvatarGanymede/MinerU-MCP)
- **Source:** https://pypi.org/project/mineru-converter-mcp-server/
- **Beats us:** **Auto-splits PDFs >200MB and segments >600-page docs by page range — gracefully exceeding the 200MB/200-page cap we are bound by.** Turnkey Smithery + Render deploy (per-user key); explicit HTML input.
- **We beat:** Free no-token default (it requires a key); zero runtime deps; plain CLI (no MCP host/Render/Smithery needed); 17 sinks; auto-routing.
### grimoire-skill (LeoLin990405)
- **Source:** https://github.com/LeoLin990405/grimoire-skill
- **Beats us:** Higher-level knowledge-capture ("parse once, share twice" → Obsidian notes + reusable skill packs); ingests **video** (YouTube/Bilibili) + subtitles (modalities we don't touch); cross-agent skill management; content-aware Obsidian auto-filing.
- **We beat:** Free no-token default (it needs a token + `--cloud-ok` for local files); zero runtime deps (vs bash+jq+awk + optional yt-dlp/ffmpeg); 17 sinks vs primarily Obsidian; broader Office/HTML; cross-platform single-file portability.
### kesslerio/mineru-pdf-parser (openclaw/ClawHub skill, local CPU)
- **Source:** openclaw/skills · SKILL.md
- **Beats us:** **Fully local/offline (pure CPU, cross-platform)** — no cloud/token/caps; handles privacy-sensitive docs; native Markdown + JSON.
- **We beat:** Zero install (it needs a full local MinerU install + weights + shell wrapper); no GPU/heavy runtime; faster wall-clock **only vs slow local CPU**; broader modalities; 17 sinks; `--stdout`/`--json`; better docs.
### nilecui/mineru-parser-skills (Claude Agent SDK, cloud)
- **Source:** https://github.com/nilecui/mineru-parser-skills
- **Beats us:** Built directly on the Claude Agent SDK (slots into Agent-SDK apps). Honestly little else — it's a thinner cloud wrapper.
- **We beat:** Accepts local files/dirs **and** URLs (it is **URL-only** — cannot parse a local PDF); free no-token default; zero runtime deps; batch/`--resume`/parallel; 17 sinks; broader modalities; mature/documented vs a 4-commit, no-license repo. *Caveat:* our "benchmarked" claim means **latency-measured**, not accuracy-benchmarked.
### TINKPA/mcp-mineru (local MLX, Apple Silicon)
- **Source:** https://github.com/TINKPA/mcp-mineru
- **Beats us:** **Fully offline/local** via MinerU running on-device (MLX accel); no cloud/token/caps; data never leaves the Mac.
- **We beat:** Zero install/no weights/no GPU; **faster wall-clock only for typical multi-page docs vs its slow local inference (32148s/page on M4)** — not a general speed win; broader modalities; batch/`--resume`/17 sinks; more active/documented; usable as plain CLI.
---
## Summary of mandatory concessions (do not bury these)
1. **Offline / air-gapped is our single biggest gap.** MinerU engine, Marker, Docling, olmOCR, Nougat, PyMuPDF4LLM, TINKPA, kesslerio, MinerU-Document-Explorer, and self-hostable Unstructured/Chunkr all run with **zero cloud dependency**. We are cloud-only and **cannot handle confidential/regulated/air-gapped content at all.**
2. **Data privacy:** every self-hosted competitor keeps documents on the machine; we **upload every file** to MinerU's cloud — a hard disqualifier for many regulated users.
3. **Accuracy is downstream of, and capped by, MinerU's cloud.** Self-hosting MinerU2.5-Pro gives the same-or-better accuracy with no caps. Same-backend wrappers yield **identical** quality to us.
4. **Hard caps:** 10MB/20-page (Agent), 200MB/200-page (Standard), IP rate limits. mineru-converter exceeds them via auto-split/segmentation.
5. **Mathpix beats us on formula/LaTeX OCR (incl. handwriting).**
6. **Reducto leads complex/financial tables; olmOCR leads olmOCR-Bench (82.4 vs MinerU 75.8).** Different benchmarks favor different tools — never cherry-pick only OmniDocBench.
7. **Official first-party advantage:** the official MinerU MCP/Document-Explorer + ecosystem track changes day-one and match our free tier; we are third-party, can lag, and ship **no MCP server**.
8. **Permissive-license wins we lack:** olmOCR (Apache-2.0 code + 7B weights), Docling (MIT + Apache-2.0 weights), Unstructured (Apache-2.0 core).
9. **PyMuPDF4LLM is far faster/lighter on born-digital PDFs** (clean-text corpora, speed > fidelity).
## Sources
- MinerU engine: https://github.com/opendatalab/MinerU · arXiv 2509.22186 · https://huggingface.co/opendatalab/MinerU2.5-Pro-2604-1.2B · https://neurohive.io/en/state-of-the-art/mineru2-5-open-source-1-2b-model-for-pdf-parsing-outperforms-gemini-2-5-pro-on-benchmarks/
- Official MCP / ecosystem: https://github.com/opendatalab/MinerU-Ecosystem · https://pypi.org/project/mineru-open-mcp/ · https://github.com/opendatalab/MinerU-Document-Explorer
- Marker: https://github.com/datalab-to/marker · https://allenai.org/blog/olmocr-2
- Docling: https://github.com/docling-project/docling · arXiv 2408.09869 · https://huggingface.co/ibm-granite/granite-docling-258M
- olmOCR: https://github.com/allenai/olmocr · https://allenai.org/blog/olmocr-2 · https://huggingface.co/datasets/allenai/olmOCR-bench
- Nougat: https://github.com/facebookresearch/nougat · arXiv 2308.13418
- PyMuPDF4LLM: https://github.com/pymupdf/pymupdf4llm · https://pymupdf.io/blog/pymupdf-layout-10-faster-pdf-parsing-without-gpus
- Zerox: https://github.com/getomni-ai/zerox
- LlamaParse: https://www.llamaindex.ai/pricing
- Mathpix: https://mathpix.com/pricing/api · https://mathpix.com/image-to-latex
- Unstructured: https://unstructured.io/pricing · https://github.com/Unstructured-IO/unstructured
- Reducto: https://reducto.ai/pricing
- Other wrappers: https://github.com/linxule/mineru-mcp · https://pypi.org/project/mineru-converter-mcp-server/ · https://github.com/LeoLin990405/grimoire-skill · https://github.com/nilecui/mineru-parser-skills · https://github.com/TINKPA/mcp-mineru

View File

@ -0,0 +1,59 @@
# Delivery Integrations (`--to`)
After parsing, MinerU Skill can deliver the Markdown straight into your content
tools using each tool's **official ingestion path** — no fragile generic block
converters. Targets are pluggable sinks; select one or more with `--to NAME`
(repeatable). List them live with `python3 scripts/mineru.py --list-sinks`.
```bash
# Parse and fan out to several destinations at once
python3 scripts/mineru.py paper.pdf --to obsidian --to notion --to slack
```
Each sink reads its configuration from **environment variables** so an AI agent
can run it non-interactively. Delivery results appear in `--json` output under
each result's `sinks` array.
## Support matrix
| Target | `--to` | Native path | Auth / config (env) | Markdown fidelity | Images |
|--------|--------|-------------|---------------------|-------------------|--------|
| **Obsidian** | `obsidian` (`ob`) | filesystem write + YAML frontmatter | `OBSIDIAN_VAULT`, `OBSIDIAN_SUBDIR?` | full | ✅ copied to `<note>.assets/` |
| **Logseq** | `logseq` | filesystem write, outline + `key:: value` | `LOGSEQ_GRAPH` | full (outline transform) | ✅ copied to `assets/` |
| **SiYuan** | `siyuan` | kernel `createDocWithMd` | `SIYUAN_TOKEN`, `SIYUAN_API_URL?`, `SIYUAN_NOTEBOOK?` | full (GFM) | ✅ `asset/upload` |
| **Notion** | `notion` | `POST /v1/pages` (blocks) | `NOTION_API_KEY`, `NOTION_PARENT_PAGE_ID`, `NOTION_VERSION?` | structure (headings/lists/code/quote) | ⚠️ text only¹ |
| **Linear** | `linear` | GraphQL `issueCreate` | `LINEAR_API_KEY`, `LINEAR_TEAM_ID` | full (Markdown-native) | ✅ base64-inlined |
| **Yuque 语雀** | `yuque` (`语雀`) | open API create doc | `YUQUE_TOKEN`, `YUQUE_NAMESPACE` | full (Markdown-native) | ⚠️ host publicly² |
| **Coda** | `coda` | page canvas `format:markdown` | `CODA_API_TOKEN`, `CODA_DOC_ID?` | full (Markdown-native) | ⚠️ public URL² |
| **Slack** | `slack` | external-upload `.md` file | `SLACK_BOT_TOKEN`, `SLACK_CHANNEL` | full (raw file) | ⚠️ not embedded |
| **Lark 飞书** | `feishu` (`lark`, `飞书`) | Drive `import_tasks` → Docx | `FEISHU_APP_ID`, `FEISHU_APP_SECRET`, `FEISHU_FOLDER_TOKEN?` | full (server-converted) | ⚠️ public URL² |
| **Confluence** | `confluence` | `POST /wiki/api/v2/pages` (storage) | `CONFLUENCE_BASE_URL`, `CONFLUENCE_EMAIL`, `CONFLUENCE_API_TOKEN`, `CONFLUENCE_SPACE_ID` | MD→HTML | ⚠️ not attached |
| **OneNote** | `onenote` | Graph `sections/{id}/pages` | `ONENOTE_TOKEN`³, `ONENOTE_SECTION_ID` | MD→HTML | ⚠️ remote only |
| **TickTick 滴答** | `ticktick` (`dida`, `滴答清单`) | `POST /open/v1/task` | `TICKTICK_TOKEN`, `TICKTICK_PROJECT_ID?` | task note | ❌ unsupported |
| **DingTalk 钉钉** | `dingtalk` (`钉钉`) | robot markdown webhook | `DINGTALK_WEBHOOK`, `DINGTALK_SECRET?` | markdown message | ⚠️ public URL only |
| **Airtable** | `airtable` | `POST /v0/{base}/{table}` record | `AIRTABLE_API_KEY`, `AIRTABLE_BASE_ID`, `AIRTABLE_TABLE`, `AIRTABLE_TITLE_FIELD?`, `AIRTABLE_BODY_FIELD?` | record field⁴ | ❌ not uploaded |
| **WeCom 企业微信** | `wecom` (`企业微信`) | app `message/send` markdown | `WECOM_CORPID`, `WECOM_CORPSECRET`, `WECOM_AGENTID`, `WECOM_TOUSER?` | message (subset, ≤2 KB)⁵ | ❌ unsupported |
| **Roam Research** ⁶ | `roam` | `batch-actions` block tree | `ROAM_API_TOKEN`, `ROAM_GRAPH_NAME` | full (Markdown→outline) | ⚠️ public URL |
| **WPS 金山文档** ⁶ | `wps` (`kdocs`, `金山`) | Markdown→DOCX → kdocs upload | `WPS_APP_ID`, `WPS_APP_SECRET`, `WPS_PARENT_PATH?` | DOCX (via html-for-docx) | embedded in DOCX |
Notes:
1. **Notion** images need a separate `file_uploads` upload-then-reference dance; v1 delivers text + structure and notes the count of un-embedded local images. (Roadmap: image upload.)
2. Hosted services that ingest Markdown by value but have no first-class CLI asset upload — local images must be hosted at a public URL to render. The Markdown is delivered intact; image links that are already URLs work.
3. **OneNote** `ONENOTE_TOKEN` is a Microsoft Graph access token (delegated, scope `Notes.Create`). Obtain it via the device-code OAuth flow; the sink itself stays non-interactive.
4. **Airtable** is a database, not a document store — the doc is stored as one record (title + body fields). A good "save this doc as a row" target, not a document publisher.
5. **WeCom** markdown messages are a limited subset (≤2048 bytes, no images/tables, not rendered in the workbench). Best as a notification/summary; for a full document deliver via Lark/Notion and send the link.
6. **Optional-dependency sinks** — these two rely on a third-party library that the sink lazy-imports only when used, so the core and the other 15 sinks stay zero-dependency. If the library is absent, the sink returns a clear `pip install …` hint. They are implemented to the official specs but, being credential/desktop-gated, are best-effort until validated against live accounts.
## Optional-dependency sinks (`[roam]`, `[wps]`)
```bash
pip install "mineru-skill[wps]" # html-for-docx (Markdown → DOCX)
pip install "mineru-skill[roam]" # official roam-client SDK (git, needs Python ≥3.11)
# roam-client is git-only; equivalently:
pip install "roam-client @ git+https://github.com/Roam-Research/backend-sdks.git#subdirectory=python"
```
- **Roam** — no library ingests Markdown into Roam, but the official `roam-client` SDK handles the genuinely error-prone transport (307/308 peer-host redirect, dual `Authorization`/`x-authorization` Bearer headers, `/write`). We depend on it for transport and build only the Markdown→outline tree, delivering the whole document in one `batch-actions` request. Images must be public URLs.
- **WPS / 金山文档** — Markdown→DOCX uses the maintained pure-pip `html-for-docx` (reusing this project's Markdown→HTML); the kdocs upload signs requests with the documented WPS-2 scheme (plain SHA-1) using only the standard library. Requires an approved kdocs developer app + provisioned appspace.
Adding more targets is a single small module — see `scripts/sinks/base.py`. PRs welcome.

View File

@ -0,0 +1 @@
"""Importable package for MinerU Skill console entry points."""

View File

@ -0,0 +1,88 @@
"""Heading-aware Markdown chunking for RAG pipelines (zero-dependency).
``chunk_markdown`` splits a parsed Markdown document into retrieval-sized chunks
that preserve heading context matching the RAG-friendliness of LlamaParse /
Unstructured without any dependency.
"""
from __future__ import annotations
import re
_HEADING = re.compile(r"^(#{1,6})\s+(.*)$")
def _slug(text: str) -> str:
text = (text or "doc").strip().lower()
text = re.sub(r"[^a-z0-9]+", "-", text).strip("-")
return text or "doc"
def _split_by_size(text: str, max_chars: int) -> list:
"""Split text into <= max_chars pieces on paragraph boundaries (hard-split if needed)."""
if len(text) <= max_chars:
return [text]
pieces: list = []
current = ""
for para in text.split("\n\n"):
if len(para) > max_chars:
if current:
pieces.append(current)
current = ""
for i in range(0, len(para), max_chars):
pieces.append(para[i:i + max_chars])
elif not current:
current = para
elif len(current) + len(para) + 2 <= max_chars:
current = f"{current}\n\n{para}"
else:
pieces.append(current)
current = para
if current:
pieces.append(current)
return pieces
def chunk_markdown(markdown: str, *, max_chars: int = 2000, source: str = "") -> list:
"""Chunk Markdown by heading, size-splitting long sections.
Returns ``[{id, index, heading, text, chars, source}, ...]`` where ``heading``
is the ``H1 > H2 > H3`` breadcrumb for the chunk.
"""
lines = markdown.replace("\r\n", "\n").split("\n")
chunks: list = []
stack: list = [] # (level, text) heading breadcrumb
buf: list = []
base = _slug(source)
def breadcrumb() -> str:
return " > ".join(t for _, t in stack)
def flush():
text = "\n".join(buf).strip()
buf.clear()
if not text:
return
head = breadcrumb()
for piece in _split_by_size(text, max_chars):
idx = len(chunks)
chunks.append({
"id": f"{base}-{idx}",
"index": idx,
"heading": head,
"text": piece,
"chars": len(piece),
"source": source,
})
for line in lines:
match = _HEADING.match(line.strip())
if match:
flush() # close the previous section under its own breadcrumb
level = len(match.group(1))
while stack and stack[-1][0] >= level:
stack.pop()
stack.append((level, match.group(2)))
buf.append(line)
flush()
return chunks

View File

@ -0,0 +1,59 @@
"""Optional fully-offline parsing backend for born-digital PDFs.
Our single biggest honest gap is being cloud-only. ``--engine local`` parses a
PDF **entirely offline** with the optional, lightweight ``pymupdf4llm`` library
(no GPU, no cloud, no upload caps) ideal for confidential or born-digital PDFs
where MinerU's cloud VLM is overkill. Scanned/complex docs still want the cloud
engine, so ``--engine auto`` only uses local when the PDF has real text.
pip install "mineru-skill[local]" # i.e. pip install pymupdf4llm
"""
from __future__ import annotations
from pathlib import Path
_HINT = (
"--engine local needs pymupdf4llm — pip install 'mineru-skill[local]' "
"(i.e. pip install pymupdf4llm)"
)
class LocalEngineError(Exception):
"""Raised when local parsing is requested but cannot be performed."""
def available() -> bool:
try:
import pymupdf4llm # noqa: F401
return True
except ImportError:
return False
def is_born_digital(path, min_chars: int = 200) -> bool:
"""True if the PDF has extractable text (so local parsing is appropriate)."""
try:
import pymupdf
except ImportError:
return False
doc = pymupdf.open(str(path))
total = 0
for page in doc:
total += len(page.get_text().strip())
if total >= min_chars:
return True
return total >= min_chars
def parse_local(path, output_dir=None) -> str:
"""Parse a PDF to Markdown fully offline. Returns the Markdown string."""
try:
import pymupdf4llm
except ImportError as exc:
raise LocalEngineError(_HINT) from exc
if output_dir is not None:
images = Path(output_dir) / "images"
images.mkdir(parents=True, exist_ok=True)
return pymupdf4llm.to_markdown(str(path), write_images=True, image_path=str(images))
return pymupdf4llm.to_markdown(str(path))

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,178 @@
#!/usr/bin/env python3
"""Zero-dependency MCP server (stdio) for MinerU Skill.
Speaks newline-delimited JSON-RPC 2.0 over stdin/stdout using only the standard
library, so an MCP host (Claude, Cursor, Windsurf, ...) can call MinerU. Register:
{"command": "python3", "args": ["scripts/mineru_mcp.py"]}
Tools: ``mineru_parse``, ``mineru_parse_to``, ``mineru_list_sinks``.
"""
from __future__ import annotations
import json
import os
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).resolve().parent))
import mineru # noqa: E402
PROTOCOL_VERSION = "2024-11-05"
SERVER_INFO = {"name": "mineru", "version": mineru.__version__}
TOOLS = [
{
"name": "mineru_parse",
"description": "Parse a PDF / Office / image file or URL into clean Markdown via MinerU.",
"inputSchema": {
"type": "object",
"properties": {
"input": {"type": "string", "description": "Local file path or http(s) URL"},
"output_dir": {"type": "string", "description": "Where to write output (default ./output)"},
"api": {"type": "string", "enum": ["auto", "agent", "standard"]},
"engine": {"type": "string", "enum": ["cloud", "local", "auto"]},
"ocr": {"type": "boolean"},
"lang": {"type": "string"},
},
"required": ["input"],
},
},
{
"name": "mineru_parse_to",
"description": "Parse a document and deliver the Markdown into content tools (Obsidian, Notion, Feishu, ...).",
"inputSchema": {
"type": "object",
"properties": {
"input": {"type": "string"},
"sinks": {"type": "array", "items": {"type": "string"}, "description": "Sink names, e.g. ['obsidian','notion']"},
"output_dir": {"type": "string"},
},
"required": ["input", "sinks"],
},
},
{
"name": "mineru_list_sinks",
"description": "List available delivery targets and their required environment variables.",
"inputSchema": {"type": "object", "properties": {}},
},
]
class MethodNotFound(Exception):
pass
def _text_result(text: str, is_error: bool = False) -> dict:
return {"content": [{"type": "text", "text": text}], "isError": is_error}
def _tool_parse(args: dict) -> dict:
opts = mineru.ParseOptions(is_ocr=bool(args.get("ocr")), language=args.get("lang", "ch"))
token = os.environ.get("MINERU_TOKEN")
output_dir = Path(args.get("output_dir") or "./output")
res = mineru.process_one(
args["input"], opts, token=token, output_dir=output_dir,
api=args.get("api", "auto"), engine=args.get("engine", "cloud"),
)
if res.state == "done":
return _text_result(res.markdown or "")
return _text_result(f"Parse failed: {res.error}", is_error=True)
def _tool_parse_to(args: dict) -> dict:
opts = mineru.ParseOptions()
token = os.environ.get("MINERU_TOKEN")
output_dir = Path(args.get("output_dir") or "./output")
res = mineru.process_one(args["input"], opts, token=token, output_dir=output_dir)
if res.state != "done":
return _text_result(f"Parse failed: {res.error}", is_error=True)
sinks = mineru._load_sinks()
if sinks is None:
return _text_result("Sinks package unavailable.", is_error=True)
doc = sinks.ParsedDoc(title=res.name, markdown=res.markdown, source=res.source,
modality=res.modality, markdown_path=res.markdown_path)
outcomes = [o.to_status() for o in sinks.deliver_all(doc, args["sinks"])]
any_fail = any(not o["ok"] for o in outcomes)
return _text_result(json.dumps({"name": res.name, "deliveries": outcomes}, ensure_ascii=False, indent=2),
is_error=any_fail)
def _tool_list_sinks(_args: dict) -> dict:
sinks = mineru._load_sinks()
if sinks is None:
return _text_result("Sinks package unavailable.", is_error=True)
listing = [{"name": n, "label": sinks.get_sink(n).label, "requires": list(sinks.get_sink(n).requires)}
for n in sinks.sink_names()]
return _text_result(json.dumps(listing, ensure_ascii=False, indent=2))
_TOOL_HANDLERS = {
"mineru_parse": _tool_parse,
"mineru_parse_to": _tool_parse_to,
"mineru_list_sinks": _tool_list_sinks,
}
def _route(method: str, params: dict):
if method == "initialize":
return {"protocolVersion": PROTOCOL_VERSION, "capabilities": {"tools": {}}, "serverInfo": SERVER_INFO}
if method == "tools/list":
return {"tools": TOOLS}
if method == "tools/call":
name = params.get("name")
handler = _TOOL_HANDLERS.get(name)
if handler is None:
return _text_result(f"Unknown tool: {name}", is_error=True)
try:
return handler(params.get("arguments") or {})
except Exception as exc: # noqa: BLE001 - report as a tool error, never crash the server
return _text_result(f"{type(exc).__name__}: {exc}", is_error=True)
raise MethodNotFound(method)
def dispatch(request: dict):
"""Handle one JSON-RPC request dict; return a response dict, or None for notifications."""
is_notification = "id" not in request
req_id = request.get("id")
try:
result = _route(request.get("method"), request.get("params") or {})
except MethodNotFound as exc:
if is_notification:
return None
return {"jsonrpc": "2.0", "id": req_id, "error": {"code": -32601, "message": f"Method not found: {exc}"}}
except Exception as exc: # noqa: BLE001
if is_notification:
return None
return {"jsonrpc": "2.0", "id": req_id, "error": {"code": -32603, "message": str(exc)}}
if is_notification:
return None
return {"jsonrpc": "2.0", "id": req_id, "result": result}
def serve(stdin=None, stdout=None) -> None:
"""Read newline-delimited JSON-RPC from stdin, write responses to stdout."""
stdin = stdin or sys.stdin
stdout = stdout or sys.stdout
for line in stdin:
line = line.strip()
if not line:
continue
try:
request = json.loads(line)
except ValueError:
continue
response = dispatch(request)
if response is not None:
stdout.write(json.dumps(response, ensure_ascii=False) + "\n")
stdout.flush()
def main() -> int:
serve()
return 0
if __name__ == "__main__":
sys.exit(main())

View File

@ -0,0 +1,75 @@
"""Pluggable delivery sinks for parsed Markdown.
Each submodule registers one or more :class:`Sink` implementations that deliver a
:class:`ParsedDoc` into a content tool using that tool's official ingestion path.
Importing this package populates the registry; a sink module that fails to import
is recorded in :data:`IMPORT_ERRORS` rather than breaking the others.
"""
from __future__ import annotations
import importlib
import sys
from .base import ( # noqa: F401
ParsedDoc,
Sink,
SinkError,
SinkResult,
get_sink,
sink_names,
REGISTRY,
)
# Sink modules to load. Order is cosmetic.
_MODULES = [
"local", # obsidian, logseq (filesystem)
"siyuan",
"notion",
"linear",
"yuque",
"coda",
"ticktick",
"dingtalk",
"airtable",
"wecom",
"slack",
"feishu",
"confluence",
"onenote",
"roam", # optional dependency (roam-client)
"wps", # optional dependency (html-for-docx)
]
IMPORT_ERRORS: dict = {}
for _name in _MODULES:
try:
importlib.import_module(f"{__name__}.{_name}")
except Exception as exc: # noqa: BLE001 - a bad sink shouldn't break the rest
IMPORT_ERRORS[_name] = f"{type(exc).__name__}: {exc}"
print(f"[sinks] failed to load {_name}: {exc}", file=sys.stderr)
def deliver_all(doc: ParsedDoc, names) -> list:
"""Deliver ``doc`` to each named sink, returning a list of :class:`SinkResult`."""
results = []
for name in names:
sink = get_sink(name)
if sink is None:
results.append(SinkResult(sink=name, ok=False, error=f"unknown sink '{name}'"))
continue
missing = sink.missing_config()
if missing:
results.append(SinkResult(
sink=sink.name, ok=False,
error=f"missing config: {', '.join(missing)}",
))
continue
try:
results.append(sink.deliver(doc))
except SinkError as exc:
results.append(SinkResult(sink=sink.name, ok=False, error=str(exc)))
except Exception as exc: # noqa: BLE001 - surface but never crash the run
results.append(SinkResult(sink=sink.name, ok=False, error=f"{type(exc).__name__}: {exc}"))
return results

View File

@ -0,0 +1,72 @@
"""Zero-dependency HTTP helpers shared by all sinks (stdlib urllib only).
``http_request`` is the single seam tests monkeypatch.
"""
from __future__ import annotations
import json
import mimetypes
import urllib.error
import urllib.request
from typing import Optional
USER_AGENT = "MinerU-Skill-sink/1.0"
def http_request(method, url, *, headers=None, data=None, timeout=60):
"""Perform one HTTP request. Returns ``(status_code, body_bytes)``."""
req = urllib.request.Request(url, data=data, method=method, headers=headers or {})
req.add_header("User-Agent", USER_AGENT)
try:
with urllib.request.urlopen(req, timeout=timeout) as resp:
return resp.getcode(), resp.read()
except urllib.error.HTTPError as exc:
body = exc.read() if hasattr(exc, "read") else b""
return exc.code, body
def request_json(method, url, *, headers=None, payload=None, timeout=60):
"""JSON request helper. Returns ``(status_code, parsed_json_or_empty_dict)``."""
hdrs = dict(headers or {})
body = None
if payload is not None:
hdrs.setdefault("Content-Type", "application/json")
body = json.dumps(payload, ensure_ascii=False).encode("utf-8")
status, raw = http_request(method, url, headers=hdrs, data=body, timeout=timeout)
parsed: dict = {}
if raw:
try:
parsed = json.loads(raw.decode("utf-8"))
except (ValueError, UnicodeDecodeError):
parsed = {}
return status, parsed
def encode_multipart(fields=None, files=None):
"""Build a ``multipart/form-data`` body with stdlib only.
``fields``: dict of str -> str. ``files``: list of (field_name, filename, bytes).
Returns ``(content_type, body_bytes)``.
"""
boundary = "----MinerUSinkBoundary7MA4YWxkTrZu0gW"
crlf = b"\r\n"
parts = []
for name, value in (fields or {}).items():
parts.append(b"--" + boundary.encode())
parts.append(f'Content-Disposition: form-data; name="{name}"'.encode())
parts.append(b"")
parts.append(str(value).encode("utf-8"))
for field_name, filename, content in files or []:
ctype = mimetypes.guess_type(filename)[0] or "application/octet-stream"
parts.append(b"--" + boundary.encode())
parts.append(
f'Content-Disposition: form-data; name="{field_name}"; filename="{filename}"'.encode()
)
parts.append(f"Content-Type: {ctype}".encode())
parts.append(b"")
parts.append(content)
parts.append(b"--" + boundary.encode() + b"--")
parts.append(b"")
body = crlf.join(parts)
return f"multipart/form-data; boundary={boundary}", body

View File

@ -0,0 +1,244 @@
"""Small, dependency-free Markdown utilities used by sinks.
These are intentionally pragmatic, not a full CommonMark implementation: they
cover the constructs MinerU emits (headings, emphasis, code, lists, tables,
blockquotes, links, images) well enough to deliver faithful content to tools
that require HTML (Confluence, OneNote) or an outline (Logseq).
"""
from __future__ import annotations
import html
import re
from pathlib import Path
from typing import Optional
_IMAGE_RE = re.compile(r"!\[(?P<alt>[^\]]*)\]\((?P<ref>[^)\s]+)(?:\s+\"[^\"]*\")?\)")
_ILLEGAL_FS = re.compile(r'[\\/:*?"<>|#^\[\]]+')
def slugify(text: str, default: str = "document") -> str:
"""Filesystem/URL-safe slug."""
text = text.strip().lower()
text = re.sub(r"[\s_]+", "-", text)
text = re.sub(r"[^a-z0-9\-]+", "", text)
text = re.sub(r"-{2,}", "-", text).strip("-")
return text or default
def safe_filename(title: str, default: str = "document") -> str:
"""Clean a title into a safe note filename (keeps unicode, drops illegal chars)."""
name = _ILLEGAL_FS.sub(" ", title).strip()
name = re.sub(r"\s{2,}", " ", name)
return name[:120] or default
def is_remote(ref: str) -> bool:
return ref.startswith("http://") or ref.startswith("https://") or ref.startswith("data:")
def find_local_images(markdown: str, base_dir) -> list:
"""Return ``[(alt, ref, Path)]`` for image refs that point at existing local files."""
base = Path(base_dir) if base_dir else None
found = []
seen = set()
for match in _IMAGE_RE.finditer(markdown):
ref = match.group("ref")
if is_remote(ref) or ref in seen:
continue
path = Path(ref)
if not path.is_absolute() and base is not None:
path = base / ref
if path.is_file():
found.append((match.group("alt"), ref, path))
seen.add(ref)
return found
def rewrite_images(markdown: str, mapping: dict) -> str:
"""Rewrite local image refs using ``{old_ref: new_ref}``."""
def repl(match):
ref = match.group("ref")
if ref in mapping:
return f"![{match.group('alt')}]({mapping[ref]})"
return match.group(0)
return _IMAGE_RE.sub(repl, markdown)
def yaml_frontmatter(props: dict) -> str:
"""Render a YAML frontmatter block. List values become ``- item`` lines."""
lines = ["---"]
for key, value in props.items():
if value is None or value == "" or value == []:
continue
if isinstance(value, (list, tuple)):
lines.append(f"{key}:")
for item in value:
lines.append(f" - {item}")
else:
lines.append(f"{key}: {value}")
lines.append("---")
return "\n".join(lines)
# --------------------------------------------------------------------------- #
# Inline + block Markdown -> HTML (pragmatic, XHTML-safe)
# --------------------------------------------------------------------------- #
def _inline(text: str) -> str:
"""Convert inline Markdown to HTML on already-escaped text."""
# images first, then links
text = _IMAGE_RE.sub(
lambda m: f'<img src="{html.escape(m.group("ref"), quote=True)}" alt="{m.group("alt")}" />',
text,
)
text = re.sub(r"\[([^\]]+)\]\(([^)\s]+)\)",
lambda m: f'<a href="{html.escape(m.group(2), quote=True)}">{m.group(1)}</a>', text)
text = re.sub(r"`([^`]+)`", r"<code>\1</code>", text)
text = re.sub(r"\*\*([^*]+)\*\*", r"<strong>\1</strong>", text)
text = re.sub(r"(?<!\*)\*(?!\*)([^*]+)\*(?!\*)", r"<em>\1</em>", text)
return text
def md_to_html(markdown: str) -> str:
"""Convert a Markdown document to a pragmatic, XHTML-safe HTML fragment."""
out = []
lines = markdown.replace("\r\n", "\n").split("\n")
i = 0
n = len(lines)
in_code = False
code_buf: list = []
list_stack: list = [] # 'ul' / 'ol'
def close_lists():
while list_stack:
out.append(f"</{list_stack.pop()}>")
while i < n:
line = lines[i]
fence = line.strip().startswith("```")
if fence and not in_code:
close_lists()
in_code = True
code_buf = []
i += 1
continue
if fence and in_code:
out.append("<pre><code>" + html.escape("\n".join(code_buf)) + "</code></pre>")
in_code = False
i += 1
continue
if in_code:
code_buf.append(line)
i += 1
continue
stripped = line.strip()
if not stripped:
close_lists()
i += 1
continue
# table block
if "|" in stripped and i + 1 < n and re.match(r"^\s*\|?[\s:|-]+\|?\s*$", lines[i + 1]):
close_lists()
header = [c.strip() for c in stripped.strip("|").split("|")]
rows = []
i += 2
while i < n and "|" in lines[i] and lines[i].strip():
rows.append([c.strip() for c in lines[i].strip().strip("|").split("|")])
i += 1
out.append("<table><thead><tr>"
+ "".join(f"<th>{_inline(html.escape(c))}</th>" for c in header)
+ "</tr></thead><tbody>")
for row in rows:
out.append("<tr>" + "".join(f"<td>{_inline(html.escape(c))}</td>" for c in row) + "</tr>")
out.append("</tbody></table>")
continue
heading = re.match(r"^(#{1,6})\s+(.*)$", stripped)
if heading:
close_lists()
level = len(heading.group(1))
out.append(f"<h{level}>{_inline(html.escape(heading.group(2)))}</h{level}>")
i += 1
continue
if stripped.startswith(">"):
close_lists()
out.append(f"<blockquote>{_inline(html.escape(stripped[1:].strip()))}</blockquote>")
i += 1
continue
if re.match(r"^([-*+])\s+", stripped):
if not list_stack or list_stack[-1] != "ul":
close_lists()
list_stack.append("ul")
out.append("<ul>")
item = re.sub(r"^([-*+])\s+", "", stripped)
out.append(f"<li>{_inline(html.escape(item))}</li>")
i += 1
continue
if re.match(r"^\d+\.\s+", stripped):
if not list_stack or list_stack[-1] != "ol":
close_lists()
list_stack.append("ol")
out.append("<ol>")
item = re.sub(r"^\d+\.\s+", "", stripped)
out.append(f"<li>{_inline(html.escape(item))}</li>")
i += 1
continue
if re.match(r"^([-*_])\1{2,}$", stripped):
close_lists()
out.append("<hr />")
i += 1
continue
close_lists()
out.append(f"<p>{_inline(html.escape(stripped))}</p>")
i += 1
if in_code:
out.append("<pre><code>" + html.escape("\n".join(code_buf)) + "</code></pre>")
close_lists()
return "\n".join(out)
# --------------------------------------------------------------------------- #
# Markdown -> Logseq outline
# --------------------------------------------------------------------------- #
def md_to_logseq(markdown: str, properties: Optional[dict] = None) -> str:
"""Convert flat Markdown into a Logseq outline.
Every line becomes a ``- `` block. Headings are top-level blocks; the content
that follows a heading nests one level beneath it. Page properties
(``key:: value``) go on the first block, as Logseq requires.
"""
out = []
if properties:
prop_lines = []
for key, value in properties.items():
if not value:
continue
if isinstance(value, (list, tuple)):
value = ", ".join(str(v) for v in value)
prop_lines.append(f"{key}:: {value}")
if prop_lines:
out.append("- " + prop_lines[0])
out.extend(f" {p}" for p in prop_lines[1:])
have_heading = False
for raw in markdown.replace("\r\n", "\n").split("\n"):
line = raw.strip()
if not line:
continue
if re.match(r"^#{1,6}\s+", line):
out.append(f"- {line}")
have_heading = True
elif have_heading:
out.append(f"\t- {line}")
else:
out.append(f"- {line}")
return "\n".join(out)

View File

@ -0,0 +1,50 @@
"""Airtable sink — store parsed Markdown as a record in a base/table.
Airtable is a database, not a document tool: the native ingestion path is a
record whose fields hold the title and the Markdown body. Field names are
configurable to match an existing table schema.
Docs: https://airtable.com/developers/web/api/create-records
(POST /v0/{baseId}/{tableIdOrName}).
"""
from __future__ import annotations
import urllib.parse
from . import _http
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
API_BASE = "https://api.airtable.com/v0"
@register
class AirtableSink(Sink):
name = "airtable"
requires = ("AIRTABLE_API_KEY", "AIRTABLE_BASE_ID", "AIRTABLE_TABLE")
label = "Airtable record (database)"
def deliver(self, doc: ParsedDoc) -> SinkResult:
api_key = self.env("AIRTABLE_API_KEY")
base = self.env("AIRTABLE_BASE_ID")
table = self.env("AIRTABLE_TABLE")
title_field = self.env("AIRTABLE_TITLE_FIELD", "Title")
body_field = self.env("AIRTABLE_BODY_FIELD", "Notes")
url = f"{API_BASE}/{base}/{urllib.parse.quote(table)}"
headers = {"Authorization": f"Bearer {api_key}"}
payload = {"fields": {title_field: doc.title, body_field: doc.markdown}}
status, parsed = _http.request_json("POST", url, headers=headers, payload=payload)
if parsed.get("error") or status >= 400:
raise SinkError(str(parsed.get("error") or f"HTTP {status}"))
if not parsed.get("id"):
raise SinkError(f"Airtable returned no record id: {parsed}")
return SinkResult(
sink=self.name,
ok=True,
url=None,
detail="stored as a database record (Airtable is a DB, not a doc)",
)

View File

@ -0,0 +1,101 @@
"""Core types and the sink registry for delivering parsed Markdown to content tools.
A *sink* takes a :class:`ParsedDoc` (Markdown + local images + metadata) and
delivers it into one destination (Obsidian, Notion, Slack, Feishu, ...) using
that tool's OFFICIAL native ingestion path. Sinks read their configuration from
environment variables so an AI agent can run them without interactive prompts.
"""
from __future__ import annotations
import os
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class ParsedDoc:
"""A parsed document ready for delivery."""
title: str
markdown: str
images: tuple = () # absolute paths to local image files
source: str = ""
modality: str = "unknown"
markdown_path: Optional[str] = None
@dataclass
class SinkResult:
"""Outcome of delivering a :class:`ParsedDoc` to one sink."""
sink: str
ok: bool
url: Optional[str] = None
detail: Optional[str] = None
error: Optional[str] = None
def to_status(self) -> dict:
return {
"sink": self.sink,
"ok": self.ok,
"url": self.url,
"detail": self.detail,
"error": self.error,
}
class SinkError(Exception):
"""Raised by a sink when delivery fails for a known reason."""
class Sink:
"""Base class for a delivery target.
Subclasses set ``name``/``aliases``/``requires`` and implement
:meth:`deliver`. ``requires`` lists the environment variables that must be
present for the sink to be usable.
"""
name: str = "base"
aliases: tuple = ()
requires: tuple = () # required env vars
label: str = "" # human description
local: bool = False # filesystem-only, no network/auth
def env(self, key: str, default: Optional[str] = None) -> Optional[str]:
value = os.environ.get(key, default)
return value.strip() if isinstance(value, str) else value
def missing_config(self) -> list:
return [k for k in self.requires if not self.env(k)]
def is_configured(self) -> bool:
return not self.missing_config()
def deliver(self, doc: ParsedDoc) -> SinkResult: # pragma: no cover - abstract
raise NotImplementedError
# --------------------------------------------------------------------------- #
# Registry
# --------------------------------------------------------------------------- #
REGISTRY: dict = {}
def register(cls):
"""Class decorator that instantiates a sink and registers it by name+aliases."""
inst = cls()
REGISTRY[inst.name] = inst
for alias in inst.aliases:
REGISTRY[alias] = inst
return cls
def get_sink(name: str) -> Optional[Sink]:
return REGISTRY.get(name.lower())
def sink_names() -> list:
"""Canonical sink names (no aliases), sorted."""
return sorted({s.name for s in REGISTRY.values()})

View File

@ -0,0 +1,72 @@
"""Coda sink: deliver Markdown as a page, into an existing doc or a new one.
Coda's API (``https://coda.io/apis/v1``) authenticates with a Bearer token.
Markdown is delivered as canvas page content. If ``CODA_DOC_ID`` is set, a new
page is added to that doc; otherwise a new doc is created with the content as its
initial page.
Coda canvas content embeds images by URL only, so local image refs are left
untouched host images at a public URL for them to render.
"""
from __future__ import annotations
from pathlib import Path
from . import _http, _md
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
API = "https://coda.io/apis/v1"
def _canvas(markdown: str) -> dict:
return {"type": "canvas", "canvasContent": {"format": "markdown", "content": markdown}}
@register
class CodaSink(Sink):
name = "coda"
requires = ("CODA_API_TOKEN",)
label = "Coda page (REST API)"
def deliver(self, doc: ParsedDoc) -> SinkResult:
token = self.env("CODA_API_TOKEN")
doc_id = self.env("CODA_DOC_ID")
headers = {
"Authorization": f"Bearer {token}",
"Content-Type": "application/json",
}
base_dir = Path(doc.markdown_path).parent if doc.markdown_path else None
n_images = len(_md.find_local_images(doc.markdown, base_dir))
if doc_id:
status, parsed = _http.request_json(
"POST", f"{API}/docs/{doc_id}/pages", headers=headers, payload={
"name": doc.title,
"pageContent": _canvas(doc.markdown),
},
)
else:
status, parsed = _http.request_json(
"POST", f"{API}/docs", headers=headers, payload={
"title": doc.title,
"initialPage": {
"name": doc.title,
"pageContent": _canvas(doc.markdown),
},
},
)
if status >= 400:
raise SinkError(parsed.get("message") or f"HTTP {status}")
if n_images:
detail = f"text only ({n_images} local image(s); Coda embeds images by URL)"
else:
detail = "text only"
return SinkResult(
sink=self.name, ok=True,
url=parsed.get("browserLink"),
detail=detail,
)

View File

@ -0,0 +1,66 @@
"""Confluence sink: create a page from the parsed Markdown via the Cloud REST API.
Confluence Cloud ingests content as *storage-format* HTML. Delivery converts the
Markdown to HTML and creates a page with the v2 REST API
(``POST /wiki/api/v2/pages``) using Basic auth (email + API token).
Local images are not attached Confluence storage HTML references attachments by
filename, which would require a separate upload step.
"""
from __future__ import annotations
import base64
from . import _http, _md
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
@register
class ConfluenceSink(Sink):
name = "confluence"
requires = (
"CONFLUENCE_BASE_URL",
"CONFLUENCE_EMAIL",
"CONFLUENCE_API_TOKEN",
"CONFLUENCE_SPACE_ID",
)
label = "Confluence Cloud page (storage HTML)"
def deliver(self, doc: ParsedDoc) -> SinkResult:
base = self.env("CONFLUENCE_BASE_URL").rstrip("/")
email = self.env("CONFLUENCE_EMAIL")
token = self.env("CONFLUENCE_API_TOKEN")
space = self.env("CONFLUENCE_SPACE_ID")
auth = base64.b64encode(f"{email}:{token}".encode("utf-8")).decode("ascii")
headers = {
"Authorization": f"Basic {auth}",
"Content-Type": "application/json",
}
html = _md.md_to_html(doc.markdown)
status, parsed = _http.request_json(
"POST",
f"{base}/wiki/api/v2/pages",
headers=headers,
payload={
"spaceId": space,
"status": "current",
"title": doc.title,
"body": {"representation": "storage", "value": html},
},
)
if status >= 400:
raise SinkError(
parsed.get("title")
or parsed.get("message")
or f"Confluence HTTP {status}"
)
webui = (parsed.get("_links") or {}).get("webui")
url = base + webui if webui else None
return SinkResult(
sink=self.name, ok=True, url=url,
detail="converted Markdown->storage HTML (local images not attached)",
)

View File

@ -0,0 +1,65 @@
"""DingTalk (钉钉) sink — push parsed Markdown as a robot markdown message.
A DingTalk custom robot accepts a ``markdown`` message type. The official native
ingestion path is therefore a webhook POST carrying the document title and body.
When a signing secret is configured the request is HMAC-SHA256 signed per
DingTalk's spec. DingTalk's markdown renderer only fetches images over public
URLs, so local images won't render.
Docs: https://open.dingtalk.com/document/robots/custom-robot-access.
"""
from __future__ import annotations
import base64
import hashlib
import hmac
import time
import urllib.parse
from . import _http
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
@register
class DingTalkSink(Sink):
name = "dingtalk"
aliases = ("钉钉",)
requires = ("DINGTALK_WEBHOOK",)
label = "DingTalk robot markdown (钉钉)"
def _build_url(self) -> str:
webhook = self.env("DINGTALK_WEBHOOK")
if webhook.startswith("http"):
url = webhook
else:
url = f"https://oapi.dingtalk.com/robot/send?access_token={webhook}"
secret = self.env("DINGTALK_SECRET")
if secret:
timestamp = str(round(time.time() * 1000))
string_to_sign = f"{timestamp}\n{secret}"
hmac_code = hmac.new(
secret.encode(), string_to_sign.encode(), hashlib.sha256
).digest()
sign = urllib.parse.quote_plus(base64.b64encode(hmac_code))
url += f"&timestamp={timestamp}&sign={sign}"
return url
def deliver(self, doc: ParsedDoc) -> SinkResult:
url = self._build_url()
payload = {
"msgtype": "markdown",
"markdown": {"title": doc.title, "text": doc.markdown},
}
status, parsed = _http.request_json("POST", url, payload=payload)
if parsed.get("errcode") not in (0, None):
raise SinkError(parsed.get("errmsg") or f"DingTalk HTTP {status}: {parsed}")
return SinkResult(
sink=self.name,
ok=True,
url=None,
detail="robot markdown message (local images won't render; host publicly)",
)

View File

@ -0,0 +1,124 @@
"""Feishu / Lark sink: import the parsed Markdown as a Docx document.
Feishu (飞书) / Lark ingests Markdown through its Drive import pipeline. Delivery
follows that official path:
1. ``tenant_access_token/internal`` exchange the app id/secret for a tenant
access token.
2. ``drive/v1/medias/upload_all`` upload the ``.md`` bytes as an import medium
and obtain a ``file_token``.
3. ``drive/v1/import_tasks`` kick off an import task converting the medium to a
Docx, returning a ``ticket``.
4. Poll ``drive/v1/import_tasks/{ticket}`` until the job finishes, surfacing the
resulting document URL.
Local images are not uploaded they would need public URLs to render in Docx.
"""
from __future__ import annotations
import json
import time
from . import _http, _md
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
@register
class FeishuSink(Sink):
name = "feishu"
aliases = ("lark", "飞书")
requires = ("FEISHU_APP_ID", "FEISHU_APP_SECRET")
label = "Feishu / Lark Docx (Drive import)"
def deliver(self, doc: ParsedDoc) -> SinkResult:
app_id = self.env("FEISHU_APP_ID")
app_secret = self.env("FEISHU_APP_SECRET")
folder_token = self.env("FEISHU_FOLDER_TOKEN")
# Step 1: tenant access token.
status, parsed = _http.request_json(
"POST",
"https://open.feishu.cn/open-apis/auth/v3/tenant_access_token/internal",
payload={"app_id": app_id, "app_secret": app_secret},
)
token = parsed.get("tenant_access_token")
if parsed.get("code") not in (0, None) or not token:
raise SinkError(parsed.get("msg") or f"Feishu auth failed (HTTP {status})")
headers = {"Authorization": f"Bearer {token}"}
# Step 2: upload the Markdown bytes as an import medium.
content = doc.markdown.encode("utf-8")
fname = _md.safe_filename(doc.title) + ".md"
ctype, body = _http.encode_multipart(
fields={
"file_name": fname,
"parent_type": "ccm_import_open",
"size": str(len(content)),
"extra": json.dumps({"obj_type": "docx", "file_extension": "md"}),
},
files=[("file", fname, content)],
)
up_status, raw = _http.http_request(
"POST",
"https://open.feishu.cn/open-apis/drive/v1/medias/upload_all",
headers={**headers, "Content-Type": ctype},
data=body,
)
parsed = _parse_json(raw)
if parsed.get("code") not in (0, None):
raise SinkError(parsed.get("msg") or f"Feishu media upload failed (HTTP {up_status})")
file_token = (parsed.get("data") or {}).get("file_token")
if not file_token:
raise SinkError("Feishu did not return a file_token")
# Step 3: create the import task.
status, parsed = _http.request_json(
"POST",
"https://open.feishu.cn/open-apis/drive/v1/import_tasks",
headers=headers,
payload={
"file_extension": "md",
"file_token": file_token,
"type": "docx",
"file_name": doc.title,
"point": {"mount_type": 1, "mount_key": folder_token or ""},
},
)
if parsed.get("code") not in (0, None):
raise SinkError(parsed.get("msg") or f"Feishu import task failed (HTTP {status})")
ticket = (parsed.get("data") or {}).get("ticket")
if not ticket:
raise SinkError("Feishu did not return an import ticket")
# Step 4: poll until the import job completes.
url = None
for _attempt in range(20):
status, parsed = _http.request_json(
"GET",
f"https://open.feishu.cn/open-apis/drive/v1/import_tasks/{ticket}",
headers=headers,
)
res = (parsed.get("data") or {}).get("result") or {}
job_status = res.get("job_status")
if job_status == 0:
url = res.get("url")
break
if job_status in (1, 2):
time.sleep(1)
continue
raise SinkError(res.get("job_error_msg") or "Feishu import failed")
return SinkResult(
sink=self.name, ok=True, url=url,
detail="imported to Feishu Docx (local images need public URLs)",
)
def _parse_json(raw):
if not raw:
return {}
try:
return json.loads(raw.decode("utf-8"))
except (ValueError, UnicodeDecodeError):
return {}

View File

@ -0,0 +1,75 @@
"""Linear sink: create an issue from Markdown via the GraphQL API.
Linear's API is GraphQL at ``https://api.linear.app/graphql`` and authenticates
with a raw API key in the ``Authorization`` header (no ``Bearer`` prefix). The
issue description is Markdown; Linear renders inline ``data:`` image URIs, so
local images are read and embedded as base64 data URIs before delivery.
"""
from __future__ import annotations
import base64
from pathlib import Path
from . import _http, _md
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
API = "https://api.linear.app/graphql"
_MUTATION = (
"mutation IssueCreate($input: IssueCreateInput!)"
"{issueCreate(input:$input){success issue{id url identifier}}}"
)
_MIME = {
".png": "image/png",
".jpg": "image/jpeg",
".jpeg": "image/jpeg",
".gif": "image/gif",
".webp": "image/webp",
}
def _data_uri(path: Path) -> str:
mime = _MIME.get(path.suffix.lower(), "image/png")
b64 = base64.b64encode(path.read_bytes()).decode("ascii")
return f"data:{mime};base64,{b64}"
@register
class LinearSink(Sink):
name = "linear"
requires = ("LINEAR_API_KEY", "LINEAR_TEAM_ID")
label = "Linear issue (GraphQL API)"
def deliver(self, doc: ParsedDoc) -> SinkResult:
key = self.env("LINEAR_API_KEY")
team = self.env("LINEAR_TEAM_ID")
headers = {"Authorization": key, "Content-Type": "application/json"}
base_dir = Path(doc.markdown_path).parent if doc.markdown_path else None
images = _md.find_local_images(doc.markdown, base_dir)
mapping = {ref: _data_uri(path) for _alt, ref, path in images}
body = _md.rewrite_images(doc.markdown, mapping)
status, parsed = _http.request_json("POST", API, headers=headers, payload={
"query": _MUTATION,
"variables": {"input": {
"teamId": team,
"title": doc.title,
"description": body,
}},
})
if parsed.get("errors"):
raise SinkError(str(parsed["errors"]))
result = ((parsed.get("data") or {}).get("issueCreate")) or {}
if not result.get("success"):
raise SinkError(f"Linear did not create the issue (HTTP {status})")
issue = result.get("issue") or {}
return SinkResult(
sink=self.name, ok=True,
url=issue.get("url"),
detail=f"{len(mapping)} image(s) inlined",
)

View File

@ -0,0 +1,105 @@
"""Local-first sinks: Obsidian and Logseq (filesystem writes, no auth).
Both tools are folders of Markdown files. The native ingestion is a filesystem
write following each tool's conventions:
* Obsidian a flat note with YAML frontmatter; images in a per-note assets
folder, referenced with relative Markdown embeds.
* Logseq an outline (every line a ``- `` block) with ``key:: value`` page
properties on the first block; images in ``assets/`` referenced as
``![](../assets/x.png)``.
"""
from __future__ import annotations
from pathlib import Path
from . import _md
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
def _copy_images(doc: ParsedDoc, dest_dir: Path, ref_prefix: str) -> dict:
"""Copy referenced local images into ``dest_dir``; return ``{old_ref: new_ref}``."""
base = Path(doc.markdown_path).parent if doc.markdown_path else None
mapping = {}
images = _md.find_local_images(doc.markdown, base)
if images:
dest_dir.mkdir(parents=True, exist_ok=True)
for _alt, ref, path in images:
target = dest_dir / path.name
target.write_bytes(path.read_bytes())
mapping[ref] = f"{ref_prefix}{path.name}"
return mapping
@register
class ObsidianSink(Sink):
name = "obsidian"
aliases = ("ob",)
requires = ("OBSIDIAN_VAULT",)
label = "Obsidian vault (local Markdown)"
local = True
def deliver(self, doc: ParsedDoc) -> SinkResult:
vault = Path(self.env("OBSIDIAN_VAULT")).expanduser()
if not vault.is_dir():
raise SinkError(f"Obsidian vault not found: {vault}")
subdir = self.env("OBSIDIAN_SUBDIR", "") or ""
note_dir = vault / subdir if subdir else vault
note_dir.mkdir(parents=True, exist_ok=True)
stem = _md.safe_filename(doc.title)
assets = note_dir / f"{stem}.assets"
mapping = _copy_images(doc, assets, f"{stem}.assets/")
body = _md.rewrite_images(doc.markdown, mapping)
front = _md.yaml_frontmatter({
"title": doc.title,
"source": doc.source,
"modality": doc.modality,
"tags": ["mineru", "parsed"],
})
note_path = note_dir / f"{stem}.md"
note_path.write_text(f"{front}\n\n{body}\n", encoding="utf-8")
return SinkResult(sink=self.name, ok=True, url=str(note_path),
detail=f"{len(mapping)} image(s)")
@register
class LogseqSink(Sink):
name = "logseq"
requires = ("LOGSEQ_GRAPH",)
label = "Logseq graph (local outline)"
local = True
def deliver(self, doc: ParsedDoc) -> SinkResult:
graph = Path(self.env("LOGSEQ_GRAPH")).expanduser()
if not graph.is_dir():
raise SinkError(f"Logseq graph not found: {graph}")
pages = graph / "pages"
assets = graph / "assets"
pages.mkdir(parents=True, exist_ok=True)
stem = _md.safe_filename(doc.title)
# Namespace asset names by page slug to avoid collisions in the shared assets/.
prefix = _md.slugify(doc.title)
mapping = {}
base = Path(doc.markdown_path).parent if doc.markdown_path else None
images = _md.find_local_images(doc.markdown, base)
if images:
assets.mkdir(parents=True, exist_ok=True)
for _alt, ref, path in images:
new_name = f"{prefix}-{path.name}"
(assets / new_name).write_bytes(path.read_bytes())
mapping[ref] = f"../assets/{new_name}"
body = _md.rewrite_images(doc.markdown, mapping)
outline = _md.md_to_logseq(body, properties={
"title": doc.title,
"source": doc.source,
"tags": "mineru, parsed",
})
page_path = pages / f"{stem}.md"
page_path.write_text(outline + "\n", encoding="utf-8")
return SinkResult(sink=self.name, ok=True, url=str(page_path),
detail=f"{len(mapping)} image(s)")

View File

@ -0,0 +1,130 @@
"""Notion sink: create a page under a parent page from Markdown blocks.
Notion's native ingestion is the block API: each Markdown line becomes a typed
block (heading, quote, code, list item, paragraph). A page is created with up to
100 children inline; any remainder is appended in 100-block chunks via the
``/blocks/{id}/children`` PATCH endpoint.
Notion has no inline image-from-bytes path (images must be uploaded or hosted
separately), so local image refs are intentionally left untouched.
"""
from __future__ import annotations
from pathlib import Path
from . import _http, _md
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
API = "https://api.notion.com/v1"
MAX_BLOCKS = 100
MAX_TEXT = 2000
def _rich(text: str) -> list:
return [{"type": "text", "text": {"content": text[:MAX_TEXT]}}]
def _block(block_type: str, text: str, **extra) -> dict:
inner = {"rich_text": _rich(text)}
inner.update(extra)
return {"object": "block", "type": block_type, block_type: inner}
def _is_numbered(text: str) -> bool:
head = text.split(".", 1)
return len(head) == 2 and head[0].isdigit() and head[1].startswith(" ")
def _blocks(markdown: str) -> list:
"""Convert flat Markdown lines into a list of Notion block dicts."""
blocks = []
in_code = False
code_buf: list = []
for raw in markdown.replace("\r\n", "\n").split("\n"):
stripped = raw.strip()
if stripped.startswith("```"):
if in_code:
blocks.append(_block("code", "\n".join(code_buf), language="plain text"))
in_code = False
code_buf = []
else:
in_code = True
code_buf = []
continue
if in_code:
code_buf.append(raw)
continue
if not stripped:
continue
if stripped.startswith("# "):
blocks.append(_block("heading_1", stripped[2:].strip()))
elif stripped.startswith("## "):
blocks.append(_block("heading_2", stripped[3:].strip()))
elif stripped.startswith("### "):
blocks.append(_block("heading_3", stripped[4:].strip()))
elif stripped.startswith("> "):
blocks.append(_block("quote", stripped[2:].strip()))
elif stripped.startswith("- ") or stripped.startswith("* "):
blocks.append(_block("bulleted_list_item", stripped[2:].strip()))
elif _is_numbered(stripped):
blocks.append(_block("numbered_list_item", stripped.split(".", 1)[1].strip()))
else:
blocks.append(_block("paragraph", stripped))
if in_code:
blocks.append(_block("code", "\n".join(code_buf), language="plain text"))
return blocks
@register
class NotionSink(Sink):
name = "notion"
requires = ("NOTION_API_KEY", "NOTION_PARENT_PAGE_ID")
label = "Notion page (blocks API)"
def deliver(self, doc: ParsedDoc) -> SinkResult:
key = self.env("NOTION_API_KEY")
parent = self.env("NOTION_PARENT_PAGE_ID")
version = self.env("NOTION_VERSION", "2022-06-28") or "2022-06-28"
headers = {
"Authorization": f"Bearer {key}",
"Notion-Version": version,
"Content-Type": "application/json",
}
# Count local images for the detail note (refs are left as-is).
base_dir = Path(doc.markdown_path).parent if doc.markdown_path else None
n_images = len(_md.find_local_images(doc.markdown, base_dir))
blocks = _blocks(doc.markdown)
status, parsed = _http.request_json("POST", f"{API}/pages", headers=headers, payload={
"parent": {"page_id": parent},
"properties": {"title": {"title": [{"text": {"content": doc.title}}]}},
"children": blocks[:MAX_BLOCKS],
})
if parsed.get("object") == "error":
raise SinkError(parsed.get("message") or f"Notion API error (HTTP {status})")
created_id = parsed.get("id")
if not created_id:
raise SinkError(f"Notion did not return a page id (HTTP {status})")
page_url = parsed.get("url")
for start in range(MAX_BLOCKS, len(blocks), MAX_BLOCKS):
chunk = blocks[start:start + MAX_BLOCKS]
ch_status, ch_parsed = _http.request_json(
"PATCH", f"{API}/blocks/{created_id}/children",
headers=headers, payload={"children": chunk},
)
if ch_parsed.get("object") == "error":
raise SinkError(ch_parsed.get("message")
or f"Notion block append failed (HTTP {ch_status})")
if n_images:
detail = (f"text+structure ({n_images} local images not embedded; "
f"Notion needs file upload)")
else:
detail = "text+structure"
return SinkResult(sink=self.name, ok=True, url=page_url, detail=detail)

View File

@ -0,0 +1,66 @@
"""OneNote sink: create a page from the parsed Markdown via Microsoft Graph.
OneNote pages are created by POSTing an HTML document to a section's ``pages``
endpoint with a pre-obtained Microsoft Graph access token (OAuth). Delivery
converts the Markdown to a full HTML document and creates the page.
Only remote images render Graph fetches ``<img src>`` URLs, so local image
paths emitted by MinerU would need to be public URLs.
"""
from __future__ import annotations
import html
import json
from . import _http, _md
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
@register
class OneNoteSink(Sink):
name = "onenote"
aliases = ("msonenote",)
requires = ("ONENOTE_TOKEN", "ONENOTE_SECTION_ID")
label = "OneNote section page (Microsoft Graph)"
def deliver(self, doc: ParsedDoc) -> SinkResult:
token = self.env("ONENOTE_TOKEN")
section = self.env("ONENOTE_SECTION_ID")
body_html = _md.md_to_html(doc.markdown)
page = (
"<!DOCTYPE html><html><head>"
f"<title>{html.escape(doc.title)}</title>"
f"</head><body>{body_html}</body></html>"
)
status, raw = _http.http_request(
"POST",
f"https://graph.microsoft.com/v1.0/me/onenote/sections/{section}/pages",
headers={
"Authorization": f"Bearer {token}",
"Content-Type": "text/html",
},
data=page.encode("utf-8"),
)
if status >= 400:
preview = raw.decode("utf-8", "replace") if raw else ""
raise SinkError(f"OneNote HTTP {status}: {preview[:200]}")
if status != 201:
raise SinkError(f"OneNote unexpected response (HTTP {status})")
parsed = {}
if raw:
try:
parsed = json.loads(raw.decode("utf-8"))
except (ValueError, UnicodeDecodeError):
parsed = {}
links = parsed.get("links") or {}
web = links.get("oneNoteWebUrl") or {}
url = web.get("href")
return SinkResult(
sink=self.name, ok=True, url=url,
detail="converted Markdown->HTML (remote images only; OAuth token required)",
)

View File

@ -0,0 +1,106 @@
"""Roam Research sink — optional dependency.
There is no library that ingests a Markdown document into Roam, but the official
``roam-client`` SDK correctly handles the parts that are easy to get wrong the
307/308 peer-host redirect, the dual ``Authorization`` / ``x-authorization``
Bearer headers, and the ``/write`` plumbing. So we lazily depend on it for
transport and only build the Markdown block-tree ourselves, delivering the whole
document in a single ``batch-actions`` request (one HTTP round-trip).
Install the SDK (git-only, not on PyPI; needs Python 3.11):
pip install "roam-client @ git+https://github.com/Roam-Research/backend-sdks.git#subdirectory=python"
Config: ``ROAM_API_TOKEN`` (graph edit token), ``ROAM_GRAPH_NAME``.
"""
from __future__ import annotations
import itertools
import re
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
_HEADING = re.compile(r"^(#{1,6})\s+(.*)$")
_INSTALL_HINT = (
'Roam sink needs the official SDK — pip install '
'"roam-client @ git+https://github.com/Roam-Research/backend-sdks.git#subdirectory=python"'
)
def md_to_roam_tree(markdown: str) -> list:
"""Convert Markdown into a nested Roam block tree.
Headings become parent blocks (``heading`` 13); the lines under a heading
nest beneath it. Returns ``[{"string", "heading"?, "children": [...]}, ...]``.
"""
roots: list = []
stack: list = [] # [(heading_level, node)]
for raw in markdown.replace("\r\n", "\n").split("\n"):
line = raw.strip()
if not line:
continue
match = _HEADING.match(line)
if match:
level = len(match.group(1))
node = {"string": match.group(2), "heading": min(level, 3), "children": []}
while stack and stack[-1][0] >= level:
stack.pop()
(stack[-1][1]["children"] if stack else roots).append(node)
stack.append((level, node))
else:
node = {"string": line, "children": []}
(stack[-1][1]["children"] if stack else roots).append(node)
return roots
def tree_to_actions(children: list, parent_uid: str, uidgen) -> list:
"""Flatten a block tree into ``create-block`` actions for one batch request."""
actions: list = []
for order, node in enumerate(children):
uid = uidgen()
block = {"string": node["string"], "uid": uid}
if node.get("heading"):
block["heading"] = node["heading"]
actions.append({
"action": "create-block",
"location": {"parent-uid": parent_uid, "order": order},
"block": block,
})
actions.extend(tree_to_actions(node.get("children", []), uid, uidgen))
return actions
@register
class RoamSink(Sink):
name = "roam"
aliases = ("roamresearch",)
requires = ("ROAM_API_TOKEN", "ROAM_GRAPH_NAME")
label = "Roam Research (batch-actions, optional dep)"
def deliver(self, doc: ParsedDoc) -> SinkResult:
try:
from roam_client.client import create_page, initialize_graph
except ImportError as exc: # pragma: no cover - exercised via SinkError path
raise SinkError(_INSTALL_HINT) from exc
token = self.env("ROAM_API_TOKEN")
graph = self.env("ROAM_GRAPH_NAME")
client = initialize_graph({"token": token, "graph": graph})
create_page(client, {"page": {"title": doc.title}})
counter = itertools.count(1)
actions = tree_to_actions(
md_to_roam_tree(doc.markdown), doc.title, lambda: f"mu{next(counter):07d}"
)
if actions:
client.call(
f"/api/graph/{graph}/write", "POST",
{"action": "batch-actions", "actions": actions},
)
return SinkResult(
sink=self.name, ok=True,
url=f"https://roamresearch.com/#/app/{graph}",
detail=f"{len(actions)} block(s) via batch-actions (images need public URLs)",
)

View File

@ -0,0 +1,111 @@
"""SiYuan sink: create a new document from Markdown via the local kernel API.
SiYuan (思源笔记) exposes a kernel HTTP API (default ``http://127.0.0.1:6806``)
authenticated with an API token. Delivery follows SiYuan's native ingestion path:
1. Resolve the target notebook (``SIYUAN_NOTEBOOK`` or the first listed notebook).
2. Upload each referenced local image via ``/api/asset/upload`` and rewrite the
Markdown to point at the returned ``assets/...`` paths.
3. Create the document with ``/api/filetree/createDocWithMd``.
Every kernel response wraps its payload as ``{"code": 0, "msg": "", "data": ...}``;
a non-zero ``code`` is an error.
"""
from __future__ import annotations
import json
from pathlib import Path
from . import _http, _md
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
@register
class SiYuanSink(Sink):
name = "siyuan"
requires = ("SIYUAN_TOKEN",)
label = "SiYuan notebook (local kernel API)"
def _json_post(self, base: str, path: str, headers: dict, payload: dict):
"""POST JSON; return ``data`` after verifying ``code == 0``."""
try:
status, parsed = _http.request_json("POST", f"{base}{path}",
headers=headers, payload=payload)
except Exception as exc: # noqa: BLE001
raise self._unreachable(base, exc) from exc
return self._unwrap(base, status, parsed)
def _upload_post(self, base: str, headers: dict, content_type: str, body: bytes):
"""POST a multipart body; return ``data`` after verifying ``code == 0``."""
hdrs = dict(headers)
hdrs["Content-Type"] = content_type
try:
status, raw = _http.http_request("POST", f"{base}/api/asset/upload",
headers=hdrs, data=body)
except Exception as exc: # noqa: BLE001
raise self._unreachable(base, exc) from exc
parsed: dict = {}
if raw:
try:
parsed = json.loads(raw.decode("utf-8"))
except (ValueError, UnicodeDecodeError):
parsed = {}
return self._unwrap(base, status, parsed)
@staticmethod
def _unreachable(base: str, exc=None) -> SinkError:
suffix = f" ({exc})" if exc else ""
return SinkError(
f"SiYuan kernel not reachable at {base} — start SiYuan and enable "
f"the API token{suffix}"
)
def _unwrap(self, base: str, status: int, parsed: dict):
if status == 0:
raise self._unreachable(base)
if parsed.get("code") != 0:
raise SinkError(parsed.get("msg") or f"SiYuan API error (HTTP {status})")
return parsed.get("data")
def deliver(self, doc: ParsedDoc) -> SinkResult:
base = (self.env("SIYUAN_API_URL", "http://127.0.0.1:6806")
or "http://127.0.0.1:6806").rstrip("/")
token = self.env("SIYUAN_TOKEN")
headers = {"Authorization": f"Token {token}"}
notebook = self.env("SIYUAN_NOTEBOOK")
if not notebook:
data = self._json_post(base, "/api/notebook/lsNotebooks", headers, {})
notebooks = (data or {}).get("notebooks") or []
if not notebooks:
raise SinkError("SiYuan has no notebooks — create one before delivering")
notebook = notebooks[0]["id"]
base_dir = Path(doc.markdown_path).parent if doc.markdown_path else None
images = _md.find_local_images(doc.markdown, base_dir)
mapping = {}
for _alt, ref, path in images:
content_type, body = _http.encode_multipart(
fields={"assetsDirPath": "/assets/"},
files=[("file[]", path.name, path.read_bytes())],
)
data = self._upload_post(base, headers, content_type, body)
succ_map = (data or {}).get("succMap") or {}
if path.name in succ_map:
mapping[ref] = succ_map[path.name]
body_md = _md.rewrite_images(doc.markdown, mapping)
docid = self._json_post(base, "/api/filetree/createDocWithMd", headers, {
"notebook": notebook,
"path": "/" + _md.safe_filename(doc.title),
"markdown": body_md,
})
if not docid:
raise SinkError("SiYuan did not return a document id")
return SinkResult(
sink=self.name, ok=True,
url=f"siyuan://blocks/{docid}",
detail=f"{len(mapping)} image(s)",
)

View File

@ -0,0 +1,95 @@
"""Slack sink: upload the parsed Markdown as a file via the external-upload flow.
Slack deprecated ``files.upload`` (retired) in favour of a three-step external
upload. Delivery follows that official path:
1. ``files.getUploadURLExternal`` reserve an upload URL + file id for the
given filename and byte length.
2. ``POST`` the raw bytes to the returned upload URL.
3. ``files.completeUploadExternal`` finalize the upload, attach it to the
target channel, and post an initial comment.
Images are *not* embedded: Markdown is uploaded as a single ``.md`` file.
"""
from __future__ import annotations
import urllib.parse
from . import _http, _md
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
@register
class SlackSink(Sink):
name = "slack"
requires = ("SLACK_BOT_TOKEN", "SLACK_CHANNEL")
label = "Slack channel (file upload)"
def deliver(self, doc: ParsedDoc) -> SinkResult:
token = self.env("SLACK_BOT_TOKEN")
channel = self.env("SLACK_CHANNEL")
auth = {"Authorization": f"Bearer {token}"}
content = doc.markdown.encode("utf-8")
filename = _md.slugify(doc.title) + ".md"
# Step 1: reserve an external upload URL + file id. This endpoint wants
# form-encoded data, so use http_request and parse the JSON response.
form = urllib.parse.urlencode({
"filename": filename,
"length": len(content),
}).encode("utf-8")
status, raw = _http.http_request(
"POST",
"https://slack.com/api/files.getUploadURLExternal",
headers={**auth, "Content-Type": "application/x-www-form-urlencoded"},
data=form,
)
parsed = _parse_json(raw)
if not parsed.get("ok"):
raise SinkError(parsed.get("error") or f"Slack getUploadURLExternal failed (HTTP {status})")
upload_url = parsed.get("upload_url")
file_id = parsed.get("file_id")
if not upload_url or not file_id:
raise SinkError("Slack did not return an upload URL / file id")
# Step 2: upload the raw bytes to the reserved URL.
up_status, _up_body = _http.http_request(
"POST", upload_url,
headers={"Content-Type": "application/octet-stream"},
data=content,
)
if up_status != 200:
raise SinkError(f"Slack file upload failed (HTTP {up_status})")
# Step 3: finalize the upload into the channel.
status, parsed = _http.request_json(
"POST",
"https://slack.com/api/files.completeUploadExternal",
headers=auth,
payload={
"files": [{"id": file_id, "title": doc.title}],
"channel_id": channel,
"initial_comment": f"Parsed: {doc.title}",
},
)
if not parsed.get("ok"):
raise SinkError(parsed.get("error") or f"Slack completeUploadExternal failed (HTTP {status})")
files = parsed.get("files") or [{}]
url = files[0].get("permalink")
return SinkResult(
sink=self.name, ok=True, url=url,
detail="uploaded .md file (images not embedded)",
)
def _parse_json(raw):
import json
if not raw:
return {}
try:
return json.loads(raw.decode("utf-8"))
except (ValueError, UnicodeDecodeError):
return {}

View File

@ -0,0 +1,48 @@
"""TickTick (滴答清单) sink — create a task from parsed Markdown.
TickTick's Open API exposes a task object whose ``content`` field holds the body
text. The official native ingestion path for arbitrary Markdown is therefore a
task: the document title becomes the task title and the Markdown becomes the
task content. Tasks have no attachment/inline-image surface, so local images are
not delivered.
Docs: https://developer.ticktick.com/docs (POST /open/v1/task).
"""
from __future__ import annotations
from . import _http
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
API_URL = "https://api.ticktick.com/open/v1/task"
@register
class TickTickSink(Sink):
name = "ticktick"
aliases = ("dida", "滴答清单")
requires = ("TICKTICK_TOKEN",)
label = "TickTick task (滴答清单)"
def deliver(self, doc: ParsedDoc) -> SinkResult:
token = self.env("TICKTICK_TOKEN")
project_id = self.env("TICKTICK_PROJECT_ID")
payload = {"title": doc.title, "content": doc.markdown}
if project_id:
payload["projectId"] = project_id
headers = {"Authorization": f"Bearer {token}"}
status, parsed = _http.request_json("POST", API_URL, headers=headers, payload=payload)
if status >= 400:
raise SinkError(f"TickTick HTTP {status}: {parsed}")
if not parsed.get("id"):
raise SinkError(f"TickTick returned no task id: {parsed}")
return SinkResult(
sink=self.name,
ok=True,
url=None,
detail="task content (no inline images supported by TickTick)",
)

View File

@ -0,0 +1,60 @@
"""WeCom (企业微信 / WeChat Work) sink — send parsed Markdown as an app message.
WeCom apps deliver content via the message-send API. The native ingestion path
is a ``markdown`` message from a self-built app: first an access token is fetched
with the corp id + secret, then the message is posted. WeCom's markdown is a
limited subset with a 2048-byte content cap and no inline images, so the body is
truncated to fit.
Docs: https://developer.work.weixin.qq.com/document/path/90236 (message/send),
https://developer.work.weixin.qq.com/document/path/91039 (gettoken).
"""
from __future__ import annotations
from . import _http
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
TOKEN_URL = "https://qyapi.weixin.qq.com/cgi-bin/gettoken"
SEND_URL = "https://qyapi.weixin.qq.com/cgi-bin/message/send"
@register
class WeComSink(Sink):
name = "wecom"
aliases = ("企业微信", "wechatwork")
requires = ("WECOM_CORPID", "WECOM_CORPSECRET", "WECOM_AGENTID")
label = "WeCom app markdown (企业微信)"
def deliver(self, doc: ParsedDoc) -> SinkResult:
corpid = self.env("WECOM_CORPID")
secret = self.env("WECOM_CORPSECRET")
agentid = self.env("WECOM_AGENTID")
touser = self.env("WECOM_TOUSER", "@all")
# Step 1: fetch an access token.
token_url = f"{TOKEN_URL}?corpid={corpid}&corpsecret={secret}"
status, parsed = _http.request_json("GET", token_url)
if parsed.get("errcode") not in (0, None) or not parsed.get("access_token"):
raise SinkError(parsed.get("errmsg") or f"WeCom token fetch failed: {parsed}")
token = parsed["access_token"]
# Step 2: send the markdown message.
send_url = f"{SEND_URL}?access_token={token}"
payload = {
"touser": touser,
"msgtype": "markdown",
"agentid": int(agentid),
"markdown": {"content": doc.markdown[:2048]},
}
status, parsed = _http.request_json("POST", send_url, payload=payload)
if parsed.get("errcode") not in (0, None):
raise SinkError(parsed.get("errmsg") or f"WeCom send failed: {parsed}")
return SinkResult(
sink=self.name,
ok=True,
url=None,
detail="markdown notification (WeCom markdown is a limited subset, "
"2048-byte cap, no inline images)",
)

View File

@ -0,0 +1,104 @@
"""WPS / 金山文档 (Kingsoft kdocs) sink — optional dependency.
The native ingestion path is: Markdown ``.docx`` upload to the kdocs cloud
appspace. There is no official Python SDK, so:
* MarkdownDOCX uses the maintained, pure-pip ``html-for-docx`` package
(reusing this project's Markdown→HTML), lazily imported so the core stays
zero-dependency. Install with ``pip install mineru-skill[wps]``.
* The kdocs WPS-2 request signing (plain SHA-1) and multipart upload are done
with the standard library small and fully documented.
Cloud upload requires an approved kdocs developer app (``WPS_APP_ID`` /
``WPS_APP_SECRET``) and a provisioned appspace; it is opt-in and surfaces the
raw kdocs error on failure. Docs: https://developer.kdocs.cn/server/guide/signature.html
"""
from __future__ import annotations
import email.utils
import hashlib
import io
import json
from . import _http, _md
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
KDOCS_UPLOAD = "https://developer.kdocs.cn/api/v1/openapi/appspace/files/upload"
def _markdown_to_docx_bytes(markdown: str) -> bytes:
"""Convert Markdown → HTML → DOCX bytes via the optional html-for-docx lib."""
try:
from html4docx import HtmlToDocx # pip install html-for-docx
except ImportError as exc: # pragma: no cover - exercised via SinkError path
raise SinkError(
"WPS sink needs a Markdown→DOCX converter — "
"pip install 'mineru-skill[wps]' (i.e. pip install html-for-docx)"
) from exc
html = _md.md_to_html(markdown)
document = HtmlToDocx().parse_html_string(html)
buf = io.BytesIO()
document.save(buf)
return buf.getvalue()
def _wps2_headers(app_id: str, app_secret: str, body: bytes, content_type: str) -> dict:
"""Build kdocs WPS-2 auth headers.
signature = sha1(app_secret + content_md5 + content_type + date) hex.
Content-Md5 / Content-Type must match the exact wire body and header sent.
"""
content_md5 = hashlib.md5(body).hexdigest()
date = email.utils.formatdate(usegmt=True) # RFC1123 GMT
signature = hashlib.sha1(
(app_secret + content_md5 + content_type + date).encode("utf-8")
).hexdigest()
return {
"Date": date,
"Content-Md5": content_md5,
"Content-Type": content_type,
"Authorization": f"WPS-2:{app_id}:{signature}",
}
@register
class WpsSink(Sink):
name = "wps"
aliases = ("kdocs", "金山文档", "金山")
requires = ("WPS_APP_ID", "WPS_APP_SECRET")
label = "WPS / 金山文档 (Markdown→DOCX upload, optional dep)"
def deliver(self, doc: ParsedDoc) -> SinkResult:
app_id = self.env("WPS_APP_ID")
app_secret = self.env("WPS_APP_SECRET")
docx_bytes = _markdown_to_docx_bytes(doc.markdown)
filename = _md.safe_filename(doc.title) + ".docx"
fields = {}
parent_path = self.env("WPS_PARENT_PATH")
parent_token = self.env("WPS_PARENT_TOKEN")
if parent_path:
fields["parent_path"] = parent_path
if parent_token:
fields["parent_token"] = parent_token
content_type, body = _http.encode_multipart(
fields=fields, files=[("file", filename, docx_bytes)]
)
headers = _wps2_headers(app_id, app_secret, body, content_type)
status, raw = _http.http_request("POST", KDOCS_UPLOAD, headers=headers, data=body)
try:
parsed = json.loads(raw.decode("utf-8")) if raw else {}
except (ValueError, UnicodeDecodeError):
parsed = {}
if status >= 400 or parsed.get("code") not in (0, None):
raise SinkError(parsed.get("message") or parsed.get("msg") or f"kdocs HTTP {status}")
file_token = (parsed.get("data") or {}).get("file_token")
return SinkResult(
sink=self.name, ok=True, url=file_token,
detail="Markdown→DOCX uploaded to 金山文档 (experimental; needs a provisioned appspace)",
)

View File

@ -0,0 +1,65 @@
"""Yuque (语雀) sink: create a Markdown doc in a repository via the open API.
Yuque's open API (``https://www.yuque.com/api/v2``) authenticates with an
``X-Auth-Token`` header and creates docs under a repository namespace. The body
is posted as raw Markdown.
Yuque's open API has no asset-upload endpoint, so local image refs are left
untouched host images at a public URL for them to render.
"""
from __future__ import annotations
from pathlib import Path
from . import _http, _md
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
API = "https://www.yuque.com/api/v2"
@register
class YuqueSink(Sink):
name = "yuque"
aliases = ("语雀",)
requires = ("YUQUE_TOKEN", "YUQUE_NAMESPACE")
label = "Yuque doc (open API)"
def deliver(self, doc: ParsedDoc) -> SinkResult:
token = self.env("YUQUE_TOKEN")
namespace = self.env("YUQUE_NAMESPACE")
headers = {
"X-Auth-Token": token,
"User-Agent": "MinerU-Skill/3.0",
"Content-Type": "application/json",
}
base_dir = Path(doc.markdown_path).parent if doc.markdown_path else None
n_images = len(_md.find_local_images(doc.markdown, base_dir))
status, parsed = _http.request_json(
"POST", f"{API}/repos/{namespace}/docs", headers=headers, payload={
"title": doc.title,
"slug": _md.slugify(doc.title),
"public": 0,
"format": "markdown",
"body": doc.markdown,
},
)
data = parsed.get("data")
if not data:
if status >= 400 or parsed.get("message"):
raise SinkError(parsed.get("message") or f"HTTP {status}")
raise SinkError(f"Yuque returned no doc data (HTTP {status})")
slug = data.get("slug")
if n_images:
detail = f"text only ({n_images} local image(s); host images publicly to embed)"
else:
detail = "text only"
return SinkResult(
sink=self.name, ok=True,
url=f"https://www.yuque.com/{namespace}/{slug}",
detail=detail,
)

View File

@ -0,0 +1,64 @@
"""Split oversized PDFs into cap-sized parts so they clear the MinerU API limits.
The MinerU cloud caps at 20 pages (free Agent API) / 200 pages (Standard API).
``--split`` slices a larger PDF into parts locally, each is parsed, and the
Markdown is merged back so we are no longer bound by those page caps (the same
trick mineru-converter uses). Uses the optional ``pypdf`` library, lazily
imported, so the core stays zero-dependency.
pip install "mineru-skill[split]" # i.e. pip install pypdf
"""
from __future__ import annotations
from pathlib import Path
class SplitError(Exception):
"""Raised when splitting is requested but cannot be performed."""
def _load_pypdf():
try:
import pypdf # noqa: F401
return pypdf
except ImportError as exc:
raise SplitError(
"--split needs the pypdf library — pip install 'mineru-skill[split]' "
"(i.e. pip install pypdf)"
) from exc
def pdf_page_count(path) -> int:
"""Return the page count of a local PDF (requires pypdf)."""
pypdf = _load_pypdf()
return len(pypdf.PdfReader(str(path)).pages)
def split_pdf(path, max_pages: int, out_dir) -> list:
"""Slice ``path`` into ``max_pages``-page parts under ``out_dir``.
Returns the list of part paths (a single-element list pointing at the original
file if it already fits).
"""
if max_pages < 1:
raise SplitError("max_pages must be >= 1")
pypdf = _load_pypdf()
reader = pypdf.PdfReader(str(path))
total = len(reader.pages)
if total <= max_pages:
return [Path(path)]
out_dir = Path(out_dir)
out_dir.mkdir(parents=True, exist_ok=True)
stem = Path(path).stem
parts = []
for part_index, start in enumerate(range(0, total, max_pages), start=1):
writer = pypdf.PdfWriter()
for page in range(start, min(start + max_pages, total)):
writer.add_page(reader.pages[page])
part_path = out_dir / f"{stem}__part{part_index:03d}.pdf"
with open(part_path, "wb") as handle:
writer.write(handle)
parts.append(part_path)
return parts