add mineru
This commit is contained in:
parent
22b9ad4877
commit
b618cb12d2
49
skills/developing/mineru/SKILL.md
Normal file
49
skills/developing/mineru/SKILL.md
Normal file
@ -0,0 +1,49 @@
|
||||
---
|
||||
name: mineru
|
||||
description: An AI-Native skill for parsing PDF / Office / image files into Markdown with MinerU — a fast, zero-config document parser for AI agents. Works with NO token via the Agent API and auto-upgrades to the Standard API (token) for large files, batches, and DOCX/HTML/LaTeX export. Use when converting PDF/Word/PPT/Excel/image documents, extracting text/tables/formulas, running OCR, or batch processing.
|
||||
category: Document Processing
|
||||
metadata:
|
||||
author: Nebutra
|
||||
version: "3.3.1"
|
||||
argument-hint: <pdf-file-or-url>
|
||||
---
|
||||
|
||||
# MinerU PDF Parser
|
||||
|
||||
Parse PDF, Office, and image documents into structured Markdown via the MinerU API.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Zero-config: no token, no install (free Agent API)
|
||||
python3 "${CLAUDE_PLUGIN_ROOT}/scripts/mineru.py" ./document.pdf --output ./output/
|
||||
|
||||
# Pipe Markdown back to an agent
|
||||
python3 "${CLAUDE_PLUGIN_ROOT}/scripts/mineru.py" ./document.pdf --stdout
|
||||
|
||||
# Power mode: token unlocks large files / batch / extra formats
|
||||
export MINERU_TOKEN="..." # https://mineru.net/apiManage/token
|
||||
python3 "${CLAUDE_PLUGIN_ROOT}/scripts/mineru.py" ./pdfs/ --output ./output/ --workers 8 --resume
|
||||
```
|
||||
|
||||
## Features
|
||||
|
||||
- **Auto-routing**: free Agent API by default, auto-upgrades to the Standard API (token) for large/batch/extra-format jobs
|
||||
- **Multi-modal**: PDF, images, Word, PPT, Excel, HTML
|
||||
- **High-performance OCR**: `--ocr` with language selection (`--lang`)
|
||||
- **Formula & table recognition**: LaTeX formulas, structured tables
|
||||
- **Multi-format export**: Markdown (default), plus DOCX / HTML / LaTeX
|
||||
- **AI-Native output**: `--stdout` (Markdown) and `--json` (machine status)
|
||||
- **Batch + resume**: parallel workers with `--resume`
|
||||
- **Zero dependencies**: standard library only
|
||||
|
||||
## Authentication
|
||||
|
||||
A token is **optional** — the Agent API works without one. Set a token to unlock
|
||||
the Standard API (≤ 200 MB / ≤ 200 pages, batch, DOCX/HTML/LaTeX):
|
||||
|
||||
```bash
|
||||
export MINERU_TOKEN="your-token-here" # https://mineru.net/apiManage/token
|
||||
```
|
||||
|
||||
Official API docs: https://mineru.net/apiManage/docs
|
||||
170
skills/developing/mineru/references/api_reference.md
Normal file
170
skills/developing/mineru/references/api_reference.md
Normal file
@ -0,0 +1,170 @@
|
||||
# MinerU API Reference
|
||||
|
||||
Official docs: https://mineru.net/apiManage/docs · Token: https://mineru.net/apiManage/token
|
||||
|
||||
MinerU exposes **two** document-parsing APIs. This skill auto-routes between them.
|
||||
|
||||
| | 🎯 Standard API | ⚡ Agent API (lightweight) |
|
||||
|---|---|---|
|
||||
| Base URL | `https://mineru.net/api/v4` | `https://mineru.net/api/v1/agent` |
|
||||
| Token | **required** (`Bearer`) | **none** (IP rate-limited) |
|
||||
| Models | `pipeline` / `vlm` / `MinerU-HTML` | fixed lightweight `pipeline` |
|
||||
| File size | ≤ 200 MB | ≤ 10 MB |
|
||||
| Pages | ≤ 200 | ≤ 20 |
|
||||
| Batch | ≤ 50 per request | single file only |
|
||||
| Output | zip (Markdown + JSON, optional DOCX/HTML/LaTeX) | Markdown only (CDN link) |
|
||||
| Designed for | high-accuracy / complex / batch | AI-agent / quick / no-login |
|
||||
|
||||
Free Standard-API quota: **1000 pages/day at highest priority** (overflow is lower priority).
|
||||
|
||||
---
|
||||
|
||||
## Authentication (Standard API)
|
||||
|
||||
```
|
||||
Authorization: Bearer YOUR_API_TOKEN
|
||||
```
|
||||
|
||||
Get a token at https://mineru.net/apiManage/token.
|
||||
|
||||
> **Response envelopes.** Business endpoints return `{"code":0,"data":{…},"msg":"ok"}`.
|
||||
> The auth/gateway layer returns a *different* shape on failure:
|
||||
> `{"success":false,"msgCode":"A0202","msg":"user authenticate failed"}`.
|
||||
> Clients must handle both — this skill maps `msgCode` to the same error hints.
|
||||
|
||||
---
|
||||
|
||||
## Standard API endpoints (`/api/v4`)
|
||||
|
||||
### Single URL — `POST /extract/task`
|
||||
|
||||
```json
|
||||
{
|
||||
"url": "https://example.com/doc.pdf",
|
||||
"model_version": "vlm",
|
||||
"is_ocr": false,
|
||||
"enable_formula": true,
|
||||
"enable_table": true,
|
||||
"language": "ch",
|
||||
"page_ranges": "1-10",
|
||||
"extra_formats": ["docx", "html"],
|
||||
"data_id": "my-document"
|
||||
}
|
||||
```
|
||||
Response → `{ "code": 0, "data": { "task_id": "…" } }`. HTML inputs require `model_version: "MinerU-HTML"`.
|
||||
|
||||
### Get task result — `GET /extract/task/{task_id}`
|
||||
|
||||
```json
|
||||
{ "code": 0, "data": { "task_id": "…", "state": "done", "full_zip_url": "https://…", "err_msg": "" } }
|
||||
```
|
||||
|
||||
### Batch local upload — `POST /file-urls/batch`
|
||||
|
||||
Returns signed upload URLs; PUT each file (no `Content-Type`). Up to **50** files / request.
|
||||
|
||||
```json
|
||||
{ "files": [ { "name": "doc.pdf", "data_id": "doc" } ], "model_version": "vlm" }
|
||||
```
|
||||
Response → `{ "code": 0, "data": { "batch_id": "…", "file_urls": ["https://…"] } }`.
|
||||
|
||||
### Batch URL — `POST /extract/task/batch`
|
||||
|
||||
```json
|
||||
{ "files": [ { "url": "https://…/doc.pdf", "data_id": "doc" } ], "model_version": "vlm" }
|
||||
```
|
||||
|
||||
### Batch results — `GET /extract-results/batch/{batch_id}`
|
||||
|
||||
```json
|
||||
{ "code": 0, "data": { "batch_id": "…", "extract_result": [
|
||||
{ "file_name": "doc.pdf", "state": "done", "full_zip_url": "https://…" }
|
||||
] } }
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Agent API endpoints (`/api/v1/agent`) — no token
|
||||
|
||||
### URL — `POST /parse/url`
|
||||
|
||||
```json
|
||||
{ "url": "https://…/doc.pdf", "language": "ch", "enable_table": true, "is_ocr": false, "enable_formula": true, "page_range": "1-10" }
|
||||
```
|
||||
`page_range` accepts `from-to` or a single page only (no commas). Returns `{ "code": 0, "data": { "task_id": "…" } }`.
|
||||
|
||||
### File — `POST /parse/file`
|
||||
|
||||
```json
|
||||
{ "file_name": "doc.pdf", "language": "ch" }
|
||||
```
|
||||
Response → `{ "data": { "task_id": "…", "file_url": "https://oss…" } }`; PUT the file to `file_url`.
|
||||
|
||||
### Result — `GET /parse/{task_id}`
|
||||
|
||||
```json
|
||||
{ "code": 0, "data": { "task_id": "…", "state": "done", "markdown_url": "https://cdn…/full.md" } }
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task states
|
||||
|
||||
`pending` (queued) · `running` (parsing) · `converting` (format conversion) ·
|
||||
`uploading` (downloading source, Agent) · `waiting-file` (awaiting upload) ·
|
||||
`done` (complete) · `failed` (error).
|
||||
|
||||
---
|
||||
|
||||
## Parameters
|
||||
|
||||
| Parameter | Type | Default | Notes |
|
||||
|-----------|------|---------|-------|
|
||||
| `model_version` | string | `pipeline` | `pipeline`, `vlm` (recommended), `MinerU-HTML` (HTML only) |
|
||||
| `is_ocr` | bool | `false` | OCR for scanned docs (pipeline/vlm) |
|
||||
| `enable_formula` | bool | `true` | Formula recognition |
|
||||
| `enable_table` | bool | `true` | Table recognition |
|
||||
| `language` | string | `ch` | OCR language (see official `language` table) |
|
||||
| `page_ranges` | string | all | Standard: `"2,4-6"`; Agent `page_range`: `"1-10"` only |
|
||||
| `extra_formats` | array | `[]` | `docx` / `html` / `latex` (Standard only) |
|
||||
| `data_id` | string | – | `[A-Za-z0-9_.-]`, ≤ 128 chars |
|
||||
| `no_cache` | bool | `false` | Bypass URL cache (Standard) |
|
||||
| `cache_tolerance` | int | `900` | Cache TTL seconds (Standard) |
|
||||
|
||||
---
|
||||
|
||||
## Limits
|
||||
|
||||
| | Standard | Agent |
|
||||
|---|---|---|
|
||||
| File size | 200 MB | 10 MB |
|
||||
| Pages | 200 | 20 |
|
||||
| Batch | 50 / request | 1 |
|
||||
| Quota | 1000 pages/day priority | IP rate-limited (HTTP 429) |
|
||||
|
||||
Supported types: PDF, images (png/jpg/jpeg/jp2/webp/gif/bmp), Doc(x), Ppt(x), Xls(x); HTML is Standard-only.
|
||||
|
||||
---
|
||||
|
||||
## Error codes
|
||||
|
||||
| Code | Meaning |
|
||||
|------|---------|
|
||||
| `A0202` | Invalid token |
|
||||
| `A0211` | Token expired |
|
||||
| `-500` | Parameter error |
|
||||
| `-10001` / `-10002` | Service error / invalid params |
|
||||
| `-60002` | Unsupported file format |
|
||||
| `-60003` / `-60004` | File read failed / empty file |
|
||||
| `-60005` | File too large (> 200 MB) |
|
||||
| `-60006` | Too many pages (> 200) |
|
||||
| `-60008` | File read timeout (URL unreachable) |
|
||||
| `-60010` | Parse failed |
|
||||
| `-60015` / `-60016` | File / format conversion failed |
|
||||
| `-60018` | Daily quota reached |
|
||||
| `-60022` | Web page read failed (rate-limited) |
|
||||
| **Agent API** | |
|
||||
| `-30001` | Exceeds Agent 10 MB limit → use Standard API |
|
||||
| `-30002` | Unsupported file type for Agent |
|
||||
| `-30003` | Exceeds Agent 20-page limit → use Standard API or `--pages` |
|
||||
| `-30004` | Invalid request parameters |
|
||||
193
skills/developing/mineru/references/comparison.md
Normal file
193
skills/developing/mineru/references/comparison.md
Normal file
@ -0,0 +1,193 @@
|
||||
<!-- Web-researched competitive comparison (45 tools, 6 categories, adversarially fact-checked). Last researched 2026-05-31. Star counts / versions are point-in-time. -->
|
||||
|
||||
# MinerU Skill — Competitive Comparison Reference
|
||||
|
||||
This document gives an honest, sourced, per-tool breakdown of how **MinerU Skill** compares to the document-parsing landscape. Read the framing first: it determines how to interpret every "we win / they win" below.
|
||||
|
||||
## What MinerU Skill actually is (and is not)
|
||||
|
||||
MinerU Skill is a **zero-config, zero-dependency, agent-native convenience layer over [MinerU](https://github.com/opendatalab/MinerU)'s cloud API**, plus 17 turnkey delivery integrations to note/knowledge/content tools. Concretely (verified in this repo):
|
||||
|
||||
- Core script `scripts/mineru.py` is **~54KB / ~1,350 lines of pure Python standard library** — no `requests`/`aiohttp`, no model weights.
|
||||
- A **genuinely token-free** default: the free **Agent API** path (`agent_parse` → `_agent_poll`) sends **no `Authorization` header** (the Bearer header is set only when a token is present). Files ≤10MB / ≤20 pages.
|
||||
- **Auto-routing**: with a token, large/batched/extra-format jobs use the **Standard API** (≤200MB / ≤200 pages); the Agent path **auto-escalates** to Standard on size/page limits.
|
||||
- **17 delivery sinks** (16 sink modules + `local.py` registering both `obsidian` and `logseq`): obsidian, logseq, siyuan, notion, confluence, onenote, coda, yuque, feishu, slack, dingtalk, wecom, ticktick, linear, airtable — all zero-dependency — plus **roam** (needs `roam-client`) and **wps** (needs `html-for-docx`) which lazy-load one library only when used.
|
||||
- `--resume` dedup, parallel `--workers` (ThreadPoolExecutor), `--stdout`/`--json` agent output.
|
||||
|
||||
**Critical dependency:** our accuracy is **entirely downstream of, and capped by, what MinerU's cloud serves.** We own no models. Therefore:
|
||||
|
||||
- We have **no quality edge** over any other cloud wrapper that hits the same MinerU API — OCR/table/formula output is **identical**.
|
||||
- Self-hosting the MinerU engine gives the **same or better** accuracy (version-controllable, no upload caps).
|
||||
|
||||
**Hard limits we cannot exceed:** 10MB/20-page free Agent tier, 200MB/200-page Standard tier, plus IP rate limits. Self-hosted tools have no such caps (only hardware).
|
||||
|
||||
**Our benchmark is latency-only.** `tests/test_live.py` measures end-to-end cloud round-trip latency (~13–14s for the official demo PDF). It is **not** an accuracy benchmark; we have no OmniDocBench/olmOCR-Bench numbers of our own.
|
||||
|
||||
### A note on the speed claim
|
||||
|
||||
Our ~13–14s/doc cloud round-trip is **not** a clean win over self-hosted GPU engines. A normal self-host with a GPU runs at ~0.18s/page (Marker) or ~2.12 pages/sec (MinerU on A100) — far faster at any real scale. We only out-run **slow Apple-Silicon-CPU local runs of small docs** (e.g., M4 VLM at 32–148s/page). Do not frame "faster wall-clock" as a general win.
|
||||
|
||||
### A note on benchmarks
|
||||
|
||||
No single benchmark is authoritative. Different benchmarks favor different tools:
|
||||
- **OmniDocBench** (v1.5/v1.6): MinerU2.5 **90.67** (v1.5), MinerU2.5-Pro **95.69** (v1.6) — leads, beating Gemini 2.5 Pro / GPT-4o / Qwen2.5-VL-72B on text/table/formula. Source: arXiv 2509.22186.
|
||||
- **olmOCR-Bench** (Ai2, Oct 2025): olmOCR-2 **82.4** > Marker **76.1** > **MinerU 75.8**. Here MinerU **trails** — this is a real olmOCR win and must stay visible.
|
||||
- **RD-TableBench**: Reducto 90.2% on complex tables — but Reducto authored this benchmark (vendor-biased).
|
||||
- Mathpix is the de-facto formula-OCR standard (BLEU/edit-distance studies), though a PaddleOCR-VL-based tool claims to beat it on OmniDocBench v1.0 formula recognition, so the very top is contested.
|
||||
|
||||
> Star counts / versions below (e.g. MinerU "65.7k / v3.2.1") are point-in-time and not independently re-verified.
|
||||
|
||||
---
|
||||
|
||||
## Category 1 — Self-hosted / open-source parsing engines
|
||||
|
||||
These are the tools that close our single biggest gap: **fully offline / air-gapped / no cloud / no upload caps.**
|
||||
|
||||
### MinerU engine (opendatalab) — the engine we wrap
|
||||
- **Source:** https://github.com/opendatalab/MinerU · arXiv 2509.22186 · https://huggingface.co/opendatalab/MinerU2.5-Pro-2604-1.2B
|
||||
- **Strengths:** Owns the SOTA models (OmniDocBench 90.67 / 95.69-Pro v1.6). 109-language OCR, handwriting, cross-page table merge, formula→LaTeX (the source of *our* LaTeX). Fully self-hostable → offline, air-gappable, zero per-page cost, no caps. Pipeline backend runs pure CPU; VLM needs 8GB+ VRAM. Native MCP, Python/Go/TS SDKs, LangChain/LlamaIndex/Dify/FastGPT.
|
||||
- **Weaknesses vs us:** Heavy install (multi-GB torch/vLLM + weights, 16GB RAM / 20GB disk floor); slow on Apple Silicon; no note/PKM delivery sinks; library/CLI rather than zero-config.
|
||||
- **Verdict:** **Beats us** on offline, privacy, caps, accuracy ceiling, ecosystem. **We beat it** only on zero-install/zero-config and built-in delivery.
|
||||
|
||||
### Marker (datalab-to / VikParuchuri)
|
||||
- **Source:** https://github.com/datalab-to/marker · https://allenai.org/blog/olmocr-2
|
||||
- **Strengths:** Fully offline; very high batch throughput (~122 pages/sec/H100, 0.18s/page GPU); broad formats incl. EPUB; optional local-LLM (Ollama) quality boost with no data leaving the machine; ~35k+ stars, active.
|
||||
- **Weaknesses:** **GPL-3.0** code + model weights under a modified RAIL-M (free only under ~$2M funding+revenue; commercial above that needs a Datalab license). olmOCR-Bench **76.1** — below olmOCR-2 and MinerU's OmniDocBench standing.
|
||||
- **Verdict:** Beats us on offline/throughput; we beat it on zero-install and 17 delivery sinks. License gate is a real friction it has and we don't.
|
||||
|
||||
### Docling (IBM / DS4SD)
|
||||
- **Source:** https://github.com/docling-project/docling · https://huggingface.co/ibm-granite/granite-docling-258M · arXiv 2408.09869
|
||||
- **Strengths:** **Widest input modality set** (PDF/DOCX/PPTX/XLSX/HTML/AsciiDoc/LaTeX/CSV/images + **audio via ASR** + USPTO/JATS/XBRL). Tiny 258M Granite-Docling VLM runs on CPU/modest GPU. **MIT code + Apache-2.0 weights.** Deep framework ecosystem (LangChain/LlamaIndex/Haystack + official MCP), IBM-backed, 60k+ stars. Air-gapped by design.
|
||||
- **Weaknesses:** Absolute accuracy lags MinerU on OmniDocBench/olmOCR-Bench; library-first (not a zero-config CLI); targets framework ingestion, not file delivery to note tools.
|
||||
- **Verdict:** Beats us on offline, modality breadth, permissive license, ecosystem; we beat it on zero-install and note/PKM delivery. **Do not over-rank its MIT as uniquely best** — olmOCR's Apache-2.0 on *both* code and 7B weights is at least as commercially valuable.
|
||||
|
||||
### olmOCR (allenai)
|
||||
- **Source:** https://github.com/allenai/olmocr · https://allenai.org/blog/olmocr-2 · https://huggingface.co/datasets/allenai/olmOCR-bench
|
||||
- **Strengths:** **Leads Ai2's olmOCR-Bench (82.4 vs MinerU 75.8)** — a benchmark where MinerU trails. **Apache-2.0 on code AND the olmOCR-2-7B weights** (most commercial-friendly model reuse here). Built for million-page LLM-training linearization. Offline.
|
||||
- **Weaknesses:** **PDF/image only** (no Office/HTML); **English-primary**, filters non-English (MinerU does 109-lang); **requires a 12GB+ NVIDIA GPU, no CPU mode at all**.
|
||||
- **Verdict:** Beats us on offline, that-benchmark accuracy, license, scale. We beat it on modality breadth, multilingual, no-GPU, delivery, zero-install. **Keep the olmOCR-Bench lead visible — do not cherry-pick only OmniDocBench.**
|
||||
|
||||
### Nougat (facebookresearch / Meta AI)
|
||||
- **Source:** https://github.com/facebookresearch/nougat · arXiv 2308.13418
|
||||
- **Strengths:** Strong LaTeX/math on arXiv-style scientific PDFs (its trained niche). Offline.
|
||||
- **Weaknesses:** **PDF + English/Latin-script only** (no CJK); **CC-BY-NC weights (non-commercial)**; effectively **unmaintained** (last release Aug 2023); known repetition/hallucination/[MISSING_PAGE] failures off-distribution.
|
||||
- **Verdict:** Offline + niche math is its only edge; we beat it on general-purpose, multilingual, maintenance, commercial license, delivery.
|
||||
|
||||
### PyMuPDF4LLM (pymupdf / Artifex)
|
||||
- **Source:** https://github.com/pymupdf/pymupdf4llm · https://pymupdf.io/blog/pymupdf-layout-10-faster-pdf-parsing-without-gpus
|
||||
- **Strengths:** **Far faster and lighter than any ML tool on born-digital PDFs** (~hundreds of pages/sec on plain CPU; a C-optimized variant claims ~520 pages/sec). Lowest dependency/hardware footprint. Offline, no cloud, no caps. Ideal for huge clean-PDF corpora where speed > fidelity.
|
||||
- **Weaknesses:** No ML → no real formula/LaTeX, weak complex tables, poor scanned/handwritten; slow external OCR; **AGPL-3.0 OR Artifex commercial**; Office formats need paid **PyMuPDF Pro**.
|
||||
- **Verdict:** A genuine win for the speed-over-fidelity, clean-PDF use case. We beat it on hard-doc quality (MinerU's VLM), multilingual OCR, and delivery — but acknowledge its speed/footprint advantage honestly.
|
||||
|
||||
### Zerox (getomni-ai)
|
||||
- **Source:** https://github.com/getomni-ai/zerox
|
||||
- **Strengths:** Trivial provider-flexibility (OpenAI/Azure/Bedrock-Claude/Gemini/Vertex); JSON-Schema structured extraction (Node SDK); MIT code.
|
||||
- **Weaknesses:** **NOT offline and NOT token-free** — mandates a paid cloud vision-LLM key; needs graphicsmagick+ghostscript; **no published benchmarks**; per-page LLM cost can exceed MinerU on large jobs.
|
||||
- **Verdict:** We beat it on token-free start, benchmarked accuracy, dedicated formula/table models, system-dep footprint, and delivery. It beats us on provider-swap flexibility and typed JSON extraction.
|
||||
|
||||
---
|
||||
|
||||
## Category 2 — Commercial cloud document-parsing APIs
|
||||
|
||||
Mostly **stronger than us** on enterprise accuracy, SLAs, structured extraction, and RAG/MCP ecosystems. Our honest edges are narrow: token-free + zero-install hosted default, clean Markdown/LaTeX of academic PDFs, and 17 delivery sinks none of them offer.
|
||||
|
||||
### LlamaParse (LlamaIndex / LlamaCloud)
|
||||
- **Source:** https://www.llamaindex.ai/pricing · LlamaCloud MCP docs
|
||||
- **Beats us:** Official hosted **MCP server**; deep native RAG stack (parse→index→LlamaExtract/LlamaAgents); steerable NL parsing with frontier LLMs (GPT-4.1/Gemini 2.5 Pro); richer outputs (per-page JSON, XLSX, HTML tables, annotated PDF); enterprise SLAs; mature Python+TS SDKs.
|
||||
- **We beat:** Token-free start (it needs a LlamaCloud key from page one); zero runtime deps; 17 note/PKM sinks (it delivers to RAG indexes, not note tools); built-in `--resume`/parallel batch CLI.
|
||||
|
||||
### Mathpix (Convert API)
|
||||
- **Source:** https://mathpix.com/pricing/api · https://mathpix.com/image-to-latex
|
||||
- **Beats us:** **Best-in-class formula/equation OCR (printed AND handwritten) → clean LaTeX — clearly better than MinerU for pure math fidelity; concede this, do not imply parity.** Mature Snip ecosystem + Overleaf workflows; very low per-image cost at scale.
|
||||
- **We beat:** Token-free start (Mathpix API requires a paid PAYG account, **$19.99 setup fee**, card on file; **no recurring free monthly allowance** — only a one-time $29 test credit; the consumer Snip app's free quota does **not** apply to the API); general-purpose multi-modal Office parsing; 17 delivery sinks; built-in batch CLI.
|
||||
|
||||
### Unstructured.io
|
||||
- **Source:** https://unstructured.io/pricing · https://github.com/Unstructured-IO/unstructured
|
||||
- **Beats us:** **Apache-2.0 core library is fully self-hostable → 100% offline** (we cannot); official MCP + huge connector ecosystem (S3/SharePoint/vector DBs); built-in chunking+embedding (RAG-ready); 25+ file types; permissive license for product embedding.
|
||||
- **We beat:** Token-free hosted default with zero install (its hosted API needs a key; self-host means running infra); cleaner human-readable Markdown out of the box (its primary output is JSON "elements"); 17 note/PKM sinks (it targets vector DBs/storage). *On parsing quality:* VLM parsing is generally stronger for complex layout/formula, but this is **not a benchmarked head-to-head** — state it as a tendency, not a measured win.
|
||||
|
||||
### Reducto
|
||||
- **Source:** https://reducto.ai/pricing
|
||||
- **Beats us:** **Best complex/financial table extraction (90.2% RD-TableBench — vendor-authored but the strongest public evidence)**; agentic multi-pass OCR; SOC2/HIPAA, on-prem/VPC/air-gapped, enterprise SLAs; schema-based extraction with bounding boxes/citations.
|
||||
- **We beat:** Token-free start (it needs a key + credits); zero-install plain CLI; 17 delivery sinks; auto-routing/--resume/parallel batch.
|
||||
|
||||
### Chunkr (and similar RAG-native APIs)
|
||||
- **Beats us:** Self-hostable (offline option we lack); RAG-native chunking + broad export (DOCX/HTML/LaTeX).
|
||||
- **We beat:** Token-free start; zero-install; 17 note/PKM sinks.
|
||||
- **Caveat (fact-check):** Do **not** claim "stronger VLM Markdown for formulas" — Chunkr cloud uses its own proprietary models and we have **no head-to-head benchmark**. Drop the quality claim; keep only the export-breadth and offline framing.
|
||||
|
||||
---
|
||||
|
||||
## Category 3 — Other MinerU wrappers, skills & MCP servers (our direct peers)
|
||||
|
||||
**Every cloud-backed wrapper here hits the same MinerU API we do, so its OCR/table/formula output is IDENTICAL to ours.** We have **no quality edge** over them — only DX differences. Claims of "better OCR/formula/Markdown" vs these are **invalid** and must not appear.
|
||||
|
||||
### Official MinerU MCP server (mineru-open-mcp / MinerU-Ecosystem)
|
||||
- **Source:** https://github.com/opendatalab/MinerU-Ecosystem · https://pypi.org/project/mineru-open-mcp/
|
||||
- **Beats us:** **Official, first-party** — tracks API/format changes day-one; native **MCP server** (stdio + streamable-http) in Claude Desktop/Cursor/Windsurf with zero glue; full ecosystem (Python/Go/TS SDKs, LangChain/LlamaIndex/Dify/FastGPT). **Same free no-token Flash tier as us** — our "free zero-token" edge is fully matched by the first party.
|
||||
- **We beat:** Zero runtime deps (vs pip/uvx install); auto-routing Agent⇄Standard with auto-escalation; 17 delivery sinks; `--resume`/parallel batch; usable as a plain CLI outside any MCP host.
|
||||
|
||||
### MinerU-Document-Explorer (official, opendatalab)
|
||||
- **Source:** https://github.com/opendatalab/MinerU-Document-Explorer
|
||||
- **Beats us:** Different, **larger** value prop — a local agent-native **knowledge engine** (BM25/vector/hybrid retrieval + deep-reading + LLM-wiki) with 15 MCP tools; runs 100% locally for its core; MIT, 568 stars.
|
||||
- **We beat:** We're a focused zero-dep converter; broader conversion modalities; 17 delivery sinks (it keeps content in its own index/wiki); no Node/local-model download.
|
||||
|
||||
### linxule/mineru-mcp (Node, cloud)
|
||||
- **Source:** https://github.com/linxule/mineru-mcp
|
||||
- **Beats us:** Native MCP server with 6 granular tools (explicit status-polling + batch-status pagination); first-class for Node/JS MCP stacks; batch up to 200 URLs/request.
|
||||
- **We beat:** **Free no-token path** (it **requires** a token always); zero runtime deps (vs Node 18+); broader modalities (Excel/HTML); 17 delivery sinks; usable as plain CLI outside MCP.
|
||||
|
||||
### mineru-converter-mcp-server (AvatarGanymede/MinerU-MCP)
|
||||
- **Source:** https://pypi.org/project/mineru-converter-mcp-server/
|
||||
- **Beats us:** **Auto-splits PDFs >200MB and segments >600-page docs by page range — gracefully exceeding the 200MB/200-page cap we are bound by.** Turnkey Smithery + Render deploy (per-user key); explicit HTML input.
|
||||
- **We beat:** Free no-token default (it requires a key); zero runtime deps; plain CLI (no MCP host/Render/Smithery needed); 17 sinks; auto-routing.
|
||||
|
||||
### grimoire-skill (LeoLin990405)
|
||||
- **Source:** https://github.com/LeoLin990405/grimoire-skill
|
||||
- **Beats us:** Higher-level knowledge-capture ("parse once, share twice" → Obsidian notes + reusable skill packs); ingests **video** (YouTube/Bilibili) + subtitles (modalities we don't touch); cross-agent skill management; content-aware Obsidian auto-filing.
|
||||
- **We beat:** Free no-token default (it needs a token + `--cloud-ok` for local files); zero runtime deps (vs bash+jq+awk + optional yt-dlp/ffmpeg); 17 sinks vs primarily Obsidian; broader Office/HTML; cross-platform single-file portability.
|
||||
|
||||
### kesslerio/mineru-pdf-parser (openclaw/ClawHub skill, local CPU)
|
||||
- **Source:** openclaw/skills · SKILL.md
|
||||
- **Beats us:** **Fully local/offline (pure CPU, cross-platform)** — no cloud/token/caps; handles privacy-sensitive docs; native Markdown + JSON.
|
||||
- **We beat:** Zero install (it needs a full local MinerU install + weights + shell wrapper); no GPU/heavy runtime; faster wall-clock **only vs slow local CPU**; broader modalities; 17 sinks; `--stdout`/`--json`; better docs.
|
||||
|
||||
### nilecui/mineru-parser-skills (Claude Agent SDK, cloud)
|
||||
- **Source:** https://github.com/nilecui/mineru-parser-skills
|
||||
- **Beats us:** Built directly on the Claude Agent SDK (slots into Agent-SDK apps). Honestly little else — it's a thinner cloud wrapper.
|
||||
- **We beat:** Accepts local files/dirs **and** URLs (it is **URL-only** — cannot parse a local PDF); free no-token default; zero runtime deps; batch/`--resume`/parallel; 17 sinks; broader modalities; mature/documented vs a 4-commit, no-license repo. *Caveat:* our "benchmarked" claim means **latency-measured**, not accuracy-benchmarked.
|
||||
|
||||
### TINKPA/mcp-mineru (local MLX, Apple Silicon)
|
||||
- **Source:** https://github.com/TINKPA/mcp-mineru
|
||||
- **Beats us:** **Fully offline/local** via MinerU running on-device (MLX accel); no cloud/token/caps; data never leaves the Mac.
|
||||
- **We beat:** Zero install/no weights/no GPU; **faster wall-clock only for typical multi-page docs vs its slow local inference (32–148s/page on M4)** — not a general speed win; broader modalities; batch/`--resume`/17 sinks; more active/documented; usable as plain CLI.
|
||||
|
||||
---
|
||||
|
||||
## Summary of mandatory concessions (do not bury these)
|
||||
|
||||
1. **Offline / air-gapped is our single biggest gap.** MinerU engine, Marker, Docling, olmOCR, Nougat, PyMuPDF4LLM, TINKPA, kesslerio, MinerU-Document-Explorer, and self-hostable Unstructured/Chunkr all run with **zero cloud dependency**. We are cloud-only and **cannot handle confidential/regulated/air-gapped content at all.**
|
||||
2. **Data privacy:** every self-hosted competitor keeps documents on the machine; we **upload every file** to MinerU's cloud — a hard disqualifier for many regulated users.
|
||||
3. **Accuracy is downstream of, and capped by, MinerU's cloud.** Self-hosting MinerU2.5-Pro gives the same-or-better accuracy with no caps. Same-backend wrappers yield **identical** quality to us.
|
||||
4. **Hard caps:** 10MB/20-page (Agent), 200MB/200-page (Standard), IP rate limits. mineru-converter exceeds them via auto-split/segmentation.
|
||||
5. **Mathpix beats us on formula/LaTeX OCR (incl. handwriting).**
|
||||
6. **Reducto leads complex/financial tables; olmOCR leads olmOCR-Bench (82.4 vs MinerU 75.8).** Different benchmarks favor different tools — never cherry-pick only OmniDocBench.
|
||||
7. **Official first-party advantage:** the official MinerU MCP/Document-Explorer + ecosystem track changes day-one and match our free tier; we are third-party, can lag, and ship **no MCP server**.
|
||||
8. **Permissive-license wins we lack:** olmOCR (Apache-2.0 code + 7B weights), Docling (MIT + Apache-2.0 weights), Unstructured (Apache-2.0 core).
|
||||
9. **PyMuPDF4LLM is far faster/lighter on born-digital PDFs** (clean-text corpora, speed > fidelity).
|
||||
|
||||
## Sources
|
||||
|
||||
- MinerU engine: https://github.com/opendatalab/MinerU · arXiv 2509.22186 · https://huggingface.co/opendatalab/MinerU2.5-Pro-2604-1.2B · https://neurohive.io/en/state-of-the-art/mineru2-5-open-source-1-2b-model-for-pdf-parsing-outperforms-gemini-2-5-pro-on-benchmarks/
|
||||
- Official MCP / ecosystem: https://github.com/opendatalab/MinerU-Ecosystem · https://pypi.org/project/mineru-open-mcp/ · https://github.com/opendatalab/MinerU-Document-Explorer
|
||||
- Marker: https://github.com/datalab-to/marker · https://allenai.org/blog/olmocr-2
|
||||
- Docling: https://github.com/docling-project/docling · arXiv 2408.09869 · https://huggingface.co/ibm-granite/granite-docling-258M
|
||||
- olmOCR: https://github.com/allenai/olmocr · https://allenai.org/blog/olmocr-2 · https://huggingface.co/datasets/allenai/olmOCR-bench
|
||||
- Nougat: https://github.com/facebookresearch/nougat · arXiv 2308.13418
|
||||
- PyMuPDF4LLM: https://github.com/pymupdf/pymupdf4llm · https://pymupdf.io/blog/pymupdf-layout-10-faster-pdf-parsing-without-gpus
|
||||
- Zerox: https://github.com/getomni-ai/zerox
|
||||
- LlamaParse: https://www.llamaindex.ai/pricing
|
||||
- Mathpix: https://mathpix.com/pricing/api · https://mathpix.com/image-to-latex
|
||||
- Unstructured: https://unstructured.io/pricing · https://github.com/Unstructured-IO/unstructured
|
||||
- Reducto: https://reducto.ai/pricing
|
||||
- Other wrappers: https://github.com/linxule/mineru-mcp · https://pypi.org/project/mineru-converter-mcp-server/ · https://github.com/LeoLin990405/grimoire-skill · https://github.com/nilecui/mineru-parser-skills · https://github.com/TINKPA/mcp-mineru
|
||||
59
skills/developing/mineru/references/integrations.md
Normal file
59
skills/developing/mineru/references/integrations.md
Normal file
@ -0,0 +1,59 @@
|
||||
# Delivery Integrations (`--to`)
|
||||
|
||||
After parsing, MinerU Skill can deliver the Markdown straight into your content
|
||||
tools using each tool's **official ingestion path** — no fragile generic block
|
||||
converters. Targets are pluggable sinks; select one or more with `--to NAME`
|
||||
(repeatable). List them live with `python3 scripts/mineru.py --list-sinks`.
|
||||
|
||||
```bash
|
||||
# Parse and fan out to several destinations at once
|
||||
python3 scripts/mineru.py paper.pdf --to obsidian --to notion --to slack
|
||||
```
|
||||
|
||||
Each sink reads its configuration from **environment variables** so an AI agent
|
||||
can run it non-interactively. Delivery results appear in `--json` output under
|
||||
each result's `sinks` array.
|
||||
|
||||
## Support matrix
|
||||
|
||||
| Target | `--to` | Native path | Auth / config (env) | Markdown fidelity | Images |
|
||||
|--------|--------|-------------|---------------------|-------------------|--------|
|
||||
| **Obsidian** | `obsidian` (`ob`) | filesystem write + YAML frontmatter | `OBSIDIAN_VAULT`, `OBSIDIAN_SUBDIR?` | full | ✅ copied to `<note>.assets/` |
|
||||
| **Logseq** | `logseq` | filesystem write, outline + `key:: value` | `LOGSEQ_GRAPH` | full (outline transform) | ✅ copied to `assets/` |
|
||||
| **SiYuan** | `siyuan` | kernel `createDocWithMd` | `SIYUAN_TOKEN`, `SIYUAN_API_URL?`, `SIYUAN_NOTEBOOK?` | full (GFM) | ✅ `asset/upload` |
|
||||
| **Notion** | `notion` | `POST /v1/pages` (blocks) | `NOTION_API_KEY`, `NOTION_PARENT_PAGE_ID`, `NOTION_VERSION?` | structure (headings/lists/code/quote) | ⚠️ text only¹ |
|
||||
| **Linear** | `linear` | GraphQL `issueCreate` | `LINEAR_API_KEY`, `LINEAR_TEAM_ID` | full (Markdown-native) | ✅ base64-inlined |
|
||||
| **Yuque 语雀** | `yuque` (`语雀`) | open API create doc | `YUQUE_TOKEN`, `YUQUE_NAMESPACE` | full (Markdown-native) | ⚠️ host publicly² |
|
||||
| **Coda** | `coda` | page canvas `format:markdown` | `CODA_API_TOKEN`, `CODA_DOC_ID?` | full (Markdown-native) | ⚠️ public URL² |
|
||||
| **Slack** | `slack` | external-upload `.md` file | `SLACK_BOT_TOKEN`, `SLACK_CHANNEL` | full (raw file) | ⚠️ not embedded |
|
||||
| **Lark 飞书** | `feishu` (`lark`, `飞书`) | Drive `import_tasks` → Docx | `FEISHU_APP_ID`, `FEISHU_APP_SECRET`, `FEISHU_FOLDER_TOKEN?` | full (server-converted) | ⚠️ public URL² |
|
||||
| **Confluence** | `confluence` | `POST /wiki/api/v2/pages` (storage) | `CONFLUENCE_BASE_URL`, `CONFLUENCE_EMAIL`, `CONFLUENCE_API_TOKEN`, `CONFLUENCE_SPACE_ID` | MD→HTML | ⚠️ not attached |
|
||||
| **OneNote** | `onenote` | Graph `sections/{id}/pages` | `ONENOTE_TOKEN`³, `ONENOTE_SECTION_ID` | MD→HTML | ⚠️ remote only |
|
||||
| **TickTick 滴答** | `ticktick` (`dida`, `滴答清单`) | `POST /open/v1/task` | `TICKTICK_TOKEN`, `TICKTICK_PROJECT_ID?` | task note | ❌ unsupported |
|
||||
| **DingTalk 钉钉** | `dingtalk` (`钉钉`) | robot markdown webhook | `DINGTALK_WEBHOOK`, `DINGTALK_SECRET?` | markdown message | ⚠️ public URL only |
|
||||
| **Airtable** | `airtable` | `POST /v0/{base}/{table}` record | `AIRTABLE_API_KEY`, `AIRTABLE_BASE_ID`, `AIRTABLE_TABLE`, `AIRTABLE_TITLE_FIELD?`, `AIRTABLE_BODY_FIELD?` | record field⁴ | ❌ not uploaded |
|
||||
| **WeCom 企业微信** | `wecom` (`企业微信`) | app `message/send` markdown | `WECOM_CORPID`, `WECOM_CORPSECRET`, `WECOM_AGENTID`, `WECOM_TOUSER?` | message (subset, ≤2 KB)⁵ | ❌ unsupported |
|
||||
| **Roam Research** ⁶ | `roam` | `batch-actions` block tree | `ROAM_API_TOKEN`, `ROAM_GRAPH_NAME` | full (Markdown→outline) | ⚠️ public URL |
|
||||
| **WPS 金山文档** ⁶ | `wps` (`kdocs`, `金山`) | Markdown→DOCX → kdocs upload | `WPS_APP_ID`, `WPS_APP_SECRET`, `WPS_PARENT_PATH?` | DOCX (via html-for-docx) | embedded in DOCX |
|
||||
|
||||
Notes:
|
||||
1. **Notion** images need a separate `file_uploads` upload-then-reference dance; v1 delivers text + structure and notes the count of un-embedded local images. (Roadmap: image upload.)
|
||||
2. Hosted services that ingest Markdown by value but have no first-class CLI asset upload — local images must be hosted at a public URL to render. The Markdown is delivered intact; image links that are already URLs work.
|
||||
3. **OneNote** `ONENOTE_TOKEN` is a Microsoft Graph access token (delegated, scope `Notes.Create`). Obtain it via the device-code OAuth flow; the sink itself stays non-interactive.
|
||||
4. **Airtable** is a database, not a document store — the doc is stored as one record (title + body fields). A good "save this doc as a row" target, not a document publisher.
|
||||
5. **WeCom** markdown messages are a limited subset (≤2048 bytes, no images/tables, not rendered in the workbench). Best as a notification/summary; for a full document deliver via Lark/Notion and send the link.
|
||||
6. **Optional-dependency sinks** — these two rely on a third-party library that the sink lazy-imports only when used, so the core and the other 15 sinks stay zero-dependency. If the library is absent, the sink returns a clear `pip install …` hint. They are implemented to the official specs but, being credential/desktop-gated, are best-effort until validated against live accounts.
|
||||
|
||||
## Optional-dependency sinks (`[roam]`, `[wps]`)
|
||||
|
||||
```bash
|
||||
pip install "mineru-skill[wps]" # html-for-docx (Markdown → DOCX)
|
||||
pip install "mineru-skill[roam]" # official roam-client SDK (git, needs Python ≥3.11)
|
||||
# roam-client is git-only; equivalently:
|
||||
pip install "roam-client @ git+https://github.com/Roam-Research/backend-sdks.git#subdirectory=python"
|
||||
```
|
||||
|
||||
- **Roam** — no library ingests Markdown into Roam, but the official `roam-client` SDK handles the genuinely error-prone transport (307/308 peer-host redirect, dual `Authorization`/`x-authorization` Bearer headers, `/write`). We depend on it for transport and build only the Markdown→outline tree, delivering the whole document in one `batch-actions` request. Images must be public URLs.
|
||||
- **WPS / 金山文档** — Markdown→DOCX uses the maintained pure-pip `html-for-docx` (reusing this project's Markdown→HTML); the kdocs upload signs requests with the documented WPS-2 scheme (plain SHA-1) using only the standard library. Requires an approved kdocs developer app + provisioned appspace.
|
||||
|
||||
Adding more targets is a single small module — see `scripts/sinks/base.py`. PRs welcome.
|
||||
1
skills/developing/mineru/scripts/__init__.py
Normal file
1
skills/developing/mineru/scripts/__init__.py
Normal file
@ -0,0 +1 @@
|
||||
"""Importable package for MinerU Skill console entry points."""
|
||||
88
skills/developing/mineru/scripts/chunking.py
Normal file
88
skills/developing/mineru/scripts/chunking.py
Normal file
@ -0,0 +1,88 @@
|
||||
"""Heading-aware Markdown chunking for RAG pipelines (zero-dependency).
|
||||
|
||||
``chunk_markdown`` splits a parsed Markdown document into retrieval-sized chunks
|
||||
that preserve heading context — matching the RAG-friendliness of LlamaParse /
|
||||
Unstructured without any dependency.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
|
||||
_HEADING = re.compile(r"^(#{1,6})\s+(.*)$")
|
||||
|
||||
|
||||
def _slug(text: str) -> str:
|
||||
text = (text or "doc").strip().lower()
|
||||
text = re.sub(r"[^a-z0-9]+", "-", text).strip("-")
|
||||
return text or "doc"
|
||||
|
||||
|
||||
def _split_by_size(text: str, max_chars: int) -> list:
|
||||
"""Split text into <= max_chars pieces on paragraph boundaries (hard-split if needed)."""
|
||||
if len(text) <= max_chars:
|
||||
return [text]
|
||||
pieces: list = []
|
||||
current = ""
|
||||
for para in text.split("\n\n"):
|
||||
if len(para) > max_chars:
|
||||
if current:
|
||||
pieces.append(current)
|
||||
current = ""
|
||||
for i in range(0, len(para), max_chars):
|
||||
pieces.append(para[i:i + max_chars])
|
||||
elif not current:
|
||||
current = para
|
||||
elif len(current) + len(para) + 2 <= max_chars:
|
||||
current = f"{current}\n\n{para}"
|
||||
else:
|
||||
pieces.append(current)
|
||||
current = para
|
||||
if current:
|
||||
pieces.append(current)
|
||||
return pieces
|
||||
|
||||
|
||||
def chunk_markdown(markdown: str, *, max_chars: int = 2000, source: str = "") -> list:
|
||||
"""Chunk Markdown by heading, size-splitting long sections.
|
||||
|
||||
Returns ``[{id, index, heading, text, chars, source}, ...]`` where ``heading``
|
||||
is the ``H1 > H2 > H3`` breadcrumb for the chunk.
|
||||
"""
|
||||
lines = markdown.replace("\r\n", "\n").split("\n")
|
||||
chunks: list = []
|
||||
stack: list = [] # (level, text) heading breadcrumb
|
||||
buf: list = []
|
||||
base = _slug(source)
|
||||
|
||||
def breadcrumb() -> str:
|
||||
return " > ".join(t for _, t in stack)
|
||||
|
||||
def flush():
|
||||
text = "\n".join(buf).strip()
|
||||
buf.clear()
|
||||
if not text:
|
||||
return
|
||||
head = breadcrumb()
|
||||
for piece in _split_by_size(text, max_chars):
|
||||
idx = len(chunks)
|
||||
chunks.append({
|
||||
"id": f"{base}-{idx}",
|
||||
"index": idx,
|
||||
"heading": head,
|
||||
"text": piece,
|
||||
"chars": len(piece),
|
||||
"source": source,
|
||||
})
|
||||
|
||||
for line in lines:
|
||||
match = _HEADING.match(line.strip())
|
||||
if match:
|
||||
flush() # close the previous section under its own breadcrumb
|
||||
level = len(match.group(1))
|
||||
while stack and stack[-1][0] >= level:
|
||||
stack.pop()
|
||||
stack.append((level, match.group(2)))
|
||||
buf.append(line)
|
||||
flush()
|
||||
return chunks
|
||||
59
skills/developing/mineru/scripts/local_engine.py
Normal file
59
skills/developing/mineru/scripts/local_engine.py
Normal file
@ -0,0 +1,59 @@
|
||||
"""Optional fully-offline parsing backend for born-digital PDFs.
|
||||
|
||||
Our single biggest honest gap is being cloud-only. ``--engine local`` parses a
|
||||
PDF **entirely offline** with the optional, lightweight ``pymupdf4llm`` library
|
||||
(no GPU, no cloud, no upload caps) — ideal for confidential or born-digital PDFs
|
||||
where MinerU's cloud VLM is overkill. Scanned/complex docs still want the cloud
|
||||
engine, so ``--engine auto`` only uses local when the PDF has real text.
|
||||
|
||||
pip install "mineru-skill[local]" # i.e. pip install pymupdf4llm
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
_HINT = (
|
||||
"--engine local needs pymupdf4llm — pip install 'mineru-skill[local]' "
|
||||
"(i.e. pip install pymupdf4llm)"
|
||||
)
|
||||
|
||||
|
||||
class LocalEngineError(Exception):
|
||||
"""Raised when local parsing is requested but cannot be performed."""
|
||||
|
||||
|
||||
def available() -> bool:
|
||||
try:
|
||||
import pymupdf4llm # noqa: F401
|
||||
return True
|
||||
except ImportError:
|
||||
return False
|
||||
|
||||
|
||||
def is_born_digital(path, min_chars: int = 200) -> bool:
|
||||
"""True if the PDF has extractable text (so local parsing is appropriate)."""
|
||||
try:
|
||||
import pymupdf
|
||||
except ImportError:
|
||||
return False
|
||||
doc = pymupdf.open(str(path))
|
||||
total = 0
|
||||
for page in doc:
|
||||
total += len(page.get_text().strip())
|
||||
if total >= min_chars:
|
||||
return True
|
||||
return total >= min_chars
|
||||
|
||||
|
||||
def parse_local(path, output_dir=None) -> str:
|
||||
"""Parse a PDF to Markdown fully offline. Returns the Markdown string."""
|
||||
try:
|
||||
import pymupdf4llm
|
||||
except ImportError as exc:
|
||||
raise LocalEngineError(_HINT) from exc
|
||||
if output_dir is not None:
|
||||
images = Path(output_dir) / "images"
|
||||
images.mkdir(parents=True, exist_ok=True)
|
||||
return pymupdf4llm.to_markdown(str(path), write_images=True, image_path=str(images))
|
||||
return pymupdf4llm.to_markdown(str(path))
|
||||
1996
skills/developing/mineru/scripts/mineru.py
Normal file
1996
skills/developing/mineru/scripts/mineru.py
Normal file
File diff suppressed because it is too large
Load Diff
178
skills/developing/mineru/scripts/mineru_mcp.py
Normal file
178
skills/developing/mineru/scripts/mineru_mcp.py
Normal file
@ -0,0 +1,178 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Zero-dependency MCP server (stdio) for MinerU Skill.
|
||||
|
||||
Speaks newline-delimited JSON-RPC 2.0 over stdin/stdout using only the standard
|
||||
library, so an MCP host (Claude, Cursor, Windsurf, ...) can call MinerU. Register:
|
||||
|
||||
{"command": "python3", "args": ["scripts/mineru_mcp.py"]}
|
||||
|
||||
Tools: ``mineru_parse``, ``mineru_parse_to``, ``mineru_list_sinks``.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent))
|
||||
import mineru # noqa: E402
|
||||
|
||||
PROTOCOL_VERSION = "2024-11-05"
|
||||
SERVER_INFO = {"name": "mineru", "version": mineru.__version__}
|
||||
|
||||
TOOLS = [
|
||||
{
|
||||
"name": "mineru_parse",
|
||||
"description": "Parse a PDF / Office / image file or URL into clean Markdown via MinerU.",
|
||||
"inputSchema": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"input": {"type": "string", "description": "Local file path or http(s) URL"},
|
||||
"output_dir": {"type": "string", "description": "Where to write output (default ./output)"},
|
||||
"api": {"type": "string", "enum": ["auto", "agent", "standard"]},
|
||||
"engine": {"type": "string", "enum": ["cloud", "local", "auto"]},
|
||||
"ocr": {"type": "boolean"},
|
||||
"lang": {"type": "string"},
|
||||
},
|
||||
"required": ["input"],
|
||||
},
|
||||
},
|
||||
{
|
||||
"name": "mineru_parse_to",
|
||||
"description": "Parse a document and deliver the Markdown into content tools (Obsidian, Notion, Feishu, ...).",
|
||||
"inputSchema": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"input": {"type": "string"},
|
||||
"sinks": {"type": "array", "items": {"type": "string"}, "description": "Sink names, e.g. ['obsidian','notion']"},
|
||||
"output_dir": {"type": "string"},
|
||||
},
|
||||
"required": ["input", "sinks"],
|
||||
},
|
||||
},
|
||||
{
|
||||
"name": "mineru_list_sinks",
|
||||
"description": "List available delivery targets and their required environment variables.",
|
||||
"inputSchema": {"type": "object", "properties": {}},
|
||||
},
|
||||
]
|
||||
|
||||
|
||||
class MethodNotFound(Exception):
|
||||
pass
|
||||
|
||||
|
||||
def _text_result(text: str, is_error: bool = False) -> dict:
|
||||
return {"content": [{"type": "text", "text": text}], "isError": is_error}
|
||||
|
||||
|
||||
def _tool_parse(args: dict) -> dict:
|
||||
opts = mineru.ParseOptions(is_ocr=bool(args.get("ocr")), language=args.get("lang", "ch"))
|
||||
token = os.environ.get("MINERU_TOKEN")
|
||||
output_dir = Path(args.get("output_dir") or "./output")
|
||||
res = mineru.process_one(
|
||||
args["input"], opts, token=token, output_dir=output_dir,
|
||||
api=args.get("api", "auto"), engine=args.get("engine", "cloud"),
|
||||
)
|
||||
if res.state == "done":
|
||||
return _text_result(res.markdown or "")
|
||||
return _text_result(f"Parse failed: {res.error}", is_error=True)
|
||||
|
||||
|
||||
def _tool_parse_to(args: dict) -> dict:
|
||||
opts = mineru.ParseOptions()
|
||||
token = os.environ.get("MINERU_TOKEN")
|
||||
output_dir = Path(args.get("output_dir") or "./output")
|
||||
res = mineru.process_one(args["input"], opts, token=token, output_dir=output_dir)
|
||||
if res.state != "done":
|
||||
return _text_result(f"Parse failed: {res.error}", is_error=True)
|
||||
sinks = mineru._load_sinks()
|
||||
if sinks is None:
|
||||
return _text_result("Sinks package unavailable.", is_error=True)
|
||||
doc = sinks.ParsedDoc(title=res.name, markdown=res.markdown, source=res.source,
|
||||
modality=res.modality, markdown_path=res.markdown_path)
|
||||
outcomes = [o.to_status() for o in sinks.deliver_all(doc, args["sinks"])]
|
||||
any_fail = any(not o["ok"] for o in outcomes)
|
||||
return _text_result(json.dumps({"name": res.name, "deliveries": outcomes}, ensure_ascii=False, indent=2),
|
||||
is_error=any_fail)
|
||||
|
||||
|
||||
def _tool_list_sinks(_args: dict) -> dict:
|
||||
sinks = mineru._load_sinks()
|
||||
if sinks is None:
|
||||
return _text_result("Sinks package unavailable.", is_error=True)
|
||||
listing = [{"name": n, "label": sinks.get_sink(n).label, "requires": list(sinks.get_sink(n).requires)}
|
||||
for n in sinks.sink_names()]
|
||||
return _text_result(json.dumps(listing, ensure_ascii=False, indent=2))
|
||||
|
||||
|
||||
_TOOL_HANDLERS = {
|
||||
"mineru_parse": _tool_parse,
|
||||
"mineru_parse_to": _tool_parse_to,
|
||||
"mineru_list_sinks": _tool_list_sinks,
|
||||
}
|
||||
|
||||
|
||||
def _route(method: str, params: dict):
|
||||
if method == "initialize":
|
||||
return {"protocolVersion": PROTOCOL_VERSION, "capabilities": {"tools": {}}, "serverInfo": SERVER_INFO}
|
||||
if method == "tools/list":
|
||||
return {"tools": TOOLS}
|
||||
if method == "tools/call":
|
||||
name = params.get("name")
|
||||
handler = _TOOL_HANDLERS.get(name)
|
||||
if handler is None:
|
||||
return _text_result(f"Unknown tool: {name}", is_error=True)
|
||||
try:
|
||||
return handler(params.get("arguments") or {})
|
||||
except Exception as exc: # noqa: BLE001 - report as a tool error, never crash the server
|
||||
return _text_result(f"{type(exc).__name__}: {exc}", is_error=True)
|
||||
raise MethodNotFound(method)
|
||||
|
||||
|
||||
def dispatch(request: dict):
|
||||
"""Handle one JSON-RPC request dict; return a response dict, or None for notifications."""
|
||||
is_notification = "id" not in request
|
||||
req_id = request.get("id")
|
||||
try:
|
||||
result = _route(request.get("method"), request.get("params") or {})
|
||||
except MethodNotFound as exc:
|
||||
if is_notification:
|
||||
return None
|
||||
return {"jsonrpc": "2.0", "id": req_id, "error": {"code": -32601, "message": f"Method not found: {exc}"}}
|
||||
except Exception as exc: # noqa: BLE001
|
||||
if is_notification:
|
||||
return None
|
||||
return {"jsonrpc": "2.0", "id": req_id, "error": {"code": -32603, "message": str(exc)}}
|
||||
if is_notification:
|
||||
return None
|
||||
return {"jsonrpc": "2.0", "id": req_id, "result": result}
|
||||
|
||||
|
||||
def serve(stdin=None, stdout=None) -> None:
|
||||
"""Read newline-delimited JSON-RPC from stdin, write responses to stdout."""
|
||||
stdin = stdin or sys.stdin
|
||||
stdout = stdout or sys.stdout
|
||||
for line in stdin:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
try:
|
||||
request = json.loads(line)
|
||||
except ValueError:
|
||||
continue
|
||||
response = dispatch(request)
|
||||
if response is not None:
|
||||
stdout.write(json.dumps(response, ensure_ascii=False) + "\n")
|
||||
stdout.flush()
|
||||
|
||||
|
||||
def main() -> int:
|
||||
serve()
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
75
skills/developing/mineru/scripts/sinks/__init__.py
Normal file
75
skills/developing/mineru/scripts/sinks/__init__.py
Normal file
@ -0,0 +1,75 @@
|
||||
"""Pluggable delivery sinks for parsed Markdown.
|
||||
|
||||
Each submodule registers one or more :class:`Sink` implementations that deliver a
|
||||
:class:`ParsedDoc` into a content tool using that tool's official ingestion path.
|
||||
Importing this package populates the registry; a sink module that fails to import
|
||||
is recorded in :data:`IMPORT_ERRORS` rather than breaking the others.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import importlib
|
||||
import sys
|
||||
|
||||
from .base import ( # noqa: F401
|
||||
ParsedDoc,
|
||||
Sink,
|
||||
SinkError,
|
||||
SinkResult,
|
||||
get_sink,
|
||||
sink_names,
|
||||
REGISTRY,
|
||||
)
|
||||
|
||||
# Sink modules to load. Order is cosmetic.
|
||||
_MODULES = [
|
||||
"local", # obsidian, logseq (filesystem)
|
||||
"siyuan",
|
||||
"notion",
|
||||
"linear",
|
||||
"yuque",
|
||||
"coda",
|
||||
"ticktick",
|
||||
"dingtalk",
|
||||
"airtable",
|
||||
"wecom",
|
||||
"slack",
|
||||
"feishu",
|
||||
"confluence",
|
||||
"onenote",
|
||||
"roam", # optional dependency (roam-client)
|
||||
"wps", # optional dependency (html-for-docx)
|
||||
]
|
||||
|
||||
IMPORT_ERRORS: dict = {}
|
||||
|
||||
for _name in _MODULES:
|
||||
try:
|
||||
importlib.import_module(f"{__name__}.{_name}")
|
||||
except Exception as exc: # noqa: BLE001 - a bad sink shouldn't break the rest
|
||||
IMPORT_ERRORS[_name] = f"{type(exc).__name__}: {exc}"
|
||||
print(f"[sinks] failed to load {_name}: {exc}", file=sys.stderr)
|
||||
|
||||
|
||||
def deliver_all(doc: ParsedDoc, names) -> list:
|
||||
"""Deliver ``doc`` to each named sink, returning a list of :class:`SinkResult`."""
|
||||
results = []
|
||||
for name in names:
|
||||
sink = get_sink(name)
|
||||
if sink is None:
|
||||
results.append(SinkResult(sink=name, ok=False, error=f"unknown sink '{name}'"))
|
||||
continue
|
||||
missing = sink.missing_config()
|
||||
if missing:
|
||||
results.append(SinkResult(
|
||||
sink=sink.name, ok=False,
|
||||
error=f"missing config: {', '.join(missing)}",
|
||||
))
|
||||
continue
|
||||
try:
|
||||
results.append(sink.deliver(doc))
|
||||
except SinkError as exc:
|
||||
results.append(SinkResult(sink=sink.name, ok=False, error=str(exc)))
|
||||
except Exception as exc: # noqa: BLE001 - surface but never crash the run
|
||||
results.append(SinkResult(sink=sink.name, ok=False, error=f"{type(exc).__name__}: {exc}"))
|
||||
return results
|
||||
72
skills/developing/mineru/scripts/sinks/_http.py
Normal file
72
skills/developing/mineru/scripts/sinks/_http.py
Normal file
@ -0,0 +1,72 @@
|
||||
"""Zero-dependency HTTP helpers shared by all sinks (stdlib urllib only).
|
||||
|
||||
``http_request`` is the single seam tests monkeypatch.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import mimetypes
|
||||
import urllib.error
|
||||
import urllib.request
|
||||
from typing import Optional
|
||||
|
||||
USER_AGENT = "MinerU-Skill-sink/1.0"
|
||||
|
||||
|
||||
def http_request(method, url, *, headers=None, data=None, timeout=60):
|
||||
"""Perform one HTTP request. Returns ``(status_code, body_bytes)``."""
|
||||
req = urllib.request.Request(url, data=data, method=method, headers=headers or {})
|
||||
req.add_header("User-Agent", USER_AGENT)
|
||||
try:
|
||||
with urllib.request.urlopen(req, timeout=timeout) as resp:
|
||||
return resp.getcode(), resp.read()
|
||||
except urllib.error.HTTPError as exc:
|
||||
body = exc.read() if hasattr(exc, "read") else b""
|
||||
return exc.code, body
|
||||
|
||||
|
||||
def request_json(method, url, *, headers=None, payload=None, timeout=60):
|
||||
"""JSON request helper. Returns ``(status_code, parsed_json_or_empty_dict)``."""
|
||||
hdrs = dict(headers or {})
|
||||
body = None
|
||||
if payload is not None:
|
||||
hdrs.setdefault("Content-Type", "application/json")
|
||||
body = json.dumps(payload, ensure_ascii=False).encode("utf-8")
|
||||
status, raw = http_request(method, url, headers=hdrs, data=body, timeout=timeout)
|
||||
parsed: dict = {}
|
||||
if raw:
|
||||
try:
|
||||
parsed = json.loads(raw.decode("utf-8"))
|
||||
except (ValueError, UnicodeDecodeError):
|
||||
parsed = {}
|
||||
return status, parsed
|
||||
|
||||
|
||||
def encode_multipart(fields=None, files=None):
|
||||
"""Build a ``multipart/form-data`` body with stdlib only.
|
||||
|
||||
``fields``: dict of str -> str. ``files``: list of (field_name, filename, bytes).
|
||||
Returns ``(content_type, body_bytes)``.
|
||||
"""
|
||||
boundary = "----MinerUSinkBoundary7MA4YWxkTrZu0gW"
|
||||
crlf = b"\r\n"
|
||||
parts = []
|
||||
for name, value in (fields or {}).items():
|
||||
parts.append(b"--" + boundary.encode())
|
||||
parts.append(f'Content-Disposition: form-data; name="{name}"'.encode())
|
||||
parts.append(b"")
|
||||
parts.append(str(value).encode("utf-8"))
|
||||
for field_name, filename, content in files or []:
|
||||
ctype = mimetypes.guess_type(filename)[0] or "application/octet-stream"
|
||||
parts.append(b"--" + boundary.encode())
|
||||
parts.append(
|
||||
f'Content-Disposition: form-data; name="{field_name}"; filename="{filename}"'.encode()
|
||||
)
|
||||
parts.append(f"Content-Type: {ctype}".encode())
|
||||
parts.append(b"")
|
||||
parts.append(content)
|
||||
parts.append(b"--" + boundary.encode() + b"--")
|
||||
parts.append(b"")
|
||||
body = crlf.join(parts)
|
||||
return f"multipart/form-data; boundary={boundary}", body
|
||||
244
skills/developing/mineru/scripts/sinks/_md.py
Normal file
244
skills/developing/mineru/scripts/sinks/_md.py
Normal file
@ -0,0 +1,244 @@
|
||||
"""Small, dependency-free Markdown utilities used by sinks.
|
||||
|
||||
These are intentionally pragmatic, not a full CommonMark implementation: they
|
||||
cover the constructs MinerU emits (headings, emphasis, code, lists, tables,
|
||||
blockquotes, links, images) well enough to deliver faithful content to tools
|
||||
that require HTML (Confluence, OneNote) or an outline (Logseq).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import html
|
||||
import re
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
_IMAGE_RE = re.compile(r"!\[(?P<alt>[^\]]*)\]\((?P<ref>[^)\s]+)(?:\s+\"[^\"]*\")?\)")
|
||||
_ILLEGAL_FS = re.compile(r'[\\/:*?"<>|#^\[\]]+')
|
||||
|
||||
|
||||
def slugify(text: str, default: str = "document") -> str:
|
||||
"""Filesystem/URL-safe slug."""
|
||||
text = text.strip().lower()
|
||||
text = re.sub(r"[\s_]+", "-", text)
|
||||
text = re.sub(r"[^a-z0-9\-]+", "", text)
|
||||
text = re.sub(r"-{2,}", "-", text).strip("-")
|
||||
return text or default
|
||||
|
||||
|
||||
def safe_filename(title: str, default: str = "document") -> str:
|
||||
"""Clean a title into a safe note filename (keeps unicode, drops illegal chars)."""
|
||||
name = _ILLEGAL_FS.sub(" ", title).strip()
|
||||
name = re.sub(r"\s{2,}", " ", name)
|
||||
return name[:120] or default
|
||||
|
||||
|
||||
def is_remote(ref: str) -> bool:
|
||||
return ref.startswith("http://") or ref.startswith("https://") or ref.startswith("data:")
|
||||
|
||||
|
||||
def find_local_images(markdown: str, base_dir) -> list:
|
||||
"""Return ``[(alt, ref, Path)]`` for image refs that point at existing local files."""
|
||||
base = Path(base_dir) if base_dir else None
|
||||
found = []
|
||||
seen = set()
|
||||
for match in _IMAGE_RE.finditer(markdown):
|
||||
ref = match.group("ref")
|
||||
if is_remote(ref) or ref in seen:
|
||||
continue
|
||||
path = Path(ref)
|
||||
if not path.is_absolute() and base is not None:
|
||||
path = base / ref
|
||||
if path.is_file():
|
||||
found.append((match.group("alt"), ref, path))
|
||||
seen.add(ref)
|
||||
return found
|
||||
|
||||
|
||||
def rewrite_images(markdown: str, mapping: dict) -> str:
|
||||
"""Rewrite local image refs using ``{old_ref: new_ref}``."""
|
||||
def repl(match):
|
||||
ref = match.group("ref")
|
||||
if ref in mapping:
|
||||
return f""
|
||||
return match.group(0)
|
||||
|
||||
return _IMAGE_RE.sub(repl, markdown)
|
||||
|
||||
|
||||
def yaml_frontmatter(props: dict) -> str:
|
||||
"""Render a YAML frontmatter block. List values become ``- item`` lines."""
|
||||
lines = ["---"]
|
||||
for key, value in props.items():
|
||||
if value is None or value == "" or value == []:
|
||||
continue
|
||||
if isinstance(value, (list, tuple)):
|
||||
lines.append(f"{key}:")
|
||||
for item in value:
|
||||
lines.append(f" - {item}")
|
||||
else:
|
||||
lines.append(f"{key}: {value}")
|
||||
lines.append("---")
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Inline + block Markdown -> HTML (pragmatic, XHTML-safe)
|
||||
# --------------------------------------------------------------------------- #
|
||||
def _inline(text: str) -> str:
|
||||
"""Convert inline Markdown to HTML on already-escaped text."""
|
||||
# images first, then links
|
||||
text = _IMAGE_RE.sub(
|
||||
lambda m: f'<img src="{html.escape(m.group("ref"), quote=True)}" alt="{m.group("alt")}" />',
|
||||
text,
|
||||
)
|
||||
text = re.sub(r"\[([^\]]+)\]\(([^)\s]+)\)",
|
||||
lambda m: f'<a href="{html.escape(m.group(2), quote=True)}">{m.group(1)}</a>', text)
|
||||
text = re.sub(r"`([^`]+)`", r"<code>\1</code>", text)
|
||||
text = re.sub(r"\*\*([^*]+)\*\*", r"<strong>\1</strong>", text)
|
||||
text = re.sub(r"(?<!\*)\*(?!\*)([^*]+)\*(?!\*)", r"<em>\1</em>", text)
|
||||
return text
|
||||
|
||||
|
||||
def md_to_html(markdown: str) -> str:
|
||||
"""Convert a Markdown document to a pragmatic, XHTML-safe HTML fragment."""
|
||||
out = []
|
||||
lines = markdown.replace("\r\n", "\n").split("\n")
|
||||
i = 0
|
||||
n = len(lines)
|
||||
in_code = False
|
||||
code_buf: list = []
|
||||
list_stack: list = [] # 'ul' / 'ol'
|
||||
|
||||
def close_lists():
|
||||
while list_stack:
|
||||
out.append(f"</{list_stack.pop()}>")
|
||||
|
||||
while i < n:
|
||||
line = lines[i]
|
||||
fence = line.strip().startswith("```")
|
||||
if fence and not in_code:
|
||||
close_lists()
|
||||
in_code = True
|
||||
code_buf = []
|
||||
i += 1
|
||||
continue
|
||||
if fence and in_code:
|
||||
out.append("<pre><code>" + html.escape("\n".join(code_buf)) + "</code></pre>")
|
||||
in_code = False
|
||||
i += 1
|
||||
continue
|
||||
if in_code:
|
||||
code_buf.append(line)
|
||||
i += 1
|
||||
continue
|
||||
|
||||
stripped = line.strip()
|
||||
if not stripped:
|
||||
close_lists()
|
||||
i += 1
|
||||
continue
|
||||
|
||||
# table block
|
||||
if "|" in stripped and i + 1 < n and re.match(r"^\s*\|?[\s:|-]+\|?\s*$", lines[i + 1]):
|
||||
close_lists()
|
||||
header = [c.strip() for c in stripped.strip("|").split("|")]
|
||||
rows = []
|
||||
i += 2
|
||||
while i < n and "|" in lines[i] and lines[i].strip():
|
||||
rows.append([c.strip() for c in lines[i].strip().strip("|").split("|")])
|
||||
i += 1
|
||||
out.append("<table><thead><tr>"
|
||||
+ "".join(f"<th>{_inline(html.escape(c))}</th>" for c in header)
|
||||
+ "</tr></thead><tbody>")
|
||||
for row in rows:
|
||||
out.append("<tr>" + "".join(f"<td>{_inline(html.escape(c))}</td>" for c in row) + "</tr>")
|
||||
out.append("</tbody></table>")
|
||||
continue
|
||||
|
||||
heading = re.match(r"^(#{1,6})\s+(.*)$", stripped)
|
||||
if heading:
|
||||
close_lists()
|
||||
level = len(heading.group(1))
|
||||
out.append(f"<h{level}>{_inline(html.escape(heading.group(2)))}</h{level}>")
|
||||
i += 1
|
||||
continue
|
||||
|
||||
if stripped.startswith(">"):
|
||||
close_lists()
|
||||
out.append(f"<blockquote>{_inline(html.escape(stripped[1:].strip()))}</blockquote>")
|
||||
i += 1
|
||||
continue
|
||||
|
||||
if re.match(r"^([-*+])\s+", stripped):
|
||||
if not list_stack or list_stack[-1] != "ul":
|
||||
close_lists()
|
||||
list_stack.append("ul")
|
||||
out.append("<ul>")
|
||||
item = re.sub(r"^([-*+])\s+", "", stripped)
|
||||
out.append(f"<li>{_inline(html.escape(item))}</li>")
|
||||
i += 1
|
||||
continue
|
||||
|
||||
if re.match(r"^\d+\.\s+", stripped):
|
||||
if not list_stack or list_stack[-1] != "ol":
|
||||
close_lists()
|
||||
list_stack.append("ol")
|
||||
out.append("<ol>")
|
||||
item = re.sub(r"^\d+\.\s+", "", stripped)
|
||||
out.append(f"<li>{_inline(html.escape(item))}</li>")
|
||||
i += 1
|
||||
continue
|
||||
|
||||
if re.match(r"^([-*_])\1{2,}$", stripped):
|
||||
close_lists()
|
||||
out.append("<hr />")
|
||||
i += 1
|
||||
continue
|
||||
|
||||
close_lists()
|
||||
out.append(f"<p>{_inline(html.escape(stripped))}</p>")
|
||||
i += 1
|
||||
|
||||
if in_code:
|
||||
out.append("<pre><code>" + html.escape("\n".join(code_buf)) + "</code></pre>")
|
||||
close_lists()
|
||||
return "\n".join(out)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Markdown -> Logseq outline
|
||||
# --------------------------------------------------------------------------- #
|
||||
def md_to_logseq(markdown: str, properties: Optional[dict] = None) -> str:
|
||||
"""Convert flat Markdown into a Logseq outline.
|
||||
|
||||
Every line becomes a ``- `` block. Headings are top-level blocks; the content
|
||||
that follows a heading nests one level beneath it. Page properties
|
||||
(``key:: value``) go on the first block, as Logseq requires.
|
||||
"""
|
||||
out = []
|
||||
if properties:
|
||||
prop_lines = []
|
||||
for key, value in properties.items():
|
||||
if not value:
|
||||
continue
|
||||
if isinstance(value, (list, tuple)):
|
||||
value = ", ".join(str(v) for v in value)
|
||||
prop_lines.append(f"{key}:: {value}")
|
||||
if prop_lines:
|
||||
out.append("- " + prop_lines[0])
|
||||
out.extend(f" {p}" for p in prop_lines[1:])
|
||||
|
||||
have_heading = False
|
||||
for raw in markdown.replace("\r\n", "\n").split("\n"):
|
||||
line = raw.strip()
|
||||
if not line:
|
||||
continue
|
||||
if re.match(r"^#{1,6}\s+", line):
|
||||
out.append(f"- {line}")
|
||||
have_heading = True
|
||||
elif have_heading:
|
||||
out.append(f"\t- {line}")
|
||||
else:
|
||||
out.append(f"- {line}")
|
||||
return "\n".join(out)
|
||||
50
skills/developing/mineru/scripts/sinks/airtable.py
Normal file
50
skills/developing/mineru/scripts/sinks/airtable.py
Normal file
@ -0,0 +1,50 @@
|
||||
"""Airtable sink — store parsed Markdown as a record in a base/table.
|
||||
|
||||
Airtable is a database, not a document tool: the native ingestion path is a
|
||||
record whose fields hold the title and the Markdown body. Field names are
|
||||
configurable to match an existing table schema.
|
||||
|
||||
Docs: https://airtable.com/developers/web/api/create-records
|
||||
(POST /v0/{baseId}/{tableIdOrName}).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import urllib.parse
|
||||
|
||||
from . import _http
|
||||
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
|
||||
|
||||
API_BASE = "https://api.airtable.com/v0"
|
||||
|
||||
|
||||
@register
|
||||
class AirtableSink(Sink):
|
||||
name = "airtable"
|
||||
requires = ("AIRTABLE_API_KEY", "AIRTABLE_BASE_ID", "AIRTABLE_TABLE")
|
||||
label = "Airtable record (database)"
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
api_key = self.env("AIRTABLE_API_KEY")
|
||||
base = self.env("AIRTABLE_BASE_ID")
|
||||
table = self.env("AIRTABLE_TABLE")
|
||||
title_field = self.env("AIRTABLE_TITLE_FIELD", "Title")
|
||||
body_field = self.env("AIRTABLE_BODY_FIELD", "Notes")
|
||||
|
||||
url = f"{API_BASE}/{base}/{urllib.parse.quote(table)}"
|
||||
headers = {"Authorization": f"Bearer {api_key}"}
|
||||
payload = {"fields": {title_field: doc.title, body_field: doc.markdown}}
|
||||
|
||||
status, parsed = _http.request_json("POST", url, headers=headers, payload=payload)
|
||||
|
||||
if parsed.get("error") or status >= 400:
|
||||
raise SinkError(str(parsed.get("error") or f"HTTP {status}"))
|
||||
if not parsed.get("id"):
|
||||
raise SinkError(f"Airtable returned no record id: {parsed}")
|
||||
|
||||
return SinkResult(
|
||||
sink=self.name,
|
||||
ok=True,
|
||||
url=None,
|
||||
detail="stored as a database record (Airtable is a DB, not a doc)",
|
||||
)
|
||||
101
skills/developing/mineru/scripts/sinks/base.py
Normal file
101
skills/developing/mineru/scripts/sinks/base.py
Normal file
@ -0,0 +1,101 @@
|
||||
"""Core types and the sink registry for delivering parsed Markdown to content tools.
|
||||
|
||||
A *sink* takes a :class:`ParsedDoc` (Markdown + local images + metadata) and
|
||||
delivers it into one destination (Obsidian, Notion, Slack, Feishu, ...) using
|
||||
that tool's OFFICIAL native ingestion path. Sinks read their configuration from
|
||||
environment variables so an AI agent can run them without interactive prompts.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
from dataclasses import dataclass, field
|
||||
from typing import Optional
|
||||
|
||||
|
||||
@dataclass
|
||||
class ParsedDoc:
|
||||
"""A parsed document ready for delivery."""
|
||||
|
||||
title: str
|
||||
markdown: str
|
||||
images: tuple = () # absolute paths to local image files
|
||||
source: str = ""
|
||||
modality: str = "unknown"
|
||||
markdown_path: Optional[str] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class SinkResult:
|
||||
"""Outcome of delivering a :class:`ParsedDoc` to one sink."""
|
||||
|
||||
sink: str
|
||||
ok: bool
|
||||
url: Optional[str] = None
|
||||
detail: Optional[str] = None
|
||||
error: Optional[str] = None
|
||||
|
||||
def to_status(self) -> dict:
|
||||
return {
|
||||
"sink": self.sink,
|
||||
"ok": self.ok,
|
||||
"url": self.url,
|
||||
"detail": self.detail,
|
||||
"error": self.error,
|
||||
}
|
||||
|
||||
|
||||
class SinkError(Exception):
|
||||
"""Raised by a sink when delivery fails for a known reason."""
|
||||
|
||||
|
||||
class Sink:
|
||||
"""Base class for a delivery target.
|
||||
|
||||
Subclasses set ``name``/``aliases``/``requires`` and implement
|
||||
:meth:`deliver`. ``requires`` lists the environment variables that must be
|
||||
present for the sink to be usable.
|
||||
"""
|
||||
|
||||
name: str = "base"
|
||||
aliases: tuple = ()
|
||||
requires: tuple = () # required env vars
|
||||
label: str = "" # human description
|
||||
local: bool = False # filesystem-only, no network/auth
|
||||
|
||||
def env(self, key: str, default: Optional[str] = None) -> Optional[str]:
|
||||
value = os.environ.get(key, default)
|
||||
return value.strip() if isinstance(value, str) else value
|
||||
|
||||
def missing_config(self) -> list:
|
||||
return [k for k in self.requires if not self.env(k)]
|
||||
|
||||
def is_configured(self) -> bool:
|
||||
return not self.missing_config()
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult: # pragma: no cover - abstract
|
||||
raise NotImplementedError
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Registry
|
||||
# --------------------------------------------------------------------------- #
|
||||
REGISTRY: dict = {}
|
||||
|
||||
|
||||
def register(cls):
|
||||
"""Class decorator that instantiates a sink and registers it by name+aliases."""
|
||||
inst = cls()
|
||||
REGISTRY[inst.name] = inst
|
||||
for alias in inst.aliases:
|
||||
REGISTRY[alias] = inst
|
||||
return cls
|
||||
|
||||
|
||||
def get_sink(name: str) -> Optional[Sink]:
|
||||
return REGISTRY.get(name.lower())
|
||||
|
||||
|
||||
def sink_names() -> list:
|
||||
"""Canonical sink names (no aliases), sorted."""
|
||||
return sorted({s.name for s in REGISTRY.values()})
|
||||
72
skills/developing/mineru/scripts/sinks/coda.py
Normal file
72
skills/developing/mineru/scripts/sinks/coda.py
Normal file
@ -0,0 +1,72 @@
|
||||
"""Coda sink: deliver Markdown as a page, into an existing doc or a new one.
|
||||
|
||||
Coda's API (``https://coda.io/apis/v1``) authenticates with a Bearer token.
|
||||
Markdown is delivered as canvas page content. If ``CODA_DOC_ID`` is set, a new
|
||||
page is added to that doc; otherwise a new doc is created with the content as its
|
||||
initial page.
|
||||
|
||||
Coda canvas content embeds images by URL only, so local image refs are left
|
||||
untouched — host images at a public URL for them to render.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
from . import _http, _md
|
||||
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
|
||||
|
||||
API = "https://coda.io/apis/v1"
|
||||
|
||||
|
||||
def _canvas(markdown: str) -> dict:
|
||||
return {"type": "canvas", "canvasContent": {"format": "markdown", "content": markdown}}
|
||||
|
||||
|
||||
@register
|
||||
class CodaSink(Sink):
|
||||
name = "coda"
|
||||
requires = ("CODA_API_TOKEN",)
|
||||
label = "Coda page (REST API)"
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
token = self.env("CODA_API_TOKEN")
|
||||
doc_id = self.env("CODA_DOC_ID")
|
||||
headers = {
|
||||
"Authorization": f"Bearer {token}",
|
||||
"Content-Type": "application/json",
|
||||
}
|
||||
|
||||
base_dir = Path(doc.markdown_path).parent if doc.markdown_path else None
|
||||
n_images = len(_md.find_local_images(doc.markdown, base_dir))
|
||||
|
||||
if doc_id:
|
||||
status, parsed = _http.request_json(
|
||||
"POST", f"{API}/docs/{doc_id}/pages", headers=headers, payload={
|
||||
"name": doc.title,
|
||||
"pageContent": _canvas(doc.markdown),
|
||||
},
|
||||
)
|
||||
else:
|
||||
status, parsed = _http.request_json(
|
||||
"POST", f"{API}/docs", headers=headers, payload={
|
||||
"title": doc.title,
|
||||
"initialPage": {
|
||||
"name": doc.title,
|
||||
"pageContent": _canvas(doc.markdown),
|
||||
},
|
||||
},
|
||||
)
|
||||
|
||||
if status >= 400:
|
||||
raise SinkError(parsed.get("message") or f"HTTP {status}")
|
||||
|
||||
if n_images:
|
||||
detail = f"text only ({n_images} local image(s); Coda embeds images by URL)"
|
||||
else:
|
||||
detail = "text only"
|
||||
return SinkResult(
|
||||
sink=self.name, ok=True,
|
||||
url=parsed.get("browserLink"),
|
||||
detail=detail,
|
||||
)
|
||||
66
skills/developing/mineru/scripts/sinks/confluence.py
Normal file
66
skills/developing/mineru/scripts/sinks/confluence.py
Normal file
@ -0,0 +1,66 @@
|
||||
"""Confluence sink: create a page from the parsed Markdown via the Cloud REST API.
|
||||
|
||||
Confluence Cloud ingests content as *storage-format* HTML. Delivery converts the
|
||||
Markdown to HTML and creates a page with the v2 REST API
|
||||
(``POST /wiki/api/v2/pages``) using Basic auth (email + API token).
|
||||
|
||||
Local images are not attached — Confluence storage HTML references attachments by
|
||||
filename, which would require a separate upload step.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import base64
|
||||
|
||||
from . import _http, _md
|
||||
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
|
||||
|
||||
|
||||
@register
|
||||
class ConfluenceSink(Sink):
|
||||
name = "confluence"
|
||||
requires = (
|
||||
"CONFLUENCE_BASE_URL",
|
||||
"CONFLUENCE_EMAIL",
|
||||
"CONFLUENCE_API_TOKEN",
|
||||
"CONFLUENCE_SPACE_ID",
|
||||
)
|
||||
label = "Confluence Cloud page (storage HTML)"
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
base = self.env("CONFLUENCE_BASE_URL").rstrip("/")
|
||||
email = self.env("CONFLUENCE_EMAIL")
|
||||
token = self.env("CONFLUENCE_API_TOKEN")
|
||||
space = self.env("CONFLUENCE_SPACE_ID")
|
||||
|
||||
auth = base64.b64encode(f"{email}:{token}".encode("utf-8")).decode("ascii")
|
||||
headers = {
|
||||
"Authorization": f"Basic {auth}",
|
||||
"Content-Type": "application/json",
|
||||
}
|
||||
|
||||
html = _md.md_to_html(doc.markdown)
|
||||
status, parsed = _http.request_json(
|
||||
"POST",
|
||||
f"{base}/wiki/api/v2/pages",
|
||||
headers=headers,
|
||||
payload={
|
||||
"spaceId": space,
|
||||
"status": "current",
|
||||
"title": doc.title,
|
||||
"body": {"representation": "storage", "value": html},
|
||||
},
|
||||
)
|
||||
if status >= 400:
|
||||
raise SinkError(
|
||||
parsed.get("title")
|
||||
or parsed.get("message")
|
||||
or f"Confluence HTTP {status}"
|
||||
)
|
||||
|
||||
webui = (parsed.get("_links") or {}).get("webui")
|
||||
url = base + webui if webui else None
|
||||
return SinkResult(
|
||||
sink=self.name, ok=True, url=url,
|
||||
detail="converted Markdown->storage HTML (local images not attached)",
|
||||
)
|
||||
65
skills/developing/mineru/scripts/sinks/dingtalk.py
Normal file
65
skills/developing/mineru/scripts/sinks/dingtalk.py
Normal file
@ -0,0 +1,65 @@
|
||||
"""DingTalk (钉钉) sink — push parsed Markdown as a robot markdown message.
|
||||
|
||||
A DingTalk custom robot accepts a ``markdown`` message type. The official native
|
||||
ingestion path is therefore a webhook POST carrying the document title and body.
|
||||
When a signing secret is configured the request is HMAC-SHA256 signed per
|
||||
DingTalk's spec. DingTalk's markdown renderer only fetches images over public
|
||||
URLs, so local images won't render.
|
||||
|
||||
Docs: https://open.dingtalk.com/document/robots/custom-robot-access.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import base64
|
||||
import hashlib
|
||||
import hmac
|
||||
import time
|
||||
import urllib.parse
|
||||
|
||||
from . import _http
|
||||
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
|
||||
|
||||
|
||||
@register
|
||||
class DingTalkSink(Sink):
|
||||
name = "dingtalk"
|
||||
aliases = ("钉钉",)
|
||||
requires = ("DINGTALK_WEBHOOK",)
|
||||
label = "DingTalk robot markdown (钉钉)"
|
||||
|
||||
def _build_url(self) -> str:
|
||||
webhook = self.env("DINGTALK_WEBHOOK")
|
||||
if webhook.startswith("http"):
|
||||
url = webhook
|
||||
else:
|
||||
url = f"https://oapi.dingtalk.com/robot/send?access_token={webhook}"
|
||||
|
||||
secret = self.env("DINGTALK_SECRET")
|
||||
if secret:
|
||||
timestamp = str(round(time.time() * 1000))
|
||||
string_to_sign = f"{timestamp}\n{secret}"
|
||||
hmac_code = hmac.new(
|
||||
secret.encode(), string_to_sign.encode(), hashlib.sha256
|
||||
).digest()
|
||||
sign = urllib.parse.quote_plus(base64.b64encode(hmac_code))
|
||||
url += f"×tamp={timestamp}&sign={sign}"
|
||||
return url
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
url = self._build_url()
|
||||
payload = {
|
||||
"msgtype": "markdown",
|
||||
"markdown": {"title": doc.title, "text": doc.markdown},
|
||||
}
|
||||
status, parsed = _http.request_json("POST", url, payload=payload)
|
||||
|
||||
if parsed.get("errcode") not in (0, None):
|
||||
raise SinkError(parsed.get("errmsg") or f"DingTalk HTTP {status}: {parsed}")
|
||||
|
||||
return SinkResult(
|
||||
sink=self.name,
|
||||
ok=True,
|
||||
url=None,
|
||||
detail="robot markdown message (local images won't render; host publicly)",
|
||||
)
|
||||
124
skills/developing/mineru/scripts/sinks/feishu.py
Normal file
124
skills/developing/mineru/scripts/sinks/feishu.py
Normal file
@ -0,0 +1,124 @@
|
||||
"""Feishu / Lark sink: import the parsed Markdown as a Docx document.
|
||||
|
||||
Feishu (飞书) / Lark ingests Markdown through its Drive import pipeline. Delivery
|
||||
follows that official path:
|
||||
|
||||
1. ``tenant_access_token/internal`` — exchange the app id/secret for a tenant
|
||||
access token.
|
||||
2. ``drive/v1/medias/upload_all`` — upload the ``.md`` bytes as an import medium
|
||||
and obtain a ``file_token``.
|
||||
3. ``drive/v1/import_tasks`` — kick off an import task converting the medium to a
|
||||
Docx, returning a ``ticket``.
|
||||
4. Poll ``drive/v1/import_tasks/{ticket}`` until the job finishes, surfacing the
|
||||
resulting document URL.
|
||||
|
||||
Local images are not uploaded — they would need public URLs to render in Docx.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import time
|
||||
|
||||
from . import _http, _md
|
||||
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
|
||||
|
||||
|
||||
@register
|
||||
class FeishuSink(Sink):
|
||||
name = "feishu"
|
||||
aliases = ("lark", "飞书")
|
||||
requires = ("FEISHU_APP_ID", "FEISHU_APP_SECRET")
|
||||
label = "Feishu / Lark Docx (Drive import)"
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
app_id = self.env("FEISHU_APP_ID")
|
||||
app_secret = self.env("FEISHU_APP_SECRET")
|
||||
folder_token = self.env("FEISHU_FOLDER_TOKEN")
|
||||
|
||||
# Step 1: tenant access token.
|
||||
status, parsed = _http.request_json(
|
||||
"POST",
|
||||
"https://open.feishu.cn/open-apis/auth/v3/tenant_access_token/internal",
|
||||
payload={"app_id": app_id, "app_secret": app_secret},
|
||||
)
|
||||
token = parsed.get("tenant_access_token")
|
||||
if parsed.get("code") not in (0, None) or not token:
|
||||
raise SinkError(parsed.get("msg") or f"Feishu auth failed (HTTP {status})")
|
||||
headers = {"Authorization": f"Bearer {token}"}
|
||||
|
||||
# Step 2: upload the Markdown bytes as an import medium.
|
||||
content = doc.markdown.encode("utf-8")
|
||||
fname = _md.safe_filename(doc.title) + ".md"
|
||||
ctype, body = _http.encode_multipart(
|
||||
fields={
|
||||
"file_name": fname,
|
||||
"parent_type": "ccm_import_open",
|
||||
"size": str(len(content)),
|
||||
"extra": json.dumps({"obj_type": "docx", "file_extension": "md"}),
|
||||
},
|
||||
files=[("file", fname, content)],
|
||||
)
|
||||
up_status, raw = _http.http_request(
|
||||
"POST",
|
||||
"https://open.feishu.cn/open-apis/drive/v1/medias/upload_all",
|
||||
headers={**headers, "Content-Type": ctype},
|
||||
data=body,
|
||||
)
|
||||
parsed = _parse_json(raw)
|
||||
if parsed.get("code") not in (0, None):
|
||||
raise SinkError(parsed.get("msg") or f"Feishu media upload failed (HTTP {up_status})")
|
||||
file_token = (parsed.get("data") or {}).get("file_token")
|
||||
if not file_token:
|
||||
raise SinkError("Feishu did not return a file_token")
|
||||
|
||||
# Step 3: create the import task.
|
||||
status, parsed = _http.request_json(
|
||||
"POST",
|
||||
"https://open.feishu.cn/open-apis/drive/v1/import_tasks",
|
||||
headers=headers,
|
||||
payload={
|
||||
"file_extension": "md",
|
||||
"file_token": file_token,
|
||||
"type": "docx",
|
||||
"file_name": doc.title,
|
||||
"point": {"mount_type": 1, "mount_key": folder_token or ""},
|
||||
},
|
||||
)
|
||||
if parsed.get("code") not in (0, None):
|
||||
raise SinkError(parsed.get("msg") or f"Feishu import task failed (HTTP {status})")
|
||||
ticket = (parsed.get("data") or {}).get("ticket")
|
||||
if not ticket:
|
||||
raise SinkError("Feishu did not return an import ticket")
|
||||
|
||||
# Step 4: poll until the import job completes.
|
||||
url = None
|
||||
for _attempt in range(20):
|
||||
status, parsed = _http.request_json(
|
||||
"GET",
|
||||
f"https://open.feishu.cn/open-apis/drive/v1/import_tasks/{ticket}",
|
||||
headers=headers,
|
||||
)
|
||||
res = (parsed.get("data") or {}).get("result") or {}
|
||||
job_status = res.get("job_status")
|
||||
if job_status == 0:
|
||||
url = res.get("url")
|
||||
break
|
||||
if job_status in (1, 2):
|
||||
time.sleep(1)
|
||||
continue
|
||||
raise SinkError(res.get("job_error_msg") or "Feishu import failed")
|
||||
|
||||
return SinkResult(
|
||||
sink=self.name, ok=True, url=url,
|
||||
detail="imported to Feishu Docx (local images need public URLs)",
|
||||
)
|
||||
|
||||
|
||||
def _parse_json(raw):
|
||||
if not raw:
|
||||
return {}
|
||||
try:
|
||||
return json.loads(raw.decode("utf-8"))
|
||||
except (ValueError, UnicodeDecodeError):
|
||||
return {}
|
||||
75
skills/developing/mineru/scripts/sinks/linear.py
Normal file
75
skills/developing/mineru/scripts/sinks/linear.py
Normal file
@ -0,0 +1,75 @@
|
||||
"""Linear sink: create an issue from Markdown via the GraphQL API.
|
||||
|
||||
Linear's API is GraphQL at ``https://api.linear.app/graphql`` and authenticates
|
||||
with a raw API key in the ``Authorization`` header (no ``Bearer`` prefix). The
|
||||
issue description is Markdown; Linear renders inline ``data:`` image URIs, so
|
||||
local images are read and embedded as base64 data URIs before delivery.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import base64
|
||||
from pathlib import Path
|
||||
|
||||
from . import _http, _md
|
||||
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
|
||||
|
||||
API = "https://api.linear.app/graphql"
|
||||
|
||||
_MUTATION = (
|
||||
"mutation IssueCreate($input: IssueCreateInput!)"
|
||||
"{issueCreate(input:$input){success issue{id url identifier}}}"
|
||||
)
|
||||
|
||||
_MIME = {
|
||||
".png": "image/png",
|
||||
".jpg": "image/jpeg",
|
||||
".jpeg": "image/jpeg",
|
||||
".gif": "image/gif",
|
||||
".webp": "image/webp",
|
||||
}
|
||||
|
||||
|
||||
def _data_uri(path: Path) -> str:
|
||||
mime = _MIME.get(path.suffix.lower(), "image/png")
|
||||
b64 = base64.b64encode(path.read_bytes()).decode("ascii")
|
||||
return f"data:{mime};base64,{b64}"
|
||||
|
||||
|
||||
@register
|
||||
class LinearSink(Sink):
|
||||
name = "linear"
|
||||
requires = ("LINEAR_API_KEY", "LINEAR_TEAM_ID")
|
||||
label = "Linear issue (GraphQL API)"
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
key = self.env("LINEAR_API_KEY")
|
||||
team = self.env("LINEAR_TEAM_ID")
|
||||
headers = {"Authorization": key, "Content-Type": "application/json"}
|
||||
|
||||
base_dir = Path(doc.markdown_path).parent if doc.markdown_path else None
|
||||
images = _md.find_local_images(doc.markdown, base_dir)
|
||||
mapping = {ref: _data_uri(path) for _alt, ref, path in images}
|
||||
body = _md.rewrite_images(doc.markdown, mapping)
|
||||
|
||||
status, parsed = _http.request_json("POST", API, headers=headers, payload={
|
||||
"query": _MUTATION,
|
||||
"variables": {"input": {
|
||||
"teamId": team,
|
||||
"title": doc.title,
|
||||
"description": body,
|
||||
}},
|
||||
})
|
||||
if parsed.get("errors"):
|
||||
raise SinkError(str(parsed["errors"]))
|
||||
|
||||
result = ((parsed.get("data") or {}).get("issueCreate")) or {}
|
||||
if not result.get("success"):
|
||||
raise SinkError(f"Linear did not create the issue (HTTP {status})")
|
||||
issue = result.get("issue") or {}
|
||||
|
||||
return SinkResult(
|
||||
sink=self.name, ok=True,
|
||||
url=issue.get("url"),
|
||||
detail=f"{len(mapping)} image(s) inlined",
|
||||
)
|
||||
105
skills/developing/mineru/scripts/sinks/local.py
Normal file
105
skills/developing/mineru/scripts/sinks/local.py
Normal file
@ -0,0 +1,105 @@
|
||||
"""Local-first sinks: Obsidian and Logseq (filesystem writes, no auth).
|
||||
|
||||
Both tools are folders of Markdown files. The native ingestion is a filesystem
|
||||
write following each tool's conventions:
|
||||
|
||||
* Obsidian — a flat note with YAML frontmatter; images in a per-note assets
|
||||
folder, referenced with relative Markdown embeds.
|
||||
* Logseq — an outline (every line a ``- `` block) with ``key:: value`` page
|
||||
properties on the first block; images in ``assets/`` referenced as
|
||||
````.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
from . import _md
|
||||
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
|
||||
|
||||
|
||||
def _copy_images(doc: ParsedDoc, dest_dir: Path, ref_prefix: str) -> dict:
|
||||
"""Copy referenced local images into ``dest_dir``; return ``{old_ref: new_ref}``."""
|
||||
base = Path(doc.markdown_path).parent if doc.markdown_path else None
|
||||
mapping = {}
|
||||
images = _md.find_local_images(doc.markdown, base)
|
||||
if images:
|
||||
dest_dir.mkdir(parents=True, exist_ok=True)
|
||||
for _alt, ref, path in images:
|
||||
target = dest_dir / path.name
|
||||
target.write_bytes(path.read_bytes())
|
||||
mapping[ref] = f"{ref_prefix}{path.name}"
|
||||
return mapping
|
||||
|
||||
|
||||
@register
|
||||
class ObsidianSink(Sink):
|
||||
name = "obsidian"
|
||||
aliases = ("ob",)
|
||||
requires = ("OBSIDIAN_VAULT",)
|
||||
label = "Obsidian vault (local Markdown)"
|
||||
local = True
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
vault = Path(self.env("OBSIDIAN_VAULT")).expanduser()
|
||||
if not vault.is_dir():
|
||||
raise SinkError(f"Obsidian vault not found: {vault}")
|
||||
subdir = self.env("OBSIDIAN_SUBDIR", "") or ""
|
||||
note_dir = vault / subdir if subdir else vault
|
||||
note_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
stem = _md.safe_filename(doc.title)
|
||||
assets = note_dir / f"{stem}.assets"
|
||||
mapping = _copy_images(doc, assets, f"{stem}.assets/")
|
||||
body = _md.rewrite_images(doc.markdown, mapping)
|
||||
|
||||
front = _md.yaml_frontmatter({
|
||||
"title": doc.title,
|
||||
"source": doc.source,
|
||||
"modality": doc.modality,
|
||||
"tags": ["mineru", "parsed"],
|
||||
})
|
||||
note_path = note_dir / f"{stem}.md"
|
||||
note_path.write_text(f"{front}\n\n{body}\n", encoding="utf-8")
|
||||
return SinkResult(sink=self.name, ok=True, url=str(note_path),
|
||||
detail=f"{len(mapping)} image(s)")
|
||||
|
||||
|
||||
@register
|
||||
class LogseqSink(Sink):
|
||||
name = "logseq"
|
||||
requires = ("LOGSEQ_GRAPH",)
|
||||
label = "Logseq graph (local outline)"
|
||||
local = True
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
graph = Path(self.env("LOGSEQ_GRAPH")).expanduser()
|
||||
if not graph.is_dir():
|
||||
raise SinkError(f"Logseq graph not found: {graph}")
|
||||
pages = graph / "pages"
|
||||
assets = graph / "assets"
|
||||
pages.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
stem = _md.safe_filename(doc.title)
|
||||
# Namespace asset names by page slug to avoid collisions in the shared assets/.
|
||||
prefix = _md.slugify(doc.title)
|
||||
mapping = {}
|
||||
base = Path(doc.markdown_path).parent if doc.markdown_path else None
|
||||
images = _md.find_local_images(doc.markdown, base)
|
||||
if images:
|
||||
assets.mkdir(parents=True, exist_ok=True)
|
||||
for _alt, ref, path in images:
|
||||
new_name = f"{prefix}-{path.name}"
|
||||
(assets / new_name).write_bytes(path.read_bytes())
|
||||
mapping[ref] = f"../assets/{new_name}"
|
||||
body = _md.rewrite_images(doc.markdown, mapping)
|
||||
|
||||
outline = _md.md_to_logseq(body, properties={
|
||||
"title": doc.title,
|
||||
"source": doc.source,
|
||||
"tags": "mineru, parsed",
|
||||
})
|
||||
page_path = pages / f"{stem}.md"
|
||||
page_path.write_text(outline + "\n", encoding="utf-8")
|
||||
return SinkResult(sink=self.name, ok=True, url=str(page_path),
|
||||
detail=f"{len(mapping)} image(s)")
|
||||
130
skills/developing/mineru/scripts/sinks/notion.py
Normal file
130
skills/developing/mineru/scripts/sinks/notion.py
Normal file
@ -0,0 +1,130 @@
|
||||
"""Notion sink: create a page under a parent page from Markdown blocks.
|
||||
|
||||
Notion's native ingestion is the block API: each Markdown line becomes a typed
|
||||
block (heading, quote, code, list item, paragraph). A page is created with up to
|
||||
100 children inline; any remainder is appended in 100-block chunks via the
|
||||
``/blocks/{id}/children`` PATCH endpoint.
|
||||
|
||||
Notion has no inline image-from-bytes path (images must be uploaded or hosted
|
||||
separately), so local image refs are intentionally left untouched.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
from . import _http, _md
|
||||
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
|
||||
|
||||
API = "https://api.notion.com/v1"
|
||||
MAX_BLOCKS = 100
|
||||
MAX_TEXT = 2000
|
||||
|
||||
|
||||
def _rich(text: str) -> list:
|
||||
return [{"type": "text", "text": {"content": text[:MAX_TEXT]}}]
|
||||
|
||||
|
||||
def _block(block_type: str, text: str, **extra) -> dict:
|
||||
inner = {"rich_text": _rich(text)}
|
||||
inner.update(extra)
|
||||
return {"object": "block", "type": block_type, block_type: inner}
|
||||
|
||||
|
||||
def _is_numbered(text: str) -> bool:
|
||||
head = text.split(".", 1)
|
||||
return len(head) == 2 and head[0].isdigit() and head[1].startswith(" ")
|
||||
|
||||
|
||||
def _blocks(markdown: str) -> list:
|
||||
"""Convert flat Markdown lines into a list of Notion block dicts."""
|
||||
blocks = []
|
||||
in_code = False
|
||||
code_buf: list = []
|
||||
for raw in markdown.replace("\r\n", "\n").split("\n"):
|
||||
stripped = raw.strip()
|
||||
|
||||
if stripped.startswith("```"):
|
||||
if in_code:
|
||||
blocks.append(_block("code", "\n".join(code_buf), language="plain text"))
|
||||
in_code = False
|
||||
code_buf = []
|
||||
else:
|
||||
in_code = True
|
||||
code_buf = []
|
||||
continue
|
||||
if in_code:
|
||||
code_buf.append(raw)
|
||||
continue
|
||||
|
||||
if not stripped:
|
||||
continue
|
||||
if stripped.startswith("# "):
|
||||
blocks.append(_block("heading_1", stripped[2:].strip()))
|
||||
elif stripped.startswith("## "):
|
||||
blocks.append(_block("heading_2", stripped[3:].strip()))
|
||||
elif stripped.startswith("### "):
|
||||
blocks.append(_block("heading_3", stripped[4:].strip()))
|
||||
elif stripped.startswith("> "):
|
||||
blocks.append(_block("quote", stripped[2:].strip()))
|
||||
elif stripped.startswith("- ") or stripped.startswith("* "):
|
||||
blocks.append(_block("bulleted_list_item", stripped[2:].strip()))
|
||||
elif _is_numbered(stripped):
|
||||
blocks.append(_block("numbered_list_item", stripped.split(".", 1)[1].strip()))
|
||||
else:
|
||||
blocks.append(_block("paragraph", stripped))
|
||||
|
||||
if in_code:
|
||||
blocks.append(_block("code", "\n".join(code_buf), language="plain text"))
|
||||
return blocks
|
||||
|
||||
|
||||
@register
|
||||
class NotionSink(Sink):
|
||||
name = "notion"
|
||||
requires = ("NOTION_API_KEY", "NOTION_PARENT_PAGE_ID")
|
||||
label = "Notion page (blocks API)"
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
key = self.env("NOTION_API_KEY")
|
||||
parent = self.env("NOTION_PARENT_PAGE_ID")
|
||||
version = self.env("NOTION_VERSION", "2022-06-28") or "2022-06-28"
|
||||
headers = {
|
||||
"Authorization": f"Bearer {key}",
|
||||
"Notion-Version": version,
|
||||
"Content-Type": "application/json",
|
||||
}
|
||||
|
||||
# Count local images for the detail note (refs are left as-is).
|
||||
base_dir = Path(doc.markdown_path).parent if doc.markdown_path else None
|
||||
n_images = len(_md.find_local_images(doc.markdown, base_dir))
|
||||
|
||||
blocks = _blocks(doc.markdown)
|
||||
status, parsed = _http.request_json("POST", f"{API}/pages", headers=headers, payload={
|
||||
"parent": {"page_id": parent},
|
||||
"properties": {"title": {"title": [{"text": {"content": doc.title}}]}},
|
||||
"children": blocks[:MAX_BLOCKS],
|
||||
})
|
||||
if parsed.get("object") == "error":
|
||||
raise SinkError(parsed.get("message") or f"Notion API error (HTTP {status})")
|
||||
created_id = parsed.get("id")
|
||||
if not created_id:
|
||||
raise SinkError(f"Notion did not return a page id (HTTP {status})")
|
||||
page_url = parsed.get("url")
|
||||
|
||||
for start in range(MAX_BLOCKS, len(blocks), MAX_BLOCKS):
|
||||
chunk = blocks[start:start + MAX_BLOCKS]
|
||||
ch_status, ch_parsed = _http.request_json(
|
||||
"PATCH", f"{API}/blocks/{created_id}/children",
|
||||
headers=headers, payload={"children": chunk},
|
||||
)
|
||||
if ch_parsed.get("object") == "error":
|
||||
raise SinkError(ch_parsed.get("message")
|
||||
or f"Notion block append failed (HTTP {ch_status})")
|
||||
|
||||
if n_images:
|
||||
detail = (f"text+structure ({n_images} local images not embedded; "
|
||||
f"Notion needs file upload)")
|
||||
else:
|
||||
detail = "text+structure"
|
||||
return SinkResult(sink=self.name, ok=True, url=page_url, detail=detail)
|
||||
66
skills/developing/mineru/scripts/sinks/onenote.py
Normal file
66
skills/developing/mineru/scripts/sinks/onenote.py
Normal file
@ -0,0 +1,66 @@
|
||||
"""OneNote sink: create a page from the parsed Markdown via Microsoft Graph.
|
||||
|
||||
OneNote pages are created by POSTing an HTML document to a section's ``pages``
|
||||
endpoint with a pre-obtained Microsoft Graph access token (OAuth). Delivery
|
||||
converts the Markdown to a full HTML document and creates the page.
|
||||
|
||||
Only remote images render — Graph fetches ``<img src>`` URLs, so local image
|
||||
paths emitted by MinerU would need to be public URLs.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import html
|
||||
import json
|
||||
|
||||
from . import _http, _md
|
||||
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
|
||||
|
||||
|
||||
@register
|
||||
class OneNoteSink(Sink):
|
||||
name = "onenote"
|
||||
aliases = ("msonenote",)
|
||||
requires = ("ONENOTE_TOKEN", "ONENOTE_SECTION_ID")
|
||||
label = "OneNote section page (Microsoft Graph)"
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
token = self.env("ONENOTE_TOKEN")
|
||||
section = self.env("ONENOTE_SECTION_ID")
|
||||
|
||||
body_html = _md.md_to_html(doc.markdown)
|
||||
page = (
|
||||
"<!DOCTYPE html><html><head>"
|
||||
f"<title>{html.escape(doc.title)}</title>"
|
||||
f"</head><body>{body_html}</body></html>"
|
||||
)
|
||||
|
||||
status, raw = _http.http_request(
|
||||
"POST",
|
||||
f"https://graph.microsoft.com/v1.0/me/onenote/sections/{section}/pages",
|
||||
headers={
|
||||
"Authorization": f"Bearer {token}",
|
||||
"Content-Type": "text/html",
|
||||
},
|
||||
data=page.encode("utf-8"),
|
||||
)
|
||||
if status >= 400:
|
||||
preview = raw.decode("utf-8", "replace") if raw else ""
|
||||
raise SinkError(f"OneNote HTTP {status}: {preview[:200]}")
|
||||
if status != 201:
|
||||
raise SinkError(f"OneNote unexpected response (HTTP {status})")
|
||||
|
||||
parsed = {}
|
||||
if raw:
|
||||
try:
|
||||
parsed = json.loads(raw.decode("utf-8"))
|
||||
except (ValueError, UnicodeDecodeError):
|
||||
parsed = {}
|
||||
links = parsed.get("links") or {}
|
||||
web = links.get("oneNoteWebUrl") or {}
|
||||
url = web.get("href")
|
||||
|
||||
return SinkResult(
|
||||
sink=self.name, ok=True, url=url,
|
||||
detail="converted Markdown->HTML (remote images only; OAuth token required)",
|
||||
)
|
||||
106
skills/developing/mineru/scripts/sinks/roam.py
Normal file
106
skills/developing/mineru/scripts/sinks/roam.py
Normal file
@ -0,0 +1,106 @@
|
||||
"""Roam Research sink — optional dependency.
|
||||
|
||||
There is no library that ingests a Markdown document into Roam, but the official
|
||||
``roam-client`` SDK correctly handles the parts that are easy to get wrong — the
|
||||
307/308 peer-host redirect, the dual ``Authorization`` / ``x-authorization``
|
||||
Bearer headers, and the ``/write`` plumbing. So we lazily depend on it for
|
||||
transport and only build the Markdown → block-tree ourselves, delivering the whole
|
||||
document in a single ``batch-actions`` request (one HTTP round-trip).
|
||||
|
||||
Install the SDK (git-only, not on PyPI; needs Python ≥ 3.11):
|
||||
|
||||
pip install "roam-client @ git+https://github.com/Roam-Research/backend-sdks.git#subdirectory=python"
|
||||
|
||||
Config: ``ROAM_API_TOKEN`` (graph edit token), ``ROAM_GRAPH_NAME``.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import itertools
|
||||
import re
|
||||
|
||||
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
|
||||
|
||||
_HEADING = re.compile(r"^(#{1,6})\s+(.*)$")
|
||||
_INSTALL_HINT = (
|
||||
'Roam sink needs the official SDK — pip install '
|
||||
'"roam-client @ git+https://github.com/Roam-Research/backend-sdks.git#subdirectory=python"'
|
||||
)
|
||||
|
||||
|
||||
def md_to_roam_tree(markdown: str) -> list:
|
||||
"""Convert Markdown into a nested Roam block tree.
|
||||
|
||||
Headings become parent blocks (``heading`` 1–3); the lines under a heading
|
||||
nest beneath it. Returns ``[{"string", "heading"?, "children": [...]}, ...]``.
|
||||
"""
|
||||
roots: list = []
|
||||
stack: list = [] # [(heading_level, node)]
|
||||
for raw in markdown.replace("\r\n", "\n").split("\n"):
|
||||
line = raw.strip()
|
||||
if not line:
|
||||
continue
|
||||
match = _HEADING.match(line)
|
||||
if match:
|
||||
level = len(match.group(1))
|
||||
node = {"string": match.group(2), "heading": min(level, 3), "children": []}
|
||||
while stack and stack[-1][0] >= level:
|
||||
stack.pop()
|
||||
(stack[-1][1]["children"] if stack else roots).append(node)
|
||||
stack.append((level, node))
|
||||
else:
|
||||
node = {"string": line, "children": []}
|
||||
(stack[-1][1]["children"] if stack else roots).append(node)
|
||||
return roots
|
||||
|
||||
|
||||
def tree_to_actions(children: list, parent_uid: str, uidgen) -> list:
|
||||
"""Flatten a block tree into ``create-block`` actions for one batch request."""
|
||||
actions: list = []
|
||||
for order, node in enumerate(children):
|
||||
uid = uidgen()
|
||||
block = {"string": node["string"], "uid": uid}
|
||||
if node.get("heading"):
|
||||
block["heading"] = node["heading"]
|
||||
actions.append({
|
||||
"action": "create-block",
|
||||
"location": {"parent-uid": parent_uid, "order": order},
|
||||
"block": block,
|
||||
})
|
||||
actions.extend(tree_to_actions(node.get("children", []), uid, uidgen))
|
||||
return actions
|
||||
|
||||
|
||||
@register
|
||||
class RoamSink(Sink):
|
||||
name = "roam"
|
||||
aliases = ("roamresearch",)
|
||||
requires = ("ROAM_API_TOKEN", "ROAM_GRAPH_NAME")
|
||||
label = "Roam Research (batch-actions, optional dep)"
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
try:
|
||||
from roam_client.client import create_page, initialize_graph
|
||||
except ImportError as exc: # pragma: no cover - exercised via SinkError path
|
||||
raise SinkError(_INSTALL_HINT) from exc
|
||||
|
||||
token = self.env("ROAM_API_TOKEN")
|
||||
graph = self.env("ROAM_GRAPH_NAME")
|
||||
client = initialize_graph({"token": token, "graph": graph})
|
||||
|
||||
create_page(client, {"page": {"title": doc.title}})
|
||||
|
||||
counter = itertools.count(1)
|
||||
actions = tree_to_actions(
|
||||
md_to_roam_tree(doc.markdown), doc.title, lambda: f"mu{next(counter):07d}"
|
||||
)
|
||||
if actions:
|
||||
client.call(
|
||||
f"/api/graph/{graph}/write", "POST",
|
||||
{"action": "batch-actions", "actions": actions},
|
||||
)
|
||||
return SinkResult(
|
||||
sink=self.name, ok=True,
|
||||
url=f"https://roamresearch.com/#/app/{graph}",
|
||||
detail=f"{len(actions)} block(s) via batch-actions (images need public URLs)",
|
||||
)
|
||||
111
skills/developing/mineru/scripts/sinks/siyuan.py
Normal file
111
skills/developing/mineru/scripts/sinks/siyuan.py
Normal file
@ -0,0 +1,111 @@
|
||||
"""SiYuan sink: create a new document from Markdown via the local kernel API.
|
||||
|
||||
SiYuan (思源笔记) exposes a kernel HTTP API (default ``http://127.0.0.1:6806``)
|
||||
authenticated with an API token. Delivery follows SiYuan's native ingestion path:
|
||||
|
||||
1. Resolve the target notebook (``SIYUAN_NOTEBOOK`` or the first listed notebook).
|
||||
2. Upload each referenced local image via ``/api/asset/upload`` and rewrite the
|
||||
Markdown to point at the returned ``assets/...`` paths.
|
||||
3. Create the document with ``/api/filetree/createDocWithMd``.
|
||||
|
||||
Every kernel response wraps its payload as ``{"code": 0, "msg": "", "data": ...}``;
|
||||
a non-zero ``code`` is an error.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
from . import _http, _md
|
||||
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
|
||||
|
||||
|
||||
@register
|
||||
class SiYuanSink(Sink):
|
||||
name = "siyuan"
|
||||
requires = ("SIYUAN_TOKEN",)
|
||||
label = "SiYuan notebook (local kernel API)"
|
||||
|
||||
def _json_post(self, base: str, path: str, headers: dict, payload: dict):
|
||||
"""POST JSON; return ``data`` after verifying ``code == 0``."""
|
||||
try:
|
||||
status, parsed = _http.request_json("POST", f"{base}{path}",
|
||||
headers=headers, payload=payload)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
raise self._unreachable(base, exc) from exc
|
||||
return self._unwrap(base, status, parsed)
|
||||
|
||||
def _upload_post(self, base: str, headers: dict, content_type: str, body: bytes):
|
||||
"""POST a multipart body; return ``data`` after verifying ``code == 0``."""
|
||||
hdrs = dict(headers)
|
||||
hdrs["Content-Type"] = content_type
|
||||
try:
|
||||
status, raw = _http.http_request("POST", f"{base}/api/asset/upload",
|
||||
headers=hdrs, data=body)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
raise self._unreachable(base, exc) from exc
|
||||
parsed: dict = {}
|
||||
if raw:
|
||||
try:
|
||||
parsed = json.loads(raw.decode("utf-8"))
|
||||
except (ValueError, UnicodeDecodeError):
|
||||
parsed = {}
|
||||
return self._unwrap(base, status, parsed)
|
||||
|
||||
@staticmethod
|
||||
def _unreachable(base: str, exc=None) -> SinkError:
|
||||
suffix = f" ({exc})" if exc else ""
|
||||
return SinkError(
|
||||
f"SiYuan kernel not reachable at {base} — start SiYuan and enable "
|
||||
f"the API token{suffix}"
|
||||
)
|
||||
|
||||
def _unwrap(self, base: str, status: int, parsed: dict):
|
||||
if status == 0:
|
||||
raise self._unreachable(base)
|
||||
if parsed.get("code") != 0:
|
||||
raise SinkError(parsed.get("msg") or f"SiYuan API error (HTTP {status})")
|
||||
return parsed.get("data")
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
base = (self.env("SIYUAN_API_URL", "http://127.0.0.1:6806")
|
||||
or "http://127.0.0.1:6806").rstrip("/")
|
||||
token = self.env("SIYUAN_TOKEN")
|
||||
headers = {"Authorization": f"Token {token}"}
|
||||
|
||||
notebook = self.env("SIYUAN_NOTEBOOK")
|
||||
if not notebook:
|
||||
data = self._json_post(base, "/api/notebook/lsNotebooks", headers, {})
|
||||
notebooks = (data or {}).get("notebooks") or []
|
||||
if not notebooks:
|
||||
raise SinkError("SiYuan has no notebooks — create one before delivering")
|
||||
notebook = notebooks[0]["id"]
|
||||
|
||||
base_dir = Path(doc.markdown_path).parent if doc.markdown_path else None
|
||||
images = _md.find_local_images(doc.markdown, base_dir)
|
||||
mapping = {}
|
||||
for _alt, ref, path in images:
|
||||
content_type, body = _http.encode_multipart(
|
||||
fields={"assetsDirPath": "/assets/"},
|
||||
files=[("file[]", path.name, path.read_bytes())],
|
||||
)
|
||||
data = self._upload_post(base, headers, content_type, body)
|
||||
succ_map = (data or {}).get("succMap") or {}
|
||||
if path.name in succ_map:
|
||||
mapping[ref] = succ_map[path.name]
|
||||
body_md = _md.rewrite_images(doc.markdown, mapping)
|
||||
|
||||
docid = self._json_post(base, "/api/filetree/createDocWithMd", headers, {
|
||||
"notebook": notebook,
|
||||
"path": "/" + _md.safe_filename(doc.title),
|
||||
"markdown": body_md,
|
||||
})
|
||||
if not docid:
|
||||
raise SinkError("SiYuan did not return a document id")
|
||||
|
||||
return SinkResult(
|
||||
sink=self.name, ok=True,
|
||||
url=f"siyuan://blocks/{docid}",
|
||||
detail=f"{len(mapping)} image(s)",
|
||||
)
|
||||
95
skills/developing/mineru/scripts/sinks/slack.py
Normal file
95
skills/developing/mineru/scripts/sinks/slack.py
Normal file
@ -0,0 +1,95 @@
|
||||
"""Slack sink: upload the parsed Markdown as a file via the external-upload flow.
|
||||
|
||||
Slack deprecated ``files.upload`` (retired) in favour of a three-step external
|
||||
upload. Delivery follows that official path:
|
||||
|
||||
1. ``files.getUploadURLExternal`` — reserve an upload URL + file id for the
|
||||
given filename and byte length.
|
||||
2. ``POST`` the raw bytes to the returned upload URL.
|
||||
3. ``files.completeUploadExternal`` — finalize the upload, attach it to the
|
||||
target channel, and post an initial comment.
|
||||
|
||||
Images are *not* embedded: Markdown is uploaded as a single ``.md`` file.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import urllib.parse
|
||||
|
||||
from . import _http, _md
|
||||
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
|
||||
|
||||
|
||||
@register
|
||||
class SlackSink(Sink):
|
||||
name = "slack"
|
||||
requires = ("SLACK_BOT_TOKEN", "SLACK_CHANNEL")
|
||||
label = "Slack channel (file upload)"
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
token = self.env("SLACK_BOT_TOKEN")
|
||||
channel = self.env("SLACK_CHANNEL")
|
||||
auth = {"Authorization": f"Bearer {token}"}
|
||||
|
||||
content = doc.markdown.encode("utf-8")
|
||||
filename = _md.slugify(doc.title) + ".md"
|
||||
|
||||
# Step 1: reserve an external upload URL + file id. This endpoint wants
|
||||
# form-encoded data, so use http_request and parse the JSON response.
|
||||
form = urllib.parse.urlencode({
|
||||
"filename": filename,
|
||||
"length": len(content),
|
||||
}).encode("utf-8")
|
||||
status, raw = _http.http_request(
|
||||
"POST",
|
||||
"https://slack.com/api/files.getUploadURLExternal",
|
||||
headers={**auth, "Content-Type": "application/x-www-form-urlencoded"},
|
||||
data=form,
|
||||
)
|
||||
parsed = _parse_json(raw)
|
||||
if not parsed.get("ok"):
|
||||
raise SinkError(parsed.get("error") or f"Slack getUploadURLExternal failed (HTTP {status})")
|
||||
upload_url = parsed.get("upload_url")
|
||||
file_id = parsed.get("file_id")
|
||||
if not upload_url or not file_id:
|
||||
raise SinkError("Slack did not return an upload URL / file id")
|
||||
|
||||
# Step 2: upload the raw bytes to the reserved URL.
|
||||
up_status, _up_body = _http.http_request(
|
||||
"POST", upload_url,
|
||||
headers={"Content-Type": "application/octet-stream"},
|
||||
data=content,
|
||||
)
|
||||
if up_status != 200:
|
||||
raise SinkError(f"Slack file upload failed (HTTP {up_status})")
|
||||
|
||||
# Step 3: finalize the upload into the channel.
|
||||
status, parsed = _http.request_json(
|
||||
"POST",
|
||||
"https://slack.com/api/files.completeUploadExternal",
|
||||
headers=auth,
|
||||
payload={
|
||||
"files": [{"id": file_id, "title": doc.title}],
|
||||
"channel_id": channel,
|
||||
"initial_comment": f"Parsed: {doc.title}",
|
||||
},
|
||||
)
|
||||
if not parsed.get("ok"):
|
||||
raise SinkError(parsed.get("error") or f"Slack completeUploadExternal failed (HTTP {status})")
|
||||
|
||||
files = parsed.get("files") or [{}]
|
||||
url = files[0].get("permalink")
|
||||
return SinkResult(
|
||||
sink=self.name, ok=True, url=url,
|
||||
detail="uploaded .md file (images not embedded)",
|
||||
)
|
||||
|
||||
|
||||
def _parse_json(raw):
|
||||
import json
|
||||
if not raw:
|
||||
return {}
|
||||
try:
|
||||
return json.loads(raw.decode("utf-8"))
|
||||
except (ValueError, UnicodeDecodeError):
|
||||
return {}
|
||||
48
skills/developing/mineru/scripts/sinks/ticktick.py
Normal file
48
skills/developing/mineru/scripts/sinks/ticktick.py
Normal file
@ -0,0 +1,48 @@
|
||||
"""TickTick (滴答清单) sink — create a task from parsed Markdown.
|
||||
|
||||
TickTick's Open API exposes a task object whose ``content`` field holds the body
|
||||
text. The official native ingestion path for arbitrary Markdown is therefore a
|
||||
task: the document title becomes the task title and the Markdown becomes the
|
||||
task content. Tasks have no attachment/inline-image surface, so local images are
|
||||
not delivered.
|
||||
|
||||
Docs: https://developer.ticktick.com/docs (POST /open/v1/task).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from . import _http
|
||||
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
|
||||
|
||||
API_URL = "https://api.ticktick.com/open/v1/task"
|
||||
|
||||
|
||||
@register
|
||||
class TickTickSink(Sink):
|
||||
name = "ticktick"
|
||||
aliases = ("dida", "滴答清单")
|
||||
requires = ("TICKTICK_TOKEN",)
|
||||
label = "TickTick task (滴答清单)"
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
token = self.env("TICKTICK_TOKEN")
|
||||
project_id = self.env("TICKTICK_PROJECT_ID")
|
||||
|
||||
payload = {"title": doc.title, "content": doc.markdown}
|
||||
if project_id:
|
||||
payload["projectId"] = project_id
|
||||
|
||||
headers = {"Authorization": f"Bearer {token}"}
|
||||
status, parsed = _http.request_json("POST", API_URL, headers=headers, payload=payload)
|
||||
|
||||
if status >= 400:
|
||||
raise SinkError(f"TickTick HTTP {status}: {parsed}")
|
||||
if not parsed.get("id"):
|
||||
raise SinkError(f"TickTick returned no task id: {parsed}")
|
||||
|
||||
return SinkResult(
|
||||
sink=self.name,
|
||||
ok=True,
|
||||
url=None,
|
||||
detail="task content (no inline images supported by TickTick)",
|
||||
)
|
||||
60
skills/developing/mineru/scripts/sinks/wecom.py
Normal file
60
skills/developing/mineru/scripts/sinks/wecom.py
Normal file
@ -0,0 +1,60 @@
|
||||
"""WeCom (企业微信 / WeChat Work) sink — send parsed Markdown as an app message.
|
||||
|
||||
WeCom apps deliver content via the message-send API. The native ingestion path
|
||||
is a ``markdown`` message from a self-built app: first an access token is fetched
|
||||
with the corp id + secret, then the message is posted. WeCom's markdown is a
|
||||
limited subset with a 2048-byte content cap and no inline images, so the body is
|
||||
truncated to fit.
|
||||
|
||||
Docs: https://developer.work.weixin.qq.com/document/path/90236 (message/send),
|
||||
https://developer.work.weixin.qq.com/document/path/91039 (gettoken).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from . import _http
|
||||
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
|
||||
|
||||
TOKEN_URL = "https://qyapi.weixin.qq.com/cgi-bin/gettoken"
|
||||
SEND_URL = "https://qyapi.weixin.qq.com/cgi-bin/message/send"
|
||||
|
||||
|
||||
@register
|
||||
class WeComSink(Sink):
|
||||
name = "wecom"
|
||||
aliases = ("企业微信", "wechatwork")
|
||||
requires = ("WECOM_CORPID", "WECOM_CORPSECRET", "WECOM_AGENTID")
|
||||
label = "WeCom app markdown (企业微信)"
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
corpid = self.env("WECOM_CORPID")
|
||||
secret = self.env("WECOM_CORPSECRET")
|
||||
agentid = self.env("WECOM_AGENTID")
|
||||
touser = self.env("WECOM_TOUSER", "@all")
|
||||
|
||||
# Step 1: fetch an access token.
|
||||
token_url = f"{TOKEN_URL}?corpid={corpid}&corpsecret={secret}"
|
||||
status, parsed = _http.request_json("GET", token_url)
|
||||
if parsed.get("errcode") not in (0, None) or not parsed.get("access_token"):
|
||||
raise SinkError(parsed.get("errmsg") or f"WeCom token fetch failed: {parsed}")
|
||||
token = parsed["access_token"]
|
||||
|
||||
# Step 2: send the markdown message.
|
||||
send_url = f"{SEND_URL}?access_token={token}"
|
||||
payload = {
|
||||
"touser": touser,
|
||||
"msgtype": "markdown",
|
||||
"agentid": int(agentid),
|
||||
"markdown": {"content": doc.markdown[:2048]},
|
||||
}
|
||||
status, parsed = _http.request_json("POST", send_url, payload=payload)
|
||||
if parsed.get("errcode") not in (0, None):
|
||||
raise SinkError(parsed.get("errmsg") or f"WeCom send failed: {parsed}")
|
||||
|
||||
return SinkResult(
|
||||
sink=self.name,
|
||||
ok=True,
|
||||
url=None,
|
||||
detail="markdown notification (WeCom markdown is a limited subset, "
|
||||
"2048-byte cap, no inline images)",
|
||||
)
|
||||
104
skills/developing/mineru/scripts/sinks/wps.py
Normal file
104
skills/developing/mineru/scripts/sinks/wps.py
Normal file
@ -0,0 +1,104 @@
|
||||
"""WPS / 金山文档 (Kingsoft kdocs) sink — optional dependency.
|
||||
|
||||
The native ingestion path is: Markdown → ``.docx`` → upload to the kdocs cloud
|
||||
appspace. There is no official Python SDK, so:
|
||||
|
||||
* Markdown→DOCX uses the maintained, pure-pip ``html-for-docx`` package
|
||||
(reusing this project's Markdown→HTML), lazily imported so the core stays
|
||||
zero-dependency. Install with ``pip install mineru-skill[wps]``.
|
||||
* The kdocs WPS-2 request signing (plain SHA-1) and multipart upload are done
|
||||
with the standard library — small and fully documented.
|
||||
|
||||
Cloud upload requires an approved kdocs developer app (``WPS_APP_ID`` /
|
||||
``WPS_APP_SECRET``) and a provisioned appspace; it is opt-in and surfaces the
|
||||
raw kdocs error on failure. Docs: https://developer.kdocs.cn/server/guide/signature.html
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import email.utils
|
||||
import hashlib
|
||||
import io
|
||||
import json
|
||||
|
||||
from . import _http, _md
|
||||
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
|
||||
|
||||
KDOCS_UPLOAD = "https://developer.kdocs.cn/api/v1/openapi/appspace/files/upload"
|
||||
|
||||
|
||||
def _markdown_to_docx_bytes(markdown: str) -> bytes:
|
||||
"""Convert Markdown → HTML → DOCX bytes via the optional html-for-docx lib."""
|
||||
try:
|
||||
from html4docx import HtmlToDocx # pip install html-for-docx
|
||||
except ImportError as exc: # pragma: no cover - exercised via SinkError path
|
||||
raise SinkError(
|
||||
"WPS sink needs a Markdown→DOCX converter — "
|
||||
"pip install 'mineru-skill[wps]' (i.e. pip install html-for-docx)"
|
||||
) from exc
|
||||
html = _md.md_to_html(markdown)
|
||||
document = HtmlToDocx().parse_html_string(html)
|
||||
buf = io.BytesIO()
|
||||
document.save(buf)
|
||||
return buf.getvalue()
|
||||
|
||||
|
||||
def _wps2_headers(app_id: str, app_secret: str, body: bytes, content_type: str) -> dict:
|
||||
"""Build kdocs WPS-2 auth headers.
|
||||
|
||||
signature = sha1(app_secret + content_md5 + content_type + date) hex.
|
||||
Content-Md5 / Content-Type must match the exact wire body and header sent.
|
||||
"""
|
||||
content_md5 = hashlib.md5(body).hexdigest()
|
||||
date = email.utils.formatdate(usegmt=True) # RFC1123 GMT
|
||||
signature = hashlib.sha1(
|
||||
(app_secret + content_md5 + content_type + date).encode("utf-8")
|
||||
).hexdigest()
|
||||
return {
|
||||
"Date": date,
|
||||
"Content-Md5": content_md5,
|
||||
"Content-Type": content_type,
|
||||
"Authorization": f"WPS-2:{app_id}:{signature}",
|
||||
}
|
||||
|
||||
|
||||
@register
|
||||
class WpsSink(Sink):
|
||||
name = "wps"
|
||||
aliases = ("kdocs", "金山文档", "金山")
|
||||
requires = ("WPS_APP_ID", "WPS_APP_SECRET")
|
||||
label = "WPS / 金山文档 (Markdown→DOCX upload, optional dep)"
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
app_id = self.env("WPS_APP_ID")
|
||||
app_secret = self.env("WPS_APP_SECRET")
|
||||
|
||||
docx_bytes = _markdown_to_docx_bytes(doc.markdown)
|
||||
filename = _md.safe_filename(doc.title) + ".docx"
|
||||
|
||||
fields = {}
|
||||
parent_path = self.env("WPS_PARENT_PATH")
|
||||
parent_token = self.env("WPS_PARENT_TOKEN")
|
||||
if parent_path:
|
||||
fields["parent_path"] = parent_path
|
||||
if parent_token:
|
||||
fields["parent_token"] = parent_token
|
||||
|
||||
content_type, body = _http.encode_multipart(
|
||||
fields=fields, files=[("file", filename, docx_bytes)]
|
||||
)
|
||||
headers = _wps2_headers(app_id, app_secret, body, content_type)
|
||||
|
||||
status, raw = _http.http_request("POST", KDOCS_UPLOAD, headers=headers, data=body)
|
||||
try:
|
||||
parsed = json.loads(raw.decode("utf-8")) if raw else {}
|
||||
except (ValueError, UnicodeDecodeError):
|
||||
parsed = {}
|
||||
if status >= 400 or parsed.get("code") not in (0, None):
|
||||
raise SinkError(parsed.get("message") or parsed.get("msg") or f"kdocs HTTP {status}")
|
||||
|
||||
file_token = (parsed.get("data") or {}).get("file_token")
|
||||
return SinkResult(
|
||||
sink=self.name, ok=True, url=file_token,
|
||||
detail="Markdown→DOCX uploaded to 金山文档 (experimental; needs a provisioned appspace)",
|
||||
)
|
||||
65
skills/developing/mineru/scripts/sinks/yuque.py
Normal file
65
skills/developing/mineru/scripts/sinks/yuque.py
Normal file
@ -0,0 +1,65 @@
|
||||
"""Yuque (语雀) sink: create a Markdown doc in a repository via the open API.
|
||||
|
||||
Yuque's open API (``https://www.yuque.com/api/v2``) authenticates with an
|
||||
``X-Auth-Token`` header and creates docs under a repository namespace. The body
|
||||
is posted as raw Markdown.
|
||||
|
||||
Yuque's open API has no asset-upload endpoint, so local image refs are left
|
||||
untouched — host images at a public URL for them to render.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
from . import _http, _md
|
||||
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
|
||||
|
||||
API = "https://www.yuque.com/api/v2"
|
||||
|
||||
|
||||
@register
|
||||
class YuqueSink(Sink):
|
||||
name = "yuque"
|
||||
aliases = ("语雀",)
|
||||
requires = ("YUQUE_TOKEN", "YUQUE_NAMESPACE")
|
||||
label = "Yuque doc (open API)"
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
token = self.env("YUQUE_TOKEN")
|
||||
namespace = self.env("YUQUE_NAMESPACE")
|
||||
headers = {
|
||||
"X-Auth-Token": token,
|
||||
"User-Agent": "MinerU-Skill/3.0",
|
||||
"Content-Type": "application/json",
|
||||
}
|
||||
|
||||
base_dir = Path(doc.markdown_path).parent if doc.markdown_path else None
|
||||
n_images = len(_md.find_local_images(doc.markdown, base_dir))
|
||||
|
||||
status, parsed = _http.request_json(
|
||||
"POST", f"{API}/repos/{namespace}/docs", headers=headers, payload={
|
||||
"title": doc.title,
|
||||
"slug": _md.slugify(doc.title),
|
||||
"public": 0,
|
||||
"format": "markdown",
|
||||
"body": doc.markdown,
|
||||
},
|
||||
)
|
||||
|
||||
data = parsed.get("data")
|
||||
if not data:
|
||||
if status >= 400 or parsed.get("message"):
|
||||
raise SinkError(parsed.get("message") or f"HTTP {status}")
|
||||
raise SinkError(f"Yuque returned no doc data (HTTP {status})")
|
||||
|
||||
slug = data.get("slug")
|
||||
if n_images:
|
||||
detail = f"text only ({n_images} local image(s); host images publicly to embed)"
|
||||
else:
|
||||
detail = "text only"
|
||||
return SinkResult(
|
||||
sink=self.name, ok=True,
|
||||
url=f"https://www.yuque.com/{namespace}/{slug}",
|
||||
detail=detail,
|
||||
)
|
||||
64
skills/developing/mineru/scripts/splitter.py
Normal file
64
skills/developing/mineru/scripts/splitter.py
Normal file
@ -0,0 +1,64 @@
|
||||
"""Split oversized PDFs into cap-sized parts so they clear the MinerU API limits.
|
||||
|
||||
The MinerU cloud caps at 20 pages (free Agent API) / 200 pages (Standard API).
|
||||
``--split`` slices a larger PDF into parts locally, each is parsed, and the
|
||||
Markdown is merged back — so we are no longer bound by those page caps (the same
|
||||
trick mineru-converter uses). Uses the optional ``pypdf`` library, lazily
|
||||
imported, so the core stays zero-dependency.
|
||||
|
||||
pip install "mineru-skill[split]" # i.e. pip install pypdf
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
class SplitError(Exception):
|
||||
"""Raised when splitting is requested but cannot be performed."""
|
||||
|
||||
|
||||
def _load_pypdf():
|
||||
try:
|
||||
import pypdf # noqa: F401
|
||||
return pypdf
|
||||
except ImportError as exc:
|
||||
raise SplitError(
|
||||
"--split needs the pypdf library — pip install 'mineru-skill[split]' "
|
||||
"(i.e. pip install pypdf)"
|
||||
) from exc
|
||||
|
||||
|
||||
def pdf_page_count(path) -> int:
|
||||
"""Return the page count of a local PDF (requires pypdf)."""
|
||||
pypdf = _load_pypdf()
|
||||
return len(pypdf.PdfReader(str(path)).pages)
|
||||
|
||||
|
||||
def split_pdf(path, max_pages: int, out_dir) -> list:
|
||||
"""Slice ``path`` into ``max_pages``-page parts under ``out_dir``.
|
||||
|
||||
Returns the list of part paths (a single-element list pointing at the original
|
||||
file if it already fits).
|
||||
"""
|
||||
if max_pages < 1:
|
||||
raise SplitError("max_pages must be >= 1")
|
||||
pypdf = _load_pypdf()
|
||||
reader = pypdf.PdfReader(str(path))
|
||||
total = len(reader.pages)
|
||||
if total <= max_pages:
|
||||
return [Path(path)]
|
||||
|
||||
out_dir = Path(out_dir)
|
||||
out_dir.mkdir(parents=True, exist_ok=True)
|
||||
stem = Path(path).stem
|
||||
parts = []
|
||||
for part_index, start in enumerate(range(0, total, max_pages), start=1):
|
||||
writer = pypdf.PdfWriter()
|
||||
for page in range(start, min(start + max_pages, total)):
|
||||
writer.add_page(reader.pages[page])
|
||||
part_path = out_dir / f"{stem}__part{part_index:03d}.pdf"
|
||||
with open(part_path, "wb") as handle:
|
||||
writer.write(handle)
|
||||
parts.append(part_path)
|
||||
return parts
|
||||
Loading…
Reference in New Issue
Block a user