qwen_agent/skills/linggan/mineru/references/comparison.md
2026-06-22 20:30:08 +08:00

194 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!-- Web-researched competitive comparison (45 tools, 6 categories, adversarially fact-checked). Last researched 2026-05-31. Star counts / versions are point-in-time. -->
# MinerU Skill — Competitive Comparison Reference
This document gives an honest, sourced, per-tool breakdown of how **MinerU Skill** compares to the document-parsing landscape. Read the framing first: it determines how to interpret every "we win / they win" below.
## What MinerU Skill actually is (and is not)
MinerU Skill is a **zero-config, zero-dependency, agent-native convenience layer over [MinerU](https://github.com/opendatalab/MinerU)'s cloud API**, plus 17 turnkey delivery integrations to note/knowledge/content tools. Concretely (verified in this repo):
- Core script `scripts/mineru.py` is **~54KB / ~1,350 lines of pure Python standard library** — no `requests`/`aiohttp`, no model weights.
- A **genuinely token-free** default: the free **Agent API** path (`agent_parse` → `_agent_poll`) sends **no `Authorization` header** (the Bearer header is set only when a token is present). Files ≤10MB / ≤20 pages.
- **Auto-routing**: with a token, large/batched/extra-format jobs use the **Standard API** (≤200MB / ≤200 pages); the Agent path **auto-escalates** to Standard on size/page limits.
- **17 delivery sinks** (16 sink modules + `local.py` registering both `obsidian` and `logseq`): obsidian, logseq, siyuan, notion, confluence, onenote, coda, yuque, feishu, slack, dingtalk, wecom, ticktick, linear, airtable — all zero-dependency — plus **roam** (needs `roam-client`) and **wps** (needs `html-for-docx`) which lazy-load one library only when used.
- `--resume` dedup, parallel `--workers` (ThreadPoolExecutor), `--stdout`/`--json` agent output.
**Critical dependency:** our accuracy is **entirely downstream of, and capped by, what MinerU's cloud serves.** We own no models. Therefore:
- We have **no quality edge** over any other cloud wrapper that hits the same MinerU API — OCR/table/formula output is **identical**.
- Self-hosting the MinerU engine gives the **same or better** accuracy (version-controllable, no upload caps).
**Hard limits we cannot exceed:** 10MB/20-page free Agent tier, 200MB/200-page Standard tier, plus IP rate limits. Self-hosted tools have no such caps (only hardware).
**Our benchmark is latency-only.** `tests/test_live.py` measures end-to-end cloud round-trip latency (~1314s for the official demo PDF). It is **not** an accuracy benchmark; we have no OmniDocBench/olmOCR-Bench numbers of our own.
### A note on the speed claim
Our ~1314s/doc cloud round-trip is **not** a clean win over self-hosted GPU engines. A normal self-host with a GPU runs at ~0.18s/page (Marker) or ~2.12 pages/sec (MinerU on A100) — far faster at any real scale. We only out-run **slow Apple-Silicon-CPU local runs of small docs** (e.g., M4 VLM at 32148s/page). Do not frame "faster wall-clock" as a general win.
### A note on benchmarks
No single benchmark is authoritative. Different benchmarks favor different tools:
- **OmniDocBench** (v1.5/v1.6): MinerU2.5 **90.67** (v1.5), MinerU2.5-Pro **95.69** (v1.6) — leads, beating Gemini 2.5 Pro / GPT-4o / Qwen2.5-VL-72B on text/table/formula. Source: arXiv 2509.22186.
- **olmOCR-Bench** (Ai2, Oct 2025): olmOCR-2 **82.4** > Marker **76.1** > **MinerU 75.8**. Here MinerU **trails** — this is a real olmOCR win and must stay visible.
- **RD-TableBench**: Reducto 90.2% on complex tables — but Reducto authored this benchmark (vendor-biased).
- Mathpix is the de-facto formula-OCR standard (BLEU/edit-distance studies), though a PaddleOCR-VL-based tool claims to beat it on OmniDocBench v1.0 formula recognition, so the very top is contested.
> Star counts / versions below (e.g. MinerU "65.7k / v3.2.1") are point-in-time and not independently re-verified.
---
## Category 1 — Self-hosted / open-source parsing engines
These are the tools that close our single biggest gap: **fully offline / air-gapped / no cloud / no upload caps.**
### MinerU engine (opendatalab) — the engine we wrap
- **Source:** https://github.com/opendatalab/MinerU · arXiv 2509.22186 · https://huggingface.co/opendatalab/MinerU2.5-Pro-2604-1.2B
- **Strengths:** Owns the SOTA models (OmniDocBench 90.67 / 95.69-Pro v1.6). 109-language OCR, handwriting, cross-page table merge, formula→LaTeX (the source of *our* LaTeX). Fully self-hostable → offline, air-gappable, zero per-page cost, no caps. Pipeline backend runs pure CPU; VLM needs 8GB+ VRAM. Native MCP, Python/Go/TS SDKs, LangChain/LlamaIndex/Dify/FastGPT.
- **Weaknesses vs us:** Heavy install (multi-GB torch/vLLM + weights, 16GB RAM / 20GB disk floor); slow on Apple Silicon; no note/PKM delivery sinks; library/CLI rather than zero-config.
- **Verdict:** **Beats us** on offline, privacy, caps, accuracy ceiling, ecosystem. **We beat it** only on zero-install/zero-config and built-in delivery.
### Marker (datalab-to / VikParuchuri)
- **Source:** https://github.com/datalab-to/marker · https://allenai.org/blog/olmocr-2
- **Strengths:** Fully offline; very high batch throughput (~122 pages/sec/H100, 0.18s/page GPU); broad formats incl. EPUB; optional local-LLM (Ollama) quality boost with no data leaving the machine; ~35k+ stars, active.
- **Weaknesses:** **GPL-3.0** code + model weights under a modified RAIL-M (free only under ~$2M funding+revenue; commercial above that needs a Datalab license). olmOCR-Bench **76.1** — below olmOCR-2 and MinerU's OmniDocBench standing.
- **Verdict:** Beats us on offline/throughput; we beat it on zero-install and 17 delivery sinks. License gate is a real friction it has and we don't.
### Docling (IBM / DS4SD)
- **Source:** https://github.com/docling-project/docling · https://huggingface.co/ibm-granite/granite-docling-258M · arXiv 2408.09869
- **Strengths:** **Widest input modality set** (PDF/DOCX/PPTX/XLSX/HTML/AsciiDoc/LaTeX/CSV/images + **audio via ASR** + USPTO/JATS/XBRL). Tiny 258M Granite-Docling VLM runs on CPU/modest GPU. **MIT code + Apache-2.0 weights.** Deep framework ecosystem (LangChain/LlamaIndex/Haystack + official MCP), IBM-backed, 60k+ stars. Air-gapped by design.
- **Weaknesses:** Absolute accuracy lags MinerU on OmniDocBench/olmOCR-Bench; library-first (not a zero-config CLI); targets framework ingestion, not file delivery to note tools.
- **Verdict:** Beats us on offline, modality breadth, permissive license, ecosystem; we beat it on zero-install and note/PKM delivery. **Do not over-rank its MIT as uniquely best** — olmOCR's Apache-2.0 on *both* code and 7B weights is at least as commercially valuable.
### olmOCR (allenai)
- **Source:** https://github.com/allenai/olmocr · https://allenai.org/blog/olmocr-2 · https://huggingface.co/datasets/allenai/olmOCR-bench
- **Strengths:** **Leads Ai2's olmOCR-Bench (82.4 vs MinerU 75.8)** — a benchmark where MinerU trails. **Apache-2.0 on code AND the olmOCR-2-7B weights** (most commercial-friendly model reuse here). Built for million-page LLM-training linearization. Offline.
- **Weaknesses:** **PDF/image only** (no Office/HTML); **English-primary**, filters non-English (MinerU does 109-lang); **requires a 12GB+ NVIDIA GPU, no CPU mode at all**.
- **Verdict:** Beats us on offline, that-benchmark accuracy, license, scale. We beat it on modality breadth, multilingual, no-GPU, delivery, zero-install. **Keep the olmOCR-Bench lead visible — do not cherry-pick only OmniDocBench.**
### Nougat (facebookresearch / Meta AI)
- **Source:** https://github.com/facebookresearch/nougat · arXiv 2308.13418
- **Strengths:** Strong LaTeX/math on arXiv-style scientific PDFs (its trained niche). Offline.
- **Weaknesses:** **PDF + English/Latin-script only** (no CJK); **CC-BY-NC weights (non-commercial)**; effectively **unmaintained** (last release Aug 2023); known repetition/hallucination/[MISSING_PAGE] failures off-distribution.
- **Verdict:** Offline + niche math is its only edge; we beat it on general-purpose, multilingual, maintenance, commercial license, delivery.
### PyMuPDF4LLM (pymupdf / Artifex)
- **Source:** https://github.com/pymupdf/pymupdf4llm · https://pymupdf.io/blog/pymupdf-layout-10-faster-pdf-parsing-without-gpus
- **Strengths:** **Far faster and lighter than any ML tool on born-digital PDFs** (~hundreds of pages/sec on plain CPU; a C-optimized variant claims ~520 pages/sec). Lowest dependency/hardware footprint. Offline, no cloud, no caps. Ideal for huge clean-PDF corpora where speed > fidelity.
- **Weaknesses:** No ML → no real formula/LaTeX, weak complex tables, poor scanned/handwritten; slow external OCR; **AGPL-3.0 OR Artifex commercial**; Office formats need paid **PyMuPDF Pro**.
- **Verdict:** A genuine win for the speed-over-fidelity, clean-PDF use case. We beat it on hard-doc quality (MinerU's VLM), multilingual OCR, and delivery — but acknowledge its speed/footprint advantage honestly.
### Zerox (getomni-ai)
- **Source:** https://github.com/getomni-ai/zerox
- **Strengths:** Trivial provider-flexibility (OpenAI/Azure/Bedrock-Claude/Gemini/Vertex); JSON-Schema structured extraction (Node SDK); MIT code.
- **Weaknesses:** **NOT offline and NOT token-free** — mandates a paid cloud vision-LLM key; needs graphicsmagick+ghostscript; **no published benchmarks**; per-page LLM cost can exceed MinerU on large jobs.
- **Verdict:** We beat it on token-free start, benchmarked accuracy, dedicated formula/table models, system-dep footprint, and delivery. It beats us on provider-swap flexibility and typed JSON extraction.
---
## Category 2 — Commercial cloud document-parsing APIs
Mostly **stronger than us** on enterprise accuracy, SLAs, structured extraction, and RAG/MCP ecosystems. Our honest edges are narrow: token-free + zero-install hosted default, clean Markdown/LaTeX of academic PDFs, and 17 delivery sinks none of them offer.
### LlamaParse (LlamaIndex / LlamaCloud)
- **Source:** https://www.llamaindex.ai/pricing · LlamaCloud MCP docs
- **Beats us:** Official hosted **MCP server**; deep native RAG stack (parse→index→LlamaExtract/LlamaAgents); steerable NL parsing with frontier LLMs (GPT-4.1/Gemini 2.5 Pro); richer outputs (per-page JSON, XLSX, HTML tables, annotated PDF); enterprise SLAs; mature Python+TS SDKs.
- **We beat:** Token-free start (it needs a LlamaCloud key from page one); zero runtime deps; 17 note/PKM sinks (it delivers to RAG indexes, not note tools); built-in `--resume`/parallel batch CLI.
### Mathpix (Convert API)
- **Source:** https://mathpix.com/pricing/api · https://mathpix.com/image-to-latex
- **Beats us:** **Best-in-class formula/equation OCR (printed AND handwritten) → clean LaTeX — clearly better than MinerU for pure math fidelity; concede this, do not imply parity.** Mature Snip ecosystem + Overleaf workflows; very low per-image cost at scale.
- **We beat:** Token-free start (Mathpix API requires a paid PAYG account, **$19.99 setup fee**, card on file; **no recurring free monthly allowance** — only a one-time $29 test credit; the consumer Snip app's free quota does **not** apply to the API); general-purpose multi-modal Office parsing; 17 delivery sinks; built-in batch CLI.
### Unstructured.io
- **Source:** https://unstructured.io/pricing · https://github.com/Unstructured-IO/unstructured
- **Beats us:** **Apache-2.0 core library is fully self-hostable → 100% offline** (we cannot); official MCP + huge connector ecosystem (S3/SharePoint/vector DBs); built-in chunking+embedding (RAG-ready); 25+ file types; permissive license for product embedding.
- **We beat:** Token-free hosted default with zero install (its hosted API needs a key; self-host means running infra); cleaner human-readable Markdown out of the box (its primary output is JSON "elements"); 17 note/PKM sinks (it targets vector DBs/storage). *On parsing quality:* VLM parsing is generally stronger for complex layout/formula, but this is **not a benchmarked head-to-head** — state it as a tendency, not a measured win.
### Reducto
- **Source:** https://reducto.ai/pricing
- **Beats us:** **Best complex/financial table extraction (90.2% RD-TableBench — vendor-authored but the strongest public evidence)**; agentic multi-pass OCR; SOC2/HIPAA, on-prem/VPC/air-gapped, enterprise SLAs; schema-based extraction with bounding boxes/citations.
- **We beat:** Token-free start (it needs a key + credits); zero-install plain CLI; 17 delivery sinks; auto-routing/--resume/parallel batch.
### Chunkr (and similar RAG-native APIs)
- **Beats us:** Self-hostable (offline option we lack); RAG-native chunking + broad export (DOCX/HTML/LaTeX).
- **We beat:** Token-free start; zero-install; 17 note/PKM sinks.
- **Caveat (fact-check):** Do **not** claim "stronger VLM Markdown for formulas" — Chunkr cloud uses its own proprietary models and we have **no head-to-head benchmark**. Drop the quality claim; keep only the export-breadth and offline framing.
---
## Category 3 — Other MinerU wrappers, skills & MCP servers (our direct peers)
**Every cloud-backed wrapper here hits the same MinerU API we do, so its OCR/table/formula output is IDENTICAL to ours.** We have **no quality edge** over them — only DX differences. Claims of "better OCR/formula/Markdown" vs these are **invalid** and must not appear.
### Official MinerU MCP server (mineru-open-mcp / MinerU-Ecosystem)
- **Source:** https://github.com/opendatalab/MinerU-Ecosystem · https://pypi.org/project/mineru-open-mcp/
- **Beats us:** **Official, first-party** — tracks API/format changes day-one; native **MCP server** (stdio + streamable-http) in Claude Desktop/Cursor/Windsurf with zero glue; full ecosystem (Python/Go/TS SDKs, LangChain/LlamaIndex/Dify/FastGPT). **Same free no-token Flash tier as us** — our "free zero-token" edge is fully matched by the first party.
- **We beat:** Zero runtime deps (vs pip/uvx install); auto-routing Agent⇄Standard with auto-escalation; 17 delivery sinks; `--resume`/parallel batch; usable as a plain CLI outside any MCP host.
### MinerU-Document-Explorer (official, opendatalab)
- **Source:** https://github.com/opendatalab/MinerU-Document-Explorer
- **Beats us:** Different, **larger** value prop — a local agent-native **knowledge engine** (BM25/vector/hybrid retrieval + deep-reading + LLM-wiki) with 15 MCP tools; runs 100% locally for its core; MIT, 568 stars.
- **We beat:** We're a focused zero-dep converter; broader conversion modalities; 17 delivery sinks (it keeps content in its own index/wiki); no Node/local-model download.
### linxule/mineru-mcp (Node, cloud)
- **Source:** https://github.com/linxule/mineru-mcp
- **Beats us:** Native MCP server with 6 granular tools (explicit status-polling + batch-status pagination); first-class for Node/JS MCP stacks; batch up to 200 URLs/request.
- **We beat:** **Free no-token path** (it **requires** a token always); zero runtime deps (vs Node 18+); broader modalities (Excel/HTML); 17 delivery sinks; usable as plain CLI outside MCP.
### mineru-converter-mcp-server (AvatarGanymede/MinerU-MCP)
- **Source:** https://pypi.org/project/mineru-converter-mcp-server/
- **Beats us:** **Auto-splits PDFs >200MB and segments >600-page docs by page range — gracefully exceeding the 200MB/200-page cap we are bound by.** Turnkey Smithery + Render deploy (per-user key); explicit HTML input.
- **We beat:** Free no-token default (it requires a key); zero runtime deps; plain CLI (no MCP host/Render/Smithery needed); 17 sinks; auto-routing.
### grimoire-skill (LeoLin990405)
- **Source:** https://github.com/LeoLin990405/grimoire-skill
- **Beats us:** Higher-level knowledge-capture ("parse once, share twice" → Obsidian notes + reusable skill packs); ingests **video** (YouTube/Bilibili) + subtitles (modalities we don't touch); cross-agent skill management; content-aware Obsidian auto-filing.
- **We beat:** Free no-token default (it needs a token + `--cloud-ok` for local files); zero runtime deps (vs bash+jq+awk + optional yt-dlp/ffmpeg); 17 sinks vs primarily Obsidian; broader Office/HTML; cross-platform single-file portability.
### kesslerio/mineru-pdf-parser (openclaw/ClawHub skill, local CPU)
- **Source:** openclaw/skills · SKILL.md
- **Beats us:** **Fully local/offline (pure CPU, cross-platform)** — no cloud/token/caps; handles privacy-sensitive docs; native Markdown + JSON.
- **We beat:** Zero install (it needs a full local MinerU install + weights + shell wrapper); no GPU/heavy runtime; faster wall-clock **only vs slow local CPU**; broader modalities; 17 sinks; `--stdout`/`--json`; better docs.
### nilecui/mineru-parser-skills (Claude Agent SDK, cloud)
- **Source:** https://github.com/nilecui/mineru-parser-skills
- **Beats us:** Built directly on the Claude Agent SDK (slots into Agent-SDK apps). Honestly little else — it's a thinner cloud wrapper.
- **We beat:** Accepts local files/dirs **and** URLs (it is **URL-only** — cannot parse a local PDF); free no-token default; zero runtime deps; batch/`--resume`/parallel; 17 sinks; broader modalities; mature/documented vs a 4-commit, no-license repo. *Caveat:* our "benchmarked" claim means **latency-measured**, not accuracy-benchmarked.
### TINKPA/mcp-mineru (local MLX, Apple Silicon)
- **Source:** https://github.com/TINKPA/mcp-mineru
- **Beats us:** **Fully offline/local** via MinerU running on-device (MLX accel); no cloud/token/caps; data never leaves the Mac.
- **We beat:** Zero install/no weights/no GPU; **faster wall-clock only for typical multi-page docs vs its slow local inference (32148s/page on M4)** — not a general speed win; broader modalities; batch/`--resume`/17 sinks; more active/documented; usable as plain CLI.
---
## Summary of mandatory concessions (do not bury these)
1. **Offline / air-gapped is our single biggest gap.** MinerU engine, Marker, Docling, olmOCR, Nougat, PyMuPDF4LLM, TINKPA, kesslerio, MinerU-Document-Explorer, and self-hostable Unstructured/Chunkr all run with **zero cloud dependency**. We are cloud-only and **cannot handle confidential/regulated/air-gapped content at all.**
2. **Data privacy:** every self-hosted competitor keeps documents on the machine; we **upload every file** to MinerU's cloud — a hard disqualifier for many regulated users.
3. **Accuracy is downstream of, and capped by, MinerU's cloud.** Self-hosting MinerU2.5-Pro gives the same-or-better accuracy with no caps. Same-backend wrappers yield **identical** quality to us.
4. **Hard caps:** 10MB/20-page (Agent), 200MB/200-page (Standard), IP rate limits. mineru-converter exceeds them via auto-split/segmentation.
5. **Mathpix beats us on formula/LaTeX OCR (incl. handwriting).**
6. **Reducto leads complex/financial tables; olmOCR leads olmOCR-Bench (82.4 vs MinerU 75.8).** Different benchmarks favor different tools — never cherry-pick only OmniDocBench.
7. **Official first-party advantage:** the official MinerU MCP/Document-Explorer + ecosystem track changes day-one and match our free tier; we are third-party, can lag, and ship **no MCP server**.
8. **Permissive-license wins we lack:** olmOCR (Apache-2.0 code + 7B weights), Docling (MIT + Apache-2.0 weights), Unstructured (Apache-2.0 core).
9. **PyMuPDF4LLM is far faster/lighter on born-digital PDFs** (clean-text corpora, speed > fidelity).
## Sources
- MinerU engine: https://github.com/opendatalab/MinerU · arXiv 2509.22186 · https://huggingface.co/opendatalab/MinerU2.5-Pro-2604-1.2B · https://neurohive.io/en/state-of-the-art/mineru2-5-open-source-1-2b-model-for-pdf-parsing-outperforms-gemini-2-5-pro-on-benchmarks/
- Official MCP / ecosystem: https://github.com/opendatalab/MinerU-Ecosystem · https://pypi.org/project/mineru-open-mcp/ · https://github.com/opendatalab/MinerU-Document-Explorer
- Marker: https://github.com/datalab-to/marker · https://allenai.org/blog/olmocr-2
- Docling: https://github.com/docling-project/docling · arXiv 2408.09869 · https://huggingface.co/ibm-granite/granite-docling-258M
- olmOCR: https://github.com/allenai/olmocr · https://allenai.org/blog/olmocr-2 · https://huggingface.co/datasets/allenai/olmOCR-bench
- Nougat: https://github.com/facebookresearch/nougat · arXiv 2308.13418
- PyMuPDF4LLM: https://github.com/pymupdf/pymupdf4llm · https://pymupdf.io/blog/pymupdf-layout-10-faster-pdf-parsing-without-gpus
- Zerox: https://github.com/getomni-ai/zerox
- LlamaParse: https://www.llamaindex.ai/pricing
- Mathpix: https://mathpix.com/pricing/api · https://mathpix.com/image-to-latex
- Unstructured: https://unstructured.io/pricing · https://github.com/Unstructured-IO/unstructured
- Reducto: https://reducto.ai/pricing
- Other wrappers: https://github.com/linxule/mineru-mcp · https://pypi.org/project/mineru-converter-mcp-server/ · https://github.com/LeoLin990405/grimoire-skill · https://github.com/nilecui/mineru-parser-skills · https://github.com/TINKPA/mcp-mineru