qwen_agent/skills/developing/mineru/references/api_reference.md
2026-06-05 14:35:17 +08:00

171 lines
5.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# MinerU API Reference
Official docs: https://mineru.net/apiManage/docs · Token: https://mineru.net/apiManage/token
MinerU exposes **two** document-parsing APIs. This skill auto-routes between them.
| | 🎯 Standard API | ⚡ Agent API (lightweight) |
|---|---|---|
| Base URL | `https://mineru.net/api/v4` | `https://mineru.net/api/v1/agent` |
| Token | **required** (`Bearer`) | **none** (IP rate-limited) |
| Models | `pipeline` / `vlm` / `MinerU-HTML` | fixed lightweight `pipeline` |
| File size | ≤ 200 MB | ≤ 10 MB |
| Pages | ≤ 200 | ≤ 20 |
| Batch | ≤ 50 per request | single file only |
| Output | zip (Markdown + JSON, optional DOCX/HTML/LaTeX) | Markdown only (CDN link) |
| Designed for | high-accuracy / complex / batch | AI-agent / quick / no-login |
Free Standard-API quota: **1000 pages/day at highest priority** (overflow is lower priority).
---
## Authentication (Standard API)
```
Authorization: Bearer YOUR_API_TOKEN
```
Get a token at https://mineru.net/apiManage/token.
> **Response envelopes.** Business endpoints return `{"code":0,"data":{…},"msg":"ok"}`.
> The auth/gateway layer returns a *different* shape on failure:
> `{"success":false,"msgCode":"A0202","msg":"user authenticate failed"}`.
> Clients must handle both — this skill maps `msgCode` to the same error hints.
---
## Standard API endpoints (`/api/v4`)
### Single URL — `POST /extract/task`
```json
{
"url": "https://example.com/doc.pdf",
"model_version": "vlm",
"is_ocr": false,
"enable_formula": true,
"enable_table": true,
"language": "ch",
"page_ranges": "1-10",
"extra_formats": ["docx", "html"],
"data_id": "my-document"
}
```
Response → `{ "code": 0, "data": { "task_id": "…" } }`. HTML inputs require `model_version: "MinerU-HTML"`.
### Get task result — `GET /extract/task/{task_id}`
```json
{ "code": 0, "data": { "task_id": "…", "state": "done", "full_zip_url": "https://…", "err_msg": "" } }
```
### Batch local upload — `POST /file-urls/batch`
Returns signed upload URLs; PUT each file (no `Content-Type`). Up to **50** files / request.
```json
{ "files": [ { "name": "doc.pdf", "data_id": "doc" } ], "model_version": "vlm" }
```
Response → `{ "code": 0, "data": { "batch_id": "…", "file_urls": ["https://…"] } }`.
### Batch URL — `POST /extract/task/batch`
```json
{ "files": [ { "url": "https://…/doc.pdf", "data_id": "doc" } ], "model_version": "vlm" }
```
### Batch results — `GET /extract-results/batch/{batch_id}`
```json
{ "code": 0, "data": { "batch_id": "…", "extract_result": [
{ "file_name": "doc.pdf", "state": "done", "full_zip_url": "https://…" }
] } }
```
---
## Agent API endpoints (`/api/v1/agent`) — no token
### URL — `POST /parse/url`
```json
{ "url": "https://…/doc.pdf", "language": "ch", "enable_table": true, "is_ocr": false, "enable_formula": true, "page_range": "1-10" }
```
`page_range` accepts `from-to` or a single page only (no commas). Returns `{ "code": 0, "data": { "task_id": "…" } }`.
### File — `POST /parse/file`
```json
{ "file_name": "doc.pdf", "language": "ch" }
```
Response → `{ "data": { "task_id": "…", "file_url": "https://oss…" } }`; PUT the file to `file_url`.
### Result — `GET /parse/{task_id}`
```json
{ "code": 0, "data": { "task_id": "…", "state": "done", "markdown_url": "https://cdn…/full.md" } }
```
---
## Task states
`pending` (queued) · `running` (parsing) · `converting` (format conversion) ·
`uploading` (downloading source, Agent) · `waiting-file` (awaiting upload) ·
`done` (complete) · `failed` (error).
---
## Parameters
| Parameter | Type | Default | Notes |
|-----------|------|---------|-------|
| `model_version` | string | `pipeline` | `pipeline`, `vlm` (recommended), `MinerU-HTML` (HTML only) |
| `is_ocr` | bool | `false` | OCR for scanned docs (pipeline/vlm) |
| `enable_formula` | bool | `true` | Formula recognition |
| `enable_table` | bool | `true` | Table recognition |
| `language` | string | `ch` | OCR language (see official `language` table) |
| `page_ranges` | string | all | Standard: `"2,4-6"`; Agent `page_range`: `"1-10"` only |
| `extra_formats` | array | `[]` | `docx` / `html` / `latex` (Standard only) |
| `data_id` | string | | `[A-Za-z0-9_.-]`, ≤ 128 chars |
| `no_cache` | bool | `false` | Bypass URL cache (Standard) |
| `cache_tolerance` | int | `900` | Cache TTL seconds (Standard) |
---
## Limits
| | Standard | Agent |
|---|---|---|
| File size | 200 MB | 10 MB |
| Pages | 200 | 20 |
| Batch | 50 / request | 1 |
| Quota | 1000 pages/day priority | IP rate-limited (HTTP 429) |
Supported types: PDF, images (png/jpg/jpeg/jp2/webp/gif/bmp), Doc(x), Ppt(x), Xls(x); HTML is Standard-only.
---
## Error codes
| Code | Meaning |
|------|---------|
| `A0202` | Invalid token |
| `A0211` | Token expired |
| `-500` | Parameter error |
| `-10001` / `-10002` | Service error / invalid params |
| `-60002` | Unsupported file format |
| `-60003` / `-60004` | File read failed / empty file |
| `-60005` | File too large (> 200 MB) |
| `-60006` | Too many pages (> 200) |
| `-60008` | File read timeout (URL unreachable) |
| `-60010` | Parse failed |
| `-60015` / `-60016` | File / format conversion failed |
| `-60018` | Daily quota reached |
| `-60022` | Web page read failed (rate-limited) |
| **Agent API** | |
| `-30001` | Exceeds Agent 10 MB limit → use Standard API |
| `-30002` | Unsupported file type for Agent |
| `-30003` | Exceeds Agent 20-page limit → use Standard API or `--pages` |
| `-30004` | Invalid request parameters |