171 lines
5.4 KiB
Markdown
171 lines
5.4 KiB
Markdown
# MinerU API Reference
|
||
|
||
Official docs: https://mineru.net/apiManage/docs · Token: https://mineru.net/apiManage/token
|
||
|
||
MinerU exposes **two** document-parsing APIs. This skill auto-routes between them.
|
||
|
||
| | 🎯 Standard API | ⚡ Agent API (lightweight) |
|
||
|---|---|---|
|
||
| Base URL | `https://mineru.net/api/v4` | `https://mineru.net/api/v1/agent` |
|
||
| Token | **required** (`Bearer`) | **none** (IP rate-limited) |
|
||
| Models | `pipeline` / `vlm` / `MinerU-HTML` | fixed lightweight `pipeline` |
|
||
| File size | ≤ 200 MB | ≤ 10 MB |
|
||
| Pages | ≤ 200 | ≤ 20 |
|
||
| Batch | ≤ 50 per request | single file only |
|
||
| Output | zip (Markdown + JSON, optional DOCX/HTML/LaTeX) | Markdown only (CDN link) |
|
||
| Designed for | high-accuracy / complex / batch | AI-agent / quick / no-login |
|
||
|
||
Free Standard-API quota: **1000 pages/day at highest priority** (overflow is lower priority).
|
||
|
||
---
|
||
|
||
## Authentication (Standard API)
|
||
|
||
```
|
||
Authorization: Bearer YOUR_API_TOKEN
|
||
```
|
||
|
||
Get a token at https://mineru.net/apiManage/token.
|
||
|
||
> **Response envelopes.** Business endpoints return `{"code":0,"data":{…},"msg":"ok"}`.
|
||
> The auth/gateway layer returns a *different* shape on failure:
|
||
> `{"success":false,"msgCode":"A0202","msg":"user authenticate failed"}`.
|
||
> Clients must handle both — this skill maps `msgCode` to the same error hints.
|
||
|
||
---
|
||
|
||
## Standard API endpoints (`/api/v4`)
|
||
|
||
### Single URL — `POST /extract/task`
|
||
|
||
```json
|
||
{
|
||
"url": "https://example.com/doc.pdf",
|
||
"model_version": "vlm",
|
||
"is_ocr": false,
|
||
"enable_formula": true,
|
||
"enable_table": true,
|
||
"language": "ch",
|
||
"page_ranges": "1-10",
|
||
"extra_formats": ["docx", "html"],
|
||
"data_id": "my-document"
|
||
}
|
||
```
|
||
Response → `{ "code": 0, "data": { "task_id": "…" } }`. HTML inputs require `model_version: "MinerU-HTML"`.
|
||
|
||
### Get task result — `GET /extract/task/{task_id}`
|
||
|
||
```json
|
||
{ "code": 0, "data": { "task_id": "…", "state": "done", "full_zip_url": "https://…", "err_msg": "" } }
|
||
```
|
||
|
||
### Batch local upload — `POST /file-urls/batch`
|
||
|
||
Returns signed upload URLs; PUT each file (no `Content-Type`). Up to **50** files / request.
|
||
|
||
```json
|
||
{ "files": [ { "name": "doc.pdf", "data_id": "doc" } ], "model_version": "vlm" }
|
||
```
|
||
Response → `{ "code": 0, "data": { "batch_id": "…", "file_urls": ["https://…"] } }`.
|
||
|
||
### Batch URL — `POST /extract/task/batch`
|
||
|
||
```json
|
||
{ "files": [ { "url": "https://…/doc.pdf", "data_id": "doc" } ], "model_version": "vlm" }
|
||
```
|
||
|
||
### Batch results — `GET /extract-results/batch/{batch_id}`
|
||
|
||
```json
|
||
{ "code": 0, "data": { "batch_id": "…", "extract_result": [
|
||
{ "file_name": "doc.pdf", "state": "done", "full_zip_url": "https://…" }
|
||
] } }
|
||
```
|
||
|
||
---
|
||
|
||
## Agent API endpoints (`/api/v1/agent`) — no token
|
||
|
||
### URL — `POST /parse/url`
|
||
|
||
```json
|
||
{ "url": "https://…/doc.pdf", "language": "ch", "enable_table": true, "is_ocr": false, "enable_formula": true, "page_range": "1-10" }
|
||
```
|
||
`page_range` accepts `from-to` or a single page only (no commas). Returns `{ "code": 0, "data": { "task_id": "…" } }`.
|
||
|
||
### File — `POST /parse/file`
|
||
|
||
```json
|
||
{ "file_name": "doc.pdf", "language": "ch" }
|
||
```
|
||
Response → `{ "data": { "task_id": "…", "file_url": "https://oss…" } }`; PUT the file to `file_url`.
|
||
|
||
### Result — `GET /parse/{task_id}`
|
||
|
||
```json
|
||
{ "code": 0, "data": { "task_id": "…", "state": "done", "markdown_url": "https://cdn…/full.md" } }
|
||
```
|
||
|
||
---
|
||
|
||
## Task states
|
||
|
||
`pending` (queued) · `running` (parsing) · `converting` (format conversion) ·
|
||
`uploading` (downloading source, Agent) · `waiting-file` (awaiting upload) ·
|
||
`done` (complete) · `failed` (error).
|
||
|
||
---
|
||
|
||
## Parameters
|
||
|
||
| Parameter | Type | Default | Notes |
|
||
|-----------|------|---------|-------|
|
||
| `model_version` | string | `pipeline` | `pipeline`, `vlm` (recommended), `MinerU-HTML` (HTML only) |
|
||
| `is_ocr` | bool | `false` | OCR for scanned docs (pipeline/vlm) |
|
||
| `enable_formula` | bool | `true` | Formula recognition |
|
||
| `enable_table` | bool | `true` | Table recognition |
|
||
| `language` | string | `ch` | OCR language (see official `language` table) |
|
||
| `page_ranges` | string | all | Standard: `"2,4-6"`; Agent `page_range`: `"1-10"` only |
|
||
| `extra_formats` | array | `[]` | `docx` / `html` / `latex` (Standard only) |
|
||
| `data_id` | string | – | `[A-Za-z0-9_.-]`, ≤ 128 chars |
|
||
| `no_cache` | bool | `false` | Bypass URL cache (Standard) |
|
||
| `cache_tolerance` | int | `900` | Cache TTL seconds (Standard) |
|
||
|
||
---
|
||
|
||
## Limits
|
||
|
||
| | Standard | Agent |
|
||
|---|---|---|
|
||
| File size | 200 MB | 10 MB |
|
||
| Pages | 200 | 20 |
|
||
| Batch | 50 / request | 1 |
|
||
| Quota | 1000 pages/day priority | IP rate-limited (HTTP 429) |
|
||
|
||
Supported types: PDF, images (png/jpg/jpeg/jp2/webp/gif/bmp), Doc(x), Ppt(x), Xls(x); HTML is Standard-only.
|
||
|
||
---
|
||
|
||
## Error codes
|
||
|
||
| Code | Meaning |
|
||
|------|---------|
|
||
| `A0202` | Invalid token |
|
||
| `A0211` | Token expired |
|
||
| `-500` | Parameter error |
|
||
| `-10001` / `-10002` | Service error / invalid params |
|
||
| `-60002` | Unsupported file format |
|
||
| `-60003` / `-60004` | File read failed / empty file |
|
||
| `-60005` | File too large (> 200 MB) |
|
||
| `-60006` | Too many pages (> 200) |
|
||
| `-60008` | File read timeout (URL unreachable) |
|
||
| `-60010` | Parse failed |
|
||
| `-60015` / `-60016` | File / format conversion failed |
|
||
| `-60018` | Daily quota reached |
|
||
| `-60022` | Web page read failed (rate-limited) |
|
||
| **Agent API** | |
|
||
| `-30001` | Exceeds Agent 10 MB limit → use Standard API |
|
||
| `-30002` | Unsupported file type for Agent |
|
||
| `-30003` | Exceeds Agent 20-page limit → use Standard API or `--pages` |
|
||
| `-30004` | Invalid request parameters |
|