Compare commits
15 Commits
4b50a75a3d
...
f45f55b50a
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
f45f55b50a | ||
|
|
f18d966123 | ||
|
|
8466b0e710 | ||
|
|
d009411360 | ||
|
|
bb74aee41b | ||
|
|
ecf332add5 | ||
|
|
b618cb12d2 | ||
|
|
22b9ad4877 | ||
|
|
1925de0355 | ||
|
|
96585886f8 | ||
|
|
5594eab520 | ||
|
|
022781b145 | ||
|
|
3ada55af40 | ||
|
|
96ded5e598 | ||
|
|
203dcf4a4e |
55
.features/memory/MEMORY.md
Normal file
55
.features/memory/MEMORY.md
Normal file
@ -0,0 +1,55 @@
|
||||
---
|
||||
feature: "memory"
|
||||
scope: "Agent 长期记忆能力(基于 Mem0 + pgvector),跨会话回忆与事实提取存储"
|
||||
updated_at: "2026-06-01"
|
||||
status: active
|
||||
---
|
||||
|
||||
# Memory(记忆功能)
|
||||
|
||||
## 当前状态
|
||||
Agent 的长期记忆能力,底层使用 **Mem0** 库 + **pgvector**(PostgreSQL 向量存储)。
|
||||
在 agent 执行前 `recall` 相关记忆并注入 system prompt,在执行后于后台线程异步提取并存储新事实。
|
||||
按 `(user_id, agent_id)` 多租户隔离,每个 `agent_id` 一张 `mem0_{agent_id}` 集合表。
|
||||
|
||||
> 注意:API/配置字段历史上叫 `memori`,为兼容性保留命名,内部实际用的是 **Mem0**。
|
||||
|
||||
## 配置开关
|
||||
| 层级 | 字段 | 默认 | 位置 |
|
||||
|------|------|------|------|
|
||||
| 全局总开关 | `MEM0_ENABLED` (env) | `true` | `utils/settings.py:80` |
|
||||
| Agent 配置 | `enable_memori: bool` | `False` | `agent/agent_config.py:47` |
|
||||
| API 请求 | `enable_memory: bool` | `False` | `utils/api_models.py:56` |
|
||||
| 召回数量 | `memori_semantic_search_top_k: int` | `20` | `agent/agent_config.py:48` |
|
||||
| 召回数量(env) | `MEM0_SEMANTIC_SEARCH_TOP_K` | `20` | `utils/settings.py:84` |
|
||||
| 连接池大小 | `MEM0_POOL_SIZE` (env) | `50` | `utils/settings.py:61` |
|
||||
|
||||
开启路径:V1 走请求体 `enable_memory`,V2 走 bot 配置 `enable_memory`;两者都受全局 `MEM0_ENABLED` 限制。
|
||||
中间件注册在 `agent/deep_assistant.py:270`(`if config.enable_memori:`)。
|
||||
|
||||
## 核心文件
|
||||
- `agent/mem0_manager.py` — Mem0 客户端管理器:实例创建/LRU 缓存(最多 50)、连接池管理、`recall_memories` / `add_memory` / `delete_all`、多租户隔离、`CustomMem0Embedding`、`json_repair` 补丁
|
||||
- `agent/mem0_middleware.py` — 中间件:`before_agent` 召回并写入 `config._mem0_context`(行 114/155);`after_agent` 后台异步提取存储
|
||||
- `agent/mem0_config.py` — Mem0 配置类:user/agent/session id、记忆提示模板、自定义提取 prompt 加载(`PreMemoryPrompt` hook)
|
||||
- `routes/memory.py` — 内存管理 API(GET/POST/DELETE,供前端管理用户记忆)
|
||||
- `drop_mem0_tables.py` — 清理脚本,删除所有 `mem0_*` 表(重置/清脏数据)
|
||||
|
||||
## 数据流
|
||||
**写入**:User+Assistant 消息 → `after_agent`(后台线程)→ `add_memory` → `Mem0.add()`(LLM 提取事实)→ pgvector 向量化存入 `mem0_{agent_id}`。
|
||||
**读取**:User query → `before_agent` → `recall_memories` → `Mem0.search()`(向量相似 top_k)→ 格式化后写入 `config._mem0_context` → 注入 system prompt(也供思考功能 [[../thinking/MEMORY|thinking]] 使用)。
|
||||
|
||||
## 关键设计决策
|
||||
- 复用项目已加载的 embedding 模型(`CustomMem0Embedding`),避免 Mem0 重复加载 SentenceTransformer → `decisions/2026-06-custom-embedding.md`
|
||||
- 连接池主动释放 + LRU 缓存实例,防连接池耗尽 → `decisions/2026-06-connection-pool.md`
|
||||
|
||||
## Gotchas(开发必读)
|
||||
- **命名陷阱**:配置叫 `enable_memori`(无 y),API 叫 `enable_memory`,内部实现是 Mem0,三个名字别混。
|
||||
- **连接池耗尽**:Mem0 PGVector `__init__` 取连接、`__del__` 释放;必须在每次操作后主动 `_release_connection()`,否则高并发会打满 `MEM0_POOL_SIZE`。
|
||||
- **JSON 脆弱**:LLM 提取事实返回的 JSON 常有尾逗号/单引号,已 monkey patch 成 `json_repair.loads`,不要改回原生解析。
|
||||
- **表膨胀**:每个 `agent_id` 一张表,多 bot 长期运行会产生大量表,定期用 `drop_mem0_tables.py` 清理。
|
||||
- **Embedding 维度**:`paraphrase-multilingual-MiniLM-L12-v2`,384 维;换模型需同步 pgvector 列维度,否则写入报错。
|
||||
|
||||
## 索引
|
||||
- 设计决策:`decisions/`
|
||||
- 变更历史:`changelog/`
|
||||
- 相关文档:`docs/`
|
||||
6
.features/memory/changelog/2026-Q2.md
Normal file
6
.features/memory/changelog/2026-Q2.md
Normal file
@ -0,0 +1,6 @@
|
||||
# Changelog 2026 Q2 — Memory
|
||||
|
||||
## 2026-06-01
|
||||
- 初始化 feature memory 文档。
|
||||
- 记录现状:Mem0 + pgvector 长期记忆,`before_agent` 召回注入 / `after_agent` 后台提取存储。
|
||||
- 归档设计决策:自定义 embedding 复用(custom-embedding)、连接池主动释放 + LRU(connection-pool)。
|
||||
25
.features/memory/decisions/2026-06-connection-pool.md
Normal file
25
.features/memory/decisions/2026-06-connection-pool.md
Normal file
@ -0,0 +1,25 @@
|
||||
---
|
||||
date: "2026-06-01"
|
||||
status: adopted
|
||||
topic: "connection-pool"
|
||||
impact: [memory, performance, stability]
|
||||
---
|
||||
|
||||
# 连接池主动释放 + Mem0 实例 LRU 缓存
|
||||
|
||||
## 背景
|
||||
Mem0 的 PGVector 后端在实例 `__init__` 时从连接池取一个连接,理论上在 `__del__` 时归还。
|
||||
但 Python GC 时机不确定,高并发下连接迟迟不归还会迅速打满 `MEM0_POOL_SIZE`(默认 50),导致后续请求阻塞。
|
||||
同时若为每个 `(user_id, agent_id)` 都新建 Mem0 实例且不回收,也会无限占用连接。
|
||||
|
||||
## 决策
|
||||
1. `Mem0Manager` 用 `OrderedDict` 维护最多 50 个 Mem0 实例的 LRU 缓存,超出淘汰最旧的。
|
||||
2. 每次记忆操作(recall/add)后调用 `_release_connection()` 立即把连接归还连接池,不等 GC。
|
||||
|
||||
## 影响
|
||||
- 连接池不再被慢 GC 拖垮,高并发稳定。
|
||||
- 实例数量有上界,内存可控。
|
||||
|
||||
## Gotchas
|
||||
- 不要在操作链路里持有 Mem0 实例的连接跨多个 await,会绕过释放逻辑。
|
||||
- LRU 上限(50)与 `MEM0_POOL_SIZE`(50)相关联,调整其一时需一并评估。
|
||||
22
.features/memory/decisions/2026-06-custom-embedding.md
Normal file
22
.features/memory/decisions/2026-06-custom-embedding.md
Normal file
@ -0,0 +1,22 @@
|
||||
---
|
||||
date: "2026-06-01"
|
||||
status: adopted
|
||||
topic: "custom-embedding"
|
||||
impact: [memory, performance]
|
||||
---
|
||||
|
||||
# 复用项目 embedding 模型而非 Mem0 自带 SentenceTransformer
|
||||
|
||||
## 背景
|
||||
Mem0 默认会自行加载一个 SentenceTransformer 做 embedding。项目本身已经通过 `GlobalModelManager`
|
||||
加载了 `paraphrase-multilingual-MiniLM-L12-v2`(384 维)。若放任 Mem0 自加载,会出现同一模型在内存中加载两份,浪费显存/内存。
|
||||
|
||||
## 决策
|
||||
在 `agent/mem0_manager.py` 实现 `CustomMem0Embedding`,把 Mem0 的 embedder 接到项目已加载的全局模型上,复用同一份权重。
|
||||
|
||||
## 影响
|
||||
- 内存占用显著下降(不重复加载模型)。
|
||||
- embedding 维度固定为 384,与项目主模型一致;换模型时 pgvector 列维度必须同步调整。
|
||||
|
||||
## 备注
|
||||
相关连接池/实例缓存策略见 [[2026-06-connection-pool]]。
|
||||
0
.features/memory/docs/.gitkeep
Normal file
0
.features/memory/docs/.gitkeep
Normal file
52
.features/thinking/MEMORY.md
Normal file
52
.features/thinking/MEMORY.md
Normal file
@ -0,0 +1,52 @@
|
||||
---
|
||||
feature: "thinking"
|
||||
scope: "Agent 思考功能(基于 GuidelineMiddleware 的前置辅助推理),在主回答前生成一次 <think> 内容"
|
||||
updated_at: "2026-06-01"
|
||||
status: active
|
||||
---
|
||||
|
||||
# Thinking(思考功能)
|
||||
|
||||
## 当前状态
|
||||
思考功能通过自定义的 **`GuidelineMiddleware`** 实现:在主 agent 执行前,先用业务指引 prompt 调一次模型做"思考",
|
||||
把结果包成 `<think>...</think>` 标签并打上 `message_tag: "THINK"` 元数据,供前端识别/折叠展示。
|
||||
|
||||
> 重要:这是"主请求前的一次辅助请求",**不是** Qwen 模型内置的 reasoning/extended-thinking 模式,因此与具体模型无关,任何 LLM 都能用。对标 OpenAI o1 / Claude thinking,但实现更轻。
|
||||
|
||||
## 配置开关
|
||||
| 层级 | 字段 | 默认 | 位置 |
|
||||
|------|------|------|------|
|
||||
| Agent 配置 | `enable_thinking: bool` | `False` | `agent/agent_config.py:26` |
|
||||
| API 请求 | `enable_thinking: bool` | `False` | `utils/api_models.py:54` |
|
||||
|
||||
开启路径:V1 走请求体 `enable_thinking`,V2 走 bot 配置 `enable_thinking`。
|
||||
中间件注册在 `agent/deep_assistant.py:294`:`if config.enable_thinking: middleware.append(GuidelineMiddleware(...))`。
|
||||
|
||||
## 核心文件
|
||||
- `agent/guideline_middleware.py` — 思考主逻辑。`get_guideline_prompt`(行 53+)组装指引 prompt;`before_agent`/`abefore_agent` 调模型生成思考,包 `<think>` 标签并标 `THINK`(行 120-124 / 146-149)。
|
||||
- `agent/deep_assistant.py:294-295` — 按 `enable_thinking` 注册中间件。
|
||||
|
||||
## 数据流
|
||||
1. `before_agent` 加载指引(system prompt 中的 Guidelines 块)。
|
||||
2. 从 system prompt 提取 guidelines / tool_description / scenarios / terms_list。
|
||||
3. 组装 `guideline_prompt` = 业务规则 + 聊天历史 + **记忆上下文** + 工具描述 + 场景 + 术语分析。
|
||||
4. 调模型一次:`SystemMessage(guideline_prompt)` + 用户最后一条消息 → 得到思考内容。
|
||||
5. 内容包成 `<think>...</think>`,`additional_kwargs["message_tag"] = "THINK"`。
|
||||
6. 追加一条空 `HumanMessage`(兼容"最后必须是 user 消息"的模型)。
|
||||
7. 主 agent 继续执行,产出正式回答。
|
||||
|
||||
## 与记忆功能的耦合
|
||||
`guideline_middleware.py:63` 读取 `config._mem0_context`(由 [[../memory/MEMORY|memory]] 的 `before_agent` 写入)。
|
||||
即:思考阶段会把已召回的长期记忆纳入指引 prompt,从而基于记忆做更好的分析。
|
||||
**顺序依赖**:memory 中间件需在 thinking 之前执行,`_mem0_context` 才有值。
|
||||
|
||||
## Gotchas(开发必读)
|
||||
- **思考是非流式的**:思考内容在 `before_agent` 一次性完整生成,只有正式回答才流式输出。前端靠 `<think>` 标签 + `message_tag:"THINK"` 折叠展示。
|
||||
- **额外一次模型调用**:每次开启都多打一次 LLM 请求,增加延迟和成本,按场景权衡。
|
||||
- **不是模型原生 reasoning**:别误以为依赖 `enable_thinking` 透传给 Qwen,它是中间件层的自定义实现。
|
||||
- **空 HumanMessage 收尾**:思考消息后会补一条空 user 消息,改消息列表处理逻辑时勿误删。
|
||||
- **依赖记忆上下文顺序**:若调整中间件注册顺序,确认 memory 仍在 thinking 之前。
|
||||
|
||||
## 索引
|
||||
- 设计决策:`decisions/`
|
||||
- 变更历史:`changelog/`
|
||||
7
.features/thinking/changelog/2026-Q2.md
Normal file
7
.features/thinking/changelog/2026-Q2.md
Normal file
@ -0,0 +1,7 @@
|
||||
# Changelog 2026 Q2 — Thinking
|
||||
|
||||
## 2026-06-01
|
||||
- 初始化 feature memory 文档。
|
||||
- 记录现状:`GuidelineMiddleware` 在 `before_agent` 生成 `<think>` 思考内容,标 `message_tag:"THINK"`。
|
||||
- 归档设计决策:用中间件实现而非模型原生 reasoning(middleware-thinking)。
|
||||
- 记录与 memory 功能的顺序耦合(依赖 `_mem0_context`)。
|
||||
@ -0,0 +1,28 @@
|
||||
---
|
||||
date: "2026-06-01"
|
||||
status: adopted
|
||||
topic: "middleware-thinking"
|
||||
impact: [thinking, model-compat]
|
||||
---
|
||||
|
||||
# 用中间件实现思考,而非依赖模型原生 reasoning
|
||||
|
||||
## 背景
|
||||
"思考功能"可以有两种实现:
|
||||
A. 透传 `enable_thinking` 给底层模型,依赖模型自带的 reasoning/extended-thinking 能力。
|
||||
B. 在主请求前自己加一次"指引思考"的辅助 LLM 调用。
|
||||
|
||||
模型 A 路线要求底层模型支持原生 reasoning,且不同模型行为/输出格式不一致,难以统一前端处理。
|
||||
|
||||
## 决策
|
||||
采用 B:实现 `GuidelineMiddleware`,在 `before_agent` 阶段用业务指引 prompt 调一次模型生成思考,
|
||||
统一包成 `<think>...</think>` + `message_tag:"THINK"`。
|
||||
|
||||
## 影响
|
||||
- 与具体模型解耦,任何 LLM(OpenAI/Claude/Qwen)都能用。
|
||||
- 思考阶段可注入业务规则、工具描述、术语分析、记忆上下文,可控性强。
|
||||
- 代价:每次多一次 LLM 调用(延迟 + 成本);思考内容非流式。
|
||||
|
||||
## Gotchas
|
||||
- 思考依赖 `config._mem0_context`,需保证 memory 中间件先于本中间件执行。
|
||||
- 思考后补空 `HumanMessage` 以兼容"末条须为 user"的模型,勿删。
|
||||
133
routes/chat.py
133
routes/chat.py
@ -18,8 +18,10 @@ from utils.fastapi_utils import (
|
||||
process_messages,
|
||||
create_project_directory, extract_api_key_from_auth, generate_v2_auth_token, fetch_bot_config, fetch_bot_config_from_db,
|
||||
call_preamble_llm,
|
||||
create_stream_chunk
|
||||
create_stream_chunk,
|
||||
detect_provider, sanitize_model_kwargs
|
||||
)
|
||||
from langchain.chat_models import init_chat_model
|
||||
from langchain_core.messages import AIMessageChunk, ToolMessage, AIMessage, HumanMessage
|
||||
from utils.settings import MAX_OUTPUT_TOKENS
|
||||
from agent.agent_config import AgentConfig
|
||||
@ -968,6 +970,135 @@ async def chat_completions_v3(request: ChatRequestV3, authorization: Optional[st
|
||||
raise HTTPException(status_code=500, detail=f"Internal server error: {str(e)}")
|
||||
|
||||
|
||||
async def build_llm_from_bot_config(bot_id: str, user_identifier: Optional[str] = None):
|
||||
"""Build a direct LLM client from a bot's database config.
|
||||
|
||||
Reuses the v3 config-loading chain to resolve model / api_key / model_server,
|
||||
then constructs a LangChain chat model without any agent logic.
|
||||
|
||||
Returns:
|
||||
tuple: (llm_instance, model_name)
|
||||
"""
|
||||
bot_config = await fetch_bot_config_from_db(bot_id, user_identifier)
|
||||
|
||||
model_name = bot_config.get("model", "")
|
||||
api_key = bot_config.get("api_key", "")
|
||||
model_server = bot_config.get("model_server", "")
|
||||
|
||||
if not model_name:
|
||||
raise HTTPException(status_code=400, detail=f"No model configured for bot '{bot_id}'")
|
||||
|
||||
# Detect provider and sanitize kwargs (same as the agent path)
|
||||
model_provider, base_url = detect_provider(model_name, model_server)
|
||||
model_kwargs, _, _ = sanitize_model_kwargs(
|
||||
model_name=model_name,
|
||||
model_provider=model_provider,
|
||||
base_url=base_url,
|
||||
api_key=api_key,
|
||||
generate_cfg={},
|
||||
source="llm_passthrough"
|
||||
)
|
||||
|
||||
llm = init_chat_model(**model_kwargs)
|
||||
return llm, model_name
|
||||
|
||||
|
||||
@router.post("/api/v3/llm/chat/completions")
|
||||
async def llm_passthrough_v3(request: ChatRequestV3, authorization: Optional[str] = Header(None)):
|
||||
"""LLM passthrough API - direct LLM call, bypassing all agent logic.
|
||||
|
||||
Only model / api_key / model_server are read from the bot's database config
|
||||
(resolved via bot_id). Messages are forwarded to the LLM as-is.
|
||||
|
||||
Required Parameters:
|
||||
- bot_id: str - target bot id (used to look up LLM config from db)
|
||||
- messages: List[Message] - conversation messages, passed through directly
|
||||
|
||||
Optional Parameters:
|
||||
- stream: bool - whether to stream the output, default false
|
||||
- user_identifier: str - used to resolve the api_key owner
|
||||
|
||||
Authentication:
|
||||
- Authorization header is required: Bearer <token>
|
||||
- token = md5(MASTERKEY:bot_id), same scheme as the v2 API
|
||||
|
||||
Returns:
|
||||
Union[dict, StreamingResponse]: OpenAI-compatible completion or stream
|
||||
"""
|
||||
try:
|
||||
bot_id = request.bot_id
|
||||
if not bot_id:
|
||||
raise HTTPException(status_code=400, detail="bot_id is required")
|
||||
|
||||
# Authentication validation (same auth logic as v2: token = md5(MASTERKEY:bot_id))
|
||||
expected_token = generate_v2_auth_token(bot_id)
|
||||
provided_token = extract_api_key_from_auth(authorization)
|
||||
|
||||
if not provided_token:
|
||||
raise HTTPException(
|
||||
status_code=401,
|
||||
detail="Authorization header is required"
|
||||
)
|
||||
|
||||
if provided_token != expected_token:
|
||||
raise HTTPException(
|
||||
status_code=403,
|
||||
detail=f"Invalid authentication token. Expected: {expected_token[:8]}..., Provided: {provided_token[:8]}..."
|
||||
)
|
||||
|
||||
# Build the LLM client from db config
|
||||
llm, model_name = await build_llm_from_bot_config(bot_id, request.user_identifier)
|
||||
|
||||
# Forward messages as-is (pure passthrough, no agent processing)
|
||||
lc_messages = [{"role": msg.role, "content": msg.content} for msg in request.messages]
|
||||
|
||||
chunk_id = f"chatcmpl-{int(time.time())}"
|
||||
|
||||
# Streaming response
|
||||
if request.stream:
|
||||
async def generate():
|
||||
try:
|
||||
async for chunk in llm.astream(lc_messages):
|
||||
content = chunk.content if isinstance(chunk.content, str) else str(chunk.content)
|
||||
if content:
|
||||
data = create_stream_chunk(chunk_id, model_name, content=content)
|
||||
yield f"data: {json.dumps(data, ensure_ascii=False)}\n\n"
|
||||
# Final chunk with finish_reason
|
||||
done = create_stream_chunk(chunk_id, model_name, finish_reason="stop")
|
||||
yield f"data: {json.dumps(done, ensure_ascii=False)}\n\n"
|
||||
yield "data: [DONE]\n\n"
|
||||
except Exception as stream_error:
|
||||
logger.error(f"Error in LLM passthrough stream: {stream_error}")
|
||||
err = {"error": {"message": str(stream_error), "type": "internal_error"}}
|
||||
yield f"data: {json.dumps(err, ensure_ascii=False)}\n\n"
|
||||
|
||||
return StreamingResponse(generate(), media_type="text/event-stream")
|
||||
|
||||
# Non-streaming response
|
||||
response = await llm.ainvoke(lc_messages)
|
||||
content = response.content if isinstance(response.content, str) else str(response.content)
|
||||
|
||||
return {
|
||||
"id": chunk_id,
|
||||
"object": "chat.completion",
|
||||
"created": int(time.time()),
|
||||
"model": model_name,
|
||||
"choices": [{
|
||||
"index": 0,
|
||||
"message": {"role": "assistant", "content": content},
|
||||
"finish_reason": "stop"
|
||||
}]
|
||||
}
|
||||
|
||||
except HTTPException:
|
||||
raise
|
||||
except Exception as e:
|
||||
error_details = traceback.format_exc()
|
||||
logger.error(f"Error in llm_passthrough_v3: {str(e)}")
|
||||
logger.error(f"Full traceback: {error_details}")
|
||||
raise HTTPException(status_code=500, detail=f"Internal server error: {str(e)}")
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# Chat history query endpoints
|
||||
# ============================================================================
|
||||
|
||||
@ -22,6 +22,7 @@ class SkillItem(BaseModel):
|
||||
name: str
|
||||
description: str
|
||||
user_skill: bool = False
|
||||
category: str = "other"
|
||||
|
||||
|
||||
class SkillListResponse(BaseModel):
|
||||
@ -35,6 +36,7 @@ class SkillValidationResult:
|
||||
valid: bool
|
||||
name: Optional[str] = None
|
||||
description: Optional[str] = None
|
||||
category: Optional[str] = None
|
||||
error_message: Optional[str] = None
|
||||
|
||||
|
||||
@ -267,7 +269,8 @@ def parse_plugin_json(plugin_json_path: str) -> SkillValidationResult:
|
||||
return SkillValidationResult(
|
||||
valid=True,
|
||||
name=plugin_config['name'],
|
||||
description=plugin_config['description']
|
||||
description=plugin_config['description'],
|
||||
category=plugin_config.get('category'),
|
||||
)
|
||||
|
||||
except json.JSONDecodeError as e:
|
||||
@ -334,7 +337,8 @@ def parse_skill_frontmatter(skill_md_path: str) -> SkillValidationResult:
|
||||
return SkillValidationResult(
|
||||
valid=True,
|
||||
name=metadata['name'],
|
||||
description=metadata['description']
|
||||
description=metadata['description'],
|
||||
category=metadata.get('category'),
|
||||
)
|
||||
|
||||
except yaml.YAMLError as e:
|
||||
@ -410,10 +414,13 @@ def get_skill_metadata_legacy(skill_path: str) -> Optional[dict]:
|
||||
"""
|
||||
result = get_skill_metadata(skill_path)
|
||||
if result.valid:
|
||||
return {
|
||||
ret = {
|
||||
'name': result.name,
|
||||
'description': result.description
|
||||
'description': result.description,
|
||||
}
|
||||
if result.category:
|
||||
ret['category'] = result.category
|
||||
return ret
|
||||
return None
|
||||
|
||||
|
||||
@ -456,7 +463,8 @@ def get_official_skills(base_dir: str) -> List[SkillItem]:
|
||||
skills.append(SkillItem(
|
||||
name=metadata['name'],
|
||||
description=metadata['description'],
|
||||
user_skill=False
|
||||
user_skill=False,
|
||||
category=metadata.get('category', 'other'),
|
||||
))
|
||||
skill_names.add(skill_name)
|
||||
logger.debug(f"Found official skill: {metadata['name']} from {official_skills_dir}")
|
||||
@ -489,7 +497,8 @@ def get_user_skills(base_dir: str, bot_id: str) -> List[SkillItem]:
|
||||
skills.append(SkillItem(
|
||||
name=metadata['name'],
|
||||
description=metadata['description'],
|
||||
user_skill=True
|
||||
user_skill=True,
|
||||
category=metadata.get('category', 'custom'),
|
||||
))
|
||||
logger.debug(f"Found user skill: {metadata['name']}")
|
||||
|
||||
|
||||
@ -18,5 +18,6 @@
|
||||
"{bot_id}"
|
||||
]
|
||||
}
|
||||
}
|
||||
},
|
||||
"category": "Data & Retrieval"
|
||||
}
|
||||
|
||||
@ -314,7 +314,12 @@ async def handle_request(request: Dict[str, Any]) -> Dict[str, Any]:
|
||||
top_k = arguments.get("top_k", 100)
|
||||
|
||||
if not query:
|
||||
return create_error_response(request_id, -32602, "Missing required parameter: query")
|
||||
return create_success_response(request_id, {
|
||||
"content": [{
|
||||
"type": "text",
|
||||
"text": "Error: missing required parameter 'query'. Please call this tool again with a non-empty 'query' argument describing what you want to retrieve."
|
||||
}]
|
||||
})
|
||||
|
||||
result = rag_retrieve(query, top_k, trace_id)
|
||||
|
||||
@ -328,7 +333,12 @@ async def handle_request(request: Dict[str, Any]) -> Dict[str, Any]:
|
||||
query = arguments.get("query", "")
|
||||
|
||||
if not query:
|
||||
return create_error_response(request_id, -32602, "Missing required parameter: query")
|
||||
return create_success_response(request_id, {
|
||||
"content": [{
|
||||
"type": "text",
|
||||
"text": "Error: missing required parameter 'query'. Please call this tool again with a non-empty 'query' argument describing what you want to retrieve."
|
||||
}]
|
||||
})
|
||||
|
||||
result = table_rag_retrieve(query, trace_id)
|
||||
|
||||
|
||||
@ -18,5 +18,6 @@
|
||||
"{bot_id}"
|
||||
]
|
||||
}
|
||||
}
|
||||
},
|
||||
"category": "Data & Retrieval"
|
||||
}
|
||||
|
||||
@ -314,7 +314,12 @@ async def handle_request(request: Dict[str, Any]) -> Dict[str, Any]:
|
||||
top_k = arguments.get("top_k", 100)
|
||||
|
||||
if not query:
|
||||
return create_error_response(request_id, -32602, "Missing required parameter: query")
|
||||
return create_success_response(request_id, {
|
||||
"content": [{
|
||||
"type": "text",
|
||||
"text": "Error: missing required parameter 'query'. Please call this tool again with a non-empty 'query' argument describing what you want to retrieve."
|
||||
}]
|
||||
})
|
||||
|
||||
result = rag_retrieve(query, top_k, trace_id)
|
||||
|
||||
@ -328,7 +333,12 @@ async def handle_request(request: Dict[str, Any]) -> Dict[str, Any]:
|
||||
query = arguments.get("query", "")
|
||||
|
||||
if not query:
|
||||
return create_error_response(request_id, -32602, "Missing required parameter: query")
|
||||
return create_success_response(request_id, {
|
||||
"content": [{
|
||||
"type": "text",
|
||||
"text": "Error: missing required parameter 'query'. Please call this tool again with a non-empty 'query' argument describing what you want to retrieve."
|
||||
}]
|
||||
})
|
||||
|
||||
result = table_rag_retrieve(query, trace_id)
|
||||
|
||||
|
||||
@ -1,6 +1,7 @@
|
||||
{
|
||||
"name": "data-dashboard",
|
||||
"description": "Renders data as an interactive dashboard card UI using the mcp-ui protocol.",
|
||||
"category": "Interactive UI",
|
||||
"hooks": {
|
||||
"PrePrompt": [
|
||||
{
|
||||
|
||||
@ -2,6 +2,7 @@
|
||||
name: docx
|
||||
description: "Comprehensive document creation, editing, and analysis with support for tracked changes, comments, formatting preservation, and text extraction. When Claude needs to work with professional documents (.docx files) for: (1) Creating new documents, (2) Modifying or editing content, (3) Working with tracked changes, (4) Adding comments, or any other document tasks"
|
||||
license: Proprietary. LICENSE.txt has complete terms
|
||||
category: Document Processing
|
||||
---
|
||||
|
||||
# DOCX creation, editing, and analysis
|
||||
|
||||
@ -16,6 +16,7 @@ metadata:
|
||||
- node
|
||||
- npm
|
||||
primaryEnv: SMTP_PASS
|
||||
category: Communication
|
||||
---
|
||||
|
||||
# IMAP/SMTP Email Tool
|
||||
|
||||
@ -13,7 +13,11 @@
|
||||
"mcp_ui": {
|
||||
"transport": "stdio",
|
||||
"command": "python",
|
||||
"args": ["./ui_render_server.py", "{bot_id}"]
|
||||
"args": [
|
||||
"./ui_render_server.py",
|
||||
"{bot_id}"
|
||||
]
|
||||
}
|
||||
}
|
||||
},
|
||||
"category": "Interactive UI"
|
||||
}
|
||||
|
||||
@ -2,6 +2,7 @@
|
||||
name: pdf
|
||||
description: Comprehensive PDF manipulation toolkit for extracting text and tables, creating new PDFs, merging/splitting documents, and handling forms. When Claude needs to fill in a PDF form or programmatically process, generate, or analyze PDF documents at scale.
|
||||
license: Proprietary. LICENSE.txt has complete terms
|
||||
category: Document Processing
|
||||
---
|
||||
|
||||
# PDF Processing Guide
|
||||
|
||||
@ -2,6 +2,7 @@
|
||||
name: pptx
|
||||
description: "Presentation creation, editing, and analysis. When Claude needs to work with presentations (.pptx files) for: (1) Creating new presentations, (2) Modifying or editing content, (3) Working with layouts, (4) Adding comments or speaker notes, or any other presentation tasks"
|
||||
license: Proprietary. LICENSE.txt has complete terms
|
||||
category: Document Processing
|
||||
---
|
||||
|
||||
# PPTX creation, editing, and analysis
|
||||
|
||||
@ -5,6 +5,7 @@ compatibility: Requires Python 3.8+ and PyYAML. Uses AWS SigV4 signing (no exter
|
||||
metadata:
|
||||
author: foundra
|
||||
version: "2.1"
|
||||
category: Web Services
|
||||
---
|
||||
|
||||
# R2 Upload
|
||||
|
||||
@ -8,5 +8,6 @@
|
||||
"command": "python scripts/schedule_manager.py list --format brief"
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"category": "Task Scheduling"
|
||||
}
|
||||
|
||||
@ -1,6 +1,7 @@
|
||||
---
|
||||
name: schedule-job
|
||||
description: Scheduled Task Management - Create, manage, and view scheduled tasks for users (supports cron recurring tasks and one-time tasks)
|
||||
category: Task Scheduling
|
||||
---
|
||||
|
||||
# Schedule Job - Scheduled Task Management
|
||||
|
||||
@ -1,6 +1,7 @@
|
||||
---
|
||||
name: skill-creator
|
||||
description: Create new skills, modify and improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, update or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy.
|
||||
category: Developer Tools
|
||||
---
|
||||
|
||||
# Skill Creator
|
||||
|
||||
@ -2,6 +2,7 @@
|
||||
name: xlsx
|
||||
description: "Comprehensive spreadsheet creation, editing, and analysis with support for formulas, formatting, data analysis, and visualization. When Claude needs to work with spreadsheets (.xlsx, .xlsm, .csv, .tsv, etc) for: (1) Creating new spreadsheets with formulas and formatting, (2) Reading or analyzing data, (3) Modify existing spreadsheets while preserving formulas, (4) Data analysis and visualization in spreadsheets, or (5) Recalculating formulas"
|
||||
license: Proprietary. LICENSE.txt has complete terms
|
||||
category: Document Processing
|
||||
---
|
||||
|
||||
# Requirements for Outputs
|
||||
|
||||
@ -2,6 +2,7 @@
|
||||
name: ai-ppt-generator
|
||||
description: Generate PPT with Baidu AI. Smart template selection based on content.
|
||||
metadata: { "openclaw": { "emoji": "📑", "requires": { "bins": ["python3"], "env":["BAIDU_API_KEY"]},"primaryEnv":"BAIDU_API_KEY" } }
|
||||
category: Document Processing
|
||||
---
|
||||
|
||||
# AI PPT Generator
|
||||
@ -82,4 +83,4 @@ python3 scripts/random_ppt_theme.py --query "企业年度总结" --category "企
|
||||
- **API integration**: Fetches real style_id from Baidu API for each template
|
||||
- **Error handling**: If template not found, falls back to random selection
|
||||
- **Timeout**: Generation takes 2-5 minutes, set sufficient timeout
|
||||
- **Streaming**: Uses streaming API, wait for `is_end: true` before considering complete
|
||||
- **Streaming**: Uses streaming API, wait for `is_end: true` before considering complete
|
||||
|
||||
@ -8,5 +8,6 @@
|
||||
},
|
||||
"skills": [
|
||||
"./skills/catalog-search-agent"
|
||||
]
|
||||
],
|
||||
"category": "Data & Retrieval"
|
||||
}
|
||||
|
||||
@ -13,7 +13,11 @@
|
||||
"ecommerce_storefront": {
|
||||
"transport": "stdio",
|
||||
"command": "python",
|
||||
"args": ["./ecommerce_server.py", "{bot_id}"]
|
||||
"args": [
|
||||
"./ecommerce_server.py",
|
||||
"{bot_id}"
|
||||
]
|
||||
}
|
||||
}
|
||||
},
|
||||
"category": "Developer Tools"
|
||||
}
|
||||
|
||||
195
skills/developing/expense-approval-reviewer/SKILL.md
Normal file
195
skills/developing/expense-approval-reviewer/SKILL.md
Normal file
@ -0,0 +1,195 @@
|
||||
---
|
||||
name: expense-approval-reviewer
|
||||
description: 对报销单据(差旅/餐饮/办公等费用报销)做合规性与真实性审核,逐项检查发票、金额、费用类型、事由、预算、重复/拆分报销等风险点,给出明确的审核结论和建议动作。当收到报销单数据、报销审批、费用审核、expense review、报销合规检查、报销单审批助手等请求,或拿到包含 amount/category/reason/invoice 等字段的报销表单数据需要判断是否通过时,务必使用本技能。只输出结构化文本,不要输出 JSON。
|
||||
category: Compliance & Security
|
||||
---
|
||||
|
||||
# 报销审批审核助手(Expense Approval Reviewer)
|
||||
|
||||
## Overview
|
||||
|
||||
本技能面向企业 OA 报销流程,对一张报销单据做**自动初审**,识别合规与真实性风险,并给出一个清晰的审核结论:**通过 / 需关注 / 审批不通过**。
|
||||
|
||||
定位说明:
|
||||
|
||||
- 你是**初审 agent**,只负责审核并**输出文本结论**。
|
||||
- 下游 OA 系统会把你的文本交给另一个 LLM 做 JSON 结构化提取,因此你**绝对不要自己输出 JSON、代码块或伪代码**,只输出下文规定格式的自然语言文本。
|
||||
- 你的结论用于决定单据是被**退回发起人修改**(审批不通过)还是**进入人工审批**(通过/需关注),所以判定要稳定、可解释。
|
||||
|
||||
## Triggering Cues
|
||||
|
||||
出现以下任一情况就使用本技能:
|
||||
|
||||
- 中文:报销审批、报销审核、费用审核、报销单初审、报销合规检查、发票审核、差旅报销审核
|
||||
- 英文:expense review, reimbursement approval, expense compliance check, invoice review
|
||||
- 收到一段报销表单数据(含 `amount` 金额、`category` 费用类型、`reason` 事由、`invoice_img`/发票 等字段),要求判断是否可以通过审批。
|
||||
|
||||
## 输入(Input)
|
||||
|
||||
通常会收到一张报销单的字段数据,常见字段:
|
||||
|
||||
| 字段 key | 含义 | 说明 |
|
||||
|---|---|---|
|
||||
| `amount` | 报销金额(元) | 必填 |
|
||||
| `category` | 费用类型 | travel(差旅) / meal(餐饮) / office(办公用品) / other(其他) |
|
||||
| `reason` | 报销事由 | 自由文本 |
|
||||
| `invoice_img` | 发票照片/凭证 | URL 或附件标识,**为空表示未上传发票** |
|
||||
| `date` / `occurred_at` | 费用发生日期 | 可能没有 |
|
||||
| `creator` / `dept` | 发起人/部门 | 可能没有 |
|
||||
|
||||
字段缺失时:必填项(金额、发票)缺失要明确指出;可选上下文(日期、部门、历史记录)缺失时不要因此卡住,按“信息不足”温和提示即可。
|
||||
|
||||
## 审核要点清单(核心)
|
||||
|
||||
逐项检查以下 8 类。每条标注:**检查什么 → 何时判为问题 → 严重度**。严重度分 `高/中/低`。
|
||||
|
||||
### 1. 发票与凭证完整性 —— 字段 `invoice_img`
|
||||
- 检查:是否提供发票/凭证。
|
||||
- 异常:**未上传发票** → 无法核验真实性。
|
||||
- 严重度:**高**(硬性缺陷,通常直接“审批不通过”退回补票)。
|
||||
|
||||
### 2. 金额合规性 —— 字段 `amount`
|
||||
- 检查:单笔金额是否超过限额、是否为 0 或负数、是否明显异常。
|
||||
- 异常:
|
||||
- 单笔 > 10000 元 → 超单笔阈值,需补充审批说明(**中**,进人工关注)。
|
||||
- 金额 ≤ 0 或非数字 → 数据无效(**高**)。
|
||||
- 严重度:高 / 中(按上述)。
|
||||
|
||||
### 3. 费用类型与事由一致性 —— 字段 `category` × `reason`
|
||||
- 检查:`category` 与 `reason` 描述是否吻合。
|
||||
- 异常:类型为“差旅”但事由是“团队聚餐”等明显不符(**中**)。
|
||||
- 严重度:**中**。
|
||||
|
||||
### 4. 事由充分性 —— 字段 `reason`
|
||||
- 检查:事由是否具体、能说明用途。
|
||||
- 异常:事由过于简略(少于约 4 个有效字、或仅写“报销”“费用”等)→ 无法判断用途(**低**)。
|
||||
- 严重度:**低**。
|
||||
|
||||
### 5. 金额与事由的合理性(真实性嗅探)—— `amount` × `reason` × `category`
|
||||
- 检查:金额相对事由/类型是否离谱。
|
||||
- 异常:如“一次工作午餐”报销上万元、办公用品报销远超常识 → 疑似异常(**中/高**,视偏离程度)。
|
||||
- 严重度:中 / 高。
|
||||
|
||||
### 6. 重复报销 / 拆分报销嫌疑 —— `amount` × 阈值
|
||||
- 检查:金额是否“恰好卡在阈值下方”、是否疑似把大额拆成多笔规避审批。
|
||||
- 异常:金额逼近且略低于限额(如 9800、9900)且无合理说明 → 拆分嫌疑(**中**)。
|
||||
- 提示:若提供了历史报销上下文,检查是否与近期单据重复(同金额同日期同事由 → **高**)。
|
||||
- 严重度:中 / 高。
|
||||
|
||||
### 7. 时效性 —— 字段 `date`/`occurred_at`(若有)
|
||||
- 检查:费用发生日期距今是否超期(如超过 90 天)。
|
||||
- 异常:明显超期且无说明(**低/中**)。
|
||||
- 严重度:低 / 中。无该字段则跳过,不报问题。
|
||||
|
||||
### 8. 发票抬头/税务信息 —— 若数据中含发票明细
|
||||
- 检查:抬头是否为公司主体、是否个人抬头、税号是否缺失。
|
||||
- 异常:个人抬头报销公司费用、关键税务字段缺失(**中**)。
|
||||
- 严重度:**中**。无相关数据则跳过。
|
||||
|
||||
> 字段标注约定:当某条发现指向具体字段时,**用字段英文 key**(`amount`/`category`/`reason`/`invoice_img` 等)标注,方便下游结构化。
|
||||
|
||||
## 判定与结论规则
|
||||
|
||||
综合所有发现,给出**总体判定**(三选一),映射关系如下(下游会据此路由):
|
||||
|
||||
| 总体判定 | 触发条件 | 下游含义 |
|
||||
|---|---|---|
|
||||
| **审批不通过** | 存在任一**高**级硬性缺陷(如缺发票、金额无效、疑似重复/造假) | 退回发起人修改后重新提交 |
|
||||
| **需关注** | 无硬性缺陷,但存在一个或多个**中**级风险 | 进入人工审批,提醒审批人重点关注 |
|
||||
| **通过** | 无风险,或仅有**低**级提示 | 建议人工审批通过 |
|
||||
|
||||
置信度:根据信息完整度与判断确定性给出 `高/中/低`(或 0–100% 区间)。信息缺失越多、判断越主观,置信度越低。
|
||||
|
||||
## 输出格式(Output Format)
|
||||
|
||||
**只输出下面这种结构化中文文本,不要输出 JSON、不要用代码块包裹。** 按固定小标题组织,便于下游 LLM 抽取:
|
||||
|
||||
```
|
||||
【审核结论】通过 / 需关注 / 审批不通过(三选一)
|
||||
【一句话摘要】用一句话说明结论原因。
|
||||
【置信度】高 / 中 / 低(或百分比)
|
||||
【建议动作】建议通过审批 / 建议进入人工审批并关注以下风险 / 建议退回发起人修改
|
||||
|
||||
【风险发现】
|
||||
1. 字段:invoice_img | 严重度:高 | 问题:未上传发票,无法核验报销真实性 | 建议:补充发票凭证后重新提交
|
||||
2. 字段:amount | 严重度:中 | 问题:报销金额 15000 元超过单笔 10000 元阈值 | 建议:拆分报销或补充审批说明
|
||||
...(无风险时写“无”)
|
||||
|
||||
【说明与假设】列出做判断时假设的前提、缺失的上下文(如未提供历史报销记录、无费用日期等)。
|
||||
```
|
||||
|
||||
要求:
|
||||
- `【风险发现】`每条固定四段:`字段 | 严重度 | 问题 | 建议`,用全角竖线 `|` 分隔。
|
||||
- 严重度只用 `高/中/低`。
|
||||
- 结论必须与风险发现自洽:有“高”必为“审批不通过”,最高为“中”则“需关注”,全“低”或无则“通过”。
|
||||
|
||||
## Workflow
|
||||
|
||||
1. 解析输入的报销字段(amount/category/reason/invoice_img 等)。
|
||||
2. 按“审核要点清单”8 类逐项检查,记录命中的风险(字段、严重度、问题、建议)。
|
||||
3. 按“判定与结论规则”汇总出总体判定与建议动作。
|
||||
4. 评估置信度。
|
||||
5. 按“输出格式”输出结构化文本,**不要输出 JSON**。
|
||||
|
||||
## Examples
|
||||
|
||||
### 示例 1:缺发票 → 审批不通过
|
||||
输入:
|
||||
```
|
||||
amount=800, category=meal, reason=团队聚餐, invoice_img=(空)
|
||||
```
|
||||
输出风格:
|
||||
```
|
||||
【审核结论】审批不通过
|
||||
【一句话摘要】未上传发票,无法核验报销真实性,需补票后重新提交。
|
||||
【置信度】高
|
||||
【建议动作】建议退回发起人修改
|
||||
|
||||
【风险发现】
|
||||
1. 字段:invoice_img | 严重度:高 | 问题:未上传发票照片,无法核验报销真实性 | 建议:补充发票凭证后重新提交
|
||||
|
||||
【说明与假设】未提供历史报销记录,仅基于当前单据字段判断。
|
||||
```
|
||||
|
||||
### 示例 2:超额但有发票 → 需关注(进人工)
|
||||
输入:
|
||||
```
|
||||
amount=15000, category=travel, reason=出差北京参加客户签约会议, invoice_img=http://img/inv.png
|
||||
```
|
||||
输出风格:
|
||||
```
|
||||
【审核结论】需关注
|
||||
【一句话摘要】金额超单笔阈值但发票齐全、事由清晰,建议进入人工审批并关注金额。
|
||||
【置信度】中
|
||||
【建议动作】建议进入人工审批并关注以下风险
|
||||
|
||||
【风险发现】
|
||||
1. 字段:amount | 严重度:中 | 问题:报销金额 15000 元超过单笔 10000 元阈值 | 建议:补充超额审批说明或按规定拆分
|
||||
|
||||
【说明与假设】差旅标准与部门预算未提供,金额合理性以常识判断。
|
||||
```
|
||||
|
||||
### 示例 3:正常小额 → 通过
|
||||
输入:
|
||||
```
|
||||
amount=120, category=office, reason=采购办公用打印纸, invoice_img=http://img/2.png
|
||||
```
|
||||
输出风格:
|
||||
```
|
||||
【审核结论】通过
|
||||
【一句话摘要】金额小、类型与事由一致、发票齐全,无明显风险。
|
||||
【置信度】高
|
||||
【建议动作】建议通过审批
|
||||
|
||||
【风险发现】无
|
||||
|
||||
【说明与假设】基于当前单据字段判断,未发现异常。
|
||||
```
|
||||
|
||||
## Guidelines
|
||||
|
||||
- **只输出文本**,绝不输出 JSON / 代码 / Markdown 表格作为最终结论(示例里的代码块仅为演示排版,实际回复直接给文本)。
|
||||
- 判定要**稳定可复现**:同样的输入应给出同样的结论,便于下游提取与回归测试。
|
||||
- 缺少可选上下文(历史记录、预算标准、日期)时,在`【说明与假设】`里说明,不要凭空编造数据,也不要因此拒绝出结论。
|
||||
- 这是**初审辅助**,不替代财务/审计的最终判断;措辞用“疑似/建议/需关注”,不下绝对定论。
|
||||
- 严重度与总体判定必须自洽(见“判定与结论规则”),避免“有高风险却判通过”这类矛盾。
|
||||
@ -1,6 +1,7 @@
|
||||
---
|
||||
name: managing-scripts
|
||||
description: Manages shared scripts repository for reusable data analysis tools. Check scripts/README.md before writing, design generalized scripts with parameters, and keep documentation in sync.
|
||||
category: Data & Retrieval
|
||||
---
|
||||
|
||||
# Managing Scripts
|
||||
|
||||
49
skills/developing/mineru/SKILL.md
Normal file
49
skills/developing/mineru/SKILL.md
Normal file
@ -0,0 +1,49 @@
|
||||
---
|
||||
name: mineru
|
||||
description: An AI-Native skill for parsing PDF / Office / image files into Markdown with MinerU — a fast, zero-config document parser for AI agents. Works with NO token via the Agent API and auto-upgrades to the Standard API (token) for large files, batches, and DOCX/HTML/LaTeX export. Use when converting PDF/Word/PPT/Excel/image documents, extracting text/tables/formulas, running OCR, or batch processing.
|
||||
category: Document Processing
|
||||
metadata:
|
||||
author: Nebutra
|
||||
version: "3.3.1"
|
||||
argument-hint: <pdf-file-or-url>
|
||||
---
|
||||
|
||||
# MinerU PDF Parser
|
||||
|
||||
Parse PDF, Office, and image documents into structured Markdown via the MinerU API.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Zero-config: no token, no install (free Agent API)
|
||||
python3 "${CLAUDE_PLUGIN_ROOT}/scripts/mineru.py" ./document.pdf --output ./output/
|
||||
|
||||
# Pipe Markdown back to an agent
|
||||
python3 "${CLAUDE_PLUGIN_ROOT}/scripts/mineru.py" ./document.pdf --stdout
|
||||
|
||||
# Power mode: token unlocks large files / batch / extra formats
|
||||
export MINERU_TOKEN="..." # https://mineru.net/apiManage/token
|
||||
python3 "${CLAUDE_PLUGIN_ROOT}/scripts/mineru.py" ./pdfs/ --output ./output/ --workers 8 --resume
|
||||
```
|
||||
|
||||
## Features
|
||||
|
||||
- **Auto-routing**: free Agent API by default, auto-upgrades to the Standard API (token) for large/batch/extra-format jobs
|
||||
- **Multi-modal**: PDF, images, Word, PPT, Excel, HTML
|
||||
- **High-performance OCR**: `--ocr` with language selection (`--lang`)
|
||||
- **Formula & table recognition**: LaTeX formulas, structured tables
|
||||
- **Multi-format export**: Markdown (default), plus DOCX / HTML / LaTeX
|
||||
- **AI-Native output**: `--stdout` (Markdown) and `--json` (machine status)
|
||||
- **Batch + resume**: parallel workers with `--resume`
|
||||
- **Zero dependencies**: standard library only
|
||||
|
||||
## Authentication
|
||||
|
||||
A token is **optional** — the Agent API works without one. Set a token to unlock
|
||||
the Standard API (≤ 200 MB / ≤ 200 pages, batch, DOCX/HTML/LaTeX):
|
||||
|
||||
```bash
|
||||
export MINERU_TOKEN="your-token-here" # https://mineru.net/apiManage/token
|
||||
```
|
||||
|
||||
Official API docs: https://mineru.net/apiManage/docs
|
||||
170
skills/developing/mineru/references/api_reference.md
Normal file
170
skills/developing/mineru/references/api_reference.md
Normal file
@ -0,0 +1,170 @@
|
||||
# MinerU API Reference
|
||||
|
||||
Official docs: https://mineru.net/apiManage/docs · Token: https://mineru.net/apiManage/token
|
||||
|
||||
MinerU exposes **two** document-parsing APIs. This skill auto-routes between them.
|
||||
|
||||
| | 🎯 Standard API | ⚡ Agent API (lightweight) |
|
||||
|---|---|---|
|
||||
| Base URL | `https://mineru.net/api/v4` | `https://mineru.net/api/v1/agent` |
|
||||
| Token | **required** (`Bearer`) | **none** (IP rate-limited) |
|
||||
| Models | `pipeline` / `vlm` / `MinerU-HTML` | fixed lightweight `pipeline` |
|
||||
| File size | ≤ 200 MB | ≤ 10 MB |
|
||||
| Pages | ≤ 200 | ≤ 20 |
|
||||
| Batch | ≤ 50 per request | single file only |
|
||||
| Output | zip (Markdown + JSON, optional DOCX/HTML/LaTeX) | Markdown only (CDN link) |
|
||||
| Designed for | high-accuracy / complex / batch | AI-agent / quick / no-login |
|
||||
|
||||
Free Standard-API quota: **1000 pages/day at highest priority** (overflow is lower priority).
|
||||
|
||||
---
|
||||
|
||||
## Authentication (Standard API)
|
||||
|
||||
```
|
||||
Authorization: Bearer YOUR_API_TOKEN
|
||||
```
|
||||
|
||||
Get a token at https://mineru.net/apiManage/token.
|
||||
|
||||
> **Response envelopes.** Business endpoints return `{"code":0,"data":{…},"msg":"ok"}`.
|
||||
> The auth/gateway layer returns a *different* shape on failure:
|
||||
> `{"success":false,"msgCode":"A0202","msg":"user authenticate failed"}`.
|
||||
> Clients must handle both — this skill maps `msgCode` to the same error hints.
|
||||
|
||||
---
|
||||
|
||||
## Standard API endpoints (`/api/v4`)
|
||||
|
||||
### Single URL — `POST /extract/task`
|
||||
|
||||
```json
|
||||
{
|
||||
"url": "https://example.com/doc.pdf",
|
||||
"model_version": "vlm",
|
||||
"is_ocr": false,
|
||||
"enable_formula": true,
|
||||
"enable_table": true,
|
||||
"language": "ch",
|
||||
"page_ranges": "1-10",
|
||||
"extra_formats": ["docx", "html"],
|
||||
"data_id": "my-document"
|
||||
}
|
||||
```
|
||||
Response → `{ "code": 0, "data": { "task_id": "…" } }`. HTML inputs require `model_version: "MinerU-HTML"`.
|
||||
|
||||
### Get task result — `GET /extract/task/{task_id}`
|
||||
|
||||
```json
|
||||
{ "code": 0, "data": { "task_id": "…", "state": "done", "full_zip_url": "https://…", "err_msg": "" } }
|
||||
```
|
||||
|
||||
### Batch local upload — `POST /file-urls/batch`
|
||||
|
||||
Returns signed upload URLs; PUT each file (no `Content-Type`). Up to **50** files / request.
|
||||
|
||||
```json
|
||||
{ "files": [ { "name": "doc.pdf", "data_id": "doc" } ], "model_version": "vlm" }
|
||||
```
|
||||
Response → `{ "code": 0, "data": { "batch_id": "…", "file_urls": ["https://…"] } }`.
|
||||
|
||||
### Batch URL — `POST /extract/task/batch`
|
||||
|
||||
```json
|
||||
{ "files": [ { "url": "https://…/doc.pdf", "data_id": "doc" } ], "model_version": "vlm" }
|
||||
```
|
||||
|
||||
### Batch results — `GET /extract-results/batch/{batch_id}`
|
||||
|
||||
```json
|
||||
{ "code": 0, "data": { "batch_id": "…", "extract_result": [
|
||||
{ "file_name": "doc.pdf", "state": "done", "full_zip_url": "https://…" }
|
||||
] } }
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Agent API endpoints (`/api/v1/agent`) — no token
|
||||
|
||||
### URL — `POST /parse/url`
|
||||
|
||||
```json
|
||||
{ "url": "https://…/doc.pdf", "language": "ch", "enable_table": true, "is_ocr": false, "enable_formula": true, "page_range": "1-10" }
|
||||
```
|
||||
`page_range` accepts `from-to` or a single page only (no commas). Returns `{ "code": 0, "data": { "task_id": "…" } }`.
|
||||
|
||||
### File — `POST /parse/file`
|
||||
|
||||
```json
|
||||
{ "file_name": "doc.pdf", "language": "ch" }
|
||||
```
|
||||
Response → `{ "data": { "task_id": "…", "file_url": "https://oss…" } }`; PUT the file to `file_url`.
|
||||
|
||||
### Result — `GET /parse/{task_id}`
|
||||
|
||||
```json
|
||||
{ "code": 0, "data": { "task_id": "…", "state": "done", "markdown_url": "https://cdn…/full.md" } }
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Task states
|
||||
|
||||
`pending` (queued) · `running` (parsing) · `converting` (format conversion) ·
|
||||
`uploading` (downloading source, Agent) · `waiting-file` (awaiting upload) ·
|
||||
`done` (complete) · `failed` (error).
|
||||
|
||||
---
|
||||
|
||||
## Parameters
|
||||
|
||||
| Parameter | Type | Default | Notes |
|
||||
|-----------|------|---------|-------|
|
||||
| `model_version` | string | `pipeline` | `pipeline`, `vlm` (recommended), `MinerU-HTML` (HTML only) |
|
||||
| `is_ocr` | bool | `false` | OCR for scanned docs (pipeline/vlm) |
|
||||
| `enable_formula` | bool | `true` | Formula recognition |
|
||||
| `enable_table` | bool | `true` | Table recognition |
|
||||
| `language` | string | `ch` | OCR language (see official `language` table) |
|
||||
| `page_ranges` | string | all | Standard: `"2,4-6"`; Agent `page_range`: `"1-10"` only |
|
||||
| `extra_formats` | array | `[]` | `docx` / `html` / `latex` (Standard only) |
|
||||
| `data_id` | string | – | `[A-Za-z0-9_.-]`, ≤ 128 chars |
|
||||
| `no_cache` | bool | `false` | Bypass URL cache (Standard) |
|
||||
| `cache_tolerance` | int | `900` | Cache TTL seconds (Standard) |
|
||||
|
||||
---
|
||||
|
||||
## Limits
|
||||
|
||||
| | Standard | Agent |
|
||||
|---|---|---|
|
||||
| File size | 200 MB | 10 MB |
|
||||
| Pages | 200 | 20 |
|
||||
| Batch | 50 / request | 1 |
|
||||
| Quota | 1000 pages/day priority | IP rate-limited (HTTP 429) |
|
||||
|
||||
Supported types: PDF, images (png/jpg/jpeg/jp2/webp/gif/bmp), Doc(x), Ppt(x), Xls(x); HTML is Standard-only.
|
||||
|
||||
---
|
||||
|
||||
## Error codes
|
||||
|
||||
| Code | Meaning |
|
||||
|------|---------|
|
||||
| `A0202` | Invalid token |
|
||||
| `A0211` | Token expired |
|
||||
| `-500` | Parameter error |
|
||||
| `-10001` / `-10002` | Service error / invalid params |
|
||||
| `-60002` | Unsupported file format |
|
||||
| `-60003` / `-60004` | File read failed / empty file |
|
||||
| `-60005` | File too large (> 200 MB) |
|
||||
| `-60006` | Too many pages (> 200) |
|
||||
| `-60008` | File read timeout (URL unreachable) |
|
||||
| `-60010` | Parse failed |
|
||||
| `-60015` / `-60016` | File / format conversion failed |
|
||||
| `-60018` | Daily quota reached |
|
||||
| `-60022` | Web page read failed (rate-limited) |
|
||||
| **Agent API** | |
|
||||
| `-30001` | Exceeds Agent 10 MB limit → use Standard API |
|
||||
| `-30002` | Unsupported file type for Agent |
|
||||
| `-30003` | Exceeds Agent 20-page limit → use Standard API or `--pages` |
|
||||
| `-30004` | Invalid request parameters |
|
||||
193
skills/developing/mineru/references/comparison.md
Normal file
193
skills/developing/mineru/references/comparison.md
Normal file
@ -0,0 +1,193 @@
|
||||
<!-- Web-researched competitive comparison (45 tools, 6 categories, adversarially fact-checked). Last researched 2026-05-31. Star counts / versions are point-in-time. -->
|
||||
|
||||
# MinerU Skill — Competitive Comparison Reference
|
||||
|
||||
This document gives an honest, sourced, per-tool breakdown of how **MinerU Skill** compares to the document-parsing landscape. Read the framing first: it determines how to interpret every "we win / they win" below.
|
||||
|
||||
## What MinerU Skill actually is (and is not)
|
||||
|
||||
MinerU Skill is a **zero-config, zero-dependency, agent-native convenience layer over [MinerU](https://github.com/opendatalab/MinerU)'s cloud API**, plus 17 turnkey delivery integrations to note/knowledge/content tools. Concretely (verified in this repo):
|
||||
|
||||
- Core script `scripts/mineru.py` is **~54KB / ~1,350 lines of pure Python standard library** — no `requests`/`aiohttp`, no model weights.
|
||||
- A **genuinely token-free** default: the free **Agent API** path (`agent_parse` → `_agent_poll`) sends **no `Authorization` header** (the Bearer header is set only when a token is present). Files ≤10MB / ≤20 pages.
|
||||
- **Auto-routing**: with a token, large/batched/extra-format jobs use the **Standard API** (≤200MB / ≤200 pages); the Agent path **auto-escalates** to Standard on size/page limits.
|
||||
- **17 delivery sinks** (16 sink modules + `local.py` registering both `obsidian` and `logseq`): obsidian, logseq, siyuan, notion, confluence, onenote, coda, yuque, feishu, slack, dingtalk, wecom, ticktick, linear, airtable — all zero-dependency — plus **roam** (needs `roam-client`) and **wps** (needs `html-for-docx`) which lazy-load one library only when used.
|
||||
- `--resume` dedup, parallel `--workers` (ThreadPoolExecutor), `--stdout`/`--json` agent output.
|
||||
|
||||
**Critical dependency:** our accuracy is **entirely downstream of, and capped by, what MinerU's cloud serves.** We own no models. Therefore:
|
||||
|
||||
- We have **no quality edge** over any other cloud wrapper that hits the same MinerU API — OCR/table/formula output is **identical**.
|
||||
- Self-hosting the MinerU engine gives the **same or better** accuracy (version-controllable, no upload caps).
|
||||
|
||||
**Hard limits we cannot exceed:** 10MB/20-page free Agent tier, 200MB/200-page Standard tier, plus IP rate limits. Self-hosted tools have no such caps (only hardware).
|
||||
|
||||
**Our benchmark is latency-only.** `tests/test_live.py` measures end-to-end cloud round-trip latency (~13–14s for the official demo PDF). It is **not** an accuracy benchmark; we have no OmniDocBench/olmOCR-Bench numbers of our own.
|
||||
|
||||
### A note on the speed claim
|
||||
|
||||
Our ~13–14s/doc cloud round-trip is **not** a clean win over self-hosted GPU engines. A normal self-host with a GPU runs at ~0.18s/page (Marker) or ~2.12 pages/sec (MinerU on A100) — far faster at any real scale. We only out-run **slow Apple-Silicon-CPU local runs of small docs** (e.g., M4 VLM at 32–148s/page). Do not frame "faster wall-clock" as a general win.
|
||||
|
||||
### A note on benchmarks
|
||||
|
||||
No single benchmark is authoritative. Different benchmarks favor different tools:
|
||||
- **OmniDocBench** (v1.5/v1.6): MinerU2.5 **90.67** (v1.5), MinerU2.5-Pro **95.69** (v1.6) — leads, beating Gemini 2.5 Pro / GPT-4o / Qwen2.5-VL-72B on text/table/formula. Source: arXiv 2509.22186.
|
||||
- **olmOCR-Bench** (Ai2, Oct 2025): olmOCR-2 **82.4** > Marker **76.1** > **MinerU 75.8**. Here MinerU **trails** — this is a real olmOCR win and must stay visible.
|
||||
- **RD-TableBench**: Reducto 90.2% on complex tables — but Reducto authored this benchmark (vendor-biased).
|
||||
- Mathpix is the de-facto formula-OCR standard (BLEU/edit-distance studies), though a PaddleOCR-VL-based tool claims to beat it on OmniDocBench v1.0 formula recognition, so the very top is contested.
|
||||
|
||||
> Star counts / versions below (e.g. MinerU "65.7k / v3.2.1") are point-in-time and not independently re-verified.
|
||||
|
||||
---
|
||||
|
||||
## Category 1 — Self-hosted / open-source parsing engines
|
||||
|
||||
These are the tools that close our single biggest gap: **fully offline / air-gapped / no cloud / no upload caps.**
|
||||
|
||||
### MinerU engine (opendatalab) — the engine we wrap
|
||||
- **Source:** https://github.com/opendatalab/MinerU · arXiv 2509.22186 · https://huggingface.co/opendatalab/MinerU2.5-Pro-2604-1.2B
|
||||
- **Strengths:** Owns the SOTA models (OmniDocBench 90.67 / 95.69-Pro v1.6). 109-language OCR, handwriting, cross-page table merge, formula→LaTeX (the source of *our* LaTeX). Fully self-hostable → offline, air-gappable, zero per-page cost, no caps. Pipeline backend runs pure CPU; VLM needs 8GB+ VRAM. Native MCP, Python/Go/TS SDKs, LangChain/LlamaIndex/Dify/FastGPT.
|
||||
- **Weaknesses vs us:** Heavy install (multi-GB torch/vLLM + weights, 16GB RAM / 20GB disk floor); slow on Apple Silicon; no note/PKM delivery sinks; library/CLI rather than zero-config.
|
||||
- **Verdict:** **Beats us** on offline, privacy, caps, accuracy ceiling, ecosystem. **We beat it** only on zero-install/zero-config and built-in delivery.
|
||||
|
||||
### Marker (datalab-to / VikParuchuri)
|
||||
- **Source:** https://github.com/datalab-to/marker · https://allenai.org/blog/olmocr-2
|
||||
- **Strengths:** Fully offline; very high batch throughput (~122 pages/sec/H100, 0.18s/page GPU); broad formats incl. EPUB; optional local-LLM (Ollama) quality boost with no data leaving the machine; ~35k+ stars, active.
|
||||
- **Weaknesses:** **GPL-3.0** code + model weights under a modified RAIL-M (free only under ~$2M funding+revenue; commercial above that needs a Datalab license). olmOCR-Bench **76.1** — below olmOCR-2 and MinerU's OmniDocBench standing.
|
||||
- **Verdict:** Beats us on offline/throughput; we beat it on zero-install and 17 delivery sinks. License gate is a real friction it has and we don't.
|
||||
|
||||
### Docling (IBM / DS4SD)
|
||||
- **Source:** https://github.com/docling-project/docling · https://huggingface.co/ibm-granite/granite-docling-258M · arXiv 2408.09869
|
||||
- **Strengths:** **Widest input modality set** (PDF/DOCX/PPTX/XLSX/HTML/AsciiDoc/LaTeX/CSV/images + **audio via ASR** + USPTO/JATS/XBRL). Tiny 258M Granite-Docling VLM runs on CPU/modest GPU. **MIT code + Apache-2.0 weights.** Deep framework ecosystem (LangChain/LlamaIndex/Haystack + official MCP), IBM-backed, 60k+ stars. Air-gapped by design.
|
||||
- **Weaknesses:** Absolute accuracy lags MinerU on OmniDocBench/olmOCR-Bench; library-first (not a zero-config CLI); targets framework ingestion, not file delivery to note tools.
|
||||
- **Verdict:** Beats us on offline, modality breadth, permissive license, ecosystem; we beat it on zero-install and note/PKM delivery. **Do not over-rank its MIT as uniquely best** — olmOCR's Apache-2.0 on *both* code and 7B weights is at least as commercially valuable.
|
||||
|
||||
### olmOCR (allenai)
|
||||
- **Source:** https://github.com/allenai/olmocr · https://allenai.org/blog/olmocr-2 · https://huggingface.co/datasets/allenai/olmOCR-bench
|
||||
- **Strengths:** **Leads Ai2's olmOCR-Bench (82.4 vs MinerU 75.8)** — a benchmark where MinerU trails. **Apache-2.0 on code AND the olmOCR-2-7B weights** (most commercial-friendly model reuse here). Built for million-page LLM-training linearization. Offline.
|
||||
- **Weaknesses:** **PDF/image only** (no Office/HTML); **English-primary**, filters non-English (MinerU does 109-lang); **requires a 12GB+ NVIDIA GPU, no CPU mode at all**.
|
||||
- **Verdict:** Beats us on offline, that-benchmark accuracy, license, scale. We beat it on modality breadth, multilingual, no-GPU, delivery, zero-install. **Keep the olmOCR-Bench lead visible — do not cherry-pick only OmniDocBench.**
|
||||
|
||||
### Nougat (facebookresearch / Meta AI)
|
||||
- **Source:** https://github.com/facebookresearch/nougat · arXiv 2308.13418
|
||||
- **Strengths:** Strong LaTeX/math on arXiv-style scientific PDFs (its trained niche). Offline.
|
||||
- **Weaknesses:** **PDF + English/Latin-script only** (no CJK); **CC-BY-NC weights (non-commercial)**; effectively **unmaintained** (last release Aug 2023); known repetition/hallucination/[MISSING_PAGE] failures off-distribution.
|
||||
- **Verdict:** Offline + niche math is its only edge; we beat it on general-purpose, multilingual, maintenance, commercial license, delivery.
|
||||
|
||||
### PyMuPDF4LLM (pymupdf / Artifex)
|
||||
- **Source:** https://github.com/pymupdf/pymupdf4llm · https://pymupdf.io/blog/pymupdf-layout-10-faster-pdf-parsing-without-gpus
|
||||
- **Strengths:** **Far faster and lighter than any ML tool on born-digital PDFs** (~hundreds of pages/sec on plain CPU; a C-optimized variant claims ~520 pages/sec). Lowest dependency/hardware footprint. Offline, no cloud, no caps. Ideal for huge clean-PDF corpora where speed > fidelity.
|
||||
- **Weaknesses:** No ML → no real formula/LaTeX, weak complex tables, poor scanned/handwritten; slow external OCR; **AGPL-3.0 OR Artifex commercial**; Office formats need paid **PyMuPDF Pro**.
|
||||
- **Verdict:** A genuine win for the speed-over-fidelity, clean-PDF use case. We beat it on hard-doc quality (MinerU's VLM), multilingual OCR, and delivery — but acknowledge its speed/footprint advantage honestly.
|
||||
|
||||
### Zerox (getomni-ai)
|
||||
- **Source:** https://github.com/getomni-ai/zerox
|
||||
- **Strengths:** Trivial provider-flexibility (OpenAI/Azure/Bedrock-Claude/Gemini/Vertex); JSON-Schema structured extraction (Node SDK); MIT code.
|
||||
- **Weaknesses:** **NOT offline and NOT token-free** — mandates a paid cloud vision-LLM key; needs graphicsmagick+ghostscript; **no published benchmarks**; per-page LLM cost can exceed MinerU on large jobs.
|
||||
- **Verdict:** We beat it on token-free start, benchmarked accuracy, dedicated formula/table models, system-dep footprint, and delivery. It beats us on provider-swap flexibility and typed JSON extraction.
|
||||
|
||||
---
|
||||
|
||||
## Category 2 — Commercial cloud document-parsing APIs
|
||||
|
||||
Mostly **stronger than us** on enterprise accuracy, SLAs, structured extraction, and RAG/MCP ecosystems. Our honest edges are narrow: token-free + zero-install hosted default, clean Markdown/LaTeX of academic PDFs, and 17 delivery sinks none of them offer.
|
||||
|
||||
### LlamaParse (LlamaIndex / LlamaCloud)
|
||||
- **Source:** https://www.llamaindex.ai/pricing · LlamaCloud MCP docs
|
||||
- **Beats us:** Official hosted **MCP server**; deep native RAG stack (parse→index→LlamaExtract/LlamaAgents); steerable NL parsing with frontier LLMs (GPT-4.1/Gemini 2.5 Pro); richer outputs (per-page JSON, XLSX, HTML tables, annotated PDF); enterprise SLAs; mature Python+TS SDKs.
|
||||
- **We beat:** Token-free start (it needs a LlamaCloud key from page one); zero runtime deps; 17 note/PKM sinks (it delivers to RAG indexes, not note tools); built-in `--resume`/parallel batch CLI.
|
||||
|
||||
### Mathpix (Convert API)
|
||||
- **Source:** https://mathpix.com/pricing/api · https://mathpix.com/image-to-latex
|
||||
- **Beats us:** **Best-in-class formula/equation OCR (printed AND handwritten) → clean LaTeX — clearly better than MinerU for pure math fidelity; concede this, do not imply parity.** Mature Snip ecosystem + Overleaf workflows; very low per-image cost at scale.
|
||||
- **We beat:** Token-free start (Mathpix API requires a paid PAYG account, **$19.99 setup fee**, card on file; **no recurring free monthly allowance** — only a one-time $29 test credit; the consumer Snip app's free quota does **not** apply to the API); general-purpose multi-modal Office parsing; 17 delivery sinks; built-in batch CLI.
|
||||
|
||||
### Unstructured.io
|
||||
- **Source:** https://unstructured.io/pricing · https://github.com/Unstructured-IO/unstructured
|
||||
- **Beats us:** **Apache-2.0 core library is fully self-hostable → 100% offline** (we cannot); official MCP + huge connector ecosystem (S3/SharePoint/vector DBs); built-in chunking+embedding (RAG-ready); 25+ file types; permissive license for product embedding.
|
||||
- **We beat:** Token-free hosted default with zero install (its hosted API needs a key; self-host means running infra); cleaner human-readable Markdown out of the box (its primary output is JSON "elements"); 17 note/PKM sinks (it targets vector DBs/storage). *On parsing quality:* VLM parsing is generally stronger for complex layout/formula, but this is **not a benchmarked head-to-head** — state it as a tendency, not a measured win.
|
||||
|
||||
### Reducto
|
||||
- **Source:** https://reducto.ai/pricing
|
||||
- **Beats us:** **Best complex/financial table extraction (90.2% RD-TableBench — vendor-authored but the strongest public evidence)**; agentic multi-pass OCR; SOC2/HIPAA, on-prem/VPC/air-gapped, enterprise SLAs; schema-based extraction with bounding boxes/citations.
|
||||
- **We beat:** Token-free start (it needs a key + credits); zero-install plain CLI; 17 delivery sinks; auto-routing/--resume/parallel batch.
|
||||
|
||||
### Chunkr (and similar RAG-native APIs)
|
||||
- **Beats us:** Self-hostable (offline option we lack); RAG-native chunking + broad export (DOCX/HTML/LaTeX).
|
||||
- **We beat:** Token-free start; zero-install; 17 note/PKM sinks.
|
||||
- **Caveat (fact-check):** Do **not** claim "stronger VLM Markdown for formulas" — Chunkr cloud uses its own proprietary models and we have **no head-to-head benchmark**. Drop the quality claim; keep only the export-breadth and offline framing.
|
||||
|
||||
---
|
||||
|
||||
## Category 3 — Other MinerU wrappers, skills & MCP servers (our direct peers)
|
||||
|
||||
**Every cloud-backed wrapper here hits the same MinerU API we do, so its OCR/table/formula output is IDENTICAL to ours.** We have **no quality edge** over them — only DX differences. Claims of "better OCR/formula/Markdown" vs these are **invalid** and must not appear.
|
||||
|
||||
### Official MinerU MCP server (mineru-open-mcp / MinerU-Ecosystem)
|
||||
- **Source:** https://github.com/opendatalab/MinerU-Ecosystem · https://pypi.org/project/mineru-open-mcp/
|
||||
- **Beats us:** **Official, first-party** — tracks API/format changes day-one; native **MCP server** (stdio + streamable-http) in Claude Desktop/Cursor/Windsurf with zero glue; full ecosystem (Python/Go/TS SDKs, LangChain/LlamaIndex/Dify/FastGPT). **Same free no-token Flash tier as us** — our "free zero-token" edge is fully matched by the first party.
|
||||
- **We beat:** Zero runtime deps (vs pip/uvx install); auto-routing Agent⇄Standard with auto-escalation; 17 delivery sinks; `--resume`/parallel batch; usable as a plain CLI outside any MCP host.
|
||||
|
||||
### MinerU-Document-Explorer (official, opendatalab)
|
||||
- **Source:** https://github.com/opendatalab/MinerU-Document-Explorer
|
||||
- **Beats us:** Different, **larger** value prop — a local agent-native **knowledge engine** (BM25/vector/hybrid retrieval + deep-reading + LLM-wiki) with 15 MCP tools; runs 100% locally for its core; MIT, 568 stars.
|
||||
- **We beat:** We're a focused zero-dep converter; broader conversion modalities; 17 delivery sinks (it keeps content in its own index/wiki); no Node/local-model download.
|
||||
|
||||
### linxule/mineru-mcp (Node, cloud)
|
||||
- **Source:** https://github.com/linxule/mineru-mcp
|
||||
- **Beats us:** Native MCP server with 6 granular tools (explicit status-polling + batch-status pagination); first-class for Node/JS MCP stacks; batch up to 200 URLs/request.
|
||||
- **We beat:** **Free no-token path** (it **requires** a token always); zero runtime deps (vs Node 18+); broader modalities (Excel/HTML); 17 delivery sinks; usable as plain CLI outside MCP.
|
||||
|
||||
### mineru-converter-mcp-server (AvatarGanymede/MinerU-MCP)
|
||||
- **Source:** https://pypi.org/project/mineru-converter-mcp-server/
|
||||
- **Beats us:** **Auto-splits PDFs >200MB and segments >600-page docs by page range — gracefully exceeding the 200MB/200-page cap we are bound by.** Turnkey Smithery + Render deploy (per-user key); explicit HTML input.
|
||||
- **We beat:** Free no-token default (it requires a key); zero runtime deps; plain CLI (no MCP host/Render/Smithery needed); 17 sinks; auto-routing.
|
||||
|
||||
### grimoire-skill (LeoLin990405)
|
||||
- **Source:** https://github.com/LeoLin990405/grimoire-skill
|
||||
- **Beats us:** Higher-level knowledge-capture ("parse once, share twice" → Obsidian notes + reusable skill packs); ingests **video** (YouTube/Bilibili) + subtitles (modalities we don't touch); cross-agent skill management; content-aware Obsidian auto-filing.
|
||||
- **We beat:** Free no-token default (it needs a token + `--cloud-ok` for local files); zero runtime deps (vs bash+jq+awk + optional yt-dlp/ffmpeg); 17 sinks vs primarily Obsidian; broader Office/HTML; cross-platform single-file portability.
|
||||
|
||||
### kesslerio/mineru-pdf-parser (openclaw/ClawHub skill, local CPU)
|
||||
- **Source:** openclaw/skills · SKILL.md
|
||||
- **Beats us:** **Fully local/offline (pure CPU, cross-platform)** — no cloud/token/caps; handles privacy-sensitive docs; native Markdown + JSON.
|
||||
- **We beat:** Zero install (it needs a full local MinerU install + weights + shell wrapper); no GPU/heavy runtime; faster wall-clock **only vs slow local CPU**; broader modalities; 17 sinks; `--stdout`/`--json`; better docs.
|
||||
|
||||
### nilecui/mineru-parser-skills (Claude Agent SDK, cloud)
|
||||
- **Source:** https://github.com/nilecui/mineru-parser-skills
|
||||
- **Beats us:** Built directly on the Claude Agent SDK (slots into Agent-SDK apps). Honestly little else — it's a thinner cloud wrapper.
|
||||
- **We beat:** Accepts local files/dirs **and** URLs (it is **URL-only** — cannot parse a local PDF); free no-token default; zero runtime deps; batch/`--resume`/parallel; 17 sinks; broader modalities; mature/documented vs a 4-commit, no-license repo. *Caveat:* our "benchmarked" claim means **latency-measured**, not accuracy-benchmarked.
|
||||
|
||||
### TINKPA/mcp-mineru (local MLX, Apple Silicon)
|
||||
- **Source:** https://github.com/TINKPA/mcp-mineru
|
||||
- **Beats us:** **Fully offline/local** via MinerU running on-device (MLX accel); no cloud/token/caps; data never leaves the Mac.
|
||||
- **We beat:** Zero install/no weights/no GPU; **faster wall-clock only for typical multi-page docs vs its slow local inference (32–148s/page on M4)** — not a general speed win; broader modalities; batch/`--resume`/17 sinks; more active/documented; usable as plain CLI.
|
||||
|
||||
---
|
||||
|
||||
## Summary of mandatory concessions (do not bury these)
|
||||
|
||||
1. **Offline / air-gapped is our single biggest gap.** MinerU engine, Marker, Docling, olmOCR, Nougat, PyMuPDF4LLM, TINKPA, kesslerio, MinerU-Document-Explorer, and self-hostable Unstructured/Chunkr all run with **zero cloud dependency**. We are cloud-only and **cannot handle confidential/regulated/air-gapped content at all.**
|
||||
2. **Data privacy:** every self-hosted competitor keeps documents on the machine; we **upload every file** to MinerU's cloud — a hard disqualifier for many regulated users.
|
||||
3. **Accuracy is downstream of, and capped by, MinerU's cloud.** Self-hosting MinerU2.5-Pro gives the same-or-better accuracy with no caps. Same-backend wrappers yield **identical** quality to us.
|
||||
4. **Hard caps:** 10MB/20-page (Agent), 200MB/200-page (Standard), IP rate limits. mineru-converter exceeds them via auto-split/segmentation.
|
||||
5. **Mathpix beats us on formula/LaTeX OCR (incl. handwriting).**
|
||||
6. **Reducto leads complex/financial tables; olmOCR leads olmOCR-Bench (82.4 vs MinerU 75.8).** Different benchmarks favor different tools — never cherry-pick only OmniDocBench.
|
||||
7. **Official first-party advantage:** the official MinerU MCP/Document-Explorer + ecosystem track changes day-one and match our free tier; we are third-party, can lag, and ship **no MCP server**.
|
||||
8. **Permissive-license wins we lack:** olmOCR (Apache-2.0 code + 7B weights), Docling (MIT + Apache-2.0 weights), Unstructured (Apache-2.0 core).
|
||||
9. **PyMuPDF4LLM is far faster/lighter on born-digital PDFs** (clean-text corpora, speed > fidelity).
|
||||
|
||||
## Sources
|
||||
|
||||
- MinerU engine: https://github.com/opendatalab/MinerU · arXiv 2509.22186 · https://huggingface.co/opendatalab/MinerU2.5-Pro-2604-1.2B · https://neurohive.io/en/state-of-the-art/mineru2-5-open-source-1-2b-model-for-pdf-parsing-outperforms-gemini-2-5-pro-on-benchmarks/
|
||||
- Official MCP / ecosystem: https://github.com/opendatalab/MinerU-Ecosystem · https://pypi.org/project/mineru-open-mcp/ · https://github.com/opendatalab/MinerU-Document-Explorer
|
||||
- Marker: https://github.com/datalab-to/marker · https://allenai.org/blog/olmocr-2
|
||||
- Docling: https://github.com/docling-project/docling · arXiv 2408.09869 · https://huggingface.co/ibm-granite/granite-docling-258M
|
||||
- olmOCR: https://github.com/allenai/olmocr · https://allenai.org/blog/olmocr-2 · https://huggingface.co/datasets/allenai/olmOCR-bench
|
||||
- Nougat: https://github.com/facebookresearch/nougat · arXiv 2308.13418
|
||||
- PyMuPDF4LLM: https://github.com/pymupdf/pymupdf4llm · https://pymupdf.io/blog/pymupdf-layout-10-faster-pdf-parsing-without-gpus
|
||||
- Zerox: https://github.com/getomni-ai/zerox
|
||||
- LlamaParse: https://www.llamaindex.ai/pricing
|
||||
- Mathpix: https://mathpix.com/pricing/api · https://mathpix.com/image-to-latex
|
||||
- Unstructured: https://unstructured.io/pricing · https://github.com/Unstructured-IO/unstructured
|
||||
- Reducto: https://reducto.ai/pricing
|
||||
- Other wrappers: https://github.com/linxule/mineru-mcp · https://pypi.org/project/mineru-converter-mcp-server/ · https://github.com/LeoLin990405/grimoire-skill · https://github.com/nilecui/mineru-parser-skills · https://github.com/TINKPA/mcp-mineru
|
||||
59
skills/developing/mineru/references/integrations.md
Normal file
59
skills/developing/mineru/references/integrations.md
Normal file
@ -0,0 +1,59 @@
|
||||
# Delivery Integrations (`--to`)
|
||||
|
||||
After parsing, MinerU Skill can deliver the Markdown straight into your content
|
||||
tools using each tool's **official ingestion path** — no fragile generic block
|
||||
converters. Targets are pluggable sinks; select one or more with `--to NAME`
|
||||
(repeatable). List them live with `python3 scripts/mineru.py --list-sinks`.
|
||||
|
||||
```bash
|
||||
# Parse and fan out to several destinations at once
|
||||
python3 scripts/mineru.py paper.pdf --to obsidian --to notion --to slack
|
||||
```
|
||||
|
||||
Each sink reads its configuration from **environment variables** so an AI agent
|
||||
can run it non-interactively. Delivery results appear in `--json` output under
|
||||
each result's `sinks` array.
|
||||
|
||||
## Support matrix
|
||||
|
||||
| Target | `--to` | Native path | Auth / config (env) | Markdown fidelity | Images |
|
||||
|--------|--------|-------------|---------------------|-------------------|--------|
|
||||
| **Obsidian** | `obsidian` (`ob`) | filesystem write + YAML frontmatter | `OBSIDIAN_VAULT`, `OBSIDIAN_SUBDIR?` | full | ✅ copied to `<note>.assets/` |
|
||||
| **Logseq** | `logseq` | filesystem write, outline + `key:: value` | `LOGSEQ_GRAPH` | full (outline transform) | ✅ copied to `assets/` |
|
||||
| **SiYuan** | `siyuan` | kernel `createDocWithMd` | `SIYUAN_TOKEN`, `SIYUAN_API_URL?`, `SIYUAN_NOTEBOOK?` | full (GFM) | ✅ `asset/upload` |
|
||||
| **Notion** | `notion` | `POST /v1/pages` (blocks) | `NOTION_API_KEY`, `NOTION_PARENT_PAGE_ID`, `NOTION_VERSION?` | structure (headings/lists/code/quote) | ⚠️ text only¹ |
|
||||
| **Linear** | `linear` | GraphQL `issueCreate` | `LINEAR_API_KEY`, `LINEAR_TEAM_ID` | full (Markdown-native) | ✅ base64-inlined |
|
||||
| **Yuque 语雀** | `yuque` (`语雀`) | open API create doc | `YUQUE_TOKEN`, `YUQUE_NAMESPACE` | full (Markdown-native) | ⚠️ host publicly² |
|
||||
| **Coda** | `coda` | page canvas `format:markdown` | `CODA_API_TOKEN`, `CODA_DOC_ID?` | full (Markdown-native) | ⚠️ public URL² |
|
||||
| **Slack** | `slack` | external-upload `.md` file | `SLACK_BOT_TOKEN`, `SLACK_CHANNEL` | full (raw file) | ⚠️ not embedded |
|
||||
| **Lark 飞书** | `feishu` (`lark`, `飞书`) | Drive `import_tasks` → Docx | `FEISHU_APP_ID`, `FEISHU_APP_SECRET`, `FEISHU_FOLDER_TOKEN?` | full (server-converted) | ⚠️ public URL² |
|
||||
| **Confluence** | `confluence` | `POST /wiki/api/v2/pages` (storage) | `CONFLUENCE_BASE_URL`, `CONFLUENCE_EMAIL`, `CONFLUENCE_API_TOKEN`, `CONFLUENCE_SPACE_ID` | MD→HTML | ⚠️ not attached |
|
||||
| **OneNote** | `onenote` | Graph `sections/{id}/pages` | `ONENOTE_TOKEN`³, `ONENOTE_SECTION_ID` | MD→HTML | ⚠️ remote only |
|
||||
| **TickTick 滴答** | `ticktick` (`dida`, `滴答清单`) | `POST /open/v1/task` | `TICKTICK_TOKEN`, `TICKTICK_PROJECT_ID?` | task note | ❌ unsupported |
|
||||
| **DingTalk 钉钉** | `dingtalk` (`钉钉`) | robot markdown webhook | `DINGTALK_WEBHOOK`, `DINGTALK_SECRET?` | markdown message | ⚠️ public URL only |
|
||||
| **Airtable** | `airtable` | `POST /v0/{base}/{table}` record | `AIRTABLE_API_KEY`, `AIRTABLE_BASE_ID`, `AIRTABLE_TABLE`, `AIRTABLE_TITLE_FIELD?`, `AIRTABLE_BODY_FIELD?` | record field⁴ | ❌ not uploaded |
|
||||
| **WeCom 企业微信** | `wecom` (`企业微信`) | app `message/send` markdown | `WECOM_CORPID`, `WECOM_CORPSECRET`, `WECOM_AGENTID`, `WECOM_TOUSER?` | message (subset, ≤2 KB)⁵ | ❌ unsupported |
|
||||
| **Roam Research** ⁶ | `roam` | `batch-actions` block tree | `ROAM_API_TOKEN`, `ROAM_GRAPH_NAME` | full (Markdown→outline) | ⚠️ public URL |
|
||||
| **WPS 金山文档** ⁶ | `wps` (`kdocs`, `金山`) | Markdown→DOCX → kdocs upload | `WPS_APP_ID`, `WPS_APP_SECRET`, `WPS_PARENT_PATH?` | DOCX (via html-for-docx) | embedded in DOCX |
|
||||
|
||||
Notes:
|
||||
1. **Notion** images need a separate `file_uploads` upload-then-reference dance; v1 delivers text + structure and notes the count of un-embedded local images. (Roadmap: image upload.)
|
||||
2. Hosted services that ingest Markdown by value but have no first-class CLI asset upload — local images must be hosted at a public URL to render. The Markdown is delivered intact; image links that are already URLs work.
|
||||
3. **OneNote** `ONENOTE_TOKEN` is a Microsoft Graph access token (delegated, scope `Notes.Create`). Obtain it via the device-code OAuth flow; the sink itself stays non-interactive.
|
||||
4. **Airtable** is a database, not a document store — the doc is stored as one record (title + body fields). A good "save this doc as a row" target, not a document publisher.
|
||||
5. **WeCom** markdown messages are a limited subset (≤2048 bytes, no images/tables, not rendered in the workbench). Best as a notification/summary; for a full document deliver via Lark/Notion and send the link.
|
||||
6. **Optional-dependency sinks** — these two rely on a third-party library that the sink lazy-imports only when used, so the core and the other 15 sinks stay zero-dependency. If the library is absent, the sink returns a clear `pip install …` hint. They are implemented to the official specs but, being credential/desktop-gated, are best-effort until validated against live accounts.
|
||||
|
||||
## Optional-dependency sinks (`[roam]`, `[wps]`)
|
||||
|
||||
```bash
|
||||
pip install "mineru-skill[wps]" # html-for-docx (Markdown → DOCX)
|
||||
pip install "mineru-skill[roam]" # official roam-client SDK (git, needs Python ≥3.11)
|
||||
# roam-client is git-only; equivalently:
|
||||
pip install "roam-client @ git+https://github.com/Roam-Research/backend-sdks.git#subdirectory=python"
|
||||
```
|
||||
|
||||
- **Roam** — no library ingests Markdown into Roam, but the official `roam-client` SDK handles the genuinely error-prone transport (307/308 peer-host redirect, dual `Authorization`/`x-authorization` Bearer headers, `/write`). We depend on it for transport and build only the Markdown→outline tree, delivering the whole document in one `batch-actions` request. Images must be public URLs.
|
||||
- **WPS / 金山文档** — Markdown→DOCX uses the maintained pure-pip `html-for-docx` (reusing this project's Markdown→HTML); the kdocs upload signs requests with the documented WPS-2 scheme (plain SHA-1) using only the standard library. Requires an approved kdocs developer app + provisioned appspace.
|
||||
|
||||
Adding more targets is a single small module — see `scripts/sinks/base.py`. PRs welcome.
|
||||
1
skills/developing/mineru/scripts/__init__.py
Normal file
1
skills/developing/mineru/scripts/__init__.py
Normal file
@ -0,0 +1 @@
|
||||
"""Importable package for MinerU Skill console entry points."""
|
||||
88
skills/developing/mineru/scripts/chunking.py
Normal file
88
skills/developing/mineru/scripts/chunking.py
Normal file
@ -0,0 +1,88 @@
|
||||
"""Heading-aware Markdown chunking for RAG pipelines (zero-dependency).
|
||||
|
||||
``chunk_markdown`` splits a parsed Markdown document into retrieval-sized chunks
|
||||
that preserve heading context — matching the RAG-friendliness of LlamaParse /
|
||||
Unstructured without any dependency.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import re
|
||||
|
||||
_HEADING = re.compile(r"^(#{1,6})\s+(.*)$")
|
||||
|
||||
|
||||
def _slug(text: str) -> str:
|
||||
text = (text or "doc").strip().lower()
|
||||
text = re.sub(r"[^a-z0-9]+", "-", text).strip("-")
|
||||
return text or "doc"
|
||||
|
||||
|
||||
def _split_by_size(text: str, max_chars: int) -> list:
|
||||
"""Split text into <= max_chars pieces on paragraph boundaries (hard-split if needed)."""
|
||||
if len(text) <= max_chars:
|
||||
return [text]
|
||||
pieces: list = []
|
||||
current = ""
|
||||
for para in text.split("\n\n"):
|
||||
if len(para) > max_chars:
|
||||
if current:
|
||||
pieces.append(current)
|
||||
current = ""
|
||||
for i in range(0, len(para), max_chars):
|
||||
pieces.append(para[i:i + max_chars])
|
||||
elif not current:
|
||||
current = para
|
||||
elif len(current) + len(para) + 2 <= max_chars:
|
||||
current = f"{current}\n\n{para}"
|
||||
else:
|
||||
pieces.append(current)
|
||||
current = para
|
||||
if current:
|
||||
pieces.append(current)
|
||||
return pieces
|
||||
|
||||
|
||||
def chunk_markdown(markdown: str, *, max_chars: int = 2000, source: str = "") -> list:
|
||||
"""Chunk Markdown by heading, size-splitting long sections.
|
||||
|
||||
Returns ``[{id, index, heading, text, chars, source}, ...]`` where ``heading``
|
||||
is the ``H1 > H2 > H3`` breadcrumb for the chunk.
|
||||
"""
|
||||
lines = markdown.replace("\r\n", "\n").split("\n")
|
||||
chunks: list = []
|
||||
stack: list = [] # (level, text) heading breadcrumb
|
||||
buf: list = []
|
||||
base = _slug(source)
|
||||
|
||||
def breadcrumb() -> str:
|
||||
return " > ".join(t for _, t in stack)
|
||||
|
||||
def flush():
|
||||
text = "\n".join(buf).strip()
|
||||
buf.clear()
|
||||
if not text:
|
||||
return
|
||||
head = breadcrumb()
|
||||
for piece in _split_by_size(text, max_chars):
|
||||
idx = len(chunks)
|
||||
chunks.append({
|
||||
"id": f"{base}-{idx}",
|
||||
"index": idx,
|
||||
"heading": head,
|
||||
"text": piece,
|
||||
"chars": len(piece),
|
||||
"source": source,
|
||||
})
|
||||
|
||||
for line in lines:
|
||||
match = _HEADING.match(line.strip())
|
||||
if match:
|
||||
flush() # close the previous section under its own breadcrumb
|
||||
level = len(match.group(1))
|
||||
while stack and stack[-1][0] >= level:
|
||||
stack.pop()
|
||||
stack.append((level, match.group(2)))
|
||||
buf.append(line)
|
||||
flush()
|
||||
return chunks
|
||||
59
skills/developing/mineru/scripts/local_engine.py
Normal file
59
skills/developing/mineru/scripts/local_engine.py
Normal file
@ -0,0 +1,59 @@
|
||||
"""Optional fully-offline parsing backend for born-digital PDFs.
|
||||
|
||||
Our single biggest honest gap is being cloud-only. ``--engine local`` parses a
|
||||
PDF **entirely offline** with the optional, lightweight ``pymupdf4llm`` library
|
||||
(no GPU, no cloud, no upload caps) — ideal for confidential or born-digital PDFs
|
||||
where MinerU's cloud VLM is overkill. Scanned/complex docs still want the cloud
|
||||
engine, so ``--engine auto`` only uses local when the PDF has real text.
|
||||
|
||||
pip install "mineru-skill[local]" # i.e. pip install pymupdf4llm
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
_HINT = (
|
||||
"--engine local needs pymupdf4llm — pip install 'mineru-skill[local]' "
|
||||
"(i.e. pip install pymupdf4llm)"
|
||||
)
|
||||
|
||||
|
||||
class LocalEngineError(Exception):
|
||||
"""Raised when local parsing is requested but cannot be performed."""
|
||||
|
||||
|
||||
def available() -> bool:
|
||||
try:
|
||||
import pymupdf4llm # noqa: F401
|
||||
return True
|
||||
except ImportError:
|
||||
return False
|
||||
|
||||
|
||||
def is_born_digital(path, min_chars: int = 200) -> bool:
|
||||
"""True if the PDF has extractable text (so local parsing is appropriate)."""
|
||||
try:
|
||||
import pymupdf
|
||||
except ImportError:
|
||||
return False
|
||||
doc = pymupdf.open(str(path))
|
||||
total = 0
|
||||
for page in doc:
|
||||
total += len(page.get_text().strip())
|
||||
if total >= min_chars:
|
||||
return True
|
||||
return total >= min_chars
|
||||
|
||||
|
||||
def parse_local(path, output_dir=None) -> str:
|
||||
"""Parse a PDF to Markdown fully offline. Returns the Markdown string."""
|
||||
try:
|
||||
import pymupdf4llm
|
||||
except ImportError as exc:
|
||||
raise LocalEngineError(_HINT) from exc
|
||||
if output_dir is not None:
|
||||
images = Path(output_dir) / "images"
|
||||
images.mkdir(parents=True, exist_ok=True)
|
||||
return pymupdf4llm.to_markdown(str(path), write_images=True, image_path=str(images))
|
||||
return pymupdf4llm.to_markdown(str(path))
|
||||
1996
skills/developing/mineru/scripts/mineru.py
Normal file
1996
skills/developing/mineru/scripts/mineru.py
Normal file
File diff suppressed because it is too large
Load Diff
178
skills/developing/mineru/scripts/mineru_mcp.py
Normal file
178
skills/developing/mineru/scripts/mineru_mcp.py
Normal file
@ -0,0 +1,178 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Zero-dependency MCP server (stdio) for MinerU Skill.
|
||||
|
||||
Speaks newline-delimited JSON-RPC 2.0 over stdin/stdout using only the standard
|
||||
library, so an MCP host (Claude, Cursor, Windsurf, ...) can call MinerU. Register:
|
||||
|
||||
{"command": "python3", "args": ["scripts/mineru_mcp.py"]}
|
||||
|
||||
Tools: ``mineru_parse``, ``mineru_parse_to``, ``mineru_list_sinks``.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent))
|
||||
import mineru # noqa: E402
|
||||
|
||||
PROTOCOL_VERSION = "2024-11-05"
|
||||
SERVER_INFO = {"name": "mineru", "version": mineru.__version__}
|
||||
|
||||
TOOLS = [
|
||||
{
|
||||
"name": "mineru_parse",
|
||||
"description": "Parse a PDF / Office / image file or URL into clean Markdown via MinerU.",
|
||||
"inputSchema": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"input": {"type": "string", "description": "Local file path or http(s) URL"},
|
||||
"output_dir": {"type": "string", "description": "Where to write output (default ./output)"},
|
||||
"api": {"type": "string", "enum": ["auto", "agent", "standard"]},
|
||||
"engine": {"type": "string", "enum": ["cloud", "local", "auto"]},
|
||||
"ocr": {"type": "boolean"},
|
||||
"lang": {"type": "string"},
|
||||
},
|
||||
"required": ["input"],
|
||||
},
|
||||
},
|
||||
{
|
||||
"name": "mineru_parse_to",
|
||||
"description": "Parse a document and deliver the Markdown into content tools (Obsidian, Notion, Feishu, ...).",
|
||||
"inputSchema": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"input": {"type": "string"},
|
||||
"sinks": {"type": "array", "items": {"type": "string"}, "description": "Sink names, e.g. ['obsidian','notion']"},
|
||||
"output_dir": {"type": "string"},
|
||||
},
|
||||
"required": ["input", "sinks"],
|
||||
},
|
||||
},
|
||||
{
|
||||
"name": "mineru_list_sinks",
|
||||
"description": "List available delivery targets and their required environment variables.",
|
||||
"inputSchema": {"type": "object", "properties": {}},
|
||||
},
|
||||
]
|
||||
|
||||
|
||||
class MethodNotFound(Exception):
|
||||
pass
|
||||
|
||||
|
||||
def _text_result(text: str, is_error: bool = False) -> dict:
|
||||
return {"content": [{"type": "text", "text": text}], "isError": is_error}
|
||||
|
||||
|
||||
def _tool_parse(args: dict) -> dict:
|
||||
opts = mineru.ParseOptions(is_ocr=bool(args.get("ocr")), language=args.get("lang", "ch"))
|
||||
token = os.environ.get("MINERU_TOKEN")
|
||||
output_dir = Path(args.get("output_dir") or "./output")
|
||||
res = mineru.process_one(
|
||||
args["input"], opts, token=token, output_dir=output_dir,
|
||||
api=args.get("api", "auto"), engine=args.get("engine", "cloud"),
|
||||
)
|
||||
if res.state == "done":
|
||||
return _text_result(res.markdown or "")
|
||||
return _text_result(f"Parse failed: {res.error}", is_error=True)
|
||||
|
||||
|
||||
def _tool_parse_to(args: dict) -> dict:
|
||||
opts = mineru.ParseOptions()
|
||||
token = os.environ.get("MINERU_TOKEN")
|
||||
output_dir = Path(args.get("output_dir") or "./output")
|
||||
res = mineru.process_one(args["input"], opts, token=token, output_dir=output_dir)
|
||||
if res.state != "done":
|
||||
return _text_result(f"Parse failed: {res.error}", is_error=True)
|
||||
sinks = mineru._load_sinks()
|
||||
if sinks is None:
|
||||
return _text_result("Sinks package unavailable.", is_error=True)
|
||||
doc = sinks.ParsedDoc(title=res.name, markdown=res.markdown, source=res.source,
|
||||
modality=res.modality, markdown_path=res.markdown_path)
|
||||
outcomes = [o.to_status() for o in sinks.deliver_all(doc, args["sinks"])]
|
||||
any_fail = any(not o["ok"] for o in outcomes)
|
||||
return _text_result(json.dumps({"name": res.name, "deliveries": outcomes}, ensure_ascii=False, indent=2),
|
||||
is_error=any_fail)
|
||||
|
||||
|
||||
def _tool_list_sinks(_args: dict) -> dict:
|
||||
sinks = mineru._load_sinks()
|
||||
if sinks is None:
|
||||
return _text_result("Sinks package unavailable.", is_error=True)
|
||||
listing = [{"name": n, "label": sinks.get_sink(n).label, "requires": list(sinks.get_sink(n).requires)}
|
||||
for n in sinks.sink_names()]
|
||||
return _text_result(json.dumps(listing, ensure_ascii=False, indent=2))
|
||||
|
||||
|
||||
_TOOL_HANDLERS = {
|
||||
"mineru_parse": _tool_parse,
|
||||
"mineru_parse_to": _tool_parse_to,
|
||||
"mineru_list_sinks": _tool_list_sinks,
|
||||
}
|
||||
|
||||
|
||||
def _route(method: str, params: dict):
|
||||
if method == "initialize":
|
||||
return {"protocolVersion": PROTOCOL_VERSION, "capabilities": {"tools": {}}, "serverInfo": SERVER_INFO}
|
||||
if method == "tools/list":
|
||||
return {"tools": TOOLS}
|
||||
if method == "tools/call":
|
||||
name = params.get("name")
|
||||
handler = _TOOL_HANDLERS.get(name)
|
||||
if handler is None:
|
||||
return _text_result(f"Unknown tool: {name}", is_error=True)
|
||||
try:
|
||||
return handler(params.get("arguments") or {})
|
||||
except Exception as exc: # noqa: BLE001 - report as a tool error, never crash the server
|
||||
return _text_result(f"{type(exc).__name__}: {exc}", is_error=True)
|
||||
raise MethodNotFound(method)
|
||||
|
||||
|
||||
def dispatch(request: dict):
|
||||
"""Handle one JSON-RPC request dict; return a response dict, or None for notifications."""
|
||||
is_notification = "id" not in request
|
||||
req_id = request.get("id")
|
||||
try:
|
||||
result = _route(request.get("method"), request.get("params") or {})
|
||||
except MethodNotFound as exc:
|
||||
if is_notification:
|
||||
return None
|
||||
return {"jsonrpc": "2.0", "id": req_id, "error": {"code": -32601, "message": f"Method not found: {exc}"}}
|
||||
except Exception as exc: # noqa: BLE001
|
||||
if is_notification:
|
||||
return None
|
||||
return {"jsonrpc": "2.0", "id": req_id, "error": {"code": -32603, "message": str(exc)}}
|
||||
if is_notification:
|
||||
return None
|
||||
return {"jsonrpc": "2.0", "id": req_id, "result": result}
|
||||
|
||||
|
||||
def serve(stdin=None, stdout=None) -> None:
|
||||
"""Read newline-delimited JSON-RPC from stdin, write responses to stdout."""
|
||||
stdin = stdin or sys.stdin
|
||||
stdout = stdout or sys.stdout
|
||||
for line in stdin:
|
||||
line = line.strip()
|
||||
if not line:
|
||||
continue
|
||||
try:
|
||||
request = json.loads(line)
|
||||
except ValueError:
|
||||
continue
|
||||
response = dispatch(request)
|
||||
if response is not None:
|
||||
stdout.write(json.dumps(response, ensure_ascii=False) + "\n")
|
||||
stdout.flush()
|
||||
|
||||
|
||||
def main() -> int:
|
||||
serve()
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
75
skills/developing/mineru/scripts/sinks/__init__.py
Normal file
75
skills/developing/mineru/scripts/sinks/__init__.py
Normal file
@ -0,0 +1,75 @@
|
||||
"""Pluggable delivery sinks for parsed Markdown.
|
||||
|
||||
Each submodule registers one or more :class:`Sink` implementations that deliver a
|
||||
:class:`ParsedDoc` into a content tool using that tool's official ingestion path.
|
||||
Importing this package populates the registry; a sink module that fails to import
|
||||
is recorded in :data:`IMPORT_ERRORS` rather than breaking the others.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import importlib
|
||||
import sys
|
||||
|
||||
from .base import ( # noqa: F401
|
||||
ParsedDoc,
|
||||
Sink,
|
||||
SinkError,
|
||||
SinkResult,
|
||||
get_sink,
|
||||
sink_names,
|
||||
REGISTRY,
|
||||
)
|
||||
|
||||
# Sink modules to load. Order is cosmetic.
|
||||
_MODULES = [
|
||||
"local", # obsidian, logseq (filesystem)
|
||||
"siyuan",
|
||||
"notion",
|
||||
"linear",
|
||||
"yuque",
|
||||
"coda",
|
||||
"ticktick",
|
||||
"dingtalk",
|
||||
"airtable",
|
||||
"wecom",
|
||||
"slack",
|
||||
"feishu",
|
||||
"confluence",
|
||||
"onenote",
|
||||
"roam", # optional dependency (roam-client)
|
||||
"wps", # optional dependency (html-for-docx)
|
||||
]
|
||||
|
||||
IMPORT_ERRORS: dict = {}
|
||||
|
||||
for _name in _MODULES:
|
||||
try:
|
||||
importlib.import_module(f"{__name__}.{_name}")
|
||||
except Exception as exc: # noqa: BLE001 - a bad sink shouldn't break the rest
|
||||
IMPORT_ERRORS[_name] = f"{type(exc).__name__}: {exc}"
|
||||
print(f"[sinks] failed to load {_name}: {exc}", file=sys.stderr)
|
||||
|
||||
|
||||
def deliver_all(doc: ParsedDoc, names) -> list:
|
||||
"""Deliver ``doc`` to each named sink, returning a list of :class:`SinkResult`."""
|
||||
results = []
|
||||
for name in names:
|
||||
sink = get_sink(name)
|
||||
if sink is None:
|
||||
results.append(SinkResult(sink=name, ok=False, error=f"unknown sink '{name}'"))
|
||||
continue
|
||||
missing = sink.missing_config()
|
||||
if missing:
|
||||
results.append(SinkResult(
|
||||
sink=sink.name, ok=False,
|
||||
error=f"missing config: {', '.join(missing)}",
|
||||
))
|
||||
continue
|
||||
try:
|
||||
results.append(sink.deliver(doc))
|
||||
except SinkError as exc:
|
||||
results.append(SinkResult(sink=sink.name, ok=False, error=str(exc)))
|
||||
except Exception as exc: # noqa: BLE001 - surface but never crash the run
|
||||
results.append(SinkResult(sink=sink.name, ok=False, error=f"{type(exc).__name__}: {exc}"))
|
||||
return results
|
||||
72
skills/developing/mineru/scripts/sinks/_http.py
Normal file
72
skills/developing/mineru/scripts/sinks/_http.py
Normal file
@ -0,0 +1,72 @@
|
||||
"""Zero-dependency HTTP helpers shared by all sinks (stdlib urllib only).
|
||||
|
||||
``http_request`` is the single seam tests monkeypatch.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import mimetypes
|
||||
import urllib.error
|
||||
import urllib.request
|
||||
from typing import Optional
|
||||
|
||||
USER_AGENT = "MinerU-Skill-sink/1.0"
|
||||
|
||||
|
||||
def http_request(method, url, *, headers=None, data=None, timeout=60):
|
||||
"""Perform one HTTP request. Returns ``(status_code, body_bytes)``."""
|
||||
req = urllib.request.Request(url, data=data, method=method, headers=headers or {})
|
||||
req.add_header("User-Agent", USER_AGENT)
|
||||
try:
|
||||
with urllib.request.urlopen(req, timeout=timeout) as resp:
|
||||
return resp.getcode(), resp.read()
|
||||
except urllib.error.HTTPError as exc:
|
||||
body = exc.read() if hasattr(exc, "read") else b""
|
||||
return exc.code, body
|
||||
|
||||
|
||||
def request_json(method, url, *, headers=None, payload=None, timeout=60):
|
||||
"""JSON request helper. Returns ``(status_code, parsed_json_or_empty_dict)``."""
|
||||
hdrs = dict(headers or {})
|
||||
body = None
|
||||
if payload is not None:
|
||||
hdrs.setdefault("Content-Type", "application/json")
|
||||
body = json.dumps(payload, ensure_ascii=False).encode("utf-8")
|
||||
status, raw = http_request(method, url, headers=hdrs, data=body, timeout=timeout)
|
||||
parsed: dict = {}
|
||||
if raw:
|
||||
try:
|
||||
parsed = json.loads(raw.decode("utf-8"))
|
||||
except (ValueError, UnicodeDecodeError):
|
||||
parsed = {}
|
||||
return status, parsed
|
||||
|
||||
|
||||
def encode_multipart(fields=None, files=None):
|
||||
"""Build a ``multipart/form-data`` body with stdlib only.
|
||||
|
||||
``fields``: dict of str -> str. ``files``: list of (field_name, filename, bytes).
|
||||
Returns ``(content_type, body_bytes)``.
|
||||
"""
|
||||
boundary = "----MinerUSinkBoundary7MA4YWxkTrZu0gW"
|
||||
crlf = b"\r\n"
|
||||
parts = []
|
||||
for name, value in (fields or {}).items():
|
||||
parts.append(b"--" + boundary.encode())
|
||||
parts.append(f'Content-Disposition: form-data; name="{name}"'.encode())
|
||||
parts.append(b"")
|
||||
parts.append(str(value).encode("utf-8"))
|
||||
for field_name, filename, content in files or []:
|
||||
ctype = mimetypes.guess_type(filename)[0] or "application/octet-stream"
|
||||
parts.append(b"--" + boundary.encode())
|
||||
parts.append(
|
||||
f'Content-Disposition: form-data; name="{field_name}"; filename="{filename}"'.encode()
|
||||
)
|
||||
parts.append(f"Content-Type: {ctype}".encode())
|
||||
parts.append(b"")
|
||||
parts.append(content)
|
||||
parts.append(b"--" + boundary.encode() + b"--")
|
||||
parts.append(b"")
|
||||
body = crlf.join(parts)
|
||||
return f"multipart/form-data; boundary={boundary}", body
|
||||
244
skills/developing/mineru/scripts/sinks/_md.py
Normal file
244
skills/developing/mineru/scripts/sinks/_md.py
Normal file
@ -0,0 +1,244 @@
|
||||
"""Small, dependency-free Markdown utilities used by sinks.
|
||||
|
||||
These are intentionally pragmatic, not a full CommonMark implementation: they
|
||||
cover the constructs MinerU emits (headings, emphasis, code, lists, tables,
|
||||
blockquotes, links, images) well enough to deliver faithful content to tools
|
||||
that require HTML (Confluence, OneNote) or an outline (Logseq).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import html
|
||||
import re
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
_IMAGE_RE = re.compile(r"!\[(?P<alt>[^\]]*)\]\((?P<ref>[^)\s]+)(?:\s+\"[^\"]*\")?\)")
|
||||
_ILLEGAL_FS = re.compile(r'[\\/:*?"<>|#^\[\]]+')
|
||||
|
||||
|
||||
def slugify(text: str, default: str = "document") -> str:
|
||||
"""Filesystem/URL-safe slug."""
|
||||
text = text.strip().lower()
|
||||
text = re.sub(r"[\s_]+", "-", text)
|
||||
text = re.sub(r"[^a-z0-9\-]+", "", text)
|
||||
text = re.sub(r"-{2,}", "-", text).strip("-")
|
||||
return text or default
|
||||
|
||||
|
||||
def safe_filename(title: str, default: str = "document") -> str:
|
||||
"""Clean a title into a safe note filename (keeps unicode, drops illegal chars)."""
|
||||
name = _ILLEGAL_FS.sub(" ", title).strip()
|
||||
name = re.sub(r"\s{2,}", " ", name)
|
||||
return name[:120] or default
|
||||
|
||||
|
||||
def is_remote(ref: str) -> bool:
|
||||
return ref.startswith("http://") or ref.startswith("https://") or ref.startswith("data:")
|
||||
|
||||
|
||||
def find_local_images(markdown: str, base_dir) -> list:
|
||||
"""Return ``[(alt, ref, Path)]`` for image refs that point at existing local files."""
|
||||
base = Path(base_dir) if base_dir else None
|
||||
found = []
|
||||
seen = set()
|
||||
for match in _IMAGE_RE.finditer(markdown):
|
||||
ref = match.group("ref")
|
||||
if is_remote(ref) or ref in seen:
|
||||
continue
|
||||
path = Path(ref)
|
||||
if not path.is_absolute() and base is not None:
|
||||
path = base / ref
|
||||
if path.is_file():
|
||||
found.append((match.group("alt"), ref, path))
|
||||
seen.add(ref)
|
||||
return found
|
||||
|
||||
|
||||
def rewrite_images(markdown: str, mapping: dict) -> str:
|
||||
"""Rewrite local image refs using ``{old_ref: new_ref}``."""
|
||||
def repl(match):
|
||||
ref = match.group("ref")
|
||||
if ref in mapping:
|
||||
return f""
|
||||
return match.group(0)
|
||||
|
||||
return _IMAGE_RE.sub(repl, markdown)
|
||||
|
||||
|
||||
def yaml_frontmatter(props: dict) -> str:
|
||||
"""Render a YAML frontmatter block. List values become ``- item`` lines."""
|
||||
lines = ["---"]
|
||||
for key, value in props.items():
|
||||
if value is None or value == "" or value == []:
|
||||
continue
|
||||
if isinstance(value, (list, tuple)):
|
||||
lines.append(f"{key}:")
|
||||
for item in value:
|
||||
lines.append(f" - {item}")
|
||||
else:
|
||||
lines.append(f"{key}: {value}")
|
||||
lines.append("---")
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Inline + block Markdown -> HTML (pragmatic, XHTML-safe)
|
||||
# --------------------------------------------------------------------------- #
|
||||
def _inline(text: str) -> str:
|
||||
"""Convert inline Markdown to HTML on already-escaped text."""
|
||||
# images first, then links
|
||||
text = _IMAGE_RE.sub(
|
||||
lambda m: f'<img src="{html.escape(m.group("ref"), quote=True)}" alt="{m.group("alt")}" />',
|
||||
text,
|
||||
)
|
||||
text = re.sub(r"\[([^\]]+)\]\(([^)\s]+)\)",
|
||||
lambda m: f'<a href="{html.escape(m.group(2), quote=True)}">{m.group(1)}</a>', text)
|
||||
text = re.sub(r"`([^`]+)`", r"<code>\1</code>", text)
|
||||
text = re.sub(r"\*\*([^*]+)\*\*", r"<strong>\1</strong>", text)
|
||||
text = re.sub(r"(?<!\*)\*(?!\*)([^*]+)\*(?!\*)", r"<em>\1</em>", text)
|
||||
return text
|
||||
|
||||
|
||||
def md_to_html(markdown: str) -> str:
|
||||
"""Convert a Markdown document to a pragmatic, XHTML-safe HTML fragment."""
|
||||
out = []
|
||||
lines = markdown.replace("\r\n", "\n").split("\n")
|
||||
i = 0
|
||||
n = len(lines)
|
||||
in_code = False
|
||||
code_buf: list = []
|
||||
list_stack: list = [] # 'ul' / 'ol'
|
||||
|
||||
def close_lists():
|
||||
while list_stack:
|
||||
out.append(f"</{list_stack.pop()}>")
|
||||
|
||||
while i < n:
|
||||
line = lines[i]
|
||||
fence = line.strip().startswith("```")
|
||||
if fence and not in_code:
|
||||
close_lists()
|
||||
in_code = True
|
||||
code_buf = []
|
||||
i += 1
|
||||
continue
|
||||
if fence and in_code:
|
||||
out.append("<pre><code>" + html.escape("\n".join(code_buf)) + "</code></pre>")
|
||||
in_code = False
|
||||
i += 1
|
||||
continue
|
||||
if in_code:
|
||||
code_buf.append(line)
|
||||
i += 1
|
||||
continue
|
||||
|
||||
stripped = line.strip()
|
||||
if not stripped:
|
||||
close_lists()
|
||||
i += 1
|
||||
continue
|
||||
|
||||
# table block
|
||||
if "|" in stripped and i + 1 < n and re.match(r"^\s*\|?[\s:|-]+\|?\s*$", lines[i + 1]):
|
||||
close_lists()
|
||||
header = [c.strip() for c in stripped.strip("|").split("|")]
|
||||
rows = []
|
||||
i += 2
|
||||
while i < n and "|" in lines[i] and lines[i].strip():
|
||||
rows.append([c.strip() for c in lines[i].strip().strip("|").split("|")])
|
||||
i += 1
|
||||
out.append("<table><thead><tr>"
|
||||
+ "".join(f"<th>{_inline(html.escape(c))}</th>" for c in header)
|
||||
+ "</tr></thead><tbody>")
|
||||
for row in rows:
|
||||
out.append("<tr>" + "".join(f"<td>{_inline(html.escape(c))}</td>" for c in row) + "</tr>")
|
||||
out.append("</tbody></table>")
|
||||
continue
|
||||
|
||||
heading = re.match(r"^(#{1,6})\s+(.*)$", stripped)
|
||||
if heading:
|
||||
close_lists()
|
||||
level = len(heading.group(1))
|
||||
out.append(f"<h{level}>{_inline(html.escape(heading.group(2)))}</h{level}>")
|
||||
i += 1
|
||||
continue
|
||||
|
||||
if stripped.startswith(">"):
|
||||
close_lists()
|
||||
out.append(f"<blockquote>{_inline(html.escape(stripped[1:].strip()))}</blockquote>")
|
||||
i += 1
|
||||
continue
|
||||
|
||||
if re.match(r"^([-*+])\s+", stripped):
|
||||
if not list_stack or list_stack[-1] != "ul":
|
||||
close_lists()
|
||||
list_stack.append("ul")
|
||||
out.append("<ul>")
|
||||
item = re.sub(r"^([-*+])\s+", "", stripped)
|
||||
out.append(f"<li>{_inline(html.escape(item))}</li>")
|
||||
i += 1
|
||||
continue
|
||||
|
||||
if re.match(r"^\d+\.\s+", stripped):
|
||||
if not list_stack or list_stack[-1] != "ol":
|
||||
close_lists()
|
||||
list_stack.append("ol")
|
||||
out.append("<ol>")
|
||||
item = re.sub(r"^\d+\.\s+", "", stripped)
|
||||
out.append(f"<li>{_inline(html.escape(item))}</li>")
|
||||
i += 1
|
||||
continue
|
||||
|
||||
if re.match(r"^([-*_])\1{2,}$", stripped):
|
||||
close_lists()
|
||||
out.append("<hr />")
|
||||
i += 1
|
||||
continue
|
||||
|
||||
close_lists()
|
||||
out.append(f"<p>{_inline(html.escape(stripped))}</p>")
|
||||
i += 1
|
||||
|
||||
if in_code:
|
||||
out.append("<pre><code>" + html.escape("\n".join(code_buf)) + "</code></pre>")
|
||||
close_lists()
|
||||
return "\n".join(out)
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Markdown -> Logseq outline
|
||||
# --------------------------------------------------------------------------- #
|
||||
def md_to_logseq(markdown: str, properties: Optional[dict] = None) -> str:
|
||||
"""Convert flat Markdown into a Logseq outline.
|
||||
|
||||
Every line becomes a ``- `` block. Headings are top-level blocks; the content
|
||||
that follows a heading nests one level beneath it. Page properties
|
||||
(``key:: value``) go on the first block, as Logseq requires.
|
||||
"""
|
||||
out = []
|
||||
if properties:
|
||||
prop_lines = []
|
||||
for key, value in properties.items():
|
||||
if not value:
|
||||
continue
|
||||
if isinstance(value, (list, tuple)):
|
||||
value = ", ".join(str(v) for v in value)
|
||||
prop_lines.append(f"{key}:: {value}")
|
||||
if prop_lines:
|
||||
out.append("- " + prop_lines[0])
|
||||
out.extend(f" {p}" for p in prop_lines[1:])
|
||||
|
||||
have_heading = False
|
||||
for raw in markdown.replace("\r\n", "\n").split("\n"):
|
||||
line = raw.strip()
|
||||
if not line:
|
||||
continue
|
||||
if re.match(r"^#{1,6}\s+", line):
|
||||
out.append(f"- {line}")
|
||||
have_heading = True
|
||||
elif have_heading:
|
||||
out.append(f"\t- {line}")
|
||||
else:
|
||||
out.append(f"- {line}")
|
||||
return "\n".join(out)
|
||||
50
skills/developing/mineru/scripts/sinks/airtable.py
Normal file
50
skills/developing/mineru/scripts/sinks/airtable.py
Normal file
@ -0,0 +1,50 @@
|
||||
"""Airtable sink — store parsed Markdown as a record in a base/table.
|
||||
|
||||
Airtable is a database, not a document tool: the native ingestion path is a
|
||||
record whose fields hold the title and the Markdown body. Field names are
|
||||
configurable to match an existing table schema.
|
||||
|
||||
Docs: https://airtable.com/developers/web/api/create-records
|
||||
(POST /v0/{baseId}/{tableIdOrName}).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import urllib.parse
|
||||
|
||||
from . import _http
|
||||
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
|
||||
|
||||
API_BASE = "https://api.airtable.com/v0"
|
||||
|
||||
|
||||
@register
|
||||
class AirtableSink(Sink):
|
||||
name = "airtable"
|
||||
requires = ("AIRTABLE_API_KEY", "AIRTABLE_BASE_ID", "AIRTABLE_TABLE")
|
||||
label = "Airtable record (database)"
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
api_key = self.env("AIRTABLE_API_KEY")
|
||||
base = self.env("AIRTABLE_BASE_ID")
|
||||
table = self.env("AIRTABLE_TABLE")
|
||||
title_field = self.env("AIRTABLE_TITLE_FIELD", "Title")
|
||||
body_field = self.env("AIRTABLE_BODY_FIELD", "Notes")
|
||||
|
||||
url = f"{API_BASE}/{base}/{urllib.parse.quote(table)}"
|
||||
headers = {"Authorization": f"Bearer {api_key}"}
|
||||
payload = {"fields": {title_field: doc.title, body_field: doc.markdown}}
|
||||
|
||||
status, parsed = _http.request_json("POST", url, headers=headers, payload=payload)
|
||||
|
||||
if parsed.get("error") or status >= 400:
|
||||
raise SinkError(str(parsed.get("error") or f"HTTP {status}"))
|
||||
if not parsed.get("id"):
|
||||
raise SinkError(f"Airtable returned no record id: {parsed}")
|
||||
|
||||
return SinkResult(
|
||||
sink=self.name,
|
||||
ok=True,
|
||||
url=None,
|
||||
detail="stored as a database record (Airtable is a DB, not a doc)",
|
||||
)
|
||||
101
skills/developing/mineru/scripts/sinks/base.py
Normal file
101
skills/developing/mineru/scripts/sinks/base.py
Normal file
@ -0,0 +1,101 @@
|
||||
"""Core types and the sink registry for delivering parsed Markdown to content tools.
|
||||
|
||||
A *sink* takes a :class:`ParsedDoc` (Markdown + local images + metadata) and
|
||||
delivers it into one destination (Obsidian, Notion, Slack, Feishu, ...) using
|
||||
that tool's OFFICIAL native ingestion path. Sinks read their configuration from
|
||||
environment variables so an AI agent can run them without interactive prompts.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
from dataclasses import dataclass, field
|
||||
from typing import Optional
|
||||
|
||||
|
||||
@dataclass
|
||||
class ParsedDoc:
|
||||
"""A parsed document ready for delivery."""
|
||||
|
||||
title: str
|
||||
markdown: str
|
||||
images: tuple = () # absolute paths to local image files
|
||||
source: str = ""
|
||||
modality: str = "unknown"
|
||||
markdown_path: Optional[str] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class SinkResult:
|
||||
"""Outcome of delivering a :class:`ParsedDoc` to one sink."""
|
||||
|
||||
sink: str
|
||||
ok: bool
|
||||
url: Optional[str] = None
|
||||
detail: Optional[str] = None
|
||||
error: Optional[str] = None
|
||||
|
||||
def to_status(self) -> dict:
|
||||
return {
|
||||
"sink": self.sink,
|
||||
"ok": self.ok,
|
||||
"url": self.url,
|
||||
"detail": self.detail,
|
||||
"error": self.error,
|
||||
}
|
||||
|
||||
|
||||
class SinkError(Exception):
|
||||
"""Raised by a sink when delivery fails for a known reason."""
|
||||
|
||||
|
||||
class Sink:
|
||||
"""Base class for a delivery target.
|
||||
|
||||
Subclasses set ``name``/``aliases``/``requires`` and implement
|
||||
:meth:`deliver`. ``requires`` lists the environment variables that must be
|
||||
present for the sink to be usable.
|
||||
"""
|
||||
|
||||
name: str = "base"
|
||||
aliases: tuple = ()
|
||||
requires: tuple = () # required env vars
|
||||
label: str = "" # human description
|
||||
local: bool = False # filesystem-only, no network/auth
|
||||
|
||||
def env(self, key: str, default: Optional[str] = None) -> Optional[str]:
|
||||
value = os.environ.get(key, default)
|
||||
return value.strip() if isinstance(value, str) else value
|
||||
|
||||
def missing_config(self) -> list:
|
||||
return [k for k in self.requires if not self.env(k)]
|
||||
|
||||
def is_configured(self) -> bool:
|
||||
return not self.missing_config()
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult: # pragma: no cover - abstract
|
||||
raise NotImplementedError
|
||||
|
||||
|
||||
# --------------------------------------------------------------------------- #
|
||||
# Registry
|
||||
# --------------------------------------------------------------------------- #
|
||||
REGISTRY: dict = {}
|
||||
|
||||
|
||||
def register(cls):
|
||||
"""Class decorator that instantiates a sink and registers it by name+aliases."""
|
||||
inst = cls()
|
||||
REGISTRY[inst.name] = inst
|
||||
for alias in inst.aliases:
|
||||
REGISTRY[alias] = inst
|
||||
return cls
|
||||
|
||||
|
||||
def get_sink(name: str) -> Optional[Sink]:
|
||||
return REGISTRY.get(name.lower())
|
||||
|
||||
|
||||
def sink_names() -> list:
|
||||
"""Canonical sink names (no aliases), sorted."""
|
||||
return sorted({s.name for s in REGISTRY.values()})
|
||||
72
skills/developing/mineru/scripts/sinks/coda.py
Normal file
72
skills/developing/mineru/scripts/sinks/coda.py
Normal file
@ -0,0 +1,72 @@
|
||||
"""Coda sink: deliver Markdown as a page, into an existing doc or a new one.
|
||||
|
||||
Coda's API (``https://coda.io/apis/v1``) authenticates with a Bearer token.
|
||||
Markdown is delivered as canvas page content. If ``CODA_DOC_ID`` is set, a new
|
||||
page is added to that doc; otherwise a new doc is created with the content as its
|
||||
initial page.
|
||||
|
||||
Coda canvas content embeds images by URL only, so local image refs are left
|
||||
untouched — host images at a public URL for them to render.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
from . import _http, _md
|
||||
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
|
||||
|
||||
API = "https://coda.io/apis/v1"
|
||||
|
||||
|
||||
def _canvas(markdown: str) -> dict:
|
||||
return {"type": "canvas", "canvasContent": {"format": "markdown", "content": markdown}}
|
||||
|
||||
|
||||
@register
|
||||
class CodaSink(Sink):
|
||||
name = "coda"
|
||||
requires = ("CODA_API_TOKEN",)
|
||||
label = "Coda page (REST API)"
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
token = self.env("CODA_API_TOKEN")
|
||||
doc_id = self.env("CODA_DOC_ID")
|
||||
headers = {
|
||||
"Authorization": f"Bearer {token}",
|
||||
"Content-Type": "application/json",
|
||||
}
|
||||
|
||||
base_dir = Path(doc.markdown_path).parent if doc.markdown_path else None
|
||||
n_images = len(_md.find_local_images(doc.markdown, base_dir))
|
||||
|
||||
if doc_id:
|
||||
status, parsed = _http.request_json(
|
||||
"POST", f"{API}/docs/{doc_id}/pages", headers=headers, payload={
|
||||
"name": doc.title,
|
||||
"pageContent": _canvas(doc.markdown),
|
||||
},
|
||||
)
|
||||
else:
|
||||
status, parsed = _http.request_json(
|
||||
"POST", f"{API}/docs", headers=headers, payload={
|
||||
"title": doc.title,
|
||||
"initialPage": {
|
||||
"name": doc.title,
|
||||
"pageContent": _canvas(doc.markdown),
|
||||
},
|
||||
},
|
||||
)
|
||||
|
||||
if status >= 400:
|
||||
raise SinkError(parsed.get("message") or f"HTTP {status}")
|
||||
|
||||
if n_images:
|
||||
detail = f"text only ({n_images} local image(s); Coda embeds images by URL)"
|
||||
else:
|
||||
detail = "text only"
|
||||
return SinkResult(
|
||||
sink=self.name, ok=True,
|
||||
url=parsed.get("browserLink"),
|
||||
detail=detail,
|
||||
)
|
||||
66
skills/developing/mineru/scripts/sinks/confluence.py
Normal file
66
skills/developing/mineru/scripts/sinks/confluence.py
Normal file
@ -0,0 +1,66 @@
|
||||
"""Confluence sink: create a page from the parsed Markdown via the Cloud REST API.
|
||||
|
||||
Confluence Cloud ingests content as *storage-format* HTML. Delivery converts the
|
||||
Markdown to HTML and creates a page with the v2 REST API
|
||||
(``POST /wiki/api/v2/pages``) using Basic auth (email + API token).
|
||||
|
||||
Local images are not attached — Confluence storage HTML references attachments by
|
||||
filename, which would require a separate upload step.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import base64
|
||||
|
||||
from . import _http, _md
|
||||
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
|
||||
|
||||
|
||||
@register
|
||||
class ConfluenceSink(Sink):
|
||||
name = "confluence"
|
||||
requires = (
|
||||
"CONFLUENCE_BASE_URL",
|
||||
"CONFLUENCE_EMAIL",
|
||||
"CONFLUENCE_API_TOKEN",
|
||||
"CONFLUENCE_SPACE_ID",
|
||||
)
|
||||
label = "Confluence Cloud page (storage HTML)"
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
base = self.env("CONFLUENCE_BASE_URL").rstrip("/")
|
||||
email = self.env("CONFLUENCE_EMAIL")
|
||||
token = self.env("CONFLUENCE_API_TOKEN")
|
||||
space = self.env("CONFLUENCE_SPACE_ID")
|
||||
|
||||
auth = base64.b64encode(f"{email}:{token}".encode("utf-8")).decode("ascii")
|
||||
headers = {
|
||||
"Authorization": f"Basic {auth}",
|
||||
"Content-Type": "application/json",
|
||||
}
|
||||
|
||||
html = _md.md_to_html(doc.markdown)
|
||||
status, parsed = _http.request_json(
|
||||
"POST",
|
||||
f"{base}/wiki/api/v2/pages",
|
||||
headers=headers,
|
||||
payload={
|
||||
"spaceId": space,
|
||||
"status": "current",
|
||||
"title": doc.title,
|
||||
"body": {"representation": "storage", "value": html},
|
||||
},
|
||||
)
|
||||
if status >= 400:
|
||||
raise SinkError(
|
||||
parsed.get("title")
|
||||
or parsed.get("message")
|
||||
or f"Confluence HTTP {status}"
|
||||
)
|
||||
|
||||
webui = (parsed.get("_links") or {}).get("webui")
|
||||
url = base + webui if webui else None
|
||||
return SinkResult(
|
||||
sink=self.name, ok=True, url=url,
|
||||
detail="converted Markdown->storage HTML (local images not attached)",
|
||||
)
|
||||
65
skills/developing/mineru/scripts/sinks/dingtalk.py
Normal file
65
skills/developing/mineru/scripts/sinks/dingtalk.py
Normal file
@ -0,0 +1,65 @@
|
||||
"""DingTalk (钉钉) sink — push parsed Markdown as a robot markdown message.
|
||||
|
||||
A DingTalk custom robot accepts a ``markdown`` message type. The official native
|
||||
ingestion path is therefore a webhook POST carrying the document title and body.
|
||||
When a signing secret is configured the request is HMAC-SHA256 signed per
|
||||
DingTalk's spec. DingTalk's markdown renderer only fetches images over public
|
||||
URLs, so local images won't render.
|
||||
|
||||
Docs: https://open.dingtalk.com/document/robots/custom-robot-access.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import base64
|
||||
import hashlib
|
||||
import hmac
|
||||
import time
|
||||
import urllib.parse
|
||||
|
||||
from . import _http
|
||||
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
|
||||
|
||||
|
||||
@register
|
||||
class DingTalkSink(Sink):
|
||||
name = "dingtalk"
|
||||
aliases = ("钉钉",)
|
||||
requires = ("DINGTALK_WEBHOOK",)
|
||||
label = "DingTalk robot markdown (钉钉)"
|
||||
|
||||
def _build_url(self) -> str:
|
||||
webhook = self.env("DINGTALK_WEBHOOK")
|
||||
if webhook.startswith("http"):
|
||||
url = webhook
|
||||
else:
|
||||
url = f"https://oapi.dingtalk.com/robot/send?access_token={webhook}"
|
||||
|
||||
secret = self.env("DINGTALK_SECRET")
|
||||
if secret:
|
||||
timestamp = str(round(time.time() * 1000))
|
||||
string_to_sign = f"{timestamp}\n{secret}"
|
||||
hmac_code = hmac.new(
|
||||
secret.encode(), string_to_sign.encode(), hashlib.sha256
|
||||
).digest()
|
||||
sign = urllib.parse.quote_plus(base64.b64encode(hmac_code))
|
||||
url += f"×tamp={timestamp}&sign={sign}"
|
||||
return url
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
url = self._build_url()
|
||||
payload = {
|
||||
"msgtype": "markdown",
|
||||
"markdown": {"title": doc.title, "text": doc.markdown},
|
||||
}
|
||||
status, parsed = _http.request_json("POST", url, payload=payload)
|
||||
|
||||
if parsed.get("errcode") not in (0, None):
|
||||
raise SinkError(parsed.get("errmsg") or f"DingTalk HTTP {status}: {parsed}")
|
||||
|
||||
return SinkResult(
|
||||
sink=self.name,
|
||||
ok=True,
|
||||
url=None,
|
||||
detail="robot markdown message (local images won't render; host publicly)",
|
||||
)
|
||||
124
skills/developing/mineru/scripts/sinks/feishu.py
Normal file
124
skills/developing/mineru/scripts/sinks/feishu.py
Normal file
@ -0,0 +1,124 @@
|
||||
"""Feishu / Lark sink: import the parsed Markdown as a Docx document.
|
||||
|
||||
Feishu (飞书) / Lark ingests Markdown through its Drive import pipeline. Delivery
|
||||
follows that official path:
|
||||
|
||||
1. ``tenant_access_token/internal`` — exchange the app id/secret for a tenant
|
||||
access token.
|
||||
2. ``drive/v1/medias/upload_all`` — upload the ``.md`` bytes as an import medium
|
||||
and obtain a ``file_token``.
|
||||
3. ``drive/v1/import_tasks`` — kick off an import task converting the medium to a
|
||||
Docx, returning a ``ticket``.
|
||||
4. Poll ``drive/v1/import_tasks/{ticket}`` until the job finishes, surfacing the
|
||||
resulting document URL.
|
||||
|
||||
Local images are not uploaded — they would need public URLs to render in Docx.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import time
|
||||
|
||||
from . import _http, _md
|
||||
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
|
||||
|
||||
|
||||
@register
|
||||
class FeishuSink(Sink):
|
||||
name = "feishu"
|
||||
aliases = ("lark", "飞书")
|
||||
requires = ("FEISHU_APP_ID", "FEISHU_APP_SECRET")
|
||||
label = "Feishu / Lark Docx (Drive import)"
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
app_id = self.env("FEISHU_APP_ID")
|
||||
app_secret = self.env("FEISHU_APP_SECRET")
|
||||
folder_token = self.env("FEISHU_FOLDER_TOKEN")
|
||||
|
||||
# Step 1: tenant access token.
|
||||
status, parsed = _http.request_json(
|
||||
"POST",
|
||||
"https://open.feishu.cn/open-apis/auth/v3/tenant_access_token/internal",
|
||||
payload={"app_id": app_id, "app_secret": app_secret},
|
||||
)
|
||||
token = parsed.get("tenant_access_token")
|
||||
if parsed.get("code") not in (0, None) or not token:
|
||||
raise SinkError(parsed.get("msg") or f"Feishu auth failed (HTTP {status})")
|
||||
headers = {"Authorization": f"Bearer {token}"}
|
||||
|
||||
# Step 2: upload the Markdown bytes as an import medium.
|
||||
content = doc.markdown.encode("utf-8")
|
||||
fname = _md.safe_filename(doc.title) + ".md"
|
||||
ctype, body = _http.encode_multipart(
|
||||
fields={
|
||||
"file_name": fname,
|
||||
"parent_type": "ccm_import_open",
|
||||
"size": str(len(content)),
|
||||
"extra": json.dumps({"obj_type": "docx", "file_extension": "md"}),
|
||||
},
|
||||
files=[("file", fname, content)],
|
||||
)
|
||||
up_status, raw = _http.http_request(
|
||||
"POST",
|
||||
"https://open.feishu.cn/open-apis/drive/v1/medias/upload_all",
|
||||
headers={**headers, "Content-Type": ctype},
|
||||
data=body,
|
||||
)
|
||||
parsed = _parse_json(raw)
|
||||
if parsed.get("code") not in (0, None):
|
||||
raise SinkError(parsed.get("msg") or f"Feishu media upload failed (HTTP {up_status})")
|
||||
file_token = (parsed.get("data") or {}).get("file_token")
|
||||
if not file_token:
|
||||
raise SinkError("Feishu did not return a file_token")
|
||||
|
||||
# Step 3: create the import task.
|
||||
status, parsed = _http.request_json(
|
||||
"POST",
|
||||
"https://open.feishu.cn/open-apis/drive/v1/import_tasks",
|
||||
headers=headers,
|
||||
payload={
|
||||
"file_extension": "md",
|
||||
"file_token": file_token,
|
||||
"type": "docx",
|
||||
"file_name": doc.title,
|
||||
"point": {"mount_type": 1, "mount_key": folder_token or ""},
|
||||
},
|
||||
)
|
||||
if parsed.get("code") not in (0, None):
|
||||
raise SinkError(parsed.get("msg") or f"Feishu import task failed (HTTP {status})")
|
||||
ticket = (parsed.get("data") or {}).get("ticket")
|
||||
if not ticket:
|
||||
raise SinkError("Feishu did not return an import ticket")
|
||||
|
||||
# Step 4: poll until the import job completes.
|
||||
url = None
|
||||
for _attempt in range(20):
|
||||
status, parsed = _http.request_json(
|
||||
"GET",
|
||||
f"https://open.feishu.cn/open-apis/drive/v1/import_tasks/{ticket}",
|
||||
headers=headers,
|
||||
)
|
||||
res = (parsed.get("data") or {}).get("result") or {}
|
||||
job_status = res.get("job_status")
|
||||
if job_status == 0:
|
||||
url = res.get("url")
|
||||
break
|
||||
if job_status in (1, 2):
|
||||
time.sleep(1)
|
||||
continue
|
||||
raise SinkError(res.get("job_error_msg") or "Feishu import failed")
|
||||
|
||||
return SinkResult(
|
||||
sink=self.name, ok=True, url=url,
|
||||
detail="imported to Feishu Docx (local images need public URLs)",
|
||||
)
|
||||
|
||||
|
||||
def _parse_json(raw):
|
||||
if not raw:
|
||||
return {}
|
||||
try:
|
||||
return json.loads(raw.decode("utf-8"))
|
||||
except (ValueError, UnicodeDecodeError):
|
||||
return {}
|
||||
75
skills/developing/mineru/scripts/sinks/linear.py
Normal file
75
skills/developing/mineru/scripts/sinks/linear.py
Normal file
@ -0,0 +1,75 @@
|
||||
"""Linear sink: create an issue from Markdown via the GraphQL API.
|
||||
|
||||
Linear's API is GraphQL at ``https://api.linear.app/graphql`` and authenticates
|
||||
with a raw API key in the ``Authorization`` header (no ``Bearer`` prefix). The
|
||||
issue description is Markdown; Linear renders inline ``data:`` image URIs, so
|
||||
local images are read and embedded as base64 data URIs before delivery.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import base64
|
||||
from pathlib import Path
|
||||
|
||||
from . import _http, _md
|
||||
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
|
||||
|
||||
API = "https://api.linear.app/graphql"
|
||||
|
||||
_MUTATION = (
|
||||
"mutation IssueCreate($input: IssueCreateInput!)"
|
||||
"{issueCreate(input:$input){success issue{id url identifier}}}"
|
||||
)
|
||||
|
||||
_MIME = {
|
||||
".png": "image/png",
|
||||
".jpg": "image/jpeg",
|
||||
".jpeg": "image/jpeg",
|
||||
".gif": "image/gif",
|
||||
".webp": "image/webp",
|
||||
}
|
||||
|
||||
|
||||
def _data_uri(path: Path) -> str:
|
||||
mime = _MIME.get(path.suffix.lower(), "image/png")
|
||||
b64 = base64.b64encode(path.read_bytes()).decode("ascii")
|
||||
return f"data:{mime};base64,{b64}"
|
||||
|
||||
|
||||
@register
|
||||
class LinearSink(Sink):
|
||||
name = "linear"
|
||||
requires = ("LINEAR_API_KEY", "LINEAR_TEAM_ID")
|
||||
label = "Linear issue (GraphQL API)"
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
key = self.env("LINEAR_API_KEY")
|
||||
team = self.env("LINEAR_TEAM_ID")
|
||||
headers = {"Authorization": key, "Content-Type": "application/json"}
|
||||
|
||||
base_dir = Path(doc.markdown_path).parent if doc.markdown_path else None
|
||||
images = _md.find_local_images(doc.markdown, base_dir)
|
||||
mapping = {ref: _data_uri(path) for _alt, ref, path in images}
|
||||
body = _md.rewrite_images(doc.markdown, mapping)
|
||||
|
||||
status, parsed = _http.request_json("POST", API, headers=headers, payload={
|
||||
"query": _MUTATION,
|
||||
"variables": {"input": {
|
||||
"teamId": team,
|
||||
"title": doc.title,
|
||||
"description": body,
|
||||
}},
|
||||
})
|
||||
if parsed.get("errors"):
|
||||
raise SinkError(str(parsed["errors"]))
|
||||
|
||||
result = ((parsed.get("data") or {}).get("issueCreate")) or {}
|
||||
if not result.get("success"):
|
||||
raise SinkError(f"Linear did not create the issue (HTTP {status})")
|
||||
issue = result.get("issue") or {}
|
||||
|
||||
return SinkResult(
|
||||
sink=self.name, ok=True,
|
||||
url=issue.get("url"),
|
||||
detail=f"{len(mapping)} image(s) inlined",
|
||||
)
|
||||
105
skills/developing/mineru/scripts/sinks/local.py
Normal file
105
skills/developing/mineru/scripts/sinks/local.py
Normal file
@ -0,0 +1,105 @@
|
||||
"""Local-first sinks: Obsidian and Logseq (filesystem writes, no auth).
|
||||
|
||||
Both tools are folders of Markdown files. The native ingestion is a filesystem
|
||||
write following each tool's conventions:
|
||||
|
||||
* Obsidian — a flat note with YAML frontmatter; images in a per-note assets
|
||||
folder, referenced with relative Markdown embeds.
|
||||
* Logseq — an outline (every line a ``- `` block) with ``key:: value`` page
|
||||
properties on the first block; images in ``assets/`` referenced as
|
||||
````.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
from . import _md
|
||||
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
|
||||
|
||||
|
||||
def _copy_images(doc: ParsedDoc, dest_dir: Path, ref_prefix: str) -> dict:
|
||||
"""Copy referenced local images into ``dest_dir``; return ``{old_ref: new_ref}``."""
|
||||
base = Path(doc.markdown_path).parent if doc.markdown_path else None
|
||||
mapping = {}
|
||||
images = _md.find_local_images(doc.markdown, base)
|
||||
if images:
|
||||
dest_dir.mkdir(parents=True, exist_ok=True)
|
||||
for _alt, ref, path in images:
|
||||
target = dest_dir / path.name
|
||||
target.write_bytes(path.read_bytes())
|
||||
mapping[ref] = f"{ref_prefix}{path.name}"
|
||||
return mapping
|
||||
|
||||
|
||||
@register
|
||||
class ObsidianSink(Sink):
|
||||
name = "obsidian"
|
||||
aliases = ("ob",)
|
||||
requires = ("OBSIDIAN_VAULT",)
|
||||
label = "Obsidian vault (local Markdown)"
|
||||
local = True
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
vault = Path(self.env("OBSIDIAN_VAULT")).expanduser()
|
||||
if not vault.is_dir():
|
||||
raise SinkError(f"Obsidian vault not found: {vault}")
|
||||
subdir = self.env("OBSIDIAN_SUBDIR", "") or ""
|
||||
note_dir = vault / subdir if subdir else vault
|
||||
note_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
stem = _md.safe_filename(doc.title)
|
||||
assets = note_dir / f"{stem}.assets"
|
||||
mapping = _copy_images(doc, assets, f"{stem}.assets/")
|
||||
body = _md.rewrite_images(doc.markdown, mapping)
|
||||
|
||||
front = _md.yaml_frontmatter({
|
||||
"title": doc.title,
|
||||
"source": doc.source,
|
||||
"modality": doc.modality,
|
||||
"tags": ["mineru", "parsed"],
|
||||
})
|
||||
note_path = note_dir / f"{stem}.md"
|
||||
note_path.write_text(f"{front}\n\n{body}\n", encoding="utf-8")
|
||||
return SinkResult(sink=self.name, ok=True, url=str(note_path),
|
||||
detail=f"{len(mapping)} image(s)")
|
||||
|
||||
|
||||
@register
|
||||
class LogseqSink(Sink):
|
||||
name = "logseq"
|
||||
requires = ("LOGSEQ_GRAPH",)
|
||||
label = "Logseq graph (local outline)"
|
||||
local = True
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
graph = Path(self.env("LOGSEQ_GRAPH")).expanduser()
|
||||
if not graph.is_dir():
|
||||
raise SinkError(f"Logseq graph not found: {graph}")
|
||||
pages = graph / "pages"
|
||||
assets = graph / "assets"
|
||||
pages.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
stem = _md.safe_filename(doc.title)
|
||||
# Namespace asset names by page slug to avoid collisions in the shared assets/.
|
||||
prefix = _md.slugify(doc.title)
|
||||
mapping = {}
|
||||
base = Path(doc.markdown_path).parent if doc.markdown_path else None
|
||||
images = _md.find_local_images(doc.markdown, base)
|
||||
if images:
|
||||
assets.mkdir(parents=True, exist_ok=True)
|
||||
for _alt, ref, path in images:
|
||||
new_name = f"{prefix}-{path.name}"
|
||||
(assets / new_name).write_bytes(path.read_bytes())
|
||||
mapping[ref] = f"../assets/{new_name}"
|
||||
body = _md.rewrite_images(doc.markdown, mapping)
|
||||
|
||||
outline = _md.md_to_logseq(body, properties={
|
||||
"title": doc.title,
|
||||
"source": doc.source,
|
||||
"tags": "mineru, parsed",
|
||||
})
|
||||
page_path = pages / f"{stem}.md"
|
||||
page_path.write_text(outline + "\n", encoding="utf-8")
|
||||
return SinkResult(sink=self.name, ok=True, url=str(page_path),
|
||||
detail=f"{len(mapping)} image(s)")
|
||||
130
skills/developing/mineru/scripts/sinks/notion.py
Normal file
130
skills/developing/mineru/scripts/sinks/notion.py
Normal file
@ -0,0 +1,130 @@
|
||||
"""Notion sink: create a page under a parent page from Markdown blocks.
|
||||
|
||||
Notion's native ingestion is the block API: each Markdown line becomes a typed
|
||||
block (heading, quote, code, list item, paragraph). A page is created with up to
|
||||
100 children inline; any remainder is appended in 100-block chunks via the
|
||||
``/blocks/{id}/children`` PATCH endpoint.
|
||||
|
||||
Notion has no inline image-from-bytes path (images must be uploaded or hosted
|
||||
separately), so local image refs are intentionally left untouched.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
from . import _http, _md
|
||||
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
|
||||
|
||||
API = "https://api.notion.com/v1"
|
||||
MAX_BLOCKS = 100
|
||||
MAX_TEXT = 2000
|
||||
|
||||
|
||||
def _rich(text: str) -> list:
|
||||
return [{"type": "text", "text": {"content": text[:MAX_TEXT]}}]
|
||||
|
||||
|
||||
def _block(block_type: str, text: str, **extra) -> dict:
|
||||
inner = {"rich_text": _rich(text)}
|
||||
inner.update(extra)
|
||||
return {"object": "block", "type": block_type, block_type: inner}
|
||||
|
||||
|
||||
def _is_numbered(text: str) -> bool:
|
||||
head = text.split(".", 1)
|
||||
return len(head) == 2 and head[0].isdigit() and head[1].startswith(" ")
|
||||
|
||||
|
||||
def _blocks(markdown: str) -> list:
|
||||
"""Convert flat Markdown lines into a list of Notion block dicts."""
|
||||
blocks = []
|
||||
in_code = False
|
||||
code_buf: list = []
|
||||
for raw in markdown.replace("\r\n", "\n").split("\n"):
|
||||
stripped = raw.strip()
|
||||
|
||||
if stripped.startswith("```"):
|
||||
if in_code:
|
||||
blocks.append(_block("code", "\n".join(code_buf), language="plain text"))
|
||||
in_code = False
|
||||
code_buf = []
|
||||
else:
|
||||
in_code = True
|
||||
code_buf = []
|
||||
continue
|
||||
if in_code:
|
||||
code_buf.append(raw)
|
||||
continue
|
||||
|
||||
if not stripped:
|
||||
continue
|
||||
if stripped.startswith("# "):
|
||||
blocks.append(_block("heading_1", stripped[2:].strip()))
|
||||
elif stripped.startswith("## "):
|
||||
blocks.append(_block("heading_2", stripped[3:].strip()))
|
||||
elif stripped.startswith("### "):
|
||||
blocks.append(_block("heading_3", stripped[4:].strip()))
|
||||
elif stripped.startswith("> "):
|
||||
blocks.append(_block("quote", stripped[2:].strip()))
|
||||
elif stripped.startswith("- ") or stripped.startswith("* "):
|
||||
blocks.append(_block("bulleted_list_item", stripped[2:].strip()))
|
||||
elif _is_numbered(stripped):
|
||||
blocks.append(_block("numbered_list_item", stripped.split(".", 1)[1].strip()))
|
||||
else:
|
||||
blocks.append(_block("paragraph", stripped))
|
||||
|
||||
if in_code:
|
||||
blocks.append(_block("code", "\n".join(code_buf), language="plain text"))
|
||||
return blocks
|
||||
|
||||
|
||||
@register
|
||||
class NotionSink(Sink):
|
||||
name = "notion"
|
||||
requires = ("NOTION_API_KEY", "NOTION_PARENT_PAGE_ID")
|
||||
label = "Notion page (blocks API)"
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
key = self.env("NOTION_API_KEY")
|
||||
parent = self.env("NOTION_PARENT_PAGE_ID")
|
||||
version = self.env("NOTION_VERSION", "2022-06-28") or "2022-06-28"
|
||||
headers = {
|
||||
"Authorization": f"Bearer {key}",
|
||||
"Notion-Version": version,
|
||||
"Content-Type": "application/json",
|
||||
}
|
||||
|
||||
# Count local images for the detail note (refs are left as-is).
|
||||
base_dir = Path(doc.markdown_path).parent if doc.markdown_path else None
|
||||
n_images = len(_md.find_local_images(doc.markdown, base_dir))
|
||||
|
||||
blocks = _blocks(doc.markdown)
|
||||
status, parsed = _http.request_json("POST", f"{API}/pages", headers=headers, payload={
|
||||
"parent": {"page_id": parent},
|
||||
"properties": {"title": {"title": [{"text": {"content": doc.title}}]}},
|
||||
"children": blocks[:MAX_BLOCKS],
|
||||
})
|
||||
if parsed.get("object") == "error":
|
||||
raise SinkError(parsed.get("message") or f"Notion API error (HTTP {status})")
|
||||
created_id = parsed.get("id")
|
||||
if not created_id:
|
||||
raise SinkError(f"Notion did not return a page id (HTTP {status})")
|
||||
page_url = parsed.get("url")
|
||||
|
||||
for start in range(MAX_BLOCKS, len(blocks), MAX_BLOCKS):
|
||||
chunk = blocks[start:start + MAX_BLOCKS]
|
||||
ch_status, ch_parsed = _http.request_json(
|
||||
"PATCH", f"{API}/blocks/{created_id}/children",
|
||||
headers=headers, payload={"children": chunk},
|
||||
)
|
||||
if ch_parsed.get("object") == "error":
|
||||
raise SinkError(ch_parsed.get("message")
|
||||
or f"Notion block append failed (HTTP {ch_status})")
|
||||
|
||||
if n_images:
|
||||
detail = (f"text+structure ({n_images} local images not embedded; "
|
||||
f"Notion needs file upload)")
|
||||
else:
|
||||
detail = "text+structure"
|
||||
return SinkResult(sink=self.name, ok=True, url=page_url, detail=detail)
|
||||
66
skills/developing/mineru/scripts/sinks/onenote.py
Normal file
66
skills/developing/mineru/scripts/sinks/onenote.py
Normal file
@ -0,0 +1,66 @@
|
||||
"""OneNote sink: create a page from the parsed Markdown via Microsoft Graph.
|
||||
|
||||
OneNote pages are created by POSTing an HTML document to a section's ``pages``
|
||||
endpoint with a pre-obtained Microsoft Graph access token (OAuth). Delivery
|
||||
converts the Markdown to a full HTML document and creates the page.
|
||||
|
||||
Only remote images render — Graph fetches ``<img src>`` URLs, so local image
|
||||
paths emitted by MinerU would need to be public URLs.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import html
|
||||
import json
|
||||
|
||||
from . import _http, _md
|
||||
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
|
||||
|
||||
|
||||
@register
|
||||
class OneNoteSink(Sink):
|
||||
name = "onenote"
|
||||
aliases = ("msonenote",)
|
||||
requires = ("ONENOTE_TOKEN", "ONENOTE_SECTION_ID")
|
||||
label = "OneNote section page (Microsoft Graph)"
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
token = self.env("ONENOTE_TOKEN")
|
||||
section = self.env("ONENOTE_SECTION_ID")
|
||||
|
||||
body_html = _md.md_to_html(doc.markdown)
|
||||
page = (
|
||||
"<!DOCTYPE html><html><head>"
|
||||
f"<title>{html.escape(doc.title)}</title>"
|
||||
f"</head><body>{body_html}</body></html>"
|
||||
)
|
||||
|
||||
status, raw = _http.http_request(
|
||||
"POST",
|
||||
f"https://graph.microsoft.com/v1.0/me/onenote/sections/{section}/pages",
|
||||
headers={
|
||||
"Authorization": f"Bearer {token}",
|
||||
"Content-Type": "text/html",
|
||||
},
|
||||
data=page.encode("utf-8"),
|
||||
)
|
||||
if status >= 400:
|
||||
preview = raw.decode("utf-8", "replace") if raw else ""
|
||||
raise SinkError(f"OneNote HTTP {status}: {preview[:200]}")
|
||||
if status != 201:
|
||||
raise SinkError(f"OneNote unexpected response (HTTP {status})")
|
||||
|
||||
parsed = {}
|
||||
if raw:
|
||||
try:
|
||||
parsed = json.loads(raw.decode("utf-8"))
|
||||
except (ValueError, UnicodeDecodeError):
|
||||
parsed = {}
|
||||
links = parsed.get("links") or {}
|
||||
web = links.get("oneNoteWebUrl") or {}
|
||||
url = web.get("href")
|
||||
|
||||
return SinkResult(
|
||||
sink=self.name, ok=True, url=url,
|
||||
detail="converted Markdown->HTML (remote images only; OAuth token required)",
|
||||
)
|
||||
106
skills/developing/mineru/scripts/sinks/roam.py
Normal file
106
skills/developing/mineru/scripts/sinks/roam.py
Normal file
@ -0,0 +1,106 @@
|
||||
"""Roam Research sink — optional dependency.
|
||||
|
||||
There is no library that ingests a Markdown document into Roam, but the official
|
||||
``roam-client`` SDK correctly handles the parts that are easy to get wrong — the
|
||||
307/308 peer-host redirect, the dual ``Authorization`` / ``x-authorization``
|
||||
Bearer headers, and the ``/write`` plumbing. So we lazily depend on it for
|
||||
transport and only build the Markdown → block-tree ourselves, delivering the whole
|
||||
document in a single ``batch-actions`` request (one HTTP round-trip).
|
||||
|
||||
Install the SDK (git-only, not on PyPI; needs Python ≥ 3.11):
|
||||
|
||||
pip install "roam-client @ git+https://github.com/Roam-Research/backend-sdks.git#subdirectory=python"
|
||||
|
||||
Config: ``ROAM_API_TOKEN`` (graph edit token), ``ROAM_GRAPH_NAME``.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import itertools
|
||||
import re
|
||||
|
||||
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
|
||||
|
||||
_HEADING = re.compile(r"^(#{1,6})\s+(.*)$")
|
||||
_INSTALL_HINT = (
|
||||
'Roam sink needs the official SDK — pip install '
|
||||
'"roam-client @ git+https://github.com/Roam-Research/backend-sdks.git#subdirectory=python"'
|
||||
)
|
||||
|
||||
|
||||
def md_to_roam_tree(markdown: str) -> list:
|
||||
"""Convert Markdown into a nested Roam block tree.
|
||||
|
||||
Headings become parent blocks (``heading`` 1–3); the lines under a heading
|
||||
nest beneath it. Returns ``[{"string", "heading"?, "children": [...]}, ...]``.
|
||||
"""
|
||||
roots: list = []
|
||||
stack: list = [] # [(heading_level, node)]
|
||||
for raw in markdown.replace("\r\n", "\n").split("\n"):
|
||||
line = raw.strip()
|
||||
if not line:
|
||||
continue
|
||||
match = _HEADING.match(line)
|
||||
if match:
|
||||
level = len(match.group(1))
|
||||
node = {"string": match.group(2), "heading": min(level, 3), "children": []}
|
||||
while stack and stack[-1][0] >= level:
|
||||
stack.pop()
|
||||
(stack[-1][1]["children"] if stack else roots).append(node)
|
||||
stack.append((level, node))
|
||||
else:
|
||||
node = {"string": line, "children": []}
|
||||
(stack[-1][1]["children"] if stack else roots).append(node)
|
||||
return roots
|
||||
|
||||
|
||||
def tree_to_actions(children: list, parent_uid: str, uidgen) -> list:
|
||||
"""Flatten a block tree into ``create-block`` actions for one batch request."""
|
||||
actions: list = []
|
||||
for order, node in enumerate(children):
|
||||
uid = uidgen()
|
||||
block = {"string": node["string"], "uid": uid}
|
||||
if node.get("heading"):
|
||||
block["heading"] = node["heading"]
|
||||
actions.append({
|
||||
"action": "create-block",
|
||||
"location": {"parent-uid": parent_uid, "order": order},
|
||||
"block": block,
|
||||
})
|
||||
actions.extend(tree_to_actions(node.get("children", []), uid, uidgen))
|
||||
return actions
|
||||
|
||||
|
||||
@register
|
||||
class RoamSink(Sink):
|
||||
name = "roam"
|
||||
aliases = ("roamresearch",)
|
||||
requires = ("ROAM_API_TOKEN", "ROAM_GRAPH_NAME")
|
||||
label = "Roam Research (batch-actions, optional dep)"
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
try:
|
||||
from roam_client.client import create_page, initialize_graph
|
||||
except ImportError as exc: # pragma: no cover - exercised via SinkError path
|
||||
raise SinkError(_INSTALL_HINT) from exc
|
||||
|
||||
token = self.env("ROAM_API_TOKEN")
|
||||
graph = self.env("ROAM_GRAPH_NAME")
|
||||
client = initialize_graph({"token": token, "graph": graph})
|
||||
|
||||
create_page(client, {"page": {"title": doc.title}})
|
||||
|
||||
counter = itertools.count(1)
|
||||
actions = tree_to_actions(
|
||||
md_to_roam_tree(doc.markdown), doc.title, lambda: f"mu{next(counter):07d}"
|
||||
)
|
||||
if actions:
|
||||
client.call(
|
||||
f"/api/graph/{graph}/write", "POST",
|
||||
{"action": "batch-actions", "actions": actions},
|
||||
)
|
||||
return SinkResult(
|
||||
sink=self.name, ok=True,
|
||||
url=f"https://roamresearch.com/#/app/{graph}",
|
||||
detail=f"{len(actions)} block(s) via batch-actions (images need public URLs)",
|
||||
)
|
||||
111
skills/developing/mineru/scripts/sinks/siyuan.py
Normal file
111
skills/developing/mineru/scripts/sinks/siyuan.py
Normal file
@ -0,0 +1,111 @@
|
||||
"""SiYuan sink: create a new document from Markdown via the local kernel API.
|
||||
|
||||
SiYuan (思源笔记) exposes a kernel HTTP API (default ``http://127.0.0.1:6806``)
|
||||
authenticated with an API token. Delivery follows SiYuan's native ingestion path:
|
||||
|
||||
1. Resolve the target notebook (``SIYUAN_NOTEBOOK`` or the first listed notebook).
|
||||
2. Upload each referenced local image via ``/api/asset/upload`` and rewrite the
|
||||
Markdown to point at the returned ``assets/...`` paths.
|
||||
3. Create the document with ``/api/filetree/createDocWithMd``.
|
||||
|
||||
Every kernel response wraps its payload as ``{"code": 0, "msg": "", "data": ...}``;
|
||||
a non-zero ``code`` is an error.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
from . import _http, _md
|
||||
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
|
||||
|
||||
|
||||
@register
|
||||
class SiYuanSink(Sink):
|
||||
name = "siyuan"
|
||||
requires = ("SIYUAN_TOKEN",)
|
||||
label = "SiYuan notebook (local kernel API)"
|
||||
|
||||
def _json_post(self, base: str, path: str, headers: dict, payload: dict):
|
||||
"""POST JSON; return ``data`` after verifying ``code == 0``."""
|
||||
try:
|
||||
status, parsed = _http.request_json("POST", f"{base}{path}",
|
||||
headers=headers, payload=payload)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
raise self._unreachable(base, exc) from exc
|
||||
return self._unwrap(base, status, parsed)
|
||||
|
||||
def _upload_post(self, base: str, headers: dict, content_type: str, body: bytes):
|
||||
"""POST a multipart body; return ``data`` after verifying ``code == 0``."""
|
||||
hdrs = dict(headers)
|
||||
hdrs["Content-Type"] = content_type
|
||||
try:
|
||||
status, raw = _http.http_request("POST", f"{base}/api/asset/upload",
|
||||
headers=hdrs, data=body)
|
||||
except Exception as exc: # noqa: BLE001
|
||||
raise self._unreachable(base, exc) from exc
|
||||
parsed: dict = {}
|
||||
if raw:
|
||||
try:
|
||||
parsed = json.loads(raw.decode("utf-8"))
|
||||
except (ValueError, UnicodeDecodeError):
|
||||
parsed = {}
|
||||
return self._unwrap(base, status, parsed)
|
||||
|
||||
@staticmethod
|
||||
def _unreachable(base: str, exc=None) -> SinkError:
|
||||
suffix = f" ({exc})" if exc else ""
|
||||
return SinkError(
|
||||
f"SiYuan kernel not reachable at {base} — start SiYuan and enable "
|
||||
f"the API token{suffix}"
|
||||
)
|
||||
|
||||
def _unwrap(self, base: str, status: int, parsed: dict):
|
||||
if status == 0:
|
||||
raise self._unreachable(base)
|
||||
if parsed.get("code") != 0:
|
||||
raise SinkError(parsed.get("msg") or f"SiYuan API error (HTTP {status})")
|
||||
return parsed.get("data")
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
base = (self.env("SIYUAN_API_URL", "http://127.0.0.1:6806")
|
||||
or "http://127.0.0.1:6806").rstrip("/")
|
||||
token = self.env("SIYUAN_TOKEN")
|
||||
headers = {"Authorization": f"Token {token}"}
|
||||
|
||||
notebook = self.env("SIYUAN_NOTEBOOK")
|
||||
if not notebook:
|
||||
data = self._json_post(base, "/api/notebook/lsNotebooks", headers, {})
|
||||
notebooks = (data or {}).get("notebooks") or []
|
||||
if not notebooks:
|
||||
raise SinkError("SiYuan has no notebooks — create one before delivering")
|
||||
notebook = notebooks[0]["id"]
|
||||
|
||||
base_dir = Path(doc.markdown_path).parent if doc.markdown_path else None
|
||||
images = _md.find_local_images(doc.markdown, base_dir)
|
||||
mapping = {}
|
||||
for _alt, ref, path in images:
|
||||
content_type, body = _http.encode_multipart(
|
||||
fields={"assetsDirPath": "/assets/"},
|
||||
files=[("file[]", path.name, path.read_bytes())],
|
||||
)
|
||||
data = self._upload_post(base, headers, content_type, body)
|
||||
succ_map = (data or {}).get("succMap") or {}
|
||||
if path.name in succ_map:
|
||||
mapping[ref] = succ_map[path.name]
|
||||
body_md = _md.rewrite_images(doc.markdown, mapping)
|
||||
|
||||
docid = self._json_post(base, "/api/filetree/createDocWithMd", headers, {
|
||||
"notebook": notebook,
|
||||
"path": "/" + _md.safe_filename(doc.title),
|
||||
"markdown": body_md,
|
||||
})
|
||||
if not docid:
|
||||
raise SinkError("SiYuan did not return a document id")
|
||||
|
||||
return SinkResult(
|
||||
sink=self.name, ok=True,
|
||||
url=f"siyuan://blocks/{docid}",
|
||||
detail=f"{len(mapping)} image(s)",
|
||||
)
|
||||
95
skills/developing/mineru/scripts/sinks/slack.py
Normal file
95
skills/developing/mineru/scripts/sinks/slack.py
Normal file
@ -0,0 +1,95 @@
|
||||
"""Slack sink: upload the parsed Markdown as a file via the external-upload flow.
|
||||
|
||||
Slack deprecated ``files.upload`` (retired) in favour of a three-step external
|
||||
upload. Delivery follows that official path:
|
||||
|
||||
1. ``files.getUploadURLExternal`` — reserve an upload URL + file id for the
|
||||
given filename and byte length.
|
||||
2. ``POST`` the raw bytes to the returned upload URL.
|
||||
3. ``files.completeUploadExternal`` — finalize the upload, attach it to the
|
||||
target channel, and post an initial comment.
|
||||
|
||||
Images are *not* embedded: Markdown is uploaded as a single ``.md`` file.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import urllib.parse
|
||||
|
||||
from . import _http, _md
|
||||
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
|
||||
|
||||
|
||||
@register
|
||||
class SlackSink(Sink):
|
||||
name = "slack"
|
||||
requires = ("SLACK_BOT_TOKEN", "SLACK_CHANNEL")
|
||||
label = "Slack channel (file upload)"
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
token = self.env("SLACK_BOT_TOKEN")
|
||||
channel = self.env("SLACK_CHANNEL")
|
||||
auth = {"Authorization": f"Bearer {token}"}
|
||||
|
||||
content = doc.markdown.encode("utf-8")
|
||||
filename = _md.slugify(doc.title) + ".md"
|
||||
|
||||
# Step 1: reserve an external upload URL + file id. This endpoint wants
|
||||
# form-encoded data, so use http_request and parse the JSON response.
|
||||
form = urllib.parse.urlencode({
|
||||
"filename": filename,
|
||||
"length": len(content),
|
||||
}).encode("utf-8")
|
||||
status, raw = _http.http_request(
|
||||
"POST",
|
||||
"https://slack.com/api/files.getUploadURLExternal",
|
||||
headers={**auth, "Content-Type": "application/x-www-form-urlencoded"},
|
||||
data=form,
|
||||
)
|
||||
parsed = _parse_json(raw)
|
||||
if not parsed.get("ok"):
|
||||
raise SinkError(parsed.get("error") or f"Slack getUploadURLExternal failed (HTTP {status})")
|
||||
upload_url = parsed.get("upload_url")
|
||||
file_id = parsed.get("file_id")
|
||||
if not upload_url or not file_id:
|
||||
raise SinkError("Slack did not return an upload URL / file id")
|
||||
|
||||
# Step 2: upload the raw bytes to the reserved URL.
|
||||
up_status, _up_body = _http.http_request(
|
||||
"POST", upload_url,
|
||||
headers={"Content-Type": "application/octet-stream"},
|
||||
data=content,
|
||||
)
|
||||
if up_status != 200:
|
||||
raise SinkError(f"Slack file upload failed (HTTP {up_status})")
|
||||
|
||||
# Step 3: finalize the upload into the channel.
|
||||
status, parsed = _http.request_json(
|
||||
"POST",
|
||||
"https://slack.com/api/files.completeUploadExternal",
|
||||
headers=auth,
|
||||
payload={
|
||||
"files": [{"id": file_id, "title": doc.title}],
|
||||
"channel_id": channel,
|
||||
"initial_comment": f"Parsed: {doc.title}",
|
||||
},
|
||||
)
|
||||
if not parsed.get("ok"):
|
||||
raise SinkError(parsed.get("error") or f"Slack completeUploadExternal failed (HTTP {status})")
|
||||
|
||||
files = parsed.get("files") or [{}]
|
||||
url = files[0].get("permalink")
|
||||
return SinkResult(
|
||||
sink=self.name, ok=True, url=url,
|
||||
detail="uploaded .md file (images not embedded)",
|
||||
)
|
||||
|
||||
|
||||
def _parse_json(raw):
|
||||
import json
|
||||
if not raw:
|
||||
return {}
|
||||
try:
|
||||
return json.loads(raw.decode("utf-8"))
|
||||
except (ValueError, UnicodeDecodeError):
|
||||
return {}
|
||||
48
skills/developing/mineru/scripts/sinks/ticktick.py
Normal file
48
skills/developing/mineru/scripts/sinks/ticktick.py
Normal file
@ -0,0 +1,48 @@
|
||||
"""TickTick (滴答清单) sink — create a task from parsed Markdown.
|
||||
|
||||
TickTick's Open API exposes a task object whose ``content`` field holds the body
|
||||
text. The official native ingestion path for arbitrary Markdown is therefore a
|
||||
task: the document title becomes the task title and the Markdown becomes the
|
||||
task content. Tasks have no attachment/inline-image surface, so local images are
|
||||
not delivered.
|
||||
|
||||
Docs: https://developer.ticktick.com/docs (POST /open/v1/task).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from . import _http
|
||||
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
|
||||
|
||||
API_URL = "https://api.ticktick.com/open/v1/task"
|
||||
|
||||
|
||||
@register
|
||||
class TickTickSink(Sink):
|
||||
name = "ticktick"
|
||||
aliases = ("dida", "滴答清单")
|
||||
requires = ("TICKTICK_TOKEN",)
|
||||
label = "TickTick task (滴答清单)"
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
token = self.env("TICKTICK_TOKEN")
|
||||
project_id = self.env("TICKTICK_PROJECT_ID")
|
||||
|
||||
payload = {"title": doc.title, "content": doc.markdown}
|
||||
if project_id:
|
||||
payload["projectId"] = project_id
|
||||
|
||||
headers = {"Authorization": f"Bearer {token}"}
|
||||
status, parsed = _http.request_json("POST", API_URL, headers=headers, payload=payload)
|
||||
|
||||
if status >= 400:
|
||||
raise SinkError(f"TickTick HTTP {status}: {parsed}")
|
||||
if not parsed.get("id"):
|
||||
raise SinkError(f"TickTick returned no task id: {parsed}")
|
||||
|
||||
return SinkResult(
|
||||
sink=self.name,
|
||||
ok=True,
|
||||
url=None,
|
||||
detail="task content (no inline images supported by TickTick)",
|
||||
)
|
||||
60
skills/developing/mineru/scripts/sinks/wecom.py
Normal file
60
skills/developing/mineru/scripts/sinks/wecom.py
Normal file
@ -0,0 +1,60 @@
|
||||
"""WeCom (企业微信 / WeChat Work) sink — send parsed Markdown as an app message.
|
||||
|
||||
WeCom apps deliver content via the message-send API. The native ingestion path
|
||||
is a ``markdown`` message from a self-built app: first an access token is fetched
|
||||
with the corp id + secret, then the message is posted. WeCom's markdown is a
|
||||
limited subset with a 2048-byte content cap and no inline images, so the body is
|
||||
truncated to fit.
|
||||
|
||||
Docs: https://developer.work.weixin.qq.com/document/path/90236 (message/send),
|
||||
https://developer.work.weixin.qq.com/document/path/91039 (gettoken).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from . import _http
|
||||
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
|
||||
|
||||
TOKEN_URL = "https://qyapi.weixin.qq.com/cgi-bin/gettoken"
|
||||
SEND_URL = "https://qyapi.weixin.qq.com/cgi-bin/message/send"
|
||||
|
||||
|
||||
@register
|
||||
class WeComSink(Sink):
|
||||
name = "wecom"
|
||||
aliases = ("企业微信", "wechatwork")
|
||||
requires = ("WECOM_CORPID", "WECOM_CORPSECRET", "WECOM_AGENTID")
|
||||
label = "WeCom app markdown (企业微信)"
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
corpid = self.env("WECOM_CORPID")
|
||||
secret = self.env("WECOM_CORPSECRET")
|
||||
agentid = self.env("WECOM_AGENTID")
|
||||
touser = self.env("WECOM_TOUSER", "@all")
|
||||
|
||||
# Step 1: fetch an access token.
|
||||
token_url = f"{TOKEN_URL}?corpid={corpid}&corpsecret={secret}"
|
||||
status, parsed = _http.request_json("GET", token_url)
|
||||
if parsed.get("errcode") not in (0, None) or not parsed.get("access_token"):
|
||||
raise SinkError(parsed.get("errmsg") or f"WeCom token fetch failed: {parsed}")
|
||||
token = parsed["access_token"]
|
||||
|
||||
# Step 2: send the markdown message.
|
||||
send_url = f"{SEND_URL}?access_token={token}"
|
||||
payload = {
|
||||
"touser": touser,
|
||||
"msgtype": "markdown",
|
||||
"agentid": int(agentid),
|
||||
"markdown": {"content": doc.markdown[:2048]},
|
||||
}
|
||||
status, parsed = _http.request_json("POST", send_url, payload=payload)
|
||||
if parsed.get("errcode") not in (0, None):
|
||||
raise SinkError(parsed.get("errmsg") or f"WeCom send failed: {parsed}")
|
||||
|
||||
return SinkResult(
|
||||
sink=self.name,
|
||||
ok=True,
|
||||
url=None,
|
||||
detail="markdown notification (WeCom markdown is a limited subset, "
|
||||
"2048-byte cap, no inline images)",
|
||||
)
|
||||
104
skills/developing/mineru/scripts/sinks/wps.py
Normal file
104
skills/developing/mineru/scripts/sinks/wps.py
Normal file
@ -0,0 +1,104 @@
|
||||
"""WPS / 金山文档 (Kingsoft kdocs) sink — optional dependency.
|
||||
|
||||
The native ingestion path is: Markdown → ``.docx`` → upload to the kdocs cloud
|
||||
appspace. There is no official Python SDK, so:
|
||||
|
||||
* Markdown→DOCX uses the maintained, pure-pip ``html-for-docx`` package
|
||||
(reusing this project's Markdown→HTML), lazily imported so the core stays
|
||||
zero-dependency. Install with ``pip install mineru-skill[wps]``.
|
||||
* The kdocs WPS-2 request signing (plain SHA-1) and multipart upload are done
|
||||
with the standard library — small and fully documented.
|
||||
|
||||
Cloud upload requires an approved kdocs developer app (``WPS_APP_ID`` /
|
||||
``WPS_APP_SECRET``) and a provisioned appspace; it is opt-in and surfaces the
|
||||
raw kdocs error on failure. Docs: https://developer.kdocs.cn/server/guide/signature.html
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import email.utils
|
||||
import hashlib
|
||||
import io
|
||||
import json
|
||||
|
||||
from . import _http, _md
|
||||
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
|
||||
|
||||
KDOCS_UPLOAD = "https://developer.kdocs.cn/api/v1/openapi/appspace/files/upload"
|
||||
|
||||
|
||||
def _markdown_to_docx_bytes(markdown: str) -> bytes:
|
||||
"""Convert Markdown → HTML → DOCX bytes via the optional html-for-docx lib."""
|
||||
try:
|
||||
from html4docx import HtmlToDocx # pip install html-for-docx
|
||||
except ImportError as exc: # pragma: no cover - exercised via SinkError path
|
||||
raise SinkError(
|
||||
"WPS sink needs a Markdown→DOCX converter — "
|
||||
"pip install 'mineru-skill[wps]' (i.e. pip install html-for-docx)"
|
||||
) from exc
|
||||
html = _md.md_to_html(markdown)
|
||||
document = HtmlToDocx().parse_html_string(html)
|
||||
buf = io.BytesIO()
|
||||
document.save(buf)
|
||||
return buf.getvalue()
|
||||
|
||||
|
||||
def _wps2_headers(app_id: str, app_secret: str, body: bytes, content_type: str) -> dict:
|
||||
"""Build kdocs WPS-2 auth headers.
|
||||
|
||||
signature = sha1(app_secret + content_md5 + content_type + date) hex.
|
||||
Content-Md5 / Content-Type must match the exact wire body and header sent.
|
||||
"""
|
||||
content_md5 = hashlib.md5(body).hexdigest()
|
||||
date = email.utils.formatdate(usegmt=True) # RFC1123 GMT
|
||||
signature = hashlib.sha1(
|
||||
(app_secret + content_md5 + content_type + date).encode("utf-8")
|
||||
).hexdigest()
|
||||
return {
|
||||
"Date": date,
|
||||
"Content-Md5": content_md5,
|
||||
"Content-Type": content_type,
|
||||
"Authorization": f"WPS-2:{app_id}:{signature}",
|
||||
}
|
||||
|
||||
|
||||
@register
|
||||
class WpsSink(Sink):
|
||||
name = "wps"
|
||||
aliases = ("kdocs", "金山文档", "金山")
|
||||
requires = ("WPS_APP_ID", "WPS_APP_SECRET")
|
||||
label = "WPS / 金山文档 (Markdown→DOCX upload, optional dep)"
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
app_id = self.env("WPS_APP_ID")
|
||||
app_secret = self.env("WPS_APP_SECRET")
|
||||
|
||||
docx_bytes = _markdown_to_docx_bytes(doc.markdown)
|
||||
filename = _md.safe_filename(doc.title) + ".docx"
|
||||
|
||||
fields = {}
|
||||
parent_path = self.env("WPS_PARENT_PATH")
|
||||
parent_token = self.env("WPS_PARENT_TOKEN")
|
||||
if parent_path:
|
||||
fields["parent_path"] = parent_path
|
||||
if parent_token:
|
||||
fields["parent_token"] = parent_token
|
||||
|
||||
content_type, body = _http.encode_multipart(
|
||||
fields=fields, files=[("file", filename, docx_bytes)]
|
||||
)
|
||||
headers = _wps2_headers(app_id, app_secret, body, content_type)
|
||||
|
||||
status, raw = _http.http_request("POST", KDOCS_UPLOAD, headers=headers, data=body)
|
||||
try:
|
||||
parsed = json.loads(raw.decode("utf-8")) if raw else {}
|
||||
except (ValueError, UnicodeDecodeError):
|
||||
parsed = {}
|
||||
if status >= 400 or parsed.get("code") not in (0, None):
|
||||
raise SinkError(parsed.get("message") or parsed.get("msg") or f"kdocs HTTP {status}")
|
||||
|
||||
file_token = (parsed.get("data") or {}).get("file_token")
|
||||
return SinkResult(
|
||||
sink=self.name, ok=True, url=file_token,
|
||||
detail="Markdown→DOCX uploaded to 金山文档 (experimental; needs a provisioned appspace)",
|
||||
)
|
||||
65
skills/developing/mineru/scripts/sinks/yuque.py
Normal file
65
skills/developing/mineru/scripts/sinks/yuque.py
Normal file
@ -0,0 +1,65 @@
|
||||
"""Yuque (语雀) sink: create a Markdown doc in a repository via the open API.
|
||||
|
||||
Yuque's open API (``https://www.yuque.com/api/v2``) authenticates with an
|
||||
``X-Auth-Token`` header and creates docs under a repository namespace. The body
|
||||
is posted as raw Markdown.
|
||||
|
||||
Yuque's open API has no asset-upload endpoint, so local image refs are left
|
||||
untouched — host images at a public URL for them to render.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
from . import _http, _md
|
||||
from .base import ParsedDoc, Sink, SinkError, SinkResult, register
|
||||
|
||||
API = "https://www.yuque.com/api/v2"
|
||||
|
||||
|
||||
@register
|
||||
class YuqueSink(Sink):
|
||||
name = "yuque"
|
||||
aliases = ("语雀",)
|
||||
requires = ("YUQUE_TOKEN", "YUQUE_NAMESPACE")
|
||||
label = "Yuque doc (open API)"
|
||||
|
||||
def deliver(self, doc: ParsedDoc) -> SinkResult:
|
||||
token = self.env("YUQUE_TOKEN")
|
||||
namespace = self.env("YUQUE_NAMESPACE")
|
||||
headers = {
|
||||
"X-Auth-Token": token,
|
||||
"User-Agent": "MinerU-Skill/3.0",
|
||||
"Content-Type": "application/json",
|
||||
}
|
||||
|
||||
base_dir = Path(doc.markdown_path).parent if doc.markdown_path else None
|
||||
n_images = len(_md.find_local_images(doc.markdown, base_dir))
|
||||
|
||||
status, parsed = _http.request_json(
|
||||
"POST", f"{API}/repos/{namespace}/docs", headers=headers, payload={
|
||||
"title": doc.title,
|
||||
"slug": _md.slugify(doc.title),
|
||||
"public": 0,
|
||||
"format": "markdown",
|
||||
"body": doc.markdown,
|
||||
},
|
||||
)
|
||||
|
||||
data = parsed.get("data")
|
||||
if not data:
|
||||
if status >= 400 or parsed.get("message"):
|
||||
raise SinkError(parsed.get("message") or f"HTTP {status}")
|
||||
raise SinkError(f"Yuque returned no doc data (HTTP {status})")
|
||||
|
||||
slug = data.get("slug")
|
||||
if n_images:
|
||||
detail = f"text only ({n_images} local image(s); host images publicly to embed)"
|
||||
else:
|
||||
detail = "text only"
|
||||
return SinkResult(
|
||||
sink=self.name, ok=True,
|
||||
url=f"https://www.yuque.com/{namespace}/{slug}",
|
||||
detail=detail,
|
||||
)
|
||||
64
skills/developing/mineru/scripts/splitter.py
Normal file
64
skills/developing/mineru/scripts/splitter.py
Normal file
@ -0,0 +1,64 @@
|
||||
"""Split oversized PDFs into cap-sized parts so they clear the MinerU API limits.
|
||||
|
||||
The MinerU cloud caps at 20 pages (free Agent API) / 200 pages (Standard API).
|
||||
``--split`` slices a larger PDF into parts locally, each is parsed, and the
|
||||
Markdown is merged back — so we are no longer bound by those page caps (the same
|
||||
trick mineru-converter uses). Uses the optional ``pypdf`` library, lazily
|
||||
imported, so the core stays zero-dependency.
|
||||
|
||||
pip install "mineru-skill[split]" # i.e. pip install pypdf
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
class SplitError(Exception):
|
||||
"""Raised when splitting is requested but cannot be performed."""
|
||||
|
||||
|
||||
def _load_pypdf():
|
||||
try:
|
||||
import pypdf # noqa: F401
|
||||
return pypdf
|
||||
except ImportError as exc:
|
||||
raise SplitError(
|
||||
"--split needs the pypdf library — pip install 'mineru-skill[split]' "
|
||||
"(i.e. pip install pypdf)"
|
||||
) from exc
|
||||
|
||||
|
||||
def pdf_page_count(path) -> int:
|
||||
"""Return the page count of a local PDF (requires pypdf)."""
|
||||
pypdf = _load_pypdf()
|
||||
return len(pypdf.PdfReader(str(path)).pages)
|
||||
|
||||
|
||||
def split_pdf(path, max_pages: int, out_dir) -> list:
|
||||
"""Slice ``path`` into ``max_pages``-page parts under ``out_dir``.
|
||||
|
||||
Returns the list of part paths (a single-element list pointing at the original
|
||||
file if it already fits).
|
||||
"""
|
||||
if max_pages < 1:
|
||||
raise SplitError("max_pages must be >= 1")
|
||||
pypdf = _load_pypdf()
|
||||
reader = pypdf.PdfReader(str(path))
|
||||
total = len(reader.pages)
|
||||
if total <= max_pages:
|
||||
return [Path(path)]
|
||||
|
||||
out_dir = Path(out_dir)
|
||||
out_dir.mkdir(parents=True, exist_ok=True)
|
||||
stem = Path(path).stem
|
||||
parts = []
|
||||
for part_index, start in enumerate(range(0, total, max_pages), start=1):
|
||||
writer = pypdf.PdfWriter()
|
||||
for page in range(start, min(start + max_pages, total)):
|
||||
writer.add_page(reader.pages[page])
|
||||
part_path = out_dir / f"{stem}__part{part_index:03d}.pdf"
|
||||
with open(part_path, "wb") as handle:
|
||||
writer.write(handle)
|
||||
parts.append(part_path)
|
||||
return parts
|
||||
@ -1,6 +1,7 @@
|
||||
---
|
||||
name: nfc-medicine-lookup
|
||||
description: 药品检索技能,通过NFC芯片ID或药品名称查询药品信息。当用户提交NFC芯片ID、扫描药品标签、提到药品名称想了解用法、或提到"NFC"+"药"相关词汇时使用此技能。以语音助手身份向老人介绍药名、用途和用法用量。
|
||||
category: Developer Tools
|
||||
---
|
||||
|
||||
# NFC 药品检索
|
||||
|
||||
@ -8,5 +8,6 @@
|
||||
"command": "python hooks/pre_prompt.py"
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"category": "Developer Tools"
|
||||
}
|
||||
|
||||
@ -17,5 +17,6 @@
|
||||
"./pmda_server.py"
|
||||
]
|
||||
}
|
||||
}
|
||||
},
|
||||
"category": "Developer Tools"
|
||||
}
|
||||
|
||||
@ -1,6 +1,8 @@
|
||||
---
|
||||
name: ppt-outline
|
||||
description: "PPT outline and HTML presentation generator. PPT大纲、PPT模板、演示文稿、presentation、PowerPoint、幻灯片、slides、HTML演示文稿、HTML slides、浏览器演示、商业路演、pitch deck、BP商业计划书、business plan、工作汇报PPT、培训课件、课件大纲、产品介绍PPT、产品发布、keynote、演讲稿、述职PPT、答辩PPT、竞品分析PPT、毕业答辩、论文答辩、项目复盘、迭代复盘。Generate PPT outlines and standalone HTML presentations (open directly in browser, no dependencies). Use when: (1) creating PPT/presentation outlines, (2) building pitch deck/BP structures, (3) preparing work report slides, (4) designing training course outlines, (5) creating thesis defense PPT outlines, (6) building project review/retrospective PPTs, (7) generating HTML slide decks for browser-based presentations, (8) any PowerPoint/Keynote/Google Slides planning. 适用场景:做PPT大纲、写路演BP、汇报PPT结构、培训课件大纲、毕业答辩PPT、项目复盘PPT、述职答辩PPT、生成HTML演示文稿(浏览器直接打开,支持dark/light/tech/minimal四种风格)。"
|
||||
|
||||
category: Document Processing
|
||||
---
|
||||
|
||||
# ppt-outline
|
||||
|
||||
@ -1,6 +1,7 @@
|
||||
---
|
||||
name: rag-retrieve
|
||||
description: RAG retrieval skill for querying and retrieving relevant documents from knowledge base. Use this skill when users need to search documentation, retrieve knowledge base articles, or get context from a vector database. Supports semantic search with configurable top-k results.
|
||||
category: Data & Retrieval
|
||||
---
|
||||
|
||||
# RAG Retrieve
|
||||
|
||||
@ -18,5 +18,6 @@
|
||||
"{bot_id}"
|
||||
]
|
||||
}
|
||||
}
|
||||
},
|
||||
"category": "Data & Retrieval"
|
||||
}
|
||||
|
||||
@ -167,7 +167,12 @@ async def handle_request(request: Dict[str, Any]) -> Dict[str, Any]:
|
||||
top_k = arguments.get("top_k", 100)
|
||||
|
||||
if not query:
|
||||
return create_error_response(request_id, -32602, "Missing required parameter: query")
|
||||
return create_success_response(request_id, {
|
||||
"content": [{
|
||||
"type": "text",
|
||||
"text": "Error: missing required parameter 'query'. Please call this tool again with a non-empty 'query' argument describing what you want to retrieve."
|
||||
}]
|
||||
})
|
||||
|
||||
result = rag_retrieve(query, top_k)
|
||||
|
||||
|
||||
@ -1,6 +1,7 @@
|
||||
---
|
||||
name: static-hosting
|
||||
description: Serve static HTML/CSS/JS/images from robot project directories via the built-in FastAPI static file server. Use when generating web pages, reports, or interactive content for a bot.
|
||||
category: Web Services
|
||||
---
|
||||
|
||||
# Static Hosting
|
||||
|
||||
137
skills/developing/table-query/SKILL.md
Normal file
137
skills/developing/table-query/SKILL.md
Normal file
@ -0,0 +1,137 @@
|
||||
---
|
||||
name: table-query
|
||||
description: Query structured spreadsheet/table data (Excel/CSV) to answer questions about values, prices, quantities, inventory, specifications, rankings, comparisons, summaries, aggregations, lists, or any numeric/tabular lookup. Use this skill whenever the answer likely comes from uploaded tables. You locate tables, read their schema, author SQLite SQL yourself, and run it — the backend does no LLM work, so it is fast.
|
||||
category: Data & Retrieval
|
||||
---
|
||||
|
||||
# Table Query
|
||||
|
||||
Answer table/spreadsheet questions by authoring and running SQLite SQL against the
|
||||
bot's uploaded Excel data. The backend is a thin, fast SQL executor — **you** do the
|
||||
thinking (rewrite the question, pick tables, write SQL). Row-level citations
|
||||
(`__src`) are produced for you.
|
||||
|
||||
## When to use
|
||||
|
||||
Use `table-query` for: values, prices, quantities, inventory, specifications,
|
||||
rankings, comparisons, summaries, aggregations (sum/avg/count), lists, person /
|
||||
project / product lookups, monthly/period totals, or any question whose answer
|
||||
comes from structured tables. For pure concept / definition / policy / explanation
|
||||
questions, use the `rag_retrieve` document tool instead.
|
||||
|
||||
## Workflow (do this in order, once)
|
||||
|
||||
1. **search-tables** — rewrite the user's question into a retrieval query (core
|
||||
entity + attributes + synonyms), then locate candidate tables. Call this **once**.
|
||||
2. **get-schemas** — for the relevant subset of returned tables, fetch their
|
||||
`CREATE TABLE` schema and sample rows. Never write SQL without seeing the schema.
|
||||
3. **author SQL** — write a SQLite query plan as JSON (see below).
|
||||
4. **run-sql** — execute the plan. It returns CSV with an `__src` column and a
|
||||
`file_ref_table` mapping plus citation instructions.
|
||||
5. **answer + cite** — write the answer and add `<CITATION ... />` tags built from
|
||||
`__src` + `file_ref_table`. Never print the `__src` column to the user.
|
||||
|
||||
### Anti-waste rules
|
||||
|
||||
- Call **search-tables at most once** per question. Do not re-locate tables you
|
||||
already have schemas for.
|
||||
- If `run-sql` returns an error, fix the SQL and call **run-sql** again (at most ~2
|
||||
tries). Do **NOT** restart from search-tables.
|
||||
- If `search-tables` finds nothing, fall back to the `rag_retrieve` document tool.
|
||||
|
||||
## Commands
|
||||
|
||||
```bash
|
||||
# 1. locate tables
|
||||
python {SKILL_DIR}/scripts/table_query.py search-tables --query "2025 April May June sales total" --top-k 20
|
||||
|
||||
# 2. read schema + sample rows for the tables you picked
|
||||
python {SKILL_DIR}/scripts/table_query.py get-schemas --tables "sales_2025,customers"
|
||||
|
||||
# 3. run your authored plan — pipe the JSON plan via stdin (no temp file needed)
|
||||
python {SKILL_DIR}/scripts/table_query.py run-sql <<'PLAN'
|
||||
{"queries":[{"step":1,"sql":"CREATE TEMP TABLE \"final_table_step1\" AS SELECT \"month\", SUM(\"amount\") AS \"total\" FROM \"sales_2025\" GROUP BY \"month\"","source_table_names":["sales_2025"],"destine_table_name":"final_table_step1","destine_table_type":"final","destine_table_description":"Monthly totals"}]}
|
||||
PLAN
|
||||
```
|
||||
|
||||
## Authoring the SQL plan
|
||||
|
||||
The plan is a JSON object `{ "queries": [ ... ] }` that you pass to `run-sql` **on
|
||||
stdin via a quoted heredoc** (`<<'PLAN' ... PLAN`). The quoted delimiter keeps all
|
||||
the double quotes, single quotes and `$` in your SQL intact — no shell escaping.
|
||||
(You may instead write it to a file and use `--plan-file path.json` if a plan is very
|
||||
large, but stdin is the default and needs no extra step.)
|
||||
|
||||
Each query is one SQL step:
|
||||
|
||||
```json
|
||||
{
|
||||
"queries": [
|
||||
{
|
||||
"step": 1,
|
||||
"sql": "CREATE TEMP TABLE \"final_table_step1\" AS SELECT \"month\", SUM(\"amount\") AS \"total\" FROM \"sales_2025\" WHERE \"month\" IN ('2025-04','2025-05','2025-06') GROUP BY \"month\"",
|
||||
"source_table_names": ["sales_2025"],
|
||||
"destine_table_name": "final_table_step1",
|
||||
"destine_table_type": "final",
|
||||
"destine_table_description": "Monthly sales totals for Apr-Jun 2025"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Field meaning:
|
||||
- `step`: 1-based execution order.
|
||||
- `sql`: a SQLite statement, normally `CREATE TEMP TABLE "..." AS SELECT ...`.
|
||||
- `source_table_names`: tables this step reads (original tables, or earlier steps'
|
||||
`destine_table_name` for multi-step plans).
|
||||
- `destine_table_name`: the temp table this step creates. Convention:
|
||||
`intermediate_table_stepN` or `final_table_stepN`.
|
||||
- `destine_table_type`: `"final"` for results the user should see, `"intermediate"`
|
||||
for helper steps. **At least one `final` is required.**
|
||||
- `destine_table_description`: short human description of the result.
|
||||
|
||||
### SQL rules (important)
|
||||
|
||||
- **Quote every identifier** with double quotes: `"column name"`, `"table name"`.
|
||||
- String literals use single quotes; escape `'` as `''`.
|
||||
- Prefer **one logical result per `final` table**. For multiple separate results,
|
||||
emit multiple `final` tables (e.g. step1, step2) — do **NOT** `UNION` unrelated results.
|
||||
- For row-level citations to be precise, keep `final` steps as simple single-table
|
||||
`SELECT`s (no `JOIN` / `GROUP BY` / aggregation). Aggregations still work but the
|
||||
citation degrades to file+sheet level (`F1S2`) instead of an exact row (`F1S2R5`).
|
||||
- Multi-step plans run in `step` order: build `intermediate_table_stepN` first, then
|
||||
read it in a later step. Don't reference a temp table before it is created.
|
||||
- **Sample rows are a format hint only** — never assume they represent the full data
|
||||
or the row count. Your SQL must scan the whole table. Use `LIKE '%value%'` for free
|
||||
text and `=` for enums/codes.
|
||||
|
||||
## Result handling & citations
|
||||
|
||||
- `run-sql` output begins with citation instructions, then `file_ref_table`, then the
|
||||
result CSV (with `__src`).
|
||||
- Parse `__src` (`F1S2R5` = file_ref F1, sheet 2, row 5) and `file_ref_table` to build
|
||||
`<CITATION file="..." filename="..." sheet=N rows=[...] />`.
|
||||
- Put citations on their own line **after** the list/table that uses the data; combine
|
||||
same-(file,sheet) rows into one citation.
|
||||
- If the result hint says rows were truncated (`Only the first N rows ...; the
|
||||
remaining M ...`), tell the user the total (`N+M`), shown (`N`), and omitted (`M`).
|
||||
- Never expose the `__src` column itself to the user.
|
||||
|
||||
### Controlling truncation
|
||||
|
||||
`run-sql` truncates results by default (total rows and per-cell characters) to keep
|
||||
the context manageable. If a result comes back truncated and you genuinely need more,
|
||||
re-run with higher limits — do **not** re-run search-tables:
|
||||
|
||||
```bash
|
||||
python {SKILL_DIR}/scripts/table_query.py run-sql --max-rows 500 --cell-max 4000 <<'PLAN'
|
||||
{"queries":[ ... ]}
|
||||
PLAN
|
||||
```
|
||||
|
||||
- `--max-rows`: max total rows across all `final` tables (default from backend config,
|
||||
hard ceiling 2000). Prefer writing an aggregate query (SUM/COUNT/GROUP BY) over
|
||||
pulling thousands of detail rows.
|
||||
- `--cell-max`: max characters per cell before it is truncated with `..` (default from
|
||||
backend config, hard ceiling 10000). Raise this when a long-text column (e.g. a
|
||||
description/spec field) is getting cut off.
|
||||
213
skills/developing/table-query/scripts/table_query.py
Executable file
213
skills/developing/table-query/scripts/table_query.py
Executable file
@ -0,0 +1,213 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
table-query CLI.
|
||||
|
||||
Fast, LLM-free table querying. Talks to the felo-mygpt table_query endpoints:
|
||||
- search-tables : POST /v1/table_query/search_tables/{bot_id}
|
||||
- get-schemas : POST /v1/table_query/get_schemas/{bot_id}
|
||||
- run-sql : POST /v1/table_query/run_sql/{bot_id}
|
||||
|
||||
The agent drives the orchestration (rewrite -> locate -> author SQL -> run);
|
||||
the backend only does cheap work, so each call returns in seconds.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import hashlib
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
|
||||
try:
|
||||
import requests
|
||||
except ImportError:
|
||||
print("Error: requests module is required. Please install it with: pip install requests")
|
||||
sys.exit(1)
|
||||
|
||||
DEFAULT_BACKEND_HOST = os.getenv("BACKEND_HOST", "https://api-dev.gptbase.ai")
|
||||
DEFAULT_MASTERKEY = os.getenv("MASTERKEY", "master")
|
||||
|
||||
# Same citation contract the legacy table_rag_retrieve used, so the agent's
|
||||
# <CITATION ... /> behaviour is unchanged.
|
||||
TABLE_CITATION_INSTRUCTIONS = """<CITATION_INSTRUCTIONS>
|
||||
When using the retrieved table knowledge below, you MUST add XML citation tags for factual claims.
|
||||
|
||||
Format: `<CITATION file="file_id" filename="name.xlsx" sheet=1 rows=[2, 4] />`
|
||||
- Parse `__src`: `F1S2R5` = file_ref F1, sheet 2, row 5
|
||||
- Look up file_id in `file_ref_table`
|
||||
- Combine same-sheet rows into one citation: `rows=[2, 4, 6]`
|
||||
- MANDATORY: Create SEPARATE citation for EACH (file, sheet) combination
|
||||
- NEVER put <CITATION> on the same line as a bullet point or table row
|
||||
- Citations MUST be on separate lines AFTER the complete list/table
|
||||
- NEVER include the `__src` column in your response - it is internal metadata only
|
||||
- Citations MUST appear IMMEDIATELY AFTER the paragraph or bullet list that uses the knowledge
|
||||
- NEVER collect all citations and place them at the end of your response
|
||||
</CITATION_INSTRUCTIONS>
|
||||
"""
|
||||
|
||||
|
||||
def load_config() -> dict:
|
||||
"""Load robot_config.json from the robot project root (3 levels up from scripts/)."""
|
||||
config_path = os.path.join(os.path.dirname(__file__), '..', '..', '..', 'robot_config.json')
|
||||
if os.path.exists(config_path):
|
||||
try:
|
||||
with open(config_path, 'r', encoding='utf-8') as f:
|
||||
return json.load(f)
|
||||
except (json.JSONDecodeError, IOError) as e:
|
||||
print(f"Warning: failed to load robot_config.json: {e}", file=sys.stderr)
|
||||
return {}
|
||||
|
||||
|
||||
def _resolve_bot_id(cli_bot_id: str) -> str:
|
||||
if cli_bot_id:
|
||||
return cli_bot_id
|
||||
return load_config().get('bot_id') or os.getenv("BOT_ID") or os.getenv("ASSISTANT_ID")
|
||||
|
||||
|
||||
def _post(path: str, bot_id: str, payload: dict) -> dict:
|
||||
url = f"{DEFAULT_BACKEND_HOST}/v1/table_query/{path}/{bot_id}"
|
||||
auth_token = hashlib.md5(f"{DEFAULT_MASTERKEY}:{bot_id}".encode()).hexdigest()
|
||||
headers = {
|
||||
"content-type": "application/json",
|
||||
"authorization": f"Bearer {auth_token}",
|
||||
}
|
||||
trace_id = os.getenv("TRACE_ID") or os.getenv("X_REQUEST_ID")
|
||||
if trace_id:
|
||||
headers["X-Request-ID"] = trace_id
|
||||
resp = requests.post(url, json=payload, headers=headers, timeout=30)
|
||||
if resp.status_code != 200:
|
||||
raise RuntimeError(f"API {path} returned {resp.status_code}: {resp.text}")
|
||||
return resp.json()
|
||||
|
||||
|
||||
def cmd_search_tables(args, bot_id: str) -> str:
|
||||
res = _post("search_tables", bot_id, {"query": args.query, "top_k": args.top_k})
|
||||
tables = res.get("tables", [])
|
||||
if not tables:
|
||||
return ("No matching tables found. If the question may be answered from documents "
|
||||
"instead of spreadsheets, fall back to the rag_retrieve document tool.")
|
||||
lines = [f"Found {len(tables)} candidate table(s). Pick the relevant ones and call "
|
||||
f"`get-schemas` for them next.\n"]
|
||||
for t in tables:
|
||||
lines.append(
|
||||
f"- table_name: {t['table_name']}\n"
|
||||
f" file: {t.get('file_name','')} | sheet: {t.get('sheet_name','')} "
|
||||
f"| score: {round(t.get('score', 0), 3)}\n"
|
||||
f" description: {t.get('table_description','')}"
|
||||
)
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def cmd_get_schemas(args, bot_id: str) -> str:
|
||||
table_names = [t.strip() for t in args.tables.split(',') if t.strip()]
|
||||
res = _post("get_schemas", bot_id,
|
||||
{"table_names": table_names, "sample_rows": args.sample_rows})
|
||||
schemas = res.get("schemas", [])
|
||||
missing = res.get("missing_tables", [])
|
||||
if not schemas:
|
||||
return f"No schemas resolved. Missing tables: {missing}"
|
||||
blocks = []
|
||||
for s in schemas:
|
||||
block = [f"### Table: {s['table_name']}",
|
||||
f"File: {s.get('file_name','')} | Sheet: {s.get('sheet_name','')}",
|
||||
"```sql", s.get('sql_create', ''), "```"]
|
||||
sample = s.get('sample_rows') or []
|
||||
if sample:
|
||||
block.append("Sample rows (format hint only, NOT the row count):")
|
||||
block.append("```csv")
|
||||
for row in sample:
|
||||
block.append(",".join('"' + str(c).replace('"', '""') + '"' for c in row))
|
||||
block.append("```")
|
||||
blocks.append("\n".join(block))
|
||||
out = "\n\n".join(blocks)
|
||||
if missing:
|
||||
out += f"\n\nNote: these requested tables were not found: {missing}"
|
||||
out += ("\n\nNow author a SQLite plan and run it by piping the JSON to run-sql on stdin:\n"
|
||||
" run-sql <<'PLAN'\n"
|
||||
" {\"queries\": [{\"step\": 1, \"sql\": \"CREATE TEMP TABLE \\\"final_table_step1\\\" "
|
||||
"AS SELECT ...\", \"source_table_names\": [\"...\"], "
|
||||
"\"destine_table_name\": \"final_table_step1\", \"destine_table_type\": \"final\"}]}\n"
|
||||
" PLAN\n"
|
||||
"Quote all identifiers with double quotes.")
|
||||
return out
|
||||
|
||||
|
||||
def cmd_run_sql(args, bot_id: str) -> str:
|
||||
# Read the plan from --plan-file if given, otherwise from stdin (heredoc).
|
||||
try:
|
||||
if args.plan_file:
|
||||
with open(args.plan_file, 'r', encoding='utf-8') as f:
|
||||
raw = f.read()
|
||||
else:
|
||||
raw = sys.stdin.read()
|
||||
if not raw.strip():
|
||||
return ("Error: no plan provided. Pipe the JSON plan via stdin, e.g.\n"
|
||||
" python scripts/table_query.py run-sql <<'PLAN'\n"
|
||||
" {\"queries\": [...]}\n"
|
||||
" PLAN")
|
||||
plan = json.loads(raw)
|
||||
except (json.JSONDecodeError, IOError) as e:
|
||||
return f"Error: failed to read SQL plan: {e}"
|
||||
# accept either {"queries": [...]} or a bare [...] list
|
||||
queries = plan.get("queries") if isinstance(plan, dict) else plan
|
||||
if not queries:
|
||||
return "Error: the plan must contain a non-empty `queries` list."
|
||||
payload = {"queries": queries}
|
||||
if args.max_rows is not None:
|
||||
payload["max_rows"] = args.max_rows
|
||||
if args.cell_max is not None:
|
||||
payload["cell_max"] = args.cell_max
|
||||
res = _post("run_sql", bot_id, payload)
|
||||
if not res.get("success"):
|
||||
return (f"SQL execution failed: {res.get('error')}\n"
|
||||
"Fix your SQL and call run-sql again. Do NOT restart from search-tables.")
|
||||
parts = [TABLE_CITATION_INSTRUCTIONS]
|
||||
if res.get("instruction"):
|
||||
parts.append(res["instruction"])
|
||||
if res.get("knowledge"):
|
||||
parts.append(res["knowledge"])
|
||||
if res.get("extra_goal"):
|
||||
parts.append(res["extra_goal"])
|
||||
return "\n".join(parts)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="table-query: fast LLM-free table querying")
|
||||
parser.add_argument("--bot-id", default=None, help="Bot id (defaults to robot_config.json)")
|
||||
sub = parser.add_subparsers(dest="command", required=True)
|
||||
|
||||
p_search = sub.add_parser("search-tables", help="Vector-locate relevant tables")
|
||||
p_search.add_argument("--query", "-q", required=True, help="Rewritten retrieval query")
|
||||
p_search.add_argument("--top-k", "-k", type=int, default=20)
|
||||
|
||||
p_schemas = sub.add_parser("get-schemas", help="Fetch CREATE TABLE schema + sample rows")
|
||||
p_schemas.add_argument("--tables", "-t", required=True, help="Comma-separated table names")
|
||||
p_schemas.add_argument("--sample-rows", type=int, default=3)
|
||||
|
||||
p_run = sub.add_parser("run-sql", help="Execute an authored SQL plan (JSON via stdin or file)")
|
||||
p_run.add_argument("--plan-file", "-f", default=None,
|
||||
help="Path to plan JSON file (optional; defaults to reading stdin)")
|
||||
p_run.add_argument("--max-rows", type=int, default=None,
|
||||
help="Max total result rows (raise if a result came back truncated)")
|
||||
p_run.add_argument("--cell-max", type=int, default=None,
|
||||
help="Max characters per cell before truncation")
|
||||
|
||||
args = parser.parse_args()
|
||||
bot_id = _resolve_bot_id(args.bot_id)
|
||||
if not bot_id:
|
||||
print("Error: bot_id is required (robot_config.json / --bot-id / BOT_ID env)")
|
||||
sys.exit(1)
|
||||
|
||||
try:
|
||||
if args.command == "search-tables":
|
||||
print(cmd_search_tables(args, bot_id))
|
||||
elif args.command == "get-schemas":
|
||||
print(cmd_get_schemas(args, bot_id))
|
||||
elif args.command == "run-sql":
|
||||
print(cmd_run_sql(args, bot_id))
|
||||
except Exception as e:
|
||||
print(f"Error: {str(e)}")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
25
skills/developing/table-query/skill.yaml
Normal file
25
skills/developing/table-query/skill.yaml
Normal file
@ -0,0 +1,25 @@
|
||||
name: table-query
|
||||
version: 1.0.0
|
||||
description: Fast LLM-free table querying. Locate tables, fetch schema, author SQLite SQL, and run it with row-level citations.
|
||||
author:
|
||||
name: sparticle
|
||||
email: support@gbase.ai
|
||||
license: MIT
|
||||
tags:
|
||||
- table
|
||||
- sql
|
||||
- excel
|
||||
- retrieval
|
||||
- citation
|
||||
runtime:
|
||||
python: ">=3.7"
|
||||
dependencies:
|
||||
- requests
|
||||
entry_point: scripts/table_query.py
|
||||
commands:
|
||||
search-tables:
|
||||
description: Vector-locate relevant tables for a query
|
||||
get-schemas:
|
||||
description: Fetch CREATE TABLE schema + sample rows for given tables
|
||||
run-sql:
|
||||
description: Execute an authored SQLite plan and return CSV with __src citations
|
||||
67
skills/developing/table-query/verify_table_query.sh
Executable file
67
skills/developing/table-query/verify_table_query.sh
Executable file
@ -0,0 +1,67 @@
|
||||
#!/usr/bin/env bash
|
||||
#
|
||||
# Manual verification for the new table_query endpoints.
|
||||
# Run this against an environment where the feature/table-query-split branch is
|
||||
# deployed (e.g. dev). It checks the 3 fast endpoints and diffs run_sql output
|
||||
# against the legacy table_rag_retrieve for parity.
|
||||
#
|
||||
# Usage:
|
||||
# HOST=https://api-dev.gptbase.ai BOT_ID=<bot> MASTERKEY=master ./verify_table_query.sh
|
||||
#
|
||||
set -euo pipefail
|
||||
|
||||
HOST="${HOST:-https://api-dev.gptbase.ai}"
|
||||
# bot from the slow-request log (has the 案1_売上明細 xlsx). Override as needed.
|
||||
BOT_ID="${BOT_ID:-c1fa021b-6c41-41d5-b1e6-adfb8896aaaa}"
|
||||
MASTERKEY="${MASTERKEY:-master}"
|
||||
QUERY="${QUERY:-2025年4月〜6月の売上実績}"
|
||||
|
||||
# auth token = MD5(masterkey:bot_id)
|
||||
TOKEN=$(python3 -c "import hashlib,sys;print(hashlib.md5(f'{sys.argv[1]}:{sys.argv[2]}'.encode()).hexdigest())" "$MASTERKEY" "$BOT_ID")
|
||||
AUTH="authorization: Bearer ${TOKEN}"
|
||||
CT="content-type: application/json"
|
||||
|
||||
echo "=== HOST=$HOST BOT_ID=$BOT_ID ==="
|
||||
|
||||
echo
|
||||
echo "### 1) search_tables ###"
|
||||
curl -s --request POST "$HOST/v1/table_query/search_tables/$BOT_ID" \
|
||||
--header "$AUTH" --header "$CT" \
|
||||
--data "{\"query\": \"$QUERY\", \"top_k\": 20}" | python3 -m json.tool
|
||||
|
||||
echo
|
||||
echo "### 2) get_schemas (EDIT --data table_names with names from step 1) ###"
|
||||
echo "curl -s --request POST \"$HOST/v1/table_query/get_schemas/$BOT_ID\" \\"
|
||||
echo " --header \"$AUTH\" --header \"$CT\" \\"
|
||||
echo " --data '{\"table_names\": [\"<TABLE_NAME_FROM_STEP_1>\"], \"sample_rows\": 3}' | python3 -m json.tool"
|
||||
|
||||
echo
|
||||
echo "### 3) run_sql (EDIT the sql to match the real table/columns from step 2) ###"
|
||||
cat > /tmp/tq_plan.json <<'JSON'
|
||||
{
|
||||
"queries": [
|
||||
{
|
||||
"step": 1,
|
||||
"sql": "CREATE TEMP TABLE \"final_table_step1\" AS SELECT \"計上日\", \"得意先名\", \"売上金額\" FROM \"<TABLE_NAME>\" LIMIT 10",
|
||||
"source_table_names": ["<TABLE_NAME>"],
|
||||
"destine_table_name": "final_table_step1",
|
||||
"destine_table_type": "final",
|
||||
"destine_table_description": "sample rows"
|
||||
}
|
||||
]
|
||||
}
|
||||
JSON
|
||||
echo "Edit /tmp/tq_plan.json (replace <TABLE_NAME>), then:"
|
||||
echo "curl -s --request POST \"$HOST/v1/table_query/run_sql/$BOT_ID\" \\"
|
||||
echo " --header \"$AUTH\" --header \"$CT\" \\"
|
||||
echo " --data @/tmp/tq_plan.json | python3 -m json.tool"
|
||||
echo
|
||||
echo "ASSERT: run_sql output 'knowledge' contains a '__src' column and 'file_ref_table'."
|
||||
|
||||
echo
|
||||
echo "### 4) legacy table_rag_retrieve (parity reference, same question) ###"
|
||||
echo "curl -s --request POST \"$HOST/v1/table_rag_retrieve/$BOT_ID\" \\"
|
||||
echo " --header \"$AUTH\" --header \"$CT\" \\"
|
||||
echo " --data '{\"query\": \"$QUERY\"}' | python3 -m json.tool"
|
||||
echo
|
||||
echo "Compare the __src tokens / result rows between #3 and #4 for the same SQL intent."
|
||||
@ -30,8 +30,11 @@
|
||||
"mcpServers": {
|
||||
"user-context-example": {
|
||||
"command": "echo",
|
||||
"args": ["Example MCP server for user context loader"],
|
||||
"args": [
|
||||
"Example MCP server for user context loader"
|
||||
],
|
||||
"comment": "这是一个示例 MCP 配置,实际使用时替换为真实的 MCP 服务器"
|
||||
}
|
||||
}
|
||||
},
|
||||
"category": "Developer Tools"
|
||||
}
|
||||
|
||||
@ -8,6 +8,7 @@ metadata:
|
||||
bins:
|
||||
- python3
|
||||
- google-chrome
|
||||
category: Creative Generation
|
||||
---
|
||||
|
||||
# z-card-image
|
||||
|
||||
@ -2,12 +2,13 @@
|
||||
"name": "baidu-search",
|
||||
"description": "百度搜索服务",
|
||||
"mcpServers": {
|
||||
"web-search-mcp-server": {
|
||||
"transport": "http",
|
||||
"url": "https://qianfan.baidubce.com/v2/tools/web-search/mcp",
|
||||
"headers": {
|
||||
"Authorization": "Bearer {BAIDU_API_KEY}"
|
||||
}
|
||||
"web-search-mcp-server": {
|
||||
"transport": "http",
|
||||
"url": "https://qianfan.baidubce.com/v2/tools/web-search/mcp",
|
||||
"headers": {
|
||||
"Authorization": "Bearer {BAIDU_API_KEY}"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"category": "Search & Intelligence"
|
||||
}
|
||||
|
||||
@ -2,6 +2,7 @@
|
||||
name: baidu-search
|
||||
description: Search the web using Baidu AI Search Engine (BDSE). Use for live information, documentation, or research topics.
|
||||
metadata: { "openclaw": { "emoji": "🔍︎", "requires": { "bins": ["python3"], "env":["BAIDU_API_KEY"]},"primaryEnv":"BAIDU_API_KEY" } }
|
||||
category: Search & Intelligence
|
||||
---
|
||||
|
||||
# Baidu Search
|
||||
|
||||
@ -9,6 +9,7 @@ description: |
|
||||
- 用户要求 bot 安装、启用、禁用或卸载技能时(如"帮我装上这个技能包"、"把 XX 技能关掉") → 管理技能列表
|
||||
- 用户要求 bot 配置 API 密钥或运行参数时(如"把 JINA_API_KEY 设置成 xxx") → 修改环境变量
|
||||
- bot 需要自主进化、动态调整自身能力边界的自动化场景
|
||||
category: Developer Tools
|
||||
---
|
||||
|
||||
# Bot Self-Modifier
|
||||
|
||||
@ -13,6 +13,7 @@ metadata:
|
||||
"primaryEnv": "CAIYUN_WEATHER_API_TOKEN",
|
||||
},
|
||||
}
|
||||
category: Weather
|
||||
---
|
||||
|
||||
# 彩云天气 (Caiyun Weather)
|
||||
|
||||
@ -1,6 +1,7 @@
|
||||
---
|
||||
name: competitor-news-intel
|
||||
description: Research competitor news, organize developments by company and theme, and produce actionable competitive intelligence with impact assessment and follow-up recommendations. Use when the user asks for competitor monitoring, competitor news tracking, market watch summaries, or business intelligence from external updates. 中文触发词包括:竞品跟踪、竞对情报、竞品新闻、市场监听、舆情观察、竞品周报、最近竞品有什么动作。
|
||||
category: Search & Intelligence
|
||||
---
|
||||
|
||||
# Competitor News Intelligence
|
||||
|
||||
@ -1,6 +1,7 @@
|
||||
---
|
||||
name: contract-document-generator
|
||||
description: Draft contracts and formal business documents, rewrite clauses, identify risks, and organize negotiation-ready language. Use when the user asks for contract drafting, clause revision, legal-style document generation, formal agreement structuring, or document-ready policy and terms content. 中文触发词包括:合同起草、协议生成、条款修改、风险审查、保密协议、正式文档撰写。
|
||||
category: Writing & Reporting
|
||||
---
|
||||
|
||||
# Contract & Document Generator
|
||||
|
||||
@ -1,6 +1,7 @@
|
||||
---
|
||||
name: financial-report-generator
|
||||
description: Generate management-friendly financial reporting outputs from structured financial data, including KPI summaries, variance analysis, risk notes, and reporting narratives. Use when the user asks for financial reports, management reporting, monthly or quarterly performance summaries, or finance-oriented document generation. 中文触发词包括:财务月报、财务季报、经营分析、管理层汇报、董事会报告、财务简报。
|
||||
category: Writing & Reporting
|
||||
---
|
||||
|
||||
# Financial Report Generator
|
||||
|
||||
@ -1,6 +1,7 @@
|
||||
---
|
||||
name: market-academic-insight
|
||||
description: Generate structured market research and academic insight briefs with clear evidence, trends, risks, and opportunities. Use when the user asks for industry research, market trends, literature review, academic progress tracking, or evidence-based insight synthesis. 中文触发词包括:行业洞察、市场研究、学术综述、论文进展、趋势分析、研究简报。
|
||||
category: Search & Intelligence
|
||||
---
|
||||
|
||||
# Market & Academic Insight
|
||||
|
||||
@ -18,5 +18,6 @@
|
||||
"X-Dataset-Ids": "{dataset_ids}"
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"category": "Data & Retrieval"
|
||||
}
|
||||
|
||||
@ -1,6 +1,7 @@
|
||||
---
|
||||
name: sales-decision-report
|
||||
description: Analyze sales data and produce decision-oriented reports with KPI summaries, anomaly explanation, channel and region analysis, and HTML-ready report structure. Use when the user asks for sales analysis, management dashboards, sales summaries, or decision reports from business data. 中文触发词包括:销售分析、经营分析、销售周报、销售月报、数据决策报告、HTML 报表。
|
||||
category: Writing & Reporting
|
||||
---
|
||||
|
||||
# Sales Decision Report
|
||||
|
||||
@ -1,6 +1,7 @@
|
||||
---
|
||||
name: seedream
|
||||
description: 使用火山引擎 Seedream/Seedance API 生成高质量图片和视频。适用于文生图、图生图、文生视频、图生视频以及生成关联组图的场景。
|
||||
category: Creative Generation
|
||||
---
|
||||
|
||||
# Seedream
|
||||
|
||||
@ -1,6 +1,7 @@
|
||||
---
|
||||
name: static-hosting
|
||||
description: Serve static HTML/CSS/JS/images from robot project directories via the built-in FastAPI static file server. Use when generating web pages, reports, or interactive content for a bot.
|
||||
category: Web Services
|
||||
---
|
||||
|
||||
# Static Hosting
|
||||
|
||||
@ -17,6 +17,7 @@ triggers:
|
||||
- 从服务器下载
|
||||
- 浏览服务器文件
|
||||
- 读取服务器文件
|
||||
category: Web Services
|
||||
---
|
||||
|
||||
# Static Site Deploy
|
||||
|
||||
@ -1,6 +1,7 @@
|
||||
---
|
||||
name: voice-notification
|
||||
description: Voice Notification - Push voice broadcast messages to active voice sessions for real-time TTS playback
|
||||
category: Communication
|
||||
---
|
||||
|
||||
# Voice Notification - Voice Broadcast
|
||||
|
||||
@ -5,6 +5,7 @@ version: 1.0.2
|
||||
tags: [weather, china, forecast, chinese, weather-cn, life-index, 7day-forecast]
|
||||
metadata: {"openclaw":{"emoji":"🌤️","requires":{"bins":["python3"]}}}
|
||||
allowed-tools: [exec]
|
||||
category: Weather
|
||||
---
|
||||
|
||||
# 中国天气预报查询 (China Weather)
|
||||
|
||||
@ -1,6 +1,7 @@
|
||||
---
|
||||
name: kfs-answer
|
||||
description: Primary skill for answering ALL questions about the datasets knowledge base. Search files, run queries (SQL / markdown), and return answers with citations. MUST be used first for any data-related question.
|
||||
category: Data & Retrieval
|
||||
---
|
||||
|
||||
# kfs-answer
|
||||
|
||||
@ -18,5 +18,6 @@
|
||||
"{bot_id}"
|
||||
]
|
||||
}
|
||||
}
|
||||
},
|
||||
"category": "Data & Retrieval"
|
||||
}
|
||||
|
||||
@ -193,7 +193,12 @@ async def handle_request(request: Dict[str, Any]) -> Dict[str, Any]:
|
||||
top_k = arguments.get("top_k", 100)
|
||||
|
||||
if not query:
|
||||
return create_error_response(request_id, -32602, "Missing required parameter: query")
|
||||
return create_success_response(request_id, {
|
||||
"content": [{
|
||||
"type": "text",
|
||||
"text": "Error: missing required parameter 'query'. Please call this tool again with a non-empty 'query' argument describing what you want to retrieve."
|
||||
}]
|
||||
})
|
||||
|
||||
result = rag_retrieve(query, top_k)
|
||||
|
||||
|
||||
@ -1,6 +1,7 @@
|
||||
---
|
||||
name: board-meeting-pack-helper
|
||||
description: Assemble board-meeting materials into a coherent pack with agenda logic, board-level KPIs, strategic risks, governance context, and decision-ready content. Use this whenever users ask for board materials, board pack, board meeting agenda, governance updates, director pre-read, 取締役会資料, or resolution-ready content for executive or board review; use it for board-level governance materials, not for generic executive one-pagers.
|
||||
category: Writing & Reporting
|
||||
---
|
||||
|
||||
# Board Meeting Pack Helper
|
||||
|
||||
@ -1,6 +1,7 @@
|
||||
---
|
||||
name: customer-reply-tone
|
||||
description: Rewrite customer-facing replies in the right tone while preserving factual accuracy, accountability, and clear next steps across sensitive support, delivery, and account situations. Use this whenever users ask to soften, professionalize, de-escalate, polish, or reframe a customer email or chat response, including complaint reply, support response polish, or クレーム返信; use it for reply rewriting and de-escalation, not for sales follow-up or general Japanese business writing.
|
||||
category: Writing & Reporting
|
||||
---
|
||||
|
||||
# Customer Reply Tone
|
||||
|
||||
@ -1,6 +1,7 @@
|
||||
---
|
||||
name: exec-brief-1pager
|
||||
description: Turn complex business, product, and operational topics into a one-page executive brief with decision-ready insights, options, and recommended actions. Use this whenever users ask for an executive summary, leadership brief, one-pager, decision memo, CEO brief, or key points at a glance for senior leadership; use it for one-page decision support, not for recurring status updates or board meeting packs.
|
||||
category: Writing & Reporting
|
||||
---
|
||||
|
||||
# Exec Brief 1Pager
|
||||
|
||||
@ -1,6 +1,7 @@
|
||||
---
|
||||
name: incident-postmortem-ja
|
||||
description: Create structured postmortems and 障害報告書 for incidents, outages, and service failures with clear timelines, root-cause analysis, and preventive actions. Use this whenever users ask for an incident report, postmortem, RCA, incident review, 障害報告, 障害報告書, 振り返り, or 再発防止計画 focused on system and process improvement; use it for formal incident analysis, not for routine status updates or personal blame.
|
||||
category: Writing & Reporting
|
||||
---
|
||||
|
||||
# Incident Postmortem JA
|
||||
|
||||
@ -1,6 +1,7 @@
|
||||
---
|
||||
name: japan-compliance-checker
|
||||
description: Review Japan-specific compliance risks in business text, campaign copy, contracts, and operating processes with clear, practical screening guidance. Use this whenever users ask for Japan compliance review, legal review, regulatory check, 法務チェック, コンプラ確認, 契約レビュー, or 広告審査 within the v1 scope of APPI, 景品表示法, and 下請法; use it for risk screening rather than drafting, anonymization, or legal advice.
|
||||
category: Compliance & Security
|
||||
---
|
||||
|
||||
# Japan Compliance Checker
|
||||
|
||||
@ -1,6 +1,7 @@
|
||||
---
|
||||
name: japanese-business-writer
|
||||
description: Draft and polish formal Japanese business writing for emails, notices, request letters, cover notes, and workplace communication with clear structure and appropriate 敬語. Use this whenever users ask for Japanese business writing, formal JP writing, 敬語 polishing, 文面添削, 依頼メール, 案内文, 送付状, or 社内通知; use it for writing quality and business tone, not for compliance review or complaint de-escalation.
|
||||
category: Writing & Reporting
|
||||
---
|
||||
|
||||
# Japanese Business Writer
|
||||
|
||||
@ -1,6 +1,7 @@
|
||||
---
|
||||
name: japanese-pii-redactor
|
||||
description: Redact, anonymize, and de-identify personal information in Japanese-language or mixed-language text and tabular data while preserving analytical usefulness. Use this whenever users ask for PII redaction, PII scrub, de-identification, 個人情報匿名化, 匿名加工, 仮名化, 秘匿化, or マスキング; use it for executing anonymization rules, not for legal interpretation or general writing polish.
|
||||
category: Compliance & Security
|
||||
---
|
||||
|
||||
# Japanese PII Redactor
|
||||
|
||||
@ -1,6 +1,7 @@
|
||||
---
|
||||
name: kfs-answer
|
||||
description: Primary skill for answering ALL questions about the datasets knowledge base. Search files, run queries (SQL / markdown), and return answers with citations. MUST be used first for any data-related question.
|
||||
category: Data & Retrieval
|
||||
---
|
||||
|
||||
# kfs-answer
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user