Merge branch 'feature/multimodal-image-input' into developing

This commit is contained in:
朱潮 2026-06-18 12:54:31 +08:00
commit aabb0ad072
6 changed files with 183 additions and 12 deletions

View File

@ -1,6 +1,7 @@
# Skill 功能 # Skill 功能
> 负责范围:技能包管理服务 - 核心实现 > 负责范围:技能包管理服务 - 核心实现
> 最后更新2026-06-07
> 最后更新2026-06-02 > 最后更新2026-06-02
> 最后更新2026-05-26 > 最后更新2026-05-26
> 最后更新2026-05-23 > 最后更新2026-05-23
@ -33,6 +34,9 @@ MCP UI 类 skill 已按 MCP Apps 模式改造:工具返回数据,静态 HTML
## 最近重要事项 ## 最近重要事项
- [2026-06-07](changelog/2026-Q2.md): 新增 `table-query` skill——`skills/developing/table-query/`SQLite 表查询三步式工作流search-tables → get-schemas → run-sqlrun-sql 接收 JSON plan via stdin heredoc 避免 shell 转义;`category: Data & Retrieval``bb74aee`
- [2026-06-05](changelog/2026-Q2.md): 新增 `mineru` skill——`skills/developing/mineru/`PDF/Office/图片转 Markdown自动路由免 token 的 Agent API 与需 token 的 Standard API附 15 个目标 sinknotion/feishu/slack/...`category: Document Processing``b618cb1`
- [2026-06-05](changelog/2026-Q2.md): `rag-retrieve` 系列空 query 由 JSON-RPC error 改为 success-with-error-text——5 个 server 变体autoload/onprem、autoload/support、developing/rag-retrieve-no-citation、onprem/rag-retrieve-only、support/rag-retrieve-only统一改用 `create_success_response` 携带 `"Error: missing required parameter 'query'..."` 文案,让 agent 在 tool output 里看到错误并自纠错重试(`ecf332a`
- [2026-06-02](changelog/2026-Q2.md): 修复 deepagents 落盘 backend grep 扫描全盘问题——`create_custom_cli_agent` 中 `large_results_backend` / `conversation_history_backend``FilesystemBackend``virtual_mode=False` 改为 `True`,避免 `CompositeBackend` 剥前缀后把 `"/"` 转发给 grep 时扫到真实根目录,单次 grep 从 45152s 回到毫秒级(`6bccd89` - [2026-06-02](changelog/2026-Q2.md): 修复 deepagents 落盘 backend grep 扫描全盘问题——`create_custom_cli_agent` 中 `large_results_backend` / `conversation_history_backend``FilesystemBackend``virtual_mode=False` 改为 `True`,避免 `CompositeBackend` 剥前缀后把 `"/"` 转发给 grep 时扫到真实根目录,单次 grep 从 45152s 回到毫秒级(`6bccd89`
- [2026-05-29](changelog/2026-Q2.md): 新增 `ToolMetricsMiddleware`——通过 `wrap_tool_call` / `awrap_tool_call` 对每次 tool 调用计时并 emit `catalog_agent.tool_call` 结构化指标(成功/失败/取消三态、含 `tool_name`/`trace_id`/`bot_id`/`duration_ms`/`error_type`);插在 `init_agent` 中间件链的 `EmptyResponseRetryMiddleware` 之后、`ToolUseCleanupMiddleware` 之前(`9f0ae25` - [2026-05-29](changelog/2026-Q2.md): 新增 `ToolMetricsMiddleware`——通过 `wrap_tool_call` / `awrap_tool_call` 对每次 tool 调用计时并 emit `catalog_agent.tool_call` 结构化指标(成功/失败/取消三态、含 `tool_name`/`trace_id`/`bot_id`/`duration_ms`/`error_type`);插在 `init_agent` 中间件链的 `EmptyResponseRetryMiddleware` 之后、`ToolUseCleanupMiddleware` 之前(`9f0ae25`
- [2026-05-26](changelog/2026-Q2.md): skill 引入 `category` 字段——`routes/skill_manager.py` 在 `SkillItem` / `SkillValidationResult` 增加 `category`,从 `plugin.json``SKILL.md` frontmatter 解析official skill 默认 `"other"`、user skill 默认 `"custom"`;并通过 batch 给 common/developing/onprem/support 路径下大量 skill 元数据补 `category``data-dashboard` / `mcp-ui` 归类 `Interactive UI``203dcf4`, `3ada55a`, `9658588` - [2026-05-26](changelog/2026-Q2.md): skill 引入 `category` 字段——`routes/skill_manager.py` 在 `SkillItem` / `SkillValidationResult` 增加 `category`,从 `plugin.json``SKILL.md` frontmatter 解析official skill 默认 `"other"`、user skill 默认 `"custom"`;并通过 batch 给 common/developing/onprem/support 路径下大量 skill 元数据补 `category``data-dashboard` / `mcp-ui` 归类 `Interactive UI``203dcf4`, `3ada55a`, `9658588`
@ -91,6 +95,7 @@ MCP UI 类 skill 已按 MCP Apps 模式改造:工具返回数据,静态 HTML
- ⚠️ **Daytona shell_env 是文件注入而非 process env**`init_agent` 通过 `cat > $REMOTE_BASH_ENV_PATH` 写入 `export VAR=...` 行,沙箱内必须由 shellbash`BASH_ENV` 加载才能生效;非 daytona 模式或不走 bash 启动的脚本拿不到这些变量。扩展注入项需直接改 `init_agent` 里的 `_shell_env` 字典。 - ⚠️ **Daytona shell_env 是文件注入而非 process env**`init_agent` 通过 `cat > $REMOTE_BASH_ENV_PATH` 写入 `export VAR=...` 行,沙箱内必须由 shellbash`BASH_ENV` 加载才能生效;非 daytona 模式或不走 bash 启动的脚本拿不到这些变量。扩展注入项需直接改 `init_agent` 里的 `_shell_env` 字典。
- ⚠️ **`CompositeBackend` 路由下的落盘 backend 必须 `virtual_mode=True`**`create_custom_cli_agent` 中 `large_results_backend` / `conversation_history_backend` 都用独立 `tempfile.mkdtemp()` 做根目录,但 `CompositeBackend` 在路由时会剥掉前缀、可能把 `"/"` 转发给 grep`virtual_mode=False` 会把 `"/"` 解析为真实根目录并扫到 `/usr`、`/var`、其他会话的 tmp 目录(单次 45152s。`virtual_mode=True` 才会把所有路径锚定到 `root_dir` 并过滤越界结果。后续新增"只服务本次会话"的落盘 backend 一律走 `virtual_mode=True`,真实 workspace backend 仍保持 `False` - ⚠️ **`CompositeBackend` 路由下的落盘 backend 必须 `virtual_mode=True`**`create_custom_cli_agent` 中 `large_results_backend` / `conversation_history_backend` 都用独立 `tempfile.mkdtemp()` 做根目录,但 `CompositeBackend` 在路由时会剥掉前缀、可能把 `"/"` 转发给 grep`virtual_mode=False` 会把 `"/"` 解析为真实根目录并扫到 `/usr`、`/var`、其他会话的 tmp 目录(单次 45152s。`virtual_mode=True` 才会把所有路径锚定到 `root_dir` 并过滤越界结果。后续新增"只服务本次会话"的落盘 backend 一律走 `virtual_mode=True`,真实 workspace backend 仍保持 `False`
- ⚠️ **`ToolMetricsMiddleware` 必须在重试中间件之后、其他工具中间件之前**`init_agent` 中顺序约定为 `EmptyResponseRetryMiddleware → ToolMetricsMiddleware → ToolUseCleanupMiddleware → ...`,这样统计到的 `duration_ms` 才包含全部后续 tool 处理开销并自然覆盖重试边界。指标 emit 自身的异常被吞掉只打 logger.exception所以指标缺失不会触发 agent 报错,必须在指标后端做独立告警。 - ⚠️ **`ToolMetricsMiddleware` 必须在重试中间件之后、其他工具中间件之前**`init_agent` 中顺序约定为 `EmptyResponseRetryMiddleware → ToolMetricsMiddleware → ToolUseCleanupMiddleware → ...`,这样统计到的 `duration_ms` 才包含全部后续 tool 处理开销并自然覆盖重试边界。指标 emit 自身的异常被吞掉只打 logger.exception所以指标缺失不会触发 agent 报错,必须在指标后端做独立告警。
- ⚠️ **`rag-retrieve` 系列空 query 返回 success 而非 JSON-RPC error**5 个 server 变体autoload/onprem、autoload/support、developing/rag-retrieve-no-citation、onprem/rag-retrieve-only、support/rag-retrieve-only从 2026-06-05 起,缺 `query` 参数时返回 `create_success_response` 携带 `content[].text` 前缀 `"Error: missing required parameter 'query'..."`,故意让 LLM agent 在 tool output 里看到错误并自纠错重试。下游/中间层不能再用 JSON-RPC `error.code == -32602` 判断缺参,需要从 text 内容解析;新 MCP server 设计错误路径时也应区分"需要 agent 自纠错的语义错误(走 success-with-error-text"与"不可恢复错误(走 error response"。
## Skill 目录结构 ## Skill 目录结构

View File

@ -4,6 +4,92 @@
--- ---
## 2026-06-07: 新增 `table-query` skillSQLite 表查询)
**类型**:新功能
**背景**bot 上传的 Excel/CSV 表格此前只能走 `rag_retrieve` 语义检索回答问题,对"价格/数量/库存/排名/聚合"这类**结构化**问题精度差、答非所问。需要一条"快速、不走 LLM 的 SQL 查询路径"。
**改动**
- 新增 `skills/developing/table-query/`
- `SKILL.md`:定义 search-tables → get-schemas → run-sql 三步工作流;明确"backend 不做 LLM 推理,由 agent 自己写 SQLite SQL"`category: Data & Retrieval`。
- `scripts/table_query.py`CLI 入口 + run-sql plan 执行器plan 通过 stdin heredoc 传递,无需 shell 转义)。
- `skill.yaml`:元数据。
- `verify_table_query.sh`:自检脚本。
- 工作流要求search-tables **每问最多 1 次**run-sql 出错重试 ≤ 2 次,且不回退到 search-tablessearch-tables 无果时降级到 `rag_retrieve`
**根因**N/A新功能
**影响**
- agent 可直接对上传表数据做 SUM/AVG/COUNT 等结构化查询,并通过 `__src` 列 + `file_ref_table` 输出行级 citation。
- 同时引入 plan-on-stdin 的调用约定(`<<'PLAN' ... PLAN`),后续类似 SQL/JSON 入参的 skill 可参考此模式以避免 argv 转义问题。
**相关文件**
- `skills/developing/table-query/SKILL.md`
- `skills/developing/table-query/scripts/table_query.py`
- `skills/developing/table-query/skill.yaml`
- `skills/developing/table-query/verify_table_query.sh`
**Commit/PR**`bb74aee`
---
## 2026-06-05: 新增 `mineru` skillPDF/Office/图片 → Markdown 解析)
**类型**:新功能
**背景**:缺少统一的"文档转 Markdown"管道PDF/Word/PPT/Excel/图片需要走不同工具,且 OCR / 公式 / 表格识别能力不一致。
**改动**
- 新增 `skills/developing/mineru/`(约 4700 行30 文件):
- `SKILL.md` + `references/`api_reference / comparison / integrations
- `scripts/mineru.py`:核心 CLI自动路由 Agent API无 token与 Standard API`MINERU_TOKEN`,支持大文件/批量/DOCX/HTML/LaTeX 导出)。
- `scripts/mineru_mcp.py`MCP server 包装。
- `scripts/sinks/`airtable / coda / confluence / dingtalk / feishu / linear / local / notion / onenote / roam / siyuan / slack / ticktick / wecom / wps / yuque 等多目标写出。
- `scripts/chunking.py` / `splitter.py` / `local_engine.py`:分块、切分、本地引擎。
- `category: Document Processing`,标准依赖(仅标准库)。
**根因**N/A新功能
**影响**
- 提供"零 token 起步、有 token 升级"的渐进式解析路径,降低部署门槛。
- 多 sink 设计可作为后续"采集 → 结构化 → 多目的地分发"类 skill 的参考骨架。
**相关文件**
- `skills/developing/mineru/SKILL.md`
- `skills/developing/mineru/scripts/mineru.py`
- `skills/developing/mineru/scripts/mineru_mcp.py`
- `skills/developing/mineru/scripts/sinks/*.py`15 个目标)
**Commit/PR**`b618cb1`
---
## 2026-06-05: `rag-retrieve` 系列空 query 改返回 success 文案(替代 JSON-RPC error
**类型**:行为变更(兼容性)
**背景**`rag_retrieve` / `table_rag_retrieve` 工具在缺少 `query` 参数时返回 JSON-RPC `-32602` error。MCP/agent 链路上 error 容易被上层吞掉agent 看不到"为什么失败",无法自我修复。
**改动**:把 5 个 rag-retrieve server 变体的"missing query"分支由 `create_error_response(-32602, ...)` 改为 `create_success_response(...)` 携带 `content[].text = "Error: missing required parameter 'query'. Please call this tool again with a non-empty 'query' argument describing what you want to retrieve."`
- `skills/autoload/onprem/rag-retrieve/rag_retrieve_server.py`
- `skills/autoload/support/rag-retrieve/rag_retrieve_server.py`
- `skills/developing/rag-retrieve-no-citation/rag_retrieve_server.py`
- `skills/onprem/rag-retrieve-only/rag_retrieve_server.py`
- `skills/support/rag-retrieve-only/rag_retrieve_server.py`
**根因**MCP 协议层的 error 在多数 agent 框架下不会作为"tool 返回结果"传给 LLM从而无法触发重试改成 success-with-error-text 让 agent 把错误文本当 tool output 读到,并在下一轮自然带上 query 重试。
**影响**
- 客户端**不能再依赖 JSON-RPC error code = -32602 判定缺参**,必须从 `content[].text` 前缀 `"Error:"` 解析;任何在 success 路径上做强校验/落盘的中间层需要兼容这种"假成功"形态。
- 新 MCP server 出错路径如果是"需要 agent 自纠错"的语义错误,应走同样的 success-with-error-text 模式;底层崩溃 / 不可恢复错误仍走 error response。
**相关文件**:见上。
**Commit/PR**`ecf332a`
---
## 2026-06-02: 修复 deepagents 落盘 backend grep 扫描全盘问题virtual_mode=True ## 2026-06-02: 修复 deepagents 落盘 backend grep 扫描全盘问题virtual_mode=True
**类型**Bug 修复 **类型**Bug 修复

View File

@ -90,6 +90,8 @@ async def prepare_checkpoint_message(config, checkpointer):
last_user_msg = next((m for m in reversed(config.messages) if m.get('role') == 'user'), None) last_user_msg = next((m for m in reversed(config.messages) if m.get('role') == 'user'), None)
if last_user_msg: if last_user_msg:
config.messages = [last_user_msg] config.messages = [last_user_msg]
logger.info(f"Has history, sending last user message: {last_user_msg.get('content', '')[:50]}...") from utils.fastapi_utils import extract_text_from_content
preview = extract_text_from_content(last_user_msg.get('content', ''))
logger.info(f"Has history, sending last user message: {preview[:50]}...")
else: else:
logger.info(f"No history, sending all {len(config.messages)} messages") logger.info(f"No history, sending all {len(config.messages)} messages")

View File

@ -18,7 +18,8 @@ from utils.fastapi_utils import (
process_messages, process_messages,
create_project_directory, extract_api_key_from_auth, generate_v2_auth_token, fetch_bot_config, create_project_directory, extract_api_key_from_auth, generate_v2_auth_token, fetch_bot_config,
call_preamble_llm, call_preamble_llm,
create_stream_chunk create_stream_chunk,
extract_text_from_content
) )
from langchain_core.messages import AIMessageChunk, ToolMessage, AIMessage, HumanMessage from langchain_core.messages import AIMessageChunk, ToolMessage, AIMessage, HumanMessage
from utils.settings import MAX_OUTPUT_TOKENS from utils.settings import MAX_OUTPUT_TOKENS
@ -355,9 +356,9 @@ async def create_agent_and_generate_response(
"finish_reason": "stop" "finish_reason": "stop"
}], }],
usage={ usage={
"prompt_tokens": sum(len(msg.get("content", "")) for msg in config.messages), "prompt_tokens": sum(len(extract_text_from_content(msg.get("content", ""))) for msg in config.messages),
"completion_tokens": len(response_text), "completion_tokens": len(response_text),
"total_tokens": sum(len(msg.get("content", "")) for msg in config.messages) + len(response_text) "total_tokens": sum(len(extract_text_from_content(msg.get("content", ""))) for msg in config.messages) + len(response_text)
} }
) )
@ -391,6 +392,9 @@ async def _save_user_messages(config: AgentConfig) -> None:
if isinstance(msg, dict): if isinstance(msg, dict):
role = msg.get("role", "") role = msg.get("role", "")
content = msg.get("content", "") content = msg.get("content", "")
# Flatten multimodal list content to plain text before persisting,
# so base64 image data is not stored in chat history.
content = extract_text_from_content(content)
if role == "user" and content: if role == "user" and content:
# ============ Execute PreSave hooks ============ # ============ Execute PreSave hooks ============
processed_content = await execute_hooks('PreSave', config, content=content, role=role) processed_content = await execute_hooks('PreSave', config, content=content, role=role)

View File

@ -3,12 +3,16 @@
API data models and response schemas. API data models and response schemas.
""" """
from typing import Dict, List, Optional, Any, AsyncGenerator from typing import Dict, List, Optional, Any, AsyncGenerator, Union
from pydantic import BaseModel, Field, field_validator, ConfigDict from pydantic import BaseModel, Field, field_validator, ConfigDict
class Message(BaseModel): class Message(BaseModel):
role: str role: str
content: str # content can be a plain string, or a list of content blocks for multimodal
# input (e.g. text + image). Both OpenAI-style ({"type": "image_url", ...})
# and LangChain standard blocks ({"type": "image", ...}) are accepted; they
# are normalized later in process_messages.
content: Union[str, List[Dict[str, Any]]]
class DatasetRequest(BaseModel): class DatasetRequest(BaseModel):

View File

@ -232,6 +232,55 @@ def create_stream_chunk(chunk_id: str, model_name: str, content: str = None, fin
# return full_text # return full_text
def normalize_content_blocks(content: Union[str, List[Dict[str, Any]]]) -> Union[str, List[Dict[str, Any]]]:
"""Normalize multimodal content blocks into LangChain standard content blocks.
Accepts both OpenAI-style blocks ({"type": "image_url", "image_url": {"url": ...}})
and LangChain standard blocks ({"type": "image", "base64"/"url": ...}), and emits
LangChain standard blocks so the provider's block_translator can auto-convert for
either OpenAI or Anthropic. Plain string content is returned unchanged.
"""
if not isinstance(content, list):
return content
normalized: List[Dict[str, Any]] = []
for block in content:
if not isinstance(block, dict):
# Treat a bare string inside the list as a text block.
if isinstance(block, str):
normalized.append({"type": "text", "text": block})
continue
block_type = block.get("type")
if block_type == "text":
normalized.append({"type": "text", "text": block.get("text", "")})
elif block_type == "image_url":
# OpenAI-style image block: {"type": "image_url", "image_url": {"url": ...}}
image_url = block.get("image_url")
url = image_url.get("url") if isinstance(image_url, dict) else image_url
if not url:
continue
if isinstance(url, str) and url.startswith("data:"):
# data:<mime_type>;base64,<data>
try:
header, data = url.split(",", 1)
mime_type = header.split(";", 1)[0].removeprefix("data:") or "image/jpeg"
normalized.append({"type": "image", "base64": data, "mime_type": mime_type})
except ValueError:
logger.warning("Skipping malformed data URL in image_url block")
else:
normalized.append({"type": "image", "url": url})
elif block_type == "image":
# Already a LangChain standard image block; pass through.
normalized.append(block)
else:
# Unknown block type; pass through untouched.
normalized.append(block)
return normalized
def process_messages(messages: List[Dict], language: Optional[str] = None) -> List[Dict[str, str]]: def process_messages(messages: List[Dict], language: Optional[str] = None) -> List[Dict[str, str]]:
"""Process message list, including [TOOL_CALL]|[TOOL_RESPONSE]|[ANSWER] splitting and language directive addition. """Process message list, including [TOOL_CALL]|[TOOL_RESPONSE]|[ANSWER] splitting and language directive addition.
@ -255,7 +304,7 @@ def process_messages(messages: List[Dict], language: Optional[str] = None) -> Li
# Process each message # Process each message
for i, msg in enumerate(messages): for i, msg in enumerate(messages):
if msg.role == ASSISTANT: if msg.role == ASSISTANT and isinstance(msg.content, str):
# Determine the position of this ASSISTANT message among all ASSISTANT messages (0-indexed) # Determine the position of this ASSISTANT message among all ASSISTANT messages (0-indexed)
assistant_position = assistant_indices.index(i) assistant_position = assistant_indices.index(i)
@ -315,14 +364,16 @@ def process_messages(messages: List[Dict], language: Optional[str] = None) -> Li
# If processed content is empty, use original content # If processed content is empty, use original content
processed_messages.append({"role": msg.role, "content": msg.content}) processed_messages.append({"role": msg.role, "content": msg.content})
else: else:
processed_messages.append({"role": msg.role, "content": msg.content}) # User/other messages (or assistant messages carrying multimodal list
# content) pass through; normalize multimodal blocks to LangChain standard.
processed_messages.append({"role": msg.role, "content": normalize_content_blocks(msg.content)})
# Inverse operation: reassemble messages containing [THINK|TOOL_RESPONSE] back into # Inverse operation: reassemble messages containing [THINK|TOOL_RESPONSE] back into
# msg['role'] == 'function' and msg.get('function_call') format. # msg['role'] == 'function' and msg.get('function_call') format.
# This is the inverse of get_content_from_messages. # This is the inverse of get_content_from_messages.
final_messages = [] final_messages = []
for msg in processed_messages: for msg in processed_messages:
if msg["role"] == ASSISTANT: if msg["role"] == ASSISTANT and isinstance(msg["content"], str):
# Split message content # Split message content
parts = re.split(r'\[(THINK|PREAMBLE|TOOL_CALL|TOOL_RESPONSE|ANSWER)\]', msg["content"]) parts = re.split(r'\[(THINK|PREAMBLE|TOOL_CALL|TOOL_RESPONSE|ANSWER)\]', msg["content"])
@ -401,13 +452,32 @@ def process_messages(messages: List[Dict], language: Optional[str] = None) -> Li
return final_messages return final_messages
def get_user_last_message_content(messages: list) -> Optional[dict]: def extract_text_from_content(content: Union[str, List[Dict[str, Any]]]) -> str:
"""Get the last message content from a message list.""" """Extract plain text from message content that may be a multimodal block list."""
if isinstance(content, str):
return content
if isinstance(content, list):
texts = []
for block in content:
if isinstance(block, dict) and block.get("type") == "text":
texts.append(block.get("text", ""))
elif isinstance(block, str):
texts.append(block)
return "\n".join(texts)
return ""
def get_user_last_message_content(messages: list) -> Optional[str]:
"""Get the last user message's plain text content from a message list.
Multimodal list content is flattened to text so downstream consumers
(e.g. terms embedding) always receive a string.
"""
if not messages or len(messages) == 0: if not messages or len(messages) == 0:
return "" return ""
last_message = messages[-1] last_message = messages[-1]
if last_message and last_message.get('role') == 'user': if last_message and last_message.get('role') == 'user':
return last_message["content"] return extract_text_from_content(last_message.get("content", ""))
return "" return ""
def format_messages_to_chat_history(messages: List[Dict[str, str]]) -> str: def format_messages_to_chat_history(messages: List[Dict[str, str]]) -> str: