Compare commits
72 Commits
4a5d8d05cf
...
cb649d83ee
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
cb649d83ee | ||
|
|
065403223d | ||
|
|
fabb14c66a | ||
|
|
77079539c1 | ||
|
|
e6b28818bf | ||
|
|
955a064ee3 | ||
|
|
0983623a75 | ||
|
|
0af1890493 | ||
|
|
93e79f1a0e | ||
|
|
66cde77117 | ||
|
|
6bccd89e9a | ||
|
|
2f5a8e204c | ||
|
|
2205d830e1 | ||
|
|
26fc9e5226 | ||
|
|
9f0ae25233 | ||
|
|
667054f39f | ||
|
|
ecc2687f7b | ||
|
|
5173ca13b4 | ||
|
|
6ff9555e91 | ||
|
|
c5dbabf558 | ||
|
|
63f1f836c4 | ||
|
|
0260663aa9 | ||
|
|
8c4659f590 | ||
|
|
be44c243fd | ||
|
|
091659d693 | ||
|
|
667fdb8a3b | ||
|
|
f05bb34e1e | ||
|
|
9a496b955c | ||
|
|
fb5562f977 | ||
|
|
a4b32aad7f | ||
|
|
68c2472d82 | ||
|
|
1dd45107af | ||
|
|
dc2a212f35 | ||
|
|
5129fdcc05 | ||
|
|
e38fd17b97 | ||
|
|
45cf140472 | ||
|
|
51f88c8c2d | ||
|
|
3ebd006b6e | ||
|
|
c14a22bbd1 | ||
|
|
44ac8103d3 | ||
|
|
881901c504 | ||
|
|
b54f0848cd | ||
|
|
37025e4ce6 | ||
|
|
5b378fcdf6 | ||
|
|
86c58ffccf | ||
|
|
3fd12e9fc6 | ||
|
|
46cf0933a0 | ||
|
|
a1fffa311b | ||
| ae99d7a177 | |||
|
|
2f4ad22293 | ||
|
|
3e142b22e5 | ||
|
|
89e9f9b6d6 | ||
|
|
76f04d9b24 | ||
|
|
9f9c78548d | ||
|
|
538be7da2c | ||
|
|
a0d285db28 | ||
|
|
035b21bb43 | ||
|
|
adbe5b5c65 | ||
|
|
73c2051490 | ||
|
|
fa5ddf0a4f | ||
|
|
c830a0d6de | ||
|
|
d0480cca1b | ||
|
|
bace14838b | ||
|
|
60dd9c9332 | ||
|
|
b9f750be96 | ||
|
|
781f1af3f6 | ||
|
|
100007f66b | ||
|
|
1c90ab7956 | ||
|
|
d2bb915191 | ||
|
|
77367673e5 | ||
|
|
28755b3383 | ||
|
|
4de489e69e |
@ -1,15 +1,13 @@
|
||||
# Skill 功能
|
||||
|
||||
> 负责范围:技能包管理服务 - 核心实现
|
||||
|
||||
|
||||
> 最后更新:2026-05-20
|
||||
> 最后更新:2026-05-26
|
||||
|
||||
## 当前状态
|
||||
|
||||
Skill 系统支持两种来源:官方 skills (`./skills/`) 和用户 skills (`projects/uploads/{bot_id}/skills/`)。支持 Hook 系统和 MCP 服务器配置,通过 SKILL.md 或 plugin.json 定义元数据。
|
||||
|
||||
目前已新增一批**纯 `SKILL.md` 型业务 skill MVP**,用于研究、摘要、报告和情报编排,底层文件处理与外部检索能力继续复用既有 skill。
|
||||
2026-04 起 skill 包可在 `agents/*.md` 下定义子 agent(由 `SubAgentMiddleware` 加载);启用 Daytona 沙箱时 skill 加载路径变为沙箱内的 `/workspace/skills`。
|
||||
|
||||
MCP UI 类 skill 已按 MCP Apps 模式改造:工具返回数据,静态 HTML App 由 host 加载后通过 postMessage 接收数据渲染。
|
||||
|
||||
@ -24,10 +22,29 @@ MCP UI 类 skill 已按 MCP Apps 模式改造:工具返回数据,静态 HTML
|
||||
- `routes/mcp_resources.py` - MCP App 静态 HTML resource REST 入口
|
||||
- `skills/` - 官方 skills 目录
|
||||
- `skills_developing/` - 开发中 skills
|
||||
- `agent/subagent_loader.py` - 扫描 skill `agents/*.md` 加载子 agent(2026-05 引入)
|
||||
- `agent/mcp_trace_meta.py` - 对 `ClientSession.call_tool` 做 monkey-patch,向 `rag_retrieve` / `table_rag_retrieve` 的 MCP `_meta` 注入 `trace_id`(2026-05 引入)
|
||||
|
||||
## 最近重要事项
|
||||
|
||||
- [2026-05-26](changelog/2026-Q2.md): skill 引入 `category` 字段——`routes/skill_manager.py` 在 `SkillItem` / `SkillValidationResult` 增加 `category`,从 `plugin.json` 与 `SKILL.md` frontmatter 解析,official skill 默认 `"other"`、user skill 默认 `"custom"`;并通过 batch 给 common/developing/onprem/support 路径下大量 skill 元数据补 `category`,`data-dashboard` / `mcp-ui` 归类 `Interactive UI`(`203dcf4`, `3ada55a`, `9658588`)
|
||||
- [2026-05-26](changelog/2026-Q2.md): developing 分支大合并新增多个 skill:`ai-ppt-generator`(百度 AI PPT)、`nfc-medicine-lookup`(NFC 药品检索)、`ppt-outline`(PPT 大纲 / HTML 演示文稿)、`z-card-image`(配图 / 卡片图),同时 `skills/linggan/*` 系列 skill 经合并回归(`3ada55a`)
|
||||
- [2026-05-23](changelog/2026-Q2.md): 新增 MCP App 型 `skills/developing/ecommerce-storefront/`——含 `product-list` / `order-confirm` 两个 HTML App + 自带 `ecommerce_server.py` MCP server;同时落地 `docs/mcp-app-training.md`(约 1063 行)作为 MCP App 培训材料(`9d001c8`)
|
||||
- [2026-05-21](changelog/2026-Q2.md): Daytona 沙箱模式下 `init_agent` 在沙箱内写入 `BASH_ENV` 文件,注入 `ASSISTANT_ID` / `USER_IDENTIFIER` / `TRACE_ID` / `ENABLE_SELF_KNOWLEDGE` 与 `config.shell_env` 的 shell 环境变量(`776acc2`)
|
||||
- [2026-05-12](changelog/2026-Q2.md): 跨 6→10 个 skill 变体批量精修 `retrieval-policy*.md`,统一 onprem/support/autoload 各路径下的 policy 口径(`be96f24`, `7b4f03d`)
|
||||
- [2026-05-11](changelog/2026-Q2.md): 新增子 agent (SubAgent) 支持——skill 包通过 `agents/*.md` 暴露子 agent,由 `SubAgentMiddleware` 加载;附 `pmda-drug-info` skill 示例(`5b634bc`)
|
||||
- [2026-05-11](changelog/2026-Q2.md): `pmda-drug-info` 的 `pmda_server.py` 大改为 mock 实现(`a92096a`)
|
||||
- [2026-05-11](changelog/2026-Q2.md): `retrieval-policy.md` 跨 4 个 skill 变体内容同步更新(`e6d1698`)
|
||||
- [2026-05-08](changelog/2026-Q2.md): 通过 monkey-patch `ClientSession.call_tool`,把 trace_id 透传到 `rag_retrieve` / `table_rag_retrieve` 的 MCP `_meta`(`1f06450`)
|
||||
- [2026-05-06](changelog/2026-Q2.md): support 分支新增 `kfs-answer` skill(`a9227b8`)
|
||||
- [2026-05-06](changelog/2026-Q2.md): 修复 Daytona 沙箱增量同步漏掉符号链接的问题——`find -type f` 不覆盖 symlink、`tar.add` 默认不 dereference 导致悬空软链;并统一 dataset 路径为复数 `datasets/`(`3c0fa49`)
|
||||
- [2026-04-24](changelog/2026-Q2.md): 非流式响应路径上 `_execute_post_agent_hooks` 改为 `asyncio.create_task` 非阻塞执行;同时临时注释停用 `ToolOutputLengthMiddleware`(`45a9494`)
|
||||
- [2026-04-23](changelog/2026-Q2.md): PrePrompt hook 内容改为通过 `{hook_content}` 占位符注入系统提示词模板,不再在 prompt 末尾追加(`51fbf01`)
|
||||
- [2026-04-23](changelog/2026-Q2.md): Daytona 沙箱接入——`init_agent` 并行加载 + 返回元组增加 `sandbox` 字段;`skills_sources` 在沙箱模式下变为 `/workspace/skills`,`agent_dir_path` 变为 `/workspace`(`c9e0789`, `8446dab`)
|
||||
- [2026-04-22](changelog/2026-Q2.md): `skills/developing/` 下新增 `rag-retrieve-no-citation` 与 `novare-context` 两个开发中 skill(`7a30e52`)
|
||||
|
||||
- 2026-05-20: `mcp-ui` 和 `data-dashboard` 改为 MCP Apps 标准模式,App HTML 放在 skill 的 `apps/` 目录,由 host 加载后 postMessage 数据
|
||||
|
||||
- 2026-04-20: 为 `rag-retrieve` 新增 `retrieval-policy-forbidden-self-knowledge.md`,禁止知识问答场景使用模型自身知识补全答案,要求严格基于检索证据作答
|
||||
- 2026-04-19: 环境变量 `SKILLS_SUBDIR` 重命名为 `PROJECT_NAME`,用于选择 `skills/{PROJECT_NAME}` 和 `skills/autoload/{PROJECT_NAME}` 目录
|
||||
- 2026-04-19: `create_robot_project` 的 autoload 去重和 stale 清理补强,autoload 目录也纳入 managed 清理,避免 `rag-retrieve-only` 场景下旧的 `rag-retrieve` 残留
|
||||
@ -59,6 +76,16 @@ MCP UI 类 skill 已按 MCP Apps 模式改造:工具返回数据,静态 HTML
|
||||
- ⚠️ 上传大小限制:50MB(ZIP),解压后最大 500MB
|
||||
- ⚠️ 压缩比例检查:最大 100:1(防止 zip 炸弹)
|
||||
- ⚠️ 符号链接检查:禁止解压包含符号链接的文件
|
||||
- ⚠️ **子 agent 同名静默 last-wins**:`subagent_loader._parse_agent_md` 跨 skill 扫描 `agents/*.md` 时,按 `name` 字段去重,**后扫描到的覆盖先扫描的**,只打 warning 不报错。多 skill 都暴露子 agent 时需自觉错开命名。
|
||||
- ⚠️ **`SubAgentMiddleware` 中间件顺序**:必须插在 `CustomFilesystemMiddleware` 之后、`AnthropicPromptCachingMiddleware` 之前——这是匹配 `deepagents.create_deep_agent` 的官方顺序,调整 `create_custom_cli_agent` 中的中间件顺序时不能随意挪动这一段。
|
||||
- ⚠️ **Daytona 模式下 skill 路径不同**:`DAYTONA_ENABLED=true` 时 `enable_skills` 的 `skills_sources` 是 `/workspace/skills`(沙箱内),同时 system prompt 的 `agent_dir_path` 是 `/workspace`;写死本地路径的 hook / 脚本需要兼容两种环境。
|
||||
- ⚠️ **PostAgent hook 非流式分支已 fire-and-forget**:`routes/chat.py` 用 `asyncio.create_task` 启动 hook,调用方不会等待也不会感知到 hook 的异常——hook 失败只会被自己的 logger 捕获。
|
||||
- ⚠️ **MCP `_meta.trace_id` 是全局 monkey-patch 注入**:`agent/mcp_trace_meta.patch_mcp_client_session_trace_meta()` 在 `get_tools_from_mcp()` 入口调用一次后,会把 `mcp.ClientSession.call_tool` 永久包装;仅对工具名在 `{"rag_retrieve", "table_rag_retrieve"}` 集合内的调用注入 `_meta.trace_id`,扩展白名单要直接改 `_TRACE_META_TOOL_NAMES` 常量。
|
||||
- ⚠️ **PrePrompt hook 内容位置由模板决定**:自 2026-04-23 起 hook 产出通过 `{hook_content}` 占位符注入 `prompt/system_prompt.md`,不再追加在 prompt 末尾;自定义模板必须包含 `{hook_content}` 占位符否则 hook 内容会丢失。
|
||||
- ⚠️ **`init_agent` 返回值已变 3 元素**:Daytona 改造后 `init_agent` 返回 `(agent, checkpointer, sandbox)`;调用方解构必须更新。
|
||||
- ⚠️ **skill `category` 默认值**:API 返回的 `SkillItem.category`——official skill fallback 为 `"other"`、user skill fallback 为 `"custom"`;前端做分类视图时需要同时识别这两个 sentinel,不要假设官方/用户 skill 用同一套缺省值。
|
||||
- ⚠️ **`category` 字段双入口**:同一 skill 可以同时在 `.claude-plugin/plugin.json` 和 `SKILL.md` frontmatter 写 `category`;`get_skill_metadata` 优先走 `parse_plugin_json`,若 skill 包没有 plugin.json 才回落到 `parse_skill_frontmatter`——两者写不一致时以 plugin.json 为准。
|
||||
- ⚠️ **Daytona shell_env 是文件注入而非 process env**:`init_agent` 通过 `cat > $REMOTE_BASH_ENV_PATH` 写入 `export VAR=...` 行,沙箱内必须由 shell(bash)的 `BASH_ENV` 加载才能生效;非 daytona 模式或不走 bash 启动的脚本拿不到这些变量。扩展注入项需直接改 `init_agent` 里的 `_shell_env` 字典。
|
||||
|
||||
## Skill 目录结构
|
||||
|
||||
|
||||
@ -1,5 +1,392 @@
|
||||
# 2026-Q2 Skill Changelog
|
||||
|
||||
按时间倒序记录本季度的重要变更。
|
||||
|
||||
---
|
||||
|
||||
## 2026-05-26: skill `category` 字段全面接入
|
||||
|
||||
**类型**:新功能
|
||||
|
||||
**背景**:skill 数量越来越多(common / developing / onprem / support / linggan / autoload 各路径下数十个),列表 API 需要前端能按类别分组展示,元数据层面缺少 `category` 字段。
|
||||
|
||||
**改动**:
|
||||
- `routes/skill_manager.py`:
|
||||
- `SkillItem` model 新增 `category: str = "other"`。
|
||||
- `SkillValidationResult` dataclass 新增可选 `category: Optional[str]`。
|
||||
- `parse_plugin_json` 解析 `plugin_config.get('category')`;`parse_skill_frontmatter` 解析 frontmatter 的 `metadata.get('category')`。
|
||||
- `get_official_skills` 中 fallback 为 `"other"`;`get_user_skills` 中 fallback 为 `"custom"`。
|
||||
- `get_skill_metadata_legacy` 在 `category` 非空时写入返回 dict(保持向后兼容)。
|
||||
- 批量给 common / developing / onprem / support 多个 skill 的 `.claude-plugin/plugin.json` 与 `SKILL.md` frontmatter 添加 `category` 字段。
|
||||
- `data-dashboard` 与 `mcp-ui` 的 `category` 从 `"Data & Retrieval"` 修正为 `"Interactive UI"`(更贴切 MCP App 的渲染语义)。
|
||||
|
||||
**根因**:N/A(新功能)
|
||||
|
||||
**影响**:
|
||||
- `GET /api/v1/skill/list` 返回项现在包含 `category` 字段;前端可按 category 维度做分组/筛选。
|
||||
- skill 元数据约定扩展——新 skill 应在 plugin.json 或 SKILL.md frontmatter 中写明 `category`,否则会落到 `"other"` / `"custom"` 兜底。
|
||||
- `plugin.json.category` 与 `SKILL.md.category` 同时存在时以前者为准(`get_skill_metadata` 优先 plugin.json)。
|
||||
|
||||
**相关文件**:
|
||||
- `routes/skill_manager.py`
|
||||
- `skills/common/data-dashboard/.claude-plugin/plugin.json`
|
||||
- `skills/common/mcp-ui/.claude-plugin/plugin.json`
|
||||
- 以及一批 `skills/{common,developing,onprem,support}/*/SKILL.md` 与 `.claude-plugin/plugin.json`
|
||||
|
||||
**Commit/PR**:`203dcf4`, `3ada55a`, `9658588`
|
||||
|
||||
---
|
||||
|
||||
## 2026-05-26: developing 分支批量新增多类 skill
|
||||
|
||||
**类型**:新功能
|
||||
|
||||
**背景**:[待补充]——经 developing→staging 合并集中落地一批新 skill 与 linggan 系列 skill 回归。
|
||||
|
||||
**改动**:
|
||||
- 新增 `skills/developing/ai-ppt-generator/`:调用百度 AI 生成 PPT,按 topic 自动选模板(商务/科技/教育/创意/中国风等);`category: Document Processing`。
|
||||
- 新增 `skills/developing/nfc-medicine-lookup/`:通过 NFC 芯片 ID 或药品名称查询药品信息,面向老年用户的语音助手交互口径;`category: Developer Tools`。
|
||||
- 新增 `skills/developing/ppt-outline/`:PPT 大纲与独立 HTML 演示文稿生成(dark/light/tech/minimal 四种风格);`category: Document Processing`。
|
||||
- 新增 `skills/developing/z-card-image/`:生成配图、封面图、卡片图、社媒帖子分享图等;依赖 `python3` + `google-chrome`。
|
||||
- `skills/developing/static-hosting/SKILL.md` 由 1 行说明扩展为完整 80 行 skill;同时一批已有 SKILL.md / plugin.json 补 `category`。
|
||||
- `skills/linggan/*` 系列 skill(baidu-search / bot-self-modifier / caiyun-weather / competitor-news-intel / contract-document-generator / financial-report-generator / market-academic-insight / ragflow-loader / sales-decision-report / seedream / static-hosting / static-site-deploy / voice-notification / weather-china)经合并回归 staging。
|
||||
|
||||
**根因**:N/A
|
||||
|
||||
**影响**:
|
||||
- developing skill 池扩张约 5 个新业务 skill;linggan 系列重新出现在 staging。
|
||||
- 新 skill 多为 SKILL.md 型业务 skill,符合"workflow + 模板"的纯 markdown 模式;其中 `ai-ppt-generator`、`z-card-image` 依赖外部 `BAIDU_API_KEY` 或 `google-chrome` 二进制。
|
||||
|
||||
**相关文件**:
|
||||
- `skills/developing/ai-ppt-generator/SKILL.md`
|
||||
- `skills/developing/nfc-medicine-lookup/SKILL.md`
|
||||
- `skills/developing/ppt-outline/SKILL.md`
|
||||
- `skills/developing/z-card-image/SKILL.md`
|
||||
- `skills/developing/static-hosting/SKILL.md`
|
||||
- `skills/linggan/**`(回归)
|
||||
|
||||
**Commit/PR**:`3ada55a`
|
||||
|
||||
---
|
||||
|
||||
## 2026-05-23: 新增 ecommerce-storefront skill(MCP App 型)+ MCP App 培训文档
|
||||
|
||||
**类型**:新功能
|
||||
|
||||
**背景**:MCP App 模式(host 加载静态 HTML + postMessage 传数据)已经在 `mcp-ui`、`data-dashboard` 上跑通,需要一个面向电商场景的样例 skill,演示产品浏览 / 选购 / 下单确认这类多步交互的 App 渲染;同时沉淀一份 MCP App 开发指南。
|
||||
|
||||
**改动**:
|
||||
- 新增 `skills/developing/ecommerce-storefront/`:
|
||||
- `apps/product-list.html`(288 行)与 `apps/order-confirm.html`(233 行)两个静态 App。
|
||||
- `ecommerce_server.py`(213 行)作为自带 MCP server,`ecommerce_tools.json` 定义工具 schema。
|
||||
- `hooks/ecommerce_guide.md` + `hooks/pre_prompt.py` 注入 skill 使用指引到 system prompt。
|
||||
- `mcp_common.py`(252 行)复用 MCP 通用工具基类。
|
||||
- `.claude-plugin/plugin.json` 配置 PrePrompt hook 与 stdio MCP server,`category: Developer Tools`。
|
||||
- 新增 `docs/mcp-app-training.md`(约 1063 行):MCP App 模式的开发培训材料。
|
||||
|
||||
**根因**:N/A
|
||||
|
||||
**影响**:
|
||||
- developing skill 池新增一个 MCP App 型 skill,体例对齐 `mcp-ui` / `data-dashboard`。
|
||||
- MCP App 开发者有完整培训材料可参考。
|
||||
|
||||
**相关文件**:
|
||||
- `skills/developing/ecommerce-storefront/**`
|
||||
- `docs/mcp-app-training.md`
|
||||
|
||||
**Commit/PR**:`9d001c8`
|
||||
|
||||
---
|
||||
|
||||
## 2026-05-21: Daytona 沙箱注入 shell_env 到 BASH_ENV
|
||||
|
||||
**类型**:新功能
|
||||
|
||||
**背景**:Daytona 沙箱内的 skill 脚本需要能读取 `ASSISTANT_ID` / `USER_IDENTIFIER` / `TRACE_ID` 等运行时上下文,但宿主 process env 无法直接透传到沙箱里。
|
||||
|
||||
**改动**:
|
||||
- `agent/deep_assistant.py` `init_agent`:当 `sandbox is not None and sandbox_type == "daytona"` 时,组装 `_shell_env` 字典(`ASSISTANT_ID` / `USER_IDENTIFIER` / `TRACE_ID` / `ENABLE_SELF_KNOWLEDGE` 加上 `config.shell_env`),构造 `cd {REMOTE_WORKSPACE_ROOT}\n` + `export VAR="..."` 行,通过 `sandbox.execute("cat > $REMOTE_BASH_ENV_PATH << 'ENVEOF' ... ENVEOF")` 写入沙箱内。
|
||||
- `utils/daytona_sync.py` 提供常量 `REMOTE_BASH_ENV_PATH` / `REMOTE_WORKSPACE_ROOT`。
|
||||
- `AgentConfig` 增加 `shell_env: Optional[Dict[str, str]]`(调用方可追加自定义 env)。
|
||||
|
||||
**根因**:N/A
|
||||
|
||||
**影响**:
|
||||
- 沙箱内通过 bash 启动的 skill 脚本可以 `os.environ.get("ASSISTANT_ID")` 等读到运行时上下文。
|
||||
- 仅 daytona 沙箱模式生效;本地或非 bash 启动的进程不会收到 `BASH_ENV` 注入的变量。
|
||||
- 扩展注入项(新增固定环境变量)需要直接改 `init_agent` 里的 `_shell_env` 字典。
|
||||
|
||||
**相关文件**:
|
||||
- `agent/deep_assistant.py`
|
||||
- `utils/daytona_sync.py`
|
||||
|
||||
**Commit/PR**:`776acc2`
|
||||
|
||||
---
|
||||
|
||||
## 2026-05-12: 批量精修 retrieval policy 文案
|
||||
|
||||
**类型**:内容调整
|
||||
|
||||
**背景**:[待补充]
|
||||
|
||||
**改动**:
|
||||
- `be96f24`: 跨 6 个 skill 变体调整 `retrieval-policy-forbidden-self-knowledge.md` 的措辞(onprem / support / autoload-onprem / autoload-onprem-rag-only / autoload-support-rag-only 路径下的版本及一份 `retrieval-policy.md`)。
|
||||
- `7b4f03d`: 在更广的 10 个文件范围内同步更新 `retrieval-policy.md` 与 `retrieval-policy-forbidden-self-knowledge.md` 两套 policy,使各 skill 变体的策略口径保持一致。
|
||||
|
||||
**根因**:N/A(非 Bug)
|
||||
|
||||
**影响**:所有使用 `rag-retrieve` / `rag-retrieve-only` 这两个 hook 的 skill 在策略行为上保持一致;同时影响 onprem 与 support 两个发布分支的部署。
|
||||
|
||||
**相关文件**:
|
||||
- `skills/onprem/rag-retrieve/hooks/retrieval-policy*.md`
|
||||
- `skills/support/rag-retrieve/hooks/retrieval-policy*.md`
|
||||
- `skills/autoload/onprem/rag-retrieve/hooks/retrieval-policy*.md`
|
||||
- `skills/autoload/onprem/rag-retrieve-only/hooks/retrieval-policy*.md`
|
||||
- `skills/autoload/support/rag-retrieve-only/hooks/retrieval-policy*.md`
|
||||
|
||||
**Commit/PR**:`be96f24`, `7b4f03d`
|
||||
|
||||
---
|
||||
|
||||
## 2026-05-11: 子 agent (SubAgent) 支持 + pmda-drug-info skill
|
||||
|
||||
**类型**:新功能
|
||||
|
||||
**背景**:需要让单个 skill 在主 agent 之外承载多个专用子 agent,按用途隔离上下文与工具集(如 pmda 药品信息场景下的 single-drug / interaction / adverse-event / patient-specific 四个专用 agent)。
|
||||
|
||||
**改动**:
|
||||
- 新增 `agent/subagent_loader.py`:扫描 skill 目录下的 `agents/*.md`,按 YAML frontmatter 的 `name` / `description` / `tools` 字段解析为 `SubAgent` 字典;按 `name` 去重,**后扫描的覆盖先扫描的**(last-wins)。
|
||||
- `agent/deep_assistant.py`:`init_agent` 调用 `load_subagents()`,存在则将 `SubAgentMiddleware`(来自 `deepagents.middleware.subagents`)插在 `CustomFilesystemMiddleware` 之后、`AnthropicPromptCachingMiddleware` 之前,顺序匹配 `create_deep_agent`。
|
||||
- 新增 `skills/developing/pmda-drug-info/`:完整 skill 包,包含 `.claude-plugin/plugin.json`、`hooks/pre_prompt.py` + `hooks/pmda-instructions.md`、四个 `agents/*.md`、自带 `pmda_server.py` MCP server + `pmda_tools.json`、`mcp_common.py` 工具基础类。
|
||||
|
||||
**根因**:N/A
|
||||
|
||||
**影响**:
|
||||
- skill 包结构新增约定:`agents/*.md` 目录下的 markdown 文件会被加载为子 agent。
|
||||
- skill 加载流程在 `init_agent` 内增加一次目录扫描;对没有 `agents/` 的 skill 无影响。
|
||||
- skill 跨 bot 共享时存在 sub-agent 同名冲突的风险——同名 sub-agent 不会报错,而是被后扫描到的覆盖。
|
||||
|
||||
**相关文件**:
|
||||
- `agent/subagent_loader.py`(新)
|
||||
- `agent/deep_assistant.py`(接线)
|
||||
- `skills/developing/pmda-drug-info/`(新 skill)
|
||||
|
||||
**Commit/PR**:`5b634bc`
|
||||
|
||||
---
|
||||
|
||||
## 2026-05-11: pmda-drug-info MCP server 重写为 mock 实现
|
||||
|
||||
**类型**:内部改造
|
||||
|
||||
**背景**:[待补充]
|
||||
|
||||
**改动**:`skills/developing/pmda-drug-info/pmda_server.py` 大幅替换(+322 / -385),保留接口面向 agent 的契约,内部替换为 mock 数据实现。
|
||||
|
||||
**根因**:N/A
|
||||
|
||||
**影响**:pmda-drug-info skill 当前不再依赖外部真实 PMDA 数据源,便于开发期联调。
|
||||
|
||||
**相关文件**:
|
||||
- `skills/developing/pmda-drug-info/pmda_server.py`
|
||||
|
||||
**Commit/PR**:`a92096a`
|
||||
|
||||
---
|
||||
|
||||
## 2026-05-11: retrieval-policy.md 内容更新
|
||||
|
||||
**类型**:内容调整
|
||||
|
||||
**背景**:[待补充]
|
||||
|
||||
**改动**:在 onprem / support / autoload-onprem-rag-only / autoload-support-rag-only 四个版本的 `retrieval-policy.md` 上做了同步内容更新。
|
||||
|
||||
**根因**:N/A
|
||||
|
||||
**影响**:与同月 12 日的 policy 批量精修配套,使 rag-retrieve hook 策略保持一致。
|
||||
|
||||
**相关文件**:
|
||||
- `skills/onprem/rag-retrieve/hooks/retrieval-policy.md`
|
||||
- `skills/support/rag-retrieve/hooks/retrieval-policy.md`
|
||||
- `skills/autoload/onprem/rag-retrieve-only/hooks/retrieval-policy.md`
|
||||
- `skills/autoload/support/rag-retrieve-only/hooks/retrieval-policy.md`
|
||||
|
||||
**Commit/PR**:`e6d1698`
|
||||
|
||||
---
|
||||
|
||||
## 2026-05-08: 通过 MCP `_meta` 透传 trace_id 给 RAG 工具
|
||||
|
||||
**类型**:新功能
|
||||
|
||||
**背景**:需要把 catalog-agent 的 trace_id 透传给 MCP 端的 `rag_retrieve` / `table_rag_retrieve` 服务,便于跨进程追踪。
|
||||
|
||||
**改动**:
|
||||
- 新增 `agent/mcp_trace_meta.py`:通过 `patch_mcp_client_session_trace_meta()` 对 `mcp.ClientSession.call_tool` 做一次幂等 monkey-patch,调用时若工具名在 `{"rag_retrieve", "table_rag_retrieve"}` 集合内且当前请求上下文有 `trace_id`,则注入到 `kwargs["meta"]["trace_id"]`;并提供 `_call_tool_with_meta_compat` 以兼容旧版 MCP SDK(不接受 `meta=` 关键字时退化为手动构造 `CallToolRequestParams._meta`)。
|
||||
- `agent/deep_assistant.py`:在 `get_tools_from_mcp()` 入口处调用一次补丁安装。
|
||||
- 同步调整 `skills/onprem/rag-retrieve/rag_retrieve_server.py` 与 `skills/support/rag-retrieve/rag_retrieve_server.py`,接收并使用 `_meta.trace_id`。
|
||||
|
||||
**根因**:N/A
|
||||
|
||||
**影响**:
|
||||
- `rag_retrieve` / `table_rag_retrieve` 现在在 MCP `_meta` 上必带 `trace_id`(若上下文存在)。
|
||||
- 全局 monkey-patch 风格 - 只要 `get_tools_from_mcp()` 被调用过一次后,所有 `ClientSession.call_tool` 都会被包装。
|
||||
|
||||
**相关文件**:
|
||||
- `agent/mcp_trace_meta.py`(新)
|
||||
- `agent/deep_assistant.py`
|
||||
- `skills/onprem/rag-retrieve/rag_retrieve_server.py`
|
||||
- `skills/support/rag-retrieve/rag_retrieve_server.py`
|
||||
|
||||
**Commit/PR**:`1f06450`
|
||||
|
||||
---
|
||||
|
||||
## 2026-05-06: 新增 kfs-answer skill (support 分支)
|
||||
|
||||
**类型**:新功能
|
||||
|
||||
**背景**:[待补充] - 为 support 分支补齐 kfs-answer 能力(onprem 分支此前已有同名 skill)。
|
||||
|
||||
**改动**:新增 `skills/support/kfs-answer/`,包括 `SKILL.md` 与 `scripts/` 下的 `query.py` / `search.py` / `detail.py` / `query_db.py` / `format_answer.py` / `merge_citations.py` / `_session.py` 共 7 个脚本(约 1809 行)。
|
||||
|
||||
**根因**:N/A
|
||||
|
||||
**影响**:support 部署版本获得 kfs-answer 能力。
|
||||
|
||||
**相关文件**:
|
||||
- `skills/support/kfs-answer/**`
|
||||
|
||||
**Commit/PR**:`a9227b8`
|
||||
|
||||
---
|
||||
|
||||
## 2026-05-06: Daytona 沙箱增量同步漏掉符号链接
|
||||
|
||||
**类型**:Bug 修复
|
||||
|
||||
**背景**:dataset 通过符号链接挂载,但增量同步用 `find -type f` 只匹配普通文件,导致 dataset 符号链接没被检测到也没被打包同步到 Daytona 沙箱;并且 `tar.add()` 默认不 dereference,打进去的是指向宿主机路径的悬空软链。
|
||||
|
||||
**改动**:
|
||||
- `utils/daytona_sync._list_local_changed_files`:同时匹配 file 和 symlink (`-type f -o -type l`)。
|
||||
- `utils/daytona_sync._tar_workspace_entries`:`tar.add(dereference=True)`,把软链解引用为实际内容打包。
|
||||
- `skills/onprem/kfs-answer/SKILL.md` 和 `prompt/system_prompt_deep_agent.md`:统一数据集路径用复数形式 `datasets/`。
|
||||
|
||||
**根因**:`find -type f` 与 `tar.add()` 默认行为对符号链接不友好。
|
||||
|
||||
**影响**:Daytona 模式下 kfs-answer 等依赖 dataset 软链的 skill 可以正常使用沙箱内的数据;提示词与 SKILL.md 内的路径口径统一。
|
||||
|
||||
**相关文件**:
|
||||
- `utils/daytona_sync.py`
|
||||
- `skills/onprem/kfs-answer/SKILL.md`
|
||||
- `prompt/system_prompt_deep_agent.md`
|
||||
|
||||
**Commit/PR**:`3c0fa49`
|
||||
|
||||
---
|
||||
|
||||
## 2026-04-24: PostAgent hooks 非阻塞执行 + 临时停用 ToolOutputLengthMiddleware
|
||||
|
||||
**类型**:性能优化 / 临时调整
|
||||
|
||||
**背景**:非流式响应路径上 `_execute_post_agent_hooks` 是同步等待,阻塞了响应返回。
|
||||
|
||||
**改动**:
|
||||
- `routes/chat.py`:非流式分支将 `await _execute_post_agent_hooks(...)` 改为 `asyncio.create_task(_execute_post_agent_hooks(...))`,hook 在后台执行,不阻塞响应。
|
||||
- `agent/deep_assistant.py`:将 `ToolOutputLengthMiddleware` 整段注释掉(未删除,可恢复)。
|
||||
- `utils/settings.py`:切换 `DAYTONA_API_KEY` / `DAYTONA_SERVER_URL` 注释行(启用自托管 Daytona,注释掉 SaaS 行)。
|
||||
|
||||
**根因**:N/A(性能优化为主)
|
||||
|
||||
**影响**:
|
||||
- 非流式接口响应不再等待 PostAgent hooks 完成 → hook 中失败/异常**只会被 task 内部的 logger 捕获**,调用方收不到错误反馈。
|
||||
- 工具输出长度暂时不再被截断,存在超长输出冲爆上下文的风险(中间件已被注释,并未拆除)。
|
||||
|
||||
**相关文件**:
|
||||
- `routes/chat.py`
|
||||
- `agent/deep_assistant.py`
|
||||
- `utils/settings.py`
|
||||
|
||||
**Commit/PR**:`45a9494`
|
||||
|
||||
---
|
||||
|
||||
## 2026-04-23: PrePrompt hook 内容改为模板占位符注入
|
||||
|
||||
**类型**:重构
|
||||
|
||||
**背景**:原先 PrePrompt hook 的产出文本是在 `system_prompt_default.format(...)` 之后追加在 prompt 末尾,hook 内容在 prompt 中的位置固定且偏后,模板对它的可见性差。
|
||||
|
||||
**改动**:`agent/prompt_loader.load_system_prompt_async`:先执行 `execute_hooks('PrePrompt', config)` 拿到 `hook_content`,然后通过新增的 `{hook_content}` 占位符传入 `system_prompt_default.format(...)`;模板侧 `prompt/system_prompt.md` 增加对应占位符。
|
||||
|
||||
**根因**:N/A(结构化注入更可控)
|
||||
|
||||
**影响**:编写 PrePrompt hook 的 skill 必须依赖模板里 `{hook_content}` 占位符的位置;若使用了未升级的旧模板,hook 内容将不再出现在最终 system prompt 中。
|
||||
|
||||
**相关文件**:
|
||||
- `agent/prompt_loader.py`
|
||||
- `prompt/system_prompt.md`
|
||||
|
||||
**Commit/PR**:`51fbf01`
|
||||
|
||||
---
|
||||
|
||||
## 2026-04-23: Daytona 沙箱接入
|
||||
|
||||
**类型**:新功能
|
||||
|
||||
**背景**:技能脚本需要在隔离沙箱中执行(Daytona),避免直接污染宿主机。
|
||||
|
||||
**改动**:
|
||||
- `agent/deep_assistant.py`:
|
||||
- 在 `init_agent` 中读取 `DAYTONA_ENABLED` / `DAYTONA_API_KEY` / `DAYTONA_SERVER_URL`,启用时创建 `DaytonaSandbox`;并将 `sandbox` / `sandbox_type` 传到 `create_custom_cli_agent` / `agent.invoke_config`。
|
||||
- 重构为并行加载:`load_system_prompt_async` 与 `load_mcp_settings_async` 用 `asyncio.gather` 并行;`get_tools_from_mcp` 与 `asyncio.to_thread(init_daytona_sandbox, ...)` 并行;`init_agent` 现在返回 `(agent, checkpointer, sandbox)`(多了 sandbox)。
|
||||
- `enable_skills` 时 `skills_sources` 从 `"/skills"` 改为 `"/workspace/skills"`(指向沙箱内的路径)。
|
||||
- `agent/prompt_loader.py`:`agent_dir_path` 在 `DAYTONA_ENABLED=True` 时改为 `/workspace`,否则保持本地路径。
|
||||
- `utils/daytona_sync.py` 新增(204 行):沙箱与本地 workspace 双向同步。
|
||||
- `pyproject.toml` / `poetry.lock` / `requirements.txt`:新增 `daytona`、`langchain_daytona` 依赖。
|
||||
- `utils/settings.py`:新增 `DAYTONA_API_KEY` / `DAYTONA_SERVER_URL` / `DAYTONA_ENABLED` 配置。
|
||||
|
||||
**根因**:N/A
|
||||
|
||||
**影响**:
|
||||
- `init_agent` 返回元组从 2 元素变为 3 元素 (`agent, checkpointer, sandbox`)——**调用方必须同步更新解构**。
|
||||
- skill 在沙箱模式下的根路径与本地模式不同,所有写死路径的 hook / 脚本需要兼容两种环境。
|
||||
|
||||
**相关文件**:
|
||||
- `agent/deep_assistant.py`
|
||||
- `agent/prompt_loader.py`
|
||||
- `utils/daytona_sync.py`(新)
|
||||
- `utils/settings.py`
|
||||
- `pyproject.toml`, `poetry.lock`, `requirements.txt`
|
||||
|
||||
**Commit/PR**:`c9e0789`, `8446dab`
|
||||
|
||||
---
|
||||
|
||||
## 2026-04-22: 新增 rag-retrieve-no-citation 与 novare-context 两个开发中 skill
|
||||
|
||||
**类型**:新功能
|
||||
|
||||
**背景**:[待补充]
|
||||
|
||||
**改动**:
|
||||
- `skills/developing/rag-retrieve-no-citation/`:完整 skill 包,含 `.claude-plugin/plugin.json`、`README.md`、`hooks/pre_prompt.py`、`hooks/retrieval-policy.md` 与 `hooks/retrieval-policy-forbidden-self-knowledge.md`、独立 `rag_retrieve_server.py` + `rag_retrieve_tools.json` + `mcp_common.py`。
|
||||
- `skills/developing/novare-context/`:包含 `.claude-plugin/plugin.json`、`README.md`、`hooks/pre_prompt.py`。
|
||||
|
||||
**根因**:N/A
|
||||
|
||||
**影响**:开发中 skill 集合扩张,可作为后续正式版本的母版。
|
||||
|
||||
**相关文件**:
|
||||
- `skills/developing/rag-retrieve-no-citation/**`
|
||||
- `skills/developing/novare-context/**`
|
||||
|
||||
**Commit/PR**:`7a30e52`
|
||||
|
||||
---
|
||||
### 2026-05-20
|
||||
- **变更**: `mcp-ui` 和 `data-dashboard` 从自定义 `uri + data` 工具协议改为 MCP Apps 模式
|
||||
- **说明**: 静态 HTML App 放在各 skill 的 `apps/` 目录,host 通过 resource URI 加载 iframe,再用 postMessage 传递工具数据
|
||||
|
||||
47
.features/skill/decisions/2026-05-subagent-support.md
Normal file
47
.features/skill/decisions/2026-05-subagent-support.md
Normal file
@ -0,0 +1,47 @@
|
||||
---
|
||||
date: "2026-05-11"
|
||||
status: pending
|
||||
topic: "subagent-support"
|
||||
impact: [skill, agent, deep_assistant]
|
||||
---
|
||||
|
||||
# Sub-Agent 在 skill 内的承载方式
|
||||
|
||||
## 背景
|
||||
|
||||
业务方 (pmda-drug-info) 需要在单个 skill 内同时承载若干面向不同子任务的专用 agent
|
||||
(single-drug / interaction / adverse-event / patient-specific),每个子 agent
|
||||
需要独立的 system prompt 和工具白名单,但应与主 agent 复用同一组 MCP 工具实例
|
||||
与同一份 LLM。
|
||||
|
||||
[待补充]:是否考虑过用 skill-per-subagent 的方式(每个子 agent 一个独立 skill)。
|
||||
|
||||
## 选项
|
||||
|
||||
### 选项 A:skill 内 `agents/*.md` + 全局 `SubAgentMiddleware`(已实现)
|
||||
- 优点:
|
||||
- skill 包自洽,子 agent 定义与 hook / MCP 同包发布。
|
||||
- 复用 `deepagents.middleware.subagents.SubAgentMiddleware`,无需自研路由层。
|
||||
- 工具按 `tools` 字段白名单过滤,统一以 MCP tool name 引用。
|
||||
- 缺点:
|
||||
- 跨 skill 子 agent **同名时静默 last-wins 覆盖**,仅有 warning,无强校验。
|
||||
- 中间件位置耦合:`SubAgentMiddleware` 必须插在
|
||||
`CustomFilesystemMiddleware` 之后、`AnthropicPromptCachingMiddleware` 之前
|
||||
(与 `create_deep_agent` 顺序匹配),改动中间件顺序时容易踩坑。
|
||||
|
||||
### 选项 B:每个子 agent 单独建一个 skill
|
||||
- 优点:天然隔离,命名冲突由 skill 加载层处理。
|
||||
- 缺点:同一业务的多个子 agent 在 skill 列表里散落,部署 / autoload 配置复杂;
|
||||
pmda-drug-info 的 4 个子 agent 强相关,作为同一 skill 更自然。
|
||||
|
||||
## 决策
|
||||
|
||||
选择 **选项 A**(已落地)。
|
||||
|
||||
## 影响
|
||||
|
||||
- 需要改动:调用方知道 `init_agent` 返回元组现已包含 sandbox(与 daytona 改动叠加)。
|
||||
- 风险:sub-agent 同名静默覆盖;未来如多 skill 都暴露 sub-agent,需要增加冲突检测。
|
||||
- 后续任务:
|
||||
1. 沉淀 sub-agent 编写规范(`agents/*.md` frontmatter 字段 + 工具白名单约定)。
|
||||
2. 跨 skill sub-agent 命名冲突的检测策略——是否升级为 error / 加 skill 名前缀。
|
||||
13
Dockerfile
13
Dockerfile
@ -9,7 +9,7 @@ ENV PYTHONPATH=/app
|
||||
ENV PYTHONUNBUFFERED=1
|
||||
|
||||
# 安装系统依赖(含 LibreOffice 和 sharp 所需的 libvips)
|
||||
RUN apt-get update && apt-get install -y \
|
||||
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||
curl \
|
||||
wget \
|
||||
gnupg2 \
|
||||
@ -26,7 +26,8 @@ RUN apt-get update && apt-get install -y \
|
||||
|
||||
# 安装Node.js (支持npx命令)
|
||||
RUN curl -fsSL https://deb.nodesource.com/setup_20.x | bash - && \
|
||||
apt-get install -y nodejs
|
||||
apt-get install -y --no-install-recommends nodejs && \
|
||||
rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# 安装uv (Python包管理器)
|
||||
RUN curl -LsSf https://astral.sh/uv/install.sh | sh
|
||||
@ -36,7 +37,10 @@ ENV PATH="/root/.cargo/bin:$PATH"
|
||||
|
||||
# 复制requirements文件并安装Python依赖
|
||||
COPY requirements.txt .
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
RUN grep -Ev '^(torch|triton|nvidia-[^=]+|sentence-transformers|transformers|tokenizers|safetensors|scikit-learn|scipy|huggingface-hub|hf-xet)==' requirements.txt > /tmp/requirements.runtime.txt && \
|
||||
! grep -E '^(torch|triton|nvidia-[^=]+|sentence-transformers|transformers|tokenizers|safetensors|scikit-learn|scipy|huggingface-hub|hf-xet)==' /tmp/requirements.runtime.txt && \
|
||||
pip install --no-cache-dir -r /tmp/requirements.runtime.txt && \
|
||||
rm -f /tmp/requirements.runtime.txt
|
||||
|
||||
# 安装 Playwright 并下载 Chromium
|
||||
RUN pip install --no-cache-dir playwright && \
|
||||
@ -49,9 +53,6 @@ RUN mkdir -p /app/projects
|
||||
RUN mkdir -p /app/public
|
||||
RUN mkdir -p /app/models
|
||||
|
||||
# 下载sentence-transformers模型到models目录
|
||||
RUN python -c "from sentence_transformers import SentenceTransformer; model = SentenceTransformer('TaylorAI/gte-tiny'); model.save('/app/models/gte-tiny')"
|
||||
|
||||
FROM base AS bytecode-builder
|
||||
|
||||
# 复制应用代码,仅在构建阶段编译为字节码
|
||||
|
||||
@ -10,7 +10,8 @@ ENV PYTHONUNBUFFERED=1
|
||||
|
||||
# 安装系统依赖(含 LibreOffice 和 sharp 所需的 libvips)
|
||||
RUN sed -i 's|http://deb.debian.org|http://mirrors.aliyun.com|g' /etc/apt/sources.list.d/debian.sources && \
|
||||
apt-get update && apt-get install -y \
|
||||
apt-get -o Acquire::Retries=3 update && \
|
||||
apt-get -o Acquire::Retries=3 install -y --no-install-recommends \
|
||||
curl \
|
||||
wget \
|
||||
gnupg2 \
|
||||
@ -26,7 +27,8 @@ RUN sed -i 's|http://deb.debian.org|http://mirrors.aliyun.com|g' /etc/apt/source
|
||||
|
||||
# 安装Node.js (支持npx命令)
|
||||
RUN curl -fsSL https://deb.nodesource.com/setup_20.x | bash - && \
|
||||
apt-get install -y nodejs
|
||||
apt-get -o Acquire::Retries=3 install -y --no-install-recommends nodejs && \
|
||||
rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# 安装uv (Python包管理器)
|
||||
RUN curl -LsSf https://astral.sh/uv/install.sh | sh
|
||||
@ -36,7 +38,10 @@ ENV PATH="/root/.cargo/bin:$PATH"
|
||||
|
||||
# 复制requirements文件并安装Python依赖
|
||||
COPY requirements.txt .
|
||||
RUN pip install --no-cache-dir -i https://mirrors.aliyun.com/pypi/simple/ -r requirements.txt
|
||||
RUN grep -Ev '^(torch|triton|nvidia-[^=]+|sentence-transformers|transformers|tokenizers|safetensors|scikit-learn|scipy|huggingface-hub|hf-xet)==' requirements.txt > /tmp/requirements.runtime.txt && \
|
||||
! grep -E '^(torch|triton|nvidia-[^=]+|sentence-transformers|transformers|tokenizers|safetensors|scikit-learn|scipy|huggingface-hub|hf-xet)==' /tmp/requirements.runtime.txt && \
|
||||
pip install --no-cache-dir -i https://mirrors.aliyun.com/pypi/simple/ -r /tmp/requirements.runtime.txt && \
|
||||
rm -f /tmp/requirements.runtime.txt
|
||||
|
||||
# 安装 Playwright 并下载 Chromium
|
||||
RUN pip install --no-cache-dir -i https://mirrors.aliyun.com/pypi/simple/ playwright && \
|
||||
|
||||
@ -23,6 +23,7 @@ from utils.fastapi_utils import detect_provider, sanitize_model_kwargs
|
||||
from .guideline_middleware import GuidelineMiddleware
|
||||
from .tool_output_length_middleware import ToolOutputLengthMiddleware
|
||||
from .tool_use_cleanup_middleware import ToolUseCleanupMiddleware
|
||||
from .tool_metrics_middleware import ToolMetricsMiddleware
|
||||
from .filepath_fix_middleware import FilePathFixMiddleware
|
||||
from .mcp_trace_meta import patch_mcp_client_session_trace_meta
|
||||
from utils.settings import (
|
||||
@ -256,6 +257,7 @@ async def init_agent(config: AgentConfig):
|
||||
# Build the middleware list
|
||||
middleware = []
|
||||
middleware.append(EmptyResponseRetryMiddleware())
|
||||
middleware.append(ToolMetricsMiddleware(config))
|
||||
middleware.append(ToolUseCleanupMiddleware())
|
||||
# tool_output_middleware = ToolOutputLengthMiddleware(
|
||||
# max_length=(getattr(config.generate_cfg, 'tool_output_max_length', None) if config.generate_cfg else None) or TOOL_OUTPUT_MAX_LENGTH,
|
||||
@ -480,13 +482,19 @@ def create_custom_cli_agent(
|
||||
backend = FilesystemBackend(root_dir=workspace_root, virtual_mode=False)
|
||||
|
||||
# Set up composite backend with routing based on the new implementation
|
||||
# NOTE: virtual_mode=True anchors all paths to root_dir. This is required for
|
||||
# these offload-only backends: CompositeBackend strips the route prefix and
|
||||
# forwards "/" to grep, so virtual_mode=False would resolve "/" to the real
|
||||
# filesystem root and scan the whole disk (hitting /usr, /var, other sessions'
|
||||
# temp dirs), causing 45-152s grep calls. virtual_mode=True confines grep to
|
||||
# the temp dir and filters out-of-root results.
|
||||
large_results_backend = FilesystemBackend(
|
||||
root_dir=tempfile.mkdtemp(prefix="deepagents_large_results_"),
|
||||
virtual_mode=False,
|
||||
virtual_mode=True,
|
||||
)
|
||||
conversation_history_backend = FilesystemBackend(
|
||||
root_dir=tempfile.mkdtemp(prefix="deepagents_conversation_history_"),
|
||||
virtual_mode=False,
|
||||
virtual_mode=True,
|
||||
)
|
||||
composite_backend = CompositeBackend(
|
||||
default=backend,
|
||||
|
||||
@ -5,14 +5,18 @@ Responsible for creating, caching, and managing the lifecycle of Mem0 client ins
|
||||
|
||||
import logging
|
||||
import asyncio
|
||||
import threading
|
||||
import concurrent.futures
|
||||
from typing import Any, Dict, List, Optional, Literal
|
||||
from collections import OrderedDict
|
||||
from embedding.manager import GlobalModelManager, get_model_manager
|
||||
from urllib.parse import unquote, urlparse
|
||||
from embedding.manager import get_model_manager
|
||||
import json_repair
|
||||
from psycopg2 import pool
|
||||
from utils.settings import (
|
||||
CHECKPOINT_DB_URL,
|
||||
EMBEDDING_API_KEY,
|
||||
EMBEDDING_BASE_URL,
|
||||
EMBEDDING_DIMENSIONS,
|
||||
EMBEDDING_MODEL_NAME,
|
||||
MEM0_POOL_SIZE
|
||||
)
|
||||
from .mem0_config import Mem0Config
|
||||
@ -27,15 +31,9 @@ logger = logging.getLogger("app")
|
||||
|
||||
class CustomMem0Embedding:
|
||||
"""
|
||||
Custom Mem0 embedding class that directly uses the project's existing GlobalModelManager
|
||||
|
||||
This prevents Mem0 from loading the same model again and saves memory
|
||||
Custom Mem0 embedding class backed by the external embedding API.
|
||||
"""
|
||||
|
||||
_model = None # Class variable caching the model instance
|
||||
_lock = threading.Lock() # Thread-safe lock
|
||||
_executor = None # Thread pool executor
|
||||
|
||||
def __init__(self, config: Optional[Any] = None):
|
||||
"""Initialize the custom embedding."""
|
||||
# Create a simple config object compatible with Mem0 telemetry code
|
||||
@ -46,42 +44,7 @@ class CustomMem0Embedding:
|
||||
@property
|
||||
def embedding_dims(self):
|
||||
"""Get the embedding dimension."""
|
||||
return 384 # Dimension of gte-tiny
|
||||
|
||||
def _get_model_sync(self):
|
||||
"""Synchronously get the model without using asyncio.run()."""
|
||||
# First try to get an already-loaded model from the manager
|
||||
manager = get_model_manager()
|
||||
model = manager.get_model_sync()
|
||||
|
||||
if model is not None:
|
||||
# Cache the model
|
||||
CustomMem0Embedding._model = model
|
||||
return model
|
||||
|
||||
# If the model is not loaded, run async initialization in a thread pool
|
||||
if CustomMem0Embedding._executor is None:
|
||||
CustomMem0Embedding._executor = concurrent.futures.ThreadPoolExecutor(
|
||||
max_workers=1,
|
||||
thread_name_prefix="mem0_embed"
|
||||
)
|
||||
|
||||
# Run async code in a dedicated thread
|
||||
def run_async_in_thread():
|
||||
loop = asyncio.new_event_loop()
|
||||
asyncio.set_event_loop(loop)
|
||||
try:
|
||||
result = loop.run_until_complete(manager.get_model())
|
||||
return result
|
||||
finally:
|
||||
loop.close()
|
||||
|
||||
future = CustomMem0Embedding._executor.submit(run_async_in_thread)
|
||||
model = future.result(timeout=30) # 30-second timeout
|
||||
|
||||
# Cache the model
|
||||
CustomMem0Embedding._model = model
|
||||
return model
|
||||
return EMBEDDING_DIMENSIONS
|
||||
|
||||
def embed(self, text, memory_action: Optional[Literal["add", "search", "update"]] = None):
|
||||
"""
|
||||
@ -94,15 +57,11 @@ class CustomMem0Embedding:
|
||||
Returns:
|
||||
list: Embedding vector
|
||||
"""
|
||||
# Retrieve the model in a thread-safe manner
|
||||
if CustomMem0Embedding._model is None:
|
||||
with CustomMem0Embedding._lock:
|
||||
if CustomMem0Embedding._model is None:
|
||||
self._get_model_sync()
|
||||
|
||||
model = CustomMem0Embedding._model
|
||||
embeddings = model.encode(text, convert_to_numpy=True)
|
||||
return embeddings.tolist()
|
||||
manager = get_model_manager()
|
||||
input_texts = text if isinstance(text, list) else [text]
|
||||
embeddings = manager.encode_texts_sync(input_texts, batch_size=1)
|
||||
result = embeddings.tolist()
|
||||
return result if isinstance(text, list) else result[0]
|
||||
|
||||
# Monkey patch: replace mem0's remove_code_blocks with json_repair
|
||||
def _remove_code_blocks_with_repair(content: str) -> str:
|
||||
@ -233,27 +192,68 @@ class Mem0Manager:
|
||||
mem0_instance: Mem0 Memory instance
|
||||
"""
|
||||
try:
|
||||
# Mem0 Memory instances have a vector_store attribute of type PGVector
|
||||
if hasattr(mem0_instance, 'vector_store'):
|
||||
vector_store = mem0_instance.vector_store
|
||||
# PGVector has conn and connection_pool attributes
|
||||
if hasattr(vector_store, 'conn') and hasattr(vector_store, 'connection_pool'):
|
||||
if vector_store.conn is not None and vector_store.connection_pool is not None:
|
||||
try:
|
||||
# Close the cursor first
|
||||
if hasattr(vector_store, 'cur') and vector_store.cur:
|
||||
vector_store.cur.close()
|
||||
vector_store.cur = None
|
||||
# Return the connection to the pool
|
||||
vector_store.connection_pool.putconn(vector_store.conn)
|
||||
# Mark as cleaned up to prevent __del__ from releasing it again
|
||||
vector_store.conn = None
|
||||
logger.debug("Successfully released Mem0 database connection back to pool")
|
||||
except Exception as e:
|
||||
logger.warning(f"Error releasing Mem0 connection: {e}")
|
||||
vector_store = getattr(mem0_instance, 'vector_store', None)
|
||||
if vector_store is not None and getattr(vector_store, 'conn', None) is not None:
|
||||
try:
|
||||
if getattr(vector_store, 'cur', None):
|
||||
vector_store.cur.close()
|
||||
vector_store.cur = None
|
||||
connection_pool = getattr(vector_store, 'connection_pool', None)
|
||||
if connection_pool is not None:
|
||||
connection_pool.putconn(vector_store.conn)
|
||||
logger.debug("Successfully released Mem0 database connection back to pool")
|
||||
else:
|
||||
vector_store.conn.close()
|
||||
logger.debug("Successfully closed Mem0 database connection")
|
||||
vector_store.conn = None
|
||||
except Exception as e:
|
||||
logger.warning(f"Error releasing Mem0 connection: {e}")
|
||||
except Exception as e:
|
||||
logger.warning(f"Error cleaning up Mem0 instance: {e}")
|
||||
|
||||
def _build_pgvector_config(self, agent_id: str) -> Dict[str, Any]:
|
||||
"""Build Mem0 PGVector config using only fields accepted by mem0."""
|
||||
parsed_url = urlparse(CHECKPOINT_DB_URL)
|
||||
if parsed_url.scheme not in ("postgresql", "postgres"):
|
||||
raise ValueError(f"Unsupported CHECKPOINT_DB_URL scheme: {parsed_url.scheme}")
|
||||
|
||||
return {
|
||||
"dbname": unquote(parsed_url.path.lstrip("/") or "postgres"),
|
||||
"user": unquote(parsed_url.username or ""),
|
||||
"password": unquote(parsed_url.password or ""),
|
||||
"host": parsed_url.hostname or "localhost",
|
||||
"port": parsed_url.port or 5432,
|
||||
"collection_name": f"mem0_{agent_id}".replace("-", "_")[:50],
|
||||
"embedding_model_dims": EMBEDDING_DIMENSIONS,
|
||||
}
|
||||
|
||||
def _attach_pool_to_vector_store(self, mem0_instance: Any) -> None:
|
||||
"""Move Mem0's runtime vector store onto the shared psycopg2 pool."""
|
||||
vector_store = getattr(mem0_instance, 'vector_store', None)
|
||||
if vector_store is None:
|
||||
return
|
||||
|
||||
if getattr(vector_store, 'cur', None):
|
||||
vector_store.cur.close()
|
||||
vector_store.cur = None
|
||||
if getattr(vector_store, 'conn', None) is not None:
|
||||
vector_store.conn.close()
|
||||
vector_store.conn = None
|
||||
vector_store.connection_pool = self._sync_pool
|
||||
|
||||
def _close_telemetry_vector_store(self, mem0_instance: Any) -> None:
|
||||
"""Close Mem0's migration telemetry vector-store connection after init."""
|
||||
vector_store = getattr(mem0_instance, '_telemetry_vector_store', None)
|
||||
if vector_store is None:
|
||||
return
|
||||
|
||||
if getattr(vector_store, 'cur', None):
|
||||
vector_store.cur.close()
|
||||
vector_store.cur = None
|
||||
if getattr(vector_store, 'conn', None) is not None:
|
||||
vector_store.conn.close()
|
||||
vector_store.conn = None
|
||||
|
||||
def _ensure_connection(self, mem0_instance: Any) -> None:
|
||||
"""Ensure a Mem0 instance has a database connection before use.
|
||||
|
||||
@ -268,8 +268,7 @@ class Mem0Manager:
|
||||
if hasattr(vs, 'conn') and vs.conn is None and self._sync_pool:
|
||||
vs.conn = self._sync_pool.getconn()
|
||||
vs.cur = vs.conn.cursor()
|
||||
# Ensure the connection_pool reference exists for later return
|
||||
if hasattr(vs, 'connection_pool') and vs.connection_pool is None:
|
||||
if not hasattr(vs, 'connection_pool') or vs.connection_pool is None:
|
||||
vs.connection_pool = self._sync_pool
|
||||
logger.debug("Re-acquired Mem0 database connection from pool")
|
||||
except Exception as e:
|
||||
@ -292,8 +291,11 @@ class Mem0Manager:
|
||||
if hasattr(vs, 'cur') and vs.cur:
|
||||
vs.cur.close()
|
||||
vs.cur = None
|
||||
if hasattr(vs, 'connection_pool') and vs.connection_pool is not None:
|
||||
vs.connection_pool.putconn(vs.conn)
|
||||
connection_pool = getattr(vs, 'connection_pool', None)
|
||||
if connection_pool is not None:
|
||||
connection_pool.putconn(vs.conn)
|
||||
else:
|
||||
vs.conn.close()
|
||||
vs.conn = None
|
||||
logger.debug("Released Mem0 database connection back to pool")
|
||||
except Exception as e:
|
||||
@ -376,28 +378,25 @@ class Mem0Manager:
|
||||
if not connection_pool:
|
||||
raise ValueError("Database connection pool not available")
|
||||
|
||||
# Create a custom embedder that reuses the shared model to avoid duplicate loading
|
||||
# Create a custom embedder backed by the external embedding API.
|
||||
custom_embedder = CustomMem0Embedding()
|
||||
|
||||
# Configure Mem0 to use Pgvector
|
||||
# Note: use huggingface_base_url here to bypass local model loading
|
||||
# Set a dummy base_url so HuggingFaceEmbedding does not load SentenceTransformer
|
||||
|
||||
# Configure Mem0 to use Pgvector.
|
||||
# Mem0 validates this config strictly, so connection_pool is attached after creation.
|
||||
pgvector_config = self._build_pgvector_config(agent_id)
|
||||
config_dict = {
|
||||
"vector_store": {
|
||||
"provider": "pgvector",
|
||||
"config": {
|
||||
"connection_pool": connection_pool,
|
||||
"collection_name": f"mem0_{agent_id}".replace("-", "_")[:50], # Isolate by agent_id
|
||||
"embedding_model_dims": 384, # Dimension of paraphrase-multilingual-MiniLM-L12-v2
|
||||
}
|
||||
"config": pgvector_config,
|
||||
},
|
||||
# Use huggingface_base_url to bypass model loading; it will later be replaced with the custom embedder
|
||||
# The embedder is replaced immediately after Memory is created.
|
||||
"embedder": {
|
||||
"provider": "huggingface",
|
||||
"provider": "openai",
|
||||
"config": {
|
||||
"huggingface_base_url": "http://dummy-url-that-will-be-replaced",
|
||||
"api_key": "dummy-key" # Placeholder to prevent OpenAI client validation failure
|
||||
"api_key": EMBEDDING_API_KEY,
|
||||
"openai_base_url": EMBEDDING_BASE_URL,
|
||||
"model": EMBEDDING_MODEL_NAME,
|
||||
"embedding_dims": EMBEDDING_DIMENSIONS,
|
||||
}
|
||||
}
|
||||
}
|
||||
@ -432,6 +431,8 @@ class Mem0Manager:
|
||||
|
||||
# Create the Mem0 instance
|
||||
mem = Memory.from_config(config_dict)
|
||||
self._attach_pool_to_vector_store(mem)
|
||||
self._close_telemetry_vector_store(mem)
|
||||
logger.debug(f"Original embedder type: {type(mem.embedding_model).__name__}")
|
||||
logger.debug(f"Original embedder.embedding_dims: {getattr(mem.embedding_model, 'embedding_dims', 'N/A')}")
|
||||
|
||||
|
||||
100
agent/tool_metrics_middleware.py
Normal file
100
agent/tool_metrics_middleware.py
Normal file
@ -0,0 +1,100 @@
|
||||
"""Structured metrics for agent tool calls."""
|
||||
|
||||
import asyncio
|
||||
import logging
|
||||
import time
|
||||
from typing import Any, Callable
|
||||
|
||||
from langchain.agents.middleware import AgentMiddleware
|
||||
from langchain.tools.tool_node import ToolCallRequest
|
||||
|
||||
from agent.agent_config import AgentConfig
|
||||
from utils.structured_log import emit_question_metric
|
||||
|
||||
logger = logging.getLogger("app")
|
||||
|
||||
|
||||
class ToolMetricsMiddleware(AgentMiddleware):
|
||||
"""Emit structured timing metrics for every tool call."""
|
||||
|
||||
def __init__(self, config: AgentConfig):
|
||||
self.config = config
|
||||
|
||||
def _emit_tool_metric(
|
||||
self,
|
||||
request: ToolCallRequest,
|
||||
*,
|
||||
started_at: float,
|
||||
status: str,
|
||||
error_type: str | None = None,
|
||||
) -> None:
|
||||
tool_call = request.tool_call or {}
|
||||
tool_name = tool_call.get("name") or "unknown_tool"
|
||||
tool_call_id = tool_call.get("id")
|
||||
duration_ms = max(int((time.monotonic() - started_at) * 1000), 0)
|
||||
|
||||
try:
|
||||
emit_question_metric(
|
||||
stage="catalog_agent.tool_call",
|
||||
status=status,
|
||||
duration_ms=duration_ms,
|
||||
trace_id=self.config.trace_id,
|
||||
ai_id=self.config.bot_id,
|
||||
session_id=self.config.session_id,
|
||||
robot_type="agent",
|
||||
model=self.config.model_name,
|
||||
stream=self.config.stream,
|
||||
error_type=error_type,
|
||||
extra={
|
||||
"bot_id": self.config.bot_id,
|
||||
"tool_name": tool_name,
|
||||
"tool_call_id": tool_call_id,
|
||||
"tool_response": self.config.tool_response,
|
||||
"enable_thinking": self.config.enable_thinking,
|
||||
},
|
||||
)
|
||||
except Exception:
|
||||
logger.exception("Failed to emit tool metric for tool_name=%s", tool_name)
|
||||
|
||||
def wrap_tool_call(
|
||||
self,
|
||||
request: ToolCallRequest,
|
||||
handler: Callable[[ToolCallRequest], Any],
|
||||
) -> Any:
|
||||
started_at = time.monotonic()
|
||||
try:
|
||||
result = handler(request)
|
||||
except Exception as exc:
|
||||
self._emit_tool_metric(
|
||||
request,
|
||||
started_at=started_at,
|
||||
status="error",
|
||||
error_type=type(exc).__name__,
|
||||
)
|
||||
raise
|
||||
|
||||
self._emit_tool_metric(request, started_at=started_at, status="success")
|
||||
return result
|
||||
|
||||
async def awrap_tool_call(
|
||||
self,
|
||||
request: ToolCallRequest,
|
||||
handler: Callable[[ToolCallRequest], Any],
|
||||
) -> Any:
|
||||
started_at = time.monotonic()
|
||||
try:
|
||||
result = await handler(request)
|
||||
except asyncio.CancelledError:
|
||||
self._emit_tool_metric(request, started_at=started_at, status="cancel")
|
||||
raise
|
||||
except Exception as exc:
|
||||
self._emit_tool_metric(
|
||||
request,
|
||||
started_at=started_at,
|
||||
status="error",
|
||||
error_type=type(exc).__name__,
|
||||
)
|
||||
raise
|
||||
|
||||
self._emit_tool_metric(request, started_at=started_at, status="success")
|
||||
return result
|
||||
168
db_manager.py
168
db_manager.py
@ -1,168 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
SQLite task status database management tool
|
||||
"""
|
||||
|
||||
import sqlite3
|
||||
import json
|
||||
import time
|
||||
from task_queue.task_status import task_status_store
|
||||
|
||||
def view_database():
|
||||
"""View database contents"""
|
||||
print("SQLite task status database contents")
|
||||
print("=" * 40)
|
||||
print(f"Database path: {task_status_store.db_path}")
|
||||
|
||||
# Connect to the database
|
||||
conn = sqlite3.connect(task_status_store.db_path)
|
||||
cursor = conn.cursor()
|
||||
|
||||
# View table schema
|
||||
print(f"\nTable schema:")
|
||||
cursor.execute("PRAGMA table_info(task_status)")
|
||||
columns = cursor.fetchall()
|
||||
for col in columns:
|
||||
print(f" {col[1]} ({col[2]})")
|
||||
|
||||
# View all records
|
||||
print(f"\nAll records:")
|
||||
cursor.execute("SELECT * FROM task_status ORDER BY updated_at DESC")
|
||||
rows = cursor.fetchall()
|
||||
|
||||
if not rows:
|
||||
print(" (empty database)")
|
||||
else:
|
||||
print(f" Total {len(rows)} records:")
|
||||
for i, row in enumerate(rows):
|
||||
task_id, unique_id, status, created_at, updated_at, result, error = row
|
||||
created_str = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(created_at))
|
||||
updated_str = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(updated_at))
|
||||
|
||||
print(f" {i+1}. {task_id}")
|
||||
print(f" Project ID: {unique_id}")
|
||||
print(f" Status: {status}")
|
||||
print(f" Created: {created_str}")
|
||||
print(f" Updated: {updated_str}")
|
||||
if result:
|
||||
try:
|
||||
result_data = json.loads(result)
|
||||
print(f" Result: {result_data.get('message', 'N/A')}")
|
||||
except:
|
||||
print(f" Result: {result[:50]}...")
|
||||
if error:
|
||||
print(f" Error: {error}")
|
||||
print()
|
||||
|
||||
conn.close()
|
||||
|
||||
def run_query(sql_query: str):
|
||||
"""Run a custom query"""
|
||||
print(f"Running query: {sql_query}")
|
||||
|
||||
try:
|
||||
conn = sqlite3.connect(task_status_store.db_path)
|
||||
conn.row_factory = sqlite3.Row
|
||||
cursor = conn.cursor()
|
||||
cursor.execute(sql_query)
|
||||
rows = cursor.fetchall()
|
||||
|
||||
if not rows:
|
||||
print(" (no results)")
|
||||
else:
|
||||
print(f" {len(rows)} results:")
|
||||
for row in rows:
|
||||
print(f" {dict(row)}")
|
||||
|
||||
conn.close()
|
||||
|
||||
except Exception as e:
|
||||
print(f"Query failed: {e}")
|
||||
|
||||
def interactive_shell():
|
||||
"""Interactive database management"""
|
||||
print("\n🖥️ Interactive database management")
|
||||
print("Type 'help' to view available commands, or 'quit' to exit")
|
||||
|
||||
while True:
|
||||
try:
|
||||
command = input("\n> ").strip()
|
||||
|
||||
if command.lower() in ['quit', 'exit', 'q']:
|
||||
break
|
||||
elif command.lower() == 'help':
|
||||
print("""
|
||||
Available commands:
|
||||
view - View all records
|
||||
stats - View statistics
|
||||
pending - View pending tasks
|
||||
completed - View completed tasks
|
||||
failed - View failed tasks
|
||||
sql <query> - Run an SQL query
|
||||
cleanup <days> - Clean up records older than N days
|
||||
count - Count total tasks
|
||||
help - Show help
|
||||
quit/exit/q - Exit
|
||||
""")
|
||||
elif command.lower() == 'view':
|
||||
view_database()
|
||||
elif command.lower() == 'stats':
|
||||
stats = task_status_store.get_statistics()
|
||||
print(f"Statistics:")
|
||||
print(f" Total tasks: {stats['total_tasks']}")
|
||||
print(f" Status breakdown: {stats['status_breakdown']}")
|
||||
print(f" Last 24 hours: {stats['recent_24h']}")
|
||||
elif command.lower() == 'pending':
|
||||
tasks = task_status_store.search_tasks(status="pending")
|
||||
print(f"Pending tasks ({len(tasks)}):")
|
||||
for task in tasks:
|
||||
print(f" - {task['task_id']}: {task['unique_id']}")
|
||||
elif command.lower() == 'completed':
|
||||
tasks = task_status_store.search_tasks(status="completed")
|
||||
print(f"Completed tasks ({len(tasks)}):")
|
||||
for task in tasks:
|
||||
print(f" - {task['task_id']}: {task['unique_id']}")
|
||||
elif command.lower() == 'failed':
|
||||
tasks = task_status_store.search_tasks(status="failed")
|
||||
print(f"Failed tasks ({len(tasks)}):")
|
||||
for task in tasks:
|
||||
print(f" - {task['task_id']}: {task['unique_id']}")
|
||||
elif command.lower().startswith('sql '):
|
||||
sql_query = command[4:]
|
||||
run_query(sql_query)
|
||||
elif command.lower().startswith('cleanup '):
|
||||
try:
|
||||
days = int(command[8:])
|
||||
count = task_status_store.cleanup_old_tasks(days)
|
||||
print(f"Cleaned up {count} records older than {days} days")
|
||||
except ValueError:
|
||||
print("Please enter a valid number of days")
|
||||
elif command.lower() == 'count':
|
||||
all_tasks = task_status_store.list_all()
|
||||
print(f"Total tasks: {len(all_tasks)}")
|
||||
else:
|
||||
print("Unknown command. Type 'help' for help")
|
||||
|
||||
except KeyboardInterrupt:
|
||||
print("\nGoodbye!")
|
||||
break
|
||||
except Exception as e:
|
||||
print(f"Execution error: {e}")
|
||||
|
||||
def main():
|
||||
"""Main function"""
|
||||
import sys
|
||||
|
||||
if len(sys.argv) > 1:
|
||||
if sys.argv[1] == 'view':
|
||||
view_database()
|
||||
elif sys.argv[1] == 'interactive':
|
||||
interactive_shell()
|
||||
else:
|
||||
print("Usage: python db_manager.py [view|interactive]")
|
||||
else:
|
||||
view_database()
|
||||
interactive_shell()
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@ -4,128 +4,93 @@ Model pool manager and cache system
|
||||
Support high-concurrency embedding retrieval services
|
||||
"""
|
||||
|
||||
import os
|
||||
import asyncio
|
||||
import time
|
||||
import pickle
|
||||
import hashlib
|
||||
import logging
|
||||
from typing import Dict, List, Optional, Any, Tuple
|
||||
from dataclasses import dataclass
|
||||
from collections import OrderedDict
|
||||
from utils.settings import SENTENCE_TRANSFORMER_MODEL
|
||||
import threading
|
||||
import psutil
|
||||
from typing import Dict, List, Any
|
||||
from utils.settings import (
|
||||
EMBEDDING_API_KEY,
|
||||
EMBEDDING_BASE_URL,
|
||||
EMBEDDING_DIMENSIONS,
|
||||
EMBEDDING_MODEL_NAME,
|
||||
EMBEDDING_TIMEOUT,
|
||||
)
|
||||
import numpy as np
|
||||
import requests
|
||||
|
||||
from sentence_transformers import SentenceTransformer
|
||||
import logging
|
||||
|
||||
logger = logging.getLogger('app')
|
||||
|
||||
|
||||
class GlobalModelManager:
|
||||
"""Global model manager"""
|
||||
"""OpenAI-compatible embedding API manager."""
|
||||
|
||||
def __init__(self, model_name: str = 'TaylorAI/gte-tiny'):
|
||||
self.model_name = model_name
|
||||
self.local_model_path = "./models/gte-tiny"
|
||||
self._model: Optional[SentenceTransformer] = None
|
||||
self._lock = asyncio.Lock()
|
||||
self._load_time = 0
|
||||
self._device = 'cpu'
|
||||
def __init__(self):
|
||||
self.external_model_name = EMBEDDING_MODEL_NAME
|
||||
self.external_base_url = EMBEDDING_BASE_URL.rstrip("/")
|
||||
self.external_api_key = EMBEDDING_API_KEY
|
||||
self.external_dimensions = EMBEDDING_DIMENSIONS
|
||||
self.external_timeout = EMBEDDING_TIMEOUT
|
||||
|
||||
logger.info(f"GlobalModelManager initialized: {model_name}")
|
||||
|
||||
async def get_model(self) -> SentenceTransformer:
|
||||
"""Get the model instance with lazy loading"""
|
||||
if self._model is not None:
|
||||
return self._model
|
||||
|
||||
async with self._lock:
|
||||
# Double-check
|
||||
if self._model is not None:
|
||||
return self._model
|
||||
|
||||
try:
|
||||
start_time = time.time()
|
||||
|
||||
# Check the local model
|
||||
model_path = self.local_model_path if os.path.exists(self.local_model_path) else self.model_name
|
||||
|
||||
# Get device configuration
|
||||
self._device = os.environ.get('SENTENCE_TRANSFORMER_DEVICE', 'cpu')
|
||||
if self._device not in ['cpu', 'cuda', 'mps']:
|
||||
self._device = 'cpu'
|
||||
|
||||
logger.info(f"Loading model: {model_path} (device: {self._device})")
|
||||
|
||||
# Run blocking operations in the event loop executor
|
||||
loop = asyncio.get_event_loop()
|
||||
self._model = await loop.run_in_executor(
|
||||
None,
|
||||
lambda: SentenceTransformer(
|
||||
model_path,
|
||||
device=self._device
|
||||
)
|
||||
)
|
||||
|
||||
self._load_time = time.time() - start_time
|
||||
logger.info(f"Model loading completed: {self._load_time:.2f}s")
|
||||
|
||||
return self._model
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Model loading failed: {e}")
|
||||
raise
|
||||
logger.info(f"GlobalModelManager initialized: external_model={self.external_model_name}")
|
||||
|
||||
async def encode_texts(self, texts: List[str], batch_size: int = 32) -> np.ndarray:
|
||||
"""Encode texts into vectors"""
|
||||
"""Encode texts into vectors through the external embedding API."""
|
||||
if not texts:
|
||||
return np.array([])
|
||||
|
||||
model = await self.get_model()
|
||||
|
||||
try:
|
||||
# Run blocking operations in the event loop executor
|
||||
loop = asyncio.get_event_loop()
|
||||
embeddings = await loop.run_in_executor(
|
||||
None,
|
||||
lambda: model.encode(texts, batch_size=batch_size, show_progress_bar=False)
|
||||
|
||||
loop = asyncio.get_event_loop()
|
||||
return await loop.run_in_executor(
|
||||
None,
|
||||
lambda: self._encode_texts_external(texts)
|
||||
)
|
||||
|
||||
def encode_texts_sync(self, texts: List[str], batch_size: int = 32) -> np.ndarray:
|
||||
"""Synchronously encode texts. Used by synchronous integrations such as Mem0."""
|
||||
if not texts:
|
||||
return np.array([])
|
||||
|
||||
return self._encode_texts_external(texts)
|
||||
|
||||
def _encode_texts_external(self, texts: List[str]) -> np.ndarray:
|
||||
if not self.external_base_url:
|
||||
raise RuntimeError("EMBEDDING_BASE_URL is required for embedding API calls")
|
||||
|
||||
endpoint = f"{self.external_base_url}/embeddings"
|
||||
headers = {"Content-Type": "application/json"}
|
||||
if self.external_api_key:
|
||||
headers["Authorization"] = f"Bearer {self.external_api_key}"
|
||||
|
||||
payload: Dict[str, Any] = {
|
||||
"model": self.external_model_name,
|
||||
"input": texts,
|
||||
}
|
||||
if self.external_dimensions and self.external_model_name not in ("text-embedding-ada-002", "local-embedding"):
|
||||
payload["dimensions"] = self.external_dimensions
|
||||
|
||||
response = requests.post(
|
||||
endpoint,
|
||||
json=payload,
|
||||
headers=headers,
|
||||
timeout=self.external_timeout,
|
||||
)
|
||||
if response.status_code != 200:
|
||||
raise RuntimeError(f"External embedding API failed: {response.status_code} - {response.text}")
|
||||
|
||||
data = response.json()
|
||||
embeddings = [item["embedding"] for item in data.get("data", [])]
|
||||
if len(embeddings) != len(texts):
|
||||
raise RuntimeError(
|
||||
f"External embedding API returned {len(embeddings)} embeddings for {len(texts)} texts"
|
||||
)
|
||||
|
||||
# Ensure a NumPy array is returned
|
||||
if hasattr(embeddings, 'cpu'):
|
||||
embeddings = embeddings.cpu().numpy()
|
||||
elif hasattr(embeddings, 'numpy'):
|
||||
embeddings = embeddings.numpy()
|
||||
elif not isinstance(embeddings, np.ndarray):
|
||||
embeddings = np.array(embeddings)
|
||||
|
||||
return embeddings
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Text encoding failed: {e}")
|
||||
raise
|
||||
|
||||
def get_model_sync(self) -> Optional[SentenceTransformer]:
|
||||
"""Synchronously get the model instance for synchronous contexts
|
||||
|
||||
If the model is not loaded, return None. The caller should ensure the model is initialized via the async API first.
|
||||
|
||||
Returns:
|
||||
The loaded SentenceTransformer model, or None
|
||||
"""
|
||||
return self._model
|
||||
return np.array(embeddings)
|
||||
|
||||
def get_model_info(self) -> Dict[str, Any]:
|
||||
"""Get model information"""
|
||||
return {
|
||||
"model_name": self.model_name,
|
||||
"local_model_path": self.local_model_path,
|
||||
"device": self._device,
|
||||
"is_loaded": self._model is not None,
|
||||
"load_time": self._load_time
|
||||
"provider": "openai_compatible",
|
||||
"base_url": self.external_base_url,
|
||||
"model_name": self.external_model_name,
|
||||
"dimensions": self.external_dimensions,
|
||||
}
|
||||
|
||||
|
||||
@ -137,5 +102,5 @@ def get_model_manager() -> GlobalModelManager:
|
||||
"""Get the model manager instance"""
|
||||
global _model_manager
|
||||
if _model_manager is None:
|
||||
_model_manager = GlobalModelManager(SENTENCE_TRANSFORMER_MODEL)
|
||||
_model_manager = GlobalModelManager()
|
||||
return _model_manager
|
||||
|
||||
54
poetry.lock
generated
54
poetry.lock
generated
@ -1837,21 +1837,6 @@ files = [
|
||||
{file = "httpx_sse-0.4.3.tar.gz", hash = "sha256:9b1ed0127459a66014aec3c56bebd93da3c1bc8bb6618c8082039a44889a755d"},
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "huey"
|
||||
version = "2.6.0"
|
||||
description = "a little task queue"
|
||||
optional = false
|
||||
python-versions = "*"
|
||||
groups = ["main"]
|
||||
files = [
|
||||
{file = "huey-2.6.0-py3-none-any.whl", hash = "sha256:1b9df9d370b49c6d5721ba8a01ac9a787cf86b3bdc584e4679de27b920395c3f"},
|
||||
{file = "huey-2.6.0.tar.gz", hash = "sha256:8d11f8688999d65266af1425b831f6e3773e99415027177b8734b0ffd5e251f6"},
|
||||
]
|
||||
|
||||
[package.extras]
|
||||
backends = ["redis (>=3.0.0)"]
|
||||
|
||||
[[package]]
|
||||
name = "huggingface-hub"
|
||||
version = "0.36.2"
|
||||
@ -2848,26 +2833,6 @@ cli = ["python-dotenv (>=1.0.0)", "typer (>=0.16.0)"]
|
||||
rich = ["rich (>=13.9.4)"]
|
||||
ws = ["websockets (>=15.0.1)"]
|
||||
|
||||
[[package]]
|
||||
name = "mcp-ui-server"
|
||||
version = "1.0.0"
|
||||
description = "mcp-ui Server SDK for Python"
|
||||
optional = false
|
||||
python-versions = ">=3.10"
|
||||
groups = ["main"]
|
||||
files = [
|
||||
{file = "mcp_ui_server-1.0.0-py3-none-any.whl", hash = "sha256:85f53b2e4300fbd175f1fbb7c40f2566b1f4a4ad03a1f33647867c82a3159dcc"},
|
||||
{file = "mcp_ui_server-1.0.0.tar.gz", hash = "sha256:5ab8f17b93bf794966af7c35e9a575e4f21a9ba2bab3d316cfc107a15f88a3c9"},
|
||||
]
|
||||
|
||||
[package.dependencies]
|
||||
mcp = ">=1.0.0"
|
||||
pydantic = ">=2.0.0"
|
||||
typing-extensions = ">=4.0.0"
|
||||
|
||||
[package.extras]
|
||||
dev = ["pyright (>=1.1.0)", "pytest (>=7.0.0)", "ruff (>=0.1.0)"]
|
||||
|
||||
[[package]]
|
||||
name = "mdit-py-plugins"
|
||||
version = "0.6.1"
|
||||
@ -5049,6 +5014,23 @@ files = [
|
||||
beartype = ">=0.20.0,<1.0.0"
|
||||
requests = ">=2.30.0,<3.0.0"
|
||||
|
||||
[[package]]
|
||||
name = "redis"
|
||||
version = "6.4.0"
|
||||
description = "Python client for Redis database and key-value store"
|
||||
optional = false
|
||||
python-versions = ">=3.9"
|
||||
groups = ["main"]
|
||||
files = [
|
||||
{file = "redis-6.4.0-py3-none-any.whl", hash = "sha256:f0544fa9604264e9464cdf4814e7d4830f74b165d52f2a330a760a88dd248b7f"},
|
||||
{file = "redis-6.4.0.tar.gz", hash = "sha256:b01bc7282b8444e28ec36b261df5375183bb47a07eb9c603f284e89cbc5ef010"},
|
||||
]
|
||||
|
||||
[package.extras]
|
||||
hiredis = ["hiredis (>=3.2.0)"]
|
||||
jwt = ["pyjwt (>=2.9.0)"]
|
||||
ocsp = ["cryptography (>=36.0.1)", "pyopenssl (>=20.0.1)", "requests (>=2.31.0)"]
|
||||
|
||||
[[package]]
|
||||
name = "referencing"
|
||||
version = "0.37.0"
|
||||
@ -7484,4 +7466,4 @@ cffi = ["cffi (>=1.17,<2.0) ; platform_python_implementation != \"PyPy\" and pyt
|
||||
[metadata]
|
||||
lock-version = "2.1"
|
||||
python-versions = ">=3.12,<3.15"
|
||||
content-hash = "ad25328ad4a88f9a9dd9d34d0f9a097079b837325bf05183fd429e0f37cbc0ed"
|
||||
content-hash = "ba8491ec2ecd7c783fac68f66e7994279d51f6a09fdc1ec435941c1af52db0cb"
|
||||
|
||||
@ -81,6 +81,24 @@
|
||||
- 告知用户是基于"3階執務スペース"范围搜索到的结果,并确认是否操作
|
||||
**响应**:"「3階執務スペース、フォーラム側窓側」では見つかりませんでしたが、3階執務スペースエリアで照明が見つかりました。こちらの照明を操作しますか?"
|
||||
|
||||
### 联系方式查询场景(CRITICAL - 必须区分查询与发送)
|
||||
**用户**:"プランニンググループの連絡先はわかる?"(策划组的联系方式你知道吗?)
|
||||
- rag_retrieve(query="プランニンググループ 連絡先", top_k=100) → 优先查询知识库
|
||||
- 若知识库有结果:直接告知联系方式(电话、邮箱等)
|
||||
- 若知识库无结果:告知用户无法查询,提供替代方案
|
||||
**响应**:"プランニンググループの連絡先を調べますね。" → [查询后告知具体联系方式或提供替代方案]
|
||||
**禁止**:此场景禁止调用 wowtalk_send_message_to_member(这是发送消息工具,不是查询联系方式)
|
||||
|
||||
**用户**:"PUの誰に連絡していいかわからない"(不知道该联系PU的谁)
|
||||
- rag_retrieve(query="PU 連絡先 組織図", top_k=100) → 优先查询知识库
|
||||
- find_employee_location(name="[相关人员]") → 如需查找具体人员位置
|
||||
- 告知用户相关联系信息或人员位置
|
||||
**响应**:"PUの連絡先を確認しますね。" → [查询后告知联系方式或人员信息]
|
||||
|
||||
**关键区别**:
|
||||
- 「連絡先を知りたい」「連絡先はわかる」「連絡方法を教えて」→ **查询场景**,使用 rag_retrieve 查询知识库
|
||||
- 「連絡して」「通知して」「メッセージを送って」→ **发送场景**,使用 wowtalk_send_message_to_member
|
||||
|
||||
</scenarios>
|
||||
|
||||
|
||||
@ -206,6 +224,30 @@
|
||||
- **条件**:用户意图为闲聊、问候、感谢、赞美等非实质性对话。
|
||||
- **动作**:给予简洁、友好、拟人化的自然回复。
|
||||
|
||||
9. 联系方式/组织图查询(CRITICAL - 与消息通知区分)
|
||||
- **条件**:用户意图为查询联系方式、组织架构、部门电话等(关键词:連絡先、連絡方法、電話番号、メールアドレス、組織図、誰に連絡すれば、etc.)
|
||||
- **动作**:
|
||||
1. **优先**调用【知识库检索】工具查询知识库(rag_retrieve,top_k=100)
|
||||
2. 若知识库有结果:直接告知用户查询到的联系方式(电话、邮箱、组织架构等)
|
||||
3. 若知识库无结果:调用【人员检索】工具查找相关人员(find_employee_location),告知用户人员位置信息
|
||||
4. **降级回复**(工具失败或无结果时):提供替代方案,避免空循环
|
||||
- **与消息通知的区别**:
|
||||
- 「連絡先を知りたい」「連絡先はわかる」「連絡方法を教えて」→ 查询场景,使用 rag_retrieve
|
||||
- 「連絡して」「通知して」「メッセージを送って」→ 发送场景,使用 wowtalk_send_message_to_member
|
||||
- **禁止行为**:
|
||||
- 禁止在用户查询联系方式时调用 wowtalk_send_message_to_member(这是发送消息工具)
|
||||
- 禁止回复「もう一度試してみましょうか?」(空循环),必须提供降级方案
|
||||
|
||||
10. 工具失败时的降级回复(CRITICAL - 避免空循环)
|
||||
- **条件**:当工具调用失败或返回空结果时。
|
||||
- **动作**:提供降级回复或替代方案,避免「もう一度試してみましょうか?」的空循环。
|
||||
- **降级回复示例**:
|
||||
- 联系方式查询失败:"申し訳ございません、連絡先の確認ができませんでした。社内Wikiの組織図をご確認いただくか、総務担当にお問い合わせいただけますでしょうか?"
|
||||
- 人员位置查询失败:"申し訳ございません、現在の人の居場所を確認することができません。後でもう一度お試しいただくか、直接 WowTalk で連絡してみていただけますでしょうか?"
|
||||
- 知识库查询失败:"申し訳ございません、情報の検索に失敗しました。別の言葉で質問いただくか、後ほど再度お試しいただけますでしょうか?"
|
||||
- 设备操作失败:"申し訳ございません、設備の操作ができませんでした。しばらく待ってから再度お試しいただくか、設備担当にお問い合わせいただけますでしょうか?"
|
||||
- **绝对禁止**:「もう一度試してみましょうか?」这种会导致空循环的回复。
|
||||
|
||||
|
||||
## 设备控制确认机制
|
||||
|
||||
|
||||
@ -19,7 +19,7 @@ dependencies = [
|
||||
"numpy<2",
|
||||
"aiohttp",
|
||||
"aiofiles",
|
||||
"huey (>=2.5.3,<3.0.0)",
|
||||
"redis (>=4.0,<7.0)",
|
||||
"pandas>=1.5.0",
|
||||
"openpyxl>=3.0.0",
|
||||
"xlrd>=2.0.0",
|
||||
|
||||
@ -58,7 +58,6 @@ hpack==4.1.0 ; python_version >= "3.12" and python_version < "3.15"
|
||||
httpcore==1.0.9 ; python_version >= "3.12" and python_version < "3.15"
|
||||
httpx-sse==0.4.3 ; python_version >= "3.12" and python_version < "3.15"
|
||||
httpx==0.28.1 ; python_version >= "3.12" and python_version < "3.15"
|
||||
huey==2.6.0 ; python_version >= "3.12" and python_version < "3.15"
|
||||
huggingface-hub==0.36.2 ; python_version >= "3.12" and python_version < "3.15"
|
||||
hyperframe==6.1.0 ; python_version >= "3.12" and python_version < "3.15"
|
||||
idna==3.15 ; python_version >= "3.12" and python_version < "3.15"
|
||||
@ -96,7 +95,6 @@ linkify-it-py==2.1.0 ; python_version >= "3.12" and python_version < "3.15"
|
||||
markdown-it-py==4.2.0 ; python_version >= "3.12" and python_version < "3.15"
|
||||
markdownify==1.2.2 ; python_version >= "3.12" and python_version < "3.15"
|
||||
markupsafe==3.0.3 ; python_version >= "3.12" and python_version < "3.15"
|
||||
mcp-ui-server==1.0.0 ; python_version >= "3.12" and python_version < "3.15"
|
||||
mcp==1.12.4 ; python_version >= "3.12" and python_version < "3.15"
|
||||
mdit-py-plugins==0.6.1 ; python_version >= "3.12" and python_version < "3.15"
|
||||
mdurl==0.1.2 ; python_version >= "3.12" and python_version < "3.15"
|
||||
@ -166,6 +164,7 @@ pyyaml==6.0.3 ; python_version >= "3.12" and python_version < "3.15"
|
||||
qdrant-client==1.12.1 ; python_version >= "3.13" and python_version < "3.15"
|
||||
qdrant-client==1.18.0 ; python_version == "3.12"
|
||||
ragflow-sdk==0.23.1 ; python_version >= "3.12" and python_version < "3.15"
|
||||
redis==6.4.0 ; python_version >= "3.12" and python_version < "3.15"
|
||||
referencing==0.37.0 ; python_version >= "3.12" and python_version < "3.15"
|
||||
regex==2026.5.9 ; python_version >= "3.12" and python_version < "3.15"
|
||||
requests-toolbelt==1.0.0 ; python_version >= "3.12" and python_version < "3.15"
|
||||
|
||||
377
routes/files.py
377
routes/files.py
@ -1,273 +1,18 @@
|
||||
import os
|
||||
import uuid
|
||||
import shutil
|
||||
import zipfile
|
||||
from datetime import datetime
|
||||
from typing import Optional, List
|
||||
from fastapi import APIRouter, HTTPException, Header, UploadFile, File, Form
|
||||
from pydantic import BaseModel
|
||||
from typing import Optional
|
||||
from fastapi import APIRouter, HTTPException, UploadFile, File, Form
|
||||
import logging
|
||||
|
||||
logger = logging.getLogger('app')
|
||||
|
||||
from utils import (
|
||||
DatasetRequest, QueueTaskRequest, IncrementalTaskRequest, QueueTaskResponse,
|
||||
load_processed_files_log, remove_file_or_directory, remove_dataset_directory_by_key
|
||||
)
|
||||
from utils.fastapi_utils import get_versioned_filename
|
||||
from task_queue.manager import queue_manager
|
||||
from task_queue.integration_tasks import process_files_async, process_files_incremental_async, cleanup_project_async
|
||||
from task_queue.task_status import task_status_store
|
||||
|
||||
router = APIRouter()
|
||||
|
||||
|
||||
@router.post("/api/v1/files/process/async")
|
||||
async def process_files_async_endpoint(request: QueueTaskRequest, authorization: Optional[str] = Header(None)):
|
||||
"""
|
||||
Queue-based API for asynchronous file processing.
|
||||
Same functionality as /api/v1/files/process, but processed asynchronously through the queue.
|
||||
|
||||
Args:
|
||||
request: QueueTaskRequest containing dataset_id, files, system_prompt, mcp_settings, and queue options
|
||||
authorization: Authorization header containing API key (Bearer <API_KEY>)
|
||||
|
||||
Returns:
|
||||
QueueTaskResponse: Processing result with task ID for tracking
|
||||
"""
|
||||
try:
|
||||
dataset_id = request.dataset_id
|
||||
if not dataset_id:
|
||||
raise HTTPException(status_code=400, detail="dataset_id is required")
|
||||
|
||||
# Estimate processing time (based on file count)
|
||||
estimated_time = 0
|
||||
if request.upload_folder:
|
||||
# For upload_folder, file count cannot be estimated in advance, so use the default time
|
||||
estimated_time = 120 # Default: 2 minutes
|
||||
elif request.files:
|
||||
total_files = sum(len(file_list) for file_list in request.files.values())
|
||||
estimated_time = max(30, total_files * 10) # Estimated 10 seconds per file, minimum 30 seconds
|
||||
|
||||
# Create task status record
|
||||
import uuid
|
||||
task_id = str(uuid.uuid4())
|
||||
task_status_store.set_status(
|
||||
task_id=task_id,
|
||||
unique_id=dataset_id,
|
||||
status="pending"
|
||||
)
|
||||
|
||||
# Submit async task
|
||||
task = process_files_async(
|
||||
dataset_id=dataset_id,
|
||||
files=request.files,
|
||||
upload_folder=request.upload_folder,
|
||||
task_id=task_id
|
||||
)
|
||||
|
||||
# Build a more detailed message
|
||||
message = f"File processing task has been submitted to the queue, project ID: {dataset_id}"
|
||||
if request.upload_folder:
|
||||
group_count = len(request.upload_folder)
|
||||
message += f", files will be scanned automatically from {group_count} uploaded folders"
|
||||
elif request.files:
|
||||
total_files = sum(len(file_list) for file_list in request.files.values())
|
||||
message += f", including {total_files} files"
|
||||
|
||||
return QueueTaskResponse(
|
||||
success=True,
|
||||
message=message,
|
||||
dataset_id=dataset_id,
|
||||
task_id=task_id, # Use our own task_id
|
||||
task_status="pending",
|
||||
estimated_processing_time=estimated_time
|
||||
)
|
||||
|
||||
except HTTPException:
|
||||
raise
|
||||
except Exception as e:
|
||||
logger.error(f"Error submitting async file processing task: {str(e)}")
|
||||
raise HTTPException(status_code=500, detail=f"Internal server error: {str(e)}")
|
||||
|
||||
|
||||
@router.post("/api/v1/files/process/incremental")
|
||||
async def process_files_incremental_endpoint(request: IncrementalTaskRequest, authorization: Optional[str] = Header(None)):
|
||||
"""
|
||||
Queue-based API for incremental file processing, supporting file additions and deletions.
|
||||
|
||||
Args:
|
||||
request: IncrementalTaskRequest containing dataset_id, files_to_add, files_to_remove, system_prompt, mcp_settings, and queue options
|
||||
authorization: Authorization header containing API key (Bearer <API_KEY>)
|
||||
|
||||
Returns:
|
||||
QueueTaskResponse: Processing result with task ID for tracking
|
||||
"""
|
||||
try:
|
||||
dataset_id = request.dataset_id
|
||||
if not dataset_id:
|
||||
raise HTTPException(status_code=400, detail="dataset_id is required")
|
||||
|
||||
# Validate that there is at least one add or delete operation
|
||||
if not request.files_to_add and not request.files_to_remove:
|
||||
raise HTTPException(status_code=400, detail="At least one of files_to_add or files_to_remove must be provided")
|
||||
|
||||
# Estimate processing time (based on file count)
|
||||
estimated_time = 0
|
||||
total_add_files = sum(len(file_list) for file_list in (request.files_to_add or {}).values())
|
||||
total_remove_files = sum(len(file_list) for file_list in (request.files_to_remove or {}).values())
|
||||
total_files = total_add_files + total_remove_files
|
||||
estimated_time = max(30, total_files * 10) # Estimated 10 seconds per file, minimum 30 seconds
|
||||
|
||||
# Create task status record
|
||||
import uuid
|
||||
task_id = str(uuid.uuid4())
|
||||
task_status_store.set_status(
|
||||
task_id=task_id,
|
||||
unique_id=dataset_id,
|
||||
status="pending"
|
||||
)
|
||||
|
||||
# Submit incremental async task
|
||||
task = process_files_incremental_async(
|
||||
dataset_id=dataset_id,
|
||||
files_to_add=request.files_to_add,
|
||||
files_to_remove=request.files_to_remove,
|
||||
system_prompt=request.system_prompt,
|
||||
mcp_settings=request.mcp_settings,
|
||||
task_id=task_id
|
||||
)
|
||||
|
||||
return QueueTaskResponse(
|
||||
success=True,
|
||||
message=f"Incremental file processing task has been submitted to the queue - added {total_add_files} files, removed {total_remove_files} files, project ID: {dataset_id}",
|
||||
dataset_id=dataset_id,
|
||||
task_id=task_id,
|
||||
task_status="pending",
|
||||
estimated_processing_time=estimated_time
|
||||
)
|
||||
|
||||
except HTTPException:
|
||||
raise
|
||||
except Exception as e:
|
||||
logger.error(f"Error submitting incremental file processing task: {str(e)}")
|
||||
raise HTTPException(status_code=500, detail=f"Internal server error: {str(e)}")
|
||||
|
||||
|
||||
@router.get("/api/v1/files/{dataset_id}/status")
|
||||
async def get_files_processing_status(dataset_id: str):
|
||||
"""Get the file processing status for the project."""
|
||||
try:
|
||||
# Load processed files log
|
||||
processed_log = load_processed_files_log(dataset_id)
|
||||
|
||||
# Get project directory info
|
||||
project_dir = os.path.join("projects", "data", dataset_id)
|
||||
project_exists = os.path.exists(project_dir)
|
||||
|
||||
# Collect document.txt files
|
||||
document_files = []
|
||||
if project_exists:
|
||||
for root, dirs, files in os.walk(project_dir):
|
||||
for file in files:
|
||||
if file == "document.txt":
|
||||
document_files.append(os.path.join(root, file))
|
||||
|
||||
return {
|
||||
"dataset_id": dataset_id,
|
||||
"project_exists": project_exists,
|
||||
"processed_files_count": len(processed_log),
|
||||
"processed_files": processed_log,
|
||||
"document_files_count": len(document_files),
|
||||
"document_files": document_files,
|
||||
"log_file_exists": os.path.exists(os.path.join("projects", "data", dataset_id, "processed_files.json"))
|
||||
}
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=500, detail=f"Failed to retrieve file processing status: {str(e)}")
|
||||
|
||||
|
||||
@router.post("/api/v1/files/{dataset_id}/reset")
|
||||
async def reset_files_processing(dataset_id: str):
|
||||
"""Reset the project's file processing status by deleting the processing log and all files."""
|
||||
try:
|
||||
project_dir = os.path.join("projects", "data", dataset_id)
|
||||
log_file = os.path.join("projects", "data", dataset_id, "processed_files.json")
|
||||
|
||||
# Load processed log to know what files to remove
|
||||
processed_log = load_processed_files_log(dataset_id)
|
||||
|
||||
removed_files = []
|
||||
# Remove all processed files and their dataset directories
|
||||
for file_hash, file_info in processed_log.items():
|
||||
# Remove local file in files directory
|
||||
if 'local_path' in file_info:
|
||||
if remove_file_or_directory(file_info['local_path']):
|
||||
removed_files.append(file_info['local_path'])
|
||||
|
||||
# Handle new key-based structure first
|
||||
if 'key' in file_info:
|
||||
# Remove dataset directory by key
|
||||
key = file_info['key']
|
||||
if remove_dataset_directory_by_key(dataset_id, key):
|
||||
removed_files.append(f"dataset/{key}")
|
||||
elif 'filename' in file_info:
|
||||
# Fallback to old filename-based structure
|
||||
filename_without_ext = os.path.splitext(file_info['filename'])[0]
|
||||
dataset_dir = os.path.join("projects", "data", dataset_id, "datasets", filename_without_ext)
|
||||
if remove_file_or_directory(dataset_dir):
|
||||
removed_files.append(dataset_dir)
|
||||
|
||||
# Also remove any specific dataset path if exists (fallback)
|
||||
if 'dataset_path' in file_info:
|
||||
if remove_file_or_directory(file_info['dataset_path']):
|
||||
removed_files.append(file_info['dataset_path'])
|
||||
|
||||
# Remove the log file
|
||||
if remove_file_or_directory(log_file):
|
||||
removed_files.append(log_file)
|
||||
|
||||
# Remove the entire files directory
|
||||
files_dir = os.path.join(project_dir, "files")
|
||||
if remove_file_or_directory(files_dir):
|
||||
removed_files.append(files_dir)
|
||||
|
||||
# Also remove the entire dataset directory (clean up any remaining files)
|
||||
dataset_dir = os.path.join(project_dir, "datasets")
|
||||
if remove_file_or_directory(dataset_dir):
|
||||
removed_files.append(dataset_dir)
|
||||
|
||||
# Remove README.md if exists
|
||||
readme_file = os.path.join(project_dir, "README.md")
|
||||
if remove_file_or_directory(readme_file):
|
||||
removed_files.append(readme_file)
|
||||
|
||||
return {
|
||||
"message": f"File processing status reset successfully: {dataset_id}",
|
||||
"removed_files_count": len(removed_files),
|
||||
"removed_files": removed_files
|
||||
}
|
||||
except Exception as e:
|
||||
raise HTTPException(status_code=500, detail=f"Failed to reset file processing status: {str(e)}")
|
||||
|
||||
|
||||
@router.post("/api/v1/files/{dataset_id}/cleanup/async")
|
||||
async def cleanup_project_async_endpoint(dataset_id: str, remove_all: bool = False):
|
||||
"""Asynchronously clean up project files."""
|
||||
try:
|
||||
task = cleanup_project_async(dataset_id=dataset_id, remove_all=remove_all)
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"message": f"Project cleanup task has been submitted to the queue, project ID: {dataset_id}",
|
||||
"dataset_id": dataset_id,
|
||||
"task_id": task.id,
|
||||
"action": "remove_all" if remove_all else "cleanup_logs"
|
||||
}
|
||||
except Exception as e:
|
||||
logger.error(f"Error submitting cleanup task: {str(e)}")
|
||||
raise HTTPException(status_code=500, detail=f"Failed to submit cleanup task: {str(e)}")
|
||||
|
||||
|
||||
@router.post("/api/v1/upload")
|
||||
async def upload_file(file: UploadFile = File(...), folder: Optional[str] = Form(None)):
|
||||
"""
|
||||
@ -348,121 +93,3 @@ async def upload_file(file: UploadFile = File(...), folder: Optional[str] = Form
|
||||
except Exception as e:
|
||||
logger.error(f"Error uploading file: {str(e)}")
|
||||
raise HTTPException(status_code=500, detail=f"File upload failed: {str(e)}")
|
||||
|
||||
|
||||
# Task management routes that are related to file processing
|
||||
@router.get("/api/v1/task/{task_id}/status")
|
||||
async def get_task_status(task_id: str):
|
||||
"""Get task status - simple and reliable."""
|
||||
try:
|
||||
status_data = task_status_store.get_status(task_id)
|
||||
|
||||
if not status_data:
|
||||
return {
|
||||
"success": False,
|
||||
"message": "Task does not exist or has expired",
|
||||
"task_id": task_id,
|
||||
"status": "not_found"
|
||||
}
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"message": "Task status retrieved successfully",
|
||||
"task_id": task_id,
|
||||
"status": status_data["status"],
|
||||
"unique_id": status_data["unique_id"],
|
||||
"created_at": status_data["created_at"],
|
||||
"updated_at": status_data["updated_at"],
|
||||
"result": status_data.get("result"),
|
||||
"error": status_data.get("error")
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error getting task status: {str(e)}")
|
||||
raise HTTPException(status_code=500, detail=f"Failed to retrieve task status: {str(e)}")
|
||||
|
||||
|
||||
@router.delete("/api/v1/task/{task_id}")
|
||||
async def delete_task(task_id: str):
|
||||
"""Delete task record."""
|
||||
try:
|
||||
success = task_status_store.delete_status(task_id)
|
||||
if success:
|
||||
return {
|
||||
"success": True,
|
||||
"message": f"Task record deleted: {task_id}",
|
||||
"task_id": task_id
|
||||
}
|
||||
else:
|
||||
return {
|
||||
"success": False,
|
||||
"message": f"Task record does not exist: {task_id}",
|
||||
"task_id": task_id
|
||||
}
|
||||
except Exception as e:
|
||||
logger.error(f"Error deleting task: {str(e)}")
|
||||
raise HTTPException(status_code=500, detail=f"Failed to delete task record: {str(e)}")
|
||||
|
||||
|
||||
@router.get("/api/v1/tasks")
|
||||
async def list_tasks(status: Optional[str] = None, dataset_id: Optional[str] = None, limit: int = 100):
|
||||
"""List tasks with optional filters."""
|
||||
try:
|
||||
if status or dataset_id:
|
||||
# Use search function
|
||||
tasks = task_status_store.search_tasks(status=status, unique_id=dataset_id, limit=limit)
|
||||
else:
|
||||
# Get all tasks
|
||||
all_tasks = task_status_store.list_all()
|
||||
tasks = list(all_tasks.values())[:limit]
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"message": "Task list retrieved successfully",
|
||||
"total_tasks": len(tasks),
|
||||
"tasks": tasks,
|
||||
"filters": {
|
||||
"status": status,
|
||||
"dataset_id": dataset_id,
|
||||
"limit": limit
|
||||
}
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error listing tasks: {str(e)}")
|
||||
raise HTTPException(status_code=500, detail=f"Failed to retrieve task list: {str(e)}")
|
||||
|
||||
|
||||
@router.get("/api/v1/tasks/statistics")
|
||||
async def get_task_statistics():
|
||||
"""Get task statistics."""
|
||||
try:
|
||||
stats = task_status_store.get_statistics()
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"message": "Statistics retrieved successfully",
|
||||
"statistics": stats
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error getting statistics: {str(e)}")
|
||||
raise HTTPException(status_code=500, detail=f"Failed to retrieve statistics: {str(e)}")
|
||||
|
||||
|
||||
@router.post("/api/v1/tasks/cleanup")
|
||||
async def cleanup_tasks(older_than_days: int = 7):
|
||||
"""Clean up old task records."""
|
||||
try:
|
||||
deleted_count = task_status_store.cleanup_old_tasks(older_than_days=older_than_days)
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"message": f"Cleaned up {deleted_count} old task records",
|
||||
"deleted_count": deleted_count,
|
||||
"older_than_days": older_than_days
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error cleaning up tasks: {str(e)}")
|
||||
raise HTTPException(status_code=500, detail=f"Failed to clean up task records: {str(e)}")
|
||||
|
||||
@ -6,8 +6,6 @@ import logging
|
||||
|
||||
logger = logging.getLogger('app')
|
||||
|
||||
from task_queue.task_status import task_status_store
|
||||
|
||||
router = APIRouter()
|
||||
|
||||
|
||||
@ -155,22 +153,3 @@ async def list_datasets():
|
||||
except Exception as e:
|
||||
logger.error(f"Error listing datasets: {str(e)}")
|
||||
raise HTTPException(status_code=500, detail=f"Failed to retrieve dataset list: {str(e)}")
|
||||
|
||||
|
||||
@router.get("/api/v1/projects/{dataset_id}/tasks")
|
||||
async def get_project_tasks(dataset_id: str):
|
||||
"""Get all tasks for the specified project."""
|
||||
try:
|
||||
tasks = task_status_store.get_by_unique_id(dataset_id)
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"message": "Project tasks retrieved successfully",
|
||||
"dataset_id": dataset_id,
|
||||
"total_tasks": len(tasks),
|
||||
"tasks": tasks
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error getting project tasks: {str(e)}")
|
||||
raise HTTPException(status_code=500, detail=f"Failed to retrieve project tasks: {str(e)}")
|
||||
|
||||
288
skills/developing/content-compliance-reviewer/SKILL.md
Normal file
288
skills/developing/content-compliance-reviewer/SKILL.md
Normal file
@ -0,0 +1,288 @@
|
||||
---
|
||||
name: content-compliance-reviewer
|
||||
description: 对企业对外发布内容(新闻稿、社媒、广告物料、对外邮件/函件等)做严格的合规与风险审核,逐项检查违法违规红线、违反广告法的绝对化用语、夸大/误导宣传、敏感信息与个人信息(PII)泄露、未授权对外承诺、知识产权、品牌一致性、事实准确性、渠道受众适配、完整性等风险点,遵循“疑罪从严、对外即不可逆、存疑即退回”的从严原则。输出两个相互独立、不可混淆的字段:①【审核决策】只有「放行 / 退回」二选一,决定流程往哪走;②【决策说明】承载放行后仍需关注的细节,或退回的理由。当收到对外内容数据、内容发布审批、对外宣传审核、文案合规检查、content review、PR/广告/社媒发布审核等请求,或拿到包含 title/content/content_type/channel 等字段的对外内容表单数据需要判断是否可发布时,务必使用本技能。只输出结构化文本,不要输出 JSON。
|
||||
category: Compliance & Security
|
||||
---
|
||||
|
||||
# 对外内容发布合规审核助手(Content Compliance Reviewer)· 从严版
|
||||
|
||||
## Overview
|
||||
|
||||
本技能面向企业 OA 对外内容发布流程,对一篇拟对外发布的内容做**自动初审**,识别合规、敏感信息与夸大宣传风险。
|
||||
|
||||
本技能采用**从严审核(fail-closed)立场**:对外发布**一经发出即不可逆**,当合规存疑、含敏感信息、宣传无依据、信息不足以排除风险时,**默认退回**而非放行。宁可让发起人多改一版,也不放过一篇存疑内容流向人工审批后被发布。初审的价值在于把好第一道关,把明显违规或无法核实的内容挡在前面。
|
||||
|
||||
⚠️ **核心模型:决策(二元)与说明(文本)是两个完全独立的字段,绝不可混为一谈。**
|
||||
|
||||
1. **【审核决策】**:**只有两种、互斥、二选一** —— **放行** 或 **退回**。这是唯一驱动 OA 流程往哪走的字段。
|
||||
- **放行**:内容流向下一个人工审批节点(如经理、法务、品牌)。
|
||||
- **退回**:内容打回发起人修改,修改后可重新提交;流程未终止。
|
||||
2. **【决策说明】**:一段文本,**与决策正交**,用来承载“为什么”和“注意什么”。
|
||||
- 放行时:写**通过后仍需提醒人工审批人关注的细节**(没有就写“无”)。
|
||||
- 退回时:写**必须退回修改的理由**。
|
||||
|
||||
> 不存在“需关注”这种第三种决策。“需关注”不是一个决策档位,而是 **决策=放行 + 决策说明里写明了要关注的点**。把“是否放行”和“有没有要关注/要修改的事”彻底拆开,是本技能最重要的纪律。
|
||||
|
||||
定位说明:
|
||||
|
||||
- 你是**初审 agent**,只负责审核并**输出文本结论**。
|
||||
- 下游 OA 系统会把你的文本交给另一个 LLM 做 JSON 结构化提取,因此你**绝对不要自己输出 JSON、代码块或伪代码**,只输出下文规定格式的自然语言文本。
|
||||
- 你只能在“放行 / 退回”之间二选一,**无权终止流程**。你的“退回”=打回修改(可重提),**不等于**人工审批环节里的“驳回”(流程直接终止)。
|
||||
|
||||
## Triggering Cues
|
||||
|
||||
出现以下任一情况就使用本技能:
|
||||
|
||||
- 中文:内容发布审批、对外宣传审核、文案合规检查、新闻稿审核、社媒/公众号审核、广告合规审核、对外函件审核
|
||||
- 英文:content review, content compliance check, PR review, marketing copy review, social media review
|
||||
- 收到一段对外内容表单数据(含 `title` 标题、`content` 正文、`content_type` 内容类型、`channel` 发布渠道 等字段),要求判断是否可以发布。
|
||||
|
||||
## 输入(Input)
|
||||
|
||||
通常会收到一篇对外内容的字段数据,常见字段:
|
||||
|
||||
| 字段 key | 含义 | 说明 |
|
||||
|---|---|---|
|
||||
| `title` | 内容标题 | 必填 |
|
||||
| `content` | 发布正文 | 必填;审核主体 |
|
||||
| `content_type` | 内容类型 | pr(新闻稿) / social(社媒) / ad(广告) / letter(对外函件) / other |
|
||||
| `channel` | 发布渠道 | website / social / media / email / offline |
|
||||
| `target_audience` | 目标受众 | 可能没有;用于渠道-受众适配判断 |
|
||||
| `attachment` | 配图/物料 | URL 或附件标识;可能含图片中的文案/IP 风险 |
|
||||
| `reason` | 发布目的/背景 | 自由文本 |
|
||||
| `author` / `dept` | 发起人/部门 | 可能没有 |
|
||||
|
||||
字段缺失时:
|
||||
- **必填项(标题、正文)缺失或为占位**:一律视为硬性缺陷,**退回**。
|
||||
- **可选上下文(受众、配图、背景)缺失**:不因此卡死流程,但要在 `【说明与假设】` 中说明“因缺少 X 无法核验 Y”,并按从严方向取舍(存疑即偏向退回)。
|
||||
|
||||
## 审核要点清单(核心)
|
||||
|
||||
逐项检查以下 10 类。每条标注:**检查什么 → 何时判为问题 → 严重度**。严重度分 `高/中/低`。
|
||||
|
||||
> 从严总原则:**“对外发布不可逆 + 合规红线”优先于一切**。任何触碰法律/监管红线或可能泄露敏感信息的内容,一律按“高”处理。
|
||||
|
||||
### 1. 违法违规红线 —— 字段 `title` `content`
|
||||
- 检查:是否涉政治敏感、违反法律法规、歧视/侮辱性内容、虚假信息、被禁止的宣传。
|
||||
- 异常:命中任一红线 → 违法违规(**高**,退回)。
|
||||
- 严重度:**高**(无此类信号则跳过)。
|
||||
|
||||
### 2. 违反广告法的绝对化用语 —— 字段 `content` `title`
|
||||
- 检查:是否含“最/第一/国家级/顶级/唯一/首选/100%/绝无仅有”等绝对化、极限用语。
|
||||
- 异常:含绝对化/极限用语且无合法依据(如非权威排名、无认证支撑)→ 涉嫌违反广告法(**高**,退回)。
|
||||
- 严重度:**高**(无则跳过)。
|
||||
|
||||
### 3. 夸大 / 误导宣传 —— 字段 `content`
|
||||
- 检查:功效、数据、对比是否有依据、是否可能误导消费者。
|
||||
- 异常:
|
||||
- 宣称功效/数据但**无任何依据来源**、或明显夸大 → 误导宣传(**高**,退回)。
|
||||
- 表述偏夸张但尚在合理修辞范围、未涉硬指标 → **中**。
|
||||
- 严重度:高 / 中。
|
||||
|
||||
### 4. 敏感信息 / 个人信息(PII)泄露 —— 字段 `content` `attachment`
|
||||
- 检查:是否含手机号、身份证号、邮箱、住址、银行卡、客户隐私、员工个人信息、内部机密/未公开数据。
|
||||
- 异常:
|
||||
- 正文/配图含未脱敏 PII、客户隐私、内部机密、未公开财务/经营数据 → 泄露风险(**高**,退回)。
|
||||
- 含可识别个体但已部分脱敏、仍存风险 → **中**。
|
||||
- 严重度:高 / 中。
|
||||
|
||||
### 5. 未授权对外承诺 —— 字段 `content`
|
||||
- 检查:是否对外做出价格、合作、交付、法律或赔偿承诺,且可能超出授权。
|
||||
- 异常:
|
||||
- 含价格/折扣/合作/法律承诺且无授权依据 → 越权承诺(**高**,退回,要求法务/管理层确认)。
|
||||
- 含软性意向但措辞模糊、风险有限 → **中**。
|
||||
- 严重度:高 / 中。
|
||||
|
||||
### 6. 知识产权 —— 字段 `content` `attachment`
|
||||
- 检查:是否引用第三方图片、商标、文字、音乐等且可能未授权。
|
||||
- 异常:
|
||||
- 明显使用第三方受保护内容(他人商标/明星肖像/版权图)且无授权说明 → 侵权风险(**高**,退回)。
|
||||
- 疑似引用但来源不明、需核实 → **中**。
|
||||
- 严重度:高 / 中。
|
||||
|
||||
### 7. 品牌一致性 —— 字段 `title` `content`
|
||||
- 检查:公司名、商标、slogan、产品名是否使用正确、是否与品牌规范冲突。
|
||||
- 异常:公司名/商标/产品名拼写或用法错误、与官方规范不符 → 品牌瑕疵(**中**)。
|
||||
- 严重度:**中**(无则跳过)。
|
||||
|
||||
### 8. 事实准确性 —— 字段 `content`
|
||||
- 检查:所述数据、时间、引用、头衔是否自洽、是否可核。
|
||||
- 异常:内部明显矛盾、数据/时间自相冲突、引用存疑 → 事实风险(**中**)。
|
||||
- 严重度:**中**(无则跳过)。
|
||||
|
||||
### 9. 渠道 / 受众适配 —— `channel` × `target_audience` × `content`
|
||||
- 检查:内容口吻、敏感度是否与渠道、受众匹配(如面向未成年、面向监管渠道)。
|
||||
- 异常:内容与渠道/受众明显不适配(如严肃监管渠道用营销夸张话术、面向未成年含不适宜内容)→ 适配风险(**中**)。
|
||||
- 严重度:**中**(无则跳过)。
|
||||
|
||||
### 10. 完整性与占位/测试数据 —— 跨字段
|
||||
- 检查:是否缺必要的免责声明/落款/署名;是否含明显占位或测试数据(如 content=test、标题为占位符)。
|
||||
- 异常:
|
||||
- 含明显占位/测试数据、正文为空壳 → 数据无效(**高**,退回)。
|
||||
- 缺免责声明/落款等要件但内容本身合规 → **低/中**。
|
||||
- 严重度:高 / 中 / 低(按上述)。
|
||||
|
||||
> 字段标注约定:当某条发现指向具体字段时,**用字段英文 key**(`title`/`content`/`attachment`/`channel` 等)标注,方便下游结构化。
|
||||
|
||||
## 判定规则:先定决策,再写说明(从严)
|
||||
|
||||
两个字段分两步独立产出,**顺序不能反、内容不能串**:
|
||||
|
||||
### 第一步:定【审核决策】(放行 / 退回,二选一)
|
||||
|
||||
依次判断,命中任一条即 **退回**:
|
||||
|
||||
| 命中情况 | 决策 |
|
||||
|---|---|
|
||||
| ① 存在任一 **高** 级缺陷(违法违规、绝对化用语、夸大误导、PII/敏感信息泄露、越权承诺、IP 侵权、占位测试数据等) | **退回** |
|
||||
| ② **没有高级缺陷,但累计存在 ≥ 2 条“中”级风险**(多个中级风险叠加,整体合规性已不可靠) | **退回** |
|
||||
| ③ 仅有 **1 条“中”级风险或仅“低”级风险** | **放行**(在说明里提示) |
|
||||
| ④ 完全无风险 | **放行** |
|
||||
|
||||
> 从严要点:
|
||||
> 1. 对外发布不可逆,合规存疑/含敏感信息/宣传无依据时按“高”处理、倾向退回(fail-closed)。
|
||||
> 2. **中级风险会叠加**:单条中级放行,但 ≥2 条中级即退回。
|
||||
|
||||
### 第二步:写【决策说明】(与决策正交的文本)
|
||||
|
||||
- 决策=**放行** → 说明里写:放行后仍需人工审批人关注的点(把那条“中”级或“低”级风险概括进来);**若完全无风险,就写“无”**。
|
||||
- 决策=**退回** → 说明里写:导致退回的缺陷是什么(逐条点明高级缺陷,或指出是哪几条中级风险叠加)、需要发起人怎么改。
|
||||
|
||||
置信度:根据信息完整度与判断确定性给出 `高/中/低`(或 0–100% 区间)。信息缺失越多、判断越主观,置信度越低。**注意:置信度低不改变从严倾向——信息越不足,越应倾向退回。**
|
||||
|
||||
## 输出格式(Output Format)
|
||||
|
||||
**只输出下面这种结构化中文文本,不要输出 JSON、不要用代码块包裹。** 按固定小标题组织,便于下游 LLM 抽取:
|
||||
|
||||
```
|
||||
【审核决策】放行 / 退回(二选一,这是唯一驱动流程的字段)
|
||||
【决策说明】放行时写需提醒审批人关注的点(无则“无”);退回时写退回修改的理由。
|
||||
【一句话摘要】用一句话说明本次决策的核心原因。
|
||||
【置信度】高 / 中 / 低(或百分比)
|
||||
|
||||
【风险发现】
|
||||
1. 字段:content | 严重度:高 | 问题:正文含“全国第一”绝对化用语,涉嫌违反广告法 | 建议:删除或替换绝对化用语
|
||||
2. 字段:content | 严重度:中 | 问题:公司名拼写与官方规范不一致 | 建议:核对并统一品牌名称
|
||||
...(无风险时写“无”)
|
||||
|
||||
【说明与假设】列出做判断时假设的前提、缺失的上下文(如未提供配图实物、无品牌规范文档等),以及因信息不足而做出的从严取舍。
|
||||
```
|
||||
|
||||
要求:
|
||||
|
||||
- 【审核决策】**只能填“放行”或“退回”**,不得出现第三种值、不得写“需关注/通过/不通过”等含糊词。
|
||||
- 【决策说明】与【审核决策】严格对应:放行时是“关注点/无”,退回时是“退回理由”。
|
||||
- 【风险发现】每条固定四段:`字段 | 严重度 | 问题 | 建议`,用全角竖线 `|` 分隔;严重度只用 `高/中/低`。
|
||||
- 自洽性:有任一“高”级风险 ⇒ 决策必为“退回”;“中”级风险 ≥ 2 条 ⇒ 决策必为“退回”;仅 1 条“中”级或仅“低”级 ⇒ 决策为“放行”。
|
||||
|
||||
## Workflow
|
||||
|
||||
1. 解析输入的内容字段(title/content/content_type/channel 等)。
|
||||
2. 按“审核要点清单”10 类逐项检查,记录命中的每条风险(字段、严重度、问题、建议)。
|
||||
3. **第一步定决策(从严)**:有“高”级缺陷→退回;无高级但“中”级≥2条→退回;否则→放行;对外不可逆,合规/敏感信息存疑时按“高”处理、倾向退回。
|
||||
4. **第二步写说明**:放行则概括关注点(无则“无”),退回则逐条写明退回理由与整改要求。
|
||||
5. 评估置信度(信息越不足越低,但不改变从严倾向)。
|
||||
6. 按“输出格式”输出结构化文本,**不要输出 JSON**。
|
||||
|
||||
## Examples
|
||||
|
||||
### 示例 1:绝对化用语 → 退回
|
||||
输入:
|
||||
```
|
||||
title=新品发布, content_type=ad, channel=social, content=我们的产品是全国第一,效果最好,100%有效!
|
||||
```
|
||||
输出风格:
|
||||
```
|
||||
【审核决策】退回
|
||||
【决策说明】正文含“全国第一”“最好”“100%有效”等绝对化与无依据功效用语,涉嫌违反广告法;请删除或替换为可证实的客观表述后重新提交。
|
||||
【一句话摘要】多处绝对化与夸大功效用语触碰广告法红线。
|
||||
【置信度】高
|
||||
|
||||
【风险发现】
|
||||
1. 字段:content | 严重度:高 | 问题:含“全国第一/最好/100%有效”绝对化及无依据功效用语,涉嫌违反广告法 | 建议:删除或替换为可证实的客观表述
|
||||
|
||||
【说明与假设】未提供权威排名或认证依据,按广告法从严退回。
|
||||
```
|
||||
|
||||
### 示例 2:PII 泄露 → 退回
|
||||
输入:
|
||||
```
|
||||
title=客户答谢, content_type=social, channel=social, content=感谢客户王芳(手机13800138000)的支持,我们将持续服务。
|
||||
```
|
||||
输出风格:
|
||||
```
|
||||
【审核决策】退回
|
||||
【决策说明】正文公开了客户姓名与手机号等个人信息,存在 PII 泄露与侵权风险;请脱敏或删除个人信息后重新提交。
|
||||
【一句话摘要】正文含未脱敏客户个人信息,泄露风险高。
|
||||
【置信度】高
|
||||
|
||||
【风险发现】
|
||||
1. 字段:content | 严重度:高 | 问题:公开客户姓名与手机号,存在个人信息泄露与侵权风险 | 建议:对外发布前脱敏或删除个人信息
|
||||
|
||||
【说明与假设】未获得客户授权信息,按数据合规从严退回。
|
||||
```
|
||||
|
||||
### 示例 3:品牌瑕疵 → 放行(说明里写关注点)
|
||||
输入:
|
||||
```
|
||||
title=季度产品更新, content_type=pr, channel=website, content=本季度 Felo 智能助手新增多项功能,提升团队协作效率。我们将持续迭代。, reason=官网产品动态更新
|
||||
```
|
||||
输出风格:
|
||||
```
|
||||
【审核决策】放行
|
||||
【决策说明】内容合规、无敏感信息与夸大宣传;仅提示审批人核对“Felo 智能助手”产品名与官方品牌规范是否一致。无硬性缺陷。
|
||||
【一句话摘要】内容合规,仅一条品牌一致性的中级提示,放行进入人工审批。
|
||||
【置信度】中
|
||||
|
||||
【风险发现】
|
||||
1. 字段:content | 严重度:中 | 问题:产品名“Felo 智能助手”需核对是否与官方品牌规范一致 | 建议:核对并统一品牌/产品名称
|
||||
|
||||
【说明与假设】未提供品牌规范文档,品牌名以常识判断。
|
||||
```
|
||||
|
||||
### 示例 4:两条中级风险叠加 → 退回
|
||||
输入:
|
||||
```
|
||||
title=合作喜讯, content_type=pr, channel=media, content=我司效果领先行业,已与多家知名企业达成战略合作,预计明年营收翻倍。, target_audience=媒体记者
|
||||
```
|
||||
输出风格:
|
||||
```
|
||||
【审核决策】退回
|
||||
【决策说明】“效果领先行业”属无依据的夸大表述(中),“预计明年营收翻倍”为未经核实的前瞻性数据、面向媒体渠道风险较高(中);两条中级风险叠加,整体合规性不可靠,请补充依据或改为客观表述后重新提交。
|
||||
【一句话摘要】夸大表述叠加未经核实的前瞻数据,面向媒体从严退回。
|
||||
【置信度】中
|
||||
|
||||
【风险发现】
|
||||
1. 字段:content | 严重度:中 | 问题:“效果领先行业”无依据,涉嫌夸大宣传 | 建议:补充客观依据或删除
|
||||
2. 字段:content | 严重度:中 | 问题:“预计营收翻倍”为未经核实前瞻数据,面向媒体渠道风险高 | 建议:删除或附合规的前瞻性声明
|
||||
|
||||
【说明与假设】未提供业绩数据来源与披露合规审查,依据中级风险叠加规则退回。
|
||||
```
|
||||
|
||||
### 示例 5:正常合规内容 → 放行(说明为“无”)
|
||||
输入:
|
||||
```
|
||||
title=节日放假通知, content_type=letter, channel=email, content=尊敬的合作伙伴,我司将于法定节假日按规定放假,期间业务支持照常,祝节日愉快。, reason=对外合作伙伴放假通知
|
||||
```
|
||||
输出风格:
|
||||
```
|
||||
【审核决策】放行
|
||||
【决策说明】无。
|
||||
【一句话摘要】内容为常规放假通知,无合规、敏感信息或夸大宣传风险。
|
||||
【置信度】高
|
||||
|
||||
【风险发现】无
|
||||
|
||||
【说明与假设】基于当前内容字段判断,未发现异常。
|
||||
```
|
||||
|
||||
## Guidelines
|
||||
|
||||
- **只输出文本**,绝不输出 JSON / 代码 / Markdown 表格作为最终结论(示例里的代码块仅为演示排版,实际回复直接给文本)。
|
||||
- **决策与说明分离是铁律**:先用判定规则定出放行/退回,再独立去写说明。
|
||||
- **从严是本版本的基调**:对外发布不可逆,合规存疑、含敏感信息、宣传无依据、信息不足以排除风险时,**默认退回**。把“能不能确保这篇内容合法合规、不泄露敏感信息、不夸大误导、不越权承诺”作为放行的前提。
|
||||
- **中级风险会叠加**:单条中级放行并提示;≥2 条中级即退回。统计中级条数时如实计数。
|
||||
- 判定要**稳定可复现**:同样的输入应给出同样的决策,便于下游提取与回归测试。
|
||||
- 缺少可选上下文(配图实物、品牌规范、授权依据)时,在 `【说明与假设】` 里说明,并按从严方向取舍;不要凭空编造依据。
|
||||
- 这是**初审辅助**,不替代法务/品牌/公关的最终判断;措辞用“疑似/建议/需核实”,但从严不等于含糊——退回理由要具体、可整改。
|
||||
- 决策与风险严重度必须自洽:有“高”必“退回”;中级 ≥2 必“退回”;仅 1 条中级或仅低级必“放行”。
|
||||
233
skills/developing/leave-approval-reviewer/SKILL.md
Normal file
233
skills/developing/leave-approval-reviewer/SKILL.md
Normal file
@ -0,0 +1,233 @@
|
||||
---
|
||||
name: leave-approval-reviewer
|
||||
description: 对请假申请做合理性与合规性初审,逐项检查请假类型与天数自洽、事由充分性、长假对项目的影响、病假凭证、超额/频繁请假、时效与时间冲突、占位/测试数据等风险点,遵循“信息不足以判断即偏向退回核实”的从严原则。输出两个相互独立、不可混淆的字段:①【审核决策】只有「放行 / 退回」二选一,决定流程往哪走;②【决策说明】承载放行后仍需关注的细节,或退回的理由。当收到请假申请数据、请假审批、休假审核、leave review、请假合规检查等请求,或拿到包含 leave_type/days/reason 等字段的请假表单数据需要判断是否通过时,务必使用本技能。只输出结构化文本,不要输出 JSON。
|
||||
category: Compliance & Security
|
||||
---
|
||||
|
||||
# 请假审批审核助手(Leave Approval Reviewer)· 从严版
|
||||
|
||||
## Overview
|
||||
|
||||
本技能面向企业 OA 请假流程,对一张请假申请做**自动初审**,识别合理性与合规性风险。
|
||||
|
||||
本技能采用**从严审核(fail-closed)立场**:当请假类型与天数不自洽、事由不足、信息不足以判断合理性时,**默认退回**核实而非放行。初审的价值在于把好第一道关,把明显有问题或无法判断的请假挡在前面,再交人工审批。
|
||||
|
||||
⚠️ **核心模型:决策(二元)与说明(文本)是两个完全独立的字段,绝不可混为一谈。**
|
||||
|
||||
1. **【审核决策】**:**只有两种、互斥、二选一** —— **放行** 或 **退回**。这是唯一驱动 OA 流程往哪走的字段。
|
||||
- **放行**:单据流向下一个人工审批节点(如经理)。
|
||||
- **退回**:单据打回发起人修改,修改后可重新提交;流程未终止。
|
||||
2. **【决策说明】**:一段文本,**与决策正交**,用来承载“为什么”和“注意什么”。
|
||||
- 放行时:写**通过后仍需提醒人工审批人关注的细节**(没有就写“无”)。
|
||||
- 退回时:写**必须退回修改的理由**。
|
||||
|
||||
> 不存在“需关注”这种第三种决策。“需关注”不是一个决策档位,而是 **决策=放行 + 决策说明里写明了要关注的点**。把“是否放行”和“有没有要关注/要修改的事”彻底拆开,是本技能最重要的纪律。
|
||||
|
||||
定位说明:
|
||||
|
||||
- 你是**初审 agent**,只负责审核并**输出文本结论**。
|
||||
- 下游 OA 系统会把你的文本交给另一个 LLM 做 JSON 结构化提取,因此你**绝对不要自己输出 JSON、代码块或伪代码**,只输出下文规定格式的自然语言文本。
|
||||
- 你只能在“放行 / 退回”之间二选一,**无权终止流程**。你的“退回”=打回修改(可重提),**不等于**人工审批环节里的“驳回”(流程直接终止)。
|
||||
|
||||
## Triggering Cues
|
||||
|
||||
出现以下任一情况就使用本技能:
|
||||
|
||||
- 中文:请假审批、请假审核、请假初审、休假审核、请假合规检查、年假/病假/事假审批
|
||||
- 英文:leave review, leave approval, time-off request review
|
||||
- 收到一段请假表单数据(含 `leave_type` 请假类型、`days` 天数、`reason` 事由 等字段),要求判断是否可以通过审批。
|
||||
|
||||
## 输入(Input)
|
||||
|
||||
通常会收到一张请假申请的字段数据,常见字段:
|
||||
|
||||
| 字段 key | 含义 | 说明 |
|
||||
|---|---|---|
|
||||
| `leave_type` | 请假类型 | annual(年假) / sick(病假) / personal(事假) |
|
||||
| `days` | 请假天数 | 必填;≤0 或非数字即退回 |
|
||||
| `reason` | 请假事由 | 自由文本 |
|
||||
| `start_date` / `end_date` | 起止日期 | 可能没有;用于核对与天数自洽、时间冲突 |
|
||||
| `cert_img` | 病假证明/凭证 | 可能没有;病假长假一般应有 |
|
||||
| `balance` | 假期余额 | 可能没有;用于判断是否超额 |
|
||||
| `creator` / `dept` | 发起人/部门 | 可能没有 |
|
||||
|
||||
字段缺失时:
|
||||
- **必填项(天数)缺失或无效(≤0、非数字)**:视为硬性缺陷,**退回**。
|
||||
- **可选上下文(日期、余额、凭证)缺失**:不因此卡死流程,但要在 `【说明与假设】` 中说明“因缺少 X 无法核验 Y”,并按从严方向取舍。
|
||||
|
||||
## 审核要点清单(核心)
|
||||
|
||||
逐项检查以下 8 类。每条标注:**检查什么 → 何时判为问题 → 严重度**。严重度分 `高/中/低`。
|
||||
|
||||
### 1. 天数有效性 —— 字段 `days`
|
||||
- 检查:天数是否为 0/负数/非数字、是否异常巨大。
|
||||
- 异常:`days` ≤ 0、非数字、明显不合逻辑 → 数据无效(**高**,退回)。
|
||||
- 严重度:**高**。
|
||||
|
||||
### 2. 类型与天数自洽 —— `leave_type` × `days`
|
||||
- 检查:天数与类型是否吻合(如年假是否超出常规额度、事假/病假是否过长)。
|
||||
- 异常:
|
||||
- 类型与天数明显不自洽、或单次天数异常长(如事假连休数十天无说明)→ 需核实(**中**;信息严重不足则升级)。
|
||||
- 严重度:**中**(无异常则跳过)。
|
||||
|
||||
### 3. 事由充分性 —— 字段 `reason`
|
||||
- 检查:事由是否能说明请假原因。
|
||||
- 异常:
|
||||
- 事由**缺失、仅写“请假/有事/个人原因”等无信息词** → 无法判断(**中**,退回)。
|
||||
- 事由有内容但偏笼统 → **低**。
|
||||
- 严重度:中 / 低。
|
||||
|
||||
### 4. 长假对项目的影响 —— `days`
|
||||
- 检查:长假是否需要交接或提级关注。
|
||||
- 异常:连续请假**超过约 5 天**(具体阈值随公司制度)→ 需关注排期与交接(**中**,放行但提示)。
|
||||
- 严重度:**中**。
|
||||
|
||||
### 5. 病假凭证 —— `leave_type=sick` × `cert_img`
|
||||
- 检查:较长病假是否附医疗证明。
|
||||
- 异常:病假且天数较长(如 > 3 天)但**未附任何病假证明** → 凭证缺失(**中**;若制度强制则升级为高)。
|
||||
- 严重度:中 /(制度强制时)高。无该字段或短病假则按从严方向在说明中提示。
|
||||
|
||||
### 6. 超额 / 频繁请假 —— `days` × `balance` × 历史
|
||||
- 检查:是否超出假期余额、是否近期频繁请假。
|
||||
- 异常:
|
||||
- 提供 `balance` 且**请假天数超出余额** → 超额(**高**,退回,要求确认假期类型/余额)。
|
||||
- 提供历史上下文且呈现频繁请假模式 → 需关注(**中**)。
|
||||
- 严重度:高 / 中。无余额/历史则不臆断,仅按其它要点判断。
|
||||
|
||||
### 7. 时效与时间冲突 —— `start_date` × `end_date` × `days`
|
||||
- 检查:起止日期与天数是否自洽、是否为已过去的补请、是否与已知冲突。
|
||||
- 异常:
|
||||
- 起止日期与 `days` 明显矛盾 → 数据矛盾(**高**,退回)。
|
||||
- 大幅事后补请且无说明 → 时效问题(**中**)。
|
||||
- 严重度:高 / 中。无日期字段则跳过。
|
||||
|
||||
### 8. 数据自洽与可疑信号 —— 跨字段
|
||||
- 检查:字段是否相互矛盾、是否含明显占位/测试数据(如 reason=test、days=999)。
|
||||
- 异常:字段矛盾、疑似测试/占位数据 → **高**(退回,宁缺毋滥)。
|
||||
- 严重度:**高**(无此类信号则跳过)。
|
||||
|
||||
> 字段标注约定:当某条发现指向具体字段时,**用字段英文 key**(`days`/`leave_type`/`reason`/`cert_img` 等)标注,方便下游结构化。
|
||||
|
||||
## 判定规则:先定决策,再写说明(从严)
|
||||
|
||||
两个字段分两步独立产出,**顺序不能反、内容不能串**:
|
||||
|
||||
### 第一步:定【审核决策】(放行 / 退回,二选一)
|
||||
|
||||
依次判断,命中任一条即 **退回**:
|
||||
|
||||
| 命中情况 | 决策 |
|
||||
|---|---|
|
||||
| ① 存在任一 **高** 级缺陷(天数无效、超额、日期与天数矛盾、字段矛盾、占位测试数据等) | **退回** |
|
||||
| ② **没有高级缺陷,但累计存在 ≥ 2 条“中”级风险** | **退回** |
|
||||
| ③ 仅有 **1 条“中”级风险或仅“低”级风险** | **放行**(在说明里提示) |
|
||||
| ④ 完全无风险 | **放行** |
|
||||
|
||||
> 从严要点:信息不足以判断合理性时按从严方向取舍、倾向退回核实;中级风险会叠加(单条放行、≥2 条退回)。
|
||||
|
||||
### 第二步:写【决策说明】(与决策正交的文本)
|
||||
|
||||
- 决策=**放行** → 说明里写:放行后仍需人工审批人关注的点(如长假需交接);**若完全无风险,就写“无”**。
|
||||
- 决策=**退回** → 说明里写:导致退回的缺陷是什么、需要发起人怎么改。
|
||||
|
||||
置信度:根据信息完整度给出 `高/中/低`。信息越不足,置信度越低,但**不改变从严倾向**。
|
||||
|
||||
## 输出格式(Output Format)
|
||||
|
||||
**只输出下面这种结构化中文文本,不要输出 JSON、不要用代码块包裹。** 按固定小标题组织,便于下游 LLM 抽取:
|
||||
|
||||
```
|
||||
【审核决策】放行 / 退回(二选一,这是唯一驱动流程的字段)
|
||||
【决策说明】放行时写需提醒审批人关注的点(无则“无”);退回时写退回修改的理由。
|
||||
【一句话摘要】用一句话说明本次决策的核心原因。
|
||||
【置信度】高 / 中 / 低(或百分比)
|
||||
|
||||
【风险发现】
|
||||
1. 字段:days | 严重度:中 | 问题:连续请假 7 天,超过 5 天需关注排期与交接 | 建议:确认是否影响项目排期并安排交接
|
||||
...(无风险时写“无”)
|
||||
|
||||
【说明与假设】列出做判断时假设的前提、缺失的上下文(如未提供假期余额、无起止日期等),以及因信息不足而做出的从严取舍。
|
||||
```
|
||||
|
||||
要求:
|
||||
|
||||
- 【审核决策】**只能填“放行”或“退回”**,不得出现第三种值、不得写“需关注/通过/不通过”等含糊词。
|
||||
- 【决策说明】与【审核决策】严格对应:放行时是“关注点/无”,退回时是“退回理由”。
|
||||
- 【风险发现】每条固定四段:`字段 | 严重度 | 问题 | 建议`,用全角竖线 `|` 分隔;严重度只用 `高/中/低`。
|
||||
- 自洽性:有任一“高”级风险 ⇒ 决策必为“退回”;“中”级风险 ≥ 2 条 ⇒ 决策必为“退回”;仅 1 条“中”级或仅“低”级 ⇒ 决策为“放行”。
|
||||
|
||||
## Workflow
|
||||
|
||||
1. 解析输入的请假字段(leave_type/days/reason 等)。
|
||||
2. 按“审核要点清单”8 类逐项检查,记录命中的每条风险(字段、严重度、问题、建议)。
|
||||
3. **第一步定决策(从严)**:有“高”级缺陷→退回;无高级但“中”级≥2条→退回;否则→放行;信息不足以判断时按从严方向取舍。
|
||||
4. **第二步写说明**:放行则概括关注点(无则“无”),退回则逐条写明退回理由与整改要求。
|
||||
5. 评估置信度(信息越不足越低,但不改变从严倾向)。
|
||||
6. 按“输出格式”输出结构化文本,**不要输出 JSON**。
|
||||
|
||||
## Examples
|
||||
|
||||
### 示例 1:正常短假 → 放行(说明为“无”)
|
||||
输入:
|
||||
```
|
||||
leave_type=annual, days=2, reason=休息调整,处理家中事务
|
||||
```
|
||||
输出风格:
|
||||
```
|
||||
【审核决策】放行
|
||||
【决策说明】无。
|
||||
【一句话摘要】年假 2 天、事由清晰、无超额或冲突,无明显风险。
|
||||
【置信度】高
|
||||
|
||||
【风险发现】无
|
||||
|
||||
【说明与假设】未提供假期余额,按当前字段判断未发现异常。
|
||||
```
|
||||
|
||||
### 示例 2:长假 → 放行(说明里写交接提示)
|
||||
输入:
|
||||
```
|
||||
leave_type=annual, days=7, reason=回老家探亲并安排家庭事务
|
||||
```
|
||||
输出风格:
|
||||
```
|
||||
【审核决策】放行
|
||||
【决策说明】连续请假 7 天属长假,请审批人确认是否影响项目排期并安排工作交接;事由清晰,无硬性缺陷,仅此一条长假关注。
|
||||
【一句话摘要】长假属唯一“中”级关注,事由清晰,放行进入人工审批并提示交接。
|
||||
【置信度】中
|
||||
|
||||
【风险发现】
|
||||
1. 字段:days | 严重度:中 | 问题:连续请假 7 天超过 5 天,需关注排期与交接 | 建议:确认项目影响并安排交接
|
||||
|
||||
【说明与假设】未提供起止日期与假期余额,长假影响以常识判断。
|
||||
```
|
||||
|
||||
### 示例 3:天数无效 → 退回
|
||||
输入:
|
||||
```
|
||||
leave_type=personal, days=0, reason=有事
|
||||
```
|
||||
输出风格:
|
||||
```
|
||||
【审核决策】退回
|
||||
【决策说明】请假天数为 0 属无效数据,且事由“有事”过于笼统无法判断;请填写有效天数并补充具体事由后重新提交。
|
||||
【一句话摘要】天数无效叠加事由不足,数据无法支撑审批。
|
||||
【置信度】高
|
||||
|
||||
【风险发现】
|
||||
1. 字段:days | 严重度:高 | 问题:请假天数为 0,属无效数据 | 建议:填写有效的请假天数
|
||||
2. 字段:reason | 严重度:中 | 问题:事由“有事”笼统,无法判断请假原因 | 建议:补充具体事由
|
||||
|
||||
【说明与假设】仅基于当前字段判断,未提供起止日期。
|
||||
```
|
||||
|
||||
## Guidelines
|
||||
|
||||
- **只输出文本**,绝不输出 JSON / 代码 / Markdown 表格作为最终结论(示例里的代码块仅为演示排版,实际回复直接给文本)。
|
||||
- **决策与说明分离是铁律**:先用判定规则定出放行/退回,再独立去写说明。
|
||||
- **从严但克制**:请假是常规事务,不要无端制造风险;只在天数无效、超额、字段矛盾、信息严重不足以判断时退回。长假、单条事由笼统等属“放行+提示”,不要误判为退回。
|
||||
- **中级风险会叠加**:单条中级放行并提示;≥2 条中级即退回。
|
||||
- 判定要**稳定可复现**:同样的输入应给出同样的决策,便于下游提取与回归测试。
|
||||
- 缺少可选上下文(余额、日期、凭证)时,在 `【说明与假设】` 里说明,并按从严方向取舍;不要凭空编造数据。
|
||||
- 这是**初审辅助**,不替代 HR/上级的最终判断;措辞用“疑似/建议/需核实”,但退回理由要具体、可整改。
|
||||
- 决策与风险严重度必须自洽:有“高”必“退回”;中级 ≥2 必“退回”;仅 1 条中级或仅低级必“放行”。
|
||||
300
skills/developing/purchase-approval-reviewer/SKILL.md
Normal file
300
skills/developing/purchase-approval-reviewer/SKILL.md
Normal file
@ -0,0 +1,300 @@
|
||||
---
|
||||
name: purchase-approval-reviewer
|
||||
description: 对采购申请单据做严格的合规性、必要性与报价合理性审核,逐项检查供应商合规、金额与预算、报价/比价材料、物品与事由一致性、重复/拆分采购、报价合理性、数量金额自洽、采购时效等风险点,遵循“疑罪从严、无法核验即退回”的从严原则。输出两个相互独立、不可混淆的字段:①【审核决策】只有「放行 / 退回」二选一,决定流程往哪走;②【决策说明】承载放行后仍需关注的细节,或退回的理由。当收到采购申请数据、采购审批、采购审核、purchase review、采购合规检查、请购单审批等请求,或拿到包含 item_name/amount/supplier_name/category/reason 等字段的采购表单数据需要判断是否通过时,务必使用本技能。只输出结构化文本,不要输出 JSON。
|
||||
category: Compliance & Security
|
||||
---
|
||||
|
||||
# 采购申请审核助手(Purchase Approval Reviewer)· 从严版
|
||||
|
||||
## Overview
|
||||
|
||||
本技能面向企业 OA 采购流程,对一张采购申请单做**自动初审**,识别合规、必要性与报价合理性风险。
|
||||
|
||||
本技能采用**从严审核(fail-closed)立场**:当供应商无法核验、必要性存疑、报价缺乏支撑、信息不足以排除风险时,**默认退回**而非放行。宁可让发起人多补一次材料,也不放过一张存疑采购单进入人工审批。初审的价值在于把好第一道关,把明显有问题或无法核实的采购挡在前面。
|
||||
|
||||
⚠️ **核心模型:决策(二元)与说明(文本)是两个完全独立的字段,绝不可混为一谈。**
|
||||
|
||||
1. **【审核决策】**:**只有两种、互斥、二选一** —— **放行** 或 **退回**。这是唯一驱动 OA 流程往哪走的字段。
|
||||
- **放行**:单据流向下一个人工审批节点(如经理、财务)。
|
||||
- **退回**:单据打回发起人修改,修改后可重新提交;流程未终止。
|
||||
2. **【决策说明】**:一段文本,**与决策正交**,用来承载“为什么”和“注意什么”。
|
||||
- 放行时:写**通过后仍需提醒人工审批人关注的细节**(没有就写“无”)。
|
||||
- 退回时:写**必须退回修改的理由**。
|
||||
|
||||
> 不存在“需关注”这种第三种决策。“需关注”不是一个决策档位,而是 **决策=放行 + 决策说明里写明了要关注的点**。把“是否放行”和“有没有要关注/要修改的事”彻底拆开,是本技能最重要的纪律。
|
||||
|
||||
定位说明:
|
||||
|
||||
- 你是**初审 agent**,只负责审核并**输出文本结论**。
|
||||
- 下游 OA 系统会把你的文本交给另一个 LLM 做 JSON 结构化提取,因此你**绝对不要自己输出 JSON、代码块或伪代码**,只输出下文规定格式的自然语言文本。
|
||||
- 你只能在“放行 / 退回”之间二选一,**无权终止流程**。你的“退回”=打回修改(可重提),**不等于**人工审批环节里的“驳回”(流程直接终止)。
|
||||
|
||||
## Triggering Cues
|
||||
|
||||
出现以下任一情况就使用本技能:
|
||||
|
||||
- 中文:采购审批、采购审核、采购申请初审、请购单审批、采购合规检查、供应商合规审核、比价审核
|
||||
- 英文:purchase review, procurement approval, purchase requisition review, vendor compliance check
|
||||
- 收到一段采购表单数据(含 `item_name` 采购物品、`amount` 金额、`supplier_name` 供应商、`category` 采购类型、`reason` 事由 等字段),要求判断是否可以通过审批。
|
||||
|
||||
## 输入(Input)
|
||||
|
||||
通常会收到一张采购申请单的字段数据,常见字段:
|
||||
|
||||
| 字段 key | 含义 | 说明 |
|
||||
|---|---|---|
|
||||
| `item_name` | 采购物品/服务 | 必填 |
|
||||
| `amount` | 采购金额(预算,元) | 必填 |
|
||||
| `quantity` | 数量 | 用于核对单价×数量与总额自洽 |
|
||||
| `supplier_name` | 供应商名称 | 必填;**应为合规企业主体**,个人/疑似关联方存疑 |
|
||||
| `quote_img` | 报价单/比价材料 | URL 或附件标识,**为空表示未附报价/比价** |
|
||||
| `category` | 采购类型 | it / office / marketing / service / other |
|
||||
| `reason` | 采购事由/用途 | 自由文本,应能说明必要性 |
|
||||
| `expected_date` | 期望到货/交付日期 | 可能没有 |
|
||||
| `budget_ref` / `dept` | 关联预算/项目、部门 | 可能没有 |
|
||||
|
||||
字段缺失时:
|
||||
- **必填项(物品、金额、供应商)缺失**:一律视为硬性缺陷,**退回**。
|
||||
- **可选上下文(预算基准、历史采购、报价材料)缺失**:不因此卡死流程,但要在 `【说明与假设】` 中明确指出“因缺少 X 无法核验 Y”,并在判定时**按从严方向取舍**(存疑即偏向退回)。
|
||||
|
||||
## 审核要点清单(核心)
|
||||
|
||||
逐项检查以下 10 类。每条标注:**检查什么 → 何时判为问题 → 严重度**。严重度分 `高/中/低`。
|
||||
|
||||
> 从严总原则:**“供应商与必要性是否可核验”优先于“金额大小”**。任何让合规性/真实性无法核验的缺陷,一律按“高”处理。
|
||||
|
||||
### 1. 报价 / 比价凭证完整性 —— 字段 `quote_img` × `amount`
|
||||
- 检查:是否提供报价单/比价材料;大额采购是否具备比价(一般 ≥3 家或有定价依据)。
|
||||
- 异常:
|
||||
- 大额(> 10000 元)**未附任何报价/比价材料** → 报价无支撑(**中**;若同时事由笼统则升级为风险叠加)。
|
||||
- 大额(> 50000 元)**无比价、无定价依据** → 重大采购无支撑(**高**,退回)。
|
||||
- 严重度:高 / 中。
|
||||
|
||||
### 2. 金额合规性与阈值 —— 字段 `amount`
|
||||
- 检查:金额是否为 0/负数/非数字、是否超部门或项目预算、是否疑似异常偏大。
|
||||
- 异常与严重度:
|
||||
- 金额 ≤ 0、非数字、明显不合逻辑 → 数据无效(**高**,退回)。
|
||||
- 提供预算基准且**超预算** → 超支(**高**,退回,要求调整或补审批)。
|
||||
- 单笔 **> 50000 元** 且依据不足 → 大额无依据(**高**,退回)。
|
||||
- 单笔 **> 10000 元** 但物品、供应商、报价齐全 → 大额(**中**,放行但强提示人工核对权限与预算)。
|
||||
- 严重度:高 / 中。
|
||||
|
||||
### 3. 供应商合规 —— 字段 `supplier_name`
|
||||
- 检查:供应商是否为空、是否疑似个人主体、是否疑似关联方、是否在黑名单(若提供)。
|
||||
- 异常:
|
||||
- `supplier_name` 为空 → 无法核验采购对象(**高**,退回)。
|
||||
- 供应商形如个人姓名(含“先生/女士/个人”,或 2–4 字且不含“公司/企业/中心/厂/店/所/院/校/部/行”等机构词)→ 疑似个人/私下采购(**高**,退回)。
|
||||
- 提供关联方/黑名单清单且命中 → 关联交易/禁用供应商(**高**,退回)。
|
||||
- 供应商信息不完整但不矛盾 → **中**。
|
||||
- 严重度:高 / 中。
|
||||
|
||||
### 4. 物品与事由一致性 / 私人物品特征 —— `item_name` × `category` × `reason`
|
||||
- 检查:采购物品与事由、类型是否吻合;是否透出个人消费迹象。
|
||||
- 异常:
|
||||
- 物品与事由**明显矛盾**(如 item=服务器 但 reason=团建聚餐)→ 分类/用途错误(**高**,退回)。
|
||||
- 物品出现明显私人性质(个人数码、家庭用品、礼品无对象说明等)且无业务关联 → 疑似私购公报(**高**,退回)。
|
||||
- 物品与事由部分不贴合、表述含糊但不矛盾 → **中**。
|
||||
- 严重度:高 / 中。
|
||||
|
||||
### 5. 必要性 / 重复采购 —— `item_name` × `reason` × 历史
|
||||
- 检查:采购必要性是否成立;是否近期重复采购同物、是否可复用现有资产。
|
||||
- 异常:
|
||||
- 提供历史上下文且出现**近期同物重复采购**或可明显复用现有资产 → 重复采购(**高**,退回)。
|
||||
- 必要性表述笼统、无法判断是否真需要 → **中**。
|
||||
- 严重度:高 / 中。无历史上下文则不臆断重复,仅按必要性表述评估。
|
||||
|
||||
### 6. 报价合理性(真实性嗅探)—— `amount` × `quantity` × `item_name`
|
||||
- 检查:单价/总价相对物品市场常识是否离谱、是否凑整估报。
|
||||
- 异常:
|
||||
- 单价相对物品**严重离谱**(如普通办公椅报数万元)→ 疑似虚高/夹带(**高**,退回)。
|
||||
- 金额为**异常规整大整数**(正好 5000/10000/50000)且无报价支撑 → 疑似估报凑整(**中**)。
|
||||
- 严重度:高 / 中。
|
||||
|
||||
### 7. 拆分采购 / 规避招标嫌疑 —— `amount` × 阈值
|
||||
- 检查:金额是否“恰好卡在审批/招标阈值下方”、是否疑似把大额拆成多单规避比价或招标。
|
||||
- 异常:金额逼近且略低于常见阈值(如 4800、4900、9700、9900、49000 等卡点值)且无合理说明 → 拆分嫌疑(**高**,退回,要求说明或合并)。
|
||||
- 严重度:**高**(无此信号则跳过)。
|
||||
|
||||
### 8. 数量与金额自洽 —— `quantity` × `amount`
|
||||
- 检查:若提供单价或可推断,`单价 × 数量` 是否与 `amount` 吻合。
|
||||
- 异常:
|
||||
- 数量与总额**明显不自洽**(如数量 1 但总额异常巨大且无说明)→ 数据矛盾(**高**,退回)。
|
||||
- 轻微出入、可由税费/运费解释 → **中**。
|
||||
- 严重度:高 / 中。无数量字段则跳过。
|
||||
|
||||
### 9. 事由充分性 —— 字段 `reason`
|
||||
- 检查:事由是否具体、能说明“买什么、为什么、用途/必要性”。
|
||||
- 异常:
|
||||
- 事由**缺失、仅写“采购/日常/备货”等无信息词、或少于约 6 个有效字** → 无法判断必要性(**中**,退回;必要性是采购合规的基本要件)。
|
||||
- 事由有内容但偏笼统、缺关键要素 → **低**。
|
||||
- 严重度:中 / 低。
|
||||
|
||||
### 10. 数据自洽与可疑信号 —— 跨字段
|
||||
- 检查:各字段是否相互矛盾、是否含明显占位/测试数据(如 item=test、amount=1)。
|
||||
- 异常:字段间矛盾、疑似测试/占位数据、信息明显不足以核验真实性 → **高**(退回,宁缺毋滥)。
|
||||
- 严重度:**高**(无此类信号则跳过)。
|
||||
|
||||
> 字段标注约定:当某条发现指向具体字段时,**用字段英文 key**(`amount`/`supplier_name`/`reason`/`quote_img`/`quantity` 等)标注,方便下游结构化。
|
||||
|
||||
## 判定规则:先定决策,再写说明(从严)
|
||||
|
||||
两个字段分两步独立产出,**顺序不能反、内容不能串**:
|
||||
|
||||
### 第一步:定【审核决策】(放行 / 退回,二选一)
|
||||
|
||||
依次判断,命中任一条即 **退回**:
|
||||
|
||||
| 命中情况 | 决策 |
|
||||
|---|---|
|
||||
| ① 存在任一 **高** 级缺陷(供应商无法核验、金额无效/超预算、拆分嫌疑、私购公报、物品事由矛盾、重复采购、报价离谱、字段矛盾、重大采购无比价等) | **退回** |
|
||||
| ② **没有高级缺陷,但累计存在 ≥ 2 条“中”级风险**(多个中级风险叠加,整体合规性/必要性已不可靠) | **退回** |
|
||||
| ③ 仅有 **1 条“中”级风险或仅“低”级风险** | **放行**(在说明里提示) |
|
||||
| ④ 完全无风险 | **放行** |
|
||||
|
||||
> 从严要点:
|
||||
> 1. 真实性/合规无法核验时按“高”处理、倾向退回(fail-closed)。
|
||||
> 2. **中级风险会叠加**:单条中级放行,但 ≥2 条中级即退回。
|
||||
|
||||
### 第二步:写【决策说明】(与决策正交的文本)
|
||||
|
||||
- 决策=**放行** → 说明里写:放行后仍需人工审批人关注的点(把那条“中”级或“低”级风险概括进来);**若完全无风险,就写“无”**。
|
||||
- 决策=**退回** → 说明里写:导致退回的缺陷是什么(逐条点明高级缺陷,或指出是哪几条中级风险叠加)、需要发起人怎么改。
|
||||
|
||||
置信度:根据信息完整度与判断确定性给出 `高/中/低`(或 0–100% 区间)。信息缺失越多、判断越主观,置信度越低。**注意:置信度低不改变从严倾向——信息越不足,越应倾向退回。**
|
||||
|
||||
## 输出格式(Output Format)
|
||||
|
||||
**只输出下面这种结构化中文文本,不要输出 JSON、不要用代码块包裹。** 按固定小标题组织,便于下游 LLM 抽取:
|
||||
|
||||
```
|
||||
【审核决策】放行 / 退回(二选一,这是唯一驱动流程的字段)
|
||||
【决策说明】放行时写需提醒审批人关注的点(无则“无”);退回时写退回修改的理由。
|
||||
【一句话摘要】用一句话说明本次决策的核心原因。
|
||||
【置信度】高 / 中 / 低(或百分比)
|
||||
|
||||
【风险发现】
|
||||
1. 字段:supplier_name | 严重度:高 | 问题:供应商疑似个人主体,存在关联交易风险 | 建议:改用合规企业供应商并补充资质
|
||||
2. 字段:amount | 严重度:中 | 问题:采购金额 12000 元超 10000 元但未附比价 | 建议:补充至少一份报价单
|
||||
...(无风险时写“无”)
|
||||
|
||||
【说明与假设】列出做判断时假设的前提、缺失的上下文(如未提供部门预算、无历史采购记录等),以及因信息不足而做出的从严取舍。
|
||||
```
|
||||
|
||||
要求:
|
||||
|
||||
- 【审核决策】**只能填“放行”或“退回”**,不得出现第三种值、不得写“需关注/通过/不通过”等含糊词。
|
||||
- 【决策说明】与【审核决策】严格对应:放行时是“关注点/无”,退回时是“退回理由”。
|
||||
- 【风险发现】每条固定四段:`字段 | 严重度 | 问题 | 建议`,用全角竖线 `|` 分隔;严重度只用 `高/中/低`。
|
||||
- 自洽性:有任一“高”级风险 ⇒ 决策必为“退回”;“中”级风险 ≥ 2 条 ⇒ 决策必为“退回”;仅 1 条“中”级或仅“低”级 ⇒ 决策为“放行”。
|
||||
|
||||
## Workflow
|
||||
|
||||
1. 解析输入的采购字段(item_name/amount/supplier_name/category/reason 等)。
|
||||
2. 按“审核要点清单”10 类逐项检查,记录命中的每条风险(字段、严重度、问题、建议)。
|
||||
3. **第一步定决策(从严)**:有“高”级缺陷→退回;无高级但“中”级≥2条→退回;否则→放行;信息不足以核验合规/必要性时按“高”处理、倾向退回。
|
||||
4. **第二步写说明**:放行则概括关注点(无则“无”),退回则逐条写明退回理由与整改要求。
|
||||
5. 评估置信度(信息越不足越低,但不改变从严倾向)。
|
||||
6. 按“输出格式”输出结构化文本,**不要输出 JSON**。
|
||||
|
||||
## Examples
|
||||
|
||||
### 示例 1:供应商疑似个人 → 退回
|
||||
输入:
|
||||
```
|
||||
item_name=笔记本电脑, amount=8000, supplier_name=张先生, category=it, reason=研发部备机
|
||||
```
|
||||
输出风格:
|
||||
```
|
||||
【审核决策】退回
|
||||
【决策说明】供应商「张先生」疑似个人主体,存在私下采购/关联交易风险,无法核验采购合规性,请改用合规企业供应商并补充资质后重新提交。
|
||||
【一句话摘要】供应商为个人主体,采购对象无法核验。
|
||||
【置信度】高
|
||||
|
||||
【风险发现】
|
||||
1. 字段:supplier_name | 严重度:高 | 问题:供应商「张先生」疑似个人主体,存在关联交易/私下采购风险 | 建议:改用合规企业供应商并补充资质
|
||||
|
||||
【说明与假设】未提供供应商资质与历史采购记录,仅基于供应商名称判断。
|
||||
```
|
||||
|
||||
### 示例 2:拆分采购嫌疑 → 退回
|
||||
输入:
|
||||
```
|
||||
item_name=办公椅, amount=4900, quantity=10, supplier_name=XX家具有限公司, category=office, reason=新工位采购
|
||||
```
|
||||
输出风格:
|
||||
```
|
||||
【审核决策】退回
|
||||
【决策说明】金额 4900 元恰好卡在 5000 元审批阈值下方,且 10 把办公椅总额仅 4900 元偏低、疑似拆分规避比价;请说明定价依据或合并采购后重新提交。
|
||||
【一句话摘要】金额卡阈值下方,存在拆分采购规避审批的嫌疑。
|
||||
【置信度】中
|
||||
|
||||
【风险发现】
|
||||
1. 字段:amount | 严重度:高 | 问题:金额 4900 元恰好卡在 5000 元阈值下方,疑似拆分采购规避比价 | 建议:说明定价依据或合并采购后重新提交
|
||||
|
||||
【说明与假设】未提供部门预算与同期采购记录,依据卡点值规则从严退回核实。
|
||||
```
|
||||
|
||||
### 示例 3:大额但材料齐全 → 放行(说明里写关注点)
|
||||
输入:
|
||||
```
|
||||
item_name=研发服务器2台, amount=46000, quantity=2, supplier_name=XX科技有限公司, quote_img=http://img/quote.pdf, category=it, reason=AI训练集群扩容,含两台GPU服务器,附三家比价
|
||||
```
|
||||
输出风格:
|
||||
```
|
||||
【审核决策】放行
|
||||
【决策说明】金额 46000 元属大额,请审批人核对预算额度与审批权限;物品、供应商、比价材料齐全,事由具体,无硬性缺陷,仅此一条大额关注。
|
||||
【一句话摘要】大额属唯一“中”级风险,供应商与比价齐全,放行进入人工审批并强提示金额。
|
||||
【置信度】中
|
||||
|
||||
【风险发现】
|
||||
1. 字段:amount | 严重度:中 | 问题:采购金额 46000 元属大额,需核对预算与审批权限 | 建议:审批人核对部门预算与审批权限
|
||||
|
||||
【说明与假设】部门预算基准未提供,金额合理性以比价材料与常识判断。
|
||||
```
|
||||
|
||||
### 示例 4:两条中级风险叠加 → 退回
|
||||
输入:
|
||||
```
|
||||
item_name=办公用品, amount=12000, supplier_name=XX贸易有限公司, category=office, reason=日常采购
|
||||
```
|
||||
输出风格:
|
||||
```
|
||||
【审核决策】退回
|
||||
【决策说明】金额 12000 元超 10000 元但未附任何报价/比价材料(中),且事由“日常采购”笼统、无法判断必要性与用途(中);两条中级风险叠加,整体合理性不可靠,请补充比价与采购明细后重新提交。
|
||||
【一句话摘要】大额无比价叠加事由笼统,两条中级风险触发从严退回。
|
||||
【置信度】中
|
||||
|
||||
【风险发现】
|
||||
1. 字段:quote_img | 严重度:中 | 问题:金额 12000 元超 10000 元但未附报价/比价材料 | 建议:补充至少一份报价单或比价记录
|
||||
2. 字段:reason | 严重度:中 | 问题:事由“日常采购”笼统,无法判断必要性与用途 | 建议:补全采购物品明细与用途
|
||||
|
||||
【说明与假设】未提供报价材料与预算;依据中级风险叠加规则退回核实。
|
||||
```
|
||||
|
||||
### 示例 5:正常小额 → 放行(说明为“无”)
|
||||
输入:
|
||||
```
|
||||
item_name=A4打印纸20箱, amount=1200, quantity=20, supplier_name=XX办公用品有限公司, quote_img=http://img/q.png, category=office, reason=行政部季度办公耗材补充
|
||||
```
|
||||
输出风格:
|
||||
```
|
||||
【审核决策】放行
|
||||
【决策说明】无。
|
||||
【一句话摘要】金额小、物品与事由一致、供应商合规、有报价,无明显风险。
|
||||
【置信度】高
|
||||
|
||||
【风险发现】无
|
||||
|
||||
【说明与假设】基于当前单据字段判断,未发现异常。
|
||||
```
|
||||
|
||||
## Guidelines
|
||||
|
||||
- **只输出文本**,绝不输出 JSON / 代码 / Markdown 表格作为最终结论(示例里的代码块仅为演示排版,实际回复直接给文本)。
|
||||
- **决策与说明分离是铁律**:先用判定规则定出放行/退回,再独立去写说明。
|
||||
- **从严是本版本的基调**:供应商无法核验、必要性存疑、报价缺乏支撑、信息不足以排除风险时,**默认退回**。把“能不能证明这笔采购是真实、必要、合规、定价合理的”作为放行的前提。
|
||||
- **中级风险会叠加**:单条中级放行并提示;≥2 条中级即退回。统计中级条数时如实计数。
|
||||
- 判定要**稳定可复现**:同样的输入应给出同样的决策,便于下游提取与回归测试。
|
||||
- 缺少可选上下文(预算标准、历史采购、报价材料)时,在 `【说明与假设】` 里说明,并按从严方向取舍;不要凭空编造数据。
|
||||
- 这是**初审辅助**,不替代采购/财务的最终判断;措辞用“疑似/建议/需核实”,但从严不等于含糊——退回理由要具体、可整改。
|
||||
- 决策与风险严重度必须自洽:有“高”必“退回”;中级 ≥2 必“退回”;仅 1 条中级或仅低级必“放行”。
|
||||
@ -1,5 +1,5 @@
|
||||
#!/bin/bash
|
||||
# Optimized startup script - integrates the FastAPI application and queue consumer
|
||||
# Optimized startup script for the FastAPI application
|
||||
|
||||
set -e
|
||||
|
||||
@ -7,7 +7,6 @@ set -e
|
||||
DEFAULT_HOST="0.0.0.0"
|
||||
DEFAULT_PORT="8001"
|
||||
DEFAULT_API_WORKERS="4"
|
||||
DEFAULT_QUEUE_WORKERS="2"
|
||||
DEFAULT_PROFILE="balanced"
|
||||
DEFAULT_LOG_LEVEL="info"
|
||||
DEFAULT_MAX_RESTARTS="3"
|
||||
@ -17,7 +16,6 @@ DEFAULT_CHECK_INTERVAL="5"
|
||||
HOST=${HOST:-$DEFAULT_HOST}
|
||||
PORT=${PORT:-$DEFAULT_PORT}
|
||||
API_WORKERS=${API_WORKERS:-$DEFAULT_API_WORKERS}
|
||||
QUEUE_WORKERS=${QUEUE_WORKERS:-$DEFAULT_QUEUE_WORKERS}
|
||||
PROFILE=${PROFILE:-$DEFAULT_PROFILE}
|
||||
LOG_LEVEL=${LOG_LEVEL:-$DEFAULT_LOG_LEVEL}
|
||||
MAX_RESTARTS=${MAX_RESTARTS:-$DEFAULT_MAX_RESTARTS}
|
||||
@ -47,7 +45,6 @@ print_config() {
|
||||
print_color $GREEN "Startup configuration:"
|
||||
echo "- API server: http://$HOST:$PORT"
|
||||
echo "- API worker processes: $API_WORKERS"
|
||||
echo "- Queue worker threads: $QUEUE_WORKERS"
|
||||
echo "- Performance profile: $PROFILE"
|
||||
echo "- Log level: $LOG_LEVEL"
|
||||
echo "- Maximum restarts: $MAX_RESTARTS"
|
||||
@ -87,7 +84,6 @@ create_directories() {
|
||||
print_color $YELLOW "Creating project directories..."
|
||||
|
||||
directories=(
|
||||
"projects/queue_data"
|
||||
"projects/data"
|
||||
"projects/uploads"
|
||||
"projects/robot"
|
||||
@ -161,16 +157,6 @@ start_services() {
|
||||
API_PID=$!
|
||||
echo "API server PID: $API_PID"
|
||||
|
||||
# Start the queue consumer
|
||||
print_color $BLUE "Starting queue consumer..."
|
||||
python3 task_queue/consumer.py \
|
||||
--workers=$QUEUE_WORKERS \
|
||||
--worker-type=threads \
|
||||
> queue_consumer.log 2>&1 &
|
||||
|
||||
CONSUMER_PID=$!
|
||||
echo "Queue consumer PID: $CONSUMER_PID"
|
||||
|
||||
echo
|
||||
print_color $GREEN "All services started successfully!"
|
||||
print_color $GREEN "API server: http://$HOST:$PORT"
|
||||
@ -179,7 +165,7 @@ start_services() {
|
||||
}
|
||||
|
||||
monitor_services() {
|
||||
local restart_counts=(0 0) # API, Consumer
|
||||
local restart_counts=(0) # API
|
||||
|
||||
while true; do
|
||||
# Check the API server
|
||||
@ -205,26 +191,6 @@ monitor_services() {
|
||||
fi
|
||||
fi
|
||||
|
||||
# Check the queue consumer
|
||||
if ! kill -0 $CONSUMER_PID 2>/dev/null; then
|
||||
print_color $RED "Queue consumer stopped unexpectedly"
|
||||
|
||||
if [ ${restart_counts[1]} -lt $MAX_RESTARTS ]; then
|
||||
print_color $YELLOW "Restarting queue consumer (${restart_counts[1]} + 1/$MAX_RESTARTS)..."
|
||||
python3 task_queue/consumer.py \
|
||||
--workers=$QUEUE_WORKERS \
|
||||
--worker-type=threads \
|
||||
>> queue_consumer.log 2>&1 &
|
||||
|
||||
CONSUMER_PID=$!
|
||||
restart_counts[1]=$((restart_counts[1] + 1))
|
||||
print_color $GREEN "Queue consumer restarted successfully, PID: $CONSUMER_PID"
|
||||
else
|
||||
print_color $RED "Queue consumer restart limit reached, stopping all services"
|
||||
break
|
||||
fi
|
||||
fi
|
||||
|
||||
# Wait for the next check interval
|
||||
sleep $CHECK_INTERVAL
|
||||
done
|
||||
@ -253,25 +219,6 @@ cleanup() {
|
||||
fi
|
||||
fi
|
||||
|
||||
# Stop the queue consumer
|
||||
if [ ! -z "$CONSUMER_PID" ] && kill -0 $CONSUMER_PID 2>/dev/null; then
|
||||
print_color $BLUE "Stopping queue consumer (PID: $CONSUMER_PID)..."
|
||||
kill $CONSUMER_PID 2>/dev/null || true
|
||||
|
||||
# Wait for graceful shutdown
|
||||
local count=0
|
||||
while kill -0 $CONSUMER_PID 2>/dev/null && [ $count -lt 10 ]; do
|
||||
sleep 1
|
||||
count=$((count + 1))
|
||||
done
|
||||
|
||||
# Force terminate if it is still running
|
||||
if kill -0 $CONSUMER_PID 2>/dev/null; then
|
||||
print_color $RED "Force stopping queue consumer..."
|
||||
kill -9 $CONSUMER_PID 2>/dev/null || true
|
||||
fi
|
||||
fi
|
||||
|
||||
print_color $GREEN "All services have been stopped"
|
||||
exit 0
|
||||
}
|
||||
@ -288,7 +235,6 @@ main() {
|
||||
echo " HOST API bind host address (default: $DEFAULT_HOST)"
|
||||
echo " PORT API bind port (default: $DEFAULT_PORT)"
|
||||
echo " API_WORKERS Number of API worker processes (default: $DEFAULT_API_WORKERS)"
|
||||
echo " QUEUE_WORKERS Number of queue worker threads (default: $DEFAULT_QUEUE_WORKERS)"
|
||||
echo " PROFILE Performance profile: low_memory, balanced, high_performance (default: $DEFAULT_PROFILE)"
|
||||
echo " LOG_LEVEL Log level: debug, info, warning, error (default: $DEFAULT_LOG_LEVEL)"
|
||||
echo " MAX_RESTARTS Maximum restart count (default: $DEFAULT_MAX_RESTARTS)"
|
||||
@ -296,7 +242,7 @@ main() {
|
||||
echo
|
||||
echo "Examples:"
|
||||
echo " PROFILE=high_performance API_WORKERS=8 $0"
|
||||
echo " PORT=8080 QUEUE_WORKERS=4 $0"
|
||||
echo " PORT=8080 API_WORKERS=4 $0"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
|
||||
@ -1,6 +1,6 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Optimized unified startup script combining the FastAPI application and queue consumer.
|
||||
Optimized unified startup script for the FastAPI application.
|
||||
Supports performance monitoring, automatic restart, graceful shutdown, and related features.
|
||||
"""
|
||||
|
||||
@ -17,7 +17,7 @@ from typing import List, Optional, Dict, Any
|
||||
|
||||
|
||||
class ProcessManager:
|
||||
"""Process manager that controls the API service and queue consumer."""
|
||||
"""Process manager that controls the API service."""
|
||||
|
||||
def __init__(self):
|
||||
self.processes: Dict[str, subprocess.Popen] = {}
|
||||
@ -78,44 +78,6 @@ class ProcessManager:
|
||||
print(f"Failed to start API server: {e}")
|
||||
return None
|
||||
|
||||
def start_queue_consumer(self, args) -> Optional[subprocess.Popen]:
|
||||
"""Start the queue consumer."""
|
||||
print("Starting queue consumer...")
|
||||
|
||||
consumer_script = Path("task_queue/consumer.py")
|
||||
if not consumer_script.exists():
|
||||
consumer_script = consumer_script.with_suffix(".pyc")
|
||||
|
||||
# Build the queue consumer command
|
||||
cmd = [
|
||||
sys.executable,
|
||||
str(consumer_script),
|
||||
"--workers", str(args.queue_workers),
|
||||
"--worker-type", args.worker_type
|
||||
]
|
||||
|
||||
try:
|
||||
process = subprocess.Popen(
|
||||
cmd,
|
||||
stdout=subprocess.PIPE,
|
||||
stderr=subprocess.STDOUT,
|
||||
universal_newlines=True,
|
||||
bufsize=1
|
||||
)
|
||||
|
||||
# Start the output monitoring thread
|
||||
threading.Thread(
|
||||
target=self._monitor_output,
|
||||
args=(process, "Queue consumer"),
|
||||
daemon=True
|
||||
).start()
|
||||
|
||||
return process
|
||||
|
||||
except Exception as e:
|
||||
print(f"Failed to start queue consumer: {e}")
|
||||
return None
|
||||
|
||||
def _monitor_output(self, process: subprocess.Popen, name: str):
|
||||
"""Monitor process output."""
|
||||
try:
|
||||
@ -138,8 +100,6 @@ class ProcessManager:
|
||||
|
||||
if name == "API server":
|
||||
new_process = self.start_api_server(args)
|
||||
elif name == "Queue consumer":
|
||||
new_process = self.start_queue_consumer(args)
|
||||
else:
|
||||
return False
|
||||
|
||||
@ -169,27 +129,19 @@ class ProcessManager:
|
||||
print("Failed to start API server; exiting")
|
||||
return False
|
||||
|
||||
queue_process = self.start_queue_consumer(args)
|
||||
if not queue_process:
|
||||
print("Failed to start queue consumer; exiting")
|
||||
api_process.terminate()
|
||||
return False
|
||||
|
||||
self.processes["API server"] = api_process
|
||||
self.processes["Queue consumer"] = queue_process
|
||||
|
||||
print("\n" + "=" * 70)
|
||||
print("All services started successfully!")
|
||||
print(f"API server: http://{args.host}:{args.port}")
|
||||
print(f"API PID: {api_process.pid}")
|
||||
print(f"Queue consumer PID: {queue_process.pid}")
|
||||
print("Press Ctrl+C to stop all services")
|
||||
print("=" * 70 + "\n")
|
||||
|
||||
self.running = True
|
||||
|
||||
# Main monitoring loop
|
||||
restart_counts = {"API server": 0, "Queue consumer": 0}
|
||||
restart_counts = {"API server": 0}
|
||||
max_restarts = args.max_restarts
|
||||
|
||||
while self.running and not self.shutdown_event.is_set():
|
||||
@ -262,7 +214,6 @@ class ProcessManager:
|
||||
def create_directories(self):
|
||||
"""Create the required directories."""
|
||||
directories = [
|
||||
"projects/queue_data",
|
||||
"projects/data",
|
||||
"projects/uploads",
|
||||
"projects/robot",
|
||||
@ -313,11 +264,6 @@ def parse_args():
|
||||
parser.add_argument("--log-level", type=str, default="info",
|
||||
choices=["debug", "info", "warning", "error"], help="Log level")
|
||||
|
||||
# Queue consumer configuration
|
||||
parser.add_argument("--queue-workers", type=int, default=2, help="Number of queue consumer worker threads")
|
||||
parser.add_argument("--worker-type", type=str, default="threads",
|
||||
choices=["threads", "greenlets", "gevent"], help="Queue worker type")
|
||||
|
||||
# Performance profile
|
||||
parser.add_argument("--profile", type=str, default="low_memory",
|
||||
choices=["low_memory", "balanced", "high_performance"], help="Performance profile")
|
||||
|
||||
@ -1,154 +0,0 @@
|
||||
# 队列系统使用说明
|
||||
|
||||
## 概述
|
||||
|
||||
本项目集成了基于 huey 和 SqliteHuey 的异步队列系统,用于处理文件的异步处理任务。
|
||||
|
||||
## 安装依赖
|
||||
|
||||
```bash
|
||||
pip install huey
|
||||
```
|
||||
|
||||
## 目录结构
|
||||
|
||||
```
|
||||
queue/
|
||||
├── __init__.py # 包初始化文件
|
||||
├── config.py # 队列配置(SqliteHuey配置)
|
||||
├── tasks.py # 文件处理任务定义
|
||||
├── manager.py # 队列管理器
|
||||
├── consumer.py # 队列消费者(工作进程)
|
||||
├── example.py # 使用示例
|
||||
└── README.md # 说明文档
|
||||
```
|
||||
|
||||
## 核心功能
|
||||
|
||||
### 1. 队列配置 (config.py)
|
||||
- 使用 SqliteHuey 作为消息队列
|
||||
- 数据库文件存储在 `queue_data/huey.db`
|
||||
- 支持任务重试和错误存储
|
||||
|
||||
### 2. 文件处理任务 (tasks.py)
|
||||
- `process_file_async`: 异步处理单个文件
|
||||
- `process_multiple_files_async`: 批量异步处理文件
|
||||
- `process_zip_file_async`: 异步处理zip压缩文件
|
||||
- `cleanup_processed_files`: 清理旧的文件
|
||||
|
||||
### 3. 队列管理器 (manager.py)
|
||||
- 任务提交和管理
|
||||
- 队列状态监控
|
||||
- 任务结果查询
|
||||
- 任务记录清理
|
||||
|
||||
## 使用方法
|
||||
|
||||
### 1. 启动队列消费者
|
||||
|
||||
```bash
|
||||
# 启动默认配置的消费者
|
||||
python queue/consumer.py
|
||||
|
||||
# 指定工作线程数
|
||||
python queue/consumer.py --workers 4
|
||||
|
||||
# 查看队列统计信息
|
||||
python queue/consumer.py --stats
|
||||
|
||||
# 检查队列状态
|
||||
python queue/consumer.py --check
|
||||
|
||||
# 清空队列
|
||||
python queue/consumer.py --flush
|
||||
```
|
||||
|
||||
### 2. 在代码中使用队列
|
||||
|
||||
```python
|
||||
from queue.manager import queue_manager
|
||||
|
||||
# 处理单个文件
|
||||
task_id = queue_manager.enqueue_file(
|
||||
project_id="my_project",
|
||||
file_path="/path/to/file.txt",
|
||||
original_filename="myfile.txt"
|
||||
)
|
||||
|
||||
# 批量处理文件
|
||||
task_ids = queue_manager.enqueue_multiple_files(
|
||||
project_id="my_project",
|
||||
file_paths=["/path/file1.txt", "/path/file2.txt"],
|
||||
original_filenames=["file1.txt", "file2.txt"]
|
||||
)
|
||||
|
||||
# 处理zip文件
|
||||
task_id = queue_manager.enqueue_zip_file(
|
||||
project_id="my_project",
|
||||
zip_path="/path/to/archive.zip"
|
||||
)
|
||||
|
||||
# 查看任务状态
|
||||
status = queue_manager.get_task_status(task_id)
|
||||
print(status)
|
||||
|
||||
# 获取队列统计信息
|
||||
stats = queue_manager.get_queue_stats()
|
||||
print(stats)
|
||||
```
|
||||
|
||||
### 3. 运行示例
|
||||
|
||||
```bash
|
||||
python queue/example.py
|
||||
```
|
||||
|
||||
## 配置说明
|
||||
|
||||
### 队列配置参数 (config.py)
|
||||
- `filename`: SQLite数据库文件路径
|
||||
- `always_eager`: 是否立即执行任务(开发时可设为True)
|
||||
- `utc`: 是否使用UTC时间
|
||||
- `compression_level`: 压缩级别
|
||||
- `store_errors`: 是否存储错误信息
|
||||
- `max_retries`: 最大重试次数
|
||||
- `retry_delay`: 重试延迟
|
||||
|
||||
### 消费者参数 (consumer.py)
|
||||
- `--workers`: 工作线程数(默认2)
|
||||
- `--worker-type`: 工作类型(threads/greenlets/processes)
|
||||
- `--stats`: 显示统计信息
|
||||
- `--check`: 检查队列状态
|
||||
- `--flush`: 清空队列
|
||||
|
||||
## 任务状态
|
||||
|
||||
- `pending`: 等待处理
|
||||
- `running`: 正在处理
|
||||
- `complete/finished`: 处理完成
|
||||
- `error`: 处理失败
|
||||
- `scheduled`: 定时任务
|
||||
|
||||
## 最佳实践
|
||||
|
||||
1. **生产环境建议**:
|
||||
- 设置合适的工作线程数(建议CPU核心数的1-2倍)
|
||||
- 定期清理旧的任务记录
|
||||
- 监控队列状态和任务执行情况
|
||||
|
||||
2. **开发环境建议**:
|
||||
- 可以设置 `always_eager=True` 立即执行任务进行调试
|
||||
- 使用 `--check` 参数查看队列状态
|
||||
- 运行示例代码了解功能
|
||||
|
||||
3. **错误处理**:
|
||||
- 任务失败后会自动重试(最多3次)
|
||||
- 错误信息会存储在数据库中
|
||||
- 可以通过 `get_task_status()` 查看错误详情
|
||||
|
||||
## 故障排除
|
||||
|
||||
1. **数据库锁定**: 确保只有一个消费者实例在运行
|
||||
2. **任务卡住**: 检查文件路径和权限
|
||||
3. **内存不足**: 调整工作线程数或使用进程模式
|
||||
4. **磁盘空间**: 定期清理旧文件和任务记录
|
||||
@ -1,23 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Queue package initialization.
|
||||
"""
|
||||
|
||||
from .config import huey
|
||||
from .manager import QueueManager, queue_manager
|
||||
from .tasks import (
|
||||
process_file_async,
|
||||
process_multiple_files_async,
|
||||
process_zip_file_async,
|
||||
cleanup_processed_files
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
"huey",
|
||||
"QueueManager",
|
||||
"queue_manager",
|
||||
"process_file_async",
|
||||
"process_multiple_files_async",
|
||||
"process_zip_file_async",
|
||||
"cleanup_processed_files"
|
||||
]
|
||||
@ -1,31 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Queue configuration using SqliteHuey for asynchronous file processing.
|
||||
"""
|
||||
|
||||
import os
|
||||
import logging
|
||||
from huey import SqliteHuey
|
||||
from datetime import timedelta
|
||||
|
||||
# Configure logging
|
||||
logger = logging.getLogger('app')
|
||||
|
||||
# Ensure projects/queue_data directory exists
|
||||
queue_data_dir = os.path.join(os.path.dirname(__file__), '..', 'projects', 'queue_data')
|
||||
os.makedirs(queue_data_dir, exist_ok=True)
|
||||
|
||||
# Initialize SqliteHuey
|
||||
huey = SqliteHuey(
|
||||
filename=os.path.join(queue_data_dir, 'huey.db'),
|
||||
name='file_processor', # Queue name
|
||||
always_eager=False, # Set to False to enable async processing
|
||||
utc=True, # Use UTC time
|
||||
)
|
||||
|
||||
# Set default task configuration
|
||||
huey.store_errors = True # Store error information
|
||||
huey.max_retries = 3 # Maximum retry count
|
||||
huey.retry_delay = timedelta(seconds=60) # Retry delay
|
||||
|
||||
logger.info(f"SqliteHuey queue initialized, database path: {os.path.join(queue_data_dir, 'huey.db')}")
|
||||
@ -1,171 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Queue consumer for processing file tasks.
|
||||
"""
|
||||
|
||||
import sys
|
||||
import os
|
||||
import time
|
||||
import signal
|
||||
import argparse
|
||||
from pathlib import Path
|
||||
|
||||
# Add project root directory to Python path
|
||||
project_root = Path(__file__).parent.parent
|
||||
sys.path.insert(0, str(project_root))
|
||||
|
||||
from task_queue.config import huey
|
||||
from task_queue.manager import queue_manager
|
||||
from task_queue.integration_tasks import process_files_async, cleanup_project_async
|
||||
from huey.consumer import Consumer
|
||||
|
||||
|
||||
class QueueConsumer:
|
||||
"""Queue consumer for processing async tasks."""
|
||||
|
||||
def __init__(self, worker_type: str = "threads", workers: int = 2):
|
||||
self.huey = huey
|
||||
self.worker_type = worker_type
|
||||
self.workers = workers
|
||||
self.running = False
|
||||
self.consumer = None
|
||||
|
||||
# Register signal handlers
|
||||
signal.signal(signal.SIGINT, self._signal_handler)
|
||||
signal.signal(signal.SIGTERM, self._signal_handler)
|
||||
|
||||
def _signal_handler(self, signum, frame):
|
||||
"""Signal handler for graceful shutdown."""
|
||||
print(f"\nReceived signal {signum}, shutting down queue consumer...")
|
||||
self.running = False
|
||||
|
||||
def start(self):
|
||||
"""Start the queue consumer."""
|
||||
print(f"Starting queue consumer...")
|
||||
print(f"Worker threads: {self.workers}")
|
||||
print(f"Worker type: {self.worker_type}")
|
||||
print(f"Database: {os.path.join(os.path.dirname(__file__), '..', 'projects', 'queue_data', 'huey.db')}")
|
||||
print("Press Ctrl+C to stop the consumer")
|
||||
|
||||
self.running = True
|
||||
|
||||
try:
|
||||
# Create Huey consumer
|
||||
self.consumer = Consumer(self.huey, workers=self.workers, worker_type=self.worker_type.rstrip('s'))
|
||||
|
||||
# Display queue statistics
|
||||
stats = queue_manager.get_queue_stats()
|
||||
print(f"Current queue status: {stats}")
|
||||
|
||||
# Start consumer run loop
|
||||
print("Consumer starting task processing...")
|
||||
self.consumer.run()
|
||||
|
||||
except KeyboardInterrupt:
|
||||
print("\nReceived interrupt signal, shutting down...")
|
||||
except Exception as e:
|
||||
print(f"Queue consumer runtime error: {str(e)}")
|
||||
finally:
|
||||
self.stop()
|
||||
|
||||
def stop(self):
|
||||
"""Stop the queue consumer."""
|
||||
print("Stopping queue consumer...")
|
||||
try:
|
||||
if self.consumer:
|
||||
# Stop the consumer
|
||||
self.consumer.stop()
|
||||
self.consumer = None
|
||||
print("Queue consumer stopped")
|
||||
except Exception as e:
|
||||
print(f"Error stopping queue consumer: {str(e)}")
|
||||
|
||||
def process_scheduled_tasks(self):
|
||||
"""Process scheduled tasks."""
|
||||
print("Processing scheduled tasks...")
|
||||
# Additional scheduled task processing logic can be added here
|
||||
|
||||
|
||||
def main():
|
||||
"""Main entry point."""
|
||||
parser = argparse.ArgumentParser(description="File processing queue consumer")
|
||||
parser.add_argument(
|
||||
"--workers",
|
||||
type=int,
|
||||
default=2,
|
||||
help="Number of worker threads (default: 2)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--worker-type",
|
||||
choices=["threads", "greenlets", "processes"],
|
||||
default="threads",
|
||||
help="Worker thread type (default: threads)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--stats",
|
||||
action="store_true",
|
||||
help="Display queue statistics and exit"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--flush",
|
||||
action="store_true",
|
||||
help="Flush the queue and exit"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--check",
|
||||
action="store_true",
|
||||
help="Check queue status and exit"
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Initialize consumer
|
||||
consumer = QueueConsumer(
|
||||
worker_type=args.worker_type,
|
||||
workers=args.workers
|
||||
)
|
||||
|
||||
# Handle different command-line options
|
||||
if args.stats:
|
||||
print("=== Queue Statistics ===")
|
||||
stats = queue_manager.get_queue_stats()
|
||||
print(f"Total tasks: {stats.get('total_tasks', 0)}")
|
||||
print(f"Pending tasks: {stats.get('pending_tasks', 0)}")
|
||||
print(f"Running tasks: {stats.get('running_tasks', 0)}")
|
||||
print(f"Completed tasks: {stats.get('completed_tasks', 0)}")
|
||||
print(f"Error tasks: {stats.get('error_tasks', 0)}")
|
||||
print(f"Scheduled tasks: {stats.get('scheduled_tasks', 0)}")
|
||||
print(f"Database: {stats.get('queue_database', 'N/A')}")
|
||||
return
|
||||
|
||||
if args.flush:
|
||||
print("=== Flushing Queue ===")
|
||||
try:
|
||||
# Flush all tasks
|
||||
consumer.huey.flush()
|
||||
print("Queue flushed")
|
||||
except Exception as e:
|
||||
print(f"Failed to flush queue: {str(e)}")
|
||||
return
|
||||
|
||||
if args.check:
|
||||
print("=== Checking Queue Status ===")
|
||||
stats = queue_manager.get_queue_stats()
|
||||
print(f"Queue status: OK" if "error" not in stats else f"Queue status: ERROR - {stats['error']}")
|
||||
|
||||
pending_tasks = queue_manager.list_pending_tasks(limit=10)
|
||||
if pending_tasks:
|
||||
print(f"\nPending tasks (showing up to 10):")
|
||||
for task in pending_tasks:
|
||||
print(f" Task ID: {task['task_id']}, Status: {task['status']}, Created: {task['created_time']}")
|
||||
else:
|
||||
print("No pending tasks")
|
||||
return
|
||||
|
||||
# Start consumer
|
||||
print("=== Starting File Processing Queue Consumer ===")
|
||||
consumer.start()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@ -1,132 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Example usage of the queue system.
|
||||
"""
|
||||
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
# Add project root directory to Python path
|
||||
project_root = Path(__file__).parent.parent
|
||||
sys.path.insert(0, str(project_root))
|
||||
|
||||
from task_queue.manager import queue_manager
|
||||
from task_queue.tasks import process_file_async, process_multiple_files_async
|
||||
|
||||
|
||||
def example_single_file():
|
||||
"""Example: Process a single file."""
|
||||
print("=== Example: Process a single file ===")
|
||||
|
||||
project_id = "test_project"
|
||||
file_path = "public/test_document.txt"
|
||||
|
||||
# Enqueue file for processing
|
||||
task_id = queue_manager.enqueue_file(
|
||||
project_id=project_id,
|
||||
file_path=file_path,
|
||||
original_filename="example_document.txt"
|
||||
)
|
||||
|
||||
print(f"Task submitted, task ID: {task_id}")
|
||||
|
||||
# Check task status
|
||||
time.sleep(2)
|
||||
status = queue_manager.get_task_status(task_id)
|
||||
print(f"Task status: {status}")
|
||||
|
||||
|
||||
def example_multiple_files():
|
||||
"""Example: Batch process files."""
|
||||
print("\n=== Example: Batch process files ===")
|
||||
|
||||
project_id = "test_project_batch"
|
||||
file_paths = [
|
||||
"public/test_document.txt",
|
||||
"public/goods.xlsx" # Assuming this file exists
|
||||
]
|
||||
original_filenames = [
|
||||
"batch_document_1.txt",
|
||||
"batch_goods.xlsx"
|
||||
]
|
||||
|
||||
# Enqueue multiple files for processing
|
||||
task_ids = queue_manager.enqueue_multiple_files(
|
||||
project_id=project_id,
|
||||
file_paths=file_paths,
|
||||
original_filenames=original_filenames
|
||||
)
|
||||
|
||||
print(f"Batch tasks submitted, task IDs: {task_ids}")
|
||||
|
||||
|
||||
def example_zip_file():
|
||||
"""Example: Process a zip file."""
|
||||
print("\n=== Example: Process a zip file ===")
|
||||
|
||||
project_id = "test_project_zip"
|
||||
zip_path = "public/all_hp_product_spec_book2506.zip"
|
||||
|
||||
# Enqueue zip file for processing
|
||||
task_id = queue_manager.enqueue_zip_file(
|
||||
project_id=project_id,
|
||||
zip_path=zip_path
|
||||
)
|
||||
|
||||
print(f"Zip task submitted, task ID: {task_id}")
|
||||
|
||||
|
||||
def example_queue_stats():
|
||||
"""Example: Get queue statistics."""
|
||||
print("\n=== Example: Queue statistics ===")
|
||||
|
||||
stats = queue_manager.get_queue_stats()
|
||||
print("Queue statistics:")
|
||||
for key, value in stats.items():
|
||||
if key != "recent_tasks":
|
||||
print(f" {key}: {value}")
|
||||
|
||||
|
||||
def example_cleanup():
|
||||
"""Example: Cleanup tasks."""
|
||||
print("\n=== Example: Cleanup tasks ===")
|
||||
|
||||
project_id = "test_project"
|
||||
|
||||
# Enqueue cleanup task (delayed 10 seconds)
|
||||
task_id = queue_manager.enqueue_cleanup_task(
|
||||
project_id=project_id,
|
||||
older_than_days=1, # Clean files older than 1 day
|
||||
delay=10
|
||||
)
|
||||
|
||||
print(f"Cleanup task submitted, task ID: {task_id}")
|
||||
|
||||
|
||||
def main():
|
||||
"""Main entry point."""
|
||||
print("Queue System Usage Examples")
|
||||
print("=" * 50)
|
||||
|
||||
try:
|
||||
# Run examples
|
||||
example_single_file()
|
||||
example_multiple_files()
|
||||
example_zip_file()
|
||||
example_queue_stats()
|
||||
example_cleanup()
|
||||
|
||||
print("\n" + "=" * 50)
|
||||
print("Examples completed!")
|
||||
print("\nTo check task execution, run:")
|
||||
print("python queue/consumer.py --check")
|
||||
print("\nTo start the queue consumer, run:")
|
||||
print("python queue/consumer.py")
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error running examples: {str(e)}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@ -1,499 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Queue tasks for file processing integration.
|
||||
"""
|
||||
|
||||
import os
|
||||
import json
|
||||
import time
|
||||
import hashlib
|
||||
import shutil
|
||||
from typing import Dict, List, Optional, Any
|
||||
|
||||
from task_queue.config import huey
|
||||
from task_queue.manager import queue_manager
|
||||
from task_queue.task_status import task_status_store
|
||||
from utils import download_dataset_files, save_processed_files_log, load_processed_files_log
|
||||
from utils.dataset_manager import remove_dataset_directory_by_key
|
||||
|
||||
|
||||
def scan_upload_folder(upload_dir: str) -> List[str]:
|
||||
"""
|
||||
Scan all supported file formats in the upload folder.
|
||||
|
||||
Args:
|
||||
upload_dir: Upload folder path
|
||||
|
||||
Returns:
|
||||
List[str]: List of supported file paths
|
||||
"""
|
||||
supported_extensions = {
|
||||
# Text files
|
||||
'.txt', '.md', '.rtf',
|
||||
# Document files
|
||||
'.doc', '.docx', '.pdf', '.odt',
|
||||
# Spreadsheet files
|
||||
'.xls', '.xlsx', '.csv', '.ods',
|
||||
# Presentation files
|
||||
'.ppt', '.pptx', '.odp',
|
||||
# E-books
|
||||
'.epub', '.mobi',
|
||||
# Web files
|
||||
'.html', '.htm',
|
||||
# Config files
|
||||
'.json', '.xml', '.yaml', '.yml',
|
||||
# Code files
|
||||
'.py', '.js', '.java', '.cpp', '.c', '.go', '.rs',
|
||||
# Archive files
|
||||
'.zip', '.rar', '.7z', '.tar', '.gz'
|
||||
}
|
||||
|
||||
scanned_files = []
|
||||
|
||||
if not os.path.exists(upload_dir):
|
||||
return scanned_files
|
||||
|
||||
for root, dirs, files in os.walk(upload_dir):
|
||||
for file in files:
|
||||
# Skip hidden files and system files
|
||||
if file.startswith('.') or file.startswith('~'):
|
||||
continue
|
||||
|
||||
file_path = os.path.join(root, file)
|
||||
file_extension = os.path.splitext(file)[1].lower()
|
||||
|
||||
# Check if file extension is supported
|
||||
if file_extension in supported_extensions:
|
||||
scanned_files.append(file_path)
|
||||
else:
|
||||
# For files without extension, try to process them (may be text files)
|
||||
if not file_extension:
|
||||
try:
|
||||
# Try reading the file header to determine if it's a text file
|
||||
with open(file_path, 'r', encoding='utf-8') as f:
|
||||
f.read(1024) # Read the first 1KB
|
||||
scanned_files.append(file_path)
|
||||
except (UnicodeDecodeError, PermissionError):
|
||||
# Not a text file or unreadable, skip
|
||||
pass
|
||||
|
||||
return scanned_files
|
||||
|
||||
|
||||
@huey.task()
|
||||
def process_files_async(
|
||||
dataset_id: str,
|
||||
files: Optional[Dict[str, List[str]]] = None,
|
||||
upload_folder: Optional[Dict[str, str]] = None,
|
||||
task_id: Optional[str] = None
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Asynchronously process file tasks - compatible with existing files/process API.
|
||||
|
||||
Args:
|
||||
dataset_id: Unique project ID
|
||||
files: Dictionary of file paths grouped by key
|
||||
upload_folder: Upload folder dictionary organized by group name, e.g. {'group1': 'my_project1', 'group2': 'my_project2'}
|
||||
task_id: Task ID (for status tracking)
|
||||
|
||||
Returns:
|
||||
Processing result dictionary
|
||||
"""
|
||||
try:
|
||||
print(f"Starting async file processing task, project ID: {dataset_id}")
|
||||
|
||||
# If task_id is provided, set initial status
|
||||
if task_id:
|
||||
task_status_store.set_status(
|
||||
task_id=task_id,
|
||||
unique_id=dataset_id,
|
||||
status="running"
|
||||
)
|
||||
|
||||
# Ensure project directory exists
|
||||
project_dir = os.path.join("projects", "data", dataset_id)
|
||||
if not os.path.exists(project_dir):
|
||||
os.makedirs(project_dir, exist_ok=True)
|
||||
|
||||
# Process files: use key-grouped format
|
||||
processed_files_by_key = {}
|
||||
|
||||
# If upload_folder is provided, scan files in those folders
|
||||
if upload_folder and not files:
|
||||
scanned_files_by_group = {}
|
||||
total_scanned_files = 0
|
||||
|
||||
for group_name, folder_name in upload_folder.items():
|
||||
# Security check: prevent path traversal attacks
|
||||
safe_folder_name = os.path.basename(folder_name)
|
||||
upload_dir = os.path.join("projects", "uploads", safe_folder_name)
|
||||
|
||||
if os.path.exists(upload_dir):
|
||||
scanned_files = scan_upload_folder(upload_dir)
|
||||
if scanned_files:
|
||||
scanned_files_by_group[group_name] = scanned_files
|
||||
total_scanned_files += len(scanned_files)
|
||||
print(f"Scanned {len(scanned_files)} files from upload folder '{safe_folder_name}' (group: {group_name})")
|
||||
else:
|
||||
print(f"No supported files found in upload folder '{safe_folder_name}' (group: {group_name})")
|
||||
else:
|
||||
print(f"Upload folder does not exist: {upload_dir} (group: {group_name})")
|
||||
|
||||
if scanned_files_by_group:
|
||||
files = scanned_files_by_group
|
||||
print(f"Total scanned {total_scanned_files} files from {len(scanned_files_by_group)} groups")
|
||||
else:
|
||||
print("No supported files found in any upload folder")
|
||||
|
||||
if files:
|
||||
# Use files from the request (grouped by key)
|
||||
# Since this is an async task, call synchronously
|
||||
import asyncio
|
||||
try:
|
||||
loop = asyncio.get_event_loop()
|
||||
except RuntimeError:
|
||||
loop = asyncio.new_event_loop()
|
||||
asyncio.set_event_loop(loop)
|
||||
|
||||
processed_files_by_key = loop.run_until_complete(download_dataset_files(dataset_id, files))
|
||||
total_files = sum(len(files_list) for files_list in processed_files_by_key.values())
|
||||
print(f"Async processed {total_files} dataset files across {len(processed_files_by_key)} keys, project ID: {dataset_id}")
|
||||
else:
|
||||
print(f"No files provided in request, project ID: {dataset_id}")
|
||||
|
||||
# Collect all document.txt files in the project directory
|
||||
document_files = []
|
||||
for root, dirs, files_list in os.walk(project_dir):
|
||||
for file in files_list:
|
||||
if file == "document.txt":
|
||||
document_files.append(os.path.join(root, file))
|
||||
|
||||
# Generate project README.md file
|
||||
try:
|
||||
from utils.project_manager import save_project_readme
|
||||
save_project_readme(dataset_id)
|
||||
print(f"README.md generated, project ID: {dataset_id}")
|
||||
except Exception as e:
|
||||
print(f"Failed to generate README.md, project ID: {dataset_id}, error: {str(e)}")
|
||||
# Does not affect main processing flow, continue
|
||||
|
||||
# Build result file list
|
||||
result_files = []
|
||||
for key in processed_files_by_key.keys():
|
||||
# Add corresponding dataset document.txt path
|
||||
document_path = os.path.join("projects", "data", dataset_id, "datasets", key, "document.txt")
|
||||
if os.path.exists(document_path):
|
||||
result_files.append(document_path)
|
||||
|
||||
# Also add document.txt files that exist but are not in processed_files_by_key
|
||||
existing_document_paths = set(result_files) # Avoid duplicates
|
||||
for doc_file in document_files:
|
||||
if doc_file not in existing_document_paths:
|
||||
result_files.append(doc_file)
|
||||
|
||||
result = {
|
||||
"status": "success",
|
||||
"message": f"Successfully processed {len(result_files)} document files across {len(processed_files_by_key)} keys",
|
||||
"dataset_id": dataset_id,
|
||||
"processed_files": result_files,
|
||||
"processed_files_by_key": processed_files_by_key,
|
||||
"document_files": document_files,
|
||||
"total_files_processed": sum(len(files_list) for files_list in processed_files_by_key.values()),
|
||||
"processing_time": time.time()
|
||||
}
|
||||
|
||||
# Update task status to completed
|
||||
if task_id:
|
||||
task_status_store.update_status(
|
||||
task_id=task_id,
|
||||
status="completed",
|
||||
result=result
|
||||
)
|
||||
|
||||
print(f"Async file processing task completed: {dataset_id}")
|
||||
return result
|
||||
|
||||
except Exception as e:
|
||||
error_msg = f"Error during async file processing: {str(e)}"
|
||||
print(error_msg)
|
||||
|
||||
# Update task status to error
|
||||
if task_id:
|
||||
task_status_store.update_status(
|
||||
task_id=task_id,
|
||||
status="failed",
|
||||
error=error_msg
|
||||
)
|
||||
|
||||
return {
|
||||
"status": "error",
|
||||
"message": error_msg,
|
||||
"dataset_id": dataset_id,
|
||||
"error": str(e)
|
||||
}
|
||||
|
||||
|
||||
@huey.task()
|
||||
def process_files_incremental_async(
|
||||
dataset_id: str,
|
||||
files_to_add: Optional[Dict[str, List[str]]] = None,
|
||||
files_to_remove: Optional[Dict[str, List[str]]] = None,
|
||||
system_prompt: Optional[str] = None,
|
||||
mcp_settings: Optional[List[Dict]] = None,
|
||||
task_id: Optional[str] = None
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Incremental file processing task - supports adding and removing files.
|
||||
|
||||
Args:
|
||||
dataset_id: Unique project ID
|
||||
files_to_add: Dictionary of file paths to add, grouped by key
|
||||
files_to_remove: Dictionary of file paths to remove, grouped by key
|
||||
system_prompt: System prompt
|
||||
mcp_settings: MCP settings
|
||||
task_id: Task ID (for status tracking)
|
||||
|
||||
Returns:
|
||||
Processing result dictionary
|
||||
"""
|
||||
try:
|
||||
print(f"Starting incremental file processing task, project ID: {dataset_id}")
|
||||
|
||||
# If task_id is provided, set initial status
|
||||
if task_id:
|
||||
task_status_store.set_status(
|
||||
task_id=task_id,
|
||||
unique_id=dataset_id,
|
||||
status="running"
|
||||
)
|
||||
|
||||
# Ensure project directory exists
|
||||
project_dir = os.path.join("projects", "data", dataset_id)
|
||||
if not os.path.exists(project_dir):
|
||||
os.makedirs(project_dir, exist_ok=True)
|
||||
|
||||
# Load existing processing log
|
||||
processed_log = load_processed_files_log(dataset_id)
|
||||
print(f"Loaded existing processing log with {len(processed_log)} file records")
|
||||
|
||||
removed_files = []
|
||||
added_files = []
|
||||
|
||||
# 1. Process removals
|
||||
if files_to_remove:
|
||||
print(f"Starting removal processing across {len(files_to_remove)} key groups")
|
||||
for key, file_list in files_to_remove.items():
|
||||
if not file_list: # If file list is empty, remove the entire key group
|
||||
print(f"Removing entire key group: {key}")
|
||||
if remove_dataset_directory_by_key(dataset_id, key):
|
||||
removed_files.append(f"dataset/{key}")
|
||||
|
||||
# Remove all records for this key from the processing log
|
||||
keys_to_remove = [file_hash for file_hash, file_info in processed_log.items()
|
||||
if file_info.get('key') == key]
|
||||
for file_hash in keys_to_remove:
|
||||
del processed_log[file_hash]
|
||||
removed_files.append(f"log_entry:{file_hash}")
|
||||
else:
|
||||
# Remove specific files
|
||||
for file_path in file_list:
|
||||
print(f"Removing specific file: {key}/{file_path}")
|
||||
|
||||
# Actually delete the file
|
||||
filename = os.path.basename(file_path)
|
||||
|
||||
# Delete original file
|
||||
source_file = os.path.join("projects", "data", dataset_id, "files", key, filename)
|
||||
if os.path.exists(source_file):
|
||||
os.remove(source_file)
|
||||
removed_files.append(f"file:{key}/{filename}")
|
||||
|
||||
# Delete processed file directory
|
||||
processed_dir = os.path.join("projects", "data", dataset_id, "processed", key, filename)
|
||||
if os.path.exists(processed_dir):
|
||||
shutil.rmtree(processed_dir)
|
||||
removed_files.append(f"processed:{key}/{filename}")
|
||||
|
||||
# Compute file hash to find in log
|
||||
file_hash = hashlib.md5(file_path.encode('utf-8')).hexdigest()
|
||||
|
||||
# Remove from processing log
|
||||
if file_hash in processed_log:
|
||||
del processed_log[file_hash]
|
||||
removed_files.append(f"log_entry:{file_hash}")
|
||||
|
||||
# 2. Process additions
|
||||
processed_files_by_key = {}
|
||||
if files_to_add:
|
||||
print(f"Starting addition processing across {len(files_to_add)} key groups")
|
||||
# Use async processing to download files
|
||||
import asyncio
|
||||
try:
|
||||
loop = asyncio.get_event_loop()
|
||||
except RuntimeError:
|
||||
loop = asyncio.new_event_loop()
|
||||
asyncio.set_event_loop(loop)
|
||||
|
||||
processed_files_by_key = loop.run_until_complete(download_dataset_files(dataset_id, files_to_add, incremental_mode=True))
|
||||
total_added_files = sum(len(files_list) for files_list in processed_files_by_key.values())
|
||||
print(f"Async processed {total_added_files} dataset files across {len(processed_files_by_key)} keys, project ID: {dataset_id}")
|
||||
|
||||
# Record added files
|
||||
for key, files_list in processed_files_by_key.items():
|
||||
for file_path in files_list:
|
||||
added_files.append(f"{key}/{file_path}")
|
||||
else:
|
||||
print(f"No files to add provided in request, project ID: {dataset_id}")
|
||||
|
||||
# Save updated processing log
|
||||
save_processed_files_log(dataset_id, processed_log)
|
||||
print(f"Updated processing log, now contains {len(processed_log)} file records")
|
||||
|
||||
# Save system_prompt and mcp_settings to project directory (if provided)
|
||||
if system_prompt:
|
||||
system_prompt_file = os.path.join(project_dir, "system_prompt.md")
|
||||
with open(system_prompt_file, 'w', encoding='utf-8') as f:
|
||||
f.write(system_prompt)
|
||||
print(f"Saved system_prompt, project ID: {dataset_id}")
|
||||
|
||||
if mcp_settings:
|
||||
mcp_settings_file = os.path.join(project_dir, "mcp_settings.json")
|
||||
with open(mcp_settings_file, 'w', encoding='utf-8') as f:
|
||||
json.dump(mcp_settings, f, ensure_ascii=False, indent=2)
|
||||
print(f"Saved mcp_settings, project ID: {dataset_id}")
|
||||
|
||||
# Generate project README.md file
|
||||
try:
|
||||
from utils.project_manager import save_project_readme
|
||||
save_project_readme(dataset_id)
|
||||
print(f"README.md generated, project ID: {dataset_id}")
|
||||
except Exception as e:
|
||||
print(f"Failed to generate README.md, project ID: {dataset_id}, error: {str(e)}")
|
||||
# Does not affect main processing flow, continue
|
||||
|
||||
# Collect all document.txt files in the project directory
|
||||
document_files = []
|
||||
for root, dirs, files_list in os.walk(project_dir):
|
||||
for file in files_list:
|
||||
if file == "document.txt":
|
||||
document_files.append(os.path.join(root, file))
|
||||
|
||||
# Build result file list
|
||||
result_files = []
|
||||
for key in processed_files_by_key.keys():
|
||||
# Add corresponding dataset document.txt path
|
||||
document_path = os.path.join("projects", "data", dataset_id, "datasets", key, "document.txt")
|
||||
if os.path.exists(document_path):
|
||||
result_files.append(document_path)
|
||||
|
||||
# Also add document.txt files that exist but are not in processed_files_by_key
|
||||
existing_document_paths = set(result_files) # Avoid duplicates
|
||||
for doc_file in document_files:
|
||||
if doc_file not in existing_document_paths:
|
||||
result_files.append(doc_file)
|
||||
|
||||
result = {
|
||||
"status": "success",
|
||||
"message": f"Incremental processing complete - added {len(added_files)} files, removed {len(removed_files)} files, {len(result_files)} document files remaining",
|
||||
"dataset_id": dataset_id,
|
||||
"removed_files": removed_files,
|
||||
"added_files": added_files,
|
||||
"processed_files": result_files,
|
||||
"processed_files_by_key": processed_files_by_key,
|
||||
"document_files": document_files,
|
||||
"total_files_added": sum(len(files_list) for files_list in processed_files_by_key.values()),
|
||||
"total_files_removed": len(removed_files),
|
||||
"final_files_count": len(result_files),
|
||||
"processing_time": time.time()
|
||||
}
|
||||
|
||||
# Update task status to completed
|
||||
if task_id:
|
||||
task_status_store.update_status(
|
||||
task_id=task_id,
|
||||
status="completed",
|
||||
result=result
|
||||
)
|
||||
|
||||
print(f"Incremental file processing task completed: {dataset_id}")
|
||||
return result
|
||||
|
||||
except Exception as e:
|
||||
error_msg = f"Error during incremental file processing: {str(e)}"
|
||||
print(error_msg)
|
||||
|
||||
# Update task status to error
|
||||
if task_id:
|
||||
task_status_store.update_status(
|
||||
task_id=task_id,
|
||||
status="failed",
|
||||
error=error_msg
|
||||
)
|
||||
|
||||
return {
|
||||
"status": "error",
|
||||
"message": error_msg,
|
||||
"dataset_id": dataset_id,
|
||||
"error": str(e)
|
||||
}
|
||||
|
||||
|
||||
@huey.task()
|
||||
def cleanup_project_async(
|
||||
dataset_id: str,
|
||||
remove_all: bool = False
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Asynchronously clean up project files.
|
||||
|
||||
Args:
|
||||
dataset_id: Unique project ID
|
||||
remove_all: Whether to remove the entire project directory
|
||||
|
||||
Returns:
|
||||
Cleanup result dictionary
|
||||
"""
|
||||
try:
|
||||
print(f"Starting async project cleanup, project ID: {dataset_id}")
|
||||
|
||||
project_dir = os.path.join("projects", "data", dataset_id)
|
||||
removed_items = []
|
||||
|
||||
if remove_all and os.path.exists(project_dir):
|
||||
import shutil
|
||||
shutil.rmtree(project_dir)
|
||||
removed_items.append(project_dir)
|
||||
result = {
|
||||
"status": "success",
|
||||
"message": f"Deleted entire project directory: {project_dir}",
|
||||
"dataset_id": dataset_id,
|
||||
"removed_items": removed_items,
|
||||
"action": "remove_all"
|
||||
}
|
||||
else:
|
||||
# Only clean processing log
|
||||
log_file = os.path.join(project_dir, "processed_files.json")
|
||||
if os.path.exists(log_file):
|
||||
os.remove(log_file)
|
||||
removed_items.append(log_file)
|
||||
|
||||
result = {
|
||||
"status": "success",
|
||||
"message": f"Cleaned project processing log, project ID: {dataset_id}",
|
||||
"dataset_id": dataset_id,
|
||||
"removed_items": removed_items,
|
||||
"action": "cleanup_logs"
|
||||
}
|
||||
|
||||
print(f"Async cleanup task completed: {dataset_id}")
|
||||
return result
|
||||
|
||||
except Exception as e:
|
||||
error_msg = f"Error during async project cleanup: {str(e)}"
|
||||
print(error_msg)
|
||||
return {
|
||||
"status": "error",
|
||||
"message": error_msg,
|
||||
"dataset_id": dataset_id,
|
||||
"error": str(e)
|
||||
}
|
||||
@ -1,228 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Queue manager for handling file processing queues.
|
||||
"""
|
||||
|
||||
import os
|
||||
import json
|
||||
import time
|
||||
import logging
|
||||
from typing import Dict, List, Optional, Any
|
||||
from huey import Huey
|
||||
from huey.api import Task
|
||||
from datetime import datetime, timedelta
|
||||
|
||||
# Configure logging
|
||||
logger = logging.getLogger('app')
|
||||
|
||||
from .config import huey
|
||||
from .tasks import process_file_async, process_multiple_files_async, process_zip_file_async, cleanup_processed_files
|
||||
|
||||
|
||||
class QueueManager:
|
||||
"""Queue manager for file processing tasks."""
|
||||
|
||||
def __init__(self):
|
||||
self.huey = huey
|
||||
logger.info(f"Queue manager initialized with database: {os.path.join(os.path.dirname(__file__), '..', 'projects', 'queue_data', 'huey.db')}")
|
||||
|
||||
def enqueue_file(
|
||||
self,
|
||||
project_id: str,
|
||||
file_path: str,
|
||||
original_filename: str = None,
|
||||
delay: int = 0
|
||||
) -> str:
|
||||
"""
|
||||
Add a file to the processing queue.
|
||||
|
||||
Args:
|
||||
project_id: Project ID
|
||||
file_path: File path
|
||||
original_filename: Original filename
|
||||
delay: Delay before execution in seconds
|
||||
|
||||
Returns:
|
||||
Task ID
|
||||
"""
|
||||
if delay > 0:
|
||||
task = process_file_async.schedule(
|
||||
args=(project_id, file_path, original_filename),
|
||||
delay=timedelta(seconds=delay)
|
||||
)
|
||||
else:
|
||||
task = process_file_async(project_id, file_path, original_filename)
|
||||
|
||||
logger.info(f"File queued for processing: {file_path}, task ID: {task.id}")
|
||||
return task.id
|
||||
|
||||
def enqueue_multiple_files(
|
||||
self,
|
||||
project_id: str,
|
||||
file_paths: List[str],
|
||||
original_filenames: List[str] = None,
|
||||
delay: int = 0
|
||||
) -> List[str]:
|
||||
"""
|
||||
Add multiple files to the processing queue.
|
||||
|
||||
Args:
|
||||
project_id: Project ID
|
||||
file_paths: List of file paths
|
||||
original_filenames: List of original filenames
|
||||
delay: Delay before execution in seconds
|
||||
|
||||
Returns:
|
||||
List of task IDs
|
||||
"""
|
||||
if delay > 0:
|
||||
task = process_multiple_files_async.schedule(
|
||||
args=(project_id, file_paths, original_filenames),
|
||||
delay=timedelta(seconds=delay)
|
||||
)
|
||||
else:
|
||||
task = process_multiple_files_async(project_id, file_paths, original_filenames)
|
||||
|
||||
logger.info(f"Batch files queued for processing: {len(file_paths)} files, task ID: {task.id}")
|
||||
return [task.id]
|
||||
|
||||
def enqueue_zip_file(
|
||||
self,
|
||||
project_id: str,
|
||||
zip_path: str,
|
||||
extract_to: str = None,
|
||||
delay: int = 0
|
||||
) -> str:
|
||||
"""
|
||||
Add a zip file to the processing queue.
|
||||
|
||||
Args:
|
||||
project_id: Project ID
|
||||
zip_path: Path to the zip file
|
||||
extract_to: Extraction target directory
|
||||
delay: Delay before execution in seconds
|
||||
|
||||
Returns:
|
||||
Task ID
|
||||
"""
|
||||
if delay > 0:
|
||||
task = process_zip_file_async.schedule(
|
||||
args=(project_id, zip_path, extract_to),
|
||||
delay=timedelta(seconds=delay)
|
||||
)
|
||||
else:
|
||||
task = process_zip_file_async(project_id, zip_path, extract_to)
|
||||
|
||||
logger.info(f"Zip file queued for processing: {zip_path}, task ID: {task.id}")
|
||||
return task.id
|
||||
|
||||
def get_task_status(self, task_id: str) -> Dict[str, Any]:
|
||||
"""
|
||||
Get task status.
|
||||
|
||||
Args:
|
||||
task_id: Task ID
|
||||
|
||||
Returns:
|
||||
Task status information
|
||||
"""
|
||||
try:
|
||||
# Try getting the task result from result storage
|
||||
try:
|
||||
# Use Huey's built-in result lookup when available
|
||||
if hasattr(self.huey, 'result') and self.huey.result:
|
||||
result = self.huey.result(task_id)
|
||||
if result is not None:
|
||||
return {
|
||||
"task_id": task_id,
|
||||
"status": "complete",
|
||||
"result": result
|
||||
}
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Check whether the task is in the pending queue
|
||||
try:
|
||||
pending_tasks = list(self.huey.pending())
|
||||
for task in pending_tasks:
|
||||
if hasattr(task, 'id') and task.id == task_id:
|
||||
return {
|
||||
"task_id": task_id,
|
||||
"status": "pending"
|
||||
}
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Check whether the task is in the scheduled queue
|
||||
try:
|
||||
scheduled_tasks = list(self.huey.scheduled())
|
||||
for task in scheduled_tasks:
|
||||
if hasattr(task, 'id') and task.id == task_id:
|
||||
return {
|
||||
"task_id": task_id,
|
||||
"status": "scheduled"
|
||||
}
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# If not found anywhere, it may not exist or may have completed with cleaned results
|
||||
return {
|
||||
"task_id": task_id,
|
||||
"status": "unknown",
|
||||
"message": "Task status is unknown; it may already be complete or may not exist"
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
return {
|
||||
"task_id": task_id,
|
||||
"status": "error",
|
||||
"message": f"Failed to get task status: {str(e)}"
|
||||
}
|
||||
|
||||
def get_queue_stats(self) -> Dict[str, Any]:
|
||||
"""
|
||||
Get queue statistics.
|
||||
|
||||
Returns:
|
||||
Queue statistics information
|
||||
"""
|
||||
try:
|
||||
# Use a simplified approach for queue statistics
|
||||
stats = {
|
||||
"total_tasks": 0,
|
||||
"pending_tasks": 0,
|
||||
"running_tasks": 0,
|
||||
"completed_tasks": 0,
|
||||
"error_tasks": 0,
|
||||
"scheduled_tasks": 0,
|
||||
"recent_tasks": [],
|
||||
"queue_database": os.path.join(os.path.dirname(__file__), '..', 'projects', 'queue_data', 'huey.db')
|
||||
}
|
||||
|
||||
# Try to get the number of pending tasks
|
||||
try:
|
||||
pending_tasks = list(self.huey.pending())
|
||||
stats["pending_tasks"] = len(pending_tasks)
|
||||
stats["total_tasks"] += len(pending_tasks)
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to get pending tasks: {e}")
|
||||
|
||||
# Try to get the number of scheduled tasks
|
||||
try:
|
||||
scheduled_tasks = list(self.huey.scheduled())
|
||||
stats["scheduled_tasks"] = len(scheduled_tasks)
|
||||
stats["total_tasks"] += len(scheduled_tasks)
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to get scheduled tasks: {e}")
|
||||
|
||||
return stats
|
||||
|
||||
except Exception as e:
|
||||
return {
|
||||
"error": str(e),
|
||||
"queue_database": os.path.join(os.path.dirname(__file__), '..', 'projects', 'queue_data', 'huey.db')
|
||||
}
|
||||
|
||||
|
||||
# Global singleton instance
|
||||
queue_manager = QueueManager()
|
||||
@ -1,286 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Optimized queue consumer with integrated performance monitoring.
|
||||
"""
|
||||
|
||||
import sys
|
||||
import os
|
||||
import time
|
||||
import signal
|
||||
import argparse
|
||||
import multiprocessing
|
||||
import logging
|
||||
from pathlib import Path
|
||||
from concurrent.futures import ThreadPoolExecutor
|
||||
import threading
|
||||
|
||||
# Configure logging
|
||||
logger = logging.getLogger('app')
|
||||
|
||||
# Add project root directory to Python path
|
||||
project_root = Path(__file__).parent.parent
|
||||
sys.path.insert(0, str(project_root))
|
||||
|
||||
from task_queue.config import huey
|
||||
from task_queue.manager import queue_manager
|
||||
from task_queue.integration_tasks import process_files_async, cleanup_project_async
|
||||
from huey.consumer import Consumer
|
||||
|
||||
|
||||
class OptimizedQueueConsumer:
|
||||
"""Optimized queue consumer with integrated performance monitoring."""
|
||||
|
||||
def __init__(self, worker_type: str = "threads", workers: int = 2):
|
||||
self.huey = huey
|
||||
self.worker_type = worker_type
|
||||
self.workers = workers
|
||||
self.running = False
|
||||
self.consumer = None
|
||||
self.processed_count = 0
|
||||
self.start_time = None
|
||||
|
||||
# Performance monitoring
|
||||
self.performance_stats = {
|
||||
'tasks_processed': 0,
|
||||
'tasks_failed': 0,
|
||||
'avg_processing_time': 0,
|
||||
'start_time': None,
|
||||
'last_activity': None
|
||||
}
|
||||
|
||||
# Register signal handlers
|
||||
signal.signal(signal.SIGINT, self._signal_handler)
|
||||
signal.signal(signal.SIGTERM, self._signal_handler)
|
||||
|
||||
def _signal_handler(self, signum, frame):
|
||||
"""Signal handler for graceful shutdown."""
|
||||
logger.info(f"\nReceived signal {signum}, shutting down queue consumer...")
|
||||
self.running = False
|
||||
if self.consumer:
|
||||
self.consumer.stop()
|
||||
|
||||
def setup_optimizations(self):
|
||||
"""Set up performance optimizations."""
|
||||
# Set environment variables
|
||||
env_vars = {
|
||||
'PYTHONUNBUFFERED': '1',
|
||||
'PYTHONDONTWRITEBYTECODE': '1',
|
||||
}
|
||||
|
||||
for key, value in env_vars.items():
|
||||
os.environ[key] = value
|
||||
|
||||
# Optimize huey configuration
|
||||
if hasattr(huey, 'immediate'):
|
||||
huey.immediate = False
|
||||
|
||||
# Adjust based on worker type
|
||||
if self.worker_type == "threads":
|
||||
# Thread pool optimization
|
||||
if hasattr(huey, 'worker_type'):
|
||||
huey.worker_type = 'threads'
|
||||
|
||||
# Set thread pool size
|
||||
if hasattr(huey, 'always_eager'):
|
||||
huey.always_eager = False
|
||||
|
||||
logger.info("Queue consumer optimization setup complete:")
|
||||
logger.info(f"- Worker type: {self.worker_type}")
|
||||
logger.info(f"- Worker count: {self.workers}")
|
||||
|
||||
def monitor_performance(self):
|
||||
"""Performance monitoring thread."""
|
||||
while self.running:
|
||||
time.sleep(30) # Output statistics every 30 seconds
|
||||
|
||||
if self.start_time:
|
||||
elapsed = time.time() - self.start_time
|
||||
rate = self.performance_stats['tasks_processed'] / max(1, elapsed)
|
||||
|
||||
logger.info(f"\n[Performance Stats]")
|
||||
logger.info(f"- Uptime: {elapsed:.1f}s")
|
||||
logger.info(f"- Tasks processed: {self.performance_stats['tasks_processed']}")
|
||||
logger.info(f"- Failed tasks: {self.performance_stats['tasks_failed']}")
|
||||
logger.info(f"- Average processing rate: {rate:.2f} tasks/s")
|
||||
|
||||
if self.performance_stats['avg_processing_time'] > 0:
|
||||
logger.info(f"- Average processing time: {self.performance_stats['avg_processing_time']:.2f}s")
|
||||
|
||||
def start(self):
|
||||
"""Start the queue consumer."""
|
||||
logger.info("=" * 60)
|
||||
logger.info("Optimized queue consumer starting")
|
||||
logger.info("=" * 60)
|
||||
|
||||
# Apply optimizations
|
||||
self.setup_optimizations()
|
||||
|
||||
logger.info(f"Database: {os.path.join(os.path.dirname(__file__), '..', 'projects', 'queue_data', 'huey.db')}")
|
||||
logger.info("Press Ctrl+C to stop the consumer")
|
||||
|
||||
self.running = True
|
||||
self.start_time = time.time()
|
||||
self.performance_stats['start_time'] = self.start_time
|
||||
|
||||
# Start performance monitoring thread
|
||||
monitor_thread = threading.Thread(target=self.monitor_performance, daemon=True)
|
||||
monitor_thread.start()
|
||||
|
||||
try:
|
||||
# Create consumer
|
||||
self.consumer = Consumer(
|
||||
self.huey,
|
||||
workers=self.workers,
|
||||
worker_type=self.worker_type,
|
||||
max_delay=60.0, # Maximum delay
|
||||
check_delay=1.0, # Check interval
|
||||
periodic=True, # Enable periodic tasks
|
||||
)
|
||||
|
||||
logger.info("Queue consumer started, waiting for tasks...")
|
||||
|
||||
# Start the consumer
|
||||
self.consumer.run()
|
||||
|
||||
except KeyboardInterrupt:
|
||||
logger.info("\nReceived keyboard interrupt signal")
|
||||
except Exception as e:
|
||||
logger.error(f"Queue consumer runtime error: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
finally:
|
||||
self.shutdown()
|
||||
|
||||
def shutdown(self):
|
||||
"""Shut down the queue consumer."""
|
||||
logger.info("\nShutting down queue consumer...")
|
||||
self.running = False
|
||||
|
||||
if self.consumer:
|
||||
try:
|
||||
self.consumer.stop()
|
||||
logger.info("Queue consumer stopped")
|
||||
except Exception as e:
|
||||
logger.error(f"Error stopping queue consumer: {e}")
|
||||
|
||||
# Output final statistics
|
||||
if self.start_time:
|
||||
elapsed = time.time() - self.start_time
|
||||
logger.info(f"\n[Final Stats]")
|
||||
logger.info(f"- Total uptime: {elapsed:.1f}s")
|
||||
logger.info(f"- Total tasks processed: {self.performance_stats['tasks_processed']}")
|
||||
logger.info(f"- Total failed tasks: {self.performance_stats['tasks_failed']}")
|
||||
|
||||
if self.performance_stats['tasks_processed'] > 0:
|
||||
rate = self.performance_stats['tasks_processed'] / elapsed
|
||||
logger.info(f"- Average processing rate: {rate:.2f} tasks/s")
|
||||
|
||||
|
||||
def calculate_optimal_workers():
|
||||
"""Calculate the optimal number of worker threads."""
|
||||
cpu_count = multiprocessing.cpu_count()
|
||||
|
||||
# Based on CPU core count and system resources
|
||||
if cpu_count <= 2:
|
||||
return 2
|
||||
elif cpu_count <= 4:
|
||||
return 4
|
||||
else:
|
||||
return min(8, cpu_count)
|
||||
|
||||
|
||||
def check_queue_status():
|
||||
"""Check queue status."""
|
||||
try:
|
||||
stats = queue_manager.get_queue_stats()
|
||||
|
||||
logger.info("\n[Queue Status]")
|
||||
if isinstance(stats, dict):
|
||||
if 'total_tasks' in stats:
|
||||
logger.info(f"- Total tasks: {stats['total_tasks']}")
|
||||
if 'pending_tasks' in stats:
|
||||
logger.info(f"- Pending tasks: {stats['pending_tasks']}")
|
||||
if 'scheduled_tasks' in stats:
|
||||
logger.info(f"- Scheduled tasks: {stats['scheduled_tasks']}")
|
||||
|
||||
# Check database file
|
||||
db_path = os.path.join(os.path.dirname(__file__), '..', 'projects', 'queue_data', 'huey.db')
|
||||
if os.path.exists(db_path):
|
||||
size = os.path.getsize(db_path)
|
||||
logger.info(f"- Database size: {size} bytes")
|
||||
else:
|
||||
logger.info("- Database file: not found")
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to get queue status: {e}")
|
||||
|
||||
|
||||
def main():
|
||||
"""Main entry point."""
|
||||
parser = argparse.ArgumentParser(description="Optimized queue consumer")
|
||||
parser.add_argument(
|
||||
"--workers",
|
||||
type=int,
|
||||
default=calculate_optimal_workers(),
|
||||
help=f"Number of worker threads (default: {calculate_optimal_workers()})"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--worker-type",
|
||||
type=str,
|
||||
default="threads",
|
||||
choices=["threads", "greenlets", "gevent"],
|
||||
help="Worker type (default: threads)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--check-status",
|
||||
action="store_true",
|
||||
help="Check queue status and exit"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--profile",
|
||||
type=str,
|
||||
default="balanced",
|
||||
choices=["low_memory", "balanced", "high_performance"],
|
||||
help="Performance profile"
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Apply performance profile
|
||||
if args.profile == "low_memory":
|
||||
os.environ['PYTHONOPTIMIZE'] = '1'
|
||||
if args.workers > 2:
|
||||
args.workers = 2
|
||||
logger.info(f"Low memory mode: adjusted worker count to {args.workers}")
|
||||
elif args.profile == "high_performance":
|
||||
if args.workers < 4:
|
||||
args.workers = 4
|
||||
logger.info(f"High performance mode: adjusted worker count to {args.workers}")
|
||||
|
||||
# Check queue status
|
||||
if args.check_status:
|
||||
check_queue_status()
|
||||
return
|
||||
|
||||
# Check environment
|
||||
try:
|
||||
import psutil
|
||||
memory = psutil.virtual_memory()
|
||||
logger.info("[System Info]")
|
||||
logger.info(f"- CPU cores: {multiprocessing.cpu_count()}")
|
||||
logger.info(f"- Available memory: {memory.available / (1024**3):.1f}GB")
|
||||
logger.info(f"- Memory usage: {memory.percent:.1f}%")
|
||||
except ImportError:
|
||||
logger.info("[Tip] Install psutil to display system info: pip install psutil")
|
||||
|
||||
# Create and start the queue consumer
|
||||
consumer = OptimizedQueueConsumer(
|
||||
worker_type=args.worker_type,
|
||||
workers=args.workers
|
||||
)
|
||||
|
||||
consumer.start()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@ -1,210 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Task status SQLite storage system.
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import sqlite3
|
||||
import time
|
||||
from typing import Dict, Optional, Any, List
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
class TaskStatusStore:
|
||||
"""SQLite-based task status store."""
|
||||
|
||||
def __init__(self, db_path: str = "projects/queue_data/task_status.db"):
|
||||
self.db_path = db_path
|
||||
# Ensure directory exists
|
||||
Path(db_path).parent.mkdir(parents=True, exist_ok=True)
|
||||
self._init_database()
|
||||
|
||||
def _init_database(self):
|
||||
"""Initialize database tables."""
|
||||
with sqlite3.connect(self.db_path) as conn:
|
||||
conn.execute('''
|
||||
CREATE TABLE IF NOT EXISTS task_status (
|
||||
task_id TEXT PRIMARY KEY,
|
||||
unique_id TEXT NOT NULL,
|
||||
status TEXT NOT NULL,
|
||||
created_at REAL NOT NULL,
|
||||
updated_at REAL NOT NULL,
|
||||
result TEXT,
|
||||
error TEXT
|
||||
)
|
||||
''')
|
||||
conn.commit()
|
||||
|
||||
def set_status(self, task_id: str, unique_id: str, status: str,
|
||||
result: Optional[Dict] = None, error: Optional[str] = None):
|
||||
"""Set task status."""
|
||||
current_time = time.time()
|
||||
|
||||
with sqlite3.connect(self.db_path) as conn:
|
||||
conn.execute('''
|
||||
INSERT OR REPLACE INTO task_status
|
||||
(task_id, unique_id, status, created_at, updated_at, result, error)
|
||||
VALUES (?, ?, ?, ?, ?, ?, ?)
|
||||
''', (
|
||||
task_id, unique_id, status, current_time, current_time,
|
||||
json.dumps(result) if result else None,
|
||||
error
|
||||
))
|
||||
conn.commit()
|
||||
|
||||
def get_status(self, task_id: str) -> Optional[Dict]:
|
||||
"""Get task status."""
|
||||
with sqlite3.connect(self.db_path) as conn:
|
||||
conn.row_factory = sqlite3.Row
|
||||
cursor = conn.execute(
|
||||
'SELECT * FROM task_status WHERE task_id = ?', (task_id,)
|
||||
)
|
||||
row = cursor.fetchone()
|
||||
|
||||
if not row:
|
||||
return None
|
||||
|
||||
result = dict(row)
|
||||
# Parse JSON field
|
||||
if result['result']:
|
||||
result['result'] = json.loads(result['result'])
|
||||
|
||||
return result
|
||||
|
||||
def update_status(self, task_id: str, status: str,
|
||||
result: Optional[Dict] = None, error: Optional[str] = None):
|
||||
"""Update task status."""
|
||||
with sqlite3.connect(self.db_path) as conn:
|
||||
# Check if task exists
|
||||
cursor = conn.execute(
|
||||
'SELECT task_id FROM task_status WHERE task_id = ?', (task_id,)
|
||||
)
|
||||
if not cursor.fetchone():
|
||||
return False
|
||||
|
||||
# Update status
|
||||
conn.execute('''
|
||||
UPDATE task_status
|
||||
SET status = ?, updated_at = ?, result = ?, error = ?
|
||||
WHERE task_id = ?
|
||||
''', (
|
||||
status, time.time(),
|
||||
json.dumps(result) if result else None,
|
||||
error, task_id
|
||||
))
|
||||
conn.commit()
|
||||
return True
|
||||
|
||||
def delete_status(self, task_id: str):
|
||||
"""Delete task status."""
|
||||
with sqlite3.connect(self.db_path) as conn:
|
||||
cursor = conn.execute(
|
||||
'DELETE FROM task_status WHERE task_id = ?', (task_id,)
|
||||
)
|
||||
conn.commit()
|
||||
return cursor.rowcount > 0
|
||||
|
||||
def list_all(self) -> Dict[str, Dict]:
|
||||
"""List all task statuses."""
|
||||
with sqlite3.connect(self.db_path) as conn:
|
||||
conn.row_factory = sqlite3.Row
|
||||
cursor = conn.execute(
|
||||
'SELECT * FROM task_status ORDER BY updated_at DESC'
|
||||
)
|
||||
all_tasks = {}
|
||||
for row in cursor:
|
||||
result = dict(row)
|
||||
# Parse JSON field
|
||||
if result['result']:
|
||||
result['result'] = json.loads(result['result'])
|
||||
all_tasks[result['task_id']] = result
|
||||
return all_tasks
|
||||
|
||||
def get_by_unique_id(self, unique_id: str) -> List[Dict]:
|
||||
"""Get all tasks for a given project ID."""
|
||||
with sqlite3.connect(self.db_path) as conn:
|
||||
conn.row_factory = sqlite3.Row
|
||||
cursor = conn.execute(
|
||||
'SELECT * FROM task_status WHERE unique_id = ? ORDER BY updated_at DESC',
|
||||
(unique_id,)
|
||||
)
|
||||
tasks = []
|
||||
for row in cursor:
|
||||
result = dict(row)
|
||||
if result['result']:
|
||||
result['result'] = json.loads(result['result'])
|
||||
tasks.append(result)
|
||||
return tasks
|
||||
|
||||
def cleanup_old_tasks(self, older_than_days: int = 7) -> int:
|
||||
"""Clean up old task records."""
|
||||
cutoff_time = time.time() - (older_than_days * 24 * 3600)
|
||||
|
||||
with sqlite3.connect(self.db_path) as conn:
|
||||
cursor = conn.execute(
|
||||
'DELETE FROM task_status WHERE updated_at < ?',
|
||||
(cutoff_time,)
|
||||
)
|
||||
conn.commit()
|
||||
return cursor.rowcount
|
||||
|
||||
def get_statistics(self) -> Dict[str, Any]:
|
||||
"""Get task statistics."""
|
||||
with sqlite3.connect(self.db_path) as conn:
|
||||
# Total tasks
|
||||
total = conn.execute('SELECT COUNT(*) FROM task_status').fetchone()[0]
|
||||
|
||||
# Status breakdown
|
||||
status_stats = conn.execute('''
|
||||
SELECT status, COUNT(*) as count
|
||||
FROM task_status
|
||||
GROUP BY status
|
||||
''').fetchall()
|
||||
|
||||
# Tasks in the last 24 hours
|
||||
recent = time.time() - (24 * 3600)
|
||||
recent_tasks = conn.execute(
|
||||
'SELECT COUNT(*) FROM task_status WHERE updated_at > ?',
|
||||
(recent,)
|
||||
).fetchone()[0]
|
||||
|
||||
return {
|
||||
'total_tasks': total,
|
||||
'status_breakdown': dict(status_stats),
|
||||
'recent_24h': recent_tasks,
|
||||
'database_path': self.db_path
|
||||
}
|
||||
|
||||
def search_tasks(self, status: Optional[str] = None,
|
||||
unique_id: Optional[str] = None,
|
||||
limit: int = 100) -> List[Dict]:
|
||||
"""Search tasks."""
|
||||
query = 'SELECT * FROM task_status WHERE 1=1'
|
||||
params = []
|
||||
|
||||
if status:
|
||||
query += ' AND status = ?'
|
||||
params.append(status)
|
||||
|
||||
if unique_id:
|
||||
query += ' AND unique_id = ?'
|
||||
params.append(unique_id)
|
||||
|
||||
query += ' ORDER BY updated_at DESC LIMIT ?'
|
||||
params.append(limit)
|
||||
|
||||
with sqlite3.connect(self.db_path) as conn:
|
||||
conn.row_factory = sqlite3.Row
|
||||
cursor = conn.execute(query, params)
|
||||
tasks = []
|
||||
for row in cursor:
|
||||
result = dict(row)
|
||||
if result['result']:
|
||||
result['result'] = json.loads(result['result'])
|
||||
tasks.append(result)
|
||||
return tasks
|
||||
|
||||
|
||||
# Global task status store instance
|
||||
task_status_store = TaskStatusStore()
|
||||
@ -1,359 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
File processing tasks for the queue system.
|
||||
"""
|
||||
|
||||
import os
|
||||
import json
|
||||
import time
|
||||
import shutil
|
||||
import logging
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Optional, Any
|
||||
from huey import crontab
|
||||
|
||||
# Configure logging
|
||||
logger = logging.getLogger('app')
|
||||
|
||||
from .config import huey
|
||||
from utils.file_utils import (
|
||||
extract_zip_file,
|
||||
get_file_hash,
|
||||
load_processed_files_log,
|
||||
save_processed_files_log,
|
||||
get_document_preview
|
||||
)
|
||||
|
||||
|
||||
@huey.task()
|
||||
def process_file_async(
|
||||
project_id: str,
|
||||
file_path: str,
|
||||
original_filename: str = None,
|
||||
target_directory: str = "files"
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Asynchronously process a single file.
|
||||
|
||||
Args:
|
||||
project_id: Project ID
|
||||
file_path: File path
|
||||
original_filename: Original filename
|
||||
target_directory: Target directory
|
||||
|
||||
Returns:
|
||||
Processing result dictionary
|
||||
"""
|
||||
try:
|
||||
logger.info(f"Starting file processing: {file_path}")
|
||||
|
||||
# Ensure project directory exists
|
||||
project_dir = os.path.join("projects", project_id)
|
||||
files_dir = os.path.join(project_dir, target_directory)
|
||||
os.makedirs(files_dir, exist_ok=True)
|
||||
|
||||
# Get file hash as identifier
|
||||
file_hash = get_file_hash(file_path)
|
||||
|
||||
# Check if file has already been processed
|
||||
processed_log = load_processed_files_log(project_id)
|
||||
if file_hash in processed_log:
|
||||
logger.info(f"File already processed, skipping: {file_path}")
|
||||
return {
|
||||
"status": "skipped",
|
||||
"message": "File already processed",
|
||||
"file_hash": file_hash,
|
||||
"project_id": project_id
|
||||
}
|
||||
|
||||
# Process the file
|
||||
result = _process_single_file(
|
||||
file_path,
|
||||
files_dir,
|
||||
original_filename or os.path.basename(file_path)
|
||||
)
|
||||
|
||||
# Update processing log
|
||||
if result["status"] == "success":
|
||||
processed_log[file_hash] = {
|
||||
"original_path": file_path,
|
||||
"original_filename": original_filename or os.path.basename(file_path),
|
||||
"processed_at": str(time.time()),
|
||||
"status": "processed",
|
||||
"result": result
|
||||
}
|
||||
save_processed_files_log(project_id, processed_log)
|
||||
|
||||
result["file_hash"] = file_hash
|
||||
result["project_id"] = project_id
|
||||
|
||||
logger.info(f"File processing complete: {file_path}, status: {result['status']}")
|
||||
return result
|
||||
|
||||
except Exception as e:
|
||||
error_msg = f"Error processing file: {str(e)}"
|
||||
logger.error(error_msg)
|
||||
return {
|
||||
"status": "error",
|
||||
"message": error_msg,
|
||||
"file_path": file_path,
|
||||
"project_id": project_id
|
||||
}
|
||||
|
||||
|
||||
@huey.task()
|
||||
def process_multiple_files_async(
|
||||
project_id: str,
|
||||
file_paths: List[str],
|
||||
original_filenames: List[str] = None
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Asynchronously process multiple files in batch.
|
||||
|
||||
Args:
|
||||
project_id: Project ID
|
||||
file_paths: List of file paths
|
||||
original_filenames: List of original filenames
|
||||
|
||||
Returns:
|
||||
List of processing results
|
||||
"""
|
||||
try:
|
||||
logger.info(f"Starting batch processing of {len(file_paths)} files")
|
||||
|
||||
results = []
|
||||
for i, file_path in enumerate(file_paths):
|
||||
original_filename = original_filenames[i] if original_filenames and i < len(original_filenames) else None
|
||||
|
||||
# Create async task for each file
|
||||
result = process_file_async(project_id, file_path, original_filename)
|
||||
results.append(result)
|
||||
|
||||
logger.info(f"Batch file processing tasks submitted, total {len(results)} files")
|
||||
return results
|
||||
|
||||
except Exception as e:
|
||||
error_msg = f"Error during batch file processing: {str(e)}"
|
||||
logger.error(error_msg)
|
||||
return [{
|
||||
"status": "error",
|
||||
"message": error_msg,
|
||||
"project_id": project_id
|
||||
}]
|
||||
|
||||
|
||||
@huey.task()
|
||||
def process_zip_file_async(
|
||||
project_id: str,
|
||||
zip_path: str,
|
||||
extract_to: str = None
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Asynchronously process a zip archive file.
|
||||
|
||||
Args:
|
||||
project_id: Project ID
|
||||
zip_path: Zip file path
|
||||
extract_to: Extraction target directory
|
||||
|
||||
Returns:
|
||||
Processing result dictionary
|
||||
"""
|
||||
try:
|
||||
logger.info(f"Starting zip file processing: {zip_path}")
|
||||
|
||||
# Set extraction directory
|
||||
if extract_to is None:
|
||||
extract_to = os.path.join("projects", project_id, "extracted", os.path.basename(zip_path))
|
||||
|
||||
os.makedirs(extract_to, exist_ok=True)
|
||||
|
||||
# Extract files
|
||||
extracted_files = extract_zip_file(zip_path, extract_to)
|
||||
|
||||
if not extracted_files:
|
||||
return {
|
||||
"status": "error",
|
||||
"message": "Extraction failed or no supported files found",
|
||||
"zip_path": zip_path,
|
||||
"project_id": project_id
|
||||
}
|
||||
|
||||
# Batch process extracted files
|
||||
result = process_multiple_files_async(project_id, extracted_files)
|
||||
|
||||
return {
|
||||
"status": "success",
|
||||
"message": f"Zip file processing complete, extracted {len(extracted_files)} files",
|
||||
"zip_path": zip_path,
|
||||
"extract_to": extract_to,
|
||||
"extracted_files": extracted_files,
|
||||
"project_id": project_id,
|
||||
"batch_task_result": result
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
error_msg = f"Error processing zip file: {str(e)}"
|
||||
logger.error(error_msg)
|
||||
return {
|
||||
"status": "error",
|
||||
"message": error_msg,
|
||||
"zip_path": zip_path,
|
||||
"project_id": project_id
|
||||
}
|
||||
|
||||
|
||||
@huey.task()
|
||||
def cleanup_processed_files(
|
||||
project_id: str,
|
||||
older_than_days: int = 30
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Clean up old processed files.
|
||||
|
||||
Args:
|
||||
project_id: Project ID
|
||||
older_than_days: Clean files older than this many days
|
||||
|
||||
Returns:
|
||||
Cleanup result dictionary
|
||||
"""
|
||||
try:
|
||||
logger.info(f"Starting cleanup of files older than {older_than_days} days in project {project_id}")
|
||||
|
||||
project_dir = os.path.join("projects", project_id)
|
||||
if not os.path.exists(project_dir):
|
||||
return {
|
||||
"status": "error",
|
||||
"message": "Project directory does not exist",
|
||||
"project_id": project_id
|
||||
}
|
||||
|
||||
current_time = time.time()
|
||||
cutoff_time = current_time - (older_than_days * 24 * 3600)
|
||||
cleaned_files = []
|
||||
|
||||
# Walk through project directory
|
||||
for root, dirs, files in os.walk(project_dir):
|
||||
for file in files:
|
||||
file_path = os.path.join(root, file)
|
||||
file_mtime = os.path.getmtime(file_path)
|
||||
|
||||
if file_mtime < cutoff_time:
|
||||
try:
|
||||
os.remove(file_path)
|
||||
cleaned_files.append(file_path)
|
||||
logger.info(f"Deleted old file: {file_path}")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to delete file {file_path}: {str(e)}")
|
||||
|
||||
# Clean up empty directories
|
||||
for root, dirs, files in os.walk(project_dir, topdown=False):
|
||||
for dir in dirs:
|
||||
dir_path = os.path.join(root, dir)
|
||||
try:
|
||||
if not os.listdir(dir_path):
|
||||
os.rmdir(dir_path)
|
||||
logger.info(f"Deleted empty directory: {dir_path}")
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to delete directory {dir_path}: {str(e)}")
|
||||
|
||||
return {
|
||||
"status": "success",
|
||||
"message": f"Cleanup complete, deleted {len(cleaned_files)} files",
|
||||
"project_id": project_id,
|
||||
"cleaned_files": cleaned_files,
|
||||
"older_than_days": older_than_days
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
error_msg = f"Error during file cleanup: {str(e)}"
|
||||
logger.error(error_msg)
|
||||
return {
|
||||
"status": "error",
|
||||
"message": error_msg,
|
||||
"project_id": project_id
|
||||
}
|
||||
|
||||
|
||||
def _process_single_file(
|
||||
file_path: str,
|
||||
target_dir: str,
|
||||
original_filename: str
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Internal method for processing a single file.
|
||||
|
||||
Args:
|
||||
file_path: Source file path
|
||||
target_dir: Target directory
|
||||
original_filename: Original filename
|
||||
|
||||
Returns:
|
||||
Processing result dictionary
|
||||
"""
|
||||
try:
|
||||
# Check if file exists
|
||||
if not os.path.exists(file_path):
|
||||
return {
|
||||
"status": "error",
|
||||
"message": "Source file does not exist",
|
||||
"file_path": file_path
|
||||
}
|
||||
|
||||
# Get file info
|
||||
file_size = os.path.getsize(file_path)
|
||||
file_ext = os.path.splitext(original_filename)[1].lower()
|
||||
|
||||
# Different processing based on file type
|
||||
supported_extensions = ['.txt', '.md', '.csv', '.xlsx', '.zip']
|
||||
|
||||
if file_ext not in supported_extensions:
|
||||
return {
|
||||
"status": "error",
|
||||
"message": f"Unsupported file type: {file_ext}",
|
||||
"file_path": file_path,
|
||||
"supported_extensions": supported_extensions
|
||||
}
|
||||
|
||||
# Copy file to target directory
|
||||
target_file_path = os.path.join(target_dir, original_filename)
|
||||
|
||||
# If target file already exists, add timestamp
|
||||
if os.path.exists(target_file_path):
|
||||
name, ext = os.path.splitext(original_filename)
|
||||
timestamp = int(time.time())
|
||||
target_file_path = os.path.join(target_dir, f"{name}_{timestamp}{ext}")
|
||||
|
||||
shutil.copy2(file_path, target_file_path)
|
||||
|
||||
# Get file preview (if it's a text file)
|
||||
preview = None
|
||||
if file_ext in ['.txt', '.md']:
|
||||
preview = get_document_preview(target_file_path, max_lines=5)
|
||||
|
||||
return {
|
||||
"status": "success",
|
||||
"message": "File processed successfully",
|
||||
"original_path": file_path,
|
||||
"target_path": target_file_path,
|
||||
"file_size": file_size,
|
||||
"file_extension": file_ext,
|
||||
"preview": preview
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
return {
|
||||
"status": "error",
|
||||
"message": f"Error processing file: {str(e)}",
|
||||
"file_path": file_path
|
||||
}
|
||||
|
||||
|
||||
# Periodic task example: clean up files older than 30 days daily at 2 AM
|
||||
@huey.periodic_task(crontab(hour=2, minute=0))
|
||||
def daily_cleanup():
|
||||
"""Daily cleanup task."""
|
||||
logger.info("Running daily cleanup task")
|
||||
# Add cleanup logic here
|
||||
return {"status": "completed", "message": "Daily cleanup task completed"}
|
||||
@ -13,23 +13,6 @@ from .file_utils import (
|
||||
save_processed_files_log
|
||||
)
|
||||
|
||||
from .dataset_manager import (
|
||||
download_dataset_files,
|
||||
generate_dataset_structure,
|
||||
remove_dataset_directory,
|
||||
remove_dataset_directory_by_key
|
||||
)
|
||||
|
||||
from .project_manager import (
|
||||
generate_project_readme,
|
||||
save_project_readme,
|
||||
get_project_status,
|
||||
remove_project,
|
||||
list_projects,
|
||||
get_project_stats
|
||||
)
|
||||
|
||||
|
||||
from .system_optimizer import (
|
||||
setup_system_optimizations
|
||||
)
|
||||
@ -59,11 +42,6 @@ from .api_models import (
|
||||
ProjectListResponse,
|
||||
ProjectStatsResponse,
|
||||
ProjectActionResponse,
|
||||
QueueTaskRequest,
|
||||
IncrementalTaskRequest,
|
||||
QueueTaskResponse,
|
||||
QueueStatusResponse,
|
||||
TaskStatusResponse,
|
||||
create_success_response,
|
||||
create_error_response,
|
||||
create_chat_response,
|
||||
@ -90,20 +68,6 @@ __all__ = [
|
||||
'load_processed_files_log',
|
||||
'save_processed_files_log',
|
||||
|
||||
# dataset_manager
|
||||
'download_dataset_files',
|
||||
'generate_dataset_structure',
|
||||
'remove_dataset_directory',
|
||||
'remove_dataset_directory_by_key',
|
||||
|
||||
# project_manager
|
||||
'generate_project_readme',
|
||||
'save_project_readme',
|
||||
'get_project_status',
|
||||
'remove_project',
|
||||
'list_projects',
|
||||
'get_project_stats',
|
||||
|
||||
# agent_pool
|
||||
'AgentPool',
|
||||
'get_agent_pool',
|
||||
@ -128,10 +92,6 @@ __all__ = [
|
||||
'ProjectListResponse',
|
||||
'ProjectStatsResponse',
|
||||
'ProjectActionResponse',
|
||||
'QueueTaskRequest',
|
||||
'QueueTaskResponse',
|
||||
'QueueStatusResponse',
|
||||
'TaskStatusResponse',
|
||||
'create_success_response',
|
||||
'create_error_response',
|
||||
'create_chat_response',
|
||||
|
||||
@ -270,133 +270,6 @@ def create_error_response(message: str, error_type: str = "error", **kwargs) ->
|
||||
}
|
||||
|
||||
|
||||
class QueueTaskRequest(BaseModel):
|
||||
"""Queue task request model"""
|
||||
dataset_id: str
|
||||
files: Optional[Dict[str, List[str]]] = Field(default=None, description="Files organized by key groups. Each key maps to a list of file paths (supports zip files)")
|
||||
upload_folder: Optional[Dict[str, str]] = Field(default=None, description="Upload folders organized by group names. Each key maps to a folder name. Example: {'group1': 'my_project1', 'group2': 'my_project2'}")
|
||||
priority: Optional[int] = Field(default=0, description="Task priority (higher number = higher priority)")
|
||||
delay: Optional[int] = Field(default=0, description="Delay execution by N seconds")
|
||||
|
||||
model_config = ConfigDict(extra='allow')
|
||||
|
||||
@field_validator('upload_folder', mode='before')
|
||||
@classmethod
|
||||
def validate_upload_folder(cls, v):
|
||||
"""Validate upload_folder dict format"""
|
||||
if v is None:
|
||||
return None
|
||||
if isinstance(v, dict):
|
||||
# Validate dict format
|
||||
for key, value in v.items():
|
||||
if not isinstance(key, str):
|
||||
raise ValueError(f"Key in upload_folder dict must be string, got {type(key)}")
|
||||
if not isinstance(value, str):
|
||||
raise ValueError(f"Value in upload_folder dict must be string (folder name), got {type(value)} for key '{key}'")
|
||||
return v
|
||||
else:
|
||||
raise ValueError(f"upload_folder must be a dict with group names as keys and folder names as values, got {type(v)}")
|
||||
|
||||
@field_validator('files', mode='before')
|
||||
@classmethod
|
||||
def validate_files(cls, v):
|
||||
"""Validate dict format with key-grouped files"""
|
||||
if v is None:
|
||||
return None
|
||||
if isinstance(v, dict):
|
||||
# Validate dict format
|
||||
for key, value in v.items():
|
||||
if not isinstance(key, str):
|
||||
raise ValueError(f"Key in files dict must be string, got {type(key)}")
|
||||
if not isinstance(value, list):
|
||||
raise ValueError(f"Value in files dict must be list, got {type(value)} for key '{key}'")
|
||||
for item in value:
|
||||
if not isinstance(item, str):
|
||||
raise ValueError(f"File paths must be strings, got {type(item)} in key '{key}'")
|
||||
return v
|
||||
else:
|
||||
raise ValueError(f"Files must be a dict with key groups, got {type(v)}")
|
||||
|
||||
|
||||
class IncrementalTaskRequest(BaseModel):
|
||||
"""Incremental file processing request model"""
|
||||
dataset_id: str = Field(..., description="Dataset ID for the project")
|
||||
files_to_add: Optional[Dict[str, List[str]]] = Field(default=None, description="Files to add organized by key groups")
|
||||
files_to_remove: Optional[Dict[str, List[str]]] = Field(default=None, description="Files to remove organized by key groups")
|
||||
system_prompt: Optional[str] = None
|
||||
mcp_settings: Optional[List[Dict]] = None
|
||||
priority: Optional[int] = Field(default=0, description="Task priority (higher number = higher priority)")
|
||||
delay: Optional[int] = Field(default=0, description="Delay execution by N seconds")
|
||||
|
||||
model_config = ConfigDict(extra='allow')
|
||||
|
||||
@field_validator('files_to_add', mode='before')
|
||||
@classmethod
|
||||
def validate_files_to_add(cls, v):
|
||||
"""Validate files_to_add dict format"""
|
||||
if v is None:
|
||||
return None
|
||||
if isinstance(v, dict):
|
||||
for key, value in v.items():
|
||||
if not isinstance(key, str):
|
||||
raise ValueError(f"Key in files_to_add dict must be string, got {type(key)}")
|
||||
if not isinstance(value, list):
|
||||
raise ValueError(f"Value in files_to_add dict must be list, got {type(value)} for key '{key}'")
|
||||
for item in value:
|
||||
if not isinstance(item, str):
|
||||
raise ValueError(f"File paths must be strings, got {type(item)} in key '{key}'")
|
||||
return v
|
||||
else:
|
||||
raise ValueError(f"files_to_add must be a dict with key groups, got {type(v)}")
|
||||
|
||||
@field_validator('files_to_remove', mode='before')
|
||||
@classmethod
|
||||
def validate_files_to_remove(cls, v):
|
||||
"""Validate files_to_remove dict format"""
|
||||
if v is None:
|
||||
return None
|
||||
if isinstance(v, dict):
|
||||
for key, value in v.items():
|
||||
if not isinstance(key, str):
|
||||
raise ValueError(f"Key in files_to_remove dict must be string, got {type(key)}")
|
||||
if not isinstance(value, list):
|
||||
raise ValueError(f"Value in files_to_remove dict must be list, got {type(value)} for key '{key}'")
|
||||
for item in value:
|
||||
if not isinstance(item, str):
|
||||
raise ValueError(f"File paths must be strings, got {type(item)} in key '{key}'")
|
||||
return v
|
||||
else:
|
||||
raise ValueError(f"files_to_remove must be a dict with key groups, got {type(v)}")
|
||||
|
||||
|
||||
class QueueTaskResponse(BaseModel):
|
||||
"""Queue task response model"""
|
||||
success: bool
|
||||
message: str
|
||||
dataset_id: str
|
||||
task_id: Optional[str] = None
|
||||
task_status: Optional[str] = None
|
||||
estimated_processing_time: Optional[int] = None # seconds
|
||||
|
||||
|
||||
class QueueStatusResponse(BaseModel):
|
||||
"""Queue status response model"""
|
||||
success: bool
|
||||
message: str
|
||||
queue_stats: Dict[str, Any]
|
||||
pending_tasks: List[Dict[str, Any]]
|
||||
|
||||
|
||||
class TaskStatusResponse(BaseModel):
|
||||
"""Task status response model"""
|
||||
success: bool
|
||||
message: str
|
||||
task_id: str
|
||||
task_status: Optional[str] = None
|
||||
task_result: Optional[Dict[str, Any]] = None
|
||||
error: Optional[str] = None
|
||||
|
||||
|
||||
def create_chat_response(
|
||||
messages: List[Message],
|
||||
model: str,
|
||||
|
||||
@ -1,439 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Data merging functions for combining processed file results.
|
||||
"""
|
||||
|
||||
import os
|
||||
import pickle
|
||||
import logging
|
||||
from typing import Dict, List, Optional, Tuple
|
||||
import json
|
||||
|
||||
# Configure logger
|
||||
logger = logging.getLogger('app')
|
||||
|
||||
# Try to import numpy, but handle if missing
|
||||
try:
|
||||
import numpy as np
|
||||
NUMPY_SUPPORT = True
|
||||
except ImportError:
|
||||
logger.warning("NumPy not available, some embedding features may be limited")
|
||||
NUMPY_SUPPORT = False
|
||||
|
||||
|
||||
def merge_documents_by_group(unique_id: str, group_name: str) -> Dict:
|
||||
"""Merge all document.txt files in a group into a single document."""
|
||||
|
||||
processed_group_dir = os.path.join("projects", "data", unique_id, "processed", group_name)
|
||||
dataset_group_dir = os.path.join("projects", "data", unique_id, "datasets", group_name)
|
||||
os.makedirs(dataset_group_dir, exist_ok=True)
|
||||
|
||||
merged_document_path = os.path.join(dataset_group_dir, "document.txt")
|
||||
|
||||
result = {
|
||||
"success": False,
|
||||
"merged_document_path": merged_document_path,
|
||||
"source_files": [],
|
||||
"total_pages": 0,
|
||||
"total_characters": 0,
|
||||
"error": None
|
||||
}
|
||||
|
||||
try:
|
||||
# Find all document.txt files in the processed directory
|
||||
document_files = []
|
||||
if os.path.exists(processed_group_dir):
|
||||
for item in os.listdir(processed_group_dir):
|
||||
item_path = os.path.join(processed_group_dir, item)
|
||||
if os.path.isdir(item_path):
|
||||
document_path = os.path.join(item_path, "document.txt")
|
||||
if os.path.exists(document_path) and os.path.getsize(document_path) > 0:
|
||||
document_files.append((item, document_path))
|
||||
|
||||
if not document_files:
|
||||
result["error"] = "No document files found to merge"
|
||||
return result
|
||||
|
||||
# Merge all documents with page separators
|
||||
merged_content = []
|
||||
total_characters = 0
|
||||
|
||||
for filename_stem, document_path in sorted(document_files):
|
||||
try:
|
||||
with open(document_path, 'r', encoding='utf-8') as f:
|
||||
content = f.read().strip()
|
||||
|
||||
if content:
|
||||
merged_content.append(f"# Page {filename_stem}")
|
||||
merged_content.append(content)
|
||||
total_characters += len(content)
|
||||
result["source_files"].append(filename_stem)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error reading document file {document_path}: {str(e)}")
|
||||
continue
|
||||
|
||||
if merged_content:
|
||||
# Write merged document
|
||||
with open(merged_document_path, 'w', encoding='utf-8') as f:
|
||||
f.write('\n\n'.join(merged_content))
|
||||
|
||||
result["total_pages"] = len(document_files)
|
||||
result["total_characters"] = total_characters
|
||||
result["success"] = True
|
||||
|
||||
else:
|
||||
result["error"] = "No valid content found in document files"
|
||||
|
||||
except Exception as e:
|
||||
result["error"] = f"Document merging failed: {str(e)}"
|
||||
logger.error(f"Error merging documents for group {group_name}: {str(e)}")
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def merge_paginations_by_group(unique_id: str, group_name: str) -> Dict:
|
||||
"""Merge all pagination.txt files in a group."""
|
||||
|
||||
processed_group_dir = os.path.join("projects", "data", unique_id, "processed", group_name)
|
||||
dataset_group_dir = os.path.join("projects", "data", unique_id, "datasets", group_name)
|
||||
os.makedirs(dataset_group_dir, exist_ok=True)
|
||||
|
||||
merged_pagination_path = os.path.join(dataset_group_dir, "pagination.txt")
|
||||
|
||||
result = {
|
||||
"success": False,
|
||||
"merged_pagination_path": merged_pagination_path,
|
||||
"source_files": [],
|
||||
"total_lines": 0,
|
||||
"error": None
|
||||
}
|
||||
|
||||
try:
|
||||
# Find all pagination.txt files
|
||||
pagination_files = []
|
||||
if os.path.exists(processed_group_dir):
|
||||
for item in os.listdir(processed_group_dir):
|
||||
item_path = os.path.join(processed_group_dir, item)
|
||||
if os.path.isdir(item_path):
|
||||
pagination_path = os.path.join(item_path, "pagination.txt")
|
||||
if os.path.exists(pagination_path) and os.path.getsize(pagination_path) > 0:
|
||||
pagination_files.append((item, pagination_path))
|
||||
|
||||
if not pagination_files:
|
||||
result["error"] = "No pagination files found to merge"
|
||||
return result
|
||||
|
||||
# Merge all pagination files
|
||||
merged_lines = []
|
||||
|
||||
for filename_stem, pagination_path in sorted(pagination_files):
|
||||
try:
|
||||
with open(pagination_path, 'r', encoding='utf-8') as f:
|
||||
lines = f.readlines()
|
||||
|
||||
for line in lines:
|
||||
line = line.strip()
|
||||
if line:
|
||||
merged_lines.append(line)
|
||||
|
||||
result["source_files"].append(filename_stem)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error reading pagination file {pagination_path}: {str(e)}")
|
||||
continue
|
||||
|
||||
if merged_lines:
|
||||
# Write merged pagination
|
||||
with open(merged_pagination_path, 'w', encoding='utf-8') as f:
|
||||
for line in merged_lines:
|
||||
f.write(f"{line}\n")
|
||||
|
||||
result["total_lines"] = len(merged_lines)
|
||||
result["success"] = True
|
||||
|
||||
else:
|
||||
result["error"] = "No valid pagination data found"
|
||||
|
||||
except Exception as e:
|
||||
result["error"] = f"Pagination merging failed: {str(e)}"
|
||||
logger.error(f"Error merging paginations for group {group_name}: {str(e)}")
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def merge_embeddings_by_group(unique_id: str, group_name: str) -> Dict:
|
||||
"""Merge all embedding.pkl files in a group."""
|
||||
|
||||
processed_group_dir = os.path.join("projects", "data", unique_id, "processed", group_name)
|
||||
dataset_group_dir = os.path.join("projects", "data", unique_id, "datasets", group_name)
|
||||
os.makedirs(dataset_group_dir, exist_ok=True)
|
||||
|
||||
merged_embedding_path = os.path.join(dataset_group_dir, "embedding.pkl")
|
||||
|
||||
result = {
|
||||
"success": False,
|
||||
"merged_embedding_path": merged_embedding_path,
|
||||
"source_files": [],
|
||||
"total_chunks": 0,
|
||||
"total_dimensions": 0,
|
||||
"error": None
|
||||
}
|
||||
|
||||
try:
|
||||
# Find all embedding.pkl files
|
||||
embedding_files = []
|
||||
if os.path.exists(processed_group_dir):
|
||||
for item in os.listdir(processed_group_dir):
|
||||
item_path = os.path.join(processed_group_dir, item)
|
||||
if os.path.isdir(item_path):
|
||||
embedding_path = os.path.join(item_path, "embedding.pkl")
|
||||
if os.path.exists(embedding_path) and os.path.getsize(embedding_path) > 0:
|
||||
embedding_files.append((item, embedding_path))
|
||||
|
||||
if not embedding_files:
|
||||
result["error"] = "No embedding files found to merge"
|
||||
return result
|
||||
|
||||
# Load and merge all embedding data
|
||||
all_chunks = []
|
||||
all_embeddings = [] # Fix: collect all embedding vectors
|
||||
total_chunks = 0
|
||||
dimensions = 0
|
||||
chunking_strategy = 'unknown'
|
||||
chunking_params = {}
|
||||
model_path = 'TaylorAI/gte-tiny'
|
||||
|
||||
for filename_stem, embedding_path in sorted(embedding_files):
|
||||
try:
|
||||
with open(embedding_path, 'rb') as f:
|
||||
embedding_data = pickle.load(f)
|
||||
|
||||
if isinstance(embedding_data, dict) and 'chunks' in embedding_data:
|
||||
chunks = embedding_data['chunks']
|
||||
|
||||
# Get embedding vectors (critical fix)
|
||||
if 'embeddings' in embedding_data:
|
||||
embeddings = embedding_data['embeddings']
|
||||
all_embeddings.append(embeddings)
|
||||
|
||||
# Get model metadata from the first file
|
||||
if 'model_path' in embedding_data:
|
||||
model_path = embedding_data['model_path']
|
||||
if 'chunking_strategy' in embedding_data:
|
||||
chunking_strategy = embedding_data['chunking_strategy']
|
||||
if 'chunking_params' in embedding_data:
|
||||
chunking_params = embedding_data['chunking_params']
|
||||
|
||||
# Add source file metadata to each chunk
|
||||
for chunk in chunks:
|
||||
if isinstance(chunk, dict):
|
||||
chunk['source_file'] = filename_stem
|
||||
chunk['source_group'] = group_name
|
||||
elif isinstance(chunk, str):
|
||||
# If the chunk is a string, keep it unchanged
|
||||
pass
|
||||
|
||||
all_chunks.extend(chunks)
|
||||
total_chunks += len(chunks)
|
||||
|
||||
result["source_files"].append(filename_stem)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error loading embedding file {embedding_path}: {str(e)}")
|
||||
continue
|
||||
|
||||
if all_chunks and all_embeddings:
|
||||
# Merge all embedding vectors
|
||||
try:
|
||||
# Try merging tensors with torch
|
||||
import torch
|
||||
if all(isinstance(emb, torch.Tensor) for emb in all_embeddings):
|
||||
merged_embeddings = torch.cat(all_embeddings, dim=0)
|
||||
dimensions = merged_embeddings.shape[1]
|
||||
else:
|
||||
# If the values are not tensors, try converting them to numpy
|
||||
import numpy as np
|
||||
if NUMPY_SUPPORT:
|
||||
np_embeddings = []
|
||||
for emb in all_embeddings:
|
||||
if hasattr(emb, 'numpy'):
|
||||
np_embeddings.append(emb.numpy())
|
||||
elif isinstance(emb, np.ndarray):
|
||||
np_embeddings.append(emb)
|
||||
else:
|
||||
# If conversion fails, skip this file
|
||||
logger.warning(f"Warning: Cannot convert embedding to numpy from file {filename_stem}")
|
||||
continue
|
||||
|
||||
if np_embeddings:
|
||||
merged_embeddings = np.concatenate(np_embeddings, axis=0)
|
||||
dimensions = merged_embeddings.shape[1]
|
||||
else:
|
||||
result["error"] = "No valid embedding tensors could be merged"
|
||||
return result
|
||||
else:
|
||||
result["error"] = "NumPy not available for merging embeddings"
|
||||
return result
|
||||
|
||||
except ImportError:
|
||||
# If torch is unavailable, try using numpy
|
||||
if NUMPY_SUPPORT:
|
||||
import numpy as np
|
||||
np_embeddings = []
|
||||
for emb in all_embeddings:
|
||||
if hasattr(emb, 'numpy'):
|
||||
np_embeddings.append(emb.numpy())
|
||||
elif isinstance(emb, np.ndarray):
|
||||
np_embeddings.append(emb)
|
||||
else:
|
||||
logger.warning(f"Warning: Cannot convert embedding to numpy from file {filename_stem}")
|
||||
continue
|
||||
|
||||
if np_embeddings:
|
||||
merged_embeddings = np.concatenate(np_embeddings, axis=0)
|
||||
dimensions = merged_embeddings.shape[1]
|
||||
else:
|
||||
result["error"] = "No valid embedding tensors could be merged"
|
||||
return result
|
||||
else:
|
||||
result["error"] = "Neither torch nor numpy available for merging embeddings"
|
||||
return result
|
||||
except Exception as e:
|
||||
result["error"] = f"Failed to merge embedding tensors: {str(e)}"
|
||||
logger.error(f"Error merging embedding tensors: {str(e)}")
|
||||
return result
|
||||
|
||||
# Create merged embedding data structure
|
||||
merged_embedding_data = {
|
||||
'chunks': all_chunks,
|
||||
'embeddings': merged_embeddings, # Critical fix: include the embeddings key
|
||||
'total_chunks': total_chunks,
|
||||
'dimensions': dimensions,
|
||||
'source_files': result["source_files"],
|
||||
'group_name': group_name,
|
||||
'merged_at': str(__import__('time').time()),
|
||||
'chunking_strategy': chunking_strategy,
|
||||
'chunking_params': chunking_params,
|
||||
'model_path': model_path
|
||||
}
|
||||
|
||||
# Save merged embeddings
|
||||
with open(merged_embedding_path, 'wb') as f:
|
||||
pickle.dump(merged_embedding_data, f)
|
||||
|
||||
result["total_chunks"] = total_chunks
|
||||
result["total_dimensions"] = dimensions
|
||||
result["success"] = True
|
||||
|
||||
else:
|
||||
result["error"] = "No valid embedding data found"
|
||||
|
||||
except Exception as e:
|
||||
result["error"] = f"Embedding merging failed: {str(e)}"
|
||||
logger.error(f"Error merging embeddings for group {group_name}: {str(e)}")
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def merge_all_data_by_group(unique_id: str, group_name: str) -> Dict:
|
||||
"""Merge documents, paginations, and embeddings for a group."""
|
||||
|
||||
merge_results = {
|
||||
"group_name": group_name,
|
||||
"unique_id": unique_id,
|
||||
"success": True,
|
||||
"document_merge": None,
|
||||
"pagination_merge": None,
|
||||
"embedding_merge": None,
|
||||
"errors": []
|
||||
}
|
||||
|
||||
# Merge documents
|
||||
document_result = merge_documents_by_group(unique_id, group_name)
|
||||
merge_results["document_merge"] = document_result
|
||||
|
||||
if not document_result["success"]:
|
||||
merge_results["success"] = False
|
||||
merge_results["errors"].append(f"Document merge failed: {document_result['error']}")
|
||||
|
||||
# Merge paginations
|
||||
pagination_result = merge_paginations_by_group(unique_id, group_name)
|
||||
merge_results["pagination_merge"] = pagination_result
|
||||
|
||||
if not pagination_result["success"]:
|
||||
merge_results["success"] = False
|
||||
merge_results["errors"].append(f"Pagination merge failed: {pagination_result['error']}")
|
||||
|
||||
# Merge embeddings
|
||||
embedding_result = merge_embeddings_by_group(unique_id, group_name)
|
||||
merge_results["embedding_merge"] = embedding_result
|
||||
|
||||
if not embedding_result["success"]:
|
||||
merge_results["success"] = False
|
||||
merge_results["errors"].append(f"Embedding merge failed: {embedding_result['error']}")
|
||||
|
||||
return merge_results
|
||||
|
||||
|
||||
def get_group_merge_status(unique_id: str, group_name: str) -> Dict:
|
||||
"""Get the status of merged data for a group."""
|
||||
|
||||
dataset_group_dir = os.path.join("projects", "data", unique_id, "datasets", group_name)
|
||||
|
||||
status = {
|
||||
"group_name": group_name,
|
||||
"unique_id": unique_id,
|
||||
"dataset_dir_exists": os.path.exists(dataset_group_dir),
|
||||
"document_exists": False,
|
||||
"document_size": 0,
|
||||
"pagination_exists": False,
|
||||
"pagination_size": 0,
|
||||
"embedding_exists": False,
|
||||
"embedding_size": 0,
|
||||
"merge_complete": False
|
||||
}
|
||||
|
||||
if os.path.exists(dataset_group_dir):
|
||||
document_path = os.path.join(dataset_group_dir, "document.txt")
|
||||
pagination_path = os.path.join(dataset_group_dir, "pagination.txt")
|
||||
embedding_path = os.path.join(dataset_group_dir, "embedding.pkl")
|
||||
|
||||
if os.path.exists(document_path):
|
||||
status["document_exists"] = True
|
||||
status["document_size"] = os.path.getsize(document_path)
|
||||
|
||||
if os.path.exists(pagination_path):
|
||||
status["pagination_exists"] = True
|
||||
status["pagination_size"] = os.path.getsize(pagination_path)
|
||||
|
||||
if os.path.exists(embedding_path):
|
||||
status["embedding_exists"] = True
|
||||
status["embedding_size"] = os.path.getsize(embedding_path)
|
||||
|
||||
# Check if all files exist and are not empty
|
||||
if (status["document_exists"] and status["document_size"] > 0 and
|
||||
status["pagination_exists"] and status["pagination_size"] > 0 and
|
||||
status["embedding_exists"] and status["embedding_size"] > 0):
|
||||
status["merge_complete"] = True
|
||||
|
||||
return status
|
||||
|
||||
|
||||
def cleanup_dataset_group(unique_id: str, group_name: str) -> bool:
|
||||
"""Clean up merged dataset files for a group."""
|
||||
|
||||
dataset_group_dir = os.path.join("projects", "data", unique_id, "datasets", group_name)
|
||||
|
||||
try:
|
||||
if os.path.exists(dataset_group_dir):
|
||||
import shutil
|
||||
shutil.rmtree(dataset_group_dir)
|
||||
logger.info(f"Cleaned up dataset group: {group_name}")
|
||||
return True
|
||||
else:
|
||||
return True # Nothing to clean up
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error cleaning up dataset group {group_name}: {str(e)}")
|
||||
return False
|
||||
@ -1,297 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Dataset management functions for organizing and processing datasets.
|
||||
New implementation with per-file processing and group merging.
|
||||
"""
|
||||
|
||||
import os
|
||||
import json
|
||||
import logging
|
||||
from typing import Dict, List
|
||||
|
||||
# Configure logger
|
||||
logger = logging.getLogger('app')
|
||||
|
||||
# Import new modules
|
||||
from utils.file_manager import (
|
||||
ensure_directories, sync_files_to_group, cleanup_orphaned_files,
|
||||
get_group_files_list
|
||||
)
|
||||
from utils.single_file_processor import (
|
||||
process_single_file, check_file_already_processed
|
||||
)
|
||||
from utils.data_merger import (
|
||||
merge_all_data_by_group, cleanup_dataset_group
|
||||
)
|
||||
|
||||
|
||||
async def download_dataset_files(unique_id: str, files: Dict[str, List[str]], incremental_mode: bool = False) -> Dict[str, List[str]]:
|
||||
"""
|
||||
Process dataset files with new architecture:
|
||||
1. Sync files to group directories
|
||||
2. Process each file individually
|
||||
3. Merge results by group
|
||||
4. Clean up orphaned files (only in non-incremental mode)
|
||||
|
||||
Args:
|
||||
unique_id: Project ID
|
||||
files: Dictionary of files to process, grouped by key
|
||||
incremental_mode: If True, preserve existing files and only process new ones
|
||||
"""
|
||||
if not files:
|
||||
return {}
|
||||
|
||||
logger.info(f"Starting {'incremental' if incremental_mode else 'full'} file processing for project: {unique_id}")
|
||||
|
||||
# Ensure project directories exist
|
||||
ensure_directories(unique_id)
|
||||
|
||||
# Step 1: Sync files to group directories
|
||||
logger.info("Step 1: Syncing files to group directories...")
|
||||
synced_files, failed_files = sync_files_to_group(unique_id, files, incremental_mode)
|
||||
|
||||
# Step 2: Detect changes and cleanup orphaned files (only in non-incremental mode)
|
||||
from utils.file_manager import detect_file_changes
|
||||
changes = detect_file_changes(unique_id, files, incremental_mode)
|
||||
|
||||
# Only cleanup orphaned files in non-incremental mode or when files are explicitly removed
|
||||
if not incremental_mode and any(changes["removed"].values()):
|
||||
logger.info("Step 2: Cleaning up orphaned files...")
|
||||
removed_files = cleanup_orphaned_files(unique_id, changes)
|
||||
logger.info(f"Removed orphaned files: {removed_files}")
|
||||
elif incremental_mode:
|
||||
logger.info("Step 2: Skipping cleanup in incremental mode to preserve existing files")
|
||||
|
||||
# Step 3: Process individual files
|
||||
logger.info("Step 3: Processing individual files...")
|
||||
processed_files_by_group = {}
|
||||
processing_results = {}
|
||||
|
||||
for group_name, file_list in files.items():
|
||||
processed_files_by_group[group_name] = []
|
||||
processing_results[group_name] = []
|
||||
|
||||
for file_path in file_list:
|
||||
filename = os.path.basename(file_path)
|
||||
|
||||
# Get local file path
|
||||
local_path = os.path.join("projects", "data", unique_id, "files", group_name, filename)
|
||||
|
||||
# Skip if file doesn't exist (might be remote file that failed to download)
|
||||
if not os.path.exists(local_path) and not file_path.startswith(('http://', 'https://')):
|
||||
logger.warning(f"Skipping non-existent file: {filename}")
|
||||
continue
|
||||
|
||||
# Check if already processed
|
||||
if check_file_already_processed(unique_id, group_name, filename):
|
||||
logger.info(f"Skipping already processed file: {filename}")
|
||||
processed_files_by_group[group_name].append(filename)
|
||||
processing_results[group_name].append({
|
||||
"filename": filename,
|
||||
"status": "existing"
|
||||
})
|
||||
continue
|
||||
|
||||
# Process the file
|
||||
logger.info(f"Processing file: {filename} (group: {group_name})")
|
||||
result = await process_single_file(unique_id, group_name, filename, file_path, local_path)
|
||||
processing_results[group_name].append(result)
|
||||
|
||||
if result["success"]:
|
||||
processed_files_by_group[group_name].append(filename)
|
||||
logger.info(f" Successfully processed {filename}")
|
||||
else:
|
||||
logger.error(f" Failed to process {filename}: {result['error']}")
|
||||
|
||||
# Step 4: Merge results by group
|
||||
logger.info("Step 4: Merging results by group...")
|
||||
merge_results = {}
|
||||
|
||||
for group_name in processed_files_by_group.keys():
|
||||
# Get all files in the group (including existing ones)
|
||||
group_files = get_group_files_list(unique_id, group_name)
|
||||
|
||||
if group_files:
|
||||
logger.info(f"Merging group: {group_name} with {len(group_files)} files")
|
||||
merge_result = merge_all_data_by_group(unique_id, group_name)
|
||||
merge_results[group_name] = merge_result
|
||||
|
||||
if merge_result["success"]:
|
||||
logger.info(f" Successfully merged group {group_name}")
|
||||
else:
|
||||
logger.error(f" Failed to merge group {group_name}: {merge_result['errors']}")
|
||||
|
||||
# Step 5: Save processing log
|
||||
logger.info("Step 5: Saving processing log...")
|
||||
await save_processing_log(unique_id, files, synced_files, processing_results, merge_results)
|
||||
|
||||
logger.info(f"File processing completed for project: {unique_id}")
|
||||
return processed_files_by_group
|
||||
|
||||
|
||||
async def save_processing_log(
|
||||
unique_id: str,
|
||||
requested_files: Dict[str, List[str]],
|
||||
synced_files: Dict,
|
||||
processing_results: Dict,
|
||||
merge_results: Dict
|
||||
):
|
||||
"""Save comprehensive processing log."""
|
||||
|
||||
log_data = {
|
||||
"unique_id": unique_id,
|
||||
"timestamp": str(os.path.getmtime("projects") if os.path.exists("projects") else 0),
|
||||
"requested_files": requested_files,
|
||||
"synced_files": synced_files,
|
||||
"processing_results": processing_results,
|
||||
"merge_results": merge_results,
|
||||
"summary": {
|
||||
"total_groups": len(requested_files),
|
||||
"total_files_requested": sum(len(files) for files in requested_files.values()),
|
||||
"total_files_processed": sum(
|
||||
len([r for r in results if r.get("success", False)])
|
||||
for results in processing_results.values()
|
||||
),
|
||||
"total_groups_merged": len([r for r in merge_results.values() if r.get("success", False)])
|
||||
}
|
||||
}
|
||||
|
||||
log_file_path = os.path.join("projects", "data", unique_id, "processing_log.json")
|
||||
try:
|
||||
with open(log_file_path, 'w', encoding='utf-8') as f:
|
||||
json.dump(log_data, f, ensure_ascii=False, indent=2)
|
||||
logger.info(f"Processing log saved to: {log_file_path}")
|
||||
except Exception as e:
|
||||
logger.error(f"Error saving processing log: {str(e)}")
|
||||
|
||||
|
||||
def generate_dataset_structure(unique_id: str) -> str:
|
||||
"""Generate a string representation of the dataset structure"""
|
||||
project_dir = os.path.join("projects", "data", unique_id)
|
||||
structure = []
|
||||
|
||||
def add_directory_contents(dir_path: str, prefix: str = ""):
|
||||
try:
|
||||
if not os.path.exists(dir_path):
|
||||
structure.append(f"{prefix}└── (not found)")
|
||||
return
|
||||
|
||||
items = sorted(os.listdir(dir_path))
|
||||
for i, item in enumerate(items):
|
||||
item_path = os.path.join(dir_path, item)
|
||||
is_last = i == len(items) - 1
|
||||
current_prefix = "└── " if is_last else "├── "
|
||||
structure.append(f"{prefix}{current_prefix}{item}")
|
||||
|
||||
if os.path.isdir(item_path):
|
||||
next_prefix = prefix + (" " if is_last else "│ ")
|
||||
add_directory_contents(item_path, next_prefix)
|
||||
except Exception as e:
|
||||
structure.append(f"{prefix}└── Error: {str(e)}")
|
||||
|
||||
# Add files directory structure
|
||||
files_dir = os.path.join(project_dir, "files")
|
||||
structure.append("files/")
|
||||
add_directory_contents(files_dir, "")
|
||||
|
||||
# Add processed directory structure
|
||||
processed_dir = os.path.join(project_dir, "processed")
|
||||
structure.append("\nprocessed/")
|
||||
add_directory_contents(processed_dir, "")
|
||||
|
||||
# Add dataset directory structure
|
||||
dataset_dir = os.path.join(project_dir, "datasets")
|
||||
structure.append("\ndataset/")
|
||||
add_directory_contents(dataset_dir, "")
|
||||
|
||||
return "\n".join(structure)
|
||||
|
||||
|
||||
def get_processing_status(unique_id: str) -> Dict:
|
||||
"""Get comprehensive processing status for a project."""
|
||||
|
||||
project_dir = os.path.join("projects", "data", unique_id)
|
||||
|
||||
if not os.path.exists(project_dir):
|
||||
return {
|
||||
"project_exists": False,
|
||||
"unique_id": unique_id
|
||||
}
|
||||
|
||||
status = {
|
||||
"project_exists": True,
|
||||
"unique_id": unique_id,
|
||||
"directories": {
|
||||
"files": os.path.exists(os.path.join(project_dir, "files")),
|
||||
"processed": os.path.exists(os.path.join(project_dir, "processed")),
|
||||
"dataset": os.path.exists(os.path.join(project_dir, "datasets"))
|
||||
},
|
||||
"groups": {},
|
||||
"processing_log_exists": os.path.exists(os.path.join(project_dir, "processing_log.json"))
|
||||
}
|
||||
|
||||
# Check each group's status
|
||||
files_dir = os.path.join(project_dir, "files")
|
||||
if os.path.exists(files_dir):
|
||||
for group_name in os.listdir(files_dir):
|
||||
group_path = os.path.join(files_dir, group_name)
|
||||
if os.path.isdir(group_path):
|
||||
status["groups"][group_name] = {
|
||||
"files_count": len([
|
||||
f for f in os.listdir(group_path)
|
||||
if os.path.isfile(os.path.join(group_path, f))
|
||||
]),
|
||||
"merge_status": "pending"
|
||||
}
|
||||
|
||||
# Check merge status for each group
|
||||
dataset_dir = os.path.join(project_dir, "datasets")
|
||||
if os.path.exists(dataset_dir):
|
||||
for group_name in os.listdir(dataset_dir):
|
||||
group_path = os.path.join(dataset_dir, group_name)
|
||||
if os.path.isdir(group_path):
|
||||
if group_name in status["groups"]:
|
||||
# Check if merge is complete
|
||||
document_path = os.path.join(group_path, "document.txt")
|
||||
pagination_path = os.path.join(group_path, "pagination.txt")
|
||||
embedding_path = os.path.join(group_path, "embedding.pkl")
|
||||
|
||||
if (os.path.exists(document_path) and os.path.exists(pagination_path) and
|
||||
os.path.exists(embedding_path)):
|
||||
status["groups"][group_name]["merge_status"] = "completed"
|
||||
else:
|
||||
status["groups"][group_name]["merge_status"] = "incomplete"
|
||||
else:
|
||||
status["groups"][group_name] = {
|
||||
"files_count": 0,
|
||||
"merge_status": "completed"
|
||||
}
|
||||
|
||||
return status
|
||||
|
||||
|
||||
def remove_dataset_directory(unique_id: str, filename_without_ext: str):
|
||||
"""Remove a specific dataset directory (deprecated - use new structure)"""
|
||||
# This function is kept for compatibility but delegates to new structure
|
||||
dataset_path = os.path.join("projects", "data", unique_id, "processed", filename_without_ext)
|
||||
if os.path.exists(dataset_path):
|
||||
import shutil
|
||||
shutil.rmtree(dataset_path)
|
||||
|
||||
|
||||
def remove_dataset_directory_by_key(unique_id: str, key: str):
|
||||
"""Remove dataset directory by key (group name)"""
|
||||
# Remove files directory
|
||||
files_group_path = os.path.join("projects", "data", unique_id, "files", key)
|
||||
if os.path.exists(files_group_path):
|
||||
import shutil
|
||||
shutil.rmtree(files_group_path)
|
||||
|
||||
# Remove processed directory
|
||||
processed_group_path = os.path.join("projects", "data", unique_id, "processed", key)
|
||||
if os.path.exists(processed_group_path):
|
||||
import shutil
|
||||
shutil.rmtree(processed_group_path)
|
||||
|
||||
# Remove dataset directory
|
||||
cleanup_dataset_group(unique_id, key)
|
||||
@ -1,343 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Project management functions for handling projects, README generation, and status tracking.
|
||||
"""
|
||||
|
||||
import os
|
||||
import json
|
||||
import logging
|
||||
from typing import Dict, List, Optional
|
||||
from pathlib import Path
|
||||
|
||||
# Configure logger
|
||||
logger = logging.getLogger('app')
|
||||
|
||||
from utils.file_utils import get_document_preview, load_processed_files_log
|
||||
|
||||
|
||||
def generate_directory_tree(project_dir: str, unique_id: str, max_depth: int = 3) -> str:
|
||||
"""Generate dataset directory tree structure for the project"""
|
||||
def _build_tree(path: str, prefix: str = "", is_last: bool = True, depth: int = 0) -> List[str]:
|
||||
if depth > max_depth:
|
||||
return []
|
||||
|
||||
lines = []
|
||||
try:
|
||||
entries = sorted(os.listdir(path))
|
||||
# Separate directories and files
|
||||
dirs = [e for e in entries if os.path.isdir(os.path.join(path, e)) and not e.startswith('.')]
|
||||
files = [e for e in entries if os.path.isfile(os.path.join(path, e)) and not e.startswith('.')]
|
||||
|
||||
entries = dirs + files
|
||||
|
||||
for i, entry in enumerate(entries):
|
||||
entry_path = os.path.join(path, entry)
|
||||
is_dir = os.path.isdir(entry_path)
|
||||
is_last_entry = i == len(entries) - 1
|
||||
|
||||
# Choose the appropriate tree symbols
|
||||
if is_last_entry:
|
||||
connector = "└── "
|
||||
new_prefix = prefix + " "
|
||||
else:
|
||||
connector = "├── "
|
||||
new_prefix = prefix + "│ "
|
||||
|
||||
# Add entry line
|
||||
line = prefix + connector + entry
|
||||
if is_dir:
|
||||
line += "/"
|
||||
lines.append(line)
|
||||
|
||||
# Recursively add subdirectories
|
||||
if is_dir and depth < max_depth:
|
||||
sub_lines = _build_tree(entry_path, new_prefix, is_last_entry, depth + 1)
|
||||
lines.extend(sub_lines)
|
||||
|
||||
except PermissionError:
|
||||
lines.append(prefix + "└── [Permission Denied]")
|
||||
except Exception as e:
|
||||
lines.append(prefix + "└── [Error: " + str(e) + "]")
|
||||
|
||||
return lines
|
||||
|
||||
# Start building tree from dataset directory
|
||||
dataset_dir = os.path.join(project_dir, "datasets")
|
||||
tree_lines = []
|
||||
|
||||
if not os.path.exists(dataset_dir):
|
||||
return "└── [No dataset directory found]"
|
||||
|
||||
try:
|
||||
entries = sorted(os.listdir(dataset_dir))
|
||||
dirs = [e for e in entries if os.path.isdir(os.path.join(dataset_dir, e)) and not e.startswith('.')]
|
||||
files = [e for e in entries if os.path.isfile(os.path.join(dataset_dir, e)) and not e.startswith('.')]
|
||||
|
||||
entries = dirs + files
|
||||
|
||||
if not entries:
|
||||
tree_lines.append("└── [Empty dataset directory]")
|
||||
else:
|
||||
for i, entry in enumerate(entries):
|
||||
entry_path = os.path.join(dataset_dir, entry)
|
||||
is_dir = os.path.isdir(entry_path)
|
||||
is_last_entry = i == len(entries) - 1
|
||||
|
||||
if is_last_entry:
|
||||
connector = "└── "
|
||||
prefix = " "
|
||||
else:
|
||||
connector = "├── "
|
||||
prefix = "│ "
|
||||
|
||||
line = connector + entry
|
||||
if is_dir:
|
||||
line += "/"
|
||||
tree_lines.append(line)
|
||||
|
||||
# Recursively add subdirectories
|
||||
if is_dir:
|
||||
sub_lines = _build_tree(entry_path, prefix, is_last_entry, 1)
|
||||
tree_lines.extend(sub_lines)
|
||||
|
||||
except Exception as e:
|
||||
tree_lines.append(f"└── [Error generating tree: {str(e)}]")
|
||||
|
||||
return "\n".join(tree_lines)
|
||||
|
||||
|
||||
def generate_project_readme(unique_id: str) -> str:
|
||||
"""Generate README.md content for a project"""
|
||||
project_dir = os.path.join("projects", "data", unique_id)
|
||||
readme_content = f"""# Project: {unique_id}
|
||||
|
||||
## Project Overview
|
||||
|
||||
This project contains processed documents and their associated embeddings for semantic search.
|
||||
|
||||
## Directory Structure
|
||||
|
||||
"""
|
||||
|
||||
# Generate directory tree
|
||||
readme_content += "```\n"
|
||||
readme_content += generate_directory_tree(project_dir, unique_id)
|
||||
readme_content += "\n```\n\n"
|
||||
|
||||
readme_content += """## Dataset Structure
|
||||
|
||||
"""
|
||||
|
||||
dataset_dir = os.path.join(project_dir, "datasets")
|
||||
if not os.path.exists(dataset_dir):
|
||||
readme_content += "No dataset files available.\n"
|
||||
else:
|
||||
# Get all document directories
|
||||
doc_dirs = []
|
||||
try:
|
||||
for item in sorted(os.listdir(dataset_dir)):
|
||||
item_path = os.path.join(dataset_dir, item)
|
||||
if os.path.isdir(item_path):
|
||||
doc_dirs.append(item)
|
||||
except Exception as e:
|
||||
logger.error(f"Error listing dataset directories: {str(e)}")
|
||||
|
||||
if not doc_dirs:
|
||||
readme_content += "No document directories found.\n"
|
||||
else:
|
||||
for doc_dir in doc_dirs:
|
||||
doc_path = os.path.join(dataset_dir, doc_dir)
|
||||
document_file = os.path.join(doc_path, "document.txt")
|
||||
pagination_file = os.path.join(doc_path, "pagination.txt")
|
||||
embeddings_file = os.path.join(doc_path, "embedding.pkl")
|
||||
|
||||
readme_content += f"### {doc_dir}\n\n"
|
||||
readme_content += f"**Files:**\n"
|
||||
readme_content += f"- `{doc_dir}/document.txt`"
|
||||
if os.path.exists(document_file):
|
||||
readme_content += " ✓"
|
||||
readme_content += "\n"
|
||||
|
||||
readme_content += f"- `{doc_dir}/pagination.txt`"
|
||||
if os.path.exists(pagination_file):
|
||||
readme_content += " ✓"
|
||||
readme_content += "\n"
|
||||
|
||||
readme_content += f"- `{doc_dir}/embedding.pkl`"
|
||||
if os.path.exists(embeddings_file):
|
||||
readme_content += " ✓"
|
||||
readme_content += "\n\n"
|
||||
|
||||
# Add document preview
|
||||
if os.path.exists(document_file):
|
||||
readme_content += f"**Content Preview (first 10 lines):**\n\n```\n"
|
||||
preview = get_document_preview(document_file, 10)
|
||||
readme_content += preview
|
||||
readme_content += "\n```\n\n"
|
||||
else:
|
||||
readme_content += f"**Content Preview:** Not available\n\n"
|
||||
|
||||
readme_content += f"""---
|
||||
*Generated on {__import__('datetime').datetime.now().strftime('%Y-%m-%d %H:%M:%S')}*
|
||||
"""
|
||||
|
||||
return readme_content
|
||||
|
||||
|
||||
def save_project_readme(unique_id: str):
|
||||
"""Save README.md for a project"""
|
||||
readme_content = generate_project_readme(unique_id)
|
||||
readme_path = os.path.join("projects", "data", unique_id, "README.md")
|
||||
|
||||
try:
|
||||
os.makedirs(os.path.dirname(readme_path), exist_ok=True)
|
||||
with open(readme_path, 'w', encoding='utf-8') as f:
|
||||
f.write(readme_content)
|
||||
logger.info(f"Generated README.md for project {unique_id}")
|
||||
return readme_path
|
||||
except Exception as e:
|
||||
logger.error(f"Error generating README for project {unique_id}: {str(e)}")
|
||||
return None
|
||||
|
||||
|
||||
def get_project_status(unique_id: str) -> Dict:
|
||||
"""Get comprehensive status of a project"""
|
||||
project_dir = os.path.join("projects", "data", unique_id)
|
||||
project_exists = os.path.exists(project_dir)
|
||||
|
||||
if not project_exists:
|
||||
return {
|
||||
"unique_id": unique_id,
|
||||
"project_exists": False,
|
||||
"error": "Project not found"
|
||||
}
|
||||
|
||||
# Get processed log
|
||||
processed_log = load_processed_files_log(unique_id)
|
||||
|
||||
# Collect document.txt files
|
||||
document_files = []
|
||||
dataset_dir = os.path.join(project_dir, "datasets")
|
||||
if os.path.exists(dataset_dir):
|
||||
for root, dirs, files in os.walk(dataset_dir):
|
||||
for file in files:
|
||||
if file == "document.txt":
|
||||
document_files.append(os.path.join(root, file))
|
||||
|
||||
# Check system prompt and MCP settings
|
||||
system_prompt_file = os.path.join(project_dir, "system_prompt.txt")
|
||||
mcp_settings_file = os.path.join(project_dir, "mcp_settings.json")
|
||||
|
||||
status = {
|
||||
"unique_id": unique_id,
|
||||
"project_exists": True,
|
||||
"project_path": project_dir,
|
||||
"processed_files_count": len(processed_log),
|
||||
"processed_files": processed_log,
|
||||
"document_files_count": len(document_files),
|
||||
"document_files": document_files,
|
||||
"has_system_prompt": os.path.exists(system_prompt_file),
|
||||
"has_mcp_settings": os.path.exists(mcp_settings_file),
|
||||
"readme_exists": os.path.exists(os.path.join(project_dir, "README.md")),
|
||||
"log_file_exists": os.path.exists(os.path.join(project_dir, "processed_files.json"))
|
||||
}
|
||||
|
||||
# Add dataset structure
|
||||
try:
|
||||
from utils.dataset_manager import generate_dataset_structure
|
||||
status["dataset_structure"] = generate_dataset_structure(unique_id)
|
||||
except Exception as e:
|
||||
status["dataset_structure"] = f"Error generating structure: {str(e)}"
|
||||
|
||||
return status
|
||||
|
||||
|
||||
def remove_project(unique_id: str) -> bool:
|
||||
"""Remove entire project directory"""
|
||||
project_dir = os.path.join("projects", "data", unique_id)
|
||||
try:
|
||||
if os.path.exists(project_dir):
|
||||
import shutil
|
||||
shutil.rmtree(project_dir)
|
||||
logger.info(f"Removed project directory: {project_dir}")
|
||||
return True
|
||||
else:
|
||||
logger.warning(f"Project directory not found: {project_dir}")
|
||||
return False
|
||||
except Exception as e:
|
||||
logger.error(f"Error removing project {unique_id}: {str(e)}")
|
||||
return False
|
||||
|
||||
|
||||
def list_projects() -> List[str]:
|
||||
"""List all existing project IDs"""
|
||||
projects_dir = "projects"
|
||||
if not os.path.exists(projects_dir):
|
||||
return []
|
||||
|
||||
try:
|
||||
return [item for item in os.listdir(projects_dir)
|
||||
if os.path.isdir(os.path.join(projects_dir, item))]
|
||||
except Exception as e:
|
||||
logger.error(f"Error listing projects: {str(e)}")
|
||||
return []
|
||||
|
||||
|
||||
def get_project_stats(unique_id: str) -> Dict:
|
||||
"""Get statistics for a specific project"""
|
||||
status = get_project_status(unique_id)
|
||||
|
||||
if not status["project_exists"]:
|
||||
return status
|
||||
|
||||
stats = {
|
||||
"unique_id": unique_id,
|
||||
"total_processed_files": status["processed_files_count"],
|
||||
"total_document_files": status["document_files_count"],
|
||||
"has_system_prompt": status["has_system_prompt"],
|
||||
"has_mcp_settings": status["has_mcp_settings"],
|
||||
"has_readme": status["readme_exists"]
|
||||
}
|
||||
|
||||
# Calculate file sizes
|
||||
total_size = 0
|
||||
document_sizes = []
|
||||
|
||||
for doc_file in status["document_files"]:
|
||||
try:
|
||||
size = os.path.getsize(doc_file)
|
||||
document_sizes.append({
|
||||
"file": doc_file,
|
||||
"size": size,
|
||||
"size_mb": round(size / (1024 * 1024), 2)
|
||||
})
|
||||
total_size += size
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
stats["total_document_size"] = total_size
|
||||
stats["total_document_size_mb"] = round(total_size / (1024 * 1024), 2)
|
||||
stats["document_files_detail"] = document_sizes
|
||||
|
||||
# Check embeddings files
|
||||
embedding_files = []
|
||||
dataset_dir = os.path.join("projects", "data", unique_id, "datasets")
|
||||
if os.path.exists(dataset_dir):
|
||||
for root, dirs, files in os.walk(dataset_dir):
|
||||
for file in files:
|
||||
if file == "embedding.pkl":
|
||||
file_path = os.path.join(root, file)
|
||||
try:
|
||||
size = os.path.getsize(file_path)
|
||||
embedding_files.append({
|
||||
"file": file_path,
|
||||
"size": size,
|
||||
"size_mb": round(size / (1024 * 1024), 2)
|
||||
})
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
stats["embedding_files_count"] = len(embedding_files)
|
||||
stats["embedding_files_detail"] = embedding_files
|
||||
|
||||
return stats
|
||||
@ -30,7 +30,12 @@ PROJECT_NAME = os.getenv("PROJECT_NAME", "support")
|
||||
TOKENIZERS_PARALLELISM = os.getenv("TOKENIZERS_PARALLELISM", "true")
|
||||
|
||||
# Embedding Model Settings
|
||||
SENTENCE_TRANSFORMER_MODEL = os.getenv("SENTENCE_TRANSFORMER_MODEL", "TaylorAI/gte-tiny")
|
||||
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "sk-hsKClH0Z695EkK5fDdB2Ec2fE13f4fC1B627BdBb8e554b5b-4")
|
||||
EMBEDDING_BASE_URL = os.getenv("EMBEDDING_BASE_URL", "https://one-dev.felo.me/v1")
|
||||
EMBEDDING_API_KEY = os.getenv("EMBEDDING_API_KEY", OPENAI_API_KEY)
|
||||
EMBEDDING_MODEL_NAME = os.getenv("EMBEDDING_MODEL_NAME", "text-embedding-3-small")
|
||||
EMBEDDING_DIMENSIONS = int(os.getenv("EMBEDDING_DIMENSIONS", "384"))
|
||||
EMBEDDING_TIMEOUT = int(os.getenv("EMBEDDING_TIMEOUT", "30"))
|
||||
|
||||
# Tool Output Length Control Settings
|
||||
TOOL_OUTPUT_MAX_LENGTH = SUMMARIZATION_MAX_TOKENS
|
||||
@ -72,6 +77,15 @@ CHECKPOINT_CLEANUP_INACTIVE_DAYS = int(os.getenv("CHECKPOINT_CLEANUP_INACTIVE_DA
|
||||
CHECKPOINT_CLEANUP_INTERVAL_HOURS = int(os.getenv("CHECKPOINT_CLEANUP_INTERVAL_HOURS", "24"))
|
||||
|
||||
|
||||
# ============================================================
|
||||
# Redis Configuration (Huey task queue backend)
|
||||
# ============================================================
|
||||
|
||||
# Redis connection URL.
|
||||
# Format: redis://[:password]@host:port/db
|
||||
REDIS_URL = os.getenv("REDIS_URL", "redis://localhost:6379/1")
|
||||
|
||||
|
||||
# ============================================================
|
||||
# Mem0 long-term memory configuration
|
||||
# ============================================================
|
||||
|
||||
@ -1,301 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Single file processing functions for handling individual files.
|
||||
"""
|
||||
|
||||
import os
|
||||
import tempfile
|
||||
import zipfile
|
||||
import logging
|
||||
from typing import Dict, List, Tuple, Optional
|
||||
from pathlib import Path
|
||||
|
||||
# Configure logger
|
||||
logger = logging.getLogger('app')
|
||||
|
||||
from utils.file_utils import download_file
|
||||
|
||||
# Try to import excel/csv processor, but handle if dependencies are missing
|
||||
try:
|
||||
from utils.excel_csv_processor import (
|
||||
is_excel_file, is_csv_file, process_excel_file, process_csv_file
|
||||
)
|
||||
EXCEL_CSV_SUPPORT = True
|
||||
except ImportError as e:
|
||||
logger.warning(f"Excel/CSV processing not available: {e}")
|
||||
EXCEL_CSV_SUPPORT = False
|
||||
|
||||
# Fallback functions
|
||||
def is_excel_file(file_path):
|
||||
return file_path.lower().endswith(('.xlsx', '.xls'))
|
||||
|
||||
def is_csv_file(file_path):
|
||||
return file_path.lower().endswith('.csv')
|
||||
|
||||
def process_excel_file(file_path):
|
||||
return "", []
|
||||
|
||||
def process_csv_file(file_path):
|
||||
return "", []
|
||||
|
||||
|
||||
async def process_single_file(
|
||||
unique_id: str,
|
||||
group_name: str,
|
||||
filename: str,
|
||||
original_path: str,
|
||||
local_path: str
|
||||
) -> Dict:
|
||||
"""
|
||||
Process a single file and generate document.txt, pagination.txt, and embedding.pkl.
|
||||
|
||||
Returns:
|
||||
Dict with processing results and file paths
|
||||
"""
|
||||
# Create output directory for this file
|
||||
filename_stem = Path(filename).stem
|
||||
output_dir = os.path.join("projects", "data", unique_id, "processed", group_name, filename_stem)
|
||||
os.makedirs(output_dir, exist_ok=True)
|
||||
|
||||
result = {
|
||||
"success": False,
|
||||
"filename": filename,
|
||||
"group": group_name,
|
||||
"output_dir": output_dir,
|
||||
"document_path": os.path.join(output_dir, "document.txt"),
|
||||
"pagination_path": os.path.join(output_dir, "pagination.txt"),
|
||||
"embedding_path": os.path.join(output_dir, "embedding.pkl"),
|
||||
"error": None,
|
||||
"content_size": 0,
|
||||
"pagination_lines": 0,
|
||||
"embedding_chunks": 0
|
||||
}
|
||||
|
||||
try:
|
||||
# Download file if it's remote and not yet downloaded
|
||||
if original_path.startswith(('http://', 'https://')):
|
||||
if not os.path.exists(local_path):
|
||||
logger.info(f"Downloading {original_path} -> {local_path}")
|
||||
success = await download_file(original_path, local_path)
|
||||
if not success:
|
||||
result["error"] = "Failed to download file"
|
||||
return result
|
||||
|
||||
# Extract content from file
|
||||
content, pagination_lines = await extract_file_content(local_path, filename)
|
||||
|
||||
if not content or not content.strip():
|
||||
result["error"] = "No content extracted from file"
|
||||
return result
|
||||
|
||||
# Write document.txt
|
||||
with open(result["document_path"], 'w', encoding='utf-8') as f:
|
||||
f.write(content)
|
||||
result["content_size"] = len(content)
|
||||
|
||||
# Write pagination.txt
|
||||
if pagination_lines:
|
||||
with open(result["pagination_path"], 'w', encoding='utf-8') as f:
|
||||
for line in pagination_lines:
|
||||
if line.strip():
|
||||
f.write(f"{line}\n")
|
||||
result["pagination_lines"] = len(pagination_lines)
|
||||
else:
|
||||
# Generate pagination from text content
|
||||
pagination_lines = generate_pagination_from_text(result["document_path"],
|
||||
result["pagination_path"])
|
||||
result["pagination_lines"] = len(pagination_lines)
|
||||
|
||||
# Generate embeddings
|
||||
try:
|
||||
embedding_chunks = await generate_embeddings_for_file(
|
||||
result["document_path"], result["embedding_path"]
|
||||
)
|
||||
result["embedding_chunks"] = len(embedding_chunks) if embedding_chunks else 0
|
||||
result["success"] = True
|
||||
|
||||
except Exception as e:
|
||||
result["error"] = f"Embedding generation failed: {str(e)}"
|
||||
logger.error(f"Failed to generate embeddings for {filename}: {str(e)}")
|
||||
|
||||
except Exception as e:
|
||||
result["error"] = f"File processing failed: {str(e)}"
|
||||
logger.error(f"Error processing file {filename}: {str(e)}")
|
||||
|
||||
return result
|
||||
|
||||
|
||||
async def extract_file_content(file_path: str, filename: str) -> Tuple[str, List[str]]:
|
||||
"""Extract content from various file formats."""
|
||||
|
||||
# Handle zip files
|
||||
if filename.lower().endswith('.zip'):
|
||||
return await extract_from_zip(file_path, filename)
|
||||
|
||||
# Handle Excel files
|
||||
elif is_excel_file(file_path):
|
||||
return await extract_from_excel(file_path, filename)
|
||||
|
||||
# Handle CSV files
|
||||
elif is_csv_file(file_path):
|
||||
return await extract_from_csv(file_path, filename)
|
||||
|
||||
# Handle text files
|
||||
else:
|
||||
return await extract_from_text(file_path, filename)
|
||||
|
||||
|
||||
async def extract_from_zip(zip_path: str, filename: str) -> Tuple[str, List[str]]:
|
||||
"""Extract content from zip file."""
|
||||
content_parts = []
|
||||
pagination_lines = []
|
||||
|
||||
try:
|
||||
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
|
||||
# Extract to temporary directory
|
||||
temp_dir = tempfile.mkdtemp(prefix=f"extract_{Path(filename).stem}_")
|
||||
zip_ref.extractall(temp_dir)
|
||||
|
||||
# Process extracted files
|
||||
for root, dirs, files in os.walk(temp_dir):
|
||||
for file in files:
|
||||
if file.lower().endswith(('.txt', '.md', '.xlsx', '.xls', '.csv')):
|
||||
file_path = os.path.join(root, file)
|
||||
|
||||
try:
|
||||
file_content, file_pagination = await extract_file_content(file_path, file)
|
||||
|
||||
if file_content:
|
||||
content_parts.append(f"# Page {file}")
|
||||
content_parts.append(file_content)
|
||||
pagination_lines.extend(file_pagination)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing extracted file {file}: {str(e)}")
|
||||
|
||||
# Clean up temporary directory
|
||||
import shutil
|
||||
shutil.rmtree(temp_dir)
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error extracting zip file {filename}: {str(e)}")
|
||||
return "", []
|
||||
|
||||
return '\n\n'.join(content_parts), pagination_lines
|
||||
|
||||
|
||||
async def extract_from_excel(file_path: str, filename: str) -> Tuple[str, List[str]]:
|
||||
"""Extract content from Excel file."""
|
||||
try:
|
||||
document_content, pagination_lines = process_excel_file(file_path)
|
||||
|
||||
if document_content:
|
||||
content = f"# Page {filename}\n{document_content}"
|
||||
return content, pagination_lines
|
||||
else:
|
||||
return "", []
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing Excel file {filename}: {str(e)}")
|
||||
return "", []
|
||||
|
||||
|
||||
async def extract_from_csv(file_path: str, filename: str) -> Tuple[str, List[str]]:
|
||||
"""Extract content from CSV file."""
|
||||
try:
|
||||
document_content, pagination_lines = process_csv_file(file_path)
|
||||
|
||||
if document_content:
|
||||
content = f"# Page {filename}\n{document_content}"
|
||||
return content, pagination_lines
|
||||
else:
|
||||
return "", []
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error processing CSV file {filename}: {str(e)}")
|
||||
return "", []
|
||||
|
||||
|
||||
async def extract_from_text(file_path: str, filename: str) -> Tuple[str, List[str]]:
|
||||
"""Extract content from text file."""
|
||||
try:
|
||||
with open(file_path, 'r', encoding='utf-8') as f:
|
||||
content = f.read().strip()
|
||||
|
||||
if content:
|
||||
return content, []
|
||||
else:
|
||||
return "", []
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error reading text file {filename}: {str(e)}")
|
||||
return "", []
|
||||
|
||||
|
||||
def generate_pagination_from_text(document_path: str, pagination_path: str) -> List[str]:
|
||||
"""Generate pagination from text document."""
|
||||
try:
|
||||
# Import embedding module for pagination
|
||||
import sys
|
||||
sys.path.append(os.path.join(os.path.dirname(__file__), '..', 'embedding'))
|
||||
from embedding import split_document_by_pages
|
||||
|
||||
pages = split_document_by_pages(str(document_path), str(pagination_path))
|
||||
|
||||
# Return pagination lines
|
||||
pagination_lines = []
|
||||
with open(pagination_path, 'r', encoding='utf-8') as f:
|
||||
for line in f:
|
||||
if line.strip():
|
||||
pagination_lines.append(line.strip())
|
||||
|
||||
return pagination_lines
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error generating pagination from text: {str(e)}")
|
||||
return []
|
||||
|
||||
|
||||
async def generate_embeddings_for_file(document_path: str, embedding_path: str) -> Optional[List]:
|
||||
"""Generate embeddings for a document."""
|
||||
try:
|
||||
# Import embedding module
|
||||
import sys
|
||||
sys.path.append(os.path.join(os.path.dirname(__file__), '..', 'embedding'))
|
||||
from embedding import embed_document
|
||||
|
||||
# Generate embeddings using paragraph chunking
|
||||
embedding_data = embed_document(
|
||||
str(document_path),
|
||||
str(embedding_path),
|
||||
chunking_strategy='paragraph'
|
||||
)
|
||||
|
||||
if embedding_data and 'chunks' in embedding_data:
|
||||
return embedding_data['chunks']
|
||||
else:
|
||||
return None
|
||||
|
||||
except Exception as e:
|
||||
logger.error(f"Error generating embeddings: {str(e)}")
|
||||
return None
|
||||
|
||||
|
||||
def check_file_already_processed(unique_id: str, group_name: str, filename: str) -> bool:
|
||||
"""Check if a file has already been processed."""
|
||||
filename_stem = Path(filename).stem
|
||||
output_dir = os.path.join("projects", "data", unique_id, "processed", group_name, filename_stem)
|
||||
|
||||
document_path = os.path.join(output_dir, "document.txt")
|
||||
pagination_path = os.path.join(output_dir, "pagination.txt")
|
||||
embedding_path = os.path.join(output_dir, "embedding.pkl")
|
||||
|
||||
# Check if all files exist and are not empty
|
||||
if (os.path.exists(document_path) and os.path.exists(pagination_path) and
|
||||
os.path.exists(embedding_path)):
|
||||
|
||||
if (os.path.getsize(document_path) > 0 and os.path.getsize(pagination_path) > 0 and
|
||||
os.path.getsize(embedding_path) > 0):
|
||||
return True
|
||||
|
||||
return False
|
||||
Loading…
Reference in New Issue
Block a user