Compare commits

...

72 Commits

Author SHA1 Message Date
朱潮
cb649d83ee Merge branch 'developing' into bot_manager
# Conflicts:
#	.features/skill/MEMORY.md
#	poetry.lock
#	requirements.txt
2026-06-08 20:07:30 +08:00
朱潮
065403223d Merge branch 'dev' into developing 2026-06-08 19:44:11 +08:00
朱潮
fabb14c66a Merge branch 'feature/enable_redis' into dev
# Conflicts:
#	poetry.lock
2026-06-08 19:43:27 +08:00
朱潮
77079539c1 refactor: remove file-parsing knowledge-base pipeline and Huey queue
The local file-parsing pipeline (upload -> Huey async parse -> generate
projects/data/.../document.txt) is no longer needed: RAG retrieval runs
against the backend vector store and does not read the local parse output,
so removing this has zero impact on existing bot Q&A.

- Delete task_queue/ (Huey queue, consumer, tasks, task status store)
- Delete parsing utils: dataset_manager, single_file_processor,
  data_merger, project_manager
- Delete db_manager.py (only managed task_status.db)
- routes/files.py: keep only POST /api/v1/upload; drop all
  parse/queue/task endpoints
- routes/projects.py: drop /tasks endpoint and task_status import
- utils/__init__.py & api_models.py: remove exports/models for deleted
  modules and queue task models
- start_unified.py & start_all_optimized.sh: no longer launch the
  queue consumer
- Drop huey dependency (keep redis)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 19:40:58 +08:00
朱潮
e6b28818bf add REDIS_URL 2026-06-08 19:00:55 +08:00
朱潮
955a064ee3 add reviewer 2026-06-08 18:32:51 +08:00
朱潮
0983623a75 Merge branch 'developing' into dev 2026-06-05 14:53:31 +08:00
朱潮
0af1890493 Merge branch 'prod' of https://github.com/sparticleinc/catalog-agent into prod 2026-06-03 08:57:30 +08:00
朱潮
93e79f1a0e Merge branch 'staging' of https://github.com/sparticleinc/catalog-agent into staging 2026-06-02 21:13:07 +08:00
朱潮
66cde77117 Merge branch 'feature/fix_large_results_backend' into dev 2026-06-02 20:57:18 +08:00
朱潮
6bccd89e9a 在 agent/deep_assistant.py 把两个落盘 backend 改为 virtual_mode=True 修复 2026-06-02 20:56:43 +08:00
csh28
2f5a8e204c Merge branch 'staging' into prod 2026-06-01 22:54:56 +08:00
github-actions[bot]
2205d830e1
chore(.features): sync feature memory (auto) (#43)
Generated by sparticle-toolkit feature-memory-sync

Co-authored-by: Denya0529 <217564326+Denya0529@users.noreply.github.com>
2026-05-29 17:13:19 +00:00
csh28
26fc9e5226
Merge pull request #42 from sparticleinc/codex/tool-metrics-for-agent-first-char-staging
Add tool call metrics middleware to staging
2026-05-29 18:22:37 +08:00
csh28
9f0ae25233 Add tool call metrics middleware 2026-05-29 18:22:36 +08:00
csh28
667054f39f
Merge pull request #41 from sparticleinc/codex/tool-metrics-for-agent-first-char
Add tool call metrics middleware
2026-05-29 11:22:41 +08:00
csh28
ecc2687f7b Add tool call metrics middleware 2026-05-29 11:10:31 +08:00
朱潮
5173ca13b4 merge 2026-05-28 22:15:26 +08:00
csh28
6ff9555e91
Merge pull request #38 from sparticleinc/hotfix/staging-local-embedding-dimensions
Hotfix: skip dimensions for local embedding model
2026-05-28 10:52:50 +08:00
csh28
c5dbabf558 Skip dimensions for local embedding model 2026-05-28 10:52:27 +08:00
csh28
63f1f836c4
Merge branch 'staging' into dev 2026-05-28 10:42:07 +08:00
csh28
0260663aa9
Merge pull request #34 from sparticleinc/fix/local-embedding-dimensions
Skip dimensions for local embedding model
2026-05-28 10:40:04 +08:00
csh28
8c4659f590 Skip dimensions for local embedding model 2026-05-28 10:39:21 +08:00
朱潮
be44c243fd Merge branch 'feature/mcp-ui' into staging 2026-05-28 09:55:30 +08:00
朱潮
091659d693 Merge branch 'developing' into staging 2026-05-28 09:55:22 +08:00
csh28
667fdb8a3b Merge branch 'staging' into prod 2026-05-26 22:18:02 +08:00
朱潮
f05bb34e1e Merge branch 'dev' of https://github.com/sparticleinc/catalog-agent into dev 2026-05-26 20:34:31 +08:00
朱潮
9a496b955c Merge branch 'feature/mcp-ui' into dev 2026-05-26 20:32:49 +08:00
csh28
fb5562f977 fix: align mem0 pgvector config with validation 2026-05-26 10:36:52 +08:00
Denya0529
a4b32aad7f
Merge pull request #33 from sparticleinc/bot/feature-memory-sync
chore(.features): feature memory sync — monthly
2026-05-23 13:53:28 +09:00
Denya0529
68c2472d82 chore(.features): sync feature memory (auto)
Generated by sparticle-toolkit feature-memory-sync
2026-05-23 03:34:30 +00:00
csh28
1dd45107af fix: align mem0 pgvector config with validation 2026-05-22 11:28:23 +08:00
csh28
dc2a212f35 Merge branch 'optimize/docker-size-baseline' into dev 2026-05-22 10:59:43 +08:00
csh28
5129fdcc05 fix: import asyncio. 2026-05-22 10:59:34 +08:00
csh28
e38fd17b97 Merge docker embedding dependency optimization 2026-05-21 20:31:44 +08:00
csh28
45cf140472 Optimize catalog-agent image embedding dependencies 2026-05-21 20:30:54 +08:00
朱潮
51f88c8c2d Merge branch 'feature/mcp-ui' into dev 2026-05-21 19:47:04 +08:00
朱潮
3ebd006b6e Merge branch 'feature/mcp-ui' into dev 2026-05-21 18:33:57 +08:00
朱潮
c14a22bbd1 Merge branch 'feature/mcp-ui' into dev 2026-05-20 19:31:01 +08:00
朱潮
44ac8103d3 update dep 2026-05-20 16:44:46 +08:00
朱潮
881901c504 Merge branch 'feature/mcp-ui' into dev 2026-05-20 16:41:27 +08:00
朱潮
b54f0848cd update dep 2026-05-20 16:40:35 +08:00
朱潮
37025e4ce6 update dep 2026-05-20 16:39:11 +08:00
朱潮
5b378fcdf6 update dep 2026-05-20 16:12:29 +08:00
朱潮
86c58ffccf update dep 2026-05-20 16:01:56 +08:00
朱潮
3fd12e9fc6 Merge branch 'dev' of https://github.com/sparticleinc/catalog-agent into dev 2026-05-20 14:59:46 +08:00
朱潮
46cf0933a0 Merge branch 'feature/mcp-ui' into dev 2026-05-20 14:58:19 +08:00
autobee-sparticle
a1fffa311b
Merge pull request #32 from sparticleinc/bugfix/autobee-20260519-novare-contact-routing
[NOVARE] 修复联系方式查询误路由到发送消息的问题
2026-05-19 23:02:12 +09:00
ae99d7a177 fix(novare): 修复联系方式查询误路由到发送消息的问题
## 问题
用户询问「プランニンググループの連絡先」时,Bot 误路由到
wowtalk_send_message_to_member(发送消息)而不是查询联系方式。

## 修复
1. 新增「联系方式查询场景」示例,明确区分查询与发送
2. 新增规则 9:联系方式/组织图查询规则
3. 新增规则 10:工具失败时的降级回复规则,避免空循环

## 关键区别
- 「連絡先を知りたい」→ 查询场景,使用 rag_retrieve
- 「連絡して」→ 发送场景,使用 wowtalk_send_message_to_member

Ref: #2993

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-19 13:52:18 +00:00
朱潮
2f4ad22293 Merge branch 'feature/mcp-ui' into dev 2026-05-19 19:20:09 +08:00
qianlir
3e142b22e5
Merge pull request #31 from sparticleinc/revert-29-fix/add-opensearch-py-onprem
Revert "Fix/add opensearch py onprem"
2026-05-19 11:15:17 +08:00
qianlir
89e9f9b6d6
Revert "Fix/add opensearch py onprem" 2026-05-19 11:14:54 +08:00
朱潮
76f04d9b24 Merge branch 'dev' of https://github.com/sparticleinc/catalog-agent into dev 2026-05-19 11:08:28 +08:00
qianlir
9f9c78548d
Merge pull request #29 from sparticleinc/fix/add-opensearch-py-onprem
Fix/add opensearch py onprem
2026-05-18 19:54:28 +08:00
qianlir
538be7da2c feat(deps): add opensearch-py for pmda-drug-info skill MCP server
pmda-drug-info skill's pmda_server.py imports opensearchpy to query
the OpenSearch pmda_sections index. catalog-agent base image already
ships psycopg (for PG drug_master queries) but was missing
opensearch-py, so the MCP stdio server failed at import time with
ModuleNotFoundError → 0 tools exposed to the bot.

Add opensearch-py >=2.2.0,<3.0.0 to pyproject.toml dependencies and
the matching pinned line (opensearch-py==2.7.1) to requirements.txt.

Verified pmda-drug-info needs:
  from opensearchpy import OpenSearch       # os_client.py
  from opensearchpy.helpers import bulk     # ingest path

After this image is rebuilt and deployed to onprem-dev, MCP stdio
servers loaded from skill plugin.json start cleanly and tools/list
returns the 10 tools from pmda_tools.json.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 19:51:38 +08:00
csh28
a0d285db28 Merge branch 'staging' into prod 2026-05-15 18:45:40 +08:00
朱潮
035b21bb43 Merge branch 'developing' into staging 2026-05-14 18:15:15 +08:00
朱潮
adbe5b5c65 Merge branch 'developing' into staging 2026-05-14 17:15:34 +08:00
朱潮
73c2051490 Merge branch 'developing' into staging 2026-05-14 16:46:22 +08:00
朱潮
fa5ddf0a4f Merge branch 'developing' into staging 2026-05-14 15:41:09 +08:00
朱潮
c830a0d6de Merge branch 'developing' into staging 2026-05-14 07:43:34 +08:00
csh28
d0480cca1b Merge branch 'feature/agent-final-answer-first-char' into staging 2026-05-11 21:29:07 +08:00
csh28
bace14838b Merge feature/agent-final-answer-first-char into staging 2026-05-11 17:46:31 +08:00
朱潮
60dd9c9332 Merge branch 'developing' into staging 2026-05-07 23:01:19 +08:00
朱潮
b9f750be96 Merge branch 'developing' into staging 2026-05-07 21:22:54 +08:00
朱潮
781f1af3f6 Merge branch 'dev' into staging 2026-04-29 19:54:53 +08:00
朱潮
100007f66b Merge branch 'dev' into staging 2026-04-29 17:28:59 +08:00
朱潮
1c90ab7956 Merge branch 'master' into staging 2026-04-21 18:31:28 +08:00
朱潮
d2bb915191 Merge branch 'developing' into prod 2026-04-20 22:38:30 +08:00
朱潮
77367673e5 Merge branch 'developing' into staging 2026-04-20 22:38:18 +08:00
朱潮
28755b3383 Merge branch 'developing' into prod 2026-04-20 20:33:53 +08:00
朱潮
4de489e69e Merge branch 'developing' into staging 2026-04-20 20:17:50 +08:00
38 changed files with 1656 additions and 4567 deletions

View File

@ -1,15 +1,13 @@
# Skill 功能
> 负责范围:技能包管理服务 - 核心实现
> 最后更新2026-05-20
> 最后更新2026-05-26
## 当前状态
Skill 系统支持两种来源:官方 skills (`./skills/`) 和用户 skills (`projects/uploads/{bot_id}/skills/`)。支持 Hook 系统和 MCP 服务器配置,通过 SKILL.md 或 plugin.json 定义元数据。
目前已新增一批**纯 `SKILL.md` 型业务 skill MVP**,用于研究、摘要、报告和情报编排,底层文件处理与外部检索能力继续复用既有 skill
2026-04 起 skill 包可在 `agents/*.md` 下定义子 agent`SubAgentMiddleware` 加载);启用 Daytona 沙箱时 skill 加载路径变为沙箱内的 `/workspace/skills`
MCP UI 类 skill 已按 MCP Apps 模式改造:工具返回数据,静态 HTML App 由 host 加载后通过 postMessage 接收数据渲染。
@ -24,10 +22,29 @@ MCP UI 类 skill 已按 MCP Apps 模式改造:工具返回数据,静态 HTML
- `routes/mcp_resources.py` - MCP App 静态 HTML resource REST 入口
- `skills/` - 官方 skills 目录
- `skills_developing/` - 开发中 skills
- `agent/subagent_loader.py` - 扫描 skill `agents/*.md` 加载子 agent2026-05 引入)
- `agent/mcp_trace_meta.py` - 对 `ClientSession.call_tool` 做 monkey-patch`rag_retrieve` / `table_rag_retrieve` 的 MCP `_meta` 注入 `trace_id`2026-05 引入)
## 最近重要事项
- [2026-05-26](changelog/2026-Q2.md): skill 引入 `category` 字段——`routes/skill_manager.py` 在 `SkillItem` / `SkillValidationResult` 增加 `category`,从 `plugin.json``SKILL.md` frontmatter 解析official skill 默认 `"other"`、user skill 默认 `"custom"`;并通过 batch 给 common/developing/onprem/support 路径下大量 skill 元数据补 `category``data-dashboard` / `mcp-ui` 归类 `Interactive UI``203dcf4`, `3ada55a`, `9658588`
- [2026-05-26](changelog/2026-Q2.md): developing 分支大合并新增多个 skill`ai-ppt-generator`(百度 AI PPT、`nfc-medicine-lookup`NFC 药品检索)、`ppt-outline`PPT 大纲 / HTML 演示文稿)、`z-card-image`(配图 / 卡片图),同时 `skills/linggan/*` 系列 skill 经合并回归(`3ada55a`
- [2026-05-23](changelog/2026-Q2.md): 新增 MCP App 型 `skills/developing/ecommerce-storefront/`——含 `product-list` / `order-confirm` 两个 HTML App + 自带 `ecommerce_server.py` MCP server同时落地 `docs/mcp-app-training.md`(约 1063 行)作为 MCP App 培训材料(`9d001c8`
- [2026-05-21](changelog/2026-Q2.md): Daytona 沙箱模式下 `init_agent` 在沙箱内写入 `BASH_ENV` 文件,注入 `ASSISTANT_ID` / `USER_IDENTIFIER` / `TRACE_ID` / `ENABLE_SELF_KNOWLEDGE``config.shell_env` 的 shell 环境变量(`776acc2`
- [2026-05-12](changelog/2026-Q2.md): 跨 6→10 个 skill 变体批量精修 `retrieval-policy*.md`,统一 onprem/support/autoload 各路径下的 policy 口径(`be96f24`, `7b4f03d`
- [2026-05-11](changelog/2026-Q2.md): 新增子 agent (SubAgent) 支持——skill 包通过 `agents/*.md` 暴露子 agent`SubAgentMiddleware` 加载;附 `pmda-drug-info` skill 示例(`5b634bc`
- [2026-05-11](changelog/2026-Q2.md): `pmda-drug-info``pmda_server.py` 大改为 mock 实现(`a92096a`
- [2026-05-11](changelog/2026-Q2.md): `retrieval-policy.md` 跨 4 个 skill 变体内容同步更新(`e6d1698`
- [2026-05-08](changelog/2026-Q2.md): 通过 monkey-patch `ClientSession.call_tool`,把 trace_id 透传到 `rag_retrieve` / `table_rag_retrieve` 的 MCP `_meta``1f06450`
- [2026-05-06](changelog/2026-Q2.md): support 分支新增 `kfs-answer` skill`a9227b8`
- [2026-05-06](changelog/2026-Q2.md): 修复 Daytona 沙箱增量同步漏掉符号链接的问题——`find -type f` 不覆盖 symlink、`tar.add` 默认不 dereference 导致悬空软链;并统一 dataset 路径为复数 `datasets/``3c0fa49`
- [2026-04-24](changelog/2026-Q2.md): 非流式响应路径上 `_execute_post_agent_hooks` 改为 `asyncio.create_task` 非阻塞执行;同时临时注释停用 `ToolOutputLengthMiddleware``45a9494`
- [2026-04-23](changelog/2026-Q2.md): PrePrompt hook 内容改为通过 `{hook_content}` 占位符注入系统提示词模板,不再在 prompt 末尾追加(`51fbf01`
- [2026-04-23](changelog/2026-Q2.md): Daytona 沙箱接入——`init_agent` 并行加载 + 返回元组增加 `sandbox` 字段;`skills_sources` 在沙箱模式下变为 `/workspace/skills``agent_dir_path` 变为 `/workspace``c9e0789`, `8446dab`
- [2026-04-22](changelog/2026-Q2.md): `skills/developing/` 下新增 `rag-retrieve-no-citation``novare-context` 两个开发中 skill`7a30e52`
- 2026-05-20: `mcp-ui``data-dashboard` 改为 MCP Apps 标准模式App HTML 放在 skill 的 `apps/` 目录,由 host 加载后 postMessage 数据
- 2026-04-20: 为 `rag-retrieve` 新增 `retrieval-policy-forbidden-self-knowledge.md`,禁止知识问答场景使用模型自身知识补全答案,要求严格基于检索证据作答
- 2026-04-19: 环境变量 `SKILLS_SUBDIR` 重命名为 `PROJECT_NAME`,用于选择 `skills/{PROJECT_NAME}``skills/autoload/{PROJECT_NAME}` 目录
- 2026-04-19: `create_robot_project` 的 autoload 去重和 stale 清理补强autoload 目录也纳入 managed 清理,避免 `rag-retrieve-only` 场景下旧的 `rag-retrieve` 残留
@ -59,6 +76,16 @@ MCP UI 类 skill 已按 MCP Apps 模式改造:工具返回数据,静态 HTML
- ⚠️ 上传大小限制50MBZIP解压后最大 500MB
- ⚠️ 压缩比例检查:最大 100:1防止 zip 炸弹)
- ⚠️ 符号链接检查:禁止解压包含符号链接的文件
- ⚠️ **子 agent 同名静默 last-wins**`subagent_loader._parse_agent_md` 跨 skill 扫描 `agents/*.md` 时,按 `name` 字段去重,**后扫描到的覆盖先扫描的**,只打 warning 不报错。多 skill 都暴露子 agent 时需自觉错开命名。
- ⚠️ **`SubAgentMiddleware` 中间件顺序**:必须插在 `CustomFilesystemMiddleware` 之后、`AnthropicPromptCachingMiddleware` 之前——这是匹配 `deepagents.create_deep_agent` 的官方顺序,调整 `create_custom_cli_agent` 中的中间件顺序时不能随意挪动这一段。
- ⚠️ **Daytona 模式下 skill 路径不同**`DAYTONA_ENABLED=true` 时 `enable_skills``skills_sources``/workspace/skills`(沙箱内),同时 system prompt 的 `agent_dir_path``/workspace`;写死本地路径的 hook / 脚本需要兼容两种环境。
- ⚠️ **PostAgent hook 非流式分支已 fire-and-forget**`routes/chat.py` 用 `asyncio.create_task` 启动 hook调用方不会等待也不会感知到 hook 的异常——hook 失败只会被自己的 logger 捕获。
- ⚠️ **MCP `_meta.trace_id` 是全局 monkey-patch 注入**`agent/mcp_trace_meta.patch_mcp_client_session_trace_meta()` 在 `get_tools_from_mcp()` 入口调用一次后,会把 `mcp.ClientSession.call_tool` 永久包装;仅对工具名在 `{"rag_retrieve", "table_rag_retrieve"}` 集合内的调用注入 `_meta.trace_id`,扩展白名单要直接改 `_TRACE_META_TOOL_NAMES` 常量。
- ⚠️ **PrePrompt hook 内容位置由模板决定**:自 2026-04-23 起 hook 产出通过 `{hook_content}` 占位符注入 `prompt/system_prompt.md`,不再追加在 prompt 末尾;自定义模板必须包含 `{hook_content}` 占位符否则 hook 内容会丢失。
- ⚠️ **`init_agent` 返回值已变 3 元素**Daytona 改造后 `init_agent` 返回 `(agent, checkpointer, sandbox)`;调用方解构必须更新。
- ⚠️ **skill `category` 默认值**API 返回的 `SkillItem.category`——official skill fallback 为 `"other"`、user skill fallback 为 `"custom"`;前端做分类视图时需要同时识别这两个 sentinel不要假设官方/用户 skill 用同一套缺省值。
- ⚠️ **`category` 字段双入口**:同一 skill 可以同时在 `.claude-plugin/plugin.json``SKILL.md` frontmatter 写 `category``get_skill_metadata` 优先走 `parse_plugin_json`,若 skill 包没有 plugin.json 才回落到 `parse_skill_frontmatter`——两者写不一致时以 plugin.json 为准。
- ⚠️ **Daytona shell_env 是文件注入而非 process env**`init_agent` 通过 `cat > $REMOTE_BASH_ENV_PATH` 写入 `export VAR=...` 行,沙箱内必须由 shellbash`BASH_ENV` 加载才能生效;非 daytona 模式或不走 bash 启动的脚本拿不到这些变量。扩展注入项需直接改 `init_agent` 里的 `_shell_env` 字典。
## Skill 目录结构

View File

@ -1,5 +1,392 @@
# 2026-Q2 Skill Changelog
按时间倒序记录本季度的重要变更。
---
## 2026-05-26: skill `category` 字段全面接入
**类型**:新功能
**背景**skill 数量越来越多common / developing / onprem / support / linggan / autoload 各路径下数十个),列表 API 需要前端能按类别分组展示,元数据层面缺少 `category` 字段。
**改动**
- `routes/skill_manager.py`
- `SkillItem` model 新增 `category: str = "other"`
- `SkillValidationResult` dataclass 新增可选 `category: Optional[str]`
- `parse_plugin_json` 解析 `plugin_config.get('category')``parse_skill_frontmatter` 解析 frontmatter 的 `metadata.get('category')`
- `get_official_skills` 中 fallback 为 `"other"``get_user_skills` 中 fallback 为 `"custom"`
- `get_skill_metadata_legacy``category` 非空时写入返回 dict保持向后兼容
- 批量给 common / developing / onprem / support 多个 skill 的 `.claude-plugin/plugin.json``SKILL.md` frontmatter 添加 `category` 字段。
- `data-dashboard``mcp-ui``category``"Data & Retrieval"` 修正为 `"Interactive UI"`(更贴切 MCP App 的渲染语义)。
**根因**N/A新功能
**影响**
- `GET /api/v1/skill/list` 返回项现在包含 `category` 字段;前端可按 category 维度做分组/筛选。
- skill 元数据约定扩展——新 skill 应在 plugin.json 或 SKILL.md frontmatter 中写明 `category`,否则会落到 `"other"` / `"custom"` 兜底。
- `plugin.json.category``SKILL.md.category` 同时存在时以前者为准(`get_skill_metadata` 优先 plugin.json
**相关文件**
- `routes/skill_manager.py`
- `skills/common/data-dashboard/.claude-plugin/plugin.json`
- `skills/common/mcp-ui/.claude-plugin/plugin.json`
- 以及一批 `skills/{common,developing,onprem,support}/*/SKILL.md``.claude-plugin/plugin.json`
**Commit/PR**`203dcf4`, `3ada55a`, `9658588`
---
## 2026-05-26: developing 分支批量新增多类 skill
**类型**:新功能
**背景**[待补充]——经 developing→staging 合并集中落地一批新 skill 与 linggan 系列 skill 回归。
**改动**
- 新增 `skills/developing/ai-ppt-generator/`:调用百度 AI 生成 PPT按 topic 自动选模板(商务/科技/教育/创意/中国风等);`category: Document Processing`。
- 新增 `skills/developing/nfc-medicine-lookup/`:通过 NFC 芯片 ID 或药品名称查询药品信息,面向老年用户的语音助手交互口径;`category: Developer Tools`。
- 新增 `skills/developing/ppt-outline/`PPT 大纲与独立 HTML 演示文稿生成dark/light/tech/minimal 四种风格);`category: Document Processing`。
- 新增 `skills/developing/z-card-image/`:生成配图、封面图、卡片图、社媒帖子分享图等;依赖 `python3` + `google-chrome`
- `skills/developing/static-hosting/SKILL.md` 由 1 行说明扩展为完整 80 行 skill同时一批已有 SKILL.md / plugin.json 补 `category`
- `skills/linggan/*` 系列 skillbaidu-search / bot-self-modifier / caiyun-weather / competitor-news-intel / contract-document-generator / financial-report-generator / market-academic-insight / ragflow-loader / sales-decision-report / seedream / static-hosting / static-site-deploy / voice-notification / weather-china经合并回归 staging。
**根因**N/A
**影响**
- developing skill 池扩张约 5 个新业务 skilllinggan 系列重新出现在 staging。
- 新 skill 多为 SKILL.md 型业务 skill符合"workflow + 模板"的纯 markdown 模式;其中 `ai-ppt-generator`、`z-card-image` 依赖外部 `BAIDU_API_KEY``google-chrome` 二进制。
**相关文件**
- `skills/developing/ai-ppt-generator/SKILL.md`
- `skills/developing/nfc-medicine-lookup/SKILL.md`
- `skills/developing/ppt-outline/SKILL.md`
- `skills/developing/z-card-image/SKILL.md`
- `skills/developing/static-hosting/SKILL.md`
- `skills/linggan/**`(回归)
**Commit/PR**`3ada55a`
---
## 2026-05-23: 新增 ecommerce-storefront skillMCP App 型)+ MCP App 培训文档
**类型**:新功能
**背景**MCP App 模式host 加载静态 HTML + postMessage 传数据)已经在 `mcp-ui`、`data-dashboard` 上跑通,需要一个面向电商场景的样例 skill演示产品浏览 / 选购 / 下单确认这类多步交互的 App 渲染;同时沉淀一份 MCP App 开发指南。
**改动**
- 新增 `skills/developing/ecommerce-storefront/`
- `apps/product-list.html`288 行)与 `apps/order-confirm.html`233 行)两个静态 App。
- `ecommerce_server.py`213 行)作为自带 MCP server`ecommerce_tools.json` 定义工具 schema。
- `hooks/ecommerce_guide.md` + `hooks/pre_prompt.py` 注入 skill 使用指引到 system prompt。
- `mcp_common.py`252 行)复用 MCP 通用工具基类。
- `.claude-plugin/plugin.json` 配置 PrePrompt hook 与 stdio MCP server`category: Developer Tools`。
- 新增 `docs/mcp-app-training.md`(约 1063 行MCP App 模式的开发培训材料。
**根因**N/A
**影响**
- developing skill 池新增一个 MCP App 型 skill体例对齐 `mcp-ui` / `data-dashboard`
- MCP App 开发者有完整培训材料可参考。
**相关文件**
- `skills/developing/ecommerce-storefront/**`
- `docs/mcp-app-training.md`
**Commit/PR**`9d001c8`
---
## 2026-05-21: Daytona 沙箱注入 shell_env 到 BASH_ENV
**类型**:新功能
**背景**Daytona 沙箱内的 skill 脚本需要能读取 `ASSISTANT_ID` / `USER_IDENTIFIER` / `TRACE_ID` 等运行时上下文,但宿主 process env 无法直接透传到沙箱里。
**改动**
- `agent/deep_assistant.py` `init_agent`:当 `sandbox is not None and sandbox_type == "daytona"` 时,组装 `_shell_env` 字典(`ASSISTANT_ID` / `USER_IDENTIFIER` / `TRACE_ID` / `ENABLE_SELF_KNOWLEDGE` 加上 `config.shell_env`),构造 `cd {REMOTE_WORKSPACE_ROOT}\n` + `export VAR="..."` 行,通过 `sandbox.execute("cat > $REMOTE_BASH_ENV_PATH << 'ENVEOF' ... ENVEOF")` 写入沙箱内。
- `utils/daytona_sync.py` 提供常量 `REMOTE_BASH_ENV_PATH` / `REMOTE_WORKSPACE_ROOT`
- `AgentConfig` 增加 `shell_env: Optional[Dict[str, str]]`(调用方可追加自定义 env
**根因**N/A
**影响**
- 沙箱内通过 bash 启动的 skill 脚本可以 `os.environ.get("ASSISTANT_ID")` 等读到运行时上下文。
- 仅 daytona 沙箱模式生效;本地或非 bash 启动的进程不会收到 `BASH_ENV` 注入的变量。
- 扩展注入项(新增固定环境变量)需要直接改 `init_agent` 里的 `_shell_env` 字典。
**相关文件**
- `agent/deep_assistant.py`
- `utils/daytona_sync.py`
**Commit/PR**`776acc2`
---
## 2026-05-12: 批量精修 retrieval policy 文案
**类型**:内容调整
**背景**[待补充]
**改动**
- `be96f24`: 跨 6 个 skill 变体调整 `retrieval-policy-forbidden-self-knowledge.md` 的措辞onprem / support / autoload-onprem / autoload-onprem-rag-only / autoload-support-rag-only 路径下的版本及一份 `retrieval-policy.md`)。
- `7b4f03d`: 在更广的 10 个文件范围内同步更新 `retrieval-policy.md``retrieval-policy-forbidden-self-knowledge.md` 两套 policy使各 skill 变体的策略口径保持一致。
**根因**N/A非 Bug
**影响**:所有使用 `rag-retrieve` / `rag-retrieve-only` 这两个 hook 的 skill 在策略行为上保持一致;同时影响 onprem 与 support 两个发布分支的部署。
**相关文件**
- `skills/onprem/rag-retrieve/hooks/retrieval-policy*.md`
- `skills/support/rag-retrieve/hooks/retrieval-policy*.md`
- `skills/autoload/onprem/rag-retrieve/hooks/retrieval-policy*.md`
- `skills/autoload/onprem/rag-retrieve-only/hooks/retrieval-policy*.md`
- `skills/autoload/support/rag-retrieve-only/hooks/retrieval-policy*.md`
**Commit/PR**`be96f24`, `7b4f03d`
---
## 2026-05-11: 子 agent (SubAgent) 支持 + pmda-drug-info skill
**类型**:新功能
**背景**:需要让单个 skill 在主 agent 之外承载多个专用子 agent按用途隔离上下文与工具集如 pmda 药品信息场景下的 single-drug / interaction / adverse-event / patient-specific 四个专用 agent
**改动**
- 新增 `agent/subagent_loader.py`:扫描 skill 目录下的 `agents/*.md`,按 YAML frontmatter 的 `name` / `description` / `tools` 字段解析为 `SubAgent` 字典;按 `name` 去重,**后扫描的覆盖先扫描的**last-wins
- `agent/deep_assistant.py``init_agent` 调用 `load_subagents()`,存在则将 `SubAgentMiddleware`(来自 `deepagents.middleware.subagents`)插在 `CustomFilesystemMiddleware` 之后、`AnthropicPromptCachingMiddleware` 之前,顺序匹配 `create_deep_agent`
- 新增 `skills/developing/pmda-drug-info/`:完整 skill 包,包含 `.claude-plugin/plugin.json`、`hooks/pre_prompt.py` + `hooks/pmda-instructions.md`、四个 `agents/*.md`、自带 `pmda_server.py` MCP server + `pmda_tools.json`、`mcp_common.py` 工具基础类。
**根因**N/A
**影响**
- skill 包结构新增约定:`agents/*.md` 目录下的 markdown 文件会被加载为子 agent。
- skill 加载流程在 `init_agent` 内增加一次目录扫描;对没有 `agents/` 的 skill 无影响。
- skill 跨 bot 共享时存在 sub-agent 同名冲突的风险——同名 sub-agent 不会报错,而是被后扫描到的覆盖。
**相关文件**
- `agent/subagent_loader.py`(新)
- `agent/deep_assistant.py`(接线)
- `skills/developing/pmda-drug-info/`(新 skill
**Commit/PR**`5b634bc`
---
## 2026-05-11: pmda-drug-info MCP server 重写为 mock 实现
**类型**:内部改造
**背景**[待补充]
**改动**`skills/developing/pmda-drug-info/pmda_server.py` 大幅替换(+322 / -385保留接口面向 agent 的契约,内部替换为 mock 数据实现。
**根因**N/A
**影响**pmda-drug-info skill 当前不再依赖外部真实 PMDA 数据源,便于开发期联调。
**相关文件**
- `skills/developing/pmda-drug-info/pmda_server.py`
**Commit/PR**`a92096a`
---
## 2026-05-11: retrieval-policy.md 内容更新
**类型**:内容调整
**背景**[待补充]
**改动**:在 onprem / support / autoload-onprem-rag-only / autoload-support-rag-only 四个版本的 `retrieval-policy.md` 上做了同步内容更新。
**根因**N/A
**影响**:与同月 12 日的 policy 批量精修配套,使 rag-retrieve hook 策略保持一致。
**相关文件**
- `skills/onprem/rag-retrieve/hooks/retrieval-policy.md`
- `skills/support/rag-retrieve/hooks/retrieval-policy.md`
- `skills/autoload/onprem/rag-retrieve-only/hooks/retrieval-policy.md`
- `skills/autoload/support/rag-retrieve-only/hooks/retrieval-policy.md`
**Commit/PR**`e6d1698`
---
## 2026-05-08: 通过 MCP `_meta` 透传 trace_id 给 RAG 工具
**类型**:新功能
**背景**:需要把 catalog-agent 的 trace_id 透传给 MCP 端的 `rag_retrieve` / `table_rag_retrieve` 服务,便于跨进程追踪。
**改动**
- 新增 `agent/mcp_trace_meta.py`:通过 `patch_mcp_client_session_trace_meta()``mcp.ClientSession.call_tool` 做一次幂等 monkey-patch调用时若工具名在 `{"rag_retrieve", "table_rag_retrieve"}` 集合内且当前请求上下文有 `trace_id`,则注入到 `kwargs["meta"]["trace_id"]`;并提供 `_call_tool_with_meta_compat` 以兼容旧版 MCP SDK不接受 `meta=` 关键字时退化为手动构造 `CallToolRequestParams._meta`)。
- `agent/deep_assistant.py`:在 `get_tools_from_mcp()` 入口处调用一次补丁安装。
- 同步调整 `skills/onprem/rag-retrieve/rag_retrieve_server.py``skills/support/rag-retrieve/rag_retrieve_server.py`,接收并使用 `_meta.trace_id`
**根因**N/A
**影响**
- `rag_retrieve` / `table_rag_retrieve` 现在在 MCP `_meta` 上必带 `trace_id`(若上下文存在)。
- 全局 monkey-patch 风格 - 只要 `get_tools_from_mcp()` 被调用过一次后,所有 `ClientSession.call_tool` 都会被包装。
**相关文件**
- `agent/mcp_trace_meta.py`(新)
- `agent/deep_assistant.py`
- `skills/onprem/rag-retrieve/rag_retrieve_server.py`
- `skills/support/rag-retrieve/rag_retrieve_server.py`
**Commit/PR**`1f06450`
---
## 2026-05-06: 新增 kfs-answer skill (support 分支)
**类型**:新功能
**背景**[待补充] - 为 support 分支补齐 kfs-answer 能力onprem 分支此前已有同名 skill
**改动**:新增 `skills/support/kfs-answer/`,包括 `SKILL.md``scripts/` 下的 `query.py` / `search.py` / `detail.py` / `query_db.py` / `format_answer.py` / `merge_citations.py` / `_session.py` 共 7 个脚本(约 1809 行)。
**根因**N/A
**影响**support 部署版本获得 kfs-answer 能力。
**相关文件**
- `skills/support/kfs-answer/**`
**Commit/PR**`a9227b8`
---
## 2026-05-06: Daytona 沙箱增量同步漏掉符号链接
**类型**Bug 修复
**背景**dataset 通过符号链接挂载,但增量同步用 `find -type f` 只匹配普通文件,导致 dataset 符号链接没被检测到也没被打包同步到 Daytona 沙箱;并且 `tar.add()` 默认不 dereference打进去的是指向宿主机路径的悬空软链。
**改动**
- `utils/daytona_sync._list_local_changed_files`:同时匹配 file 和 symlink (`-type f -o -type l`)。
- `utils/daytona_sync._tar_workspace_entries``tar.add(dereference=True)`,把软链解引用为实际内容打包。
- `skills/onprem/kfs-answer/SKILL.md``prompt/system_prompt_deep_agent.md`:统一数据集路径用复数形式 `datasets/`
**根因**`find -type f` 与 `tar.add()` 默认行为对符号链接不友好。
**影响**Daytona 模式下 kfs-answer 等依赖 dataset 软链的 skill 可以正常使用沙箱内的数据;提示词与 SKILL.md 内的路径口径统一。
**相关文件**
- `utils/daytona_sync.py`
- `skills/onprem/kfs-answer/SKILL.md`
- `prompt/system_prompt_deep_agent.md`
**Commit/PR**`3c0fa49`
---
## 2026-04-24: PostAgent hooks 非阻塞执行 + 临时停用 ToolOutputLengthMiddleware
**类型**:性能优化 / 临时调整
**背景**:非流式响应路径上 `_execute_post_agent_hooks` 是同步等待,阻塞了响应返回。
**改动**
- `routes/chat.py`:非流式分支将 `await _execute_post_agent_hooks(...)` 改为 `asyncio.create_task(_execute_post_agent_hooks(...))`hook 在后台执行,不阻塞响应。
- `agent/deep_assistant.py`:将 `ToolOutputLengthMiddleware` 整段注释掉(未删除,可恢复)。
- `utils/settings.py`:切换 `DAYTONA_API_KEY` / `DAYTONA_SERVER_URL` 注释行(启用自托管 Daytona注释掉 SaaS 行)。
**根因**N/A性能优化为主
**影响**
- 非流式接口响应不再等待 PostAgent hooks 完成 → hook 中失败/异常**只会被 task 内部的 logger 捕获**,调用方收不到错误反馈。
- 工具输出长度暂时不再被截断,存在超长输出冲爆上下文的风险(中间件已被注释,并未拆除)。
**相关文件**
- `routes/chat.py`
- `agent/deep_assistant.py`
- `utils/settings.py`
**Commit/PR**`45a9494`
---
## 2026-04-23: PrePrompt hook 内容改为模板占位符注入
**类型**:重构
**背景**:原先 PrePrompt hook 的产出文本是在 `system_prompt_default.format(...)` 之后追加在 prompt 末尾hook 内容在 prompt 中的位置固定且偏后,模板对它的可见性差。
**改动**`agent/prompt_loader.load_system_prompt_async`:先执行 `execute_hooks('PrePrompt', config)` 拿到 `hook_content`,然后通过新增的 `{hook_content}` 占位符传入 `system_prompt_default.format(...)`;模板侧 `prompt/system_prompt.md` 增加对应占位符。
**根因**N/A结构化注入更可控
**影响**:编写 PrePrompt hook 的 skill 必须依赖模板里 `{hook_content}` 占位符的位置若使用了未升级的旧模板hook 内容将不再出现在最终 system prompt 中。
**相关文件**
- `agent/prompt_loader.py`
- `prompt/system_prompt.md`
**Commit/PR**`51fbf01`
---
## 2026-04-23: Daytona 沙箱接入
**类型**:新功能
**背景**技能脚本需要在隔离沙箱中执行Daytona避免直接污染宿主机。
**改动**
- `agent/deep_assistant.py`
- 在 `init_agent` 中读取 `DAYTONA_ENABLED` / `DAYTONA_API_KEY` / `DAYTONA_SERVER_URL`,启用时创建 `DaytonaSandbox`;并将 `sandbox` / `sandbox_type` 传到 `create_custom_cli_agent` / `agent.invoke_config`
- 重构为并行加载:`load_system_prompt_async` 与 `load_mcp_settings_async``asyncio.gather` 并行;`get_tools_from_mcp` 与 `asyncio.to_thread(init_daytona_sandbox, ...)` 并行;`init_agent` 现在返回 `(agent, checkpointer, sandbox)`(多了 sandbox
- `enable_skills``skills_sources``"/skills"` 改为 `"/workspace/skills"`(指向沙箱内的路径)。
- `agent/prompt_loader.py``agent_dir_path` 在 `DAYTONA_ENABLED=True` 时改为 `/workspace`,否则保持本地路径。
- `utils/daytona_sync.py` 新增204 行):沙箱与本地 workspace 双向同步。
- `pyproject.toml` / `poetry.lock` / `requirements.txt`:新增 `daytona`、`langchain_daytona` 依赖。
- `utils/settings.py`:新增 `DAYTONA_API_KEY` / `DAYTONA_SERVER_URL` / `DAYTONA_ENABLED` 配置。
**根因**N/A
**影响**
- `init_agent` 返回元组从 2 元素变为 3 元素 (`agent, checkpointer, sandbox`)——**调用方必须同步更新解构**。
- skill 在沙箱模式下的根路径与本地模式不同,所有写死路径的 hook / 脚本需要兼容两种环境。
**相关文件**
- `agent/deep_assistant.py`
- `agent/prompt_loader.py`
- `utils/daytona_sync.py`(新)
- `utils/settings.py`
- `pyproject.toml`, `poetry.lock`, `requirements.txt`
**Commit/PR**`c9e0789`, `8446dab`
---
## 2026-04-22: 新增 rag-retrieve-no-citation 与 novare-context 两个开发中 skill
**类型**:新功能
**背景**[待补充]
**改动**
- `skills/developing/rag-retrieve-no-citation/`:完整 skill 包,含 `.claude-plugin/plugin.json`、`README.md`、`hooks/pre_prompt.py`、`hooks/retrieval-policy.md` 与 `hooks/retrieval-policy-forbidden-self-knowledge.md`、独立 `rag_retrieve_server.py` + `rag_retrieve_tools.json` + `mcp_common.py`
- `skills/developing/novare-context/`:包含 `.claude-plugin/plugin.json`、`README.md`、`hooks/pre_prompt.py`。
**根因**N/A
**影响**:开发中 skill 集合扩张,可作为后续正式版本的母版。
**相关文件**
- `skills/developing/rag-retrieve-no-citation/**`
- `skills/developing/novare-context/**`
**Commit/PR**`7a30e52`
---
### 2026-05-20
- **变更**: `mcp-ui``data-dashboard` 从自定义 `uri + data` 工具协议改为 MCP Apps 模式
- **说明**: 静态 HTML App 放在各 skill 的 `apps/` 目录host 通过 resource URI 加载 iframe再用 postMessage 传递工具数据

View File

@ -0,0 +1,47 @@
---
date: "2026-05-11"
status: pending
topic: "subagent-support"
impact: [skill, agent, deep_assistant]
---
# Sub-Agent 在 skill 内的承载方式
## 背景
业务方 (pmda-drug-info) 需要在单个 skill 内同时承载若干面向不同子任务的专用 agent
single-drug / interaction / adverse-event / patient-specific每个子 agent
需要独立的 system prompt 和工具白名单,但应与主 agent 复用同一组 MCP 工具实例
与同一份 LLM。
[待补充]:是否考虑过用 skill-per-subagent 的方式(每个子 agent 一个独立 skill
## 选项
### 选项 Askill 内 `agents/*.md` + 全局 `SubAgentMiddleware`(已实现)
- 优点:
- skill 包自洽,子 agent 定义与 hook / MCP 同包发布。
- 复用 `deepagents.middleware.subagents.SubAgentMiddleware`,无需自研路由层。
- 工具按 `tools` 字段白名单过滤,统一以 MCP tool name 引用。
- 缺点:
- 跨 skill 子 agent **同名时静默 last-wins 覆盖**,仅有 warning无强校验。
- 中间件位置耦合:`SubAgentMiddleware` 必须插在
`CustomFilesystemMiddleware` 之后、`AnthropicPromptCachingMiddleware` 之前
(与 `create_deep_agent` 顺序匹配),改动中间件顺序时容易踩坑。
### 选项 B每个子 agent 单独建一个 skill
- 优点:天然隔离,命名冲突由 skill 加载层处理。
- 缺点:同一业务的多个子 agent 在 skill 列表里散落,部署 / autoload 配置复杂;
pmda-drug-info 的 4 个子 agent 强相关,作为同一 skill 更自然。
## 决策
选择 **选项 A**(已落地)。
## 影响
- 需要改动:调用方知道 `init_agent` 返回元组现已包含 sandbox与 daytona 改动叠加)。
- 风险sub-agent 同名静默覆盖;未来如多 skill 都暴露 sub-agent需要增加冲突检测。
- 后续任务:
1. 沉淀 sub-agent 编写规范(`agents/*.md` frontmatter 字段 + 工具白名单约定)。
2. 跨 skill sub-agent 命名冲突的检测策略——是否升级为 error / 加 skill 名前缀。

View File

@ -9,7 +9,7 @@ ENV PYTHONPATH=/app
ENV PYTHONUNBUFFERED=1
# 安装系统依赖(含 LibreOffice 和 sharp 所需的 libvips
RUN apt-get update && apt-get install -y \
RUN apt-get update && apt-get install -y --no-install-recommends \
curl \
wget \
gnupg2 \
@ -26,7 +26,8 @@ RUN apt-get update && apt-get install -y \
# 安装Node.js (支持npx命令)
RUN curl -fsSL https://deb.nodesource.com/setup_20.x | bash - && \
apt-get install -y nodejs
apt-get install -y --no-install-recommends nodejs && \
rm -rf /var/lib/apt/lists/*
# 安装uv (Python包管理器)
RUN curl -LsSf https://astral.sh/uv/install.sh | sh
@ -36,7 +37,10 @@ ENV PATH="/root/.cargo/bin:$PATH"
# 复制requirements文件并安装Python依赖
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
RUN grep -Ev '^(torch|triton|nvidia-[^=]+|sentence-transformers|transformers|tokenizers|safetensors|scikit-learn|scipy|huggingface-hub|hf-xet)==' requirements.txt > /tmp/requirements.runtime.txt && \
! grep -E '^(torch|triton|nvidia-[^=]+|sentence-transformers|transformers|tokenizers|safetensors|scikit-learn|scipy|huggingface-hub|hf-xet)==' /tmp/requirements.runtime.txt && \
pip install --no-cache-dir -r /tmp/requirements.runtime.txt && \
rm -f /tmp/requirements.runtime.txt
# 安装 Playwright 并下载 Chromium
RUN pip install --no-cache-dir playwright && \
@ -49,9 +53,6 @@ RUN mkdir -p /app/projects
RUN mkdir -p /app/public
RUN mkdir -p /app/models
# 下载sentence-transformers模型到models目录
RUN python -c "from sentence_transformers import SentenceTransformer; model = SentenceTransformer('TaylorAI/gte-tiny'); model.save('/app/models/gte-tiny')"
FROM base AS bytecode-builder
# 复制应用代码,仅在构建阶段编译为字节码

View File

@ -10,7 +10,8 @@ ENV PYTHONUNBUFFERED=1
# 安装系统依赖(含 LibreOffice 和 sharp 所需的 libvips
RUN sed -i 's|http://deb.debian.org|http://mirrors.aliyun.com|g' /etc/apt/sources.list.d/debian.sources && \
apt-get update && apt-get install -y \
apt-get -o Acquire::Retries=3 update && \
apt-get -o Acquire::Retries=3 install -y --no-install-recommends \
curl \
wget \
gnupg2 \
@ -26,7 +27,8 @@ RUN sed -i 's|http://deb.debian.org|http://mirrors.aliyun.com|g' /etc/apt/source
# 安装Node.js (支持npx命令)
RUN curl -fsSL https://deb.nodesource.com/setup_20.x | bash - && \
apt-get install -y nodejs
apt-get -o Acquire::Retries=3 install -y --no-install-recommends nodejs && \
rm -rf /var/lib/apt/lists/*
# 安装uv (Python包管理器)
RUN curl -LsSf https://astral.sh/uv/install.sh | sh
@ -36,7 +38,10 @@ ENV PATH="/root/.cargo/bin:$PATH"
# 复制requirements文件并安装Python依赖
COPY requirements.txt .
RUN pip install --no-cache-dir -i https://mirrors.aliyun.com/pypi/simple/ -r requirements.txt
RUN grep -Ev '^(torch|triton|nvidia-[^=]+|sentence-transformers|transformers|tokenizers|safetensors|scikit-learn|scipy|huggingface-hub|hf-xet)==' requirements.txt > /tmp/requirements.runtime.txt && \
! grep -E '^(torch|triton|nvidia-[^=]+|sentence-transformers|transformers|tokenizers|safetensors|scikit-learn|scipy|huggingface-hub|hf-xet)==' /tmp/requirements.runtime.txt && \
pip install --no-cache-dir -i https://mirrors.aliyun.com/pypi/simple/ -r /tmp/requirements.runtime.txt && \
rm -f /tmp/requirements.runtime.txt
# 安装 Playwright 并下载 Chromium
RUN pip install --no-cache-dir -i https://mirrors.aliyun.com/pypi/simple/ playwright && \

View File

@ -23,6 +23,7 @@ from utils.fastapi_utils import detect_provider, sanitize_model_kwargs
from .guideline_middleware import GuidelineMiddleware
from .tool_output_length_middleware import ToolOutputLengthMiddleware
from .tool_use_cleanup_middleware import ToolUseCleanupMiddleware
from .tool_metrics_middleware import ToolMetricsMiddleware
from .filepath_fix_middleware import FilePathFixMiddleware
from .mcp_trace_meta import patch_mcp_client_session_trace_meta
from utils.settings import (
@ -256,6 +257,7 @@ async def init_agent(config: AgentConfig):
# Build the middleware list
middleware = []
middleware.append(EmptyResponseRetryMiddleware())
middleware.append(ToolMetricsMiddleware(config))
middleware.append(ToolUseCleanupMiddleware())
# tool_output_middleware = ToolOutputLengthMiddleware(
# max_length=(getattr(config.generate_cfg, 'tool_output_max_length', None) if config.generate_cfg else None) or TOOL_OUTPUT_MAX_LENGTH,
@ -480,13 +482,19 @@ def create_custom_cli_agent(
backend = FilesystemBackend(root_dir=workspace_root, virtual_mode=False)
# Set up composite backend with routing based on the new implementation
# NOTE: virtual_mode=True anchors all paths to root_dir. This is required for
# these offload-only backends: CompositeBackend strips the route prefix and
# forwards "/" to grep, so virtual_mode=False would resolve "/" to the real
# filesystem root and scan the whole disk (hitting /usr, /var, other sessions'
# temp dirs), causing 45-152s grep calls. virtual_mode=True confines grep to
# the temp dir and filters out-of-root results.
large_results_backend = FilesystemBackend(
root_dir=tempfile.mkdtemp(prefix="deepagents_large_results_"),
virtual_mode=False,
virtual_mode=True,
)
conversation_history_backend = FilesystemBackend(
root_dir=tempfile.mkdtemp(prefix="deepagents_conversation_history_"),
virtual_mode=False,
virtual_mode=True,
)
composite_backend = CompositeBackend(
default=backend,

View File

@ -5,14 +5,18 @@ Responsible for creating, caching, and managing the lifecycle of Mem0 client ins
import logging
import asyncio
import threading
import concurrent.futures
from typing import Any, Dict, List, Optional, Literal
from collections import OrderedDict
from embedding.manager import GlobalModelManager, get_model_manager
from urllib.parse import unquote, urlparse
from embedding.manager import get_model_manager
import json_repair
from psycopg2 import pool
from utils.settings import (
CHECKPOINT_DB_URL,
EMBEDDING_API_KEY,
EMBEDDING_BASE_URL,
EMBEDDING_DIMENSIONS,
EMBEDDING_MODEL_NAME,
MEM0_POOL_SIZE
)
from .mem0_config import Mem0Config
@ -27,15 +31,9 @@ logger = logging.getLogger("app")
class CustomMem0Embedding:
"""
Custom Mem0 embedding class that directly uses the project's existing GlobalModelManager
This prevents Mem0 from loading the same model again and saves memory
Custom Mem0 embedding class backed by the external embedding API.
"""
_model = None # Class variable caching the model instance
_lock = threading.Lock() # Thread-safe lock
_executor = None # Thread pool executor
def __init__(self, config: Optional[Any] = None):
"""Initialize the custom embedding."""
# Create a simple config object compatible with Mem0 telemetry code
@ -46,42 +44,7 @@ class CustomMem0Embedding:
@property
def embedding_dims(self):
"""Get the embedding dimension."""
return 384 # Dimension of gte-tiny
def _get_model_sync(self):
"""Synchronously get the model without using asyncio.run()."""
# First try to get an already-loaded model from the manager
manager = get_model_manager()
model = manager.get_model_sync()
if model is not None:
# Cache the model
CustomMem0Embedding._model = model
return model
# If the model is not loaded, run async initialization in a thread pool
if CustomMem0Embedding._executor is None:
CustomMem0Embedding._executor = concurrent.futures.ThreadPoolExecutor(
max_workers=1,
thread_name_prefix="mem0_embed"
)
# Run async code in a dedicated thread
def run_async_in_thread():
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
result = loop.run_until_complete(manager.get_model())
return result
finally:
loop.close()
future = CustomMem0Embedding._executor.submit(run_async_in_thread)
model = future.result(timeout=30) # 30-second timeout
# Cache the model
CustomMem0Embedding._model = model
return model
return EMBEDDING_DIMENSIONS
def embed(self, text, memory_action: Optional[Literal["add", "search", "update"]] = None):
"""
@ -94,15 +57,11 @@ class CustomMem0Embedding:
Returns:
list: Embedding vector
"""
# Retrieve the model in a thread-safe manner
if CustomMem0Embedding._model is None:
with CustomMem0Embedding._lock:
if CustomMem0Embedding._model is None:
self._get_model_sync()
model = CustomMem0Embedding._model
embeddings = model.encode(text, convert_to_numpy=True)
return embeddings.tolist()
manager = get_model_manager()
input_texts = text if isinstance(text, list) else [text]
embeddings = manager.encode_texts_sync(input_texts, batch_size=1)
result = embeddings.tolist()
return result if isinstance(text, list) else result[0]
# Monkey patch: replace mem0's remove_code_blocks with json_repair
def _remove_code_blocks_with_repair(content: str) -> str:
@ -233,27 +192,68 @@ class Mem0Manager:
mem0_instance: Mem0 Memory instance
"""
try:
# Mem0 Memory instances have a vector_store attribute of type PGVector
if hasattr(mem0_instance, 'vector_store'):
vector_store = mem0_instance.vector_store
# PGVector has conn and connection_pool attributes
if hasattr(vector_store, 'conn') and hasattr(vector_store, 'connection_pool'):
if vector_store.conn is not None and vector_store.connection_pool is not None:
try:
# Close the cursor first
if hasattr(vector_store, 'cur') and vector_store.cur:
vector_store.cur.close()
vector_store.cur = None
# Return the connection to the pool
vector_store.connection_pool.putconn(vector_store.conn)
# Mark as cleaned up to prevent __del__ from releasing it again
vector_store.conn = None
logger.debug("Successfully released Mem0 database connection back to pool")
except Exception as e:
logger.warning(f"Error releasing Mem0 connection: {e}")
vector_store = getattr(mem0_instance, 'vector_store', None)
if vector_store is not None and getattr(vector_store, 'conn', None) is not None:
try:
if getattr(vector_store, 'cur', None):
vector_store.cur.close()
vector_store.cur = None
connection_pool = getattr(vector_store, 'connection_pool', None)
if connection_pool is not None:
connection_pool.putconn(vector_store.conn)
logger.debug("Successfully released Mem0 database connection back to pool")
else:
vector_store.conn.close()
logger.debug("Successfully closed Mem0 database connection")
vector_store.conn = None
except Exception as e:
logger.warning(f"Error releasing Mem0 connection: {e}")
except Exception as e:
logger.warning(f"Error cleaning up Mem0 instance: {e}")
def _build_pgvector_config(self, agent_id: str) -> Dict[str, Any]:
"""Build Mem0 PGVector config using only fields accepted by mem0."""
parsed_url = urlparse(CHECKPOINT_DB_URL)
if parsed_url.scheme not in ("postgresql", "postgres"):
raise ValueError(f"Unsupported CHECKPOINT_DB_URL scheme: {parsed_url.scheme}")
return {
"dbname": unquote(parsed_url.path.lstrip("/") or "postgres"),
"user": unquote(parsed_url.username or ""),
"password": unquote(parsed_url.password or ""),
"host": parsed_url.hostname or "localhost",
"port": parsed_url.port or 5432,
"collection_name": f"mem0_{agent_id}".replace("-", "_")[:50],
"embedding_model_dims": EMBEDDING_DIMENSIONS,
}
def _attach_pool_to_vector_store(self, mem0_instance: Any) -> None:
"""Move Mem0's runtime vector store onto the shared psycopg2 pool."""
vector_store = getattr(mem0_instance, 'vector_store', None)
if vector_store is None:
return
if getattr(vector_store, 'cur', None):
vector_store.cur.close()
vector_store.cur = None
if getattr(vector_store, 'conn', None) is not None:
vector_store.conn.close()
vector_store.conn = None
vector_store.connection_pool = self._sync_pool
def _close_telemetry_vector_store(self, mem0_instance: Any) -> None:
"""Close Mem0's migration telemetry vector-store connection after init."""
vector_store = getattr(mem0_instance, '_telemetry_vector_store', None)
if vector_store is None:
return
if getattr(vector_store, 'cur', None):
vector_store.cur.close()
vector_store.cur = None
if getattr(vector_store, 'conn', None) is not None:
vector_store.conn.close()
vector_store.conn = None
def _ensure_connection(self, mem0_instance: Any) -> None:
"""Ensure a Mem0 instance has a database connection before use.
@ -268,8 +268,7 @@ class Mem0Manager:
if hasattr(vs, 'conn') and vs.conn is None and self._sync_pool:
vs.conn = self._sync_pool.getconn()
vs.cur = vs.conn.cursor()
# Ensure the connection_pool reference exists for later return
if hasattr(vs, 'connection_pool') and vs.connection_pool is None:
if not hasattr(vs, 'connection_pool') or vs.connection_pool is None:
vs.connection_pool = self._sync_pool
logger.debug("Re-acquired Mem0 database connection from pool")
except Exception as e:
@ -292,8 +291,11 @@ class Mem0Manager:
if hasattr(vs, 'cur') and vs.cur:
vs.cur.close()
vs.cur = None
if hasattr(vs, 'connection_pool') and vs.connection_pool is not None:
vs.connection_pool.putconn(vs.conn)
connection_pool = getattr(vs, 'connection_pool', None)
if connection_pool is not None:
connection_pool.putconn(vs.conn)
else:
vs.conn.close()
vs.conn = None
logger.debug("Released Mem0 database connection back to pool")
except Exception as e:
@ -376,28 +378,25 @@ class Mem0Manager:
if not connection_pool:
raise ValueError("Database connection pool not available")
# Create a custom embedder that reuses the shared model to avoid duplicate loading
# Create a custom embedder backed by the external embedding API.
custom_embedder = CustomMem0Embedding()
# Configure Mem0 to use Pgvector
# Note: use huggingface_base_url here to bypass local model loading
# Set a dummy base_url so HuggingFaceEmbedding does not load SentenceTransformer
# Configure Mem0 to use Pgvector.
# Mem0 validates this config strictly, so connection_pool is attached after creation.
pgvector_config = self._build_pgvector_config(agent_id)
config_dict = {
"vector_store": {
"provider": "pgvector",
"config": {
"connection_pool": connection_pool,
"collection_name": f"mem0_{agent_id}".replace("-", "_")[:50], # Isolate by agent_id
"embedding_model_dims": 384, # Dimension of paraphrase-multilingual-MiniLM-L12-v2
}
"config": pgvector_config,
},
# Use huggingface_base_url to bypass model loading; it will later be replaced with the custom embedder
# The embedder is replaced immediately after Memory is created.
"embedder": {
"provider": "huggingface",
"provider": "openai",
"config": {
"huggingface_base_url": "http://dummy-url-that-will-be-replaced",
"api_key": "dummy-key" # Placeholder to prevent OpenAI client validation failure
"api_key": EMBEDDING_API_KEY,
"openai_base_url": EMBEDDING_BASE_URL,
"model": EMBEDDING_MODEL_NAME,
"embedding_dims": EMBEDDING_DIMENSIONS,
}
}
}
@ -432,6 +431,8 @@ class Mem0Manager:
# Create the Mem0 instance
mem = Memory.from_config(config_dict)
self._attach_pool_to_vector_store(mem)
self._close_telemetry_vector_store(mem)
logger.debug(f"Original embedder type: {type(mem.embedding_model).__name__}")
logger.debug(f"Original embedder.embedding_dims: {getattr(mem.embedding_model, 'embedding_dims', 'N/A')}")

View File

@ -0,0 +1,100 @@
"""Structured metrics for agent tool calls."""
import asyncio
import logging
import time
from typing import Any, Callable
from langchain.agents.middleware import AgentMiddleware
from langchain.tools.tool_node import ToolCallRequest
from agent.agent_config import AgentConfig
from utils.structured_log import emit_question_metric
logger = logging.getLogger("app")
class ToolMetricsMiddleware(AgentMiddleware):
"""Emit structured timing metrics for every tool call."""
def __init__(self, config: AgentConfig):
self.config = config
def _emit_tool_metric(
self,
request: ToolCallRequest,
*,
started_at: float,
status: str,
error_type: str | None = None,
) -> None:
tool_call = request.tool_call or {}
tool_name = tool_call.get("name") or "unknown_tool"
tool_call_id = tool_call.get("id")
duration_ms = max(int((time.monotonic() - started_at) * 1000), 0)
try:
emit_question_metric(
stage="catalog_agent.tool_call",
status=status,
duration_ms=duration_ms,
trace_id=self.config.trace_id,
ai_id=self.config.bot_id,
session_id=self.config.session_id,
robot_type="agent",
model=self.config.model_name,
stream=self.config.stream,
error_type=error_type,
extra={
"bot_id": self.config.bot_id,
"tool_name": tool_name,
"tool_call_id": tool_call_id,
"tool_response": self.config.tool_response,
"enable_thinking": self.config.enable_thinking,
},
)
except Exception:
logger.exception("Failed to emit tool metric for tool_name=%s", tool_name)
def wrap_tool_call(
self,
request: ToolCallRequest,
handler: Callable[[ToolCallRequest], Any],
) -> Any:
started_at = time.monotonic()
try:
result = handler(request)
except Exception as exc:
self._emit_tool_metric(
request,
started_at=started_at,
status="error",
error_type=type(exc).__name__,
)
raise
self._emit_tool_metric(request, started_at=started_at, status="success")
return result
async def awrap_tool_call(
self,
request: ToolCallRequest,
handler: Callable[[ToolCallRequest], Any],
) -> Any:
started_at = time.monotonic()
try:
result = await handler(request)
except asyncio.CancelledError:
self._emit_tool_metric(request, started_at=started_at, status="cancel")
raise
except Exception as exc:
self._emit_tool_metric(
request,
started_at=started_at,
status="error",
error_type=type(exc).__name__,
)
raise
self._emit_tool_metric(request, started_at=started_at, status="success")
return result

View File

@ -1,168 +0,0 @@
#!/usr/bin/env python3
"""
SQLite task status database management tool
"""
import sqlite3
import json
import time
from task_queue.task_status import task_status_store
def view_database():
"""View database contents"""
print("SQLite task status database contents")
print("=" * 40)
print(f"Database path: {task_status_store.db_path}")
# Connect to the database
conn = sqlite3.connect(task_status_store.db_path)
cursor = conn.cursor()
# View table schema
print(f"\nTable schema:")
cursor.execute("PRAGMA table_info(task_status)")
columns = cursor.fetchall()
for col in columns:
print(f" {col[1]} ({col[2]})")
# View all records
print(f"\nAll records:")
cursor.execute("SELECT * FROM task_status ORDER BY updated_at DESC")
rows = cursor.fetchall()
if not rows:
print(" (empty database)")
else:
print(f" Total {len(rows)} records:")
for i, row in enumerate(rows):
task_id, unique_id, status, created_at, updated_at, result, error = row
created_str = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(created_at))
updated_str = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(updated_at))
print(f" {i+1}. {task_id}")
print(f" Project ID: {unique_id}")
print(f" Status: {status}")
print(f" Created: {created_str}")
print(f" Updated: {updated_str}")
if result:
try:
result_data = json.loads(result)
print(f" Result: {result_data.get('message', 'N/A')}")
except:
print(f" Result: {result[:50]}...")
if error:
print(f" Error: {error}")
print()
conn.close()
def run_query(sql_query: str):
"""Run a custom query"""
print(f"Running query: {sql_query}")
try:
conn = sqlite3.connect(task_status_store.db_path)
conn.row_factory = sqlite3.Row
cursor = conn.cursor()
cursor.execute(sql_query)
rows = cursor.fetchall()
if not rows:
print(" (no results)")
else:
print(f" {len(rows)} results:")
for row in rows:
print(f" {dict(row)}")
conn.close()
except Exception as e:
print(f"Query failed: {e}")
def interactive_shell():
"""Interactive database management"""
print("\n🖥️ Interactive database management")
print("Type 'help' to view available commands, or 'quit' to exit")
while True:
try:
command = input("\n> ").strip()
if command.lower() in ['quit', 'exit', 'q']:
break
elif command.lower() == 'help':
print("""
Available commands:
view - View all records
stats - View statistics
pending - View pending tasks
completed - View completed tasks
failed - View failed tasks
sql <query> - Run an SQL query
cleanup <days> - Clean up records older than N days
count - Count total tasks
help - Show help
quit/exit/q - Exit
""")
elif command.lower() == 'view':
view_database()
elif command.lower() == 'stats':
stats = task_status_store.get_statistics()
print(f"Statistics:")
print(f" Total tasks: {stats['total_tasks']}")
print(f" Status breakdown: {stats['status_breakdown']}")
print(f" Last 24 hours: {stats['recent_24h']}")
elif command.lower() == 'pending':
tasks = task_status_store.search_tasks(status="pending")
print(f"Pending tasks ({len(tasks)}):")
for task in tasks:
print(f" - {task['task_id']}: {task['unique_id']}")
elif command.lower() == 'completed':
tasks = task_status_store.search_tasks(status="completed")
print(f"Completed tasks ({len(tasks)}):")
for task in tasks:
print(f" - {task['task_id']}: {task['unique_id']}")
elif command.lower() == 'failed':
tasks = task_status_store.search_tasks(status="failed")
print(f"Failed tasks ({len(tasks)}):")
for task in tasks:
print(f" - {task['task_id']}: {task['unique_id']}")
elif command.lower().startswith('sql '):
sql_query = command[4:]
run_query(sql_query)
elif command.lower().startswith('cleanup '):
try:
days = int(command[8:])
count = task_status_store.cleanup_old_tasks(days)
print(f"Cleaned up {count} records older than {days} days")
except ValueError:
print("Please enter a valid number of days")
elif command.lower() == 'count':
all_tasks = task_status_store.list_all()
print(f"Total tasks: {len(all_tasks)}")
else:
print("Unknown command. Type 'help' for help")
except KeyboardInterrupt:
print("\nGoodbye!")
break
except Exception as e:
print(f"Execution error: {e}")
def main():
"""Main function"""
import sys
if len(sys.argv) > 1:
if sys.argv[1] == 'view':
view_database()
elif sys.argv[1] == 'interactive':
interactive_shell()
else:
print("Usage: python db_manager.py [view|interactive]")
else:
view_database()
interactive_shell()
if __name__ == "__main__":
main()

View File

@ -4,128 +4,93 @@ Model pool manager and cache system
Support high-concurrency embedding retrieval services
"""
import os
import asyncio
import time
import pickle
import hashlib
import logging
from typing import Dict, List, Optional, Any, Tuple
from dataclasses import dataclass
from collections import OrderedDict
from utils.settings import SENTENCE_TRANSFORMER_MODEL
import threading
import psutil
from typing import Dict, List, Any
from utils.settings import (
EMBEDDING_API_KEY,
EMBEDDING_BASE_URL,
EMBEDDING_DIMENSIONS,
EMBEDDING_MODEL_NAME,
EMBEDDING_TIMEOUT,
)
import numpy as np
import requests
from sentence_transformers import SentenceTransformer
import logging
logger = logging.getLogger('app')
class GlobalModelManager:
"""Global model manager"""
"""OpenAI-compatible embedding API manager."""
def __init__(self, model_name: str = 'TaylorAI/gte-tiny'):
self.model_name = model_name
self.local_model_path = "./models/gte-tiny"
self._model: Optional[SentenceTransformer] = None
self._lock = asyncio.Lock()
self._load_time = 0
self._device = 'cpu'
def __init__(self):
self.external_model_name = EMBEDDING_MODEL_NAME
self.external_base_url = EMBEDDING_BASE_URL.rstrip("/")
self.external_api_key = EMBEDDING_API_KEY
self.external_dimensions = EMBEDDING_DIMENSIONS
self.external_timeout = EMBEDDING_TIMEOUT
logger.info(f"GlobalModelManager initialized: {model_name}")
async def get_model(self) -> SentenceTransformer:
"""Get the model instance with lazy loading"""
if self._model is not None:
return self._model
async with self._lock:
# Double-check
if self._model is not None:
return self._model
try:
start_time = time.time()
# Check the local model
model_path = self.local_model_path if os.path.exists(self.local_model_path) else self.model_name
# Get device configuration
self._device = os.environ.get('SENTENCE_TRANSFORMER_DEVICE', 'cpu')
if self._device not in ['cpu', 'cuda', 'mps']:
self._device = 'cpu'
logger.info(f"Loading model: {model_path} (device: {self._device})")
# Run blocking operations in the event loop executor
loop = asyncio.get_event_loop()
self._model = await loop.run_in_executor(
None,
lambda: SentenceTransformer(
model_path,
device=self._device
)
)
self._load_time = time.time() - start_time
logger.info(f"Model loading completed: {self._load_time:.2f}s")
return self._model
except Exception as e:
logger.error(f"Model loading failed: {e}")
raise
logger.info(f"GlobalModelManager initialized: external_model={self.external_model_name}")
async def encode_texts(self, texts: List[str], batch_size: int = 32) -> np.ndarray:
"""Encode texts into vectors"""
"""Encode texts into vectors through the external embedding API."""
if not texts:
return np.array([])
model = await self.get_model()
loop = asyncio.get_event_loop()
return await loop.run_in_executor(
None,
lambda: self._encode_texts_external(texts)
)
try:
# Run blocking operations in the event loop executor
loop = asyncio.get_event_loop()
embeddings = await loop.run_in_executor(
None,
lambda: model.encode(texts, batch_size=batch_size, show_progress_bar=False)
def encode_texts_sync(self, texts: List[str], batch_size: int = 32) -> np.ndarray:
"""Synchronously encode texts. Used by synchronous integrations such as Mem0."""
if not texts:
return np.array([])
return self._encode_texts_external(texts)
def _encode_texts_external(self, texts: List[str]) -> np.ndarray:
if not self.external_base_url:
raise RuntimeError("EMBEDDING_BASE_URL is required for embedding API calls")
endpoint = f"{self.external_base_url}/embeddings"
headers = {"Content-Type": "application/json"}
if self.external_api_key:
headers["Authorization"] = f"Bearer {self.external_api_key}"
payload: Dict[str, Any] = {
"model": self.external_model_name,
"input": texts,
}
if self.external_dimensions and self.external_model_name not in ("text-embedding-ada-002", "local-embedding"):
payload["dimensions"] = self.external_dimensions
response = requests.post(
endpoint,
json=payload,
headers=headers,
timeout=self.external_timeout,
)
if response.status_code != 200:
raise RuntimeError(f"External embedding API failed: {response.status_code} - {response.text}")
data = response.json()
embeddings = [item["embedding"] for item in data.get("data", [])]
if len(embeddings) != len(texts):
raise RuntimeError(
f"External embedding API returned {len(embeddings)} embeddings for {len(texts)} texts"
)
# Ensure a NumPy array is returned
if hasattr(embeddings, 'cpu'):
embeddings = embeddings.cpu().numpy()
elif hasattr(embeddings, 'numpy'):
embeddings = embeddings.numpy()
elif not isinstance(embeddings, np.ndarray):
embeddings = np.array(embeddings)
return embeddings
except Exception as e:
logger.error(f"Text encoding failed: {e}")
raise
def get_model_sync(self) -> Optional[SentenceTransformer]:
"""Synchronously get the model instance for synchronous contexts
If the model is not loaded, return None. The caller should ensure the model is initialized via the async API first.
Returns:
The loaded SentenceTransformer model, or None
"""
return self._model
return np.array(embeddings)
def get_model_info(self) -> Dict[str, Any]:
"""Get model information"""
return {
"model_name": self.model_name,
"local_model_path": self.local_model_path,
"device": self._device,
"is_loaded": self._model is not None,
"load_time": self._load_time
"provider": "openai_compatible",
"base_url": self.external_base_url,
"model_name": self.external_model_name,
"dimensions": self.external_dimensions,
}
@ -137,5 +102,5 @@ def get_model_manager() -> GlobalModelManager:
"""Get the model manager instance"""
global _model_manager
if _model_manager is None:
_model_manager = GlobalModelManager(SENTENCE_TRANSFORMER_MODEL)
_model_manager = GlobalModelManager()
return _model_manager

54
poetry.lock generated
View File

@ -1837,21 +1837,6 @@ files = [
{file = "httpx_sse-0.4.3.tar.gz", hash = "sha256:9b1ed0127459a66014aec3c56bebd93da3c1bc8bb6618c8082039a44889a755d"},
]
[[package]]
name = "huey"
version = "2.6.0"
description = "a little task queue"
optional = false
python-versions = "*"
groups = ["main"]
files = [
{file = "huey-2.6.0-py3-none-any.whl", hash = "sha256:1b9df9d370b49c6d5721ba8a01ac9a787cf86b3bdc584e4679de27b920395c3f"},
{file = "huey-2.6.0.tar.gz", hash = "sha256:8d11f8688999d65266af1425b831f6e3773e99415027177b8734b0ffd5e251f6"},
]
[package.extras]
backends = ["redis (>=3.0.0)"]
[[package]]
name = "huggingface-hub"
version = "0.36.2"
@ -2848,26 +2833,6 @@ cli = ["python-dotenv (>=1.0.0)", "typer (>=0.16.0)"]
rich = ["rich (>=13.9.4)"]
ws = ["websockets (>=15.0.1)"]
[[package]]
name = "mcp-ui-server"
version = "1.0.0"
description = "mcp-ui Server SDK for Python"
optional = false
python-versions = ">=3.10"
groups = ["main"]
files = [
{file = "mcp_ui_server-1.0.0-py3-none-any.whl", hash = "sha256:85f53b2e4300fbd175f1fbb7c40f2566b1f4a4ad03a1f33647867c82a3159dcc"},
{file = "mcp_ui_server-1.0.0.tar.gz", hash = "sha256:5ab8f17b93bf794966af7c35e9a575e4f21a9ba2bab3d316cfc107a15f88a3c9"},
]
[package.dependencies]
mcp = ">=1.0.0"
pydantic = ">=2.0.0"
typing-extensions = ">=4.0.0"
[package.extras]
dev = ["pyright (>=1.1.0)", "pytest (>=7.0.0)", "ruff (>=0.1.0)"]
[[package]]
name = "mdit-py-plugins"
version = "0.6.1"
@ -5049,6 +5014,23 @@ files = [
beartype = ">=0.20.0,<1.0.0"
requests = ">=2.30.0,<3.0.0"
[[package]]
name = "redis"
version = "6.4.0"
description = "Python client for Redis database and key-value store"
optional = false
python-versions = ">=3.9"
groups = ["main"]
files = [
{file = "redis-6.4.0-py3-none-any.whl", hash = "sha256:f0544fa9604264e9464cdf4814e7d4830f74b165d52f2a330a760a88dd248b7f"},
{file = "redis-6.4.0.tar.gz", hash = "sha256:b01bc7282b8444e28ec36b261df5375183bb47a07eb9c603f284e89cbc5ef010"},
]
[package.extras]
hiredis = ["hiredis (>=3.2.0)"]
jwt = ["pyjwt (>=2.9.0)"]
ocsp = ["cryptography (>=36.0.1)", "pyopenssl (>=20.0.1)", "requests (>=2.31.0)"]
[[package]]
name = "referencing"
version = "0.37.0"
@ -7484,4 +7466,4 @@ cffi = ["cffi (>=1.17,<2.0) ; platform_python_implementation != \"PyPy\" and pyt
[metadata]
lock-version = "2.1"
python-versions = ">=3.12,<3.15"
content-hash = "ad25328ad4a88f9a9dd9d34d0f9a097079b837325bf05183fd429e0f37cbc0ed"
content-hash = "ba8491ec2ecd7c783fac68f66e7994279d51f6a09fdc1ec435941c1af52db0cb"

View File

@ -81,6 +81,24 @@
- 告知用户是基于"3階執務スペース"范围搜索到的结果,并确认是否操作
**响应**"「3階執務スペース、フォーラム側窓側」では見つかりませんでしたが、3階執務スペースエリアで照明が見つかりました。こちらの照明を操作しますか"
### 联系方式查询场景CRITICAL - 必须区分查询与发送)
**用户**"プランニンググループの連絡先はわかる?"(策划组的联系方式你知道吗?)
- rag_retrieve(query="プランニンググループ 連絡先", top_k=100) → 优先查询知识库
- 若知识库有结果:直接告知联系方式(电话、邮箱等)
- 若知识库无结果:告知用户无法查询,提供替代方案
**响应**"プランニンググループの連絡先を調べますね。" → [查询后告知具体联系方式或提供替代方案]
**禁止**:此场景禁止调用 wowtalk_send_message_to_member这是发送消息工具不是查询联系方式
**用户**"PUの誰に連絡していいかわからない"不知道该联系PU的谁
- rag_retrieve(query="PU 連絡先 組織図", top_k=100) → 优先查询知识库
- find_employee_location(name="[相关人员]") → 如需查找具体人员位置
- 告知用户相关联系信息或人员位置
**响应**"PUの連絡先を確認しますね。" → [查询后告知联系方式或人员信息]
**关键区别**
- 「連絡先を知りたい」「連絡先はわかる」「連絡方法を教えて」→ **查询场景**,使用 rag_retrieve 查询知识库
- 「連絡して」「通知して」「メッセージを送って」→ **发送场景**,使用 wowtalk_send_message_to_member
</scenarios>
@ -206,6 +224,30 @@
- **条件**:用户意图为闲聊、问候、感谢、赞美等非实质性对话。
- **动作**:给予简洁、友好、拟人化的自然回复。
9. 联系方式/组织图查询CRITICAL - 与消息通知区分)
- **条件**用户意图为查询联系方式、组织架构、部门电话等关键词連絡先、連絡方法、電話番号、メールアドレス、組織図、誰に連絡すれば、etc.
- **动作**
1. **优先**调用【知识库检索】工具查询知识库rag_retrievetop_k=100
2. 若知识库有结果:直接告知用户查询到的联系方式(电话、邮箱、组织架构等)
3. 若知识库无结果调用【人员检索】工具查找相关人员find_employee_location告知用户人员位置信息
4. **降级回复**(工具失败或无结果时):提供替代方案,避免空循环
- **与消息通知的区别**
- 「連絡先を知りたい」「連絡先はわかる」「連絡方法を教えて」→ 查询场景,使用 rag_retrieve
- 「連絡して」「通知して」「メッセージを送って」→ 发送场景,使用 wowtalk_send_message_to_member
- **禁止行为**
- 禁止在用户查询联系方式时调用 wowtalk_send_message_to_member这是发送消息工具
- 禁止回复「もう一度試してみましょうか?」(空循环),必须提供降级方案
10. 工具失败时的降级回复CRITICAL - 避免空循环)
- **条件**:当工具调用失败或返回空结果时。
- **动作**:提供降级回复或替代方案,避免「もう一度試してみましょうか?」的空循环。
- **降级回复示例**
- 联系方式查询失败:"申し訳ございません、連絡先の確認ができませんでした。社内Wikiの組織図をご確認いただくか、総務担当にお問い合わせいただけますでしょうか"
- 人员位置查询失败:"申し訳ございません、現在の人の居場所を確認することができません。後でもう一度お試しいただくか、直接 WowTalk で連絡してみていただけますでしょうか?"
- 知识库查询失败:"申し訳ございません、情報の検索に失敗しました。別の言葉で質問いただくか、後ほど再度お試しいただけますでしょうか?"
- 设备操作失败:"申し訳ございません、設備の操作ができませんでした。しばらく待ってから再度お試しいただくか、設備担当にお問い合わせいただけますでしょうか?"
- **绝对禁止**:「もう一度試してみましょうか?」这种会导致空循环的回复。
## 设备控制确认机制

View File

@ -19,7 +19,7 @@ dependencies = [
"numpy<2",
"aiohttp",
"aiofiles",
"huey (>=2.5.3,<3.0.0)",
"redis (>=4.0,<7.0)",
"pandas>=1.5.0",
"openpyxl>=3.0.0",
"xlrd>=2.0.0",

View File

@ -58,7 +58,6 @@ hpack==4.1.0 ; python_version >= "3.12" and python_version < "3.15"
httpcore==1.0.9 ; python_version >= "3.12" and python_version < "3.15"
httpx-sse==0.4.3 ; python_version >= "3.12" and python_version < "3.15"
httpx==0.28.1 ; python_version >= "3.12" and python_version < "3.15"
huey==2.6.0 ; python_version >= "3.12" and python_version < "3.15"
huggingface-hub==0.36.2 ; python_version >= "3.12" and python_version < "3.15"
hyperframe==6.1.0 ; python_version >= "3.12" and python_version < "3.15"
idna==3.15 ; python_version >= "3.12" and python_version < "3.15"
@ -96,7 +95,6 @@ linkify-it-py==2.1.0 ; python_version >= "3.12" and python_version < "3.15"
markdown-it-py==4.2.0 ; python_version >= "3.12" and python_version < "3.15"
markdownify==1.2.2 ; python_version >= "3.12" and python_version < "3.15"
markupsafe==3.0.3 ; python_version >= "3.12" and python_version < "3.15"
mcp-ui-server==1.0.0 ; python_version >= "3.12" and python_version < "3.15"
mcp==1.12.4 ; python_version >= "3.12" and python_version < "3.15"
mdit-py-plugins==0.6.1 ; python_version >= "3.12" and python_version < "3.15"
mdurl==0.1.2 ; python_version >= "3.12" and python_version < "3.15"
@ -166,6 +164,7 @@ pyyaml==6.0.3 ; python_version >= "3.12" and python_version < "3.15"
qdrant-client==1.12.1 ; python_version >= "3.13" and python_version < "3.15"
qdrant-client==1.18.0 ; python_version == "3.12"
ragflow-sdk==0.23.1 ; python_version >= "3.12" and python_version < "3.15"
redis==6.4.0 ; python_version >= "3.12" and python_version < "3.15"
referencing==0.37.0 ; python_version >= "3.12" and python_version < "3.15"
regex==2026.5.9 ; python_version >= "3.12" and python_version < "3.15"
requests-toolbelt==1.0.0 ; python_version >= "3.12" and python_version < "3.15"

View File

@ -1,273 +1,18 @@
import os
import uuid
import shutil
import zipfile
from datetime import datetime
from typing import Optional, List
from fastapi import APIRouter, HTTPException, Header, UploadFile, File, Form
from pydantic import BaseModel
from typing import Optional
from fastapi import APIRouter, HTTPException, UploadFile, File, Form
import logging
logger = logging.getLogger('app')
from utils import (
DatasetRequest, QueueTaskRequest, IncrementalTaskRequest, QueueTaskResponse,
load_processed_files_log, remove_file_or_directory, remove_dataset_directory_by_key
)
from utils.fastapi_utils import get_versioned_filename
from task_queue.manager import queue_manager
from task_queue.integration_tasks import process_files_async, process_files_incremental_async, cleanup_project_async
from task_queue.task_status import task_status_store
router = APIRouter()
@router.post("/api/v1/files/process/async")
async def process_files_async_endpoint(request: QueueTaskRequest, authorization: Optional[str] = Header(None)):
"""
Queue-based API for asynchronous file processing.
Same functionality as /api/v1/files/process, but processed asynchronously through the queue.
Args:
request: QueueTaskRequest containing dataset_id, files, system_prompt, mcp_settings, and queue options
authorization: Authorization header containing API key (Bearer <API_KEY>)
Returns:
QueueTaskResponse: Processing result with task ID for tracking
"""
try:
dataset_id = request.dataset_id
if not dataset_id:
raise HTTPException(status_code=400, detail="dataset_id is required")
# Estimate processing time (based on file count)
estimated_time = 0
if request.upload_folder:
# For upload_folder, file count cannot be estimated in advance, so use the default time
estimated_time = 120 # Default: 2 minutes
elif request.files:
total_files = sum(len(file_list) for file_list in request.files.values())
estimated_time = max(30, total_files * 10) # Estimated 10 seconds per file, minimum 30 seconds
# Create task status record
import uuid
task_id = str(uuid.uuid4())
task_status_store.set_status(
task_id=task_id,
unique_id=dataset_id,
status="pending"
)
# Submit async task
task = process_files_async(
dataset_id=dataset_id,
files=request.files,
upload_folder=request.upload_folder,
task_id=task_id
)
# Build a more detailed message
message = f"File processing task has been submitted to the queue, project ID: {dataset_id}"
if request.upload_folder:
group_count = len(request.upload_folder)
message += f", files will be scanned automatically from {group_count} uploaded folders"
elif request.files:
total_files = sum(len(file_list) for file_list in request.files.values())
message += f", including {total_files} files"
return QueueTaskResponse(
success=True,
message=message,
dataset_id=dataset_id,
task_id=task_id, # Use our own task_id
task_status="pending",
estimated_processing_time=estimated_time
)
except HTTPException:
raise
except Exception as e:
logger.error(f"Error submitting async file processing task: {str(e)}")
raise HTTPException(status_code=500, detail=f"Internal server error: {str(e)}")
@router.post("/api/v1/files/process/incremental")
async def process_files_incremental_endpoint(request: IncrementalTaskRequest, authorization: Optional[str] = Header(None)):
"""
Queue-based API for incremental file processing, supporting file additions and deletions.
Args:
request: IncrementalTaskRequest containing dataset_id, files_to_add, files_to_remove, system_prompt, mcp_settings, and queue options
authorization: Authorization header containing API key (Bearer <API_KEY>)
Returns:
QueueTaskResponse: Processing result with task ID for tracking
"""
try:
dataset_id = request.dataset_id
if not dataset_id:
raise HTTPException(status_code=400, detail="dataset_id is required")
# Validate that there is at least one add or delete operation
if not request.files_to_add and not request.files_to_remove:
raise HTTPException(status_code=400, detail="At least one of files_to_add or files_to_remove must be provided")
# Estimate processing time (based on file count)
estimated_time = 0
total_add_files = sum(len(file_list) for file_list in (request.files_to_add or {}).values())
total_remove_files = sum(len(file_list) for file_list in (request.files_to_remove or {}).values())
total_files = total_add_files + total_remove_files
estimated_time = max(30, total_files * 10) # Estimated 10 seconds per file, minimum 30 seconds
# Create task status record
import uuid
task_id = str(uuid.uuid4())
task_status_store.set_status(
task_id=task_id,
unique_id=dataset_id,
status="pending"
)
# Submit incremental async task
task = process_files_incremental_async(
dataset_id=dataset_id,
files_to_add=request.files_to_add,
files_to_remove=request.files_to_remove,
system_prompt=request.system_prompt,
mcp_settings=request.mcp_settings,
task_id=task_id
)
return QueueTaskResponse(
success=True,
message=f"Incremental file processing task has been submitted to the queue - added {total_add_files} files, removed {total_remove_files} files, project ID: {dataset_id}",
dataset_id=dataset_id,
task_id=task_id,
task_status="pending",
estimated_processing_time=estimated_time
)
except HTTPException:
raise
except Exception as e:
logger.error(f"Error submitting incremental file processing task: {str(e)}")
raise HTTPException(status_code=500, detail=f"Internal server error: {str(e)}")
@router.get("/api/v1/files/{dataset_id}/status")
async def get_files_processing_status(dataset_id: str):
"""Get the file processing status for the project."""
try:
# Load processed files log
processed_log = load_processed_files_log(dataset_id)
# Get project directory info
project_dir = os.path.join("projects", "data", dataset_id)
project_exists = os.path.exists(project_dir)
# Collect document.txt files
document_files = []
if project_exists:
for root, dirs, files in os.walk(project_dir):
for file in files:
if file == "document.txt":
document_files.append(os.path.join(root, file))
return {
"dataset_id": dataset_id,
"project_exists": project_exists,
"processed_files_count": len(processed_log),
"processed_files": processed_log,
"document_files_count": len(document_files),
"document_files": document_files,
"log_file_exists": os.path.exists(os.path.join("projects", "data", dataset_id, "processed_files.json"))
}
except Exception as e:
raise HTTPException(status_code=500, detail=f"Failed to retrieve file processing status: {str(e)}")
@router.post("/api/v1/files/{dataset_id}/reset")
async def reset_files_processing(dataset_id: str):
"""Reset the project's file processing status by deleting the processing log and all files."""
try:
project_dir = os.path.join("projects", "data", dataset_id)
log_file = os.path.join("projects", "data", dataset_id, "processed_files.json")
# Load processed log to know what files to remove
processed_log = load_processed_files_log(dataset_id)
removed_files = []
# Remove all processed files and their dataset directories
for file_hash, file_info in processed_log.items():
# Remove local file in files directory
if 'local_path' in file_info:
if remove_file_or_directory(file_info['local_path']):
removed_files.append(file_info['local_path'])
# Handle new key-based structure first
if 'key' in file_info:
# Remove dataset directory by key
key = file_info['key']
if remove_dataset_directory_by_key(dataset_id, key):
removed_files.append(f"dataset/{key}")
elif 'filename' in file_info:
# Fallback to old filename-based structure
filename_without_ext = os.path.splitext(file_info['filename'])[0]
dataset_dir = os.path.join("projects", "data", dataset_id, "datasets", filename_without_ext)
if remove_file_or_directory(dataset_dir):
removed_files.append(dataset_dir)
# Also remove any specific dataset path if exists (fallback)
if 'dataset_path' in file_info:
if remove_file_or_directory(file_info['dataset_path']):
removed_files.append(file_info['dataset_path'])
# Remove the log file
if remove_file_or_directory(log_file):
removed_files.append(log_file)
# Remove the entire files directory
files_dir = os.path.join(project_dir, "files")
if remove_file_or_directory(files_dir):
removed_files.append(files_dir)
# Also remove the entire dataset directory (clean up any remaining files)
dataset_dir = os.path.join(project_dir, "datasets")
if remove_file_or_directory(dataset_dir):
removed_files.append(dataset_dir)
# Remove README.md if exists
readme_file = os.path.join(project_dir, "README.md")
if remove_file_or_directory(readme_file):
removed_files.append(readme_file)
return {
"message": f"File processing status reset successfully: {dataset_id}",
"removed_files_count": len(removed_files),
"removed_files": removed_files
}
except Exception as e:
raise HTTPException(status_code=500, detail=f"Failed to reset file processing status: {str(e)}")
@router.post("/api/v1/files/{dataset_id}/cleanup/async")
async def cleanup_project_async_endpoint(dataset_id: str, remove_all: bool = False):
"""Asynchronously clean up project files."""
try:
task = cleanup_project_async(dataset_id=dataset_id, remove_all=remove_all)
return {
"success": True,
"message": f"Project cleanup task has been submitted to the queue, project ID: {dataset_id}",
"dataset_id": dataset_id,
"task_id": task.id,
"action": "remove_all" if remove_all else "cleanup_logs"
}
except Exception as e:
logger.error(f"Error submitting cleanup task: {str(e)}")
raise HTTPException(status_code=500, detail=f"Failed to submit cleanup task: {str(e)}")
@router.post("/api/v1/upload")
async def upload_file(file: UploadFile = File(...), folder: Optional[str] = Form(None)):
"""
@ -348,121 +93,3 @@ async def upload_file(file: UploadFile = File(...), folder: Optional[str] = Form
except Exception as e:
logger.error(f"Error uploading file: {str(e)}")
raise HTTPException(status_code=500, detail=f"File upload failed: {str(e)}")
# Task management routes that are related to file processing
@router.get("/api/v1/task/{task_id}/status")
async def get_task_status(task_id: str):
"""Get task status - simple and reliable."""
try:
status_data = task_status_store.get_status(task_id)
if not status_data:
return {
"success": False,
"message": "Task does not exist or has expired",
"task_id": task_id,
"status": "not_found"
}
return {
"success": True,
"message": "Task status retrieved successfully",
"task_id": task_id,
"status": status_data["status"],
"unique_id": status_data["unique_id"],
"created_at": status_data["created_at"],
"updated_at": status_data["updated_at"],
"result": status_data.get("result"),
"error": status_data.get("error")
}
except Exception as e:
logger.error(f"Error getting task status: {str(e)}")
raise HTTPException(status_code=500, detail=f"Failed to retrieve task status: {str(e)}")
@router.delete("/api/v1/task/{task_id}")
async def delete_task(task_id: str):
"""Delete task record."""
try:
success = task_status_store.delete_status(task_id)
if success:
return {
"success": True,
"message": f"Task record deleted: {task_id}",
"task_id": task_id
}
else:
return {
"success": False,
"message": f"Task record does not exist: {task_id}",
"task_id": task_id
}
except Exception as e:
logger.error(f"Error deleting task: {str(e)}")
raise HTTPException(status_code=500, detail=f"Failed to delete task record: {str(e)}")
@router.get("/api/v1/tasks")
async def list_tasks(status: Optional[str] = None, dataset_id: Optional[str] = None, limit: int = 100):
"""List tasks with optional filters."""
try:
if status or dataset_id:
# Use search function
tasks = task_status_store.search_tasks(status=status, unique_id=dataset_id, limit=limit)
else:
# Get all tasks
all_tasks = task_status_store.list_all()
tasks = list(all_tasks.values())[:limit]
return {
"success": True,
"message": "Task list retrieved successfully",
"total_tasks": len(tasks),
"tasks": tasks,
"filters": {
"status": status,
"dataset_id": dataset_id,
"limit": limit
}
}
except Exception as e:
logger.error(f"Error listing tasks: {str(e)}")
raise HTTPException(status_code=500, detail=f"Failed to retrieve task list: {str(e)}")
@router.get("/api/v1/tasks/statistics")
async def get_task_statistics():
"""Get task statistics."""
try:
stats = task_status_store.get_statistics()
return {
"success": True,
"message": "Statistics retrieved successfully",
"statistics": stats
}
except Exception as e:
logger.error(f"Error getting statistics: {str(e)}")
raise HTTPException(status_code=500, detail=f"Failed to retrieve statistics: {str(e)}")
@router.post("/api/v1/tasks/cleanup")
async def cleanup_tasks(older_than_days: int = 7):
"""Clean up old task records."""
try:
deleted_count = task_status_store.cleanup_old_tasks(older_than_days=older_than_days)
return {
"success": True,
"message": f"Cleaned up {deleted_count} old task records",
"deleted_count": deleted_count,
"older_than_days": older_than_days
}
except Exception as e:
logger.error(f"Error cleaning up tasks: {str(e)}")
raise HTTPException(status_code=500, detail=f"Failed to clean up task records: {str(e)}")

View File

@ -6,8 +6,6 @@ import logging
logger = logging.getLogger('app')
from task_queue.task_status import task_status_store
router = APIRouter()
@ -155,22 +153,3 @@ async def list_datasets():
except Exception as e:
logger.error(f"Error listing datasets: {str(e)}")
raise HTTPException(status_code=500, detail=f"Failed to retrieve dataset list: {str(e)}")
@router.get("/api/v1/projects/{dataset_id}/tasks")
async def get_project_tasks(dataset_id: str):
"""Get all tasks for the specified project."""
try:
tasks = task_status_store.get_by_unique_id(dataset_id)
return {
"success": True,
"message": "Project tasks retrieved successfully",
"dataset_id": dataset_id,
"total_tasks": len(tasks),
"tasks": tasks
}
except Exception as e:
logger.error(f"Error getting project tasks: {str(e)}")
raise HTTPException(status_code=500, detail=f"Failed to retrieve project tasks: {str(e)}")

View File

@ -0,0 +1,288 @@
---
name: content-compliance-reviewer
description: 对企业对外发布内容(新闻稿、社媒、广告物料、对外邮件/函件等)做严格的合规与风险审核,逐项检查违法违规红线、违反广告法的绝对化用语、夸大/误导宣传、敏感信息与个人信息PII泄露、未授权对外承诺、知识产权、品牌一致性、事实准确性、渠道受众适配、完整性等风险点遵循“疑罪从严、对外即不可逆、存疑即退回”的从严原则。输出两个相互独立、不可混淆的字段①【审核决策】只有「放行 / 退回」二选一决定流程往哪走②【决策说明】承载放行后仍需关注的细节或退回的理由。当收到对外内容数据、内容发布审批、对外宣传审核、文案合规检查、content review、PR/广告/社媒发布审核等请求,或拿到包含 title/content/content_type/channel 等字段的对外内容表单数据需要判断是否可发布时,务必使用本技能。只输出结构化文本,不要输出 JSON。
category: Compliance & Security
---
# 对外内容发布合规审核助手Content Compliance Reviewer· 从严版
## Overview
本技能面向企业 OA 对外内容发布流程,对一篇拟对外发布的内容做**自动初审**,识别合规、敏感信息与夸大宣传风险。
本技能采用**从严审核fail-closed立场**:对外发布**一经发出即不可逆**,当合规存疑、含敏感信息、宣传无依据、信息不足以排除风险时,**默认退回**而非放行。宁可让发起人多改一版,也不放过一篇存疑内容流向人工审批后被发布。初审的价值在于把好第一道关,把明显违规或无法核实的内容挡在前面。
⚠️ **核心模型:决策(二元)与说明(文本)是两个完全独立的字段,绝不可混为一谈。**
1. **【审核决策】****只有两种、互斥、二选一** —— **放行****退回**。这是唯一驱动 OA 流程往哪走的字段。
- **放行**:内容流向下一个人工审批节点(如经理、法务、品牌)。
- **退回**:内容打回发起人修改,修改后可重新提交;流程未终止。
2. **【决策说明】**:一段文本,**与决策正交**,用来承载“为什么”和“注意什么”。
- 放行时:写**通过后仍需提醒人工审批人关注的细节**(没有就写“无”)。
- 退回时:写**必须退回修改的理由**。
> 不存在“需关注”这种第三种决策。“需关注”不是一个决策档位,而是 **决策=放行 + 决策说明里写明了要关注的点**。把“是否放行”和“有没有要关注/要修改的事”彻底拆开,是本技能最重要的纪律。
定位说明:
- 你是**初审 agent**,只负责审核并**输出文本结论**。
- 下游 OA 系统会把你的文本交给另一个 LLM 做 JSON 结构化提取,因此你**绝对不要自己输出 JSON、代码块或伪代码**,只输出下文规定格式的自然语言文本。
- 你只能在“放行 / 退回”之间二选一,**无权终止流程**。你的“退回”=打回修改(可重提),**不等于**人工审批环节里的“驳回”(流程直接终止)。
## Triggering Cues
出现以下任一情况就使用本技能:
- 中文:内容发布审批、对外宣传审核、文案合规检查、新闻稿审核、社媒/公众号审核、广告合规审核、对外函件审核
- 英文content review, content compliance check, PR review, marketing copy review, social media review
- 收到一段对外内容表单数据(含 `title` 标题、`content` 正文、`content_type` 内容类型、`channel` 发布渠道 等字段),要求判断是否可以发布。
## 输入Input
通常会收到一篇对外内容的字段数据,常见字段:
| 字段 key | 含义 | 说明 |
|---|---|---|
| `title` | 内容标题 | 必填 |
| `content` | 发布正文 | 必填;审核主体 |
| `content_type` | 内容类型 | pr(新闻稿) / social(社媒) / ad(广告) / letter(对外函件) / other |
| `channel` | 发布渠道 | website / social / media / email / offline |
| `target_audience` | 目标受众 | 可能没有;用于渠道-受众适配判断 |
| `attachment` | 配图/物料 | URL 或附件标识;可能含图片中的文案/IP 风险 |
| `reason` | 发布目的/背景 | 自由文本 |
| `author` / `dept` | 发起人/部门 | 可能没有 |
字段缺失时:
- **必填项(标题、正文)缺失或为占位**:一律视为硬性缺陷,**退回**。
- **可选上下文(受众、配图、背景)缺失**:不因此卡死流程,但要在 `【说明与假设】` 中说明“因缺少 X 无法核验 Y”并按从严方向取舍存疑即偏向退回
## 审核要点清单(核心)
逐项检查以下 10 类。每条标注:**检查什么 → 何时判为问题 → 严重度**。严重度分 `高/中/低`
> 从严总原则:**“对外发布不可逆 + 合规红线”优先于一切**。任何触碰法律/监管红线或可能泄露敏感信息的内容,一律按“高”处理。
### 1. 违法违规红线 —— 字段 `title` `content`
- 检查:是否涉政治敏感、违反法律法规、歧视/侮辱性内容、虚假信息、被禁止的宣传。
- 异常:命中任一红线 → 违法违规(**高**,退回)。
- 严重度:**高**(无此类信号则跳过)。
### 2. 违反广告法的绝对化用语 —— 字段 `content` `title`
- 检查:是否含“最/第一/国家级/顶级/唯一/首选/100%/绝无仅有”等绝对化、极限用语。
- 异常:含绝对化/极限用语且无合法依据(如非权威排名、无认证支撑)→ 涉嫌违反广告法(**高**,退回)。
- 严重度:**高**(无则跳过)。
### 3. 夸大 / 误导宣传 —— 字段 `content`
- 检查:功效、数据、对比是否有依据、是否可能误导消费者。
- 异常:
- 宣称功效/数据但**无任何依据来源**、或明显夸大 → 误导宣传(**高**,退回)。
- 表述偏夸张但尚在合理修辞范围、未涉硬指标 → **中**
- 严重度:高 / 中。
### 4. 敏感信息 / 个人信息PII泄露 —— 字段 `content` `attachment`
- 检查:是否含手机号、身份证号、邮箱、住址、银行卡、客户隐私、员工个人信息、内部机密/未公开数据。
- 异常:
- 正文/配图含未脱敏 PII、客户隐私、内部机密、未公开财务/经营数据 → 泄露风险(**高**,退回)。
- 含可识别个体但已部分脱敏、仍存风险 → **中**
- 严重度:高 / 中。
### 5. 未授权对外承诺 —— 字段 `content`
- 检查:是否对外做出价格、合作、交付、法律或赔偿承诺,且可能超出授权。
- 异常:
- 含价格/折扣/合作/法律承诺且无授权依据 → 越权承诺(**高**,退回,要求法务/管理层确认)。
- 含软性意向但措辞模糊、风险有限 → **中**
- 严重度:高 / 中。
### 6. 知识产权 —— 字段 `content` `attachment`
- 检查:是否引用第三方图片、商标、文字、音乐等且可能未授权。
- 异常:
- 明显使用第三方受保护内容(他人商标/明星肖像/版权图)且无授权说明 → 侵权风险(**高**,退回)。
- 疑似引用但来源不明、需核实 → **中**
- 严重度:高 / 中。
### 7. 品牌一致性 —— 字段 `title` `content`
- 检查公司名、商标、slogan、产品名是否使用正确、是否与品牌规范冲突。
- 异常:公司名/商标/产品名拼写或用法错误、与官方规范不符 → 品牌瑕疵(**中**)。
- 严重度:**中**(无则跳过)。
### 8. 事实准确性 —— 字段 `content`
- 检查:所述数据、时间、引用、头衔是否自洽、是否可核。
- 异常:内部明显矛盾、数据/时间自相冲突、引用存疑 → 事实风险(**中**)。
- 严重度:**中**(无则跳过)。
### 9. 渠道 / 受众适配 —— `channel` × `target_audience` × `content`
- 检查:内容口吻、敏感度是否与渠道、受众匹配(如面向未成年、面向监管渠道)。
- 异常:内容与渠道/受众明显不适配(如严肃监管渠道用营销夸张话术、面向未成年含不适宜内容)→ 适配风险(**中**)。
- 严重度:**中**(无则跳过)。
### 10. 完整性与占位/测试数据 —— 跨字段
- 检查:是否缺必要的免责声明/落款/署名;是否含明显占位或测试数据(如 content=test、标题为占位符
- 异常:
- 含明显占位/测试数据、正文为空壳 → 数据无效(**高**,退回)。
- 缺免责声明/落款等要件但内容本身合规 → **低/中**
- 严重度:高 / 中 / 低(按上述)。
> 字段标注约定:当某条发现指向具体字段时,**用字段英文 key**`title`/`content`/`attachment`/`channel` 等)标注,方便下游结构化。
## 判定规则:先定决策,再写说明(从严)
两个字段分两步独立产出,**顺序不能反、内容不能串**
### 第一步:定【审核决策】(放行 / 退回,二选一)
依次判断,命中任一条即 **退回**
| 命中情况 | 决策 |
|---|---|
| ① 存在任一 **高** 级缺陷违法违规、绝对化用语、夸大误导、PII/敏感信息泄露、越权承诺、IP 侵权、占位测试数据等) | **退回** |
| ② **没有高级缺陷,但累计存在 ≥ 2 条“中”级风险**(多个中级风险叠加,整体合规性已不可靠) | **退回** |
| ③ 仅有 **1 条“中”级风险或仅“低”级风险** | **放行**(在说明里提示) |
| ④ 完全无风险 | **放行** |
> 从严要点:
> 1. 对外发布不可逆,合规存疑/含敏感信息/宣传无依据时按“高”处理、倾向退回fail-closed
> 2. **中级风险会叠加**:单条中级放行,但 ≥2 条中级即退回。
### 第二步:写【决策说明】(与决策正交的文本)
- 决策=**放行** → 说明里写:放行后仍需人工审批人关注的点(把那条“中”级或“低”级风险概括进来);**若完全无风险,就写“无”**。
- 决策=**退回** → 说明里写:导致退回的缺陷是什么(逐条点明高级缺陷,或指出是哪几条中级风险叠加)、需要发起人怎么改。
置信度:根据信息完整度与判断确定性给出 `高/中/低`(或 0100% 区间)。信息缺失越多、判断越主观,置信度越低。**注意:置信度低不改变从严倾向——信息越不足,越应倾向退回。**
## 输出格式Output Format
**只输出下面这种结构化中文文本,不要输出 JSON、不要用代码块包裹。** 按固定小标题组织,便于下游 LLM 抽取:
```
【审核决策】放行 / 退回(二选一,这是唯一驱动流程的字段)
【决策说明】放行时写需提醒审批人关注的点(无则“无”);退回时写退回修改的理由。
【一句话摘要】用一句话说明本次决策的核心原因。
【置信度】高 / 中 / 低(或百分比)
【风险发现】
1. 字段content 严重度:高 问题:正文含“全国第一”绝对化用语,涉嫌违反广告法 建议:删除或替换绝对化用语
2. 字段content 严重度:中 问题:公司名拼写与官方规范不一致 建议:核对并统一品牌名称
...(无风险时写“无”)
【说明与假设】列出做判断时假设的前提、缺失的上下文(如未提供配图实物、无品牌规范文档等),以及因信息不足而做出的从严取舍。
```
要求:
- 【审核决策】**只能填“放行”或“退回”**,不得出现第三种值、不得写“需关注/通过/不通过”等含糊词。
- 【决策说明】与【审核决策】严格对应:放行时是“关注点/无”,退回时是“退回理由”。
- 【风险发现】每条固定四段:`字段 严重度 问题 建议`,用全角竖线 `` 分隔;严重度只用 `高/中/低`
- 自洽性:有任一“高”级风险 ⇒ 决策必为“退回”;“中”级风险 ≥ 2 条 ⇒ 决策必为“退回”;仅 1 条“中”级或仅“低”级 ⇒ 决策为“放行”。
## Workflow
1. 解析输入的内容字段title/content/content_type/channel 等)。
2. 按“审核要点清单”10 类逐项检查,记录命中的每条风险(字段、严重度、问题、建议)。
3. **第一步定决策(从严)**有“高”级缺陷→退回无高级但“中”级≥2条→退回否则→放行对外不可逆合规/敏感信息存疑时按“高”处理、倾向退回。
4. **第二步写说明**:放行则概括关注点(无则“无”),退回则逐条写明退回理由与整改要求。
5. 评估置信度(信息越不足越低,但不改变从严倾向)。
6. 按“输出格式”输出结构化文本,**不要输出 JSON**。
## Examples
### 示例 1绝对化用语 → 退回
输入:
```
title=新品发布, content_type=ad, channel=social, content=我们的产品是全国第一效果最好100%有效!
```
输出风格:
```
【审核决策】退回
【决策说明】正文含“全国第一”“最好”“100%有效”等绝对化与无依据功效用语,涉嫌违反广告法;请删除或替换为可证实的客观表述后重新提交。
【一句话摘要】多处绝对化与夸大功效用语触碰广告法红线。
【置信度】高
【风险发现】
1. 字段content 严重度:高 问题:含“全国第一/最好/100%有效”绝对化及无依据功效用语,涉嫌违反广告法 建议:删除或替换为可证实的客观表述
【说明与假设】未提供权威排名或认证依据,按广告法从严退回。
```
### 示例 2PII 泄露 → 退回
输入:
```
title=客户答谢, content_type=social, channel=social, content=感谢客户王芳手机13800138000的支持我们将持续服务。
```
输出风格:
```
【审核决策】退回
【决策说明】正文公开了客户姓名与手机号等个人信息,存在 PII 泄露与侵权风险;请脱敏或删除个人信息后重新提交。
【一句话摘要】正文含未脱敏客户个人信息,泄露风险高。
【置信度】高
【风险发现】
1. 字段content 严重度:高 问题:公开客户姓名与手机号,存在个人信息泄露与侵权风险 建议:对外发布前脱敏或删除个人信息
【说明与假设】未获得客户授权信息,按数据合规从严退回。
```
### 示例 3品牌瑕疵 → 放行(说明里写关注点)
输入:
```
title=季度产品更新, content_type=pr, channel=website, content=本季度 Felo 智能助手新增多项功能,提升团队协作效率。我们将持续迭代。, reason=官网产品动态更新
```
输出风格:
```
【审核决策】放行
【决策说明】内容合规、无敏感信息与夸大宣传仅提示审批人核对“Felo 智能助手”产品名与官方品牌规范是否一致。无硬性缺陷。
【一句话摘要】内容合规,仅一条品牌一致性的中级提示,放行进入人工审批。
【置信度】中
【风险发现】
1. 字段content 严重度:中 问题产品名“Felo 智能助手”需核对是否与官方品牌规范一致 建议:核对并统一品牌/产品名称
【说明与假设】未提供品牌规范文档,品牌名以常识判断。
```
### 示例 4两条中级风险叠加 → 退回
输入:
```
title=合作喜讯, content_type=pr, channel=media, content=我司效果领先行业,已与多家知名企业达成战略合作,预计明年营收翻倍。, target_audience=媒体记者
```
输出风格:
```
【审核决策】退回
【决策说明】“效果领先行业”属无依据的夸大表述(中),“预计明年营收翻倍”为未经核实的前瞻性数据、面向媒体渠道风险较高(中);两条中级风险叠加,整体合规性不可靠,请补充依据或改为客观表述后重新提交。
【一句话摘要】夸大表述叠加未经核实的前瞻数据,面向媒体从严退回。
【置信度】中
【风险发现】
1. 字段content 严重度:中 问题:“效果领先行业”无依据,涉嫌夸大宣传 建议:补充客观依据或删除
2. 字段content 严重度:中 问题:“预计营收翻倍”为未经核实前瞻数据,面向媒体渠道风险高 建议:删除或附合规的前瞻性声明
【说明与假设】未提供业绩数据来源与披露合规审查,依据中级风险叠加规则退回。
```
### 示例 5正常合规内容 → 放行(说明为“无”)
输入:
```
title=节日放假通知, content_type=letter, channel=email, content=尊敬的合作伙伴,我司将于法定节假日按规定放假,期间业务支持照常,祝节日愉快。, reason=对外合作伙伴放假通知
```
输出风格:
```
【审核决策】放行
【决策说明】无。
【一句话摘要】内容为常规放假通知,无合规、敏感信息或夸大宣传风险。
【置信度】高
【风险发现】无
【说明与假设】基于当前内容字段判断,未发现异常。
```
## Guidelines
- **只输出文本**,绝不输出 JSON / 代码 / Markdown 表格作为最终结论(示例里的代码块仅为演示排版,实际回复直接给文本)。
- **决策与说明分离是铁律**:先用判定规则定出放行/退回,再独立去写说明。
- **从严是本版本的基调**:对外发布不可逆,合规存疑、含敏感信息、宣传无依据、信息不足以排除风险时,**默认退回**。把“能不能确保这篇内容合法合规、不泄露敏感信息、不夸大误导、不越权承诺”作为放行的前提。
- **中级风险会叠加**单条中级放行并提示≥2 条中级即退回。统计中级条数时如实计数。
- 判定要**稳定可复现**:同样的输入应给出同样的决策,便于下游提取与回归测试。
- 缺少可选上下文(配图实物、品牌规范、授权依据)时,在 `【说明与假设】` 里说明,并按从严方向取舍;不要凭空编造依据。
- 这是**初审辅助**,不替代法务/品牌/公关的最终判断;措辞用“疑似/建议/需核实”,但从严不等于含糊——退回理由要具体、可整改。
- 决策与风险严重度必须自洽:有“高”必“退回”;中级 ≥2 必“退回”;仅 1 条中级或仅低级必“放行”。

View File

@ -0,0 +1,233 @@
---
name: leave-approval-reviewer
description: 对请假申请做合理性与合规性初审,逐项检查请假类型与天数自洽、事由充分性、长假对项目的影响、病假凭证、超额/频繁请假、时效与时间冲突、占位/测试数据等风险点,遵循“信息不足以判断即偏向退回核实”的从严原则。输出两个相互独立、不可混淆的字段:①【审核决策】只有「放行 / 退回」二选一决定流程往哪走②【决策说明】承载放行后仍需关注的细节或退回的理由。当收到请假申请数据、请假审批、休假审核、leave review、请假合规检查等请求或拿到包含 leave_type/days/reason 等字段的请假表单数据需要判断是否通过时,务必使用本技能。只输出结构化文本,不要输出 JSON。
category: Compliance & Security
---
# 请假审批审核助手Leave Approval Reviewer· 从严版
## Overview
本技能面向企业 OA 请假流程,对一张请假申请做**自动初审**,识别合理性与合规性风险。
本技能采用**从严审核fail-closed立场**:当请假类型与天数不自洽、事由不足、信息不足以判断合理性时,**默认退回**核实而非放行。初审的价值在于把好第一道关,把明显有问题或无法判断的请假挡在前面,再交人工审批。
⚠️ **核心模型:决策(二元)与说明(文本)是两个完全独立的字段,绝不可混为一谈。**
1. **【审核决策】****只有两种、互斥、二选一** —— **放行****退回**。这是唯一驱动 OA 流程往哪走的字段。
- **放行**:单据流向下一个人工审批节点(如经理)。
- **退回**:单据打回发起人修改,修改后可重新提交;流程未终止。
2. **【决策说明】**:一段文本,**与决策正交**,用来承载“为什么”和“注意什么”。
- 放行时:写**通过后仍需提醒人工审批人关注的细节**(没有就写“无”)。
- 退回时:写**必须退回修改的理由**。
> 不存在“需关注”这种第三种决策。“需关注”不是一个决策档位,而是 **决策=放行 + 决策说明里写明了要关注的点**。把“是否放行”和“有没有要关注/要修改的事”彻底拆开,是本技能最重要的纪律。
定位说明:
- 你是**初审 agent**,只负责审核并**输出文本结论**。
- 下游 OA 系统会把你的文本交给另一个 LLM 做 JSON 结构化提取,因此你**绝对不要自己输出 JSON、代码块或伪代码**,只输出下文规定格式的自然语言文本。
- 你只能在“放行 / 退回”之间二选一,**无权终止流程**。你的“退回”=打回修改(可重提),**不等于**人工审批环节里的“驳回”(流程直接终止)。
## Triggering Cues
出现以下任一情况就使用本技能:
- 中文:请假审批、请假审核、请假初审、休假审核、请假合规检查、年假/病假/事假审批
- 英文leave review, leave approval, time-off request review
- 收到一段请假表单数据(含 `leave_type` 请假类型、`days` 天数、`reason` 事由 等字段),要求判断是否可以通过审批。
## 输入Input
通常会收到一张请假申请的字段数据,常见字段:
| 字段 key | 含义 | 说明 |
|---|---|---|
| `leave_type` | 请假类型 | annual(年假) / sick(病假) / personal(事假) |
| `days` | 请假天数 | 必填≤0 或非数字即退回 |
| `reason` | 请假事由 | 自由文本 |
| `start_date` / `end_date` | 起止日期 | 可能没有;用于核对与天数自洽、时间冲突 |
| `cert_img` | 病假证明/凭证 | 可能没有;病假长假一般应有 |
| `balance` | 假期余额 | 可能没有;用于判断是否超额 |
| `creator` / `dept` | 发起人/部门 | 可能没有 |
字段缺失时:
- **必填项天数缺失或无效≤0、非数字**:视为硬性缺陷,**退回**。
- **可选上下文(日期、余额、凭证)缺失**:不因此卡死流程,但要在 `【说明与假设】` 中说明“因缺少 X 无法核验 Y”并按从严方向取舍。
## 审核要点清单(核心)
逐项检查以下 8 类。每条标注:**检查什么 → 何时判为问题 → 严重度**。严重度分 `高/中/低`
### 1. 天数有效性 —— 字段 `days`
- 检查:天数是否为 0/负数/非数字、是否异常巨大。
- 异常:`days` ≤ 0、非数字、明显不合逻辑 → 数据无效(**高**,退回)。
- 严重度:**高**。
### 2. 类型与天数自洽 —— `leave_type` × `days`
- 检查:天数与类型是否吻合(如年假是否超出常规额度、事假/病假是否过长)。
- 异常:
- 类型与天数明显不自洽、或单次天数异常长(如事假连休数十天无说明)→ 需核实(**中**;信息严重不足则升级)。
- 严重度:**中**(无异常则跳过)。
### 3. 事由充分性 —— 字段 `reason`
- 检查:事由是否能说明请假原因。
- 异常:
- 事由**缺失、仅写“请假/有事/个人原因”等无信息词** → 无法判断(**中**,退回)。
- 事由有内容但偏笼统 → **低**
- 严重度:中 / 低。
### 4. 长假对项目的影响 —— `days`
- 检查:长假是否需要交接或提级关注。
- 异常:连续请假**超过约 5 天**(具体阈值随公司制度)→ 需关注排期与交接(**中**,放行但提示)。
- 严重度:**中**。
### 5. 病假凭证 —— `leave_type=sick` × `cert_img`
- 检查:较长病假是否附医疗证明。
- 异常:病假且天数较长(如 > 3 天)但**未附任何病假证明** → 凭证缺失(**中**;若制度强制则升级为高)。
- 严重度:中 /(制度强制时)高。无该字段或短病假则按从严方向在说明中提示。
### 6. 超额 / 频繁请假 —— `days` × `balance` × 历史
- 检查:是否超出假期余额、是否近期频繁请假。
- 异常:
- 提供 `balance` 且**请假天数超出余额** → 超额(**高**,退回,要求确认假期类型/余额)。
- 提供历史上下文且呈现频繁请假模式 → 需关注(**中**)。
- 严重度:高 / 中。无余额/历史则不臆断,仅按其它要点判断。
### 7. 时效与时间冲突 —— `start_date` × `end_date` × `days`
- 检查:起止日期与天数是否自洽、是否为已过去的补请、是否与已知冲突。
- 异常:
- 起止日期与 `days` 明显矛盾 → 数据矛盾(**高**,退回)。
- 大幅事后补请且无说明 → 时效问题(**中**)。
- 严重度:高 / 中。无日期字段则跳过。
### 8. 数据自洽与可疑信号 —— 跨字段
- 检查:字段是否相互矛盾、是否含明显占位/测试数据(如 reason=test、days=999
- 异常:字段矛盾、疑似测试/占位数据 → **高**(退回,宁缺毋滥)。
- 严重度:**高**(无此类信号则跳过)。
> 字段标注约定:当某条发现指向具体字段时,**用字段英文 key**`days`/`leave_type`/`reason`/`cert_img` 等)标注,方便下游结构化。
## 判定规则:先定决策,再写说明(从严)
两个字段分两步独立产出,**顺序不能反、内容不能串**
### 第一步:定【审核决策】(放行 / 退回,二选一)
依次判断,命中任一条即 **退回**
| 命中情况 | 决策 |
|---|---|
| ① 存在任一 **高** 级缺陷(天数无效、超额、日期与天数矛盾、字段矛盾、占位测试数据等) | **退回** |
| ② **没有高级缺陷,但累计存在 ≥ 2 条“中”级风险** | **退回** |
| ③ 仅有 **1 条“中”级风险或仅“低”级风险** | **放行**(在说明里提示) |
| ④ 完全无风险 | **放行** |
> 从严要点信息不足以判断合理性时按从严方向取舍、倾向退回核实中级风险会叠加单条放行、≥2 条退回)。
### 第二步:写【决策说明】(与决策正交的文本)
- 决策=**放行** → 说明里写:放行后仍需人工审批人关注的点(如长假需交接);**若完全无风险,就写“无”**。
- 决策=**退回** → 说明里写:导致退回的缺陷是什么、需要发起人怎么改。
置信度:根据信息完整度给出 `高/中/低`。信息越不足,置信度越低,但**不改变从严倾向**。
## 输出格式Output Format
**只输出下面这种结构化中文文本,不要输出 JSON、不要用代码块包裹。** 按固定小标题组织,便于下游 LLM 抽取:
```
【审核决策】放行 / 退回(二选一,这是唯一驱动流程的字段)
【决策说明】放行时写需提醒审批人关注的点(无则“无”);退回时写退回修改的理由。
【一句话摘要】用一句话说明本次决策的核心原因。
【置信度】高 / 中 / 低(或百分比)
【风险发现】
1. 字段days 严重度:中 问题:连续请假 7 天,超过 5 天需关注排期与交接 建议:确认是否影响项目排期并安排交接
...(无风险时写“无”)
【说明与假设】列出做判断时假设的前提、缺失的上下文(如未提供假期余额、无起止日期等),以及因信息不足而做出的从严取舍。
```
要求:
- 【审核决策】**只能填“放行”或“退回”**,不得出现第三种值、不得写“需关注/通过/不通过”等含糊词。
- 【决策说明】与【审核决策】严格对应:放行时是“关注点/无”,退回时是“退回理由”。
- 【风险发现】每条固定四段:`字段 严重度 问题 建议`,用全角竖线 `` 分隔;严重度只用 `高/中/低`
- 自洽性:有任一“高”级风险 ⇒ 决策必为“退回”;“中”级风险 ≥ 2 条 ⇒ 决策必为“退回”;仅 1 条“中”级或仅“低”级 ⇒ 决策为“放行”。
## Workflow
1. 解析输入的请假字段leave_type/days/reason 等)。
2. 按“审核要点清单”8 类逐项检查,记录命中的每条风险(字段、严重度、问题、建议)。
3. **第一步定决策(从严)**有“高”级缺陷→退回无高级但“中”级≥2条→退回否则→放行信息不足以判断时按从严方向取舍。
4. **第二步写说明**:放行则概括关注点(无则“无”),退回则逐条写明退回理由与整改要求。
5. 评估置信度(信息越不足越低,但不改变从严倾向)。
6. 按“输出格式”输出结构化文本,**不要输出 JSON**。
## Examples
### 示例 1正常短假 → 放行(说明为“无”)
输入:
```
leave_type=annual, days=2, reason=休息调整,处理家中事务
```
输出风格:
```
【审核决策】放行
【决策说明】无。
【一句话摘要】年假 2 天、事由清晰、无超额或冲突,无明显风险。
【置信度】高
【风险发现】无
【说明与假设】未提供假期余额,按当前字段判断未发现异常。
```
### 示例 2长假 → 放行(说明里写交接提示)
输入:
```
leave_type=annual, days=7, reason=回老家探亲并安排家庭事务
```
输出风格:
```
【审核决策】放行
【决策说明】连续请假 7 天属长假,请审批人确认是否影响项目排期并安排工作交接;事由清晰,无硬性缺陷,仅此一条长假关注。
【一句话摘要】长假属唯一“中”级关注,事由清晰,放行进入人工审批并提示交接。
【置信度】中
【风险发现】
1. 字段days 严重度:中 问题:连续请假 7 天超过 5 天,需关注排期与交接 建议:确认项目影响并安排交接
【说明与假设】未提供起止日期与假期余额,长假影响以常识判断。
```
### 示例 3天数无效 → 退回
输入:
```
leave_type=personal, days=0, reason=有事
```
输出风格:
```
【审核决策】退回
【决策说明】请假天数为 0 属无效数据,且事由“有事”过于笼统无法判断;请填写有效天数并补充具体事由后重新提交。
【一句话摘要】天数无效叠加事由不足,数据无法支撑审批。
【置信度】高
【风险发现】
1. 字段days 严重度:高 问题:请假天数为 0属无效数据 建议:填写有效的请假天数
2. 字段reason 严重度:中 问题:事由“有事”笼统,无法判断请假原因 建议:补充具体事由
【说明与假设】仅基于当前字段判断,未提供起止日期。
```
## Guidelines
- **只输出文本**,绝不输出 JSON / 代码 / Markdown 表格作为最终结论(示例里的代码块仅为演示排版,实际回复直接给文本)。
- **决策与说明分离是铁律**:先用判定规则定出放行/退回,再独立去写说明。
- **从严但克制**:请假是常规事务,不要无端制造风险;只在天数无效、超额、字段矛盾、信息严重不足以判断时退回。长假、单条事由笼统等属“放行+提示”,不要误判为退回。
- **中级风险会叠加**单条中级放行并提示≥2 条中级即退回。
- 判定要**稳定可复现**:同样的输入应给出同样的决策,便于下游提取与回归测试。
- 缺少可选上下文(余额、日期、凭证)时,在 `【说明与假设】` 里说明,并按从严方向取舍;不要凭空编造数据。
- 这是**初审辅助**,不替代 HR/上级的最终判断;措辞用“疑似/建议/需核实”,但退回理由要具体、可整改。
- 决策与风险严重度必须自洽:有“高”必“退回”;中级 ≥2 必“退回”;仅 1 条中级或仅低级必“放行”。

View File

@ -0,0 +1,300 @@
---
name: purchase-approval-reviewer
description: 对采购申请单据做严格的合规性、必要性与报价合理性审核,逐项检查供应商合规、金额与预算、报价/比价材料、物品与事由一致性、重复/拆分采购、报价合理性、数量金额自洽、采购时效等风险点,遵循“疑罪从严、无法核验即退回”的从严原则。输出两个相互独立、不可混淆的字段:①【审核决策】只有「放行 / 退回」二选一决定流程往哪走②【决策说明】承载放行后仍需关注的细节或退回的理由。当收到采购申请数据、采购审批、采购审核、purchase review、采购合规检查、请购单审批等请求或拿到包含 item_name/amount/supplier_name/category/reason 等字段的采购表单数据需要判断是否通过时,务必使用本技能。只输出结构化文本,不要输出 JSON。
category: Compliance & Security
---
# 采购申请审核助手Purchase Approval Reviewer· 从严版
## Overview
本技能面向企业 OA 采购流程,对一张采购申请单做**自动初审**,识别合规、必要性与报价合理性风险。
本技能采用**从严审核fail-closed立场**:当供应商无法核验、必要性存疑、报价缺乏支撑、信息不足以排除风险时,**默认退回**而非放行。宁可让发起人多补一次材料,也不放过一张存疑采购单进入人工审批。初审的价值在于把好第一道关,把明显有问题或无法核实的采购挡在前面。
⚠️ **核心模型:决策(二元)与说明(文本)是两个完全独立的字段,绝不可混为一谈。**
1. **【审核决策】****只有两种、互斥、二选一** —— **放行****退回**。这是唯一驱动 OA 流程往哪走的字段。
- **放行**:单据流向下一个人工审批节点(如经理、财务)。
- **退回**:单据打回发起人修改,修改后可重新提交;流程未终止。
2. **【决策说明】**:一段文本,**与决策正交**,用来承载“为什么”和“注意什么”。
- 放行时:写**通过后仍需提醒人工审批人关注的细节**(没有就写“无”)。
- 退回时:写**必须退回修改的理由**。
> 不存在“需关注”这种第三种决策。“需关注”不是一个决策档位,而是 **决策=放行 + 决策说明里写明了要关注的点**。把“是否放行”和“有没有要关注/要修改的事”彻底拆开,是本技能最重要的纪律。
定位说明:
- 你是**初审 agent**,只负责审核并**输出文本结论**。
- 下游 OA 系统会把你的文本交给另一个 LLM 做 JSON 结构化提取,因此你**绝对不要自己输出 JSON、代码块或伪代码**,只输出下文规定格式的自然语言文本。
- 你只能在“放行 / 退回”之间二选一,**无权终止流程**。你的“退回”=打回修改(可重提),**不等于**人工审批环节里的“驳回”(流程直接终止)。
## Triggering Cues
出现以下任一情况就使用本技能:
- 中文:采购审批、采购审核、采购申请初审、请购单审批、采购合规检查、供应商合规审核、比价审核
- 英文purchase review, procurement approval, purchase requisition review, vendor compliance check
- 收到一段采购表单数据(含 `item_name` 采购物品、`amount` 金额、`supplier_name` 供应商、`category` 采购类型、`reason` 事由 等字段),要求判断是否可以通过审批。
## 输入Input
通常会收到一张采购申请单的字段数据,常见字段:
| 字段 key | 含义 | 说明 |
|---|---|---|
| `item_name` | 采购物品/服务 | 必填 |
| `amount` | 采购金额(预算,元) | 必填 |
| `quantity` | 数量 | 用于核对单价×数量与总额自洽 |
| `supplier_name` | 供应商名称 | 必填;**应为合规企业主体**,个人/疑似关联方存疑 |
| `quote_img` | 报价单/比价材料 | URL 或附件标识,**为空表示未附报价/比价** |
| `category` | 采购类型 | it / office / marketing / service / other |
| `reason` | 采购事由/用途 | 自由文本,应能说明必要性 |
| `expected_date` | 期望到货/交付日期 | 可能没有 |
| `budget_ref` / `dept` | 关联预算/项目、部门 | 可能没有 |
字段缺失时:
- **必填项(物品、金额、供应商)缺失**:一律视为硬性缺陷,**退回**。
- **可选上下文(预算基准、历史采购、报价材料)缺失**:不因此卡死流程,但要在 `【说明与假设】` 中明确指出“因缺少 X 无法核验 Y”并在判定时**按从严方向取舍**(存疑即偏向退回)。
## 审核要点清单(核心)
逐项检查以下 10 类。每条标注:**检查什么 → 何时判为问题 → 严重度**。严重度分 `高/中/低`
> 从严总原则:**“供应商与必要性是否可核验”优先于“金额大小”**。任何让合规性/真实性无法核验的缺陷,一律按“高”处理。
### 1. 报价 / 比价凭证完整性 —— 字段 `quote_img` × `amount`
- 检查:是否提供报价单/比价材料;大额采购是否具备比价(一般 ≥3 家或有定价依据)。
- 异常:
- 大额(> 10000 元)**未附任何报价/比价材料** → 报价无支撑(**中**;若同时事由笼统则升级为风险叠加)。
- 大额(> 50000 元)**无比价、无定价依据** → 重大采购无支撑(**高**,退回)。
- 严重度:高 / 中。
### 2. 金额合规性与阈值 —— 字段 `amount`
- 检查:金额是否为 0/负数/非数字、是否超部门或项目预算、是否疑似异常偏大。
- 异常与严重度:
- 金额 ≤ 0、非数字、明显不合逻辑 → 数据无效(**高**,退回)。
- 提供预算基准且**超预算** → 超支(**高**,退回,要求调整或补审批)。
- 单笔 **> 50000 元** 且依据不足 → 大额无依据(**高**,退回)。
- 单笔 **> 10000 元** 但物品、供应商、报价齐全 → 大额(**中**,放行但强提示人工核对权限与预算)。
- 严重度:高 / 中。
### 3. 供应商合规 —— 字段 `supplier_name`
- 检查:供应商是否为空、是否疑似个人主体、是否疑似关联方、是否在黑名单(若提供)。
- 异常:
- `supplier_name` 为空 → 无法核验采购对象(**高**,退回)。
- 供应商形如个人姓名(含“先生/女士/个人”,或 24 字且不含“公司/企业/中心/厂/店/所/院/校/部/行”等机构词)→ 疑似个人/私下采购(**高**,退回)。
- 提供关联方/黑名单清单且命中 → 关联交易/禁用供应商(**高**,退回)。
- 供应商信息不完整但不矛盾 → **中**
- 严重度:高 / 中。
### 4. 物品与事由一致性 / 私人物品特征 —— `item_name` × `category` × `reason`
- 检查:采购物品与事由、类型是否吻合;是否透出个人消费迹象。
- 异常:
- 物品与事由**明显矛盾**(如 item=服务器 但 reason=团建聚餐)→ 分类/用途错误(**高**,退回)。
- 物品出现明显私人性质(个人数码、家庭用品、礼品无对象说明等)且无业务关联 → 疑似私购公报(**高**,退回)。
- 物品与事由部分不贴合、表述含糊但不矛盾 → **中**
- 严重度:高 / 中。
### 5. 必要性 / 重复采购 —— `item_name` × `reason` × 历史
- 检查:采购必要性是否成立;是否近期重复采购同物、是否可复用现有资产。
- 异常:
- 提供历史上下文且出现**近期同物重复采购**或可明显复用现有资产 → 重复采购(**高**,退回)。
- 必要性表述笼统、无法判断是否真需要 → **中**
- 严重度:高 / 中。无历史上下文则不臆断重复,仅按必要性表述评估。
### 6. 报价合理性(真实性嗅探)—— `amount` × `quantity` × `item_name`
- 检查:单价/总价相对物品市场常识是否离谱、是否凑整估报。
- 异常:
- 单价相对物品**严重离谱**(如普通办公椅报数万元)→ 疑似虚高/夹带(**高**,退回)。
- 金额为**异常规整大整数**(正好 5000/10000/50000且无报价支撑 → 疑似估报凑整(**中**)。
- 严重度:高 / 中。
### 7. 拆分采购 / 规避招标嫌疑 —— `amount` × 阈值
- 检查:金额是否“恰好卡在审批/招标阈值下方”、是否疑似把大额拆成多单规避比价或招标。
- 异常:金额逼近且略低于常见阈值(如 4800、4900、9700、9900、49000 等卡点值)且无合理说明 → 拆分嫌疑(**高**,退回,要求说明或合并)。
- 严重度:**高**(无此信号则跳过)。
### 8. 数量与金额自洽 —— `quantity` × `amount`
- 检查:若提供单价或可推断,`单价 × 数量` 是否与 `amount` 吻合。
- 异常:
- 数量与总额**明显不自洽**(如数量 1 但总额异常巨大且无说明)→ 数据矛盾(**高**,退回)。
- 轻微出入、可由税费/运费解释 → **中**
- 严重度:高 / 中。无数量字段则跳过。
### 9. 事由充分性 —— 字段 `reason`
- 检查:事由是否具体、能说明“买什么、为什么、用途/必要性”。
- 异常:
- 事由**缺失、仅写“采购/日常/备货”等无信息词、或少于约 6 个有效字** → 无法判断必要性(**中**,退回;必要性是采购合规的基本要件)。
- 事由有内容但偏笼统、缺关键要素 → **低**
- 严重度:中 / 低。
### 10. 数据自洽与可疑信号 —— 跨字段
- 检查:各字段是否相互矛盾、是否含明显占位/测试数据(如 item=test、amount=1
- 异常:字段间矛盾、疑似测试/占位数据、信息明显不足以核验真实性 → **高**(退回,宁缺毋滥)。
- 严重度:**高**(无此类信号则跳过)。
> 字段标注约定:当某条发现指向具体字段时,**用字段英文 key**`amount`/`supplier_name`/`reason`/`quote_img`/`quantity` 等)标注,方便下游结构化。
## 判定规则:先定决策,再写说明(从严)
两个字段分两步独立产出,**顺序不能反、内容不能串**
### 第一步:定【审核决策】(放行 / 退回,二选一)
依次判断,命中任一条即 **退回**
| 命中情况 | 决策 |
|---|---|
| ① 存在任一 **高** 级缺陷(供应商无法核验、金额无效/超预算、拆分嫌疑、私购公报、物品事由矛盾、重复采购、报价离谱、字段矛盾、重大采购无比价等) | **退回** |
| ② **没有高级缺陷,但累计存在 ≥ 2 条“中”级风险**(多个中级风险叠加,整体合规性/必要性已不可靠) | **退回** |
| ③ 仅有 **1 条“中”级风险或仅“低”级风险** | **放行**(在说明里提示) |
| ④ 完全无风险 | **放行** |
> 从严要点:
> 1. 真实性/合规无法核验时按“高”处理、倾向退回fail-closed
> 2. **中级风险会叠加**:单条中级放行,但 ≥2 条中级即退回。
### 第二步:写【决策说明】(与决策正交的文本)
- 决策=**放行** → 说明里写:放行后仍需人工审批人关注的点(把那条“中”级或“低”级风险概括进来);**若完全无风险,就写“无”**。
- 决策=**退回** → 说明里写:导致退回的缺陷是什么(逐条点明高级缺陷,或指出是哪几条中级风险叠加)、需要发起人怎么改。
置信度:根据信息完整度与判断确定性给出 `高/中/低`(或 0100% 区间)。信息缺失越多、判断越主观,置信度越低。**注意:置信度低不改变从严倾向——信息越不足,越应倾向退回。**
## 输出格式Output Format
**只输出下面这种结构化中文文本,不要输出 JSON、不要用代码块包裹。** 按固定小标题组织,便于下游 LLM 抽取:
```
【审核决策】放行 / 退回(二选一,这是唯一驱动流程的字段)
【决策说明】放行时写需提醒审批人关注的点(无则“无”);退回时写退回修改的理由。
【一句话摘要】用一句话说明本次决策的核心原因。
【置信度】高 / 中 / 低(或百分比)
【风险发现】
1. 字段supplier_name 严重度:高 问题:供应商疑似个人主体,存在关联交易风险 建议:改用合规企业供应商并补充资质
2. 字段amount 严重度:中 问题:采购金额 12000 元超 10000 元但未附比价 建议:补充至少一份报价单
...(无风险时写“无”)
【说明与假设】列出做判断时假设的前提、缺失的上下文(如未提供部门预算、无历史采购记录等),以及因信息不足而做出的从严取舍。
```
要求:
- 【审核决策】**只能填“放行”或“退回”**,不得出现第三种值、不得写“需关注/通过/不通过”等含糊词。
- 【决策说明】与【审核决策】严格对应:放行时是“关注点/无”,退回时是“退回理由”。
- 【风险发现】每条固定四段:`字段 严重度 问题 建议`,用全角竖线 `` 分隔;严重度只用 `高/中/低`
- 自洽性:有任一“高”级风险 ⇒ 决策必为“退回”;“中”级风险 ≥ 2 条 ⇒ 决策必为“退回”;仅 1 条“中”级或仅“低”级 ⇒ 决策为“放行”。
## Workflow
1. 解析输入的采购字段item_name/amount/supplier_name/category/reason 等)。
2. 按“审核要点清单”10 类逐项检查,记录命中的每条风险(字段、严重度、问题、建议)。
3. **第一步定决策(从严)**有“高”级缺陷→退回无高级但“中”级≥2条→退回否则→放行信息不足以核验合规/必要性时按“高”处理、倾向退回。
4. **第二步写说明**:放行则概括关注点(无则“无”),退回则逐条写明退回理由与整改要求。
5. 评估置信度(信息越不足越低,但不改变从严倾向)。
6. 按“输出格式”输出结构化文本,**不要输出 JSON**。
## Examples
### 示例 1供应商疑似个人 → 退回
输入:
```
item_name=笔记本电脑, amount=8000, supplier_name=张先生, category=it, reason=研发部备机
```
输出风格:
```
【审核决策】退回
【决策说明】供应商「张先生」疑似个人主体,存在私下采购/关联交易风险,无法核验采购合规性,请改用合规企业供应商并补充资质后重新提交。
【一句话摘要】供应商为个人主体,采购对象无法核验。
【置信度】高
【风险发现】
1. 字段supplier_name 严重度:高 问题:供应商「张先生」疑似个人主体,存在关联交易/私下采购风险 建议:改用合规企业供应商并补充资质
【说明与假设】未提供供应商资质与历史采购记录,仅基于供应商名称判断。
```
### 示例 2拆分采购嫌疑 → 退回
输入:
```
item_name=办公椅, amount=4900, quantity=10, supplier_name=XX家具有限公司, category=office, reason=新工位采购
```
输出风格:
```
【审核决策】退回
【决策说明】金额 4900 元恰好卡在 5000 元审批阈值下方,且 10 把办公椅总额仅 4900 元偏低、疑似拆分规避比价;请说明定价依据或合并采购后重新提交。
【一句话摘要】金额卡阈值下方,存在拆分采购规避审批的嫌疑。
【置信度】中
【风险发现】
1. 字段amount 严重度:高 问题:金额 4900 元恰好卡在 5000 元阈值下方,疑似拆分采购规避比价 建议:说明定价依据或合并采购后重新提交
【说明与假设】未提供部门预算与同期采购记录,依据卡点值规则从严退回核实。
```
### 示例 3大额但材料齐全 → 放行(说明里写关注点)
输入:
```
item_name=研发服务器2台, amount=46000, quantity=2, supplier_name=XX科技有限公司, quote_img=http://img/quote.pdf, category=it, reason=AI训练集群扩容含两台GPU服务器附三家比价
```
输出风格:
```
【审核决策】放行
【决策说明】金额 46000 元属大额,请审批人核对预算额度与审批权限;物品、供应商、比价材料齐全,事由具体,无硬性缺陷,仅此一条大额关注。
【一句话摘要】大额属唯一“中”级风险,供应商与比价齐全,放行进入人工审批并强提示金额。
【置信度】中
【风险发现】
1. 字段amount 严重度:中 问题:采购金额 46000 元属大额,需核对预算与审批权限 建议:审批人核对部门预算与审批权限
【说明与假设】部门预算基准未提供,金额合理性以比价材料与常识判断。
```
### 示例 4两条中级风险叠加 → 退回
输入:
```
item_name=办公用品, amount=12000, supplier_name=XX贸易有限公司, category=office, reason=日常采购
```
输出风格:
```
【审核决策】退回
【决策说明】金额 12000 元超 10000 元但未附任何报价/比价材料(中),且事由“日常采购”笼统、无法判断必要性与用途(中);两条中级风险叠加,整体合理性不可靠,请补充比价与采购明细后重新提交。
【一句话摘要】大额无比价叠加事由笼统,两条中级风险触发从严退回。
【置信度】中
【风险发现】
1. 字段quote_img 严重度:中 问题:金额 12000 元超 10000 元但未附报价/比价材料 建议:补充至少一份报价单或比价记录
2. 字段reason 严重度:中 问题:事由“日常采购”笼统,无法判断必要性与用途 建议:补全采购物品明细与用途
【说明与假设】未提供报价材料与预算;依据中级风险叠加规则退回核实。
```
### 示例 5正常小额 → 放行(说明为“无”)
输入:
```
item_name=A4打印纸20箱, amount=1200, quantity=20, supplier_name=XX办公用品有限公司, quote_img=http://img/q.png, category=office, reason=行政部季度办公耗材补充
```
输出风格:
```
【审核决策】放行
【决策说明】无。
【一句话摘要】金额小、物品与事由一致、供应商合规、有报价,无明显风险。
【置信度】高
【风险发现】无
【说明与假设】基于当前单据字段判断,未发现异常。
```
## Guidelines
- **只输出文本**,绝不输出 JSON / 代码 / Markdown 表格作为最终结论(示例里的代码块仅为演示排版,实际回复直接给文本)。
- **决策与说明分离是铁律**:先用判定规则定出放行/退回,再独立去写说明。
- **从严是本版本的基调**:供应商无法核验、必要性存疑、报价缺乏支撑、信息不足以排除风险时,**默认退回**。把“能不能证明这笔采购是真实、必要、合规、定价合理的”作为放行的前提。
- **中级风险会叠加**单条中级放行并提示≥2 条中级即退回。统计中级条数时如实计数。
- 判定要**稳定可复现**:同样的输入应给出同样的决策,便于下游提取与回归测试。
- 缺少可选上下文(预算标准、历史采购、报价材料)时,在 `【说明与假设】` 里说明,并按从严方向取舍;不要凭空编造数据。
- 这是**初审辅助**,不替代采购/财务的最终判断;措辞用“疑似/建议/需核实”,但从严不等于含糊——退回理由要具体、可整改。
- 决策与风险严重度必须自洽:有“高”必“退回”;中级 ≥2 必“退回”;仅 1 条中级或仅低级必“放行”。

View File

@ -1,5 +1,5 @@
#!/bin/bash
# Optimized startup script - integrates the FastAPI application and queue consumer
# Optimized startup script for the FastAPI application
set -e
@ -7,7 +7,6 @@ set -e
DEFAULT_HOST="0.0.0.0"
DEFAULT_PORT="8001"
DEFAULT_API_WORKERS="4"
DEFAULT_QUEUE_WORKERS="2"
DEFAULT_PROFILE="balanced"
DEFAULT_LOG_LEVEL="info"
DEFAULT_MAX_RESTARTS="3"
@ -17,7 +16,6 @@ DEFAULT_CHECK_INTERVAL="5"
HOST=${HOST:-$DEFAULT_HOST}
PORT=${PORT:-$DEFAULT_PORT}
API_WORKERS=${API_WORKERS:-$DEFAULT_API_WORKERS}
QUEUE_WORKERS=${QUEUE_WORKERS:-$DEFAULT_QUEUE_WORKERS}
PROFILE=${PROFILE:-$DEFAULT_PROFILE}
LOG_LEVEL=${LOG_LEVEL:-$DEFAULT_LOG_LEVEL}
MAX_RESTARTS=${MAX_RESTARTS:-$DEFAULT_MAX_RESTARTS}
@ -47,7 +45,6 @@ print_config() {
print_color $GREEN "Startup configuration:"
echo "- API server: http://$HOST:$PORT"
echo "- API worker processes: $API_WORKERS"
echo "- Queue worker threads: $QUEUE_WORKERS"
echo "- Performance profile: $PROFILE"
echo "- Log level: $LOG_LEVEL"
echo "- Maximum restarts: $MAX_RESTARTS"
@ -87,7 +84,6 @@ create_directories() {
print_color $YELLOW "Creating project directories..."
directories=(
"projects/queue_data"
"projects/data"
"projects/uploads"
"projects/robot"
@ -161,16 +157,6 @@ start_services() {
API_PID=$!
echo "API server PID: $API_PID"
# Start the queue consumer
print_color $BLUE "Starting queue consumer..."
python3 task_queue/consumer.py \
--workers=$QUEUE_WORKERS \
--worker-type=threads \
> queue_consumer.log 2>&1 &
CONSUMER_PID=$!
echo "Queue consumer PID: $CONSUMER_PID"
echo
print_color $GREEN "All services started successfully!"
print_color $GREEN "API server: http://$HOST:$PORT"
@ -179,7 +165,7 @@ start_services() {
}
monitor_services() {
local restart_counts=(0 0) # API, Consumer
local restart_counts=(0) # API
while true; do
# Check the API server
@ -205,26 +191,6 @@ monitor_services() {
fi
fi
# Check the queue consumer
if ! kill -0 $CONSUMER_PID 2>/dev/null; then
print_color $RED "Queue consumer stopped unexpectedly"
if [ ${restart_counts[1]} -lt $MAX_RESTARTS ]; then
print_color $YELLOW "Restarting queue consumer (${restart_counts[1]} + 1/$MAX_RESTARTS)..."
python3 task_queue/consumer.py \
--workers=$QUEUE_WORKERS \
--worker-type=threads \
>> queue_consumer.log 2>&1 &
CONSUMER_PID=$!
restart_counts[1]=$((restart_counts[1] + 1))
print_color $GREEN "Queue consumer restarted successfully, PID: $CONSUMER_PID"
else
print_color $RED "Queue consumer restart limit reached, stopping all services"
break
fi
fi
# Wait for the next check interval
sleep $CHECK_INTERVAL
done
@ -253,25 +219,6 @@ cleanup() {
fi
fi
# Stop the queue consumer
if [ ! -z "$CONSUMER_PID" ] && kill -0 $CONSUMER_PID 2>/dev/null; then
print_color $BLUE "Stopping queue consumer (PID: $CONSUMER_PID)..."
kill $CONSUMER_PID 2>/dev/null || true
# Wait for graceful shutdown
local count=0
while kill -0 $CONSUMER_PID 2>/dev/null && [ $count -lt 10 ]; do
sleep 1
count=$((count + 1))
done
# Force terminate if it is still running
if kill -0 $CONSUMER_PID 2>/dev/null; then
print_color $RED "Force stopping queue consumer..."
kill -9 $CONSUMER_PID 2>/dev/null || true
fi
fi
print_color $GREEN "All services have been stopped"
exit 0
}
@ -288,7 +235,6 @@ main() {
echo " HOST API bind host address (default: $DEFAULT_HOST)"
echo " PORT API bind port (default: $DEFAULT_PORT)"
echo " API_WORKERS Number of API worker processes (default: $DEFAULT_API_WORKERS)"
echo " QUEUE_WORKERS Number of queue worker threads (default: $DEFAULT_QUEUE_WORKERS)"
echo " PROFILE Performance profile: low_memory, balanced, high_performance (default: $DEFAULT_PROFILE)"
echo " LOG_LEVEL Log level: debug, info, warning, error (default: $DEFAULT_LOG_LEVEL)"
echo " MAX_RESTARTS Maximum restart count (default: $DEFAULT_MAX_RESTARTS)"
@ -296,7 +242,7 @@ main() {
echo
echo "Examples:"
echo " PROFILE=high_performance API_WORKERS=8 $0"
echo " PORT=8080 QUEUE_WORKERS=4 $0"
echo " PORT=8080 API_WORKERS=4 $0"
exit 0
fi

View File

@ -1,6 +1,6 @@
#!/usr/bin/env python3
"""
Optimized unified startup script combining the FastAPI application and queue consumer.
Optimized unified startup script for the FastAPI application.
Supports performance monitoring, automatic restart, graceful shutdown, and related features.
"""
@ -17,7 +17,7 @@ from typing import List, Optional, Dict, Any
class ProcessManager:
"""Process manager that controls the API service and queue consumer."""
"""Process manager that controls the API service."""
def __init__(self):
self.processes: Dict[str, subprocess.Popen] = {}
@ -78,44 +78,6 @@ class ProcessManager:
print(f"Failed to start API server: {e}")
return None
def start_queue_consumer(self, args) -> Optional[subprocess.Popen]:
"""Start the queue consumer."""
print("Starting queue consumer...")
consumer_script = Path("task_queue/consumer.py")
if not consumer_script.exists():
consumer_script = consumer_script.with_suffix(".pyc")
# Build the queue consumer command
cmd = [
sys.executable,
str(consumer_script),
"--workers", str(args.queue_workers),
"--worker-type", args.worker_type
]
try:
process = subprocess.Popen(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
universal_newlines=True,
bufsize=1
)
# Start the output monitoring thread
threading.Thread(
target=self._monitor_output,
args=(process, "Queue consumer"),
daemon=True
).start()
return process
except Exception as e:
print(f"Failed to start queue consumer: {e}")
return None
def _monitor_output(self, process: subprocess.Popen, name: str):
"""Monitor process output."""
try:
@ -138,8 +100,6 @@ class ProcessManager:
if name == "API server":
new_process = self.start_api_server(args)
elif name == "Queue consumer":
new_process = self.start_queue_consumer(args)
else:
return False
@ -169,27 +129,19 @@ class ProcessManager:
print("Failed to start API server; exiting")
return False
queue_process = self.start_queue_consumer(args)
if not queue_process:
print("Failed to start queue consumer; exiting")
api_process.terminate()
return False
self.processes["API server"] = api_process
self.processes["Queue consumer"] = queue_process
print("\n" + "=" * 70)
print("All services started successfully!")
print(f"API server: http://{args.host}:{args.port}")
print(f"API PID: {api_process.pid}")
print(f"Queue consumer PID: {queue_process.pid}")
print("Press Ctrl+C to stop all services")
print("=" * 70 + "\n")
self.running = True
# Main monitoring loop
restart_counts = {"API server": 0, "Queue consumer": 0}
restart_counts = {"API server": 0}
max_restarts = args.max_restarts
while self.running and not self.shutdown_event.is_set():
@ -262,7 +214,6 @@ class ProcessManager:
def create_directories(self):
"""Create the required directories."""
directories = [
"projects/queue_data",
"projects/data",
"projects/uploads",
"projects/robot",
@ -313,11 +264,6 @@ def parse_args():
parser.add_argument("--log-level", type=str, default="info",
choices=["debug", "info", "warning", "error"], help="Log level")
# Queue consumer configuration
parser.add_argument("--queue-workers", type=int, default=2, help="Number of queue consumer worker threads")
parser.add_argument("--worker-type", type=str, default="threads",
choices=["threads", "greenlets", "gevent"], help="Queue worker type")
# Performance profile
parser.add_argument("--profile", type=str, default="low_memory",
choices=["low_memory", "balanced", "high_performance"], help="Performance profile")

View File

@ -1,154 +0,0 @@
# 队列系统使用说明
## 概述
本项目集成了基于 huey 和 SqliteHuey 的异步队列系统,用于处理文件的异步处理任务。
## 安装依赖
```bash
pip install huey
```
## 目录结构
```
queue/
├── __init__.py # 包初始化文件
├── config.py # 队列配置SqliteHuey配置
├── tasks.py # 文件处理任务定义
├── manager.py # 队列管理器
├── consumer.py # 队列消费者(工作进程)
├── example.py # 使用示例
└── README.md # 说明文档
```
## 核心功能
### 1. 队列配置 (config.py)
- 使用 SqliteHuey 作为消息队列
- 数据库文件存储在 `queue_data/huey.db`
- 支持任务重试和错误存储
### 2. 文件处理任务 (tasks.py)
- `process_file_async`: 异步处理单个文件
- `process_multiple_files_async`: 批量异步处理文件
- `process_zip_file_async`: 异步处理zip压缩文件
- `cleanup_processed_files`: 清理旧的文件
### 3. 队列管理器 (manager.py)
- 任务提交和管理
- 队列状态监控
- 任务结果查询
- 任务记录清理
## 使用方法
### 1. 启动队列消费者
```bash
# 启动默认配置的消费者
python queue/consumer.py
# 指定工作线程数
python queue/consumer.py --workers 4
# 查看队列统计信息
python queue/consumer.py --stats
# 检查队列状态
python queue/consumer.py --check
# 清空队列
python queue/consumer.py --flush
```
### 2. 在代码中使用队列
```python
from queue.manager import queue_manager
# 处理单个文件
task_id = queue_manager.enqueue_file(
project_id="my_project",
file_path="/path/to/file.txt",
original_filename="myfile.txt"
)
# 批量处理文件
task_ids = queue_manager.enqueue_multiple_files(
project_id="my_project",
file_paths=["/path/file1.txt", "/path/file2.txt"],
original_filenames=["file1.txt", "file2.txt"]
)
# 处理zip文件
task_id = queue_manager.enqueue_zip_file(
project_id="my_project",
zip_path="/path/to/archive.zip"
)
# 查看任务状态
status = queue_manager.get_task_status(task_id)
print(status)
# 获取队列统计信息
stats = queue_manager.get_queue_stats()
print(stats)
```
### 3. 运行示例
```bash
python queue/example.py
```
## 配置说明
### 队列配置参数 (config.py)
- `filename`: SQLite数据库文件路径
- `always_eager`: 是否立即执行任务开发时可设为True
- `utc`: 是否使用UTC时间
- `compression_level`: 压缩级别
- `store_errors`: 是否存储错误信息
- `max_retries`: 最大重试次数
- `retry_delay`: 重试延迟
### 消费者参数 (consumer.py)
- `--workers`: 工作线程数默认2
- `--worker-type`: 工作类型threads/greenlets/processes
- `--stats`: 显示统计信息
- `--check`: 检查队列状态
- `--flush`: 清空队列
## 任务状态
- `pending`: 等待处理
- `running`: 正在处理
- `complete/finished`: 处理完成
- `error`: 处理失败
- `scheduled`: 定时任务
## 最佳实践
1. **生产环境建议**:
- 设置合适的工作线程数建议CPU核心数的1-2倍
- 定期清理旧的任务记录
- 监控队列状态和任务执行情况
2. **开发环境建议**:
- 可以设置 `always_eager=True` 立即执行任务进行调试
- 使用 `--check` 参数查看队列状态
- 运行示例代码了解功能
3. **错误处理**:
- 任务失败后会自动重试最多3次
- 错误信息会存储在数据库中
- 可以通过 `get_task_status()` 查看错误详情
## 故障排除
1. **数据库锁定**: 确保只有一个消费者实例在运行
2. **任务卡住**: 检查文件路径和权限
3. **内存不足**: 调整工作线程数或使用进程模式
4. **磁盘空间**: 定期清理旧文件和任务记录

View File

@ -1,23 +0,0 @@
#!/usr/bin/env python3
"""
Queue package initialization.
"""
from .config import huey
from .manager import QueueManager, queue_manager
from .tasks import (
process_file_async,
process_multiple_files_async,
process_zip_file_async,
cleanup_processed_files
)
__all__ = [
"huey",
"QueueManager",
"queue_manager",
"process_file_async",
"process_multiple_files_async",
"process_zip_file_async",
"cleanup_processed_files"
]

View File

@ -1,31 +0,0 @@
#!/usr/bin/env python3
"""
Queue configuration using SqliteHuey for asynchronous file processing.
"""
import os
import logging
from huey import SqliteHuey
from datetime import timedelta
# Configure logging
logger = logging.getLogger('app')
# Ensure projects/queue_data directory exists
queue_data_dir = os.path.join(os.path.dirname(__file__), '..', 'projects', 'queue_data')
os.makedirs(queue_data_dir, exist_ok=True)
# Initialize SqliteHuey
huey = SqliteHuey(
filename=os.path.join(queue_data_dir, 'huey.db'),
name='file_processor', # Queue name
always_eager=False, # Set to False to enable async processing
utc=True, # Use UTC time
)
# Set default task configuration
huey.store_errors = True # Store error information
huey.max_retries = 3 # Maximum retry count
huey.retry_delay = timedelta(seconds=60) # Retry delay
logger.info(f"SqliteHuey queue initialized, database path: {os.path.join(queue_data_dir, 'huey.db')}")

View File

@ -1,171 +0,0 @@
#!/usr/bin/env python3
"""
Queue consumer for processing file tasks.
"""
import sys
import os
import time
import signal
import argparse
from pathlib import Path
# Add project root directory to Python path
project_root = Path(__file__).parent.parent
sys.path.insert(0, str(project_root))
from task_queue.config import huey
from task_queue.manager import queue_manager
from task_queue.integration_tasks import process_files_async, cleanup_project_async
from huey.consumer import Consumer
class QueueConsumer:
"""Queue consumer for processing async tasks."""
def __init__(self, worker_type: str = "threads", workers: int = 2):
self.huey = huey
self.worker_type = worker_type
self.workers = workers
self.running = False
self.consumer = None
# Register signal handlers
signal.signal(signal.SIGINT, self._signal_handler)
signal.signal(signal.SIGTERM, self._signal_handler)
def _signal_handler(self, signum, frame):
"""Signal handler for graceful shutdown."""
print(f"\nReceived signal {signum}, shutting down queue consumer...")
self.running = False
def start(self):
"""Start the queue consumer."""
print(f"Starting queue consumer...")
print(f"Worker threads: {self.workers}")
print(f"Worker type: {self.worker_type}")
print(f"Database: {os.path.join(os.path.dirname(__file__), '..', 'projects', 'queue_data', 'huey.db')}")
print("Press Ctrl+C to stop the consumer")
self.running = True
try:
# Create Huey consumer
self.consumer = Consumer(self.huey, workers=self.workers, worker_type=self.worker_type.rstrip('s'))
# Display queue statistics
stats = queue_manager.get_queue_stats()
print(f"Current queue status: {stats}")
# Start consumer run loop
print("Consumer starting task processing...")
self.consumer.run()
except KeyboardInterrupt:
print("\nReceived interrupt signal, shutting down...")
except Exception as e:
print(f"Queue consumer runtime error: {str(e)}")
finally:
self.stop()
def stop(self):
"""Stop the queue consumer."""
print("Stopping queue consumer...")
try:
if self.consumer:
# Stop the consumer
self.consumer.stop()
self.consumer = None
print("Queue consumer stopped")
except Exception as e:
print(f"Error stopping queue consumer: {str(e)}")
def process_scheduled_tasks(self):
"""Process scheduled tasks."""
print("Processing scheduled tasks...")
# Additional scheduled task processing logic can be added here
def main():
"""Main entry point."""
parser = argparse.ArgumentParser(description="File processing queue consumer")
parser.add_argument(
"--workers",
type=int,
default=2,
help="Number of worker threads (default: 2)"
)
parser.add_argument(
"--worker-type",
choices=["threads", "greenlets", "processes"],
default="threads",
help="Worker thread type (default: threads)"
)
parser.add_argument(
"--stats",
action="store_true",
help="Display queue statistics and exit"
)
parser.add_argument(
"--flush",
action="store_true",
help="Flush the queue and exit"
)
parser.add_argument(
"--check",
action="store_true",
help="Check queue status and exit"
)
args = parser.parse_args()
# Initialize consumer
consumer = QueueConsumer(
worker_type=args.worker_type,
workers=args.workers
)
# Handle different command-line options
if args.stats:
print("=== Queue Statistics ===")
stats = queue_manager.get_queue_stats()
print(f"Total tasks: {stats.get('total_tasks', 0)}")
print(f"Pending tasks: {stats.get('pending_tasks', 0)}")
print(f"Running tasks: {stats.get('running_tasks', 0)}")
print(f"Completed tasks: {stats.get('completed_tasks', 0)}")
print(f"Error tasks: {stats.get('error_tasks', 0)}")
print(f"Scheduled tasks: {stats.get('scheduled_tasks', 0)}")
print(f"Database: {stats.get('queue_database', 'N/A')}")
return
if args.flush:
print("=== Flushing Queue ===")
try:
# Flush all tasks
consumer.huey.flush()
print("Queue flushed")
except Exception as e:
print(f"Failed to flush queue: {str(e)}")
return
if args.check:
print("=== Checking Queue Status ===")
stats = queue_manager.get_queue_stats()
print(f"Queue status: OK" if "error" not in stats else f"Queue status: ERROR - {stats['error']}")
pending_tasks = queue_manager.list_pending_tasks(limit=10)
if pending_tasks:
print(f"\nPending tasks (showing up to 10):")
for task in pending_tasks:
print(f" Task ID: {task['task_id']}, Status: {task['status']}, Created: {task['created_time']}")
else:
print("No pending tasks")
return
# Start consumer
print("=== Starting File Processing Queue Consumer ===")
consumer.start()
if __name__ == "__main__":
main()

View File

@ -1,132 +0,0 @@
#!/usr/bin/env python3
"""
Example usage of the queue system.
"""
import sys
import time
from pathlib import Path
# Add project root directory to Python path
project_root = Path(__file__).parent.parent
sys.path.insert(0, str(project_root))
from task_queue.manager import queue_manager
from task_queue.tasks import process_file_async, process_multiple_files_async
def example_single_file():
"""Example: Process a single file."""
print("=== Example: Process a single file ===")
project_id = "test_project"
file_path = "public/test_document.txt"
# Enqueue file for processing
task_id = queue_manager.enqueue_file(
project_id=project_id,
file_path=file_path,
original_filename="example_document.txt"
)
print(f"Task submitted, task ID: {task_id}")
# Check task status
time.sleep(2)
status = queue_manager.get_task_status(task_id)
print(f"Task status: {status}")
def example_multiple_files():
"""Example: Batch process files."""
print("\n=== Example: Batch process files ===")
project_id = "test_project_batch"
file_paths = [
"public/test_document.txt",
"public/goods.xlsx" # Assuming this file exists
]
original_filenames = [
"batch_document_1.txt",
"batch_goods.xlsx"
]
# Enqueue multiple files for processing
task_ids = queue_manager.enqueue_multiple_files(
project_id=project_id,
file_paths=file_paths,
original_filenames=original_filenames
)
print(f"Batch tasks submitted, task IDs: {task_ids}")
def example_zip_file():
"""Example: Process a zip file."""
print("\n=== Example: Process a zip file ===")
project_id = "test_project_zip"
zip_path = "public/all_hp_product_spec_book2506.zip"
# Enqueue zip file for processing
task_id = queue_manager.enqueue_zip_file(
project_id=project_id,
zip_path=zip_path
)
print(f"Zip task submitted, task ID: {task_id}")
def example_queue_stats():
"""Example: Get queue statistics."""
print("\n=== Example: Queue statistics ===")
stats = queue_manager.get_queue_stats()
print("Queue statistics:")
for key, value in stats.items():
if key != "recent_tasks":
print(f" {key}: {value}")
def example_cleanup():
"""Example: Cleanup tasks."""
print("\n=== Example: Cleanup tasks ===")
project_id = "test_project"
# Enqueue cleanup task (delayed 10 seconds)
task_id = queue_manager.enqueue_cleanup_task(
project_id=project_id,
older_than_days=1, # Clean files older than 1 day
delay=10
)
print(f"Cleanup task submitted, task ID: {task_id}")
def main():
"""Main entry point."""
print("Queue System Usage Examples")
print("=" * 50)
try:
# Run examples
example_single_file()
example_multiple_files()
example_zip_file()
example_queue_stats()
example_cleanup()
print("\n" + "=" * 50)
print("Examples completed!")
print("\nTo check task execution, run:")
print("python queue/consumer.py --check")
print("\nTo start the queue consumer, run:")
print("python queue/consumer.py")
except Exception as e:
print(f"Error running examples: {str(e)}")
if __name__ == "__main__":
main()

View File

@ -1,499 +0,0 @@
#!/usr/bin/env python3
"""
Queue tasks for file processing integration.
"""
import os
import json
import time
import hashlib
import shutil
from typing import Dict, List, Optional, Any
from task_queue.config import huey
from task_queue.manager import queue_manager
from task_queue.task_status import task_status_store
from utils import download_dataset_files, save_processed_files_log, load_processed_files_log
from utils.dataset_manager import remove_dataset_directory_by_key
def scan_upload_folder(upload_dir: str) -> List[str]:
"""
Scan all supported file formats in the upload folder.
Args:
upload_dir: Upload folder path
Returns:
List[str]: List of supported file paths
"""
supported_extensions = {
# Text files
'.txt', '.md', '.rtf',
# Document files
'.doc', '.docx', '.pdf', '.odt',
# Spreadsheet files
'.xls', '.xlsx', '.csv', '.ods',
# Presentation files
'.ppt', '.pptx', '.odp',
# E-books
'.epub', '.mobi',
# Web files
'.html', '.htm',
# Config files
'.json', '.xml', '.yaml', '.yml',
# Code files
'.py', '.js', '.java', '.cpp', '.c', '.go', '.rs',
# Archive files
'.zip', '.rar', '.7z', '.tar', '.gz'
}
scanned_files = []
if not os.path.exists(upload_dir):
return scanned_files
for root, dirs, files in os.walk(upload_dir):
for file in files:
# Skip hidden files and system files
if file.startswith('.') or file.startswith('~'):
continue
file_path = os.path.join(root, file)
file_extension = os.path.splitext(file)[1].lower()
# Check if file extension is supported
if file_extension in supported_extensions:
scanned_files.append(file_path)
else:
# For files without extension, try to process them (may be text files)
if not file_extension:
try:
# Try reading the file header to determine if it's a text file
with open(file_path, 'r', encoding='utf-8') as f:
f.read(1024) # Read the first 1KB
scanned_files.append(file_path)
except (UnicodeDecodeError, PermissionError):
# Not a text file or unreadable, skip
pass
return scanned_files
@huey.task()
def process_files_async(
dataset_id: str,
files: Optional[Dict[str, List[str]]] = None,
upload_folder: Optional[Dict[str, str]] = None,
task_id: Optional[str] = None
) -> Dict[str, Any]:
"""
Asynchronously process file tasks - compatible with existing files/process API.
Args:
dataset_id: Unique project ID
files: Dictionary of file paths grouped by key
upload_folder: Upload folder dictionary organized by group name, e.g. {'group1': 'my_project1', 'group2': 'my_project2'}
task_id: Task ID (for status tracking)
Returns:
Processing result dictionary
"""
try:
print(f"Starting async file processing task, project ID: {dataset_id}")
# If task_id is provided, set initial status
if task_id:
task_status_store.set_status(
task_id=task_id,
unique_id=dataset_id,
status="running"
)
# Ensure project directory exists
project_dir = os.path.join("projects", "data", dataset_id)
if not os.path.exists(project_dir):
os.makedirs(project_dir, exist_ok=True)
# Process files: use key-grouped format
processed_files_by_key = {}
# If upload_folder is provided, scan files in those folders
if upload_folder and not files:
scanned_files_by_group = {}
total_scanned_files = 0
for group_name, folder_name in upload_folder.items():
# Security check: prevent path traversal attacks
safe_folder_name = os.path.basename(folder_name)
upload_dir = os.path.join("projects", "uploads", safe_folder_name)
if os.path.exists(upload_dir):
scanned_files = scan_upload_folder(upload_dir)
if scanned_files:
scanned_files_by_group[group_name] = scanned_files
total_scanned_files += len(scanned_files)
print(f"Scanned {len(scanned_files)} files from upload folder '{safe_folder_name}' (group: {group_name})")
else:
print(f"No supported files found in upload folder '{safe_folder_name}' (group: {group_name})")
else:
print(f"Upload folder does not exist: {upload_dir} (group: {group_name})")
if scanned_files_by_group:
files = scanned_files_by_group
print(f"Total scanned {total_scanned_files} files from {len(scanned_files_by_group)} groups")
else:
print("No supported files found in any upload folder")
if files:
# Use files from the request (grouped by key)
# Since this is an async task, call synchronously
import asyncio
try:
loop = asyncio.get_event_loop()
except RuntimeError:
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
processed_files_by_key = loop.run_until_complete(download_dataset_files(dataset_id, files))
total_files = sum(len(files_list) for files_list in processed_files_by_key.values())
print(f"Async processed {total_files} dataset files across {len(processed_files_by_key)} keys, project ID: {dataset_id}")
else:
print(f"No files provided in request, project ID: {dataset_id}")
# Collect all document.txt files in the project directory
document_files = []
for root, dirs, files_list in os.walk(project_dir):
for file in files_list:
if file == "document.txt":
document_files.append(os.path.join(root, file))
# Generate project README.md file
try:
from utils.project_manager import save_project_readme
save_project_readme(dataset_id)
print(f"README.md generated, project ID: {dataset_id}")
except Exception as e:
print(f"Failed to generate README.md, project ID: {dataset_id}, error: {str(e)}")
# Does not affect main processing flow, continue
# Build result file list
result_files = []
for key in processed_files_by_key.keys():
# Add corresponding dataset document.txt path
document_path = os.path.join("projects", "data", dataset_id, "datasets", key, "document.txt")
if os.path.exists(document_path):
result_files.append(document_path)
# Also add document.txt files that exist but are not in processed_files_by_key
existing_document_paths = set(result_files) # Avoid duplicates
for doc_file in document_files:
if doc_file not in existing_document_paths:
result_files.append(doc_file)
result = {
"status": "success",
"message": f"Successfully processed {len(result_files)} document files across {len(processed_files_by_key)} keys",
"dataset_id": dataset_id,
"processed_files": result_files,
"processed_files_by_key": processed_files_by_key,
"document_files": document_files,
"total_files_processed": sum(len(files_list) for files_list in processed_files_by_key.values()),
"processing_time": time.time()
}
# Update task status to completed
if task_id:
task_status_store.update_status(
task_id=task_id,
status="completed",
result=result
)
print(f"Async file processing task completed: {dataset_id}")
return result
except Exception as e:
error_msg = f"Error during async file processing: {str(e)}"
print(error_msg)
# Update task status to error
if task_id:
task_status_store.update_status(
task_id=task_id,
status="failed",
error=error_msg
)
return {
"status": "error",
"message": error_msg,
"dataset_id": dataset_id,
"error": str(e)
}
@huey.task()
def process_files_incremental_async(
dataset_id: str,
files_to_add: Optional[Dict[str, List[str]]] = None,
files_to_remove: Optional[Dict[str, List[str]]] = None,
system_prompt: Optional[str] = None,
mcp_settings: Optional[List[Dict]] = None,
task_id: Optional[str] = None
) -> Dict[str, Any]:
"""
Incremental file processing task - supports adding and removing files.
Args:
dataset_id: Unique project ID
files_to_add: Dictionary of file paths to add, grouped by key
files_to_remove: Dictionary of file paths to remove, grouped by key
system_prompt: System prompt
mcp_settings: MCP settings
task_id: Task ID (for status tracking)
Returns:
Processing result dictionary
"""
try:
print(f"Starting incremental file processing task, project ID: {dataset_id}")
# If task_id is provided, set initial status
if task_id:
task_status_store.set_status(
task_id=task_id,
unique_id=dataset_id,
status="running"
)
# Ensure project directory exists
project_dir = os.path.join("projects", "data", dataset_id)
if not os.path.exists(project_dir):
os.makedirs(project_dir, exist_ok=True)
# Load existing processing log
processed_log = load_processed_files_log(dataset_id)
print(f"Loaded existing processing log with {len(processed_log)} file records")
removed_files = []
added_files = []
# 1. Process removals
if files_to_remove:
print(f"Starting removal processing across {len(files_to_remove)} key groups")
for key, file_list in files_to_remove.items():
if not file_list: # If file list is empty, remove the entire key group
print(f"Removing entire key group: {key}")
if remove_dataset_directory_by_key(dataset_id, key):
removed_files.append(f"dataset/{key}")
# Remove all records for this key from the processing log
keys_to_remove = [file_hash for file_hash, file_info in processed_log.items()
if file_info.get('key') == key]
for file_hash in keys_to_remove:
del processed_log[file_hash]
removed_files.append(f"log_entry:{file_hash}")
else:
# Remove specific files
for file_path in file_list:
print(f"Removing specific file: {key}/{file_path}")
# Actually delete the file
filename = os.path.basename(file_path)
# Delete original file
source_file = os.path.join("projects", "data", dataset_id, "files", key, filename)
if os.path.exists(source_file):
os.remove(source_file)
removed_files.append(f"file:{key}/{filename}")
# Delete processed file directory
processed_dir = os.path.join("projects", "data", dataset_id, "processed", key, filename)
if os.path.exists(processed_dir):
shutil.rmtree(processed_dir)
removed_files.append(f"processed:{key}/{filename}")
# Compute file hash to find in log
file_hash = hashlib.md5(file_path.encode('utf-8')).hexdigest()
# Remove from processing log
if file_hash in processed_log:
del processed_log[file_hash]
removed_files.append(f"log_entry:{file_hash}")
# 2. Process additions
processed_files_by_key = {}
if files_to_add:
print(f"Starting addition processing across {len(files_to_add)} key groups")
# Use async processing to download files
import asyncio
try:
loop = asyncio.get_event_loop()
except RuntimeError:
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
processed_files_by_key = loop.run_until_complete(download_dataset_files(dataset_id, files_to_add, incremental_mode=True))
total_added_files = sum(len(files_list) for files_list in processed_files_by_key.values())
print(f"Async processed {total_added_files} dataset files across {len(processed_files_by_key)} keys, project ID: {dataset_id}")
# Record added files
for key, files_list in processed_files_by_key.items():
for file_path in files_list:
added_files.append(f"{key}/{file_path}")
else:
print(f"No files to add provided in request, project ID: {dataset_id}")
# Save updated processing log
save_processed_files_log(dataset_id, processed_log)
print(f"Updated processing log, now contains {len(processed_log)} file records")
# Save system_prompt and mcp_settings to project directory (if provided)
if system_prompt:
system_prompt_file = os.path.join(project_dir, "system_prompt.md")
with open(system_prompt_file, 'w', encoding='utf-8') as f:
f.write(system_prompt)
print(f"Saved system_prompt, project ID: {dataset_id}")
if mcp_settings:
mcp_settings_file = os.path.join(project_dir, "mcp_settings.json")
with open(mcp_settings_file, 'w', encoding='utf-8') as f:
json.dump(mcp_settings, f, ensure_ascii=False, indent=2)
print(f"Saved mcp_settings, project ID: {dataset_id}")
# Generate project README.md file
try:
from utils.project_manager import save_project_readme
save_project_readme(dataset_id)
print(f"README.md generated, project ID: {dataset_id}")
except Exception as e:
print(f"Failed to generate README.md, project ID: {dataset_id}, error: {str(e)}")
# Does not affect main processing flow, continue
# Collect all document.txt files in the project directory
document_files = []
for root, dirs, files_list in os.walk(project_dir):
for file in files_list:
if file == "document.txt":
document_files.append(os.path.join(root, file))
# Build result file list
result_files = []
for key in processed_files_by_key.keys():
# Add corresponding dataset document.txt path
document_path = os.path.join("projects", "data", dataset_id, "datasets", key, "document.txt")
if os.path.exists(document_path):
result_files.append(document_path)
# Also add document.txt files that exist but are not in processed_files_by_key
existing_document_paths = set(result_files) # Avoid duplicates
for doc_file in document_files:
if doc_file not in existing_document_paths:
result_files.append(doc_file)
result = {
"status": "success",
"message": f"Incremental processing complete - added {len(added_files)} files, removed {len(removed_files)} files, {len(result_files)} document files remaining",
"dataset_id": dataset_id,
"removed_files": removed_files,
"added_files": added_files,
"processed_files": result_files,
"processed_files_by_key": processed_files_by_key,
"document_files": document_files,
"total_files_added": sum(len(files_list) for files_list in processed_files_by_key.values()),
"total_files_removed": len(removed_files),
"final_files_count": len(result_files),
"processing_time": time.time()
}
# Update task status to completed
if task_id:
task_status_store.update_status(
task_id=task_id,
status="completed",
result=result
)
print(f"Incremental file processing task completed: {dataset_id}")
return result
except Exception as e:
error_msg = f"Error during incremental file processing: {str(e)}"
print(error_msg)
# Update task status to error
if task_id:
task_status_store.update_status(
task_id=task_id,
status="failed",
error=error_msg
)
return {
"status": "error",
"message": error_msg,
"dataset_id": dataset_id,
"error": str(e)
}
@huey.task()
def cleanup_project_async(
dataset_id: str,
remove_all: bool = False
) -> Dict[str, Any]:
"""
Asynchronously clean up project files.
Args:
dataset_id: Unique project ID
remove_all: Whether to remove the entire project directory
Returns:
Cleanup result dictionary
"""
try:
print(f"Starting async project cleanup, project ID: {dataset_id}")
project_dir = os.path.join("projects", "data", dataset_id)
removed_items = []
if remove_all and os.path.exists(project_dir):
import shutil
shutil.rmtree(project_dir)
removed_items.append(project_dir)
result = {
"status": "success",
"message": f"Deleted entire project directory: {project_dir}",
"dataset_id": dataset_id,
"removed_items": removed_items,
"action": "remove_all"
}
else:
# Only clean processing log
log_file = os.path.join(project_dir, "processed_files.json")
if os.path.exists(log_file):
os.remove(log_file)
removed_items.append(log_file)
result = {
"status": "success",
"message": f"Cleaned project processing log, project ID: {dataset_id}",
"dataset_id": dataset_id,
"removed_items": removed_items,
"action": "cleanup_logs"
}
print(f"Async cleanup task completed: {dataset_id}")
return result
except Exception as e:
error_msg = f"Error during async project cleanup: {str(e)}"
print(error_msg)
return {
"status": "error",
"message": error_msg,
"dataset_id": dataset_id,
"error": str(e)
}

View File

@ -1,228 +0,0 @@
#!/usr/bin/env python3
"""
Queue manager for handling file processing queues.
"""
import os
import json
import time
import logging
from typing import Dict, List, Optional, Any
from huey import Huey
from huey.api import Task
from datetime import datetime, timedelta
# Configure logging
logger = logging.getLogger('app')
from .config import huey
from .tasks import process_file_async, process_multiple_files_async, process_zip_file_async, cleanup_processed_files
class QueueManager:
"""Queue manager for file processing tasks."""
def __init__(self):
self.huey = huey
logger.info(f"Queue manager initialized with database: {os.path.join(os.path.dirname(__file__), '..', 'projects', 'queue_data', 'huey.db')}")
def enqueue_file(
self,
project_id: str,
file_path: str,
original_filename: str = None,
delay: int = 0
) -> str:
"""
Add a file to the processing queue.
Args:
project_id: Project ID
file_path: File path
original_filename: Original filename
delay: Delay before execution in seconds
Returns:
Task ID
"""
if delay > 0:
task = process_file_async.schedule(
args=(project_id, file_path, original_filename),
delay=timedelta(seconds=delay)
)
else:
task = process_file_async(project_id, file_path, original_filename)
logger.info(f"File queued for processing: {file_path}, task ID: {task.id}")
return task.id
def enqueue_multiple_files(
self,
project_id: str,
file_paths: List[str],
original_filenames: List[str] = None,
delay: int = 0
) -> List[str]:
"""
Add multiple files to the processing queue.
Args:
project_id: Project ID
file_paths: List of file paths
original_filenames: List of original filenames
delay: Delay before execution in seconds
Returns:
List of task IDs
"""
if delay > 0:
task = process_multiple_files_async.schedule(
args=(project_id, file_paths, original_filenames),
delay=timedelta(seconds=delay)
)
else:
task = process_multiple_files_async(project_id, file_paths, original_filenames)
logger.info(f"Batch files queued for processing: {len(file_paths)} files, task ID: {task.id}")
return [task.id]
def enqueue_zip_file(
self,
project_id: str,
zip_path: str,
extract_to: str = None,
delay: int = 0
) -> str:
"""
Add a zip file to the processing queue.
Args:
project_id: Project ID
zip_path: Path to the zip file
extract_to: Extraction target directory
delay: Delay before execution in seconds
Returns:
Task ID
"""
if delay > 0:
task = process_zip_file_async.schedule(
args=(project_id, zip_path, extract_to),
delay=timedelta(seconds=delay)
)
else:
task = process_zip_file_async(project_id, zip_path, extract_to)
logger.info(f"Zip file queued for processing: {zip_path}, task ID: {task.id}")
return task.id
def get_task_status(self, task_id: str) -> Dict[str, Any]:
"""
Get task status.
Args:
task_id: Task ID
Returns:
Task status information
"""
try:
# Try getting the task result from result storage
try:
# Use Huey's built-in result lookup when available
if hasattr(self.huey, 'result') and self.huey.result:
result = self.huey.result(task_id)
if result is not None:
return {
"task_id": task_id,
"status": "complete",
"result": result
}
except Exception:
pass
# Check whether the task is in the pending queue
try:
pending_tasks = list(self.huey.pending())
for task in pending_tasks:
if hasattr(task, 'id') and task.id == task_id:
return {
"task_id": task_id,
"status": "pending"
}
except Exception:
pass
# Check whether the task is in the scheduled queue
try:
scheduled_tasks = list(self.huey.scheduled())
for task in scheduled_tasks:
if hasattr(task, 'id') and task.id == task_id:
return {
"task_id": task_id,
"status": "scheduled"
}
except Exception:
pass
# If not found anywhere, it may not exist or may have completed with cleaned results
return {
"task_id": task_id,
"status": "unknown",
"message": "Task status is unknown; it may already be complete or may not exist"
}
except Exception as e:
return {
"task_id": task_id,
"status": "error",
"message": f"Failed to get task status: {str(e)}"
}
def get_queue_stats(self) -> Dict[str, Any]:
"""
Get queue statistics.
Returns:
Queue statistics information
"""
try:
# Use a simplified approach for queue statistics
stats = {
"total_tasks": 0,
"pending_tasks": 0,
"running_tasks": 0,
"completed_tasks": 0,
"error_tasks": 0,
"scheduled_tasks": 0,
"recent_tasks": [],
"queue_database": os.path.join(os.path.dirname(__file__), '..', 'projects', 'queue_data', 'huey.db')
}
# Try to get the number of pending tasks
try:
pending_tasks = list(self.huey.pending())
stats["pending_tasks"] = len(pending_tasks)
stats["total_tasks"] += len(pending_tasks)
except Exception as e:
logger.error(f"Failed to get pending tasks: {e}")
# Try to get the number of scheduled tasks
try:
scheduled_tasks = list(self.huey.scheduled())
stats["scheduled_tasks"] = len(scheduled_tasks)
stats["total_tasks"] += len(scheduled_tasks)
except Exception as e:
logger.error(f"Failed to get scheduled tasks: {e}")
return stats
except Exception as e:
return {
"error": str(e),
"queue_database": os.path.join(os.path.dirname(__file__), '..', 'projects', 'queue_data', 'huey.db')
}
# Global singleton instance
queue_manager = QueueManager()

View File

@ -1,286 +0,0 @@
#!/usr/bin/env python3
"""
Optimized queue consumer with integrated performance monitoring.
"""
import sys
import os
import time
import signal
import argparse
import multiprocessing
import logging
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor
import threading
# Configure logging
logger = logging.getLogger('app')
# Add project root directory to Python path
project_root = Path(__file__).parent.parent
sys.path.insert(0, str(project_root))
from task_queue.config import huey
from task_queue.manager import queue_manager
from task_queue.integration_tasks import process_files_async, cleanup_project_async
from huey.consumer import Consumer
class OptimizedQueueConsumer:
"""Optimized queue consumer with integrated performance monitoring."""
def __init__(self, worker_type: str = "threads", workers: int = 2):
self.huey = huey
self.worker_type = worker_type
self.workers = workers
self.running = False
self.consumer = None
self.processed_count = 0
self.start_time = None
# Performance monitoring
self.performance_stats = {
'tasks_processed': 0,
'tasks_failed': 0,
'avg_processing_time': 0,
'start_time': None,
'last_activity': None
}
# Register signal handlers
signal.signal(signal.SIGINT, self._signal_handler)
signal.signal(signal.SIGTERM, self._signal_handler)
def _signal_handler(self, signum, frame):
"""Signal handler for graceful shutdown."""
logger.info(f"\nReceived signal {signum}, shutting down queue consumer...")
self.running = False
if self.consumer:
self.consumer.stop()
def setup_optimizations(self):
"""Set up performance optimizations."""
# Set environment variables
env_vars = {
'PYTHONUNBUFFERED': '1',
'PYTHONDONTWRITEBYTECODE': '1',
}
for key, value in env_vars.items():
os.environ[key] = value
# Optimize huey configuration
if hasattr(huey, 'immediate'):
huey.immediate = False
# Adjust based on worker type
if self.worker_type == "threads":
# Thread pool optimization
if hasattr(huey, 'worker_type'):
huey.worker_type = 'threads'
# Set thread pool size
if hasattr(huey, 'always_eager'):
huey.always_eager = False
logger.info("Queue consumer optimization setup complete:")
logger.info(f"- Worker type: {self.worker_type}")
logger.info(f"- Worker count: {self.workers}")
def monitor_performance(self):
"""Performance monitoring thread."""
while self.running:
time.sleep(30) # Output statistics every 30 seconds
if self.start_time:
elapsed = time.time() - self.start_time
rate = self.performance_stats['tasks_processed'] / max(1, elapsed)
logger.info(f"\n[Performance Stats]")
logger.info(f"- Uptime: {elapsed:.1f}s")
logger.info(f"- Tasks processed: {self.performance_stats['tasks_processed']}")
logger.info(f"- Failed tasks: {self.performance_stats['tasks_failed']}")
logger.info(f"- Average processing rate: {rate:.2f} tasks/s")
if self.performance_stats['avg_processing_time'] > 0:
logger.info(f"- Average processing time: {self.performance_stats['avg_processing_time']:.2f}s")
def start(self):
"""Start the queue consumer."""
logger.info("=" * 60)
logger.info("Optimized queue consumer starting")
logger.info("=" * 60)
# Apply optimizations
self.setup_optimizations()
logger.info(f"Database: {os.path.join(os.path.dirname(__file__), '..', 'projects', 'queue_data', 'huey.db')}")
logger.info("Press Ctrl+C to stop the consumer")
self.running = True
self.start_time = time.time()
self.performance_stats['start_time'] = self.start_time
# Start performance monitoring thread
monitor_thread = threading.Thread(target=self.monitor_performance, daemon=True)
monitor_thread.start()
try:
# Create consumer
self.consumer = Consumer(
self.huey,
workers=self.workers,
worker_type=self.worker_type,
max_delay=60.0, # Maximum delay
check_delay=1.0, # Check interval
periodic=True, # Enable periodic tasks
)
logger.info("Queue consumer started, waiting for tasks...")
# Start the consumer
self.consumer.run()
except KeyboardInterrupt:
logger.info("\nReceived keyboard interrupt signal")
except Exception as e:
logger.error(f"Queue consumer runtime error: {e}")
import traceback
traceback.print_exc()
finally:
self.shutdown()
def shutdown(self):
"""Shut down the queue consumer."""
logger.info("\nShutting down queue consumer...")
self.running = False
if self.consumer:
try:
self.consumer.stop()
logger.info("Queue consumer stopped")
except Exception as e:
logger.error(f"Error stopping queue consumer: {e}")
# Output final statistics
if self.start_time:
elapsed = time.time() - self.start_time
logger.info(f"\n[Final Stats]")
logger.info(f"- Total uptime: {elapsed:.1f}s")
logger.info(f"- Total tasks processed: {self.performance_stats['tasks_processed']}")
logger.info(f"- Total failed tasks: {self.performance_stats['tasks_failed']}")
if self.performance_stats['tasks_processed'] > 0:
rate = self.performance_stats['tasks_processed'] / elapsed
logger.info(f"- Average processing rate: {rate:.2f} tasks/s")
def calculate_optimal_workers():
"""Calculate the optimal number of worker threads."""
cpu_count = multiprocessing.cpu_count()
# Based on CPU core count and system resources
if cpu_count <= 2:
return 2
elif cpu_count <= 4:
return 4
else:
return min(8, cpu_count)
def check_queue_status():
"""Check queue status."""
try:
stats = queue_manager.get_queue_stats()
logger.info("\n[Queue Status]")
if isinstance(stats, dict):
if 'total_tasks' in stats:
logger.info(f"- Total tasks: {stats['total_tasks']}")
if 'pending_tasks' in stats:
logger.info(f"- Pending tasks: {stats['pending_tasks']}")
if 'scheduled_tasks' in stats:
logger.info(f"- Scheduled tasks: {stats['scheduled_tasks']}")
# Check database file
db_path = os.path.join(os.path.dirname(__file__), '..', 'projects', 'queue_data', 'huey.db')
if os.path.exists(db_path):
size = os.path.getsize(db_path)
logger.info(f"- Database size: {size} bytes")
else:
logger.info("- Database file: not found")
except Exception as e:
logger.error(f"Failed to get queue status: {e}")
def main():
"""Main entry point."""
parser = argparse.ArgumentParser(description="Optimized queue consumer")
parser.add_argument(
"--workers",
type=int,
default=calculate_optimal_workers(),
help=f"Number of worker threads (default: {calculate_optimal_workers()})"
)
parser.add_argument(
"--worker-type",
type=str,
default="threads",
choices=["threads", "greenlets", "gevent"],
help="Worker type (default: threads)"
)
parser.add_argument(
"--check-status",
action="store_true",
help="Check queue status and exit"
)
parser.add_argument(
"--profile",
type=str,
default="balanced",
choices=["low_memory", "balanced", "high_performance"],
help="Performance profile"
)
args = parser.parse_args()
# Apply performance profile
if args.profile == "low_memory":
os.environ['PYTHONOPTIMIZE'] = '1'
if args.workers > 2:
args.workers = 2
logger.info(f"Low memory mode: adjusted worker count to {args.workers}")
elif args.profile == "high_performance":
if args.workers < 4:
args.workers = 4
logger.info(f"High performance mode: adjusted worker count to {args.workers}")
# Check queue status
if args.check_status:
check_queue_status()
return
# Check environment
try:
import psutil
memory = psutil.virtual_memory()
logger.info("[System Info]")
logger.info(f"- CPU cores: {multiprocessing.cpu_count()}")
logger.info(f"- Available memory: {memory.available / (1024**3):.1f}GB")
logger.info(f"- Memory usage: {memory.percent:.1f}%")
except ImportError:
logger.info("[Tip] Install psutil to display system info: pip install psutil")
# Create and start the queue consumer
consumer = OptimizedQueueConsumer(
worker_type=args.worker_type,
workers=args.workers
)
consumer.start()
if __name__ == "__main__":
main()

View File

@ -1,210 +0,0 @@
#!/usr/bin/env python3
"""
Task status SQLite storage system.
"""
import json
import os
import sqlite3
import time
from typing import Dict, Optional, Any, List
from pathlib import Path
class TaskStatusStore:
"""SQLite-based task status store."""
def __init__(self, db_path: str = "projects/queue_data/task_status.db"):
self.db_path = db_path
# Ensure directory exists
Path(db_path).parent.mkdir(parents=True, exist_ok=True)
self._init_database()
def _init_database(self):
"""Initialize database tables."""
with sqlite3.connect(self.db_path) as conn:
conn.execute('''
CREATE TABLE IF NOT EXISTS task_status (
task_id TEXT PRIMARY KEY,
unique_id TEXT NOT NULL,
status TEXT NOT NULL,
created_at REAL NOT NULL,
updated_at REAL NOT NULL,
result TEXT,
error TEXT
)
''')
conn.commit()
def set_status(self, task_id: str, unique_id: str, status: str,
result: Optional[Dict] = None, error: Optional[str] = None):
"""Set task status."""
current_time = time.time()
with sqlite3.connect(self.db_path) as conn:
conn.execute('''
INSERT OR REPLACE INTO task_status
(task_id, unique_id, status, created_at, updated_at, result, error)
VALUES (?, ?, ?, ?, ?, ?, ?)
''', (
task_id, unique_id, status, current_time, current_time,
json.dumps(result) if result else None,
error
))
conn.commit()
def get_status(self, task_id: str) -> Optional[Dict]:
"""Get task status."""
with sqlite3.connect(self.db_path) as conn:
conn.row_factory = sqlite3.Row
cursor = conn.execute(
'SELECT * FROM task_status WHERE task_id = ?', (task_id,)
)
row = cursor.fetchone()
if not row:
return None
result = dict(row)
# Parse JSON field
if result['result']:
result['result'] = json.loads(result['result'])
return result
def update_status(self, task_id: str, status: str,
result: Optional[Dict] = None, error: Optional[str] = None):
"""Update task status."""
with sqlite3.connect(self.db_path) as conn:
# Check if task exists
cursor = conn.execute(
'SELECT task_id FROM task_status WHERE task_id = ?', (task_id,)
)
if not cursor.fetchone():
return False
# Update status
conn.execute('''
UPDATE task_status
SET status = ?, updated_at = ?, result = ?, error = ?
WHERE task_id = ?
''', (
status, time.time(),
json.dumps(result) if result else None,
error, task_id
))
conn.commit()
return True
def delete_status(self, task_id: str):
"""Delete task status."""
with sqlite3.connect(self.db_path) as conn:
cursor = conn.execute(
'DELETE FROM task_status WHERE task_id = ?', (task_id,)
)
conn.commit()
return cursor.rowcount > 0
def list_all(self) -> Dict[str, Dict]:
"""List all task statuses."""
with sqlite3.connect(self.db_path) as conn:
conn.row_factory = sqlite3.Row
cursor = conn.execute(
'SELECT * FROM task_status ORDER BY updated_at DESC'
)
all_tasks = {}
for row in cursor:
result = dict(row)
# Parse JSON field
if result['result']:
result['result'] = json.loads(result['result'])
all_tasks[result['task_id']] = result
return all_tasks
def get_by_unique_id(self, unique_id: str) -> List[Dict]:
"""Get all tasks for a given project ID."""
with sqlite3.connect(self.db_path) as conn:
conn.row_factory = sqlite3.Row
cursor = conn.execute(
'SELECT * FROM task_status WHERE unique_id = ? ORDER BY updated_at DESC',
(unique_id,)
)
tasks = []
for row in cursor:
result = dict(row)
if result['result']:
result['result'] = json.loads(result['result'])
tasks.append(result)
return tasks
def cleanup_old_tasks(self, older_than_days: int = 7) -> int:
"""Clean up old task records."""
cutoff_time = time.time() - (older_than_days * 24 * 3600)
with sqlite3.connect(self.db_path) as conn:
cursor = conn.execute(
'DELETE FROM task_status WHERE updated_at < ?',
(cutoff_time,)
)
conn.commit()
return cursor.rowcount
def get_statistics(self) -> Dict[str, Any]:
"""Get task statistics."""
with sqlite3.connect(self.db_path) as conn:
# Total tasks
total = conn.execute('SELECT COUNT(*) FROM task_status').fetchone()[0]
# Status breakdown
status_stats = conn.execute('''
SELECT status, COUNT(*) as count
FROM task_status
GROUP BY status
''').fetchall()
# Tasks in the last 24 hours
recent = time.time() - (24 * 3600)
recent_tasks = conn.execute(
'SELECT COUNT(*) FROM task_status WHERE updated_at > ?',
(recent,)
).fetchone()[0]
return {
'total_tasks': total,
'status_breakdown': dict(status_stats),
'recent_24h': recent_tasks,
'database_path': self.db_path
}
def search_tasks(self, status: Optional[str] = None,
unique_id: Optional[str] = None,
limit: int = 100) -> List[Dict]:
"""Search tasks."""
query = 'SELECT * FROM task_status WHERE 1=1'
params = []
if status:
query += ' AND status = ?'
params.append(status)
if unique_id:
query += ' AND unique_id = ?'
params.append(unique_id)
query += ' ORDER BY updated_at DESC LIMIT ?'
params.append(limit)
with sqlite3.connect(self.db_path) as conn:
conn.row_factory = sqlite3.Row
cursor = conn.execute(query, params)
tasks = []
for row in cursor:
result = dict(row)
if result['result']:
result['result'] = json.loads(result['result'])
tasks.append(result)
return tasks
# Global task status store instance
task_status_store = TaskStatusStore()

View File

@ -1,359 +0,0 @@
#!/usr/bin/env python3
"""
File processing tasks for the queue system.
"""
import os
import json
import time
import shutil
import logging
from pathlib import Path
from typing import Dict, List, Optional, Any
from huey import crontab
# Configure logging
logger = logging.getLogger('app')
from .config import huey
from utils.file_utils import (
extract_zip_file,
get_file_hash,
load_processed_files_log,
save_processed_files_log,
get_document_preview
)
@huey.task()
def process_file_async(
project_id: str,
file_path: str,
original_filename: str = None,
target_directory: str = "files"
) -> Dict[str, Any]:
"""
Asynchronously process a single file.
Args:
project_id: Project ID
file_path: File path
original_filename: Original filename
target_directory: Target directory
Returns:
Processing result dictionary
"""
try:
logger.info(f"Starting file processing: {file_path}")
# Ensure project directory exists
project_dir = os.path.join("projects", project_id)
files_dir = os.path.join(project_dir, target_directory)
os.makedirs(files_dir, exist_ok=True)
# Get file hash as identifier
file_hash = get_file_hash(file_path)
# Check if file has already been processed
processed_log = load_processed_files_log(project_id)
if file_hash in processed_log:
logger.info(f"File already processed, skipping: {file_path}")
return {
"status": "skipped",
"message": "File already processed",
"file_hash": file_hash,
"project_id": project_id
}
# Process the file
result = _process_single_file(
file_path,
files_dir,
original_filename or os.path.basename(file_path)
)
# Update processing log
if result["status"] == "success":
processed_log[file_hash] = {
"original_path": file_path,
"original_filename": original_filename or os.path.basename(file_path),
"processed_at": str(time.time()),
"status": "processed",
"result": result
}
save_processed_files_log(project_id, processed_log)
result["file_hash"] = file_hash
result["project_id"] = project_id
logger.info(f"File processing complete: {file_path}, status: {result['status']}")
return result
except Exception as e:
error_msg = f"Error processing file: {str(e)}"
logger.error(error_msg)
return {
"status": "error",
"message": error_msg,
"file_path": file_path,
"project_id": project_id
}
@huey.task()
def process_multiple_files_async(
project_id: str,
file_paths: List[str],
original_filenames: List[str] = None
) -> List[Dict[str, Any]]:
"""
Asynchronously process multiple files in batch.
Args:
project_id: Project ID
file_paths: List of file paths
original_filenames: List of original filenames
Returns:
List of processing results
"""
try:
logger.info(f"Starting batch processing of {len(file_paths)} files")
results = []
for i, file_path in enumerate(file_paths):
original_filename = original_filenames[i] if original_filenames and i < len(original_filenames) else None
# Create async task for each file
result = process_file_async(project_id, file_path, original_filename)
results.append(result)
logger.info(f"Batch file processing tasks submitted, total {len(results)} files")
return results
except Exception as e:
error_msg = f"Error during batch file processing: {str(e)}"
logger.error(error_msg)
return [{
"status": "error",
"message": error_msg,
"project_id": project_id
}]
@huey.task()
def process_zip_file_async(
project_id: str,
zip_path: str,
extract_to: str = None
) -> Dict[str, Any]:
"""
Asynchronously process a zip archive file.
Args:
project_id: Project ID
zip_path: Zip file path
extract_to: Extraction target directory
Returns:
Processing result dictionary
"""
try:
logger.info(f"Starting zip file processing: {zip_path}")
# Set extraction directory
if extract_to is None:
extract_to = os.path.join("projects", project_id, "extracted", os.path.basename(zip_path))
os.makedirs(extract_to, exist_ok=True)
# Extract files
extracted_files = extract_zip_file(zip_path, extract_to)
if not extracted_files:
return {
"status": "error",
"message": "Extraction failed or no supported files found",
"zip_path": zip_path,
"project_id": project_id
}
# Batch process extracted files
result = process_multiple_files_async(project_id, extracted_files)
return {
"status": "success",
"message": f"Zip file processing complete, extracted {len(extracted_files)} files",
"zip_path": zip_path,
"extract_to": extract_to,
"extracted_files": extracted_files,
"project_id": project_id,
"batch_task_result": result
}
except Exception as e:
error_msg = f"Error processing zip file: {str(e)}"
logger.error(error_msg)
return {
"status": "error",
"message": error_msg,
"zip_path": zip_path,
"project_id": project_id
}
@huey.task()
def cleanup_processed_files(
project_id: str,
older_than_days: int = 30
) -> Dict[str, Any]:
"""
Clean up old processed files.
Args:
project_id: Project ID
older_than_days: Clean files older than this many days
Returns:
Cleanup result dictionary
"""
try:
logger.info(f"Starting cleanup of files older than {older_than_days} days in project {project_id}")
project_dir = os.path.join("projects", project_id)
if not os.path.exists(project_dir):
return {
"status": "error",
"message": "Project directory does not exist",
"project_id": project_id
}
current_time = time.time()
cutoff_time = current_time - (older_than_days * 24 * 3600)
cleaned_files = []
# Walk through project directory
for root, dirs, files in os.walk(project_dir):
for file in files:
file_path = os.path.join(root, file)
file_mtime = os.path.getmtime(file_path)
if file_mtime < cutoff_time:
try:
os.remove(file_path)
cleaned_files.append(file_path)
logger.info(f"Deleted old file: {file_path}")
except Exception as e:
logger.error(f"Failed to delete file {file_path}: {str(e)}")
# Clean up empty directories
for root, dirs, files in os.walk(project_dir, topdown=False):
for dir in dirs:
dir_path = os.path.join(root, dir)
try:
if not os.listdir(dir_path):
os.rmdir(dir_path)
logger.info(f"Deleted empty directory: {dir_path}")
except Exception as e:
logger.error(f"Failed to delete directory {dir_path}: {str(e)}")
return {
"status": "success",
"message": f"Cleanup complete, deleted {len(cleaned_files)} files",
"project_id": project_id,
"cleaned_files": cleaned_files,
"older_than_days": older_than_days
}
except Exception as e:
error_msg = f"Error during file cleanup: {str(e)}"
logger.error(error_msg)
return {
"status": "error",
"message": error_msg,
"project_id": project_id
}
def _process_single_file(
file_path: str,
target_dir: str,
original_filename: str
) -> Dict[str, Any]:
"""
Internal method for processing a single file.
Args:
file_path: Source file path
target_dir: Target directory
original_filename: Original filename
Returns:
Processing result dictionary
"""
try:
# Check if file exists
if not os.path.exists(file_path):
return {
"status": "error",
"message": "Source file does not exist",
"file_path": file_path
}
# Get file info
file_size = os.path.getsize(file_path)
file_ext = os.path.splitext(original_filename)[1].lower()
# Different processing based on file type
supported_extensions = ['.txt', '.md', '.csv', '.xlsx', '.zip']
if file_ext not in supported_extensions:
return {
"status": "error",
"message": f"Unsupported file type: {file_ext}",
"file_path": file_path,
"supported_extensions": supported_extensions
}
# Copy file to target directory
target_file_path = os.path.join(target_dir, original_filename)
# If target file already exists, add timestamp
if os.path.exists(target_file_path):
name, ext = os.path.splitext(original_filename)
timestamp = int(time.time())
target_file_path = os.path.join(target_dir, f"{name}_{timestamp}{ext}")
shutil.copy2(file_path, target_file_path)
# Get file preview (if it's a text file)
preview = None
if file_ext in ['.txt', '.md']:
preview = get_document_preview(target_file_path, max_lines=5)
return {
"status": "success",
"message": "File processed successfully",
"original_path": file_path,
"target_path": target_file_path,
"file_size": file_size,
"file_extension": file_ext,
"preview": preview
}
except Exception as e:
return {
"status": "error",
"message": f"Error processing file: {str(e)}",
"file_path": file_path
}
# Periodic task example: clean up files older than 30 days daily at 2 AM
@huey.periodic_task(crontab(hour=2, minute=0))
def daily_cleanup():
"""Daily cleanup task."""
logger.info("Running daily cleanup task")
# Add cleanup logic here
return {"status": "completed", "message": "Daily cleanup task completed"}

View File

@ -13,23 +13,6 @@ from .file_utils import (
save_processed_files_log
)
from .dataset_manager import (
download_dataset_files,
generate_dataset_structure,
remove_dataset_directory,
remove_dataset_directory_by_key
)
from .project_manager import (
generate_project_readme,
save_project_readme,
get_project_status,
remove_project,
list_projects,
get_project_stats
)
from .system_optimizer import (
setup_system_optimizations
)
@ -59,11 +42,6 @@ from .api_models import (
ProjectListResponse,
ProjectStatsResponse,
ProjectActionResponse,
QueueTaskRequest,
IncrementalTaskRequest,
QueueTaskResponse,
QueueStatusResponse,
TaskStatusResponse,
create_success_response,
create_error_response,
create_chat_response,
@ -90,20 +68,6 @@ __all__ = [
'load_processed_files_log',
'save_processed_files_log',
# dataset_manager
'download_dataset_files',
'generate_dataset_structure',
'remove_dataset_directory',
'remove_dataset_directory_by_key',
# project_manager
'generate_project_readme',
'save_project_readme',
'get_project_status',
'remove_project',
'list_projects',
'get_project_stats',
# agent_pool
'AgentPool',
'get_agent_pool',
@ -128,10 +92,6 @@ __all__ = [
'ProjectListResponse',
'ProjectStatsResponse',
'ProjectActionResponse',
'QueueTaskRequest',
'QueueTaskResponse',
'QueueStatusResponse',
'TaskStatusResponse',
'create_success_response',
'create_error_response',
'create_chat_response',

View File

@ -270,133 +270,6 @@ def create_error_response(message: str, error_type: str = "error", **kwargs) ->
}
class QueueTaskRequest(BaseModel):
"""Queue task request model"""
dataset_id: str
files: Optional[Dict[str, List[str]]] = Field(default=None, description="Files organized by key groups. Each key maps to a list of file paths (supports zip files)")
upload_folder: Optional[Dict[str, str]] = Field(default=None, description="Upload folders organized by group names. Each key maps to a folder name. Example: {'group1': 'my_project1', 'group2': 'my_project2'}")
priority: Optional[int] = Field(default=0, description="Task priority (higher number = higher priority)")
delay: Optional[int] = Field(default=0, description="Delay execution by N seconds")
model_config = ConfigDict(extra='allow')
@field_validator('upload_folder', mode='before')
@classmethod
def validate_upload_folder(cls, v):
"""Validate upload_folder dict format"""
if v is None:
return None
if isinstance(v, dict):
# Validate dict format
for key, value in v.items():
if not isinstance(key, str):
raise ValueError(f"Key in upload_folder dict must be string, got {type(key)}")
if not isinstance(value, str):
raise ValueError(f"Value in upload_folder dict must be string (folder name), got {type(value)} for key '{key}'")
return v
else:
raise ValueError(f"upload_folder must be a dict with group names as keys and folder names as values, got {type(v)}")
@field_validator('files', mode='before')
@classmethod
def validate_files(cls, v):
"""Validate dict format with key-grouped files"""
if v is None:
return None
if isinstance(v, dict):
# Validate dict format
for key, value in v.items():
if not isinstance(key, str):
raise ValueError(f"Key in files dict must be string, got {type(key)}")
if not isinstance(value, list):
raise ValueError(f"Value in files dict must be list, got {type(value)} for key '{key}'")
for item in value:
if not isinstance(item, str):
raise ValueError(f"File paths must be strings, got {type(item)} in key '{key}'")
return v
else:
raise ValueError(f"Files must be a dict with key groups, got {type(v)}")
class IncrementalTaskRequest(BaseModel):
"""Incremental file processing request model"""
dataset_id: str = Field(..., description="Dataset ID for the project")
files_to_add: Optional[Dict[str, List[str]]] = Field(default=None, description="Files to add organized by key groups")
files_to_remove: Optional[Dict[str, List[str]]] = Field(default=None, description="Files to remove organized by key groups")
system_prompt: Optional[str] = None
mcp_settings: Optional[List[Dict]] = None
priority: Optional[int] = Field(default=0, description="Task priority (higher number = higher priority)")
delay: Optional[int] = Field(default=0, description="Delay execution by N seconds")
model_config = ConfigDict(extra='allow')
@field_validator('files_to_add', mode='before')
@classmethod
def validate_files_to_add(cls, v):
"""Validate files_to_add dict format"""
if v is None:
return None
if isinstance(v, dict):
for key, value in v.items():
if not isinstance(key, str):
raise ValueError(f"Key in files_to_add dict must be string, got {type(key)}")
if not isinstance(value, list):
raise ValueError(f"Value in files_to_add dict must be list, got {type(value)} for key '{key}'")
for item in value:
if not isinstance(item, str):
raise ValueError(f"File paths must be strings, got {type(item)} in key '{key}'")
return v
else:
raise ValueError(f"files_to_add must be a dict with key groups, got {type(v)}")
@field_validator('files_to_remove', mode='before')
@classmethod
def validate_files_to_remove(cls, v):
"""Validate files_to_remove dict format"""
if v is None:
return None
if isinstance(v, dict):
for key, value in v.items():
if not isinstance(key, str):
raise ValueError(f"Key in files_to_remove dict must be string, got {type(key)}")
if not isinstance(value, list):
raise ValueError(f"Value in files_to_remove dict must be list, got {type(value)} for key '{key}'")
for item in value:
if not isinstance(item, str):
raise ValueError(f"File paths must be strings, got {type(item)} in key '{key}'")
return v
else:
raise ValueError(f"files_to_remove must be a dict with key groups, got {type(v)}")
class QueueTaskResponse(BaseModel):
"""Queue task response model"""
success: bool
message: str
dataset_id: str
task_id: Optional[str] = None
task_status: Optional[str] = None
estimated_processing_time: Optional[int] = None # seconds
class QueueStatusResponse(BaseModel):
"""Queue status response model"""
success: bool
message: str
queue_stats: Dict[str, Any]
pending_tasks: List[Dict[str, Any]]
class TaskStatusResponse(BaseModel):
"""Task status response model"""
success: bool
message: str
task_id: str
task_status: Optional[str] = None
task_result: Optional[Dict[str, Any]] = None
error: Optional[str] = None
def create_chat_response(
messages: List[Message],
model: str,

View File

@ -1,439 +0,0 @@
#!/usr/bin/env python3
"""
Data merging functions for combining processed file results.
"""
import os
import pickle
import logging
from typing import Dict, List, Optional, Tuple
import json
# Configure logger
logger = logging.getLogger('app')
# Try to import numpy, but handle if missing
try:
import numpy as np
NUMPY_SUPPORT = True
except ImportError:
logger.warning("NumPy not available, some embedding features may be limited")
NUMPY_SUPPORT = False
def merge_documents_by_group(unique_id: str, group_name: str) -> Dict:
"""Merge all document.txt files in a group into a single document."""
processed_group_dir = os.path.join("projects", "data", unique_id, "processed", group_name)
dataset_group_dir = os.path.join("projects", "data", unique_id, "datasets", group_name)
os.makedirs(dataset_group_dir, exist_ok=True)
merged_document_path = os.path.join(dataset_group_dir, "document.txt")
result = {
"success": False,
"merged_document_path": merged_document_path,
"source_files": [],
"total_pages": 0,
"total_characters": 0,
"error": None
}
try:
# Find all document.txt files in the processed directory
document_files = []
if os.path.exists(processed_group_dir):
for item in os.listdir(processed_group_dir):
item_path = os.path.join(processed_group_dir, item)
if os.path.isdir(item_path):
document_path = os.path.join(item_path, "document.txt")
if os.path.exists(document_path) and os.path.getsize(document_path) > 0:
document_files.append((item, document_path))
if not document_files:
result["error"] = "No document files found to merge"
return result
# Merge all documents with page separators
merged_content = []
total_characters = 0
for filename_stem, document_path in sorted(document_files):
try:
with open(document_path, 'r', encoding='utf-8') as f:
content = f.read().strip()
if content:
merged_content.append(f"# Page {filename_stem}")
merged_content.append(content)
total_characters += len(content)
result["source_files"].append(filename_stem)
except Exception as e:
logger.error(f"Error reading document file {document_path}: {str(e)}")
continue
if merged_content:
# Write merged document
with open(merged_document_path, 'w', encoding='utf-8') as f:
f.write('\n\n'.join(merged_content))
result["total_pages"] = len(document_files)
result["total_characters"] = total_characters
result["success"] = True
else:
result["error"] = "No valid content found in document files"
except Exception as e:
result["error"] = f"Document merging failed: {str(e)}"
logger.error(f"Error merging documents for group {group_name}: {str(e)}")
return result
def merge_paginations_by_group(unique_id: str, group_name: str) -> Dict:
"""Merge all pagination.txt files in a group."""
processed_group_dir = os.path.join("projects", "data", unique_id, "processed", group_name)
dataset_group_dir = os.path.join("projects", "data", unique_id, "datasets", group_name)
os.makedirs(dataset_group_dir, exist_ok=True)
merged_pagination_path = os.path.join(dataset_group_dir, "pagination.txt")
result = {
"success": False,
"merged_pagination_path": merged_pagination_path,
"source_files": [],
"total_lines": 0,
"error": None
}
try:
# Find all pagination.txt files
pagination_files = []
if os.path.exists(processed_group_dir):
for item in os.listdir(processed_group_dir):
item_path = os.path.join(processed_group_dir, item)
if os.path.isdir(item_path):
pagination_path = os.path.join(item_path, "pagination.txt")
if os.path.exists(pagination_path) and os.path.getsize(pagination_path) > 0:
pagination_files.append((item, pagination_path))
if not pagination_files:
result["error"] = "No pagination files found to merge"
return result
# Merge all pagination files
merged_lines = []
for filename_stem, pagination_path in sorted(pagination_files):
try:
with open(pagination_path, 'r', encoding='utf-8') as f:
lines = f.readlines()
for line in lines:
line = line.strip()
if line:
merged_lines.append(line)
result["source_files"].append(filename_stem)
except Exception as e:
logger.error(f"Error reading pagination file {pagination_path}: {str(e)}")
continue
if merged_lines:
# Write merged pagination
with open(merged_pagination_path, 'w', encoding='utf-8') as f:
for line in merged_lines:
f.write(f"{line}\n")
result["total_lines"] = len(merged_lines)
result["success"] = True
else:
result["error"] = "No valid pagination data found"
except Exception as e:
result["error"] = f"Pagination merging failed: {str(e)}"
logger.error(f"Error merging paginations for group {group_name}: {str(e)}")
return result
def merge_embeddings_by_group(unique_id: str, group_name: str) -> Dict:
"""Merge all embedding.pkl files in a group."""
processed_group_dir = os.path.join("projects", "data", unique_id, "processed", group_name)
dataset_group_dir = os.path.join("projects", "data", unique_id, "datasets", group_name)
os.makedirs(dataset_group_dir, exist_ok=True)
merged_embedding_path = os.path.join(dataset_group_dir, "embedding.pkl")
result = {
"success": False,
"merged_embedding_path": merged_embedding_path,
"source_files": [],
"total_chunks": 0,
"total_dimensions": 0,
"error": None
}
try:
# Find all embedding.pkl files
embedding_files = []
if os.path.exists(processed_group_dir):
for item in os.listdir(processed_group_dir):
item_path = os.path.join(processed_group_dir, item)
if os.path.isdir(item_path):
embedding_path = os.path.join(item_path, "embedding.pkl")
if os.path.exists(embedding_path) and os.path.getsize(embedding_path) > 0:
embedding_files.append((item, embedding_path))
if not embedding_files:
result["error"] = "No embedding files found to merge"
return result
# Load and merge all embedding data
all_chunks = []
all_embeddings = [] # Fix: collect all embedding vectors
total_chunks = 0
dimensions = 0
chunking_strategy = 'unknown'
chunking_params = {}
model_path = 'TaylorAI/gte-tiny'
for filename_stem, embedding_path in sorted(embedding_files):
try:
with open(embedding_path, 'rb') as f:
embedding_data = pickle.load(f)
if isinstance(embedding_data, dict) and 'chunks' in embedding_data:
chunks = embedding_data['chunks']
# Get embedding vectors (critical fix)
if 'embeddings' in embedding_data:
embeddings = embedding_data['embeddings']
all_embeddings.append(embeddings)
# Get model metadata from the first file
if 'model_path' in embedding_data:
model_path = embedding_data['model_path']
if 'chunking_strategy' in embedding_data:
chunking_strategy = embedding_data['chunking_strategy']
if 'chunking_params' in embedding_data:
chunking_params = embedding_data['chunking_params']
# Add source file metadata to each chunk
for chunk in chunks:
if isinstance(chunk, dict):
chunk['source_file'] = filename_stem
chunk['source_group'] = group_name
elif isinstance(chunk, str):
# If the chunk is a string, keep it unchanged
pass
all_chunks.extend(chunks)
total_chunks += len(chunks)
result["source_files"].append(filename_stem)
except Exception as e:
logger.error(f"Error loading embedding file {embedding_path}: {str(e)}")
continue
if all_chunks and all_embeddings:
# Merge all embedding vectors
try:
# Try merging tensors with torch
import torch
if all(isinstance(emb, torch.Tensor) for emb in all_embeddings):
merged_embeddings = torch.cat(all_embeddings, dim=0)
dimensions = merged_embeddings.shape[1]
else:
# If the values are not tensors, try converting them to numpy
import numpy as np
if NUMPY_SUPPORT:
np_embeddings = []
for emb in all_embeddings:
if hasattr(emb, 'numpy'):
np_embeddings.append(emb.numpy())
elif isinstance(emb, np.ndarray):
np_embeddings.append(emb)
else:
# If conversion fails, skip this file
logger.warning(f"Warning: Cannot convert embedding to numpy from file {filename_stem}")
continue
if np_embeddings:
merged_embeddings = np.concatenate(np_embeddings, axis=0)
dimensions = merged_embeddings.shape[1]
else:
result["error"] = "No valid embedding tensors could be merged"
return result
else:
result["error"] = "NumPy not available for merging embeddings"
return result
except ImportError:
# If torch is unavailable, try using numpy
if NUMPY_SUPPORT:
import numpy as np
np_embeddings = []
for emb in all_embeddings:
if hasattr(emb, 'numpy'):
np_embeddings.append(emb.numpy())
elif isinstance(emb, np.ndarray):
np_embeddings.append(emb)
else:
logger.warning(f"Warning: Cannot convert embedding to numpy from file {filename_stem}")
continue
if np_embeddings:
merged_embeddings = np.concatenate(np_embeddings, axis=0)
dimensions = merged_embeddings.shape[1]
else:
result["error"] = "No valid embedding tensors could be merged"
return result
else:
result["error"] = "Neither torch nor numpy available for merging embeddings"
return result
except Exception as e:
result["error"] = f"Failed to merge embedding tensors: {str(e)}"
logger.error(f"Error merging embedding tensors: {str(e)}")
return result
# Create merged embedding data structure
merged_embedding_data = {
'chunks': all_chunks,
'embeddings': merged_embeddings, # Critical fix: include the embeddings key
'total_chunks': total_chunks,
'dimensions': dimensions,
'source_files': result["source_files"],
'group_name': group_name,
'merged_at': str(__import__('time').time()),
'chunking_strategy': chunking_strategy,
'chunking_params': chunking_params,
'model_path': model_path
}
# Save merged embeddings
with open(merged_embedding_path, 'wb') as f:
pickle.dump(merged_embedding_data, f)
result["total_chunks"] = total_chunks
result["total_dimensions"] = dimensions
result["success"] = True
else:
result["error"] = "No valid embedding data found"
except Exception as e:
result["error"] = f"Embedding merging failed: {str(e)}"
logger.error(f"Error merging embeddings for group {group_name}: {str(e)}")
return result
def merge_all_data_by_group(unique_id: str, group_name: str) -> Dict:
"""Merge documents, paginations, and embeddings for a group."""
merge_results = {
"group_name": group_name,
"unique_id": unique_id,
"success": True,
"document_merge": None,
"pagination_merge": None,
"embedding_merge": None,
"errors": []
}
# Merge documents
document_result = merge_documents_by_group(unique_id, group_name)
merge_results["document_merge"] = document_result
if not document_result["success"]:
merge_results["success"] = False
merge_results["errors"].append(f"Document merge failed: {document_result['error']}")
# Merge paginations
pagination_result = merge_paginations_by_group(unique_id, group_name)
merge_results["pagination_merge"] = pagination_result
if not pagination_result["success"]:
merge_results["success"] = False
merge_results["errors"].append(f"Pagination merge failed: {pagination_result['error']}")
# Merge embeddings
embedding_result = merge_embeddings_by_group(unique_id, group_name)
merge_results["embedding_merge"] = embedding_result
if not embedding_result["success"]:
merge_results["success"] = False
merge_results["errors"].append(f"Embedding merge failed: {embedding_result['error']}")
return merge_results
def get_group_merge_status(unique_id: str, group_name: str) -> Dict:
"""Get the status of merged data for a group."""
dataset_group_dir = os.path.join("projects", "data", unique_id, "datasets", group_name)
status = {
"group_name": group_name,
"unique_id": unique_id,
"dataset_dir_exists": os.path.exists(dataset_group_dir),
"document_exists": False,
"document_size": 0,
"pagination_exists": False,
"pagination_size": 0,
"embedding_exists": False,
"embedding_size": 0,
"merge_complete": False
}
if os.path.exists(dataset_group_dir):
document_path = os.path.join(dataset_group_dir, "document.txt")
pagination_path = os.path.join(dataset_group_dir, "pagination.txt")
embedding_path = os.path.join(dataset_group_dir, "embedding.pkl")
if os.path.exists(document_path):
status["document_exists"] = True
status["document_size"] = os.path.getsize(document_path)
if os.path.exists(pagination_path):
status["pagination_exists"] = True
status["pagination_size"] = os.path.getsize(pagination_path)
if os.path.exists(embedding_path):
status["embedding_exists"] = True
status["embedding_size"] = os.path.getsize(embedding_path)
# Check if all files exist and are not empty
if (status["document_exists"] and status["document_size"] > 0 and
status["pagination_exists"] and status["pagination_size"] > 0 and
status["embedding_exists"] and status["embedding_size"] > 0):
status["merge_complete"] = True
return status
def cleanup_dataset_group(unique_id: str, group_name: str) -> bool:
"""Clean up merged dataset files for a group."""
dataset_group_dir = os.path.join("projects", "data", unique_id, "datasets", group_name)
try:
if os.path.exists(dataset_group_dir):
import shutil
shutil.rmtree(dataset_group_dir)
logger.info(f"Cleaned up dataset group: {group_name}")
return True
else:
return True # Nothing to clean up
except Exception as e:
logger.error(f"Error cleaning up dataset group {group_name}: {str(e)}")
return False

View File

@ -1,297 +0,0 @@
#!/usr/bin/env python3
"""
Dataset management functions for organizing and processing datasets.
New implementation with per-file processing and group merging.
"""
import os
import json
import logging
from typing import Dict, List
# Configure logger
logger = logging.getLogger('app')
# Import new modules
from utils.file_manager import (
ensure_directories, sync_files_to_group, cleanup_orphaned_files,
get_group_files_list
)
from utils.single_file_processor import (
process_single_file, check_file_already_processed
)
from utils.data_merger import (
merge_all_data_by_group, cleanup_dataset_group
)
async def download_dataset_files(unique_id: str, files: Dict[str, List[str]], incremental_mode: bool = False) -> Dict[str, List[str]]:
"""
Process dataset files with new architecture:
1. Sync files to group directories
2. Process each file individually
3. Merge results by group
4. Clean up orphaned files (only in non-incremental mode)
Args:
unique_id: Project ID
files: Dictionary of files to process, grouped by key
incremental_mode: If True, preserve existing files and only process new ones
"""
if not files:
return {}
logger.info(f"Starting {'incremental' if incremental_mode else 'full'} file processing for project: {unique_id}")
# Ensure project directories exist
ensure_directories(unique_id)
# Step 1: Sync files to group directories
logger.info("Step 1: Syncing files to group directories...")
synced_files, failed_files = sync_files_to_group(unique_id, files, incremental_mode)
# Step 2: Detect changes and cleanup orphaned files (only in non-incremental mode)
from utils.file_manager import detect_file_changes
changes = detect_file_changes(unique_id, files, incremental_mode)
# Only cleanup orphaned files in non-incremental mode or when files are explicitly removed
if not incremental_mode and any(changes["removed"].values()):
logger.info("Step 2: Cleaning up orphaned files...")
removed_files = cleanup_orphaned_files(unique_id, changes)
logger.info(f"Removed orphaned files: {removed_files}")
elif incremental_mode:
logger.info("Step 2: Skipping cleanup in incremental mode to preserve existing files")
# Step 3: Process individual files
logger.info("Step 3: Processing individual files...")
processed_files_by_group = {}
processing_results = {}
for group_name, file_list in files.items():
processed_files_by_group[group_name] = []
processing_results[group_name] = []
for file_path in file_list:
filename = os.path.basename(file_path)
# Get local file path
local_path = os.path.join("projects", "data", unique_id, "files", group_name, filename)
# Skip if file doesn't exist (might be remote file that failed to download)
if not os.path.exists(local_path) and not file_path.startswith(('http://', 'https://')):
logger.warning(f"Skipping non-existent file: {filename}")
continue
# Check if already processed
if check_file_already_processed(unique_id, group_name, filename):
logger.info(f"Skipping already processed file: {filename}")
processed_files_by_group[group_name].append(filename)
processing_results[group_name].append({
"filename": filename,
"status": "existing"
})
continue
# Process the file
logger.info(f"Processing file: {filename} (group: {group_name})")
result = await process_single_file(unique_id, group_name, filename, file_path, local_path)
processing_results[group_name].append(result)
if result["success"]:
processed_files_by_group[group_name].append(filename)
logger.info(f" Successfully processed {filename}")
else:
logger.error(f" Failed to process {filename}: {result['error']}")
# Step 4: Merge results by group
logger.info("Step 4: Merging results by group...")
merge_results = {}
for group_name in processed_files_by_group.keys():
# Get all files in the group (including existing ones)
group_files = get_group_files_list(unique_id, group_name)
if group_files:
logger.info(f"Merging group: {group_name} with {len(group_files)} files")
merge_result = merge_all_data_by_group(unique_id, group_name)
merge_results[group_name] = merge_result
if merge_result["success"]:
logger.info(f" Successfully merged group {group_name}")
else:
logger.error(f" Failed to merge group {group_name}: {merge_result['errors']}")
# Step 5: Save processing log
logger.info("Step 5: Saving processing log...")
await save_processing_log(unique_id, files, synced_files, processing_results, merge_results)
logger.info(f"File processing completed for project: {unique_id}")
return processed_files_by_group
async def save_processing_log(
unique_id: str,
requested_files: Dict[str, List[str]],
synced_files: Dict,
processing_results: Dict,
merge_results: Dict
):
"""Save comprehensive processing log."""
log_data = {
"unique_id": unique_id,
"timestamp": str(os.path.getmtime("projects") if os.path.exists("projects") else 0),
"requested_files": requested_files,
"synced_files": synced_files,
"processing_results": processing_results,
"merge_results": merge_results,
"summary": {
"total_groups": len(requested_files),
"total_files_requested": sum(len(files) for files in requested_files.values()),
"total_files_processed": sum(
len([r for r in results if r.get("success", False)])
for results in processing_results.values()
),
"total_groups_merged": len([r for r in merge_results.values() if r.get("success", False)])
}
}
log_file_path = os.path.join("projects", "data", unique_id, "processing_log.json")
try:
with open(log_file_path, 'w', encoding='utf-8') as f:
json.dump(log_data, f, ensure_ascii=False, indent=2)
logger.info(f"Processing log saved to: {log_file_path}")
except Exception as e:
logger.error(f"Error saving processing log: {str(e)}")
def generate_dataset_structure(unique_id: str) -> str:
"""Generate a string representation of the dataset structure"""
project_dir = os.path.join("projects", "data", unique_id)
structure = []
def add_directory_contents(dir_path: str, prefix: str = ""):
try:
if not os.path.exists(dir_path):
structure.append(f"{prefix}└── (not found)")
return
items = sorted(os.listdir(dir_path))
for i, item in enumerate(items):
item_path = os.path.join(dir_path, item)
is_last = i == len(items) - 1
current_prefix = "└── " if is_last else "├── "
structure.append(f"{prefix}{current_prefix}{item}")
if os.path.isdir(item_path):
next_prefix = prefix + (" " if is_last else "")
add_directory_contents(item_path, next_prefix)
except Exception as e:
structure.append(f"{prefix}└── Error: {str(e)}")
# Add files directory structure
files_dir = os.path.join(project_dir, "files")
structure.append("files/")
add_directory_contents(files_dir, "")
# Add processed directory structure
processed_dir = os.path.join(project_dir, "processed")
structure.append("\nprocessed/")
add_directory_contents(processed_dir, "")
# Add dataset directory structure
dataset_dir = os.path.join(project_dir, "datasets")
structure.append("\ndataset/")
add_directory_contents(dataset_dir, "")
return "\n".join(structure)
def get_processing_status(unique_id: str) -> Dict:
"""Get comprehensive processing status for a project."""
project_dir = os.path.join("projects", "data", unique_id)
if not os.path.exists(project_dir):
return {
"project_exists": False,
"unique_id": unique_id
}
status = {
"project_exists": True,
"unique_id": unique_id,
"directories": {
"files": os.path.exists(os.path.join(project_dir, "files")),
"processed": os.path.exists(os.path.join(project_dir, "processed")),
"dataset": os.path.exists(os.path.join(project_dir, "datasets"))
},
"groups": {},
"processing_log_exists": os.path.exists(os.path.join(project_dir, "processing_log.json"))
}
# Check each group's status
files_dir = os.path.join(project_dir, "files")
if os.path.exists(files_dir):
for group_name in os.listdir(files_dir):
group_path = os.path.join(files_dir, group_name)
if os.path.isdir(group_path):
status["groups"][group_name] = {
"files_count": len([
f for f in os.listdir(group_path)
if os.path.isfile(os.path.join(group_path, f))
]),
"merge_status": "pending"
}
# Check merge status for each group
dataset_dir = os.path.join(project_dir, "datasets")
if os.path.exists(dataset_dir):
for group_name in os.listdir(dataset_dir):
group_path = os.path.join(dataset_dir, group_name)
if os.path.isdir(group_path):
if group_name in status["groups"]:
# Check if merge is complete
document_path = os.path.join(group_path, "document.txt")
pagination_path = os.path.join(group_path, "pagination.txt")
embedding_path = os.path.join(group_path, "embedding.pkl")
if (os.path.exists(document_path) and os.path.exists(pagination_path) and
os.path.exists(embedding_path)):
status["groups"][group_name]["merge_status"] = "completed"
else:
status["groups"][group_name]["merge_status"] = "incomplete"
else:
status["groups"][group_name] = {
"files_count": 0,
"merge_status": "completed"
}
return status
def remove_dataset_directory(unique_id: str, filename_without_ext: str):
"""Remove a specific dataset directory (deprecated - use new structure)"""
# This function is kept for compatibility but delegates to new structure
dataset_path = os.path.join("projects", "data", unique_id, "processed", filename_without_ext)
if os.path.exists(dataset_path):
import shutil
shutil.rmtree(dataset_path)
def remove_dataset_directory_by_key(unique_id: str, key: str):
"""Remove dataset directory by key (group name)"""
# Remove files directory
files_group_path = os.path.join("projects", "data", unique_id, "files", key)
if os.path.exists(files_group_path):
import shutil
shutil.rmtree(files_group_path)
# Remove processed directory
processed_group_path = os.path.join("projects", "data", unique_id, "processed", key)
if os.path.exists(processed_group_path):
import shutil
shutil.rmtree(processed_group_path)
# Remove dataset directory
cleanup_dataset_group(unique_id, key)

View File

@ -1,343 +0,0 @@
#!/usr/bin/env python3
"""
Project management functions for handling projects, README generation, and status tracking.
"""
import os
import json
import logging
from typing import Dict, List, Optional
from pathlib import Path
# Configure logger
logger = logging.getLogger('app')
from utils.file_utils import get_document_preview, load_processed_files_log
def generate_directory_tree(project_dir: str, unique_id: str, max_depth: int = 3) -> str:
"""Generate dataset directory tree structure for the project"""
def _build_tree(path: str, prefix: str = "", is_last: bool = True, depth: int = 0) -> List[str]:
if depth > max_depth:
return []
lines = []
try:
entries = sorted(os.listdir(path))
# Separate directories and files
dirs = [e for e in entries if os.path.isdir(os.path.join(path, e)) and not e.startswith('.')]
files = [e for e in entries if os.path.isfile(os.path.join(path, e)) and not e.startswith('.')]
entries = dirs + files
for i, entry in enumerate(entries):
entry_path = os.path.join(path, entry)
is_dir = os.path.isdir(entry_path)
is_last_entry = i == len(entries) - 1
# Choose the appropriate tree symbols
if is_last_entry:
connector = "└── "
new_prefix = prefix + " "
else:
connector = "├── "
new_prefix = prefix + ""
# Add entry line
line = prefix + connector + entry
if is_dir:
line += "/"
lines.append(line)
# Recursively add subdirectories
if is_dir and depth < max_depth:
sub_lines = _build_tree(entry_path, new_prefix, is_last_entry, depth + 1)
lines.extend(sub_lines)
except PermissionError:
lines.append(prefix + "└── [Permission Denied]")
except Exception as e:
lines.append(prefix + "└── [Error: " + str(e) + "]")
return lines
# Start building tree from dataset directory
dataset_dir = os.path.join(project_dir, "datasets")
tree_lines = []
if not os.path.exists(dataset_dir):
return "└── [No dataset directory found]"
try:
entries = sorted(os.listdir(dataset_dir))
dirs = [e for e in entries if os.path.isdir(os.path.join(dataset_dir, e)) and not e.startswith('.')]
files = [e for e in entries if os.path.isfile(os.path.join(dataset_dir, e)) and not e.startswith('.')]
entries = dirs + files
if not entries:
tree_lines.append("└── [Empty dataset directory]")
else:
for i, entry in enumerate(entries):
entry_path = os.path.join(dataset_dir, entry)
is_dir = os.path.isdir(entry_path)
is_last_entry = i == len(entries) - 1
if is_last_entry:
connector = "└── "
prefix = " "
else:
connector = "├── "
prefix = ""
line = connector + entry
if is_dir:
line += "/"
tree_lines.append(line)
# Recursively add subdirectories
if is_dir:
sub_lines = _build_tree(entry_path, prefix, is_last_entry, 1)
tree_lines.extend(sub_lines)
except Exception as e:
tree_lines.append(f"└── [Error generating tree: {str(e)}]")
return "\n".join(tree_lines)
def generate_project_readme(unique_id: str) -> str:
"""Generate README.md content for a project"""
project_dir = os.path.join("projects", "data", unique_id)
readme_content = f"""# Project: {unique_id}
## Project Overview
This project contains processed documents and their associated embeddings for semantic search.
## Directory Structure
"""
# Generate directory tree
readme_content += "```\n"
readme_content += generate_directory_tree(project_dir, unique_id)
readme_content += "\n```\n\n"
readme_content += """## Dataset Structure
"""
dataset_dir = os.path.join(project_dir, "datasets")
if not os.path.exists(dataset_dir):
readme_content += "No dataset files available.\n"
else:
# Get all document directories
doc_dirs = []
try:
for item in sorted(os.listdir(dataset_dir)):
item_path = os.path.join(dataset_dir, item)
if os.path.isdir(item_path):
doc_dirs.append(item)
except Exception as e:
logger.error(f"Error listing dataset directories: {str(e)}")
if not doc_dirs:
readme_content += "No document directories found.\n"
else:
for doc_dir in doc_dirs:
doc_path = os.path.join(dataset_dir, doc_dir)
document_file = os.path.join(doc_path, "document.txt")
pagination_file = os.path.join(doc_path, "pagination.txt")
embeddings_file = os.path.join(doc_path, "embedding.pkl")
readme_content += f"### {doc_dir}\n\n"
readme_content += f"**Files:**\n"
readme_content += f"- `{doc_dir}/document.txt`"
if os.path.exists(document_file):
readme_content += ""
readme_content += "\n"
readme_content += f"- `{doc_dir}/pagination.txt`"
if os.path.exists(pagination_file):
readme_content += ""
readme_content += "\n"
readme_content += f"- `{doc_dir}/embedding.pkl`"
if os.path.exists(embeddings_file):
readme_content += ""
readme_content += "\n\n"
# Add document preview
if os.path.exists(document_file):
readme_content += f"**Content Preview (first 10 lines):**\n\n```\n"
preview = get_document_preview(document_file, 10)
readme_content += preview
readme_content += "\n```\n\n"
else:
readme_content += f"**Content Preview:** Not available\n\n"
readme_content += f"""---
*Generated on {__import__('datetime').datetime.now().strftime('%Y-%m-%d %H:%M:%S')}*
"""
return readme_content
def save_project_readme(unique_id: str):
"""Save README.md for a project"""
readme_content = generate_project_readme(unique_id)
readme_path = os.path.join("projects", "data", unique_id, "README.md")
try:
os.makedirs(os.path.dirname(readme_path), exist_ok=True)
with open(readme_path, 'w', encoding='utf-8') as f:
f.write(readme_content)
logger.info(f"Generated README.md for project {unique_id}")
return readme_path
except Exception as e:
logger.error(f"Error generating README for project {unique_id}: {str(e)}")
return None
def get_project_status(unique_id: str) -> Dict:
"""Get comprehensive status of a project"""
project_dir = os.path.join("projects", "data", unique_id)
project_exists = os.path.exists(project_dir)
if not project_exists:
return {
"unique_id": unique_id,
"project_exists": False,
"error": "Project not found"
}
# Get processed log
processed_log = load_processed_files_log(unique_id)
# Collect document.txt files
document_files = []
dataset_dir = os.path.join(project_dir, "datasets")
if os.path.exists(dataset_dir):
for root, dirs, files in os.walk(dataset_dir):
for file in files:
if file == "document.txt":
document_files.append(os.path.join(root, file))
# Check system prompt and MCP settings
system_prompt_file = os.path.join(project_dir, "system_prompt.txt")
mcp_settings_file = os.path.join(project_dir, "mcp_settings.json")
status = {
"unique_id": unique_id,
"project_exists": True,
"project_path": project_dir,
"processed_files_count": len(processed_log),
"processed_files": processed_log,
"document_files_count": len(document_files),
"document_files": document_files,
"has_system_prompt": os.path.exists(system_prompt_file),
"has_mcp_settings": os.path.exists(mcp_settings_file),
"readme_exists": os.path.exists(os.path.join(project_dir, "README.md")),
"log_file_exists": os.path.exists(os.path.join(project_dir, "processed_files.json"))
}
# Add dataset structure
try:
from utils.dataset_manager import generate_dataset_structure
status["dataset_structure"] = generate_dataset_structure(unique_id)
except Exception as e:
status["dataset_structure"] = f"Error generating structure: {str(e)}"
return status
def remove_project(unique_id: str) -> bool:
"""Remove entire project directory"""
project_dir = os.path.join("projects", "data", unique_id)
try:
if os.path.exists(project_dir):
import shutil
shutil.rmtree(project_dir)
logger.info(f"Removed project directory: {project_dir}")
return True
else:
logger.warning(f"Project directory not found: {project_dir}")
return False
except Exception as e:
logger.error(f"Error removing project {unique_id}: {str(e)}")
return False
def list_projects() -> List[str]:
"""List all existing project IDs"""
projects_dir = "projects"
if not os.path.exists(projects_dir):
return []
try:
return [item for item in os.listdir(projects_dir)
if os.path.isdir(os.path.join(projects_dir, item))]
except Exception as e:
logger.error(f"Error listing projects: {str(e)}")
return []
def get_project_stats(unique_id: str) -> Dict:
"""Get statistics for a specific project"""
status = get_project_status(unique_id)
if not status["project_exists"]:
return status
stats = {
"unique_id": unique_id,
"total_processed_files": status["processed_files_count"],
"total_document_files": status["document_files_count"],
"has_system_prompt": status["has_system_prompt"],
"has_mcp_settings": status["has_mcp_settings"],
"has_readme": status["readme_exists"]
}
# Calculate file sizes
total_size = 0
document_sizes = []
for doc_file in status["document_files"]:
try:
size = os.path.getsize(doc_file)
document_sizes.append({
"file": doc_file,
"size": size,
"size_mb": round(size / (1024 * 1024), 2)
})
total_size += size
except Exception:
pass
stats["total_document_size"] = total_size
stats["total_document_size_mb"] = round(total_size / (1024 * 1024), 2)
stats["document_files_detail"] = document_sizes
# Check embeddings files
embedding_files = []
dataset_dir = os.path.join("projects", "data", unique_id, "datasets")
if os.path.exists(dataset_dir):
for root, dirs, files in os.walk(dataset_dir):
for file in files:
if file == "embedding.pkl":
file_path = os.path.join(root, file)
try:
size = os.path.getsize(file_path)
embedding_files.append({
"file": file_path,
"size": size,
"size_mb": round(size / (1024 * 1024), 2)
})
except Exception:
pass
stats["embedding_files_count"] = len(embedding_files)
stats["embedding_files_detail"] = embedding_files
return stats

View File

@ -30,7 +30,12 @@ PROJECT_NAME = os.getenv("PROJECT_NAME", "support")
TOKENIZERS_PARALLELISM = os.getenv("TOKENIZERS_PARALLELISM", "true")
# Embedding Model Settings
SENTENCE_TRANSFORMER_MODEL = os.getenv("SENTENCE_TRANSFORMER_MODEL", "TaylorAI/gte-tiny")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "sk-hsKClH0Z695EkK5fDdB2Ec2fE13f4fC1B627BdBb8e554b5b-4")
EMBEDDING_BASE_URL = os.getenv("EMBEDDING_BASE_URL", "https://one-dev.felo.me/v1")
EMBEDDING_API_KEY = os.getenv("EMBEDDING_API_KEY", OPENAI_API_KEY)
EMBEDDING_MODEL_NAME = os.getenv("EMBEDDING_MODEL_NAME", "text-embedding-3-small")
EMBEDDING_DIMENSIONS = int(os.getenv("EMBEDDING_DIMENSIONS", "384"))
EMBEDDING_TIMEOUT = int(os.getenv("EMBEDDING_TIMEOUT", "30"))
# Tool Output Length Control Settings
TOOL_OUTPUT_MAX_LENGTH = SUMMARIZATION_MAX_TOKENS
@ -72,6 +77,15 @@ CHECKPOINT_CLEANUP_INACTIVE_DAYS = int(os.getenv("CHECKPOINT_CLEANUP_INACTIVE_DA
CHECKPOINT_CLEANUP_INTERVAL_HOURS = int(os.getenv("CHECKPOINT_CLEANUP_INTERVAL_HOURS", "24"))
# ============================================================
# Redis Configuration (Huey task queue backend)
# ============================================================
# Redis connection URL.
# Format: redis://[:password]@host:port/db
REDIS_URL = os.getenv("REDIS_URL", "redis://localhost:6379/1")
# ============================================================
# Mem0 long-term memory configuration
# ============================================================

View File

@ -1,301 +0,0 @@
#!/usr/bin/env python3
"""
Single file processing functions for handling individual files.
"""
import os
import tempfile
import zipfile
import logging
from typing import Dict, List, Tuple, Optional
from pathlib import Path
# Configure logger
logger = logging.getLogger('app')
from utils.file_utils import download_file
# Try to import excel/csv processor, but handle if dependencies are missing
try:
from utils.excel_csv_processor import (
is_excel_file, is_csv_file, process_excel_file, process_csv_file
)
EXCEL_CSV_SUPPORT = True
except ImportError as e:
logger.warning(f"Excel/CSV processing not available: {e}")
EXCEL_CSV_SUPPORT = False
# Fallback functions
def is_excel_file(file_path):
return file_path.lower().endswith(('.xlsx', '.xls'))
def is_csv_file(file_path):
return file_path.lower().endswith('.csv')
def process_excel_file(file_path):
return "", []
def process_csv_file(file_path):
return "", []
async def process_single_file(
unique_id: str,
group_name: str,
filename: str,
original_path: str,
local_path: str
) -> Dict:
"""
Process a single file and generate document.txt, pagination.txt, and embedding.pkl.
Returns:
Dict with processing results and file paths
"""
# Create output directory for this file
filename_stem = Path(filename).stem
output_dir = os.path.join("projects", "data", unique_id, "processed", group_name, filename_stem)
os.makedirs(output_dir, exist_ok=True)
result = {
"success": False,
"filename": filename,
"group": group_name,
"output_dir": output_dir,
"document_path": os.path.join(output_dir, "document.txt"),
"pagination_path": os.path.join(output_dir, "pagination.txt"),
"embedding_path": os.path.join(output_dir, "embedding.pkl"),
"error": None,
"content_size": 0,
"pagination_lines": 0,
"embedding_chunks": 0
}
try:
# Download file if it's remote and not yet downloaded
if original_path.startswith(('http://', 'https://')):
if not os.path.exists(local_path):
logger.info(f"Downloading {original_path} -> {local_path}")
success = await download_file(original_path, local_path)
if not success:
result["error"] = "Failed to download file"
return result
# Extract content from file
content, pagination_lines = await extract_file_content(local_path, filename)
if not content or not content.strip():
result["error"] = "No content extracted from file"
return result
# Write document.txt
with open(result["document_path"], 'w', encoding='utf-8') as f:
f.write(content)
result["content_size"] = len(content)
# Write pagination.txt
if pagination_lines:
with open(result["pagination_path"], 'w', encoding='utf-8') as f:
for line in pagination_lines:
if line.strip():
f.write(f"{line}\n")
result["pagination_lines"] = len(pagination_lines)
else:
# Generate pagination from text content
pagination_lines = generate_pagination_from_text(result["document_path"],
result["pagination_path"])
result["pagination_lines"] = len(pagination_lines)
# Generate embeddings
try:
embedding_chunks = await generate_embeddings_for_file(
result["document_path"], result["embedding_path"]
)
result["embedding_chunks"] = len(embedding_chunks) if embedding_chunks else 0
result["success"] = True
except Exception as e:
result["error"] = f"Embedding generation failed: {str(e)}"
logger.error(f"Failed to generate embeddings for {filename}: {str(e)}")
except Exception as e:
result["error"] = f"File processing failed: {str(e)}"
logger.error(f"Error processing file {filename}: {str(e)}")
return result
async def extract_file_content(file_path: str, filename: str) -> Tuple[str, List[str]]:
"""Extract content from various file formats."""
# Handle zip files
if filename.lower().endswith('.zip'):
return await extract_from_zip(file_path, filename)
# Handle Excel files
elif is_excel_file(file_path):
return await extract_from_excel(file_path, filename)
# Handle CSV files
elif is_csv_file(file_path):
return await extract_from_csv(file_path, filename)
# Handle text files
else:
return await extract_from_text(file_path, filename)
async def extract_from_zip(zip_path: str, filename: str) -> Tuple[str, List[str]]:
"""Extract content from zip file."""
content_parts = []
pagination_lines = []
try:
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
# Extract to temporary directory
temp_dir = tempfile.mkdtemp(prefix=f"extract_{Path(filename).stem}_")
zip_ref.extractall(temp_dir)
# Process extracted files
for root, dirs, files in os.walk(temp_dir):
for file in files:
if file.lower().endswith(('.txt', '.md', '.xlsx', '.xls', '.csv')):
file_path = os.path.join(root, file)
try:
file_content, file_pagination = await extract_file_content(file_path, file)
if file_content:
content_parts.append(f"# Page {file}")
content_parts.append(file_content)
pagination_lines.extend(file_pagination)
except Exception as e:
logger.error(f"Error processing extracted file {file}: {str(e)}")
# Clean up temporary directory
import shutil
shutil.rmtree(temp_dir)
except Exception as e:
logger.error(f"Error extracting zip file {filename}: {str(e)}")
return "", []
return '\n\n'.join(content_parts), pagination_lines
async def extract_from_excel(file_path: str, filename: str) -> Tuple[str, List[str]]:
"""Extract content from Excel file."""
try:
document_content, pagination_lines = process_excel_file(file_path)
if document_content:
content = f"# Page {filename}\n{document_content}"
return content, pagination_lines
else:
return "", []
except Exception as e:
logger.error(f"Error processing Excel file {filename}: {str(e)}")
return "", []
async def extract_from_csv(file_path: str, filename: str) -> Tuple[str, List[str]]:
"""Extract content from CSV file."""
try:
document_content, pagination_lines = process_csv_file(file_path)
if document_content:
content = f"# Page {filename}\n{document_content}"
return content, pagination_lines
else:
return "", []
except Exception as e:
logger.error(f"Error processing CSV file {filename}: {str(e)}")
return "", []
async def extract_from_text(file_path: str, filename: str) -> Tuple[str, List[str]]:
"""Extract content from text file."""
try:
with open(file_path, 'r', encoding='utf-8') as f:
content = f.read().strip()
if content:
return content, []
else:
return "", []
except Exception as e:
logger.error(f"Error reading text file {filename}: {str(e)}")
return "", []
def generate_pagination_from_text(document_path: str, pagination_path: str) -> List[str]:
"""Generate pagination from text document."""
try:
# Import embedding module for pagination
import sys
sys.path.append(os.path.join(os.path.dirname(__file__), '..', 'embedding'))
from embedding import split_document_by_pages
pages = split_document_by_pages(str(document_path), str(pagination_path))
# Return pagination lines
pagination_lines = []
with open(pagination_path, 'r', encoding='utf-8') as f:
for line in f:
if line.strip():
pagination_lines.append(line.strip())
return pagination_lines
except Exception as e:
logger.error(f"Error generating pagination from text: {str(e)}")
return []
async def generate_embeddings_for_file(document_path: str, embedding_path: str) -> Optional[List]:
"""Generate embeddings for a document."""
try:
# Import embedding module
import sys
sys.path.append(os.path.join(os.path.dirname(__file__), '..', 'embedding'))
from embedding import embed_document
# Generate embeddings using paragraph chunking
embedding_data = embed_document(
str(document_path),
str(embedding_path),
chunking_strategy='paragraph'
)
if embedding_data and 'chunks' in embedding_data:
return embedding_data['chunks']
else:
return None
except Exception as e:
logger.error(f"Error generating embeddings: {str(e)}")
return None
def check_file_already_processed(unique_id: str, group_name: str, filename: str) -> bool:
"""Check if a file has already been processed."""
filename_stem = Path(filename).stem
output_dir = os.path.join("projects", "data", unique_id, "processed", group_name, filename_stem)
document_path = os.path.join(output_dir, "document.txt")
pagination_path = os.path.join(output_dir, "pagination.txt")
embedding_path = os.path.join(output_dir, "embedding.pkl")
# Check if all files exist and are not empty
if (os.path.exists(document_path) and os.path.exists(pagination_path) and
os.path.exists(embedding_path)):
if (os.path.getsize(document_path) > 0 and os.path.getsize(pagination_path) > 0 and
os.path.getsize(embedding_path) > 0):
return True
return False