Merge branch 'feature/enable_redis' into dev
# Conflicts: # poetry.lock
This commit is contained in:
commit
fabb14c66a
@ -1,6 +1,7 @@
|
|||||||
# Skill 功能
|
# Skill 功能
|
||||||
|
|
||||||
> 负责范围:技能包管理服务 - 核心实现
|
> 负责范围:技能包管理服务 - 核心实现
|
||||||
|
> 最后更新:2026-05-26
|
||||||
> 最后更新:2026-05-23
|
> 最后更新:2026-05-23
|
||||||
> 最后更新:2026-04-20
|
> 最后更新:2026-04-20
|
||||||
|
|
||||||
@ -30,6 +31,10 @@ MCP UI 类 skill 已按 MCP Apps 模式改造:工具返回数据,静态 HTML
|
|||||||
|
|
||||||
## 最近重要事项
|
## 最近重要事项
|
||||||
|
|
||||||
|
- [2026-05-26](changelog/2026-Q2.md): skill 引入 `category` 字段——`routes/skill_manager.py` 在 `SkillItem` / `SkillValidationResult` 增加 `category`,从 `plugin.json` 与 `SKILL.md` frontmatter 解析,official skill 默认 `"other"`、user skill 默认 `"custom"`;并通过 batch 给 common/developing/onprem/support 路径下大量 skill 元数据补 `category`,`data-dashboard` / `mcp-ui` 归类 `Interactive UI`(`203dcf4`, `3ada55a`, `9658588`)
|
||||||
|
- [2026-05-26](changelog/2026-Q2.md): developing 分支大合并新增多个 skill:`ai-ppt-generator`(百度 AI PPT)、`nfc-medicine-lookup`(NFC 药品检索)、`ppt-outline`(PPT 大纲 / HTML 演示文稿)、`z-card-image`(配图 / 卡片图),同时 `skills/linggan/*` 系列 skill 经合并回归(`3ada55a`)
|
||||||
|
- [2026-05-23](changelog/2026-Q2.md): 新增 MCP App 型 `skills/developing/ecommerce-storefront/`——含 `product-list` / `order-confirm` 两个 HTML App + 自带 `ecommerce_server.py` MCP server;同时落地 `docs/mcp-app-training.md`(约 1063 行)作为 MCP App 培训材料(`9d001c8`)
|
||||||
|
- [2026-05-21](changelog/2026-Q2.md): Daytona 沙箱模式下 `init_agent` 在沙箱内写入 `BASH_ENV` 文件,注入 `ASSISTANT_ID` / `USER_IDENTIFIER` / `TRACE_ID` / `ENABLE_SELF_KNOWLEDGE` 与 `config.shell_env` 的 shell 环境变量(`776acc2`)
|
||||||
- [2026-05-12](changelog/2026-Q2.md): 跨 6→10 个 skill 变体批量精修 `retrieval-policy*.md`,统一 onprem/support/autoload 各路径下的 policy 口径(`be96f24`, `7b4f03d`)
|
- [2026-05-12](changelog/2026-Q2.md): 跨 6→10 个 skill 变体批量精修 `retrieval-policy*.md`,统一 onprem/support/autoload 各路径下的 policy 口径(`be96f24`, `7b4f03d`)
|
||||||
- [2026-05-11](changelog/2026-Q2.md): 新增子 agent (SubAgent) 支持——skill 包通过 `agents/*.md` 暴露子 agent,由 `SubAgentMiddleware` 加载;附 `pmda-drug-info` skill 示例(`5b634bc`)
|
- [2026-05-11](changelog/2026-Q2.md): 新增子 agent (SubAgent) 支持——skill 包通过 `agents/*.md` 暴露子 agent,由 `SubAgentMiddleware` 加载;附 `pmda-drug-info` skill 示例(`5b634bc`)
|
||||||
- [2026-05-11](changelog/2026-Q2.md): `pmda-drug-info` 的 `pmda_server.py` 大改为 mock 实现(`a92096a`)
|
- [2026-05-11](changelog/2026-Q2.md): `pmda-drug-info` 的 `pmda_server.py` 大改为 mock 实现(`a92096a`)
|
||||||
@ -77,6 +82,9 @@ MCP UI 类 skill 已按 MCP Apps 模式改造:工具返回数据,静态 HTML
|
|||||||
- ⚠️ **MCP `_meta.trace_id` 是全局 monkey-patch 注入**:`agent/mcp_trace_meta.patch_mcp_client_session_trace_meta()` 在 `get_tools_from_mcp()` 入口调用一次后,会把 `mcp.ClientSession.call_tool` 永久包装;仅对工具名在 `{"rag_retrieve", "table_rag_retrieve"}` 集合内的调用注入 `_meta.trace_id`,扩展白名单要直接改 `_TRACE_META_TOOL_NAMES` 常量。
|
- ⚠️ **MCP `_meta.trace_id` 是全局 monkey-patch 注入**:`agent/mcp_trace_meta.patch_mcp_client_session_trace_meta()` 在 `get_tools_from_mcp()` 入口调用一次后,会把 `mcp.ClientSession.call_tool` 永久包装;仅对工具名在 `{"rag_retrieve", "table_rag_retrieve"}` 集合内的调用注入 `_meta.trace_id`,扩展白名单要直接改 `_TRACE_META_TOOL_NAMES` 常量。
|
||||||
- ⚠️ **PrePrompt hook 内容位置由模板决定**:自 2026-04-23 起 hook 产出通过 `{hook_content}` 占位符注入 `prompt/system_prompt.md`,不再追加在 prompt 末尾;自定义模板必须包含 `{hook_content}` 占位符否则 hook 内容会丢失。
|
- ⚠️ **PrePrompt hook 内容位置由模板决定**:自 2026-04-23 起 hook 产出通过 `{hook_content}` 占位符注入 `prompt/system_prompt.md`,不再追加在 prompt 末尾;自定义模板必须包含 `{hook_content}` 占位符否则 hook 内容会丢失。
|
||||||
- ⚠️ **`init_agent` 返回值已变 3 元素**:Daytona 改造后 `init_agent` 返回 `(agent, checkpointer, sandbox)`;调用方解构必须更新。
|
- ⚠️ **`init_agent` 返回值已变 3 元素**:Daytona 改造后 `init_agent` 返回 `(agent, checkpointer, sandbox)`;调用方解构必须更新。
|
||||||
|
- ⚠️ **skill `category` 默认值**:API 返回的 `SkillItem.category`——official skill fallback 为 `"other"`、user skill fallback 为 `"custom"`;前端做分类视图时需要同时识别这两个 sentinel,不要假设官方/用户 skill 用同一套缺省值。
|
||||||
|
- ⚠️ **`category` 字段双入口**:同一 skill 可以同时在 `.claude-plugin/plugin.json` 和 `SKILL.md` frontmatter 写 `category`;`get_skill_metadata` 优先走 `parse_plugin_json`,若 skill 包没有 plugin.json 才回落到 `parse_skill_frontmatter`——两者写不一致时以 plugin.json 为准。
|
||||||
|
- ⚠️ **Daytona shell_env 是文件注入而非 process env**:`init_agent` 通过 `cat > $REMOTE_BASH_ENV_PATH` 写入 `export VAR=...` 行,沙箱内必须由 shell(bash)的 `BASH_ENV` 加载才能生效;非 daytona 模式或不走 bash 启动的脚本拿不到这些变量。扩展注入项需直接改 `init_agent` 里的 `_shell_env` 字典。
|
||||||
|
|
||||||
## Skill 目录结构
|
## Skill 目录结构
|
||||||
|
|
||||||
|
|||||||
@ -4,6 +4,126 @@
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## 2026-05-26: skill `category` 字段全面接入
|
||||||
|
|
||||||
|
**类型**:新功能
|
||||||
|
|
||||||
|
**背景**:skill 数量越来越多(common / developing / onprem / support / linggan / autoload 各路径下数十个),列表 API 需要前端能按类别分组展示,元数据层面缺少 `category` 字段。
|
||||||
|
|
||||||
|
**改动**:
|
||||||
|
- `routes/skill_manager.py`:
|
||||||
|
- `SkillItem` model 新增 `category: str = "other"`。
|
||||||
|
- `SkillValidationResult` dataclass 新增可选 `category: Optional[str]`。
|
||||||
|
- `parse_plugin_json` 解析 `plugin_config.get('category')`;`parse_skill_frontmatter` 解析 frontmatter 的 `metadata.get('category')`。
|
||||||
|
- `get_official_skills` 中 fallback 为 `"other"`;`get_user_skills` 中 fallback 为 `"custom"`。
|
||||||
|
- `get_skill_metadata_legacy` 在 `category` 非空时写入返回 dict(保持向后兼容)。
|
||||||
|
- 批量给 common / developing / onprem / support 多个 skill 的 `.claude-plugin/plugin.json` 与 `SKILL.md` frontmatter 添加 `category` 字段。
|
||||||
|
- `data-dashboard` 与 `mcp-ui` 的 `category` 从 `"Data & Retrieval"` 修正为 `"Interactive UI"`(更贴切 MCP App 的渲染语义)。
|
||||||
|
|
||||||
|
**根因**:N/A(新功能)
|
||||||
|
|
||||||
|
**影响**:
|
||||||
|
- `GET /api/v1/skill/list` 返回项现在包含 `category` 字段;前端可按 category 维度做分组/筛选。
|
||||||
|
- skill 元数据约定扩展——新 skill 应在 plugin.json 或 SKILL.md frontmatter 中写明 `category`,否则会落到 `"other"` / `"custom"` 兜底。
|
||||||
|
- `plugin.json.category` 与 `SKILL.md.category` 同时存在时以前者为准(`get_skill_metadata` 优先 plugin.json)。
|
||||||
|
|
||||||
|
**相关文件**:
|
||||||
|
- `routes/skill_manager.py`
|
||||||
|
- `skills/common/data-dashboard/.claude-plugin/plugin.json`
|
||||||
|
- `skills/common/mcp-ui/.claude-plugin/plugin.json`
|
||||||
|
- 以及一批 `skills/{common,developing,onprem,support}/*/SKILL.md` 与 `.claude-plugin/plugin.json`
|
||||||
|
|
||||||
|
**Commit/PR**:`203dcf4`, `3ada55a`, `9658588`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2026-05-26: developing 分支批量新增多类 skill
|
||||||
|
|
||||||
|
**类型**:新功能
|
||||||
|
|
||||||
|
**背景**:[待补充]——经 developing→staging 合并集中落地一批新 skill 与 linggan 系列 skill 回归。
|
||||||
|
|
||||||
|
**改动**:
|
||||||
|
- 新增 `skills/developing/ai-ppt-generator/`:调用百度 AI 生成 PPT,按 topic 自动选模板(商务/科技/教育/创意/中国风等);`category: Document Processing`。
|
||||||
|
- 新增 `skills/developing/nfc-medicine-lookup/`:通过 NFC 芯片 ID 或药品名称查询药品信息,面向老年用户的语音助手交互口径;`category: Developer Tools`。
|
||||||
|
- 新增 `skills/developing/ppt-outline/`:PPT 大纲与独立 HTML 演示文稿生成(dark/light/tech/minimal 四种风格);`category: Document Processing`。
|
||||||
|
- 新增 `skills/developing/z-card-image/`:生成配图、封面图、卡片图、社媒帖子分享图等;依赖 `python3` + `google-chrome`。
|
||||||
|
- `skills/developing/static-hosting/SKILL.md` 由 1 行说明扩展为完整 80 行 skill;同时一批已有 SKILL.md / plugin.json 补 `category`。
|
||||||
|
- `skills/linggan/*` 系列 skill(baidu-search / bot-self-modifier / caiyun-weather / competitor-news-intel / contract-document-generator / financial-report-generator / market-academic-insight / ragflow-loader / sales-decision-report / seedream / static-hosting / static-site-deploy / voice-notification / weather-china)经合并回归 staging。
|
||||||
|
|
||||||
|
**根因**:N/A
|
||||||
|
|
||||||
|
**影响**:
|
||||||
|
- developing skill 池扩张约 5 个新业务 skill;linggan 系列重新出现在 staging。
|
||||||
|
- 新 skill 多为 SKILL.md 型业务 skill,符合"workflow + 模板"的纯 markdown 模式;其中 `ai-ppt-generator`、`z-card-image` 依赖外部 `BAIDU_API_KEY` 或 `google-chrome` 二进制。
|
||||||
|
|
||||||
|
**相关文件**:
|
||||||
|
- `skills/developing/ai-ppt-generator/SKILL.md`
|
||||||
|
- `skills/developing/nfc-medicine-lookup/SKILL.md`
|
||||||
|
- `skills/developing/ppt-outline/SKILL.md`
|
||||||
|
- `skills/developing/z-card-image/SKILL.md`
|
||||||
|
- `skills/developing/static-hosting/SKILL.md`
|
||||||
|
- `skills/linggan/**`(回归)
|
||||||
|
|
||||||
|
**Commit/PR**:`3ada55a`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2026-05-23: 新增 ecommerce-storefront skill(MCP App 型)+ MCP App 培训文档
|
||||||
|
|
||||||
|
**类型**:新功能
|
||||||
|
|
||||||
|
**背景**:MCP App 模式(host 加载静态 HTML + postMessage 传数据)已经在 `mcp-ui`、`data-dashboard` 上跑通,需要一个面向电商场景的样例 skill,演示产品浏览 / 选购 / 下单确认这类多步交互的 App 渲染;同时沉淀一份 MCP App 开发指南。
|
||||||
|
|
||||||
|
**改动**:
|
||||||
|
- 新增 `skills/developing/ecommerce-storefront/`:
|
||||||
|
- `apps/product-list.html`(288 行)与 `apps/order-confirm.html`(233 行)两个静态 App。
|
||||||
|
- `ecommerce_server.py`(213 行)作为自带 MCP server,`ecommerce_tools.json` 定义工具 schema。
|
||||||
|
- `hooks/ecommerce_guide.md` + `hooks/pre_prompt.py` 注入 skill 使用指引到 system prompt。
|
||||||
|
- `mcp_common.py`(252 行)复用 MCP 通用工具基类。
|
||||||
|
- `.claude-plugin/plugin.json` 配置 PrePrompt hook 与 stdio MCP server,`category: Developer Tools`。
|
||||||
|
- 新增 `docs/mcp-app-training.md`(约 1063 行):MCP App 模式的开发培训材料。
|
||||||
|
|
||||||
|
**根因**:N/A
|
||||||
|
|
||||||
|
**影响**:
|
||||||
|
- developing skill 池新增一个 MCP App 型 skill,体例对齐 `mcp-ui` / `data-dashboard`。
|
||||||
|
- MCP App 开发者有完整培训材料可参考。
|
||||||
|
|
||||||
|
**相关文件**:
|
||||||
|
- `skills/developing/ecommerce-storefront/**`
|
||||||
|
- `docs/mcp-app-training.md`
|
||||||
|
|
||||||
|
**Commit/PR**:`9d001c8`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2026-05-21: Daytona 沙箱注入 shell_env 到 BASH_ENV
|
||||||
|
|
||||||
|
**类型**:新功能
|
||||||
|
|
||||||
|
**背景**:Daytona 沙箱内的 skill 脚本需要能读取 `ASSISTANT_ID` / `USER_IDENTIFIER` / `TRACE_ID` 等运行时上下文,但宿主 process env 无法直接透传到沙箱里。
|
||||||
|
|
||||||
|
**改动**:
|
||||||
|
- `agent/deep_assistant.py` `init_agent`:当 `sandbox is not None and sandbox_type == "daytona"` 时,组装 `_shell_env` 字典(`ASSISTANT_ID` / `USER_IDENTIFIER` / `TRACE_ID` / `ENABLE_SELF_KNOWLEDGE` 加上 `config.shell_env`),构造 `cd {REMOTE_WORKSPACE_ROOT}\n` + `export VAR="..."` 行,通过 `sandbox.execute("cat > $REMOTE_BASH_ENV_PATH << 'ENVEOF' ... ENVEOF")` 写入沙箱内。
|
||||||
|
- `utils/daytona_sync.py` 提供常量 `REMOTE_BASH_ENV_PATH` / `REMOTE_WORKSPACE_ROOT`。
|
||||||
|
- `AgentConfig` 增加 `shell_env: Optional[Dict[str, str]]`(调用方可追加自定义 env)。
|
||||||
|
|
||||||
|
**根因**:N/A
|
||||||
|
|
||||||
|
**影响**:
|
||||||
|
- 沙箱内通过 bash 启动的 skill 脚本可以 `os.environ.get("ASSISTANT_ID")` 等读到运行时上下文。
|
||||||
|
- 仅 daytona 沙箱模式生效;本地或非 bash 启动的进程不会收到 `BASH_ENV` 注入的变量。
|
||||||
|
- 扩展注入项(新增固定环境变量)需要直接改 `init_agent` 里的 `_shell_env` 字典。
|
||||||
|
|
||||||
|
**相关文件**:
|
||||||
|
- `agent/deep_assistant.py`
|
||||||
|
- `utils/daytona_sync.py`
|
||||||
|
|
||||||
|
**Commit/PR**:`776acc2`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## 2026-05-12: 批量精修 retrieval policy 文案
|
## 2026-05-12: 批量精修 retrieval policy 文案
|
||||||
|
|
||||||
**类型**:内容调整
|
**类型**:内容调整
|
||||||
|
|||||||
168
db_manager.py
168
db_manager.py
@ -1,168 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
SQLite task status database management tool
|
|
||||||
"""
|
|
||||||
|
|
||||||
import sqlite3
|
|
||||||
import json
|
|
||||||
import time
|
|
||||||
from task_queue.task_status import task_status_store
|
|
||||||
|
|
||||||
def view_database():
|
|
||||||
"""View database contents"""
|
|
||||||
print("SQLite task status database contents")
|
|
||||||
print("=" * 40)
|
|
||||||
print(f"Database path: {task_status_store.db_path}")
|
|
||||||
|
|
||||||
# Connect to the database
|
|
||||||
conn = sqlite3.connect(task_status_store.db_path)
|
|
||||||
cursor = conn.cursor()
|
|
||||||
|
|
||||||
# View table schema
|
|
||||||
print(f"\nTable schema:")
|
|
||||||
cursor.execute("PRAGMA table_info(task_status)")
|
|
||||||
columns = cursor.fetchall()
|
|
||||||
for col in columns:
|
|
||||||
print(f" {col[1]} ({col[2]})")
|
|
||||||
|
|
||||||
# View all records
|
|
||||||
print(f"\nAll records:")
|
|
||||||
cursor.execute("SELECT * FROM task_status ORDER BY updated_at DESC")
|
|
||||||
rows = cursor.fetchall()
|
|
||||||
|
|
||||||
if not rows:
|
|
||||||
print(" (empty database)")
|
|
||||||
else:
|
|
||||||
print(f" Total {len(rows)} records:")
|
|
||||||
for i, row in enumerate(rows):
|
|
||||||
task_id, unique_id, status, created_at, updated_at, result, error = row
|
|
||||||
created_str = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(created_at))
|
|
||||||
updated_str = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(updated_at))
|
|
||||||
|
|
||||||
print(f" {i+1}. {task_id}")
|
|
||||||
print(f" Project ID: {unique_id}")
|
|
||||||
print(f" Status: {status}")
|
|
||||||
print(f" Created: {created_str}")
|
|
||||||
print(f" Updated: {updated_str}")
|
|
||||||
if result:
|
|
||||||
try:
|
|
||||||
result_data = json.loads(result)
|
|
||||||
print(f" Result: {result_data.get('message', 'N/A')}")
|
|
||||||
except:
|
|
||||||
print(f" Result: {result[:50]}...")
|
|
||||||
if error:
|
|
||||||
print(f" Error: {error}")
|
|
||||||
print()
|
|
||||||
|
|
||||||
conn.close()
|
|
||||||
|
|
||||||
def run_query(sql_query: str):
|
|
||||||
"""Run a custom query"""
|
|
||||||
print(f"Running query: {sql_query}")
|
|
||||||
|
|
||||||
try:
|
|
||||||
conn = sqlite3.connect(task_status_store.db_path)
|
|
||||||
conn.row_factory = sqlite3.Row
|
|
||||||
cursor = conn.cursor()
|
|
||||||
cursor.execute(sql_query)
|
|
||||||
rows = cursor.fetchall()
|
|
||||||
|
|
||||||
if not rows:
|
|
||||||
print(" (no results)")
|
|
||||||
else:
|
|
||||||
print(f" {len(rows)} results:")
|
|
||||||
for row in rows:
|
|
||||||
print(f" {dict(row)}")
|
|
||||||
|
|
||||||
conn.close()
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Query failed: {e}")
|
|
||||||
|
|
||||||
def interactive_shell():
|
|
||||||
"""Interactive database management"""
|
|
||||||
print("\n🖥️ Interactive database management")
|
|
||||||
print("Type 'help' to view available commands, or 'quit' to exit")
|
|
||||||
|
|
||||||
while True:
|
|
||||||
try:
|
|
||||||
command = input("\n> ").strip()
|
|
||||||
|
|
||||||
if command.lower() in ['quit', 'exit', 'q']:
|
|
||||||
break
|
|
||||||
elif command.lower() == 'help':
|
|
||||||
print("""
|
|
||||||
Available commands:
|
|
||||||
view - View all records
|
|
||||||
stats - View statistics
|
|
||||||
pending - View pending tasks
|
|
||||||
completed - View completed tasks
|
|
||||||
failed - View failed tasks
|
|
||||||
sql <query> - Run an SQL query
|
|
||||||
cleanup <days> - Clean up records older than N days
|
|
||||||
count - Count total tasks
|
|
||||||
help - Show help
|
|
||||||
quit/exit/q - Exit
|
|
||||||
""")
|
|
||||||
elif command.lower() == 'view':
|
|
||||||
view_database()
|
|
||||||
elif command.lower() == 'stats':
|
|
||||||
stats = task_status_store.get_statistics()
|
|
||||||
print(f"Statistics:")
|
|
||||||
print(f" Total tasks: {stats['total_tasks']}")
|
|
||||||
print(f" Status breakdown: {stats['status_breakdown']}")
|
|
||||||
print(f" Last 24 hours: {stats['recent_24h']}")
|
|
||||||
elif command.lower() == 'pending':
|
|
||||||
tasks = task_status_store.search_tasks(status="pending")
|
|
||||||
print(f"Pending tasks ({len(tasks)}):")
|
|
||||||
for task in tasks:
|
|
||||||
print(f" - {task['task_id']}: {task['unique_id']}")
|
|
||||||
elif command.lower() == 'completed':
|
|
||||||
tasks = task_status_store.search_tasks(status="completed")
|
|
||||||
print(f"Completed tasks ({len(tasks)}):")
|
|
||||||
for task in tasks:
|
|
||||||
print(f" - {task['task_id']}: {task['unique_id']}")
|
|
||||||
elif command.lower() == 'failed':
|
|
||||||
tasks = task_status_store.search_tasks(status="failed")
|
|
||||||
print(f"Failed tasks ({len(tasks)}):")
|
|
||||||
for task in tasks:
|
|
||||||
print(f" - {task['task_id']}: {task['unique_id']}")
|
|
||||||
elif command.lower().startswith('sql '):
|
|
||||||
sql_query = command[4:]
|
|
||||||
run_query(sql_query)
|
|
||||||
elif command.lower().startswith('cleanup '):
|
|
||||||
try:
|
|
||||||
days = int(command[8:])
|
|
||||||
count = task_status_store.cleanup_old_tasks(days)
|
|
||||||
print(f"Cleaned up {count} records older than {days} days")
|
|
||||||
except ValueError:
|
|
||||||
print("Please enter a valid number of days")
|
|
||||||
elif command.lower() == 'count':
|
|
||||||
all_tasks = task_status_store.list_all()
|
|
||||||
print(f"Total tasks: {len(all_tasks)}")
|
|
||||||
else:
|
|
||||||
print("Unknown command. Type 'help' for help")
|
|
||||||
|
|
||||||
except KeyboardInterrupt:
|
|
||||||
print("\nGoodbye!")
|
|
||||||
break
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Execution error: {e}")
|
|
||||||
|
|
||||||
def main():
|
|
||||||
"""Main function"""
|
|
||||||
import sys
|
|
||||||
|
|
||||||
if len(sys.argv) > 1:
|
|
||||||
if sys.argv[1] == 'view':
|
|
||||||
view_database()
|
|
||||||
elif sys.argv[1] == 'interactive':
|
|
||||||
interactive_shell()
|
|
||||||
else:
|
|
||||||
print("Usage: python db_manager.py [view|interactive]")
|
|
||||||
else:
|
|
||||||
view_database()
|
|
||||||
interactive_shell()
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
34
poetry.lock
generated
34
poetry.lock
generated
@ -1790,21 +1790,6 @@ files = [
|
|||||||
{file = "httpx_sse-0.4.3.tar.gz", hash = "sha256:9b1ed0127459a66014aec3c56bebd93da3c1bc8bb6618c8082039a44889a755d"},
|
{file = "httpx_sse-0.4.3.tar.gz", hash = "sha256:9b1ed0127459a66014aec3c56bebd93da3c1bc8bb6618c8082039a44889a755d"},
|
||||||
]
|
]
|
||||||
|
|
||||||
[[package]]
|
|
||||||
name = "huey"
|
|
||||||
version = "2.5.3"
|
|
||||||
description = "huey, a little task queue"
|
|
||||||
optional = false
|
|
||||||
python-versions = "*"
|
|
||||||
groups = ["main"]
|
|
||||||
files = [
|
|
||||||
{file = "huey-2.5.3.tar.gz", hash = "sha256:089fc72b97fd26a513f15b09925c56fad6abe4a699a1f0e902170b37e85163c7"},
|
|
||||||
]
|
|
||||||
|
|
||||||
[package.extras]
|
|
||||||
backends = ["redis (>=3.0.0)"]
|
|
||||||
redis = ["redis (>=3.0.0)"]
|
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "huggingface-hub"
|
name = "huggingface-hub"
|
||||||
version = "0.35.3"
|
version = "0.35.3"
|
||||||
@ -4825,6 +4810,23 @@ urllib3 = ">=1.26.14,<3"
|
|||||||
fastembed = ["fastembed (>=0.7,<0.8)"]
|
fastembed = ["fastembed (>=0.7,<0.8)"]
|
||||||
fastembed-gpu = ["fastembed-gpu (>=0.7,<0.8)"]
|
fastembed-gpu = ["fastembed-gpu (>=0.7,<0.8)"]
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "redis"
|
||||||
|
version = "6.4.0"
|
||||||
|
description = "Python client for Redis database and key-value store"
|
||||||
|
optional = false
|
||||||
|
python-versions = ">=3.9"
|
||||||
|
groups = ["main"]
|
||||||
|
files = [
|
||||||
|
{file = "redis-6.4.0-py3-none-any.whl", hash = "sha256:f0544fa9604264e9464cdf4814e7d4830f74b165d52f2a330a760a88dd248b7f"},
|
||||||
|
{file = "redis-6.4.0.tar.gz", hash = "sha256:b01bc7282b8444e28ec36b261df5375183bb47a07eb9c603f284e89cbc5ef010"},
|
||||||
|
]
|
||||||
|
|
||||||
|
[package.extras]
|
||||||
|
hiredis = ["hiredis (>=3.2.0)"]
|
||||||
|
jwt = ["pyjwt (>=2.9.0)"]
|
||||||
|
ocsp = ["cryptography (>=36.0.1)", "pyopenssl (>=20.0.1)", "requests (>=2.31.0)"]
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "referencing"
|
name = "referencing"
|
||||||
version = "0.37.0"
|
version = "0.37.0"
|
||||||
@ -7132,4 +7134,4 @@ cffi = ["cffi (>=1.17,<2.0) ; platform_python_implementation != \"PyPy\" and pyt
|
|||||||
[metadata]
|
[metadata]
|
||||||
lock-version = "2.1"
|
lock-version = "2.1"
|
||||||
python-versions = ">=3.12,<3.15"
|
python-versions = ">=3.12,<3.15"
|
||||||
content-hash = "dc130664802ad1344adc341931036a343f9892934a41bbc15c48663d0146696b"
|
content-hash = "f5b01d5a1e60672741f2c5e8cc6e2ec55534963a9a3791fd1cdf67d3c2fbd70b"
|
||||||
|
|||||||
@ -19,7 +19,7 @@ dependencies = [
|
|||||||
"numpy<2",
|
"numpy<2",
|
||||||
"aiohttp",
|
"aiohttp",
|
||||||
"aiofiles",
|
"aiofiles",
|
||||||
"huey (>=2.5.3,<3.0.0)",
|
"redis (>=4.0,<7.0)",
|
||||||
"pandas>=1.5.0",
|
"pandas>=1.5.0",
|
||||||
"openpyxl>=3.0.0",
|
"openpyxl>=3.0.0",
|
||||||
"xlrd>=2.0.0",
|
"xlrd>=2.0.0",
|
||||||
|
|||||||
@ -58,7 +58,6 @@ httpcore==1.0.9 ; python_version >= "3.12" and python_version < "3.15"
|
|||||||
httptools==0.7.1 ; python_version >= "3.12" and python_version < "3.15" and platform_system != "Windows"
|
httptools==0.7.1 ; python_version >= "3.12" and python_version < "3.15" and platform_system != "Windows"
|
||||||
httpx-sse==0.4.3 ; python_version >= "3.12" and python_version < "3.15"
|
httpx-sse==0.4.3 ; python_version >= "3.12" and python_version < "3.15"
|
||||||
httpx==0.28.1 ; python_version >= "3.12" and python_version < "3.15"
|
httpx==0.28.1 ; python_version >= "3.12" and python_version < "3.15"
|
||||||
huey==2.5.3 ; python_version >= "3.12" and python_version < "3.15"
|
|
||||||
huggingface-hub==0.35.3 ; python_version >= "3.12" and python_version < "3.15"
|
huggingface-hub==0.35.3 ; python_version >= "3.12" and python_version < "3.15"
|
||||||
hyperframe==6.1.0 ; python_version >= "3.12" and python_version < "3.15"
|
hyperframe==6.1.0 ; python_version >= "3.12" and python_version < "3.15"
|
||||||
idna==3.11 ; python_version >= "3.12" and python_version < "3.15"
|
idna==3.11 ; python_version >= "3.12" and python_version < "3.15"
|
||||||
@ -161,6 +160,7 @@ pywin32==311 ; python_version >= "3.12" and python_version < "3.15" and (sys_pla
|
|||||||
pyyaml==6.0.3 ; python_version >= "3.12" and python_version < "3.15"
|
pyyaml==6.0.3 ; python_version >= "3.12" and python_version < "3.15"
|
||||||
qdrant-client==1.12.1 ; python_version >= "3.13" and python_version < "3.15"
|
qdrant-client==1.12.1 ; python_version >= "3.13" and python_version < "3.15"
|
||||||
qdrant-client==1.16.2 ; python_version == "3.12"
|
qdrant-client==1.16.2 ; python_version == "3.12"
|
||||||
|
redis==6.4.0 ; python_version >= "3.12" and python_version < "3.15"
|
||||||
referencing==0.37.0 ; python_version >= "3.12" and python_version < "3.15"
|
referencing==0.37.0 ; python_version >= "3.12" and python_version < "3.15"
|
||||||
regex==2025.9.18 ; python_version >= "3.12" and python_version < "3.15"
|
regex==2025.9.18 ; python_version >= "3.12" and python_version < "3.15"
|
||||||
requests-toolbelt==1.0.0 ; python_version >= "3.12" and python_version < "3.15"
|
requests-toolbelt==1.0.0 ; python_version >= "3.12" and python_version < "3.15"
|
||||||
|
|||||||
377
routes/files.py
377
routes/files.py
@ -1,273 +1,18 @@
|
|||||||
import os
|
import os
|
||||||
import uuid
|
import uuid
|
||||||
import shutil
|
import shutil
|
||||||
import zipfile
|
|
||||||
from datetime import datetime
|
from datetime import datetime
|
||||||
from typing import Optional, List
|
from typing import Optional
|
||||||
from fastapi import APIRouter, HTTPException, Header, UploadFile, File, Form
|
from fastapi import APIRouter, HTTPException, UploadFile, File, Form
|
||||||
from pydantic import BaseModel
|
|
||||||
import logging
|
import logging
|
||||||
|
|
||||||
logger = logging.getLogger('app')
|
logger = logging.getLogger('app')
|
||||||
|
|
||||||
from utils import (
|
|
||||||
DatasetRequest, QueueTaskRequest, IncrementalTaskRequest, QueueTaskResponse,
|
|
||||||
load_processed_files_log, remove_file_or_directory, remove_dataset_directory_by_key
|
|
||||||
)
|
|
||||||
from utils.fastapi_utils import get_versioned_filename
|
from utils.fastapi_utils import get_versioned_filename
|
||||||
from task_queue.manager import queue_manager
|
|
||||||
from task_queue.integration_tasks import process_files_async, process_files_incremental_async, cleanup_project_async
|
|
||||||
from task_queue.task_status import task_status_store
|
|
||||||
|
|
||||||
router = APIRouter()
|
router = APIRouter()
|
||||||
|
|
||||||
|
|
||||||
@router.post("/api/v1/files/process/async")
|
|
||||||
async def process_files_async_endpoint(request: QueueTaskRequest, authorization: Optional[str] = Header(None)):
|
|
||||||
"""
|
|
||||||
Queue-based API for asynchronous file processing.
|
|
||||||
Same functionality as /api/v1/files/process, but processed asynchronously through the queue.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
request: QueueTaskRequest containing dataset_id, files, system_prompt, mcp_settings, and queue options
|
|
||||||
authorization: Authorization header containing API key (Bearer <API_KEY>)
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
QueueTaskResponse: Processing result with task ID for tracking
|
|
||||||
"""
|
|
||||||
try:
|
|
||||||
dataset_id = request.dataset_id
|
|
||||||
if not dataset_id:
|
|
||||||
raise HTTPException(status_code=400, detail="dataset_id is required")
|
|
||||||
|
|
||||||
# Estimate processing time (based on file count)
|
|
||||||
estimated_time = 0
|
|
||||||
if request.upload_folder:
|
|
||||||
# For upload_folder, file count cannot be estimated in advance, so use the default time
|
|
||||||
estimated_time = 120 # Default: 2 minutes
|
|
||||||
elif request.files:
|
|
||||||
total_files = sum(len(file_list) for file_list in request.files.values())
|
|
||||||
estimated_time = max(30, total_files * 10) # Estimated 10 seconds per file, minimum 30 seconds
|
|
||||||
|
|
||||||
# Create task status record
|
|
||||||
import uuid
|
|
||||||
task_id = str(uuid.uuid4())
|
|
||||||
task_status_store.set_status(
|
|
||||||
task_id=task_id,
|
|
||||||
unique_id=dataset_id,
|
|
||||||
status="pending"
|
|
||||||
)
|
|
||||||
|
|
||||||
# Submit async task
|
|
||||||
task = process_files_async(
|
|
||||||
dataset_id=dataset_id,
|
|
||||||
files=request.files,
|
|
||||||
upload_folder=request.upload_folder,
|
|
||||||
task_id=task_id
|
|
||||||
)
|
|
||||||
|
|
||||||
# Build a more detailed message
|
|
||||||
message = f"File processing task has been submitted to the queue, project ID: {dataset_id}"
|
|
||||||
if request.upload_folder:
|
|
||||||
group_count = len(request.upload_folder)
|
|
||||||
message += f", files will be scanned automatically from {group_count} uploaded folders"
|
|
||||||
elif request.files:
|
|
||||||
total_files = sum(len(file_list) for file_list in request.files.values())
|
|
||||||
message += f", including {total_files} files"
|
|
||||||
|
|
||||||
return QueueTaskResponse(
|
|
||||||
success=True,
|
|
||||||
message=message,
|
|
||||||
dataset_id=dataset_id,
|
|
||||||
task_id=task_id, # Use our own task_id
|
|
||||||
task_status="pending",
|
|
||||||
estimated_processing_time=estimated_time
|
|
||||||
)
|
|
||||||
|
|
||||||
except HTTPException:
|
|
||||||
raise
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error submitting async file processing task: {str(e)}")
|
|
||||||
raise HTTPException(status_code=500, detail=f"Internal server error: {str(e)}")
|
|
||||||
|
|
||||||
|
|
||||||
@router.post("/api/v1/files/process/incremental")
|
|
||||||
async def process_files_incremental_endpoint(request: IncrementalTaskRequest, authorization: Optional[str] = Header(None)):
|
|
||||||
"""
|
|
||||||
Queue-based API for incremental file processing, supporting file additions and deletions.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
request: IncrementalTaskRequest containing dataset_id, files_to_add, files_to_remove, system_prompt, mcp_settings, and queue options
|
|
||||||
authorization: Authorization header containing API key (Bearer <API_KEY>)
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
QueueTaskResponse: Processing result with task ID for tracking
|
|
||||||
"""
|
|
||||||
try:
|
|
||||||
dataset_id = request.dataset_id
|
|
||||||
if not dataset_id:
|
|
||||||
raise HTTPException(status_code=400, detail="dataset_id is required")
|
|
||||||
|
|
||||||
# Validate that there is at least one add or delete operation
|
|
||||||
if not request.files_to_add and not request.files_to_remove:
|
|
||||||
raise HTTPException(status_code=400, detail="At least one of files_to_add or files_to_remove must be provided")
|
|
||||||
|
|
||||||
# Estimate processing time (based on file count)
|
|
||||||
estimated_time = 0
|
|
||||||
total_add_files = sum(len(file_list) for file_list in (request.files_to_add or {}).values())
|
|
||||||
total_remove_files = sum(len(file_list) for file_list in (request.files_to_remove or {}).values())
|
|
||||||
total_files = total_add_files + total_remove_files
|
|
||||||
estimated_time = max(30, total_files * 10) # Estimated 10 seconds per file, minimum 30 seconds
|
|
||||||
|
|
||||||
# Create task status record
|
|
||||||
import uuid
|
|
||||||
task_id = str(uuid.uuid4())
|
|
||||||
task_status_store.set_status(
|
|
||||||
task_id=task_id,
|
|
||||||
unique_id=dataset_id,
|
|
||||||
status="pending"
|
|
||||||
)
|
|
||||||
|
|
||||||
# Submit incremental async task
|
|
||||||
task = process_files_incremental_async(
|
|
||||||
dataset_id=dataset_id,
|
|
||||||
files_to_add=request.files_to_add,
|
|
||||||
files_to_remove=request.files_to_remove,
|
|
||||||
system_prompt=request.system_prompt,
|
|
||||||
mcp_settings=request.mcp_settings,
|
|
||||||
task_id=task_id
|
|
||||||
)
|
|
||||||
|
|
||||||
return QueueTaskResponse(
|
|
||||||
success=True,
|
|
||||||
message=f"Incremental file processing task has been submitted to the queue - added {total_add_files} files, removed {total_remove_files} files, project ID: {dataset_id}",
|
|
||||||
dataset_id=dataset_id,
|
|
||||||
task_id=task_id,
|
|
||||||
task_status="pending",
|
|
||||||
estimated_processing_time=estimated_time
|
|
||||||
)
|
|
||||||
|
|
||||||
except HTTPException:
|
|
||||||
raise
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error submitting incremental file processing task: {str(e)}")
|
|
||||||
raise HTTPException(status_code=500, detail=f"Internal server error: {str(e)}")
|
|
||||||
|
|
||||||
|
|
||||||
@router.get("/api/v1/files/{dataset_id}/status")
|
|
||||||
async def get_files_processing_status(dataset_id: str):
|
|
||||||
"""Get the file processing status for the project."""
|
|
||||||
try:
|
|
||||||
# Load processed files log
|
|
||||||
processed_log = load_processed_files_log(dataset_id)
|
|
||||||
|
|
||||||
# Get project directory info
|
|
||||||
project_dir = os.path.join("projects", "data", dataset_id)
|
|
||||||
project_exists = os.path.exists(project_dir)
|
|
||||||
|
|
||||||
# Collect document.txt files
|
|
||||||
document_files = []
|
|
||||||
if project_exists:
|
|
||||||
for root, dirs, files in os.walk(project_dir):
|
|
||||||
for file in files:
|
|
||||||
if file == "document.txt":
|
|
||||||
document_files.append(os.path.join(root, file))
|
|
||||||
|
|
||||||
return {
|
|
||||||
"dataset_id": dataset_id,
|
|
||||||
"project_exists": project_exists,
|
|
||||||
"processed_files_count": len(processed_log),
|
|
||||||
"processed_files": processed_log,
|
|
||||||
"document_files_count": len(document_files),
|
|
||||||
"document_files": document_files,
|
|
||||||
"log_file_exists": os.path.exists(os.path.join("projects", "data", dataset_id, "processed_files.json"))
|
|
||||||
}
|
|
||||||
except Exception as e:
|
|
||||||
raise HTTPException(status_code=500, detail=f"Failed to retrieve file processing status: {str(e)}")
|
|
||||||
|
|
||||||
|
|
||||||
@router.post("/api/v1/files/{dataset_id}/reset")
|
|
||||||
async def reset_files_processing(dataset_id: str):
|
|
||||||
"""Reset the project's file processing status by deleting the processing log and all files."""
|
|
||||||
try:
|
|
||||||
project_dir = os.path.join("projects", "data", dataset_id)
|
|
||||||
log_file = os.path.join("projects", "data", dataset_id, "processed_files.json")
|
|
||||||
|
|
||||||
# Load processed log to know what files to remove
|
|
||||||
processed_log = load_processed_files_log(dataset_id)
|
|
||||||
|
|
||||||
removed_files = []
|
|
||||||
# Remove all processed files and their dataset directories
|
|
||||||
for file_hash, file_info in processed_log.items():
|
|
||||||
# Remove local file in files directory
|
|
||||||
if 'local_path' in file_info:
|
|
||||||
if remove_file_or_directory(file_info['local_path']):
|
|
||||||
removed_files.append(file_info['local_path'])
|
|
||||||
|
|
||||||
# Handle new key-based structure first
|
|
||||||
if 'key' in file_info:
|
|
||||||
# Remove dataset directory by key
|
|
||||||
key = file_info['key']
|
|
||||||
if remove_dataset_directory_by_key(dataset_id, key):
|
|
||||||
removed_files.append(f"dataset/{key}")
|
|
||||||
elif 'filename' in file_info:
|
|
||||||
# Fallback to old filename-based structure
|
|
||||||
filename_without_ext = os.path.splitext(file_info['filename'])[0]
|
|
||||||
dataset_dir = os.path.join("projects", "data", dataset_id, "datasets", filename_without_ext)
|
|
||||||
if remove_file_or_directory(dataset_dir):
|
|
||||||
removed_files.append(dataset_dir)
|
|
||||||
|
|
||||||
# Also remove any specific dataset path if exists (fallback)
|
|
||||||
if 'dataset_path' in file_info:
|
|
||||||
if remove_file_or_directory(file_info['dataset_path']):
|
|
||||||
removed_files.append(file_info['dataset_path'])
|
|
||||||
|
|
||||||
# Remove the log file
|
|
||||||
if remove_file_or_directory(log_file):
|
|
||||||
removed_files.append(log_file)
|
|
||||||
|
|
||||||
# Remove the entire files directory
|
|
||||||
files_dir = os.path.join(project_dir, "files")
|
|
||||||
if remove_file_or_directory(files_dir):
|
|
||||||
removed_files.append(files_dir)
|
|
||||||
|
|
||||||
# Also remove the entire dataset directory (clean up any remaining files)
|
|
||||||
dataset_dir = os.path.join(project_dir, "datasets")
|
|
||||||
if remove_file_or_directory(dataset_dir):
|
|
||||||
removed_files.append(dataset_dir)
|
|
||||||
|
|
||||||
# Remove README.md if exists
|
|
||||||
readme_file = os.path.join(project_dir, "README.md")
|
|
||||||
if remove_file_or_directory(readme_file):
|
|
||||||
removed_files.append(readme_file)
|
|
||||||
|
|
||||||
return {
|
|
||||||
"message": f"File processing status reset successfully: {dataset_id}",
|
|
||||||
"removed_files_count": len(removed_files),
|
|
||||||
"removed_files": removed_files
|
|
||||||
}
|
|
||||||
except Exception as e:
|
|
||||||
raise HTTPException(status_code=500, detail=f"Failed to reset file processing status: {str(e)}")
|
|
||||||
|
|
||||||
|
|
||||||
@router.post("/api/v1/files/{dataset_id}/cleanup/async")
|
|
||||||
async def cleanup_project_async_endpoint(dataset_id: str, remove_all: bool = False):
|
|
||||||
"""Asynchronously clean up project files."""
|
|
||||||
try:
|
|
||||||
task = cleanup_project_async(dataset_id=dataset_id, remove_all=remove_all)
|
|
||||||
|
|
||||||
return {
|
|
||||||
"success": True,
|
|
||||||
"message": f"Project cleanup task has been submitted to the queue, project ID: {dataset_id}",
|
|
||||||
"dataset_id": dataset_id,
|
|
||||||
"task_id": task.id,
|
|
||||||
"action": "remove_all" if remove_all else "cleanup_logs"
|
|
||||||
}
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error submitting cleanup task: {str(e)}")
|
|
||||||
raise HTTPException(status_code=500, detail=f"Failed to submit cleanup task: {str(e)}")
|
|
||||||
|
|
||||||
|
|
||||||
@router.post("/api/v1/upload")
|
@router.post("/api/v1/upload")
|
||||||
async def upload_file(file: UploadFile = File(...), folder: Optional[str] = Form(None)):
|
async def upload_file(file: UploadFile = File(...), folder: Optional[str] = Form(None)):
|
||||||
"""
|
"""
|
||||||
@ -348,121 +93,3 @@ async def upload_file(file: UploadFile = File(...), folder: Optional[str] = Form
|
|||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error(f"Error uploading file: {str(e)}")
|
logger.error(f"Error uploading file: {str(e)}")
|
||||||
raise HTTPException(status_code=500, detail=f"File upload failed: {str(e)}")
|
raise HTTPException(status_code=500, detail=f"File upload failed: {str(e)}")
|
||||||
|
|
||||||
|
|
||||||
# Task management routes that are related to file processing
|
|
||||||
@router.get("/api/v1/task/{task_id}/status")
|
|
||||||
async def get_task_status(task_id: str):
|
|
||||||
"""Get task status - simple and reliable."""
|
|
||||||
try:
|
|
||||||
status_data = task_status_store.get_status(task_id)
|
|
||||||
|
|
||||||
if not status_data:
|
|
||||||
return {
|
|
||||||
"success": False,
|
|
||||||
"message": "Task does not exist or has expired",
|
|
||||||
"task_id": task_id,
|
|
||||||
"status": "not_found"
|
|
||||||
}
|
|
||||||
|
|
||||||
return {
|
|
||||||
"success": True,
|
|
||||||
"message": "Task status retrieved successfully",
|
|
||||||
"task_id": task_id,
|
|
||||||
"status": status_data["status"],
|
|
||||||
"unique_id": status_data["unique_id"],
|
|
||||||
"created_at": status_data["created_at"],
|
|
||||||
"updated_at": status_data["updated_at"],
|
|
||||||
"result": status_data.get("result"),
|
|
||||||
"error": status_data.get("error")
|
|
||||||
}
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error getting task status: {str(e)}")
|
|
||||||
raise HTTPException(status_code=500, detail=f"Failed to retrieve task status: {str(e)}")
|
|
||||||
|
|
||||||
|
|
||||||
@router.delete("/api/v1/task/{task_id}")
|
|
||||||
async def delete_task(task_id: str):
|
|
||||||
"""Delete task record."""
|
|
||||||
try:
|
|
||||||
success = task_status_store.delete_status(task_id)
|
|
||||||
if success:
|
|
||||||
return {
|
|
||||||
"success": True,
|
|
||||||
"message": f"Task record deleted: {task_id}",
|
|
||||||
"task_id": task_id
|
|
||||||
}
|
|
||||||
else:
|
|
||||||
return {
|
|
||||||
"success": False,
|
|
||||||
"message": f"Task record does not exist: {task_id}",
|
|
||||||
"task_id": task_id
|
|
||||||
}
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error deleting task: {str(e)}")
|
|
||||||
raise HTTPException(status_code=500, detail=f"Failed to delete task record: {str(e)}")
|
|
||||||
|
|
||||||
|
|
||||||
@router.get("/api/v1/tasks")
|
|
||||||
async def list_tasks(status: Optional[str] = None, dataset_id: Optional[str] = None, limit: int = 100):
|
|
||||||
"""List tasks with optional filters."""
|
|
||||||
try:
|
|
||||||
if status or dataset_id:
|
|
||||||
# Use search function
|
|
||||||
tasks = task_status_store.search_tasks(status=status, unique_id=dataset_id, limit=limit)
|
|
||||||
else:
|
|
||||||
# Get all tasks
|
|
||||||
all_tasks = task_status_store.list_all()
|
|
||||||
tasks = list(all_tasks.values())[:limit]
|
|
||||||
|
|
||||||
return {
|
|
||||||
"success": True,
|
|
||||||
"message": "Task list retrieved successfully",
|
|
||||||
"total_tasks": len(tasks),
|
|
||||||
"tasks": tasks,
|
|
||||||
"filters": {
|
|
||||||
"status": status,
|
|
||||||
"dataset_id": dataset_id,
|
|
||||||
"limit": limit
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error listing tasks: {str(e)}")
|
|
||||||
raise HTTPException(status_code=500, detail=f"Failed to retrieve task list: {str(e)}")
|
|
||||||
|
|
||||||
|
|
||||||
@router.get("/api/v1/tasks/statistics")
|
|
||||||
async def get_task_statistics():
|
|
||||||
"""Get task statistics."""
|
|
||||||
try:
|
|
||||||
stats = task_status_store.get_statistics()
|
|
||||||
|
|
||||||
return {
|
|
||||||
"success": True,
|
|
||||||
"message": "Statistics retrieved successfully",
|
|
||||||
"statistics": stats
|
|
||||||
}
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error getting statistics: {str(e)}")
|
|
||||||
raise HTTPException(status_code=500, detail=f"Failed to retrieve statistics: {str(e)}")
|
|
||||||
|
|
||||||
|
|
||||||
@router.post("/api/v1/tasks/cleanup")
|
|
||||||
async def cleanup_tasks(older_than_days: int = 7):
|
|
||||||
"""Clean up old task records."""
|
|
||||||
try:
|
|
||||||
deleted_count = task_status_store.cleanup_old_tasks(older_than_days=older_than_days)
|
|
||||||
|
|
||||||
return {
|
|
||||||
"success": True,
|
|
||||||
"message": f"Cleaned up {deleted_count} old task records",
|
|
||||||
"deleted_count": deleted_count,
|
|
||||||
"older_than_days": older_than_days
|
|
||||||
}
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error cleaning up tasks: {str(e)}")
|
|
||||||
raise HTTPException(status_code=500, detail=f"Failed to clean up task records: {str(e)}")
|
|
||||||
|
|||||||
@ -6,8 +6,6 @@ import logging
|
|||||||
|
|
||||||
logger = logging.getLogger('app')
|
logger = logging.getLogger('app')
|
||||||
|
|
||||||
from task_queue.task_status import task_status_store
|
|
||||||
|
|
||||||
router = APIRouter()
|
router = APIRouter()
|
||||||
|
|
||||||
|
|
||||||
@ -155,22 +153,3 @@ async def list_datasets():
|
|||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error(f"Error listing datasets: {str(e)}")
|
logger.error(f"Error listing datasets: {str(e)}")
|
||||||
raise HTTPException(status_code=500, detail=f"Failed to retrieve dataset list: {str(e)}")
|
raise HTTPException(status_code=500, detail=f"Failed to retrieve dataset list: {str(e)}")
|
||||||
|
|
||||||
|
|
||||||
@router.get("/api/v1/projects/{dataset_id}/tasks")
|
|
||||||
async def get_project_tasks(dataset_id: str):
|
|
||||||
"""Get all tasks for the specified project."""
|
|
||||||
try:
|
|
||||||
tasks = task_status_store.get_by_unique_id(dataset_id)
|
|
||||||
|
|
||||||
return {
|
|
||||||
"success": True,
|
|
||||||
"message": "Project tasks retrieved successfully",
|
|
||||||
"dataset_id": dataset_id,
|
|
||||||
"total_tasks": len(tasks),
|
|
||||||
"tasks": tasks
|
|
||||||
}
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error getting project tasks: {str(e)}")
|
|
||||||
raise HTTPException(status_code=500, detail=f"Failed to retrieve project tasks: {str(e)}")
|
|
||||||
|
|||||||
@ -1,5 +1,5 @@
|
|||||||
#!/bin/bash
|
#!/bin/bash
|
||||||
# Optimized startup script - integrates the FastAPI application and queue consumer
|
# Optimized startup script for the FastAPI application
|
||||||
|
|
||||||
set -e
|
set -e
|
||||||
|
|
||||||
@ -7,7 +7,6 @@ set -e
|
|||||||
DEFAULT_HOST="0.0.0.0"
|
DEFAULT_HOST="0.0.0.0"
|
||||||
DEFAULT_PORT="8001"
|
DEFAULT_PORT="8001"
|
||||||
DEFAULT_API_WORKERS="4"
|
DEFAULT_API_WORKERS="4"
|
||||||
DEFAULT_QUEUE_WORKERS="2"
|
|
||||||
DEFAULT_PROFILE="balanced"
|
DEFAULT_PROFILE="balanced"
|
||||||
DEFAULT_LOG_LEVEL="info"
|
DEFAULT_LOG_LEVEL="info"
|
||||||
DEFAULT_MAX_RESTARTS="3"
|
DEFAULT_MAX_RESTARTS="3"
|
||||||
@ -17,7 +16,6 @@ DEFAULT_CHECK_INTERVAL="5"
|
|||||||
HOST=${HOST:-$DEFAULT_HOST}
|
HOST=${HOST:-$DEFAULT_HOST}
|
||||||
PORT=${PORT:-$DEFAULT_PORT}
|
PORT=${PORT:-$DEFAULT_PORT}
|
||||||
API_WORKERS=${API_WORKERS:-$DEFAULT_API_WORKERS}
|
API_WORKERS=${API_WORKERS:-$DEFAULT_API_WORKERS}
|
||||||
QUEUE_WORKERS=${QUEUE_WORKERS:-$DEFAULT_QUEUE_WORKERS}
|
|
||||||
PROFILE=${PROFILE:-$DEFAULT_PROFILE}
|
PROFILE=${PROFILE:-$DEFAULT_PROFILE}
|
||||||
LOG_LEVEL=${LOG_LEVEL:-$DEFAULT_LOG_LEVEL}
|
LOG_LEVEL=${LOG_LEVEL:-$DEFAULT_LOG_LEVEL}
|
||||||
MAX_RESTARTS=${MAX_RESTARTS:-$DEFAULT_MAX_RESTARTS}
|
MAX_RESTARTS=${MAX_RESTARTS:-$DEFAULT_MAX_RESTARTS}
|
||||||
@ -47,7 +45,6 @@ print_config() {
|
|||||||
print_color $GREEN "Startup configuration:"
|
print_color $GREEN "Startup configuration:"
|
||||||
echo "- API server: http://$HOST:$PORT"
|
echo "- API server: http://$HOST:$PORT"
|
||||||
echo "- API worker processes: $API_WORKERS"
|
echo "- API worker processes: $API_WORKERS"
|
||||||
echo "- Queue worker threads: $QUEUE_WORKERS"
|
|
||||||
echo "- Performance profile: $PROFILE"
|
echo "- Performance profile: $PROFILE"
|
||||||
echo "- Log level: $LOG_LEVEL"
|
echo "- Log level: $LOG_LEVEL"
|
||||||
echo "- Maximum restarts: $MAX_RESTARTS"
|
echo "- Maximum restarts: $MAX_RESTARTS"
|
||||||
@ -87,7 +84,6 @@ create_directories() {
|
|||||||
print_color $YELLOW "Creating project directories..."
|
print_color $YELLOW "Creating project directories..."
|
||||||
|
|
||||||
directories=(
|
directories=(
|
||||||
"projects/queue_data"
|
|
||||||
"projects/data"
|
"projects/data"
|
||||||
"projects/uploads"
|
"projects/uploads"
|
||||||
"projects/robot"
|
"projects/robot"
|
||||||
@ -161,16 +157,6 @@ start_services() {
|
|||||||
API_PID=$!
|
API_PID=$!
|
||||||
echo "API server PID: $API_PID"
|
echo "API server PID: $API_PID"
|
||||||
|
|
||||||
# Start the queue consumer
|
|
||||||
print_color $BLUE "Starting queue consumer..."
|
|
||||||
python3 task_queue/consumer.py \
|
|
||||||
--workers=$QUEUE_WORKERS \
|
|
||||||
--worker-type=threads \
|
|
||||||
> queue_consumer.log 2>&1 &
|
|
||||||
|
|
||||||
CONSUMER_PID=$!
|
|
||||||
echo "Queue consumer PID: $CONSUMER_PID"
|
|
||||||
|
|
||||||
echo
|
echo
|
||||||
print_color $GREEN "All services started successfully!"
|
print_color $GREEN "All services started successfully!"
|
||||||
print_color $GREEN "API server: http://$HOST:$PORT"
|
print_color $GREEN "API server: http://$HOST:$PORT"
|
||||||
@ -179,7 +165,7 @@ start_services() {
|
|||||||
}
|
}
|
||||||
|
|
||||||
monitor_services() {
|
monitor_services() {
|
||||||
local restart_counts=(0 0) # API, Consumer
|
local restart_counts=(0) # API
|
||||||
|
|
||||||
while true; do
|
while true; do
|
||||||
# Check the API server
|
# Check the API server
|
||||||
@ -205,26 +191,6 @@ monitor_services() {
|
|||||||
fi
|
fi
|
||||||
fi
|
fi
|
||||||
|
|
||||||
# Check the queue consumer
|
|
||||||
if ! kill -0 $CONSUMER_PID 2>/dev/null; then
|
|
||||||
print_color $RED "Queue consumer stopped unexpectedly"
|
|
||||||
|
|
||||||
if [ ${restart_counts[1]} -lt $MAX_RESTARTS ]; then
|
|
||||||
print_color $YELLOW "Restarting queue consumer (${restart_counts[1]} + 1/$MAX_RESTARTS)..."
|
|
||||||
python3 task_queue/consumer.py \
|
|
||||||
--workers=$QUEUE_WORKERS \
|
|
||||||
--worker-type=threads \
|
|
||||||
>> queue_consumer.log 2>&1 &
|
|
||||||
|
|
||||||
CONSUMER_PID=$!
|
|
||||||
restart_counts[1]=$((restart_counts[1] + 1))
|
|
||||||
print_color $GREEN "Queue consumer restarted successfully, PID: $CONSUMER_PID"
|
|
||||||
else
|
|
||||||
print_color $RED "Queue consumer restart limit reached, stopping all services"
|
|
||||||
break
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
|
|
||||||
# Wait for the next check interval
|
# Wait for the next check interval
|
||||||
sleep $CHECK_INTERVAL
|
sleep $CHECK_INTERVAL
|
||||||
done
|
done
|
||||||
@ -253,25 +219,6 @@ cleanup() {
|
|||||||
fi
|
fi
|
||||||
fi
|
fi
|
||||||
|
|
||||||
# Stop the queue consumer
|
|
||||||
if [ ! -z "$CONSUMER_PID" ] && kill -0 $CONSUMER_PID 2>/dev/null; then
|
|
||||||
print_color $BLUE "Stopping queue consumer (PID: $CONSUMER_PID)..."
|
|
||||||
kill $CONSUMER_PID 2>/dev/null || true
|
|
||||||
|
|
||||||
# Wait for graceful shutdown
|
|
||||||
local count=0
|
|
||||||
while kill -0 $CONSUMER_PID 2>/dev/null && [ $count -lt 10 ]; do
|
|
||||||
sleep 1
|
|
||||||
count=$((count + 1))
|
|
||||||
done
|
|
||||||
|
|
||||||
# Force terminate if it is still running
|
|
||||||
if kill -0 $CONSUMER_PID 2>/dev/null; then
|
|
||||||
print_color $RED "Force stopping queue consumer..."
|
|
||||||
kill -9 $CONSUMER_PID 2>/dev/null || true
|
|
||||||
fi
|
|
||||||
fi
|
|
||||||
|
|
||||||
print_color $GREEN "All services have been stopped"
|
print_color $GREEN "All services have been stopped"
|
||||||
exit 0
|
exit 0
|
||||||
}
|
}
|
||||||
@ -288,7 +235,6 @@ main() {
|
|||||||
echo " HOST API bind host address (default: $DEFAULT_HOST)"
|
echo " HOST API bind host address (default: $DEFAULT_HOST)"
|
||||||
echo " PORT API bind port (default: $DEFAULT_PORT)"
|
echo " PORT API bind port (default: $DEFAULT_PORT)"
|
||||||
echo " API_WORKERS Number of API worker processes (default: $DEFAULT_API_WORKERS)"
|
echo " API_WORKERS Number of API worker processes (default: $DEFAULT_API_WORKERS)"
|
||||||
echo " QUEUE_WORKERS Number of queue worker threads (default: $DEFAULT_QUEUE_WORKERS)"
|
|
||||||
echo " PROFILE Performance profile: low_memory, balanced, high_performance (default: $DEFAULT_PROFILE)"
|
echo " PROFILE Performance profile: low_memory, balanced, high_performance (default: $DEFAULT_PROFILE)"
|
||||||
echo " LOG_LEVEL Log level: debug, info, warning, error (default: $DEFAULT_LOG_LEVEL)"
|
echo " LOG_LEVEL Log level: debug, info, warning, error (default: $DEFAULT_LOG_LEVEL)"
|
||||||
echo " MAX_RESTARTS Maximum restart count (default: $DEFAULT_MAX_RESTARTS)"
|
echo " MAX_RESTARTS Maximum restart count (default: $DEFAULT_MAX_RESTARTS)"
|
||||||
@ -296,7 +242,7 @@ main() {
|
|||||||
echo
|
echo
|
||||||
echo "Examples:"
|
echo "Examples:"
|
||||||
echo " PROFILE=high_performance API_WORKERS=8 $0"
|
echo " PROFILE=high_performance API_WORKERS=8 $0"
|
||||||
echo " PORT=8080 QUEUE_WORKERS=4 $0"
|
echo " PORT=8080 API_WORKERS=4 $0"
|
||||||
exit 0
|
exit 0
|
||||||
fi
|
fi
|
||||||
|
|
||||||
|
|||||||
@ -1,6 +1,6 @@
|
|||||||
#!/usr/bin/env python3
|
#!/usr/bin/env python3
|
||||||
"""
|
"""
|
||||||
Optimized unified startup script combining the FastAPI application and queue consumer.
|
Optimized unified startup script for the FastAPI application.
|
||||||
Supports performance monitoring, automatic restart, graceful shutdown, and related features.
|
Supports performance monitoring, automatic restart, graceful shutdown, and related features.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
@ -17,7 +17,7 @@ from typing import List, Optional, Dict, Any
|
|||||||
|
|
||||||
|
|
||||||
class ProcessManager:
|
class ProcessManager:
|
||||||
"""Process manager that controls the API service and queue consumer."""
|
"""Process manager that controls the API service."""
|
||||||
|
|
||||||
def __init__(self):
|
def __init__(self):
|
||||||
self.processes: Dict[str, subprocess.Popen] = {}
|
self.processes: Dict[str, subprocess.Popen] = {}
|
||||||
@ -78,44 +78,6 @@ class ProcessManager:
|
|||||||
print(f"Failed to start API server: {e}")
|
print(f"Failed to start API server: {e}")
|
||||||
return None
|
return None
|
||||||
|
|
||||||
def start_queue_consumer(self, args) -> Optional[subprocess.Popen]:
|
|
||||||
"""Start the queue consumer."""
|
|
||||||
print("Starting queue consumer...")
|
|
||||||
|
|
||||||
consumer_script = Path("task_queue/consumer.py")
|
|
||||||
if not consumer_script.exists():
|
|
||||||
consumer_script = consumer_script.with_suffix(".pyc")
|
|
||||||
|
|
||||||
# Build the queue consumer command
|
|
||||||
cmd = [
|
|
||||||
sys.executable,
|
|
||||||
str(consumer_script),
|
|
||||||
"--workers", str(args.queue_workers),
|
|
||||||
"--worker-type", args.worker_type
|
|
||||||
]
|
|
||||||
|
|
||||||
try:
|
|
||||||
process = subprocess.Popen(
|
|
||||||
cmd,
|
|
||||||
stdout=subprocess.PIPE,
|
|
||||||
stderr=subprocess.STDOUT,
|
|
||||||
universal_newlines=True,
|
|
||||||
bufsize=1
|
|
||||||
)
|
|
||||||
|
|
||||||
# Start the output monitoring thread
|
|
||||||
threading.Thread(
|
|
||||||
target=self._monitor_output,
|
|
||||||
args=(process, "Queue consumer"),
|
|
||||||
daemon=True
|
|
||||||
).start()
|
|
||||||
|
|
||||||
return process
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Failed to start queue consumer: {e}")
|
|
||||||
return None
|
|
||||||
|
|
||||||
def _monitor_output(self, process: subprocess.Popen, name: str):
|
def _monitor_output(self, process: subprocess.Popen, name: str):
|
||||||
"""Monitor process output."""
|
"""Monitor process output."""
|
||||||
try:
|
try:
|
||||||
@ -138,8 +100,6 @@ class ProcessManager:
|
|||||||
|
|
||||||
if name == "API server":
|
if name == "API server":
|
||||||
new_process = self.start_api_server(args)
|
new_process = self.start_api_server(args)
|
||||||
elif name == "Queue consumer":
|
|
||||||
new_process = self.start_queue_consumer(args)
|
|
||||||
else:
|
else:
|
||||||
return False
|
return False
|
||||||
|
|
||||||
@ -169,27 +129,19 @@ class ProcessManager:
|
|||||||
print("Failed to start API server; exiting")
|
print("Failed to start API server; exiting")
|
||||||
return False
|
return False
|
||||||
|
|
||||||
queue_process = self.start_queue_consumer(args)
|
|
||||||
if not queue_process:
|
|
||||||
print("Failed to start queue consumer; exiting")
|
|
||||||
api_process.terminate()
|
|
||||||
return False
|
|
||||||
|
|
||||||
self.processes["API server"] = api_process
|
self.processes["API server"] = api_process
|
||||||
self.processes["Queue consumer"] = queue_process
|
|
||||||
|
|
||||||
print("\n" + "=" * 70)
|
print("\n" + "=" * 70)
|
||||||
print("All services started successfully!")
|
print("All services started successfully!")
|
||||||
print(f"API server: http://{args.host}:{args.port}")
|
print(f"API server: http://{args.host}:{args.port}")
|
||||||
print(f"API PID: {api_process.pid}")
|
print(f"API PID: {api_process.pid}")
|
||||||
print(f"Queue consumer PID: {queue_process.pid}")
|
|
||||||
print("Press Ctrl+C to stop all services")
|
print("Press Ctrl+C to stop all services")
|
||||||
print("=" * 70 + "\n")
|
print("=" * 70 + "\n")
|
||||||
|
|
||||||
self.running = True
|
self.running = True
|
||||||
|
|
||||||
# Main monitoring loop
|
# Main monitoring loop
|
||||||
restart_counts = {"API server": 0, "Queue consumer": 0}
|
restart_counts = {"API server": 0}
|
||||||
max_restarts = args.max_restarts
|
max_restarts = args.max_restarts
|
||||||
|
|
||||||
while self.running and not self.shutdown_event.is_set():
|
while self.running and not self.shutdown_event.is_set():
|
||||||
@ -262,7 +214,6 @@ class ProcessManager:
|
|||||||
def create_directories(self):
|
def create_directories(self):
|
||||||
"""Create the required directories."""
|
"""Create the required directories."""
|
||||||
directories = [
|
directories = [
|
||||||
"projects/queue_data",
|
|
||||||
"projects/data",
|
"projects/data",
|
||||||
"projects/uploads",
|
"projects/uploads",
|
||||||
"projects/robot",
|
"projects/robot",
|
||||||
@ -313,11 +264,6 @@ def parse_args():
|
|||||||
parser.add_argument("--log-level", type=str, default="info",
|
parser.add_argument("--log-level", type=str, default="info",
|
||||||
choices=["debug", "info", "warning", "error"], help="Log level")
|
choices=["debug", "info", "warning", "error"], help="Log level")
|
||||||
|
|
||||||
# Queue consumer configuration
|
|
||||||
parser.add_argument("--queue-workers", type=int, default=2, help="Number of queue consumer worker threads")
|
|
||||||
parser.add_argument("--worker-type", type=str, default="threads",
|
|
||||||
choices=["threads", "greenlets", "gevent"], help="Queue worker type")
|
|
||||||
|
|
||||||
# Performance profile
|
# Performance profile
|
||||||
parser.add_argument("--profile", type=str, default="low_memory",
|
parser.add_argument("--profile", type=str, default="low_memory",
|
||||||
choices=["low_memory", "balanced", "high_performance"], help="Performance profile")
|
choices=["low_memory", "balanced", "high_performance"], help="Performance profile")
|
||||||
|
|||||||
@ -1,154 +0,0 @@
|
|||||||
# 队列系统使用说明
|
|
||||||
|
|
||||||
## 概述
|
|
||||||
|
|
||||||
本项目集成了基于 huey 和 SqliteHuey 的异步队列系统,用于处理文件的异步处理任务。
|
|
||||||
|
|
||||||
## 安装依赖
|
|
||||||
|
|
||||||
```bash
|
|
||||||
pip install huey
|
|
||||||
```
|
|
||||||
|
|
||||||
## 目录结构
|
|
||||||
|
|
||||||
```
|
|
||||||
queue/
|
|
||||||
├── __init__.py # 包初始化文件
|
|
||||||
├── config.py # 队列配置(SqliteHuey配置)
|
|
||||||
├── tasks.py # 文件处理任务定义
|
|
||||||
├── manager.py # 队列管理器
|
|
||||||
├── consumer.py # 队列消费者(工作进程)
|
|
||||||
├── example.py # 使用示例
|
|
||||||
└── README.md # 说明文档
|
|
||||||
```
|
|
||||||
|
|
||||||
## 核心功能
|
|
||||||
|
|
||||||
### 1. 队列配置 (config.py)
|
|
||||||
- 使用 SqliteHuey 作为消息队列
|
|
||||||
- 数据库文件存储在 `queue_data/huey.db`
|
|
||||||
- 支持任务重试和错误存储
|
|
||||||
|
|
||||||
### 2. 文件处理任务 (tasks.py)
|
|
||||||
- `process_file_async`: 异步处理单个文件
|
|
||||||
- `process_multiple_files_async`: 批量异步处理文件
|
|
||||||
- `process_zip_file_async`: 异步处理zip压缩文件
|
|
||||||
- `cleanup_processed_files`: 清理旧的文件
|
|
||||||
|
|
||||||
### 3. 队列管理器 (manager.py)
|
|
||||||
- 任务提交和管理
|
|
||||||
- 队列状态监控
|
|
||||||
- 任务结果查询
|
|
||||||
- 任务记录清理
|
|
||||||
|
|
||||||
## 使用方法
|
|
||||||
|
|
||||||
### 1. 启动队列消费者
|
|
||||||
|
|
||||||
```bash
|
|
||||||
# 启动默认配置的消费者
|
|
||||||
python queue/consumer.py
|
|
||||||
|
|
||||||
# 指定工作线程数
|
|
||||||
python queue/consumer.py --workers 4
|
|
||||||
|
|
||||||
# 查看队列统计信息
|
|
||||||
python queue/consumer.py --stats
|
|
||||||
|
|
||||||
# 检查队列状态
|
|
||||||
python queue/consumer.py --check
|
|
||||||
|
|
||||||
# 清空队列
|
|
||||||
python queue/consumer.py --flush
|
|
||||||
```
|
|
||||||
|
|
||||||
### 2. 在代码中使用队列
|
|
||||||
|
|
||||||
```python
|
|
||||||
from queue.manager import queue_manager
|
|
||||||
|
|
||||||
# 处理单个文件
|
|
||||||
task_id = queue_manager.enqueue_file(
|
|
||||||
project_id="my_project",
|
|
||||||
file_path="/path/to/file.txt",
|
|
||||||
original_filename="myfile.txt"
|
|
||||||
)
|
|
||||||
|
|
||||||
# 批量处理文件
|
|
||||||
task_ids = queue_manager.enqueue_multiple_files(
|
|
||||||
project_id="my_project",
|
|
||||||
file_paths=["/path/file1.txt", "/path/file2.txt"],
|
|
||||||
original_filenames=["file1.txt", "file2.txt"]
|
|
||||||
)
|
|
||||||
|
|
||||||
# 处理zip文件
|
|
||||||
task_id = queue_manager.enqueue_zip_file(
|
|
||||||
project_id="my_project",
|
|
||||||
zip_path="/path/to/archive.zip"
|
|
||||||
)
|
|
||||||
|
|
||||||
# 查看任务状态
|
|
||||||
status = queue_manager.get_task_status(task_id)
|
|
||||||
print(status)
|
|
||||||
|
|
||||||
# 获取队列统计信息
|
|
||||||
stats = queue_manager.get_queue_stats()
|
|
||||||
print(stats)
|
|
||||||
```
|
|
||||||
|
|
||||||
### 3. 运行示例
|
|
||||||
|
|
||||||
```bash
|
|
||||||
python queue/example.py
|
|
||||||
```
|
|
||||||
|
|
||||||
## 配置说明
|
|
||||||
|
|
||||||
### 队列配置参数 (config.py)
|
|
||||||
- `filename`: SQLite数据库文件路径
|
|
||||||
- `always_eager`: 是否立即执行任务(开发时可设为True)
|
|
||||||
- `utc`: 是否使用UTC时间
|
|
||||||
- `compression_level`: 压缩级别
|
|
||||||
- `store_errors`: 是否存储错误信息
|
|
||||||
- `max_retries`: 最大重试次数
|
|
||||||
- `retry_delay`: 重试延迟
|
|
||||||
|
|
||||||
### 消费者参数 (consumer.py)
|
|
||||||
- `--workers`: 工作线程数(默认2)
|
|
||||||
- `--worker-type`: 工作类型(threads/greenlets/processes)
|
|
||||||
- `--stats`: 显示统计信息
|
|
||||||
- `--check`: 检查队列状态
|
|
||||||
- `--flush`: 清空队列
|
|
||||||
|
|
||||||
## 任务状态
|
|
||||||
|
|
||||||
- `pending`: 等待处理
|
|
||||||
- `running`: 正在处理
|
|
||||||
- `complete/finished`: 处理完成
|
|
||||||
- `error`: 处理失败
|
|
||||||
- `scheduled`: 定时任务
|
|
||||||
|
|
||||||
## 最佳实践
|
|
||||||
|
|
||||||
1. **生产环境建议**:
|
|
||||||
- 设置合适的工作线程数(建议CPU核心数的1-2倍)
|
|
||||||
- 定期清理旧的任务记录
|
|
||||||
- 监控队列状态和任务执行情况
|
|
||||||
|
|
||||||
2. **开发环境建议**:
|
|
||||||
- 可以设置 `always_eager=True` 立即执行任务进行调试
|
|
||||||
- 使用 `--check` 参数查看队列状态
|
|
||||||
- 运行示例代码了解功能
|
|
||||||
|
|
||||||
3. **错误处理**:
|
|
||||||
- 任务失败后会自动重试(最多3次)
|
|
||||||
- 错误信息会存储在数据库中
|
|
||||||
- 可以通过 `get_task_status()` 查看错误详情
|
|
||||||
|
|
||||||
## 故障排除
|
|
||||||
|
|
||||||
1. **数据库锁定**: 确保只有一个消费者实例在运行
|
|
||||||
2. **任务卡住**: 检查文件路径和权限
|
|
||||||
3. **内存不足**: 调整工作线程数或使用进程模式
|
|
||||||
4. **磁盘空间**: 定期清理旧文件和任务记录
|
|
||||||
@ -1,23 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Queue package initialization.
|
|
||||||
"""
|
|
||||||
|
|
||||||
from .config import huey
|
|
||||||
from .manager import QueueManager, queue_manager
|
|
||||||
from .tasks import (
|
|
||||||
process_file_async,
|
|
||||||
process_multiple_files_async,
|
|
||||||
process_zip_file_async,
|
|
||||||
cleanup_processed_files
|
|
||||||
)
|
|
||||||
|
|
||||||
__all__ = [
|
|
||||||
"huey",
|
|
||||||
"QueueManager",
|
|
||||||
"queue_manager",
|
|
||||||
"process_file_async",
|
|
||||||
"process_multiple_files_async",
|
|
||||||
"process_zip_file_async",
|
|
||||||
"cleanup_processed_files"
|
|
||||||
]
|
|
||||||
@ -1,31 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Queue configuration using SqliteHuey for asynchronous file processing.
|
|
||||||
"""
|
|
||||||
|
|
||||||
import os
|
|
||||||
import logging
|
|
||||||
from huey import SqliteHuey
|
|
||||||
from datetime import timedelta
|
|
||||||
|
|
||||||
# Configure logging
|
|
||||||
logger = logging.getLogger('app')
|
|
||||||
|
|
||||||
# Ensure projects/queue_data directory exists
|
|
||||||
queue_data_dir = os.path.join(os.path.dirname(__file__), '..', 'projects', 'queue_data')
|
|
||||||
os.makedirs(queue_data_dir, exist_ok=True)
|
|
||||||
|
|
||||||
# Initialize SqliteHuey
|
|
||||||
huey = SqliteHuey(
|
|
||||||
filename=os.path.join(queue_data_dir, 'huey.db'),
|
|
||||||
name='file_processor', # Queue name
|
|
||||||
always_eager=False, # Set to False to enable async processing
|
|
||||||
utc=True, # Use UTC time
|
|
||||||
)
|
|
||||||
|
|
||||||
# Set default task configuration
|
|
||||||
huey.store_errors = True # Store error information
|
|
||||||
huey.max_retries = 3 # Maximum retry count
|
|
||||||
huey.retry_delay = timedelta(seconds=60) # Retry delay
|
|
||||||
|
|
||||||
logger.info(f"SqliteHuey queue initialized, database path: {os.path.join(queue_data_dir, 'huey.db')}")
|
|
||||||
@ -1,171 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Queue consumer for processing file tasks.
|
|
||||||
"""
|
|
||||||
|
|
||||||
import sys
|
|
||||||
import os
|
|
||||||
import time
|
|
||||||
import signal
|
|
||||||
import argparse
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
# Add project root directory to Python path
|
|
||||||
project_root = Path(__file__).parent.parent
|
|
||||||
sys.path.insert(0, str(project_root))
|
|
||||||
|
|
||||||
from task_queue.config import huey
|
|
||||||
from task_queue.manager import queue_manager
|
|
||||||
from task_queue.integration_tasks import process_files_async, cleanup_project_async
|
|
||||||
from huey.consumer import Consumer
|
|
||||||
|
|
||||||
|
|
||||||
class QueueConsumer:
|
|
||||||
"""Queue consumer for processing async tasks."""
|
|
||||||
|
|
||||||
def __init__(self, worker_type: str = "threads", workers: int = 2):
|
|
||||||
self.huey = huey
|
|
||||||
self.worker_type = worker_type
|
|
||||||
self.workers = workers
|
|
||||||
self.running = False
|
|
||||||
self.consumer = None
|
|
||||||
|
|
||||||
# Register signal handlers
|
|
||||||
signal.signal(signal.SIGINT, self._signal_handler)
|
|
||||||
signal.signal(signal.SIGTERM, self._signal_handler)
|
|
||||||
|
|
||||||
def _signal_handler(self, signum, frame):
|
|
||||||
"""Signal handler for graceful shutdown."""
|
|
||||||
print(f"\nReceived signal {signum}, shutting down queue consumer...")
|
|
||||||
self.running = False
|
|
||||||
|
|
||||||
def start(self):
|
|
||||||
"""Start the queue consumer."""
|
|
||||||
print(f"Starting queue consumer...")
|
|
||||||
print(f"Worker threads: {self.workers}")
|
|
||||||
print(f"Worker type: {self.worker_type}")
|
|
||||||
print(f"Database: {os.path.join(os.path.dirname(__file__), '..', 'projects', 'queue_data', 'huey.db')}")
|
|
||||||
print("Press Ctrl+C to stop the consumer")
|
|
||||||
|
|
||||||
self.running = True
|
|
||||||
|
|
||||||
try:
|
|
||||||
# Create Huey consumer
|
|
||||||
self.consumer = Consumer(self.huey, workers=self.workers, worker_type=self.worker_type.rstrip('s'))
|
|
||||||
|
|
||||||
# Display queue statistics
|
|
||||||
stats = queue_manager.get_queue_stats()
|
|
||||||
print(f"Current queue status: {stats}")
|
|
||||||
|
|
||||||
# Start consumer run loop
|
|
||||||
print("Consumer starting task processing...")
|
|
||||||
self.consumer.run()
|
|
||||||
|
|
||||||
except KeyboardInterrupt:
|
|
||||||
print("\nReceived interrupt signal, shutting down...")
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Queue consumer runtime error: {str(e)}")
|
|
||||||
finally:
|
|
||||||
self.stop()
|
|
||||||
|
|
||||||
def stop(self):
|
|
||||||
"""Stop the queue consumer."""
|
|
||||||
print("Stopping queue consumer...")
|
|
||||||
try:
|
|
||||||
if self.consumer:
|
|
||||||
# Stop the consumer
|
|
||||||
self.consumer.stop()
|
|
||||||
self.consumer = None
|
|
||||||
print("Queue consumer stopped")
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Error stopping queue consumer: {str(e)}")
|
|
||||||
|
|
||||||
def process_scheduled_tasks(self):
|
|
||||||
"""Process scheduled tasks."""
|
|
||||||
print("Processing scheduled tasks...")
|
|
||||||
# Additional scheduled task processing logic can be added here
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
|
||||||
"""Main entry point."""
|
|
||||||
parser = argparse.ArgumentParser(description="File processing queue consumer")
|
|
||||||
parser.add_argument(
|
|
||||||
"--workers",
|
|
||||||
type=int,
|
|
||||||
default=2,
|
|
||||||
help="Number of worker threads (default: 2)"
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--worker-type",
|
|
||||||
choices=["threads", "greenlets", "processes"],
|
|
||||||
default="threads",
|
|
||||||
help="Worker thread type (default: threads)"
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--stats",
|
|
||||||
action="store_true",
|
|
||||||
help="Display queue statistics and exit"
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--flush",
|
|
||||||
action="store_true",
|
|
||||||
help="Flush the queue and exit"
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--check",
|
|
||||||
action="store_true",
|
|
||||||
help="Check queue status and exit"
|
|
||||||
)
|
|
||||||
|
|
||||||
args = parser.parse_args()
|
|
||||||
|
|
||||||
# Initialize consumer
|
|
||||||
consumer = QueueConsumer(
|
|
||||||
worker_type=args.worker_type,
|
|
||||||
workers=args.workers
|
|
||||||
)
|
|
||||||
|
|
||||||
# Handle different command-line options
|
|
||||||
if args.stats:
|
|
||||||
print("=== Queue Statistics ===")
|
|
||||||
stats = queue_manager.get_queue_stats()
|
|
||||||
print(f"Total tasks: {stats.get('total_tasks', 0)}")
|
|
||||||
print(f"Pending tasks: {stats.get('pending_tasks', 0)}")
|
|
||||||
print(f"Running tasks: {stats.get('running_tasks', 0)}")
|
|
||||||
print(f"Completed tasks: {stats.get('completed_tasks', 0)}")
|
|
||||||
print(f"Error tasks: {stats.get('error_tasks', 0)}")
|
|
||||||
print(f"Scheduled tasks: {stats.get('scheduled_tasks', 0)}")
|
|
||||||
print(f"Database: {stats.get('queue_database', 'N/A')}")
|
|
||||||
return
|
|
||||||
|
|
||||||
if args.flush:
|
|
||||||
print("=== Flushing Queue ===")
|
|
||||||
try:
|
|
||||||
# Flush all tasks
|
|
||||||
consumer.huey.flush()
|
|
||||||
print("Queue flushed")
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Failed to flush queue: {str(e)}")
|
|
||||||
return
|
|
||||||
|
|
||||||
if args.check:
|
|
||||||
print("=== Checking Queue Status ===")
|
|
||||||
stats = queue_manager.get_queue_stats()
|
|
||||||
print(f"Queue status: OK" if "error" not in stats else f"Queue status: ERROR - {stats['error']}")
|
|
||||||
|
|
||||||
pending_tasks = queue_manager.list_pending_tasks(limit=10)
|
|
||||||
if pending_tasks:
|
|
||||||
print(f"\nPending tasks (showing up to 10):")
|
|
||||||
for task in pending_tasks:
|
|
||||||
print(f" Task ID: {task['task_id']}, Status: {task['status']}, Created: {task['created_time']}")
|
|
||||||
else:
|
|
||||||
print("No pending tasks")
|
|
||||||
return
|
|
||||||
|
|
||||||
# Start consumer
|
|
||||||
print("=== Starting File Processing Queue Consumer ===")
|
|
||||||
consumer.start()
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
@ -1,132 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Example usage of the queue system.
|
|
||||||
"""
|
|
||||||
|
|
||||||
import sys
|
|
||||||
import time
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
# Add project root directory to Python path
|
|
||||||
project_root = Path(__file__).parent.parent
|
|
||||||
sys.path.insert(0, str(project_root))
|
|
||||||
|
|
||||||
from task_queue.manager import queue_manager
|
|
||||||
from task_queue.tasks import process_file_async, process_multiple_files_async
|
|
||||||
|
|
||||||
|
|
||||||
def example_single_file():
|
|
||||||
"""Example: Process a single file."""
|
|
||||||
print("=== Example: Process a single file ===")
|
|
||||||
|
|
||||||
project_id = "test_project"
|
|
||||||
file_path = "public/test_document.txt"
|
|
||||||
|
|
||||||
# Enqueue file for processing
|
|
||||||
task_id = queue_manager.enqueue_file(
|
|
||||||
project_id=project_id,
|
|
||||||
file_path=file_path,
|
|
||||||
original_filename="example_document.txt"
|
|
||||||
)
|
|
||||||
|
|
||||||
print(f"Task submitted, task ID: {task_id}")
|
|
||||||
|
|
||||||
# Check task status
|
|
||||||
time.sleep(2)
|
|
||||||
status = queue_manager.get_task_status(task_id)
|
|
||||||
print(f"Task status: {status}")
|
|
||||||
|
|
||||||
|
|
||||||
def example_multiple_files():
|
|
||||||
"""Example: Batch process files."""
|
|
||||||
print("\n=== Example: Batch process files ===")
|
|
||||||
|
|
||||||
project_id = "test_project_batch"
|
|
||||||
file_paths = [
|
|
||||||
"public/test_document.txt",
|
|
||||||
"public/goods.xlsx" # Assuming this file exists
|
|
||||||
]
|
|
||||||
original_filenames = [
|
|
||||||
"batch_document_1.txt",
|
|
||||||
"batch_goods.xlsx"
|
|
||||||
]
|
|
||||||
|
|
||||||
# Enqueue multiple files for processing
|
|
||||||
task_ids = queue_manager.enqueue_multiple_files(
|
|
||||||
project_id=project_id,
|
|
||||||
file_paths=file_paths,
|
|
||||||
original_filenames=original_filenames
|
|
||||||
)
|
|
||||||
|
|
||||||
print(f"Batch tasks submitted, task IDs: {task_ids}")
|
|
||||||
|
|
||||||
|
|
||||||
def example_zip_file():
|
|
||||||
"""Example: Process a zip file."""
|
|
||||||
print("\n=== Example: Process a zip file ===")
|
|
||||||
|
|
||||||
project_id = "test_project_zip"
|
|
||||||
zip_path = "public/all_hp_product_spec_book2506.zip"
|
|
||||||
|
|
||||||
# Enqueue zip file for processing
|
|
||||||
task_id = queue_manager.enqueue_zip_file(
|
|
||||||
project_id=project_id,
|
|
||||||
zip_path=zip_path
|
|
||||||
)
|
|
||||||
|
|
||||||
print(f"Zip task submitted, task ID: {task_id}")
|
|
||||||
|
|
||||||
|
|
||||||
def example_queue_stats():
|
|
||||||
"""Example: Get queue statistics."""
|
|
||||||
print("\n=== Example: Queue statistics ===")
|
|
||||||
|
|
||||||
stats = queue_manager.get_queue_stats()
|
|
||||||
print("Queue statistics:")
|
|
||||||
for key, value in stats.items():
|
|
||||||
if key != "recent_tasks":
|
|
||||||
print(f" {key}: {value}")
|
|
||||||
|
|
||||||
|
|
||||||
def example_cleanup():
|
|
||||||
"""Example: Cleanup tasks."""
|
|
||||||
print("\n=== Example: Cleanup tasks ===")
|
|
||||||
|
|
||||||
project_id = "test_project"
|
|
||||||
|
|
||||||
# Enqueue cleanup task (delayed 10 seconds)
|
|
||||||
task_id = queue_manager.enqueue_cleanup_task(
|
|
||||||
project_id=project_id,
|
|
||||||
older_than_days=1, # Clean files older than 1 day
|
|
||||||
delay=10
|
|
||||||
)
|
|
||||||
|
|
||||||
print(f"Cleanup task submitted, task ID: {task_id}")
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
|
||||||
"""Main entry point."""
|
|
||||||
print("Queue System Usage Examples")
|
|
||||||
print("=" * 50)
|
|
||||||
|
|
||||||
try:
|
|
||||||
# Run examples
|
|
||||||
example_single_file()
|
|
||||||
example_multiple_files()
|
|
||||||
example_zip_file()
|
|
||||||
example_queue_stats()
|
|
||||||
example_cleanup()
|
|
||||||
|
|
||||||
print("\n" + "=" * 50)
|
|
||||||
print("Examples completed!")
|
|
||||||
print("\nTo check task execution, run:")
|
|
||||||
print("python queue/consumer.py --check")
|
|
||||||
print("\nTo start the queue consumer, run:")
|
|
||||||
print("python queue/consumer.py")
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Error running examples: {str(e)}")
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
@ -1,499 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Queue tasks for file processing integration.
|
|
||||||
"""
|
|
||||||
|
|
||||||
import os
|
|
||||||
import json
|
|
||||||
import time
|
|
||||||
import hashlib
|
|
||||||
import shutil
|
|
||||||
from typing import Dict, List, Optional, Any
|
|
||||||
|
|
||||||
from task_queue.config import huey
|
|
||||||
from task_queue.manager import queue_manager
|
|
||||||
from task_queue.task_status import task_status_store
|
|
||||||
from utils import download_dataset_files, save_processed_files_log, load_processed_files_log
|
|
||||||
from utils.dataset_manager import remove_dataset_directory_by_key
|
|
||||||
|
|
||||||
|
|
||||||
def scan_upload_folder(upload_dir: str) -> List[str]:
|
|
||||||
"""
|
|
||||||
Scan all supported file formats in the upload folder.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
upload_dir: Upload folder path
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
List[str]: List of supported file paths
|
|
||||||
"""
|
|
||||||
supported_extensions = {
|
|
||||||
# Text files
|
|
||||||
'.txt', '.md', '.rtf',
|
|
||||||
# Document files
|
|
||||||
'.doc', '.docx', '.pdf', '.odt',
|
|
||||||
# Spreadsheet files
|
|
||||||
'.xls', '.xlsx', '.csv', '.ods',
|
|
||||||
# Presentation files
|
|
||||||
'.ppt', '.pptx', '.odp',
|
|
||||||
# E-books
|
|
||||||
'.epub', '.mobi',
|
|
||||||
# Web files
|
|
||||||
'.html', '.htm',
|
|
||||||
# Config files
|
|
||||||
'.json', '.xml', '.yaml', '.yml',
|
|
||||||
# Code files
|
|
||||||
'.py', '.js', '.java', '.cpp', '.c', '.go', '.rs',
|
|
||||||
# Archive files
|
|
||||||
'.zip', '.rar', '.7z', '.tar', '.gz'
|
|
||||||
}
|
|
||||||
|
|
||||||
scanned_files = []
|
|
||||||
|
|
||||||
if not os.path.exists(upload_dir):
|
|
||||||
return scanned_files
|
|
||||||
|
|
||||||
for root, dirs, files in os.walk(upload_dir):
|
|
||||||
for file in files:
|
|
||||||
# Skip hidden files and system files
|
|
||||||
if file.startswith('.') or file.startswith('~'):
|
|
||||||
continue
|
|
||||||
|
|
||||||
file_path = os.path.join(root, file)
|
|
||||||
file_extension = os.path.splitext(file)[1].lower()
|
|
||||||
|
|
||||||
# Check if file extension is supported
|
|
||||||
if file_extension in supported_extensions:
|
|
||||||
scanned_files.append(file_path)
|
|
||||||
else:
|
|
||||||
# For files without extension, try to process them (may be text files)
|
|
||||||
if not file_extension:
|
|
||||||
try:
|
|
||||||
# Try reading the file header to determine if it's a text file
|
|
||||||
with open(file_path, 'r', encoding='utf-8') as f:
|
|
||||||
f.read(1024) # Read the first 1KB
|
|
||||||
scanned_files.append(file_path)
|
|
||||||
except (UnicodeDecodeError, PermissionError):
|
|
||||||
# Not a text file or unreadable, skip
|
|
||||||
pass
|
|
||||||
|
|
||||||
return scanned_files
|
|
||||||
|
|
||||||
|
|
||||||
@huey.task()
|
|
||||||
def process_files_async(
|
|
||||||
dataset_id: str,
|
|
||||||
files: Optional[Dict[str, List[str]]] = None,
|
|
||||||
upload_folder: Optional[Dict[str, str]] = None,
|
|
||||||
task_id: Optional[str] = None
|
|
||||||
) -> Dict[str, Any]:
|
|
||||||
"""
|
|
||||||
Asynchronously process file tasks - compatible with existing files/process API.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
dataset_id: Unique project ID
|
|
||||||
files: Dictionary of file paths grouped by key
|
|
||||||
upload_folder: Upload folder dictionary organized by group name, e.g. {'group1': 'my_project1', 'group2': 'my_project2'}
|
|
||||||
task_id: Task ID (for status tracking)
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
Processing result dictionary
|
|
||||||
"""
|
|
||||||
try:
|
|
||||||
print(f"Starting async file processing task, project ID: {dataset_id}")
|
|
||||||
|
|
||||||
# If task_id is provided, set initial status
|
|
||||||
if task_id:
|
|
||||||
task_status_store.set_status(
|
|
||||||
task_id=task_id,
|
|
||||||
unique_id=dataset_id,
|
|
||||||
status="running"
|
|
||||||
)
|
|
||||||
|
|
||||||
# Ensure project directory exists
|
|
||||||
project_dir = os.path.join("projects", "data", dataset_id)
|
|
||||||
if not os.path.exists(project_dir):
|
|
||||||
os.makedirs(project_dir, exist_ok=True)
|
|
||||||
|
|
||||||
# Process files: use key-grouped format
|
|
||||||
processed_files_by_key = {}
|
|
||||||
|
|
||||||
# If upload_folder is provided, scan files in those folders
|
|
||||||
if upload_folder and not files:
|
|
||||||
scanned_files_by_group = {}
|
|
||||||
total_scanned_files = 0
|
|
||||||
|
|
||||||
for group_name, folder_name in upload_folder.items():
|
|
||||||
# Security check: prevent path traversal attacks
|
|
||||||
safe_folder_name = os.path.basename(folder_name)
|
|
||||||
upload_dir = os.path.join("projects", "uploads", safe_folder_name)
|
|
||||||
|
|
||||||
if os.path.exists(upload_dir):
|
|
||||||
scanned_files = scan_upload_folder(upload_dir)
|
|
||||||
if scanned_files:
|
|
||||||
scanned_files_by_group[group_name] = scanned_files
|
|
||||||
total_scanned_files += len(scanned_files)
|
|
||||||
print(f"Scanned {len(scanned_files)} files from upload folder '{safe_folder_name}' (group: {group_name})")
|
|
||||||
else:
|
|
||||||
print(f"No supported files found in upload folder '{safe_folder_name}' (group: {group_name})")
|
|
||||||
else:
|
|
||||||
print(f"Upload folder does not exist: {upload_dir} (group: {group_name})")
|
|
||||||
|
|
||||||
if scanned_files_by_group:
|
|
||||||
files = scanned_files_by_group
|
|
||||||
print(f"Total scanned {total_scanned_files} files from {len(scanned_files_by_group)} groups")
|
|
||||||
else:
|
|
||||||
print("No supported files found in any upload folder")
|
|
||||||
|
|
||||||
if files:
|
|
||||||
# Use files from the request (grouped by key)
|
|
||||||
# Since this is an async task, call synchronously
|
|
||||||
import asyncio
|
|
||||||
try:
|
|
||||||
loop = asyncio.get_event_loop()
|
|
||||||
except RuntimeError:
|
|
||||||
loop = asyncio.new_event_loop()
|
|
||||||
asyncio.set_event_loop(loop)
|
|
||||||
|
|
||||||
processed_files_by_key = loop.run_until_complete(download_dataset_files(dataset_id, files))
|
|
||||||
total_files = sum(len(files_list) for files_list in processed_files_by_key.values())
|
|
||||||
print(f"Async processed {total_files} dataset files across {len(processed_files_by_key)} keys, project ID: {dataset_id}")
|
|
||||||
else:
|
|
||||||
print(f"No files provided in request, project ID: {dataset_id}")
|
|
||||||
|
|
||||||
# Collect all document.txt files in the project directory
|
|
||||||
document_files = []
|
|
||||||
for root, dirs, files_list in os.walk(project_dir):
|
|
||||||
for file in files_list:
|
|
||||||
if file == "document.txt":
|
|
||||||
document_files.append(os.path.join(root, file))
|
|
||||||
|
|
||||||
# Generate project README.md file
|
|
||||||
try:
|
|
||||||
from utils.project_manager import save_project_readme
|
|
||||||
save_project_readme(dataset_id)
|
|
||||||
print(f"README.md generated, project ID: {dataset_id}")
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Failed to generate README.md, project ID: {dataset_id}, error: {str(e)}")
|
|
||||||
# Does not affect main processing flow, continue
|
|
||||||
|
|
||||||
# Build result file list
|
|
||||||
result_files = []
|
|
||||||
for key in processed_files_by_key.keys():
|
|
||||||
# Add corresponding dataset document.txt path
|
|
||||||
document_path = os.path.join("projects", "data", dataset_id, "datasets", key, "document.txt")
|
|
||||||
if os.path.exists(document_path):
|
|
||||||
result_files.append(document_path)
|
|
||||||
|
|
||||||
# Also add document.txt files that exist but are not in processed_files_by_key
|
|
||||||
existing_document_paths = set(result_files) # Avoid duplicates
|
|
||||||
for doc_file in document_files:
|
|
||||||
if doc_file not in existing_document_paths:
|
|
||||||
result_files.append(doc_file)
|
|
||||||
|
|
||||||
result = {
|
|
||||||
"status": "success",
|
|
||||||
"message": f"Successfully processed {len(result_files)} document files across {len(processed_files_by_key)} keys",
|
|
||||||
"dataset_id": dataset_id,
|
|
||||||
"processed_files": result_files,
|
|
||||||
"processed_files_by_key": processed_files_by_key,
|
|
||||||
"document_files": document_files,
|
|
||||||
"total_files_processed": sum(len(files_list) for files_list in processed_files_by_key.values()),
|
|
||||||
"processing_time": time.time()
|
|
||||||
}
|
|
||||||
|
|
||||||
# Update task status to completed
|
|
||||||
if task_id:
|
|
||||||
task_status_store.update_status(
|
|
||||||
task_id=task_id,
|
|
||||||
status="completed",
|
|
||||||
result=result
|
|
||||||
)
|
|
||||||
|
|
||||||
print(f"Async file processing task completed: {dataset_id}")
|
|
||||||
return result
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
error_msg = f"Error during async file processing: {str(e)}"
|
|
||||||
print(error_msg)
|
|
||||||
|
|
||||||
# Update task status to error
|
|
||||||
if task_id:
|
|
||||||
task_status_store.update_status(
|
|
||||||
task_id=task_id,
|
|
||||||
status="failed",
|
|
||||||
error=error_msg
|
|
||||||
)
|
|
||||||
|
|
||||||
return {
|
|
||||||
"status": "error",
|
|
||||||
"message": error_msg,
|
|
||||||
"dataset_id": dataset_id,
|
|
||||||
"error": str(e)
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
@huey.task()
|
|
||||||
def process_files_incremental_async(
|
|
||||||
dataset_id: str,
|
|
||||||
files_to_add: Optional[Dict[str, List[str]]] = None,
|
|
||||||
files_to_remove: Optional[Dict[str, List[str]]] = None,
|
|
||||||
system_prompt: Optional[str] = None,
|
|
||||||
mcp_settings: Optional[List[Dict]] = None,
|
|
||||||
task_id: Optional[str] = None
|
|
||||||
) -> Dict[str, Any]:
|
|
||||||
"""
|
|
||||||
Incremental file processing task - supports adding and removing files.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
dataset_id: Unique project ID
|
|
||||||
files_to_add: Dictionary of file paths to add, grouped by key
|
|
||||||
files_to_remove: Dictionary of file paths to remove, grouped by key
|
|
||||||
system_prompt: System prompt
|
|
||||||
mcp_settings: MCP settings
|
|
||||||
task_id: Task ID (for status tracking)
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
Processing result dictionary
|
|
||||||
"""
|
|
||||||
try:
|
|
||||||
print(f"Starting incremental file processing task, project ID: {dataset_id}")
|
|
||||||
|
|
||||||
# If task_id is provided, set initial status
|
|
||||||
if task_id:
|
|
||||||
task_status_store.set_status(
|
|
||||||
task_id=task_id,
|
|
||||||
unique_id=dataset_id,
|
|
||||||
status="running"
|
|
||||||
)
|
|
||||||
|
|
||||||
# Ensure project directory exists
|
|
||||||
project_dir = os.path.join("projects", "data", dataset_id)
|
|
||||||
if not os.path.exists(project_dir):
|
|
||||||
os.makedirs(project_dir, exist_ok=True)
|
|
||||||
|
|
||||||
# Load existing processing log
|
|
||||||
processed_log = load_processed_files_log(dataset_id)
|
|
||||||
print(f"Loaded existing processing log with {len(processed_log)} file records")
|
|
||||||
|
|
||||||
removed_files = []
|
|
||||||
added_files = []
|
|
||||||
|
|
||||||
# 1. Process removals
|
|
||||||
if files_to_remove:
|
|
||||||
print(f"Starting removal processing across {len(files_to_remove)} key groups")
|
|
||||||
for key, file_list in files_to_remove.items():
|
|
||||||
if not file_list: # If file list is empty, remove the entire key group
|
|
||||||
print(f"Removing entire key group: {key}")
|
|
||||||
if remove_dataset_directory_by_key(dataset_id, key):
|
|
||||||
removed_files.append(f"dataset/{key}")
|
|
||||||
|
|
||||||
# Remove all records for this key from the processing log
|
|
||||||
keys_to_remove = [file_hash for file_hash, file_info in processed_log.items()
|
|
||||||
if file_info.get('key') == key]
|
|
||||||
for file_hash in keys_to_remove:
|
|
||||||
del processed_log[file_hash]
|
|
||||||
removed_files.append(f"log_entry:{file_hash}")
|
|
||||||
else:
|
|
||||||
# Remove specific files
|
|
||||||
for file_path in file_list:
|
|
||||||
print(f"Removing specific file: {key}/{file_path}")
|
|
||||||
|
|
||||||
# Actually delete the file
|
|
||||||
filename = os.path.basename(file_path)
|
|
||||||
|
|
||||||
# Delete original file
|
|
||||||
source_file = os.path.join("projects", "data", dataset_id, "files", key, filename)
|
|
||||||
if os.path.exists(source_file):
|
|
||||||
os.remove(source_file)
|
|
||||||
removed_files.append(f"file:{key}/{filename}")
|
|
||||||
|
|
||||||
# Delete processed file directory
|
|
||||||
processed_dir = os.path.join("projects", "data", dataset_id, "processed", key, filename)
|
|
||||||
if os.path.exists(processed_dir):
|
|
||||||
shutil.rmtree(processed_dir)
|
|
||||||
removed_files.append(f"processed:{key}/{filename}")
|
|
||||||
|
|
||||||
# Compute file hash to find in log
|
|
||||||
file_hash = hashlib.md5(file_path.encode('utf-8')).hexdigest()
|
|
||||||
|
|
||||||
# Remove from processing log
|
|
||||||
if file_hash in processed_log:
|
|
||||||
del processed_log[file_hash]
|
|
||||||
removed_files.append(f"log_entry:{file_hash}")
|
|
||||||
|
|
||||||
# 2. Process additions
|
|
||||||
processed_files_by_key = {}
|
|
||||||
if files_to_add:
|
|
||||||
print(f"Starting addition processing across {len(files_to_add)} key groups")
|
|
||||||
# Use async processing to download files
|
|
||||||
import asyncio
|
|
||||||
try:
|
|
||||||
loop = asyncio.get_event_loop()
|
|
||||||
except RuntimeError:
|
|
||||||
loop = asyncio.new_event_loop()
|
|
||||||
asyncio.set_event_loop(loop)
|
|
||||||
|
|
||||||
processed_files_by_key = loop.run_until_complete(download_dataset_files(dataset_id, files_to_add, incremental_mode=True))
|
|
||||||
total_added_files = sum(len(files_list) for files_list in processed_files_by_key.values())
|
|
||||||
print(f"Async processed {total_added_files} dataset files across {len(processed_files_by_key)} keys, project ID: {dataset_id}")
|
|
||||||
|
|
||||||
# Record added files
|
|
||||||
for key, files_list in processed_files_by_key.items():
|
|
||||||
for file_path in files_list:
|
|
||||||
added_files.append(f"{key}/{file_path}")
|
|
||||||
else:
|
|
||||||
print(f"No files to add provided in request, project ID: {dataset_id}")
|
|
||||||
|
|
||||||
# Save updated processing log
|
|
||||||
save_processed_files_log(dataset_id, processed_log)
|
|
||||||
print(f"Updated processing log, now contains {len(processed_log)} file records")
|
|
||||||
|
|
||||||
# Save system_prompt and mcp_settings to project directory (if provided)
|
|
||||||
if system_prompt:
|
|
||||||
system_prompt_file = os.path.join(project_dir, "system_prompt.md")
|
|
||||||
with open(system_prompt_file, 'w', encoding='utf-8') as f:
|
|
||||||
f.write(system_prompt)
|
|
||||||
print(f"Saved system_prompt, project ID: {dataset_id}")
|
|
||||||
|
|
||||||
if mcp_settings:
|
|
||||||
mcp_settings_file = os.path.join(project_dir, "mcp_settings.json")
|
|
||||||
with open(mcp_settings_file, 'w', encoding='utf-8') as f:
|
|
||||||
json.dump(mcp_settings, f, ensure_ascii=False, indent=2)
|
|
||||||
print(f"Saved mcp_settings, project ID: {dataset_id}")
|
|
||||||
|
|
||||||
# Generate project README.md file
|
|
||||||
try:
|
|
||||||
from utils.project_manager import save_project_readme
|
|
||||||
save_project_readme(dataset_id)
|
|
||||||
print(f"README.md generated, project ID: {dataset_id}")
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Failed to generate README.md, project ID: {dataset_id}, error: {str(e)}")
|
|
||||||
# Does not affect main processing flow, continue
|
|
||||||
|
|
||||||
# Collect all document.txt files in the project directory
|
|
||||||
document_files = []
|
|
||||||
for root, dirs, files_list in os.walk(project_dir):
|
|
||||||
for file in files_list:
|
|
||||||
if file == "document.txt":
|
|
||||||
document_files.append(os.path.join(root, file))
|
|
||||||
|
|
||||||
# Build result file list
|
|
||||||
result_files = []
|
|
||||||
for key in processed_files_by_key.keys():
|
|
||||||
# Add corresponding dataset document.txt path
|
|
||||||
document_path = os.path.join("projects", "data", dataset_id, "datasets", key, "document.txt")
|
|
||||||
if os.path.exists(document_path):
|
|
||||||
result_files.append(document_path)
|
|
||||||
|
|
||||||
# Also add document.txt files that exist but are not in processed_files_by_key
|
|
||||||
existing_document_paths = set(result_files) # Avoid duplicates
|
|
||||||
for doc_file in document_files:
|
|
||||||
if doc_file not in existing_document_paths:
|
|
||||||
result_files.append(doc_file)
|
|
||||||
|
|
||||||
result = {
|
|
||||||
"status": "success",
|
|
||||||
"message": f"Incremental processing complete - added {len(added_files)} files, removed {len(removed_files)} files, {len(result_files)} document files remaining",
|
|
||||||
"dataset_id": dataset_id,
|
|
||||||
"removed_files": removed_files,
|
|
||||||
"added_files": added_files,
|
|
||||||
"processed_files": result_files,
|
|
||||||
"processed_files_by_key": processed_files_by_key,
|
|
||||||
"document_files": document_files,
|
|
||||||
"total_files_added": sum(len(files_list) for files_list in processed_files_by_key.values()),
|
|
||||||
"total_files_removed": len(removed_files),
|
|
||||||
"final_files_count": len(result_files),
|
|
||||||
"processing_time": time.time()
|
|
||||||
}
|
|
||||||
|
|
||||||
# Update task status to completed
|
|
||||||
if task_id:
|
|
||||||
task_status_store.update_status(
|
|
||||||
task_id=task_id,
|
|
||||||
status="completed",
|
|
||||||
result=result
|
|
||||||
)
|
|
||||||
|
|
||||||
print(f"Incremental file processing task completed: {dataset_id}")
|
|
||||||
return result
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
error_msg = f"Error during incremental file processing: {str(e)}"
|
|
||||||
print(error_msg)
|
|
||||||
|
|
||||||
# Update task status to error
|
|
||||||
if task_id:
|
|
||||||
task_status_store.update_status(
|
|
||||||
task_id=task_id,
|
|
||||||
status="failed",
|
|
||||||
error=error_msg
|
|
||||||
)
|
|
||||||
|
|
||||||
return {
|
|
||||||
"status": "error",
|
|
||||||
"message": error_msg,
|
|
||||||
"dataset_id": dataset_id,
|
|
||||||
"error": str(e)
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
@huey.task()
|
|
||||||
def cleanup_project_async(
|
|
||||||
dataset_id: str,
|
|
||||||
remove_all: bool = False
|
|
||||||
) -> Dict[str, Any]:
|
|
||||||
"""
|
|
||||||
Asynchronously clean up project files.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
dataset_id: Unique project ID
|
|
||||||
remove_all: Whether to remove the entire project directory
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
Cleanup result dictionary
|
|
||||||
"""
|
|
||||||
try:
|
|
||||||
print(f"Starting async project cleanup, project ID: {dataset_id}")
|
|
||||||
|
|
||||||
project_dir = os.path.join("projects", "data", dataset_id)
|
|
||||||
removed_items = []
|
|
||||||
|
|
||||||
if remove_all and os.path.exists(project_dir):
|
|
||||||
import shutil
|
|
||||||
shutil.rmtree(project_dir)
|
|
||||||
removed_items.append(project_dir)
|
|
||||||
result = {
|
|
||||||
"status": "success",
|
|
||||||
"message": f"Deleted entire project directory: {project_dir}",
|
|
||||||
"dataset_id": dataset_id,
|
|
||||||
"removed_items": removed_items,
|
|
||||||
"action": "remove_all"
|
|
||||||
}
|
|
||||||
else:
|
|
||||||
# Only clean processing log
|
|
||||||
log_file = os.path.join(project_dir, "processed_files.json")
|
|
||||||
if os.path.exists(log_file):
|
|
||||||
os.remove(log_file)
|
|
||||||
removed_items.append(log_file)
|
|
||||||
|
|
||||||
result = {
|
|
||||||
"status": "success",
|
|
||||||
"message": f"Cleaned project processing log, project ID: {dataset_id}",
|
|
||||||
"dataset_id": dataset_id,
|
|
||||||
"removed_items": removed_items,
|
|
||||||
"action": "cleanup_logs"
|
|
||||||
}
|
|
||||||
|
|
||||||
print(f"Async cleanup task completed: {dataset_id}")
|
|
||||||
return result
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
error_msg = f"Error during async project cleanup: {str(e)}"
|
|
||||||
print(error_msg)
|
|
||||||
return {
|
|
||||||
"status": "error",
|
|
||||||
"message": error_msg,
|
|
||||||
"dataset_id": dataset_id,
|
|
||||||
"error": str(e)
|
|
||||||
}
|
|
||||||
@ -1,228 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Queue manager for handling file processing queues.
|
|
||||||
"""
|
|
||||||
|
|
||||||
import os
|
|
||||||
import json
|
|
||||||
import time
|
|
||||||
import logging
|
|
||||||
from typing import Dict, List, Optional, Any
|
|
||||||
from huey import Huey
|
|
||||||
from huey.api import Task
|
|
||||||
from datetime import datetime, timedelta
|
|
||||||
|
|
||||||
# Configure logging
|
|
||||||
logger = logging.getLogger('app')
|
|
||||||
|
|
||||||
from .config import huey
|
|
||||||
from .tasks import process_file_async, process_multiple_files_async, process_zip_file_async, cleanup_processed_files
|
|
||||||
|
|
||||||
|
|
||||||
class QueueManager:
|
|
||||||
"""Queue manager for file processing tasks."""
|
|
||||||
|
|
||||||
def __init__(self):
|
|
||||||
self.huey = huey
|
|
||||||
logger.info(f"Queue manager initialized with database: {os.path.join(os.path.dirname(__file__), '..', 'projects', 'queue_data', 'huey.db')}")
|
|
||||||
|
|
||||||
def enqueue_file(
|
|
||||||
self,
|
|
||||||
project_id: str,
|
|
||||||
file_path: str,
|
|
||||||
original_filename: str = None,
|
|
||||||
delay: int = 0
|
|
||||||
) -> str:
|
|
||||||
"""
|
|
||||||
Add a file to the processing queue.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
project_id: Project ID
|
|
||||||
file_path: File path
|
|
||||||
original_filename: Original filename
|
|
||||||
delay: Delay before execution in seconds
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
Task ID
|
|
||||||
"""
|
|
||||||
if delay > 0:
|
|
||||||
task = process_file_async.schedule(
|
|
||||||
args=(project_id, file_path, original_filename),
|
|
||||||
delay=timedelta(seconds=delay)
|
|
||||||
)
|
|
||||||
else:
|
|
||||||
task = process_file_async(project_id, file_path, original_filename)
|
|
||||||
|
|
||||||
logger.info(f"File queued for processing: {file_path}, task ID: {task.id}")
|
|
||||||
return task.id
|
|
||||||
|
|
||||||
def enqueue_multiple_files(
|
|
||||||
self,
|
|
||||||
project_id: str,
|
|
||||||
file_paths: List[str],
|
|
||||||
original_filenames: List[str] = None,
|
|
||||||
delay: int = 0
|
|
||||||
) -> List[str]:
|
|
||||||
"""
|
|
||||||
Add multiple files to the processing queue.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
project_id: Project ID
|
|
||||||
file_paths: List of file paths
|
|
||||||
original_filenames: List of original filenames
|
|
||||||
delay: Delay before execution in seconds
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
List of task IDs
|
|
||||||
"""
|
|
||||||
if delay > 0:
|
|
||||||
task = process_multiple_files_async.schedule(
|
|
||||||
args=(project_id, file_paths, original_filenames),
|
|
||||||
delay=timedelta(seconds=delay)
|
|
||||||
)
|
|
||||||
else:
|
|
||||||
task = process_multiple_files_async(project_id, file_paths, original_filenames)
|
|
||||||
|
|
||||||
logger.info(f"Batch files queued for processing: {len(file_paths)} files, task ID: {task.id}")
|
|
||||||
return [task.id]
|
|
||||||
|
|
||||||
def enqueue_zip_file(
|
|
||||||
self,
|
|
||||||
project_id: str,
|
|
||||||
zip_path: str,
|
|
||||||
extract_to: str = None,
|
|
||||||
delay: int = 0
|
|
||||||
) -> str:
|
|
||||||
"""
|
|
||||||
Add a zip file to the processing queue.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
project_id: Project ID
|
|
||||||
zip_path: Path to the zip file
|
|
||||||
extract_to: Extraction target directory
|
|
||||||
delay: Delay before execution in seconds
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
Task ID
|
|
||||||
"""
|
|
||||||
if delay > 0:
|
|
||||||
task = process_zip_file_async.schedule(
|
|
||||||
args=(project_id, zip_path, extract_to),
|
|
||||||
delay=timedelta(seconds=delay)
|
|
||||||
)
|
|
||||||
else:
|
|
||||||
task = process_zip_file_async(project_id, zip_path, extract_to)
|
|
||||||
|
|
||||||
logger.info(f"Zip file queued for processing: {zip_path}, task ID: {task.id}")
|
|
||||||
return task.id
|
|
||||||
|
|
||||||
def get_task_status(self, task_id: str) -> Dict[str, Any]:
|
|
||||||
"""
|
|
||||||
Get task status.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
task_id: Task ID
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
Task status information
|
|
||||||
"""
|
|
||||||
try:
|
|
||||||
# Try getting the task result from result storage
|
|
||||||
try:
|
|
||||||
# Use Huey's built-in result lookup when available
|
|
||||||
if hasattr(self.huey, 'result') and self.huey.result:
|
|
||||||
result = self.huey.result(task_id)
|
|
||||||
if result is not None:
|
|
||||||
return {
|
|
||||||
"task_id": task_id,
|
|
||||||
"status": "complete",
|
|
||||||
"result": result
|
|
||||||
}
|
|
||||||
except Exception:
|
|
||||||
pass
|
|
||||||
|
|
||||||
# Check whether the task is in the pending queue
|
|
||||||
try:
|
|
||||||
pending_tasks = list(self.huey.pending())
|
|
||||||
for task in pending_tasks:
|
|
||||||
if hasattr(task, 'id') and task.id == task_id:
|
|
||||||
return {
|
|
||||||
"task_id": task_id,
|
|
||||||
"status": "pending"
|
|
||||||
}
|
|
||||||
except Exception:
|
|
||||||
pass
|
|
||||||
|
|
||||||
# Check whether the task is in the scheduled queue
|
|
||||||
try:
|
|
||||||
scheduled_tasks = list(self.huey.scheduled())
|
|
||||||
for task in scheduled_tasks:
|
|
||||||
if hasattr(task, 'id') and task.id == task_id:
|
|
||||||
return {
|
|
||||||
"task_id": task_id,
|
|
||||||
"status": "scheduled"
|
|
||||||
}
|
|
||||||
except Exception:
|
|
||||||
pass
|
|
||||||
|
|
||||||
# If not found anywhere, it may not exist or may have completed with cleaned results
|
|
||||||
return {
|
|
||||||
"task_id": task_id,
|
|
||||||
"status": "unknown",
|
|
||||||
"message": "Task status is unknown; it may already be complete or may not exist"
|
|
||||||
}
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
return {
|
|
||||||
"task_id": task_id,
|
|
||||||
"status": "error",
|
|
||||||
"message": f"Failed to get task status: {str(e)}"
|
|
||||||
}
|
|
||||||
|
|
||||||
def get_queue_stats(self) -> Dict[str, Any]:
|
|
||||||
"""
|
|
||||||
Get queue statistics.
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
Queue statistics information
|
|
||||||
"""
|
|
||||||
try:
|
|
||||||
# Use a simplified approach for queue statistics
|
|
||||||
stats = {
|
|
||||||
"total_tasks": 0,
|
|
||||||
"pending_tasks": 0,
|
|
||||||
"running_tasks": 0,
|
|
||||||
"completed_tasks": 0,
|
|
||||||
"error_tasks": 0,
|
|
||||||
"scheduled_tasks": 0,
|
|
||||||
"recent_tasks": [],
|
|
||||||
"queue_database": os.path.join(os.path.dirname(__file__), '..', 'projects', 'queue_data', 'huey.db')
|
|
||||||
}
|
|
||||||
|
|
||||||
# Try to get the number of pending tasks
|
|
||||||
try:
|
|
||||||
pending_tasks = list(self.huey.pending())
|
|
||||||
stats["pending_tasks"] = len(pending_tasks)
|
|
||||||
stats["total_tasks"] += len(pending_tasks)
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Failed to get pending tasks: {e}")
|
|
||||||
|
|
||||||
# Try to get the number of scheduled tasks
|
|
||||||
try:
|
|
||||||
scheduled_tasks = list(self.huey.scheduled())
|
|
||||||
stats["scheduled_tasks"] = len(scheduled_tasks)
|
|
||||||
stats["total_tasks"] += len(scheduled_tasks)
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Failed to get scheduled tasks: {e}")
|
|
||||||
|
|
||||||
return stats
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
return {
|
|
||||||
"error": str(e),
|
|
||||||
"queue_database": os.path.join(os.path.dirname(__file__), '..', 'projects', 'queue_data', 'huey.db')
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
# Global singleton instance
|
|
||||||
queue_manager = QueueManager()
|
|
||||||
@ -1,286 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Optimized queue consumer with integrated performance monitoring.
|
|
||||||
"""
|
|
||||||
|
|
||||||
import sys
|
|
||||||
import os
|
|
||||||
import time
|
|
||||||
import signal
|
|
||||||
import argparse
|
|
||||||
import multiprocessing
|
|
||||||
import logging
|
|
||||||
from pathlib import Path
|
|
||||||
from concurrent.futures import ThreadPoolExecutor
|
|
||||||
import threading
|
|
||||||
|
|
||||||
# Configure logging
|
|
||||||
logger = logging.getLogger('app')
|
|
||||||
|
|
||||||
# Add project root directory to Python path
|
|
||||||
project_root = Path(__file__).parent.parent
|
|
||||||
sys.path.insert(0, str(project_root))
|
|
||||||
|
|
||||||
from task_queue.config import huey
|
|
||||||
from task_queue.manager import queue_manager
|
|
||||||
from task_queue.integration_tasks import process_files_async, cleanup_project_async
|
|
||||||
from huey.consumer import Consumer
|
|
||||||
|
|
||||||
|
|
||||||
class OptimizedQueueConsumer:
|
|
||||||
"""Optimized queue consumer with integrated performance monitoring."""
|
|
||||||
|
|
||||||
def __init__(self, worker_type: str = "threads", workers: int = 2):
|
|
||||||
self.huey = huey
|
|
||||||
self.worker_type = worker_type
|
|
||||||
self.workers = workers
|
|
||||||
self.running = False
|
|
||||||
self.consumer = None
|
|
||||||
self.processed_count = 0
|
|
||||||
self.start_time = None
|
|
||||||
|
|
||||||
# Performance monitoring
|
|
||||||
self.performance_stats = {
|
|
||||||
'tasks_processed': 0,
|
|
||||||
'tasks_failed': 0,
|
|
||||||
'avg_processing_time': 0,
|
|
||||||
'start_time': None,
|
|
||||||
'last_activity': None
|
|
||||||
}
|
|
||||||
|
|
||||||
# Register signal handlers
|
|
||||||
signal.signal(signal.SIGINT, self._signal_handler)
|
|
||||||
signal.signal(signal.SIGTERM, self._signal_handler)
|
|
||||||
|
|
||||||
def _signal_handler(self, signum, frame):
|
|
||||||
"""Signal handler for graceful shutdown."""
|
|
||||||
logger.info(f"\nReceived signal {signum}, shutting down queue consumer...")
|
|
||||||
self.running = False
|
|
||||||
if self.consumer:
|
|
||||||
self.consumer.stop()
|
|
||||||
|
|
||||||
def setup_optimizations(self):
|
|
||||||
"""Set up performance optimizations."""
|
|
||||||
# Set environment variables
|
|
||||||
env_vars = {
|
|
||||||
'PYTHONUNBUFFERED': '1',
|
|
||||||
'PYTHONDONTWRITEBYTECODE': '1',
|
|
||||||
}
|
|
||||||
|
|
||||||
for key, value in env_vars.items():
|
|
||||||
os.environ[key] = value
|
|
||||||
|
|
||||||
# Optimize huey configuration
|
|
||||||
if hasattr(huey, 'immediate'):
|
|
||||||
huey.immediate = False
|
|
||||||
|
|
||||||
# Adjust based on worker type
|
|
||||||
if self.worker_type == "threads":
|
|
||||||
# Thread pool optimization
|
|
||||||
if hasattr(huey, 'worker_type'):
|
|
||||||
huey.worker_type = 'threads'
|
|
||||||
|
|
||||||
# Set thread pool size
|
|
||||||
if hasattr(huey, 'always_eager'):
|
|
||||||
huey.always_eager = False
|
|
||||||
|
|
||||||
logger.info("Queue consumer optimization setup complete:")
|
|
||||||
logger.info(f"- Worker type: {self.worker_type}")
|
|
||||||
logger.info(f"- Worker count: {self.workers}")
|
|
||||||
|
|
||||||
def monitor_performance(self):
|
|
||||||
"""Performance monitoring thread."""
|
|
||||||
while self.running:
|
|
||||||
time.sleep(30) # Output statistics every 30 seconds
|
|
||||||
|
|
||||||
if self.start_time:
|
|
||||||
elapsed = time.time() - self.start_time
|
|
||||||
rate = self.performance_stats['tasks_processed'] / max(1, elapsed)
|
|
||||||
|
|
||||||
logger.info(f"\n[Performance Stats]")
|
|
||||||
logger.info(f"- Uptime: {elapsed:.1f}s")
|
|
||||||
logger.info(f"- Tasks processed: {self.performance_stats['tasks_processed']}")
|
|
||||||
logger.info(f"- Failed tasks: {self.performance_stats['tasks_failed']}")
|
|
||||||
logger.info(f"- Average processing rate: {rate:.2f} tasks/s")
|
|
||||||
|
|
||||||
if self.performance_stats['avg_processing_time'] > 0:
|
|
||||||
logger.info(f"- Average processing time: {self.performance_stats['avg_processing_time']:.2f}s")
|
|
||||||
|
|
||||||
def start(self):
|
|
||||||
"""Start the queue consumer."""
|
|
||||||
logger.info("=" * 60)
|
|
||||||
logger.info("Optimized queue consumer starting")
|
|
||||||
logger.info("=" * 60)
|
|
||||||
|
|
||||||
# Apply optimizations
|
|
||||||
self.setup_optimizations()
|
|
||||||
|
|
||||||
logger.info(f"Database: {os.path.join(os.path.dirname(__file__), '..', 'projects', 'queue_data', 'huey.db')}")
|
|
||||||
logger.info("Press Ctrl+C to stop the consumer")
|
|
||||||
|
|
||||||
self.running = True
|
|
||||||
self.start_time = time.time()
|
|
||||||
self.performance_stats['start_time'] = self.start_time
|
|
||||||
|
|
||||||
# Start performance monitoring thread
|
|
||||||
monitor_thread = threading.Thread(target=self.monitor_performance, daemon=True)
|
|
||||||
monitor_thread.start()
|
|
||||||
|
|
||||||
try:
|
|
||||||
# Create consumer
|
|
||||||
self.consumer = Consumer(
|
|
||||||
self.huey,
|
|
||||||
workers=self.workers,
|
|
||||||
worker_type=self.worker_type,
|
|
||||||
max_delay=60.0, # Maximum delay
|
|
||||||
check_delay=1.0, # Check interval
|
|
||||||
periodic=True, # Enable periodic tasks
|
|
||||||
)
|
|
||||||
|
|
||||||
logger.info("Queue consumer started, waiting for tasks...")
|
|
||||||
|
|
||||||
# Start the consumer
|
|
||||||
self.consumer.run()
|
|
||||||
|
|
||||||
except KeyboardInterrupt:
|
|
||||||
logger.info("\nReceived keyboard interrupt signal")
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Queue consumer runtime error: {e}")
|
|
||||||
import traceback
|
|
||||||
traceback.print_exc()
|
|
||||||
finally:
|
|
||||||
self.shutdown()
|
|
||||||
|
|
||||||
def shutdown(self):
|
|
||||||
"""Shut down the queue consumer."""
|
|
||||||
logger.info("\nShutting down queue consumer...")
|
|
||||||
self.running = False
|
|
||||||
|
|
||||||
if self.consumer:
|
|
||||||
try:
|
|
||||||
self.consumer.stop()
|
|
||||||
logger.info("Queue consumer stopped")
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error stopping queue consumer: {e}")
|
|
||||||
|
|
||||||
# Output final statistics
|
|
||||||
if self.start_time:
|
|
||||||
elapsed = time.time() - self.start_time
|
|
||||||
logger.info(f"\n[Final Stats]")
|
|
||||||
logger.info(f"- Total uptime: {elapsed:.1f}s")
|
|
||||||
logger.info(f"- Total tasks processed: {self.performance_stats['tasks_processed']}")
|
|
||||||
logger.info(f"- Total failed tasks: {self.performance_stats['tasks_failed']}")
|
|
||||||
|
|
||||||
if self.performance_stats['tasks_processed'] > 0:
|
|
||||||
rate = self.performance_stats['tasks_processed'] / elapsed
|
|
||||||
logger.info(f"- Average processing rate: {rate:.2f} tasks/s")
|
|
||||||
|
|
||||||
|
|
||||||
def calculate_optimal_workers():
|
|
||||||
"""Calculate the optimal number of worker threads."""
|
|
||||||
cpu_count = multiprocessing.cpu_count()
|
|
||||||
|
|
||||||
# Based on CPU core count and system resources
|
|
||||||
if cpu_count <= 2:
|
|
||||||
return 2
|
|
||||||
elif cpu_count <= 4:
|
|
||||||
return 4
|
|
||||||
else:
|
|
||||||
return min(8, cpu_count)
|
|
||||||
|
|
||||||
|
|
||||||
def check_queue_status():
|
|
||||||
"""Check queue status."""
|
|
||||||
try:
|
|
||||||
stats = queue_manager.get_queue_stats()
|
|
||||||
|
|
||||||
logger.info("\n[Queue Status]")
|
|
||||||
if isinstance(stats, dict):
|
|
||||||
if 'total_tasks' in stats:
|
|
||||||
logger.info(f"- Total tasks: {stats['total_tasks']}")
|
|
||||||
if 'pending_tasks' in stats:
|
|
||||||
logger.info(f"- Pending tasks: {stats['pending_tasks']}")
|
|
||||||
if 'scheduled_tasks' in stats:
|
|
||||||
logger.info(f"- Scheduled tasks: {stats['scheduled_tasks']}")
|
|
||||||
|
|
||||||
# Check database file
|
|
||||||
db_path = os.path.join(os.path.dirname(__file__), '..', 'projects', 'queue_data', 'huey.db')
|
|
||||||
if os.path.exists(db_path):
|
|
||||||
size = os.path.getsize(db_path)
|
|
||||||
logger.info(f"- Database size: {size} bytes")
|
|
||||||
else:
|
|
||||||
logger.info("- Database file: not found")
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Failed to get queue status: {e}")
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
|
||||||
"""Main entry point."""
|
|
||||||
parser = argparse.ArgumentParser(description="Optimized queue consumer")
|
|
||||||
parser.add_argument(
|
|
||||||
"--workers",
|
|
||||||
type=int,
|
|
||||||
default=calculate_optimal_workers(),
|
|
||||||
help=f"Number of worker threads (default: {calculate_optimal_workers()})"
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--worker-type",
|
|
||||||
type=str,
|
|
||||||
default="threads",
|
|
||||||
choices=["threads", "greenlets", "gevent"],
|
|
||||||
help="Worker type (default: threads)"
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--check-status",
|
|
||||||
action="store_true",
|
|
||||||
help="Check queue status and exit"
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--profile",
|
|
||||||
type=str,
|
|
||||||
default="balanced",
|
|
||||||
choices=["low_memory", "balanced", "high_performance"],
|
|
||||||
help="Performance profile"
|
|
||||||
)
|
|
||||||
|
|
||||||
args = parser.parse_args()
|
|
||||||
|
|
||||||
# Apply performance profile
|
|
||||||
if args.profile == "low_memory":
|
|
||||||
os.environ['PYTHONOPTIMIZE'] = '1'
|
|
||||||
if args.workers > 2:
|
|
||||||
args.workers = 2
|
|
||||||
logger.info(f"Low memory mode: adjusted worker count to {args.workers}")
|
|
||||||
elif args.profile == "high_performance":
|
|
||||||
if args.workers < 4:
|
|
||||||
args.workers = 4
|
|
||||||
logger.info(f"High performance mode: adjusted worker count to {args.workers}")
|
|
||||||
|
|
||||||
# Check queue status
|
|
||||||
if args.check_status:
|
|
||||||
check_queue_status()
|
|
||||||
return
|
|
||||||
|
|
||||||
# Check environment
|
|
||||||
try:
|
|
||||||
import psutil
|
|
||||||
memory = psutil.virtual_memory()
|
|
||||||
logger.info("[System Info]")
|
|
||||||
logger.info(f"- CPU cores: {multiprocessing.cpu_count()}")
|
|
||||||
logger.info(f"- Available memory: {memory.available / (1024**3):.1f}GB")
|
|
||||||
logger.info(f"- Memory usage: {memory.percent:.1f}%")
|
|
||||||
except ImportError:
|
|
||||||
logger.info("[Tip] Install psutil to display system info: pip install psutil")
|
|
||||||
|
|
||||||
# Create and start the queue consumer
|
|
||||||
consumer = OptimizedQueueConsumer(
|
|
||||||
worker_type=args.worker_type,
|
|
||||||
workers=args.workers
|
|
||||||
)
|
|
||||||
|
|
||||||
consumer.start()
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
||||||
@ -1,210 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Task status SQLite storage system.
|
|
||||||
"""
|
|
||||||
|
|
||||||
import json
|
|
||||||
import os
|
|
||||||
import sqlite3
|
|
||||||
import time
|
|
||||||
from typing import Dict, Optional, Any, List
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
|
|
||||||
class TaskStatusStore:
|
|
||||||
"""SQLite-based task status store."""
|
|
||||||
|
|
||||||
def __init__(self, db_path: str = "projects/queue_data/task_status.db"):
|
|
||||||
self.db_path = db_path
|
|
||||||
# Ensure directory exists
|
|
||||||
Path(db_path).parent.mkdir(parents=True, exist_ok=True)
|
|
||||||
self._init_database()
|
|
||||||
|
|
||||||
def _init_database(self):
|
|
||||||
"""Initialize database tables."""
|
|
||||||
with sqlite3.connect(self.db_path) as conn:
|
|
||||||
conn.execute('''
|
|
||||||
CREATE TABLE IF NOT EXISTS task_status (
|
|
||||||
task_id TEXT PRIMARY KEY,
|
|
||||||
unique_id TEXT NOT NULL,
|
|
||||||
status TEXT NOT NULL,
|
|
||||||
created_at REAL NOT NULL,
|
|
||||||
updated_at REAL NOT NULL,
|
|
||||||
result TEXT,
|
|
||||||
error TEXT
|
|
||||||
)
|
|
||||||
''')
|
|
||||||
conn.commit()
|
|
||||||
|
|
||||||
def set_status(self, task_id: str, unique_id: str, status: str,
|
|
||||||
result: Optional[Dict] = None, error: Optional[str] = None):
|
|
||||||
"""Set task status."""
|
|
||||||
current_time = time.time()
|
|
||||||
|
|
||||||
with sqlite3.connect(self.db_path) as conn:
|
|
||||||
conn.execute('''
|
|
||||||
INSERT OR REPLACE INTO task_status
|
|
||||||
(task_id, unique_id, status, created_at, updated_at, result, error)
|
|
||||||
VALUES (?, ?, ?, ?, ?, ?, ?)
|
|
||||||
''', (
|
|
||||||
task_id, unique_id, status, current_time, current_time,
|
|
||||||
json.dumps(result) if result else None,
|
|
||||||
error
|
|
||||||
))
|
|
||||||
conn.commit()
|
|
||||||
|
|
||||||
def get_status(self, task_id: str) -> Optional[Dict]:
|
|
||||||
"""Get task status."""
|
|
||||||
with sqlite3.connect(self.db_path) as conn:
|
|
||||||
conn.row_factory = sqlite3.Row
|
|
||||||
cursor = conn.execute(
|
|
||||||
'SELECT * FROM task_status WHERE task_id = ?', (task_id,)
|
|
||||||
)
|
|
||||||
row = cursor.fetchone()
|
|
||||||
|
|
||||||
if not row:
|
|
||||||
return None
|
|
||||||
|
|
||||||
result = dict(row)
|
|
||||||
# Parse JSON field
|
|
||||||
if result['result']:
|
|
||||||
result['result'] = json.loads(result['result'])
|
|
||||||
|
|
||||||
return result
|
|
||||||
|
|
||||||
def update_status(self, task_id: str, status: str,
|
|
||||||
result: Optional[Dict] = None, error: Optional[str] = None):
|
|
||||||
"""Update task status."""
|
|
||||||
with sqlite3.connect(self.db_path) as conn:
|
|
||||||
# Check if task exists
|
|
||||||
cursor = conn.execute(
|
|
||||||
'SELECT task_id FROM task_status WHERE task_id = ?', (task_id,)
|
|
||||||
)
|
|
||||||
if not cursor.fetchone():
|
|
||||||
return False
|
|
||||||
|
|
||||||
# Update status
|
|
||||||
conn.execute('''
|
|
||||||
UPDATE task_status
|
|
||||||
SET status = ?, updated_at = ?, result = ?, error = ?
|
|
||||||
WHERE task_id = ?
|
|
||||||
''', (
|
|
||||||
status, time.time(),
|
|
||||||
json.dumps(result) if result else None,
|
|
||||||
error, task_id
|
|
||||||
))
|
|
||||||
conn.commit()
|
|
||||||
return True
|
|
||||||
|
|
||||||
def delete_status(self, task_id: str):
|
|
||||||
"""Delete task status."""
|
|
||||||
with sqlite3.connect(self.db_path) as conn:
|
|
||||||
cursor = conn.execute(
|
|
||||||
'DELETE FROM task_status WHERE task_id = ?', (task_id,)
|
|
||||||
)
|
|
||||||
conn.commit()
|
|
||||||
return cursor.rowcount > 0
|
|
||||||
|
|
||||||
def list_all(self) -> Dict[str, Dict]:
|
|
||||||
"""List all task statuses."""
|
|
||||||
with sqlite3.connect(self.db_path) as conn:
|
|
||||||
conn.row_factory = sqlite3.Row
|
|
||||||
cursor = conn.execute(
|
|
||||||
'SELECT * FROM task_status ORDER BY updated_at DESC'
|
|
||||||
)
|
|
||||||
all_tasks = {}
|
|
||||||
for row in cursor:
|
|
||||||
result = dict(row)
|
|
||||||
# Parse JSON field
|
|
||||||
if result['result']:
|
|
||||||
result['result'] = json.loads(result['result'])
|
|
||||||
all_tasks[result['task_id']] = result
|
|
||||||
return all_tasks
|
|
||||||
|
|
||||||
def get_by_unique_id(self, unique_id: str) -> List[Dict]:
|
|
||||||
"""Get all tasks for a given project ID."""
|
|
||||||
with sqlite3.connect(self.db_path) as conn:
|
|
||||||
conn.row_factory = sqlite3.Row
|
|
||||||
cursor = conn.execute(
|
|
||||||
'SELECT * FROM task_status WHERE unique_id = ? ORDER BY updated_at DESC',
|
|
||||||
(unique_id,)
|
|
||||||
)
|
|
||||||
tasks = []
|
|
||||||
for row in cursor:
|
|
||||||
result = dict(row)
|
|
||||||
if result['result']:
|
|
||||||
result['result'] = json.loads(result['result'])
|
|
||||||
tasks.append(result)
|
|
||||||
return tasks
|
|
||||||
|
|
||||||
def cleanup_old_tasks(self, older_than_days: int = 7) -> int:
|
|
||||||
"""Clean up old task records."""
|
|
||||||
cutoff_time = time.time() - (older_than_days * 24 * 3600)
|
|
||||||
|
|
||||||
with sqlite3.connect(self.db_path) as conn:
|
|
||||||
cursor = conn.execute(
|
|
||||||
'DELETE FROM task_status WHERE updated_at < ?',
|
|
||||||
(cutoff_time,)
|
|
||||||
)
|
|
||||||
conn.commit()
|
|
||||||
return cursor.rowcount
|
|
||||||
|
|
||||||
def get_statistics(self) -> Dict[str, Any]:
|
|
||||||
"""Get task statistics."""
|
|
||||||
with sqlite3.connect(self.db_path) as conn:
|
|
||||||
# Total tasks
|
|
||||||
total = conn.execute('SELECT COUNT(*) FROM task_status').fetchone()[0]
|
|
||||||
|
|
||||||
# Status breakdown
|
|
||||||
status_stats = conn.execute('''
|
|
||||||
SELECT status, COUNT(*) as count
|
|
||||||
FROM task_status
|
|
||||||
GROUP BY status
|
|
||||||
''').fetchall()
|
|
||||||
|
|
||||||
# Tasks in the last 24 hours
|
|
||||||
recent = time.time() - (24 * 3600)
|
|
||||||
recent_tasks = conn.execute(
|
|
||||||
'SELECT COUNT(*) FROM task_status WHERE updated_at > ?',
|
|
||||||
(recent,)
|
|
||||||
).fetchone()[0]
|
|
||||||
|
|
||||||
return {
|
|
||||||
'total_tasks': total,
|
|
||||||
'status_breakdown': dict(status_stats),
|
|
||||||
'recent_24h': recent_tasks,
|
|
||||||
'database_path': self.db_path
|
|
||||||
}
|
|
||||||
|
|
||||||
def search_tasks(self, status: Optional[str] = None,
|
|
||||||
unique_id: Optional[str] = None,
|
|
||||||
limit: int = 100) -> List[Dict]:
|
|
||||||
"""Search tasks."""
|
|
||||||
query = 'SELECT * FROM task_status WHERE 1=1'
|
|
||||||
params = []
|
|
||||||
|
|
||||||
if status:
|
|
||||||
query += ' AND status = ?'
|
|
||||||
params.append(status)
|
|
||||||
|
|
||||||
if unique_id:
|
|
||||||
query += ' AND unique_id = ?'
|
|
||||||
params.append(unique_id)
|
|
||||||
|
|
||||||
query += ' ORDER BY updated_at DESC LIMIT ?'
|
|
||||||
params.append(limit)
|
|
||||||
|
|
||||||
with sqlite3.connect(self.db_path) as conn:
|
|
||||||
conn.row_factory = sqlite3.Row
|
|
||||||
cursor = conn.execute(query, params)
|
|
||||||
tasks = []
|
|
||||||
for row in cursor:
|
|
||||||
result = dict(row)
|
|
||||||
if result['result']:
|
|
||||||
result['result'] = json.loads(result['result'])
|
|
||||||
tasks.append(result)
|
|
||||||
return tasks
|
|
||||||
|
|
||||||
|
|
||||||
# Global task status store instance
|
|
||||||
task_status_store = TaskStatusStore()
|
|
||||||
@ -1,359 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
File processing tasks for the queue system.
|
|
||||||
"""
|
|
||||||
|
|
||||||
import os
|
|
||||||
import json
|
|
||||||
import time
|
|
||||||
import shutil
|
|
||||||
import logging
|
|
||||||
from pathlib import Path
|
|
||||||
from typing import Dict, List, Optional, Any
|
|
||||||
from huey import crontab
|
|
||||||
|
|
||||||
# Configure logging
|
|
||||||
logger = logging.getLogger('app')
|
|
||||||
|
|
||||||
from .config import huey
|
|
||||||
from utils.file_utils import (
|
|
||||||
extract_zip_file,
|
|
||||||
get_file_hash,
|
|
||||||
load_processed_files_log,
|
|
||||||
save_processed_files_log,
|
|
||||||
get_document_preview
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
@huey.task()
|
|
||||||
def process_file_async(
|
|
||||||
project_id: str,
|
|
||||||
file_path: str,
|
|
||||||
original_filename: str = None,
|
|
||||||
target_directory: str = "files"
|
|
||||||
) -> Dict[str, Any]:
|
|
||||||
"""
|
|
||||||
Asynchronously process a single file.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
project_id: Project ID
|
|
||||||
file_path: File path
|
|
||||||
original_filename: Original filename
|
|
||||||
target_directory: Target directory
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
Processing result dictionary
|
|
||||||
"""
|
|
||||||
try:
|
|
||||||
logger.info(f"Starting file processing: {file_path}")
|
|
||||||
|
|
||||||
# Ensure project directory exists
|
|
||||||
project_dir = os.path.join("projects", project_id)
|
|
||||||
files_dir = os.path.join(project_dir, target_directory)
|
|
||||||
os.makedirs(files_dir, exist_ok=True)
|
|
||||||
|
|
||||||
# Get file hash as identifier
|
|
||||||
file_hash = get_file_hash(file_path)
|
|
||||||
|
|
||||||
# Check if file has already been processed
|
|
||||||
processed_log = load_processed_files_log(project_id)
|
|
||||||
if file_hash in processed_log:
|
|
||||||
logger.info(f"File already processed, skipping: {file_path}")
|
|
||||||
return {
|
|
||||||
"status": "skipped",
|
|
||||||
"message": "File already processed",
|
|
||||||
"file_hash": file_hash,
|
|
||||||
"project_id": project_id
|
|
||||||
}
|
|
||||||
|
|
||||||
# Process the file
|
|
||||||
result = _process_single_file(
|
|
||||||
file_path,
|
|
||||||
files_dir,
|
|
||||||
original_filename or os.path.basename(file_path)
|
|
||||||
)
|
|
||||||
|
|
||||||
# Update processing log
|
|
||||||
if result["status"] == "success":
|
|
||||||
processed_log[file_hash] = {
|
|
||||||
"original_path": file_path,
|
|
||||||
"original_filename": original_filename or os.path.basename(file_path),
|
|
||||||
"processed_at": str(time.time()),
|
|
||||||
"status": "processed",
|
|
||||||
"result": result
|
|
||||||
}
|
|
||||||
save_processed_files_log(project_id, processed_log)
|
|
||||||
|
|
||||||
result["file_hash"] = file_hash
|
|
||||||
result["project_id"] = project_id
|
|
||||||
|
|
||||||
logger.info(f"File processing complete: {file_path}, status: {result['status']}")
|
|
||||||
return result
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
error_msg = f"Error processing file: {str(e)}"
|
|
||||||
logger.error(error_msg)
|
|
||||||
return {
|
|
||||||
"status": "error",
|
|
||||||
"message": error_msg,
|
|
||||||
"file_path": file_path,
|
|
||||||
"project_id": project_id
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
@huey.task()
|
|
||||||
def process_multiple_files_async(
|
|
||||||
project_id: str,
|
|
||||||
file_paths: List[str],
|
|
||||||
original_filenames: List[str] = None
|
|
||||||
) -> List[Dict[str, Any]]:
|
|
||||||
"""
|
|
||||||
Asynchronously process multiple files in batch.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
project_id: Project ID
|
|
||||||
file_paths: List of file paths
|
|
||||||
original_filenames: List of original filenames
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
List of processing results
|
|
||||||
"""
|
|
||||||
try:
|
|
||||||
logger.info(f"Starting batch processing of {len(file_paths)} files")
|
|
||||||
|
|
||||||
results = []
|
|
||||||
for i, file_path in enumerate(file_paths):
|
|
||||||
original_filename = original_filenames[i] if original_filenames and i < len(original_filenames) else None
|
|
||||||
|
|
||||||
# Create async task for each file
|
|
||||||
result = process_file_async(project_id, file_path, original_filename)
|
|
||||||
results.append(result)
|
|
||||||
|
|
||||||
logger.info(f"Batch file processing tasks submitted, total {len(results)} files")
|
|
||||||
return results
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
error_msg = f"Error during batch file processing: {str(e)}"
|
|
||||||
logger.error(error_msg)
|
|
||||||
return [{
|
|
||||||
"status": "error",
|
|
||||||
"message": error_msg,
|
|
||||||
"project_id": project_id
|
|
||||||
}]
|
|
||||||
|
|
||||||
|
|
||||||
@huey.task()
|
|
||||||
def process_zip_file_async(
|
|
||||||
project_id: str,
|
|
||||||
zip_path: str,
|
|
||||||
extract_to: str = None
|
|
||||||
) -> Dict[str, Any]:
|
|
||||||
"""
|
|
||||||
Asynchronously process a zip archive file.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
project_id: Project ID
|
|
||||||
zip_path: Zip file path
|
|
||||||
extract_to: Extraction target directory
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
Processing result dictionary
|
|
||||||
"""
|
|
||||||
try:
|
|
||||||
logger.info(f"Starting zip file processing: {zip_path}")
|
|
||||||
|
|
||||||
# Set extraction directory
|
|
||||||
if extract_to is None:
|
|
||||||
extract_to = os.path.join("projects", project_id, "extracted", os.path.basename(zip_path))
|
|
||||||
|
|
||||||
os.makedirs(extract_to, exist_ok=True)
|
|
||||||
|
|
||||||
# Extract files
|
|
||||||
extracted_files = extract_zip_file(zip_path, extract_to)
|
|
||||||
|
|
||||||
if not extracted_files:
|
|
||||||
return {
|
|
||||||
"status": "error",
|
|
||||||
"message": "Extraction failed or no supported files found",
|
|
||||||
"zip_path": zip_path,
|
|
||||||
"project_id": project_id
|
|
||||||
}
|
|
||||||
|
|
||||||
# Batch process extracted files
|
|
||||||
result = process_multiple_files_async(project_id, extracted_files)
|
|
||||||
|
|
||||||
return {
|
|
||||||
"status": "success",
|
|
||||||
"message": f"Zip file processing complete, extracted {len(extracted_files)} files",
|
|
||||||
"zip_path": zip_path,
|
|
||||||
"extract_to": extract_to,
|
|
||||||
"extracted_files": extracted_files,
|
|
||||||
"project_id": project_id,
|
|
||||||
"batch_task_result": result
|
|
||||||
}
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
error_msg = f"Error processing zip file: {str(e)}"
|
|
||||||
logger.error(error_msg)
|
|
||||||
return {
|
|
||||||
"status": "error",
|
|
||||||
"message": error_msg,
|
|
||||||
"zip_path": zip_path,
|
|
||||||
"project_id": project_id
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
@huey.task()
|
|
||||||
def cleanup_processed_files(
|
|
||||||
project_id: str,
|
|
||||||
older_than_days: int = 30
|
|
||||||
) -> Dict[str, Any]:
|
|
||||||
"""
|
|
||||||
Clean up old processed files.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
project_id: Project ID
|
|
||||||
older_than_days: Clean files older than this many days
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
Cleanup result dictionary
|
|
||||||
"""
|
|
||||||
try:
|
|
||||||
logger.info(f"Starting cleanup of files older than {older_than_days} days in project {project_id}")
|
|
||||||
|
|
||||||
project_dir = os.path.join("projects", project_id)
|
|
||||||
if not os.path.exists(project_dir):
|
|
||||||
return {
|
|
||||||
"status": "error",
|
|
||||||
"message": "Project directory does not exist",
|
|
||||||
"project_id": project_id
|
|
||||||
}
|
|
||||||
|
|
||||||
current_time = time.time()
|
|
||||||
cutoff_time = current_time - (older_than_days * 24 * 3600)
|
|
||||||
cleaned_files = []
|
|
||||||
|
|
||||||
# Walk through project directory
|
|
||||||
for root, dirs, files in os.walk(project_dir):
|
|
||||||
for file in files:
|
|
||||||
file_path = os.path.join(root, file)
|
|
||||||
file_mtime = os.path.getmtime(file_path)
|
|
||||||
|
|
||||||
if file_mtime < cutoff_time:
|
|
||||||
try:
|
|
||||||
os.remove(file_path)
|
|
||||||
cleaned_files.append(file_path)
|
|
||||||
logger.info(f"Deleted old file: {file_path}")
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Failed to delete file {file_path}: {str(e)}")
|
|
||||||
|
|
||||||
# Clean up empty directories
|
|
||||||
for root, dirs, files in os.walk(project_dir, topdown=False):
|
|
||||||
for dir in dirs:
|
|
||||||
dir_path = os.path.join(root, dir)
|
|
||||||
try:
|
|
||||||
if not os.listdir(dir_path):
|
|
||||||
os.rmdir(dir_path)
|
|
||||||
logger.info(f"Deleted empty directory: {dir_path}")
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Failed to delete directory {dir_path}: {str(e)}")
|
|
||||||
|
|
||||||
return {
|
|
||||||
"status": "success",
|
|
||||||
"message": f"Cleanup complete, deleted {len(cleaned_files)} files",
|
|
||||||
"project_id": project_id,
|
|
||||||
"cleaned_files": cleaned_files,
|
|
||||||
"older_than_days": older_than_days
|
|
||||||
}
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
error_msg = f"Error during file cleanup: {str(e)}"
|
|
||||||
logger.error(error_msg)
|
|
||||||
return {
|
|
||||||
"status": "error",
|
|
||||||
"message": error_msg,
|
|
||||||
"project_id": project_id
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
def _process_single_file(
|
|
||||||
file_path: str,
|
|
||||||
target_dir: str,
|
|
||||||
original_filename: str
|
|
||||||
) -> Dict[str, Any]:
|
|
||||||
"""
|
|
||||||
Internal method for processing a single file.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
file_path: Source file path
|
|
||||||
target_dir: Target directory
|
|
||||||
original_filename: Original filename
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
Processing result dictionary
|
|
||||||
"""
|
|
||||||
try:
|
|
||||||
# Check if file exists
|
|
||||||
if not os.path.exists(file_path):
|
|
||||||
return {
|
|
||||||
"status": "error",
|
|
||||||
"message": "Source file does not exist",
|
|
||||||
"file_path": file_path
|
|
||||||
}
|
|
||||||
|
|
||||||
# Get file info
|
|
||||||
file_size = os.path.getsize(file_path)
|
|
||||||
file_ext = os.path.splitext(original_filename)[1].lower()
|
|
||||||
|
|
||||||
# Different processing based on file type
|
|
||||||
supported_extensions = ['.txt', '.md', '.csv', '.xlsx', '.zip']
|
|
||||||
|
|
||||||
if file_ext not in supported_extensions:
|
|
||||||
return {
|
|
||||||
"status": "error",
|
|
||||||
"message": f"Unsupported file type: {file_ext}",
|
|
||||||
"file_path": file_path,
|
|
||||||
"supported_extensions": supported_extensions
|
|
||||||
}
|
|
||||||
|
|
||||||
# Copy file to target directory
|
|
||||||
target_file_path = os.path.join(target_dir, original_filename)
|
|
||||||
|
|
||||||
# If target file already exists, add timestamp
|
|
||||||
if os.path.exists(target_file_path):
|
|
||||||
name, ext = os.path.splitext(original_filename)
|
|
||||||
timestamp = int(time.time())
|
|
||||||
target_file_path = os.path.join(target_dir, f"{name}_{timestamp}{ext}")
|
|
||||||
|
|
||||||
shutil.copy2(file_path, target_file_path)
|
|
||||||
|
|
||||||
# Get file preview (if it's a text file)
|
|
||||||
preview = None
|
|
||||||
if file_ext in ['.txt', '.md']:
|
|
||||||
preview = get_document_preview(target_file_path, max_lines=5)
|
|
||||||
|
|
||||||
return {
|
|
||||||
"status": "success",
|
|
||||||
"message": "File processed successfully",
|
|
||||||
"original_path": file_path,
|
|
||||||
"target_path": target_file_path,
|
|
||||||
"file_size": file_size,
|
|
||||||
"file_extension": file_ext,
|
|
||||||
"preview": preview
|
|
||||||
}
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
return {
|
|
||||||
"status": "error",
|
|
||||||
"message": f"Error processing file: {str(e)}",
|
|
||||||
"file_path": file_path
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
# Periodic task example: clean up files older than 30 days daily at 2 AM
|
|
||||||
@huey.periodic_task(crontab(hour=2, minute=0))
|
|
||||||
def daily_cleanup():
|
|
||||||
"""Daily cleanup task."""
|
|
||||||
logger.info("Running daily cleanup task")
|
|
||||||
# Add cleanup logic here
|
|
||||||
return {"status": "completed", "message": "Daily cleanup task completed"}
|
|
||||||
@ -13,23 +13,6 @@ from .file_utils import (
|
|||||||
save_processed_files_log
|
save_processed_files_log
|
||||||
)
|
)
|
||||||
|
|
||||||
from .dataset_manager import (
|
|
||||||
download_dataset_files,
|
|
||||||
generate_dataset_structure,
|
|
||||||
remove_dataset_directory,
|
|
||||||
remove_dataset_directory_by_key
|
|
||||||
)
|
|
||||||
|
|
||||||
from .project_manager import (
|
|
||||||
generate_project_readme,
|
|
||||||
save_project_readme,
|
|
||||||
get_project_status,
|
|
||||||
remove_project,
|
|
||||||
list_projects,
|
|
||||||
get_project_stats
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
from .system_optimizer import (
|
from .system_optimizer import (
|
||||||
setup_system_optimizations
|
setup_system_optimizations
|
||||||
)
|
)
|
||||||
@ -59,11 +42,6 @@ from .api_models import (
|
|||||||
ProjectListResponse,
|
ProjectListResponse,
|
||||||
ProjectStatsResponse,
|
ProjectStatsResponse,
|
||||||
ProjectActionResponse,
|
ProjectActionResponse,
|
||||||
QueueTaskRequest,
|
|
||||||
IncrementalTaskRequest,
|
|
||||||
QueueTaskResponse,
|
|
||||||
QueueStatusResponse,
|
|
||||||
TaskStatusResponse,
|
|
||||||
create_success_response,
|
create_success_response,
|
||||||
create_error_response,
|
create_error_response,
|
||||||
create_chat_response,
|
create_chat_response,
|
||||||
@ -90,20 +68,6 @@ __all__ = [
|
|||||||
'load_processed_files_log',
|
'load_processed_files_log',
|
||||||
'save_processed_files_log',
|
'save_processed_files_log',
|
||||||
|
|
||||||
# dataset_manager
|
|
||||||
'download_dataset_files',
|
|
||||||
'generate_dataset_structure',
|
|
||||||
'remove_dataset_directory',
|
|
||||||
'remove_dataset_directory_by_key',
|
|
||||||
|
|
||||||
# project_manager
|
|
||||||
'generate_project_readme',
|
|
||||||
'save_project_readme',
|
|
||||||
'get_project_status',
|
|
||||||
'remove_project',
|
|
||||||
'list_projects',
|
|
||||||
'get_project_stats',
|
|
||||||
|
|
||||||
# agent_pool
|
# agent_pool
|
||||||
'AgentPool',
|
'AgentPool',
|
||||||
'get_agent_pool',
|
'get_agent_pool',
|
||||||
@ -128,10 +92,6 @@ __all__ = [
|
|||||||
'ProjectListResponse',
|
'ProjectListResponse',
|
||||||
'ProjectStatsResponse',
|
'ProjectStatsResponse',
|
||||||
'ProjectActionResponse',
|
'ProjectActionResponse',
|
||||||
'QueueTaskRequest',
|
|
||||||
'QueueTaskResponse',
|
|
||||||
'QueueStatusResponse',
|
|
||||||
'TaskStatusResponse',
|
|
||||||
'create_success_response',
|
'create_success_response',
|
||||||
'create_error_response',
|
'create_error_response',
|
||||||
'create_chat_response',
|
'create_chat_response',
|
||||||
|
|||||||
@ -224,133 +224,6 @@ def create_error_response(message: str, error_type: str = "error", **kwargs) ->
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
class QueueTaskRequest(BaseModel):
|
|
||||||
"""Queue task request model"""
|
|
||||||
dataset_id: str
|
|
||||||
files: Optional[Dict[str, List[str]]] = Field(default=None, description="Files organized by key groups. Each key maps to a list of file paths (supports zip files)")
|
|
||||||
upload_folder: Optional[Dict[str, str]] = Field(default=None, description="Upload folders organized by group names. Each key maps to a folder name. Example: {'group1': 'my_project1', 'group2': 'my_project2'}")
|
|
||||||
priority: Optional[int] = Field(default=0, description="Task priority (higher number = higher priority)")
|
|
||||||
delay: Optional[int] = Field(default=0, description="Delay execution by N seconds")
|
|
||||||
|
|
||||||
model_config = ConfigDict(extra='allow')
|
|
||||||
|
|
||||||
@field_validator('upload_folder', mode='before')
|
|
||||||
@classmethod
|
|
||||||
def validate_upload_folder(cls, v):
|
|
||||||
"""Validate upload_folder dict format"""
|
|
||||||
if v is None:
|
|
||||||
return None
|
|
||||||
if isinstance(v, dict):
|
|
||||||
# Validate dict format
|
|
||||||
for key, value in v.items():
|
|
||||||
if not isinstance(key, str):
|
|
||||||
raise ValueError(f"Key in upload_folder dict must be string, got {type(key)}")
|
|
||||||
if not isinstance(value, str):
|
|
||||||
raise ValueError(f"Value in upload_folder dict must be string (folder name), got {type(value)} for key '{key}'")
|
|
||||||
return v
|
|
||||||
else:
|
|
||||||
raise ValueError(f"upload_folder must be a dict with group names as keys and folder names as values, got {type(v)}")
|
|
||||||
|
|
||||||
@field_validator('files', mode='before')
|
|
||||||
@classmethod
|
|
||||||
def validate_files(cls, v):
|
|
||||||
"""Validate dict format with key-grouped files"""
|
|
||||||
if v is None:
|
|
||||||
return None
|
|
||||||
if isinstance(v, dict):
|
|
||||||
# Validate dict format
|
|
||||||
for key, value in v.items():
|
|
||||||
if not isinstance(key, str):
|
|
||||||
raise ValueError(f"Key in files dict must be string, got {type(key)}")
|
|
||||||
if not isinstance(value, list):
|
|
||||||
raise ValueError(f"Value in files dict must be list, got {type(value)} for key '{key}'")
|
|
||||||
for item in value:
|
|
||||||
if not isinstance(item, str):
|
|
||||||
raise ValueError(f"File paths must be strings, got {type(item)} in key '{key}'")
|
|
||||||
return v
|
|
||||||
else:
|
|
||||||
raise ValueError(f"Files must be a dict with key groups, got {type(v)}")
|
|
||||||
|
|
||||||
|
|
||||||
class IncrementalTaskRequest(BaseModel):
|
|
||||||
"""Incremental file processing request model"""
|
|
||||||
dataset_id: str = Field(..., description="Dataset ID for the project")
|
|
||||||
files_to_add: Optional[Dict[str, List[str]]] = Field(default=None, description="Files to add organized by key groups")
|
|
||||||
files_to_remove: Optional[Dict[str, List[str]]] = Field(default=None, description="Files to remove organized by key groups")
|
|
||||||
system_prompt: Optional[str] = None
|
|
||||||
mcp_settings: Optional[List[Dict]] = None
|
|
||||||
priority: Optional[int] = Field(default=0, description="Task priority (higher number = higher priority)")
|
|
||||||
delay: Optional[int] = Field(default=0, description="Delay execution by N seconds")
|
|
||||||
|
|
||||||
model_config = ConfigDict(extra='allow')
|
|
||||||
|
|
||||||
@field_validator('files_to_add', mode='before')
|
|
||||||
@classmethod
|
|
||||||
def validate_files_to_add(cls, v):
|
|
||||||
"""Validate files_to_add dict format"""
|
|
||||||
if v is None:
|
|
||||||
return None
|
|
||||||
if isinstance(v, dict):
|
|
||||||
for key, value in v.items():
|
|
||||||
if not isinstance(key, str):
|
|
||||||
raise ValueError(f"Key in files_to_add dict must be string, got {type(key)}")
|
|
||||||
if not isinstance(value, list):
|
|
||||||
raise ValueError(f"Value in files_to_add dict must be list, got {type(value)} for key '{key}'")
|
|
||||||
for item in value:
|
|
||||||
if not isinstance(item, str):
|
|
||||||
raise ValueError(f"File paths must be strings, got {type(item)} in key '{key}'")
|
|
||||||
return v
|
|
||||||
else:
|
|
||||||
raise ValueError(f"files_to_add must be a dict with key groups, got {type(v)}")
|
|
||||||
|
|
||||||
@field_validator('files_to_remove', mode='before')
|
|
||||||
@classmethod
|
|
||||||
def validate_files_to_remove(cls, v):
|
|
||||||
"""Validate files_to_remove dict format"""
|
|
||||||
if v is None:
|
|
||||||
return None
|
|
||||||
if isinstance(v, dict):
|
|
||||||
for key, value in v.items():
|
|
||||||
if not isinstance(key, str):
|
|
||||||
raise ValueError(f"Key in files_to_remove dict must be string, got {type(key)}")
|
|
||||||
if not isinstance(value, list):
|
|
||||||
raise ValueError(f"Value in files_to_remove dict must be list, got {type(value)} for key '{key}'")
|
|
||||||
for item in value:
|
|
||||||
if not isinstance(item, str):
|
|
||||||
raise ValueError(f"File paths must be strings, got {type(item)} in key '{key}'")
|
|
||||||
return v
|
|
||||||
else:
|
|
||||||
raise ValueError(f"files_to_remove must be a dict with key groups, got {type(v)}")
|
|
||||||
|
|
||||||
|
|
||||||
class QueueTaskResponse(BaseModel):
|
|
||||||
"""Queue task response model"""
|
|
||||||
success: bool
|
|
||||||
message: str
|
|
||||||
dataset_id: str
|
|
||||||
task_id: Optional[str] = None
|
|
||||||
task_status: Optional[str] = None
|
|
||||||
estimated_processing_time: Optional[int] = None # seconds
|
|
||||||
|
|
||||||
|
|
||||||
class QueueStatusResponse(BaseModel):
|
|
||||||
"""Queue status response model"""
|
|
||||||
success: bool
|
|
||||||
message: str
|
|
||||||
queue_stats: Dict[str, Any]
|
|
||||||
pending_tasks: List[Dict[str, Any]]
|
|
||||||
|
|
||||||
|
|
||||||
class TaskStatusResponse(BaseModel):
|
|
||||||
"""Task status response model"""
|
|
||||||
success: bool
|
|
||||||
message: str
|
|
||||||
task_id: str
|
|
||||||
task_status: Optional[str] = None
|
|
||||||
task_result: Optional[Dict[str, Any]] = None
|
|
||||||
error: Optional[str] = None
|
|
||||||
|
|
||||||
|
|
||||||
def create_chat_response(
|
def create_chat_response(
|
||||||
messages: List[Message],
|
messages: List[Message],
|
||||||
model: str,
|
model: str,
|
||||||
|
|||||||
@ -1,439 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Data merging functions for combining processed file results.
|
|
||||||
"""
|
|
||||||
|
|
||||||
import os
|
|
||||||
import pickle
|
|
||||||
import logging
|
|
||||||
from typing import Dict
|
|
||||||
from utils.settings import EMBEDDING_MODEL_NAME
|
|
||||||
|
|
||||||
# Configure logger
|
|
||||||
logger = logging.getLogger('app')
|
|
||||||
|
|
||||||
# Try to import numpy, but handle if missing
|
|
||||||
try:
|
|
||||||
import numpy as np
|
|
||||||
NUMPY_SUPPORT = True
|
|
||||||
except ImportError:
|
|
||||||
logger.warning("NumPy not available, some embedding features may be limited")
|
|
||||||
NUMPY_SUPPORT = False
|
|
||||||
|
|
||||||
|
|
||||||
def merge_documents_by_group(unique_id: str, group_name: str) -> Dict:
|
|
||||||
"""Merge all document.txt files in a group into a single document."""
|
|
||||||
|
|
||||||
processed_group_dir = os.path.join("projects", "data", unique_id, "processed", group_name)
|
|
||||||
dataset_group_dir = os.path.join("projects", "data", unique_id, "datasets", group_name)
|
|
||||||
os.makedirs(dataset_group_dir, exist_ok=True)
|
|
||||||
|
|
||||||
merged_document_path = os.path.join(dataset_group_dir, "document.txt")
|
|
||||||
|
|
||||||
result = {
|
|
||||||
"success": False,
|
|
||||||
"merged_document_path": merged_document_path,
|
|
||||||
"source_files": [],
|
|
||||||
"total_pages": 0,
|
|
||||||
"total_characters": 0,
|
|
||||||
"error": None
|
|
||||||
}
|
|
||||||
|
|
||||||
try:
|
|
||||||
# Find all document.txt files in the processed directory
|
|
||||||
document_files = []
|
|
||||||
if os.path.exists(processed_group_dir):
|
|
||||||
for item in os.listdir(processed_group_dir):
|
|
||||||
item_path = os.path.join(processed_group_dir, item)
|
|
||||||
if os.path.isdir(item_path):
|
|
||||||
document_path = os.path.join(item_path, "document.txt")
|
|
||||||
if os.path.exists(document_path) and os.path.getsize(document_path) > 0:
|
|
||||||
document_files.append((item, document_path))
|
|
||||||
|
|
||||||
if not document_files:
|
|
||||||
result["error"] = "No document files found to merge"
|
|
||||||
return result
|
|
||||||
|
|
||||||
# Merge all documents with page separators
|
|
||||||
merged_content = []
|
|
||||||
total_characters = 0
|
|
||||||
|
|
||||||
for filename_stem, document_path in sorted(document_files):
|
|
||||||
try:
|
|
||||||
with open(document_path, 'r', encoding='utf-8') as f:
|
|
||||||
content = f.read().strip()
|
|
||||||
|
|
||||||
if content:
|
|
||||||
merged_content.append(f"# Page {filename_stem}")
|
|
||||||
merged_content.append(content)
|
|
||||||
total_characters += len(content)
|
|
||||||
result["source_files"].append(filename_stem)
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error reading document file {document_path}: {str(e)}")
|
|
||||||
continue
|
|
||||||
|
|
||||||
if merged_content:
|
|
||||||
# Write merged document
|
|
||||||
with open(merged_document_path, 'w', encoding='utf-8') as f:
|
|
||||||
f.write('\n\n'.join(merged_content))
|
|
||||||
|
|
||||||
result["total_pages"] = len(document_files)
|
|
||||||
result["total_characters"] = total_characters
|
|
||||||
result["success"] = True
|
|
||||||
|
|
||||||
else:
|
|
||||||
result["error"] = "No valid content found in document files"
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
result["error"] = f"Document merging failed: {str(e)}"
|
|
||||||
logger.error(f"Error merging documents for group {group_name}: {str(e)}")
|
|
||||||
|
|
||||||
return result
|
|
||||||
|
|
||||||
|
|
||||||
def merge_paginations_by_group(unique_id: str, group_name: str) -> Dict:
|
|
||||||
"""Merge all pagination.txt files in a group."""
|
|
||||||
|
|
||||||
processed_group_dir = os.path.join("projects", "data", unique_id, "processed", group_name)
|
|
||||||
dataset_group_dir = os.path.join("projects", "data", unique_id, "datasets", group_name)
|
|
||||||
os.makedirs(dataset_group_dir, exist_ok=True)
|
|
||||||
|
|
||||||
merged_pagination_path = os.path.join(dataset_group_dir, "pagination.txt")
|
|
||||||
|
|
||||||
result = {
|
|
||||||
"success": False,
|
|
||||||
"merged_pagination_path": merged_pagination_path,
|
|
||||||
"source_files": [],
|
|
||||||
"total_lines": 0,
|
|
||||||
"error": None
|
|
||||||
}
|
|
||||||
|
|
||||||
try:
|
|
||||||
# Find all pagination.txt files
|
|
||||||
pagination_files = []
|
|
||||||
if os.path.exists(processed_group_dir):
|
|
||||||
for item in os.listdir(processed_group_dir):
|
|
||||||
item_path = os.path.join(processed_group_dir, item)
|
|
||||||
if os.path.isdir(item_path):
|
|
||||||
pagination_path = os.path.join(item_path, "pagination.txt")
|
|
||||||
if os.path.exists(pagination_path) and os.path.getsize(pagination_path) > 0:
|
|
||||||
pagination_files.append((item, pagination_path))
|
|
||||||
|
|
||||||
if not pagination_files:
|
|
||||||
result["error"] = "No pagination files found to merge"
|
|
||||||
return result
|
|
||||||
|
|
||||||
# Merge all pagination files
|
|
||||||
merged_lines = []
|
|
||||||
|
|
||||||
for filename_stem, pagination_path in sorted(pagination_files):
|
|
||||||
try:
|
|
||||||
with open(pagination_path, 'r', encoding='utf-8') as f:
|
|
||||||
lines = f.readlines()
|
|
||||||
|
|
||||||
for line in lines:
|
|
||||||
line = line.strip()
|
|
||||||
if line:
|
|
||||||
merged_lines.append(line)
|
|
||||||
|
|
||||||
result["source_files"].append(filename_stem)
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error reading pagination file {pagination_path}: {str(e)}")
|
|
||||||
continue
|
|
||||||
|
|
||||||
if merged_lines:
|
|
||||||
# Write merged pagination
|
|
||||||
with open(merged_pagination_path, 'w', encoding='utf-8') as f:
|
|
||||||
for line in merged_lines:
|
|
||||||
f.write(f"{line}\n")
|
|
||||||
|
|
||||||
result["total_lines"] = len(merged_lines)
|
|
||||||
result["success"] = True
|
|
||||||
|
|
||||||
else:
|
|
||||||
result["error"] = "No valid pagination data found"
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
result["error"] = f"Pagination merging failed: {str(e)}"
|
|
||||||
logger.error(f"Error merging paginations for group {group_name}: {str(e)}")
|
|
||||||
|
|
||||||
return result
|
|
||||||
|
|
||||||
|
|
||||||
def merge_embeddings_by_group(unique_id: str, group_name: str) -> Dict:
|
|
||||||
"""Merge all embedding.pkl files in a group."""
|
|
||||||
|
|
||||||
processed_group_dir = os.path.join("projects", "data", unique_id, "processed", group_name)
|
|
||||||
dataset_group_dir = os.path.join("projects", "data", unique_id, "datasets", group_name)
|
|
||||||
os.makedirs(dataset_group_dir, exist_ok=True)
|
|
||||||
|
|
||||||
merged_embedding_path = os.path.join(dataset_group_dir, "embedding.pkl")
|
|
||||||
|
|
||||||
result = {
|
|
||||||
"success": False,
|
|
||||||
"merged_embedding_path": merged_embedding_path,
|
|
||||||
"source_files": [],
|
|
||||||
"total_chunks": 0,
|
|
||||||
"total_dimensions": 0,
|
|
||||||
"error": None
|
|
||||||
}
|
|
||||||
|
|
||||||
try:
|
|
||||||
# Find all embedding.pkl files
|
|
||||||
embedding_files = []
|
|
||||||
if os.path.exists(processed_group_dir):
|
|
||||||
for item in os.listdir(processed_group_dir):
|
|
||||||
item_path = os.path.join(processed_group_dir, item)
|
|
||||||
if os.path.isdir(item_path):
|
|
||||||
embedding_path = os.path.join(item_path, "embedding.pkl")
|
|
||||||
if os.path.exists(embedding_path) and os.path.getsize(embedding_path) > 0:
|
|
||||||
embedding_files.append((item, embedding_path))
|
|
||||||
|
|
||||||
if not embedding_files:
|
|
||||||
result["error"] = "No embedding files found to merge"
|
|
||||||
return result
|
|
||||||
|
|
||||||
# Load and merge all embedding data
|
|
||||||
all_chunks = []
|
|
||||||
all_embeddings = [] # Fix: collect all embedding vectors
|
|
||||||
total_chunks = 0
|
|
||||||
dimensions = 0
|
|
||||||
chunking_strategy = 'unknown'
|
|
||||||
chunking_params = {}
|
|
||||||
model_path = EMBEDDING_MODEL_NAME
|
|
||||||
|
|
||||||
for filename_stem, embedding_path in sorted(embedding_files):
|
|
||||||
try:
|
|
||||||
with open(embedding_path, 'rb') as f:
|
|
||||||
embedding_data = pickle.load(f)
|
|
||||||
|
|
||||||
if isinstance(embedding_data, dict) and 'chunks' in embedding_data:
|
|
||||||
chunks = embedding_data['chunks']
|
|
||||||
|
|
||||||
# Get embedding vectors (critical fix)
|
|
||||||
if 'embeddings' in embedding_data:
|
|
||||||
embeddings = embedding_data['embeddings']
|
|
||||||
all_embeddings.append(embeddings)
|
|
||||||
|
|
||||||
# Get model metadata from the first file
|
|
||||||
if 'model_path' in embedding_data:
|
|
||||||
model_path = embedding_data['model_path']
|
|
||||||
if 'chunking_strategy' in embedding_data:
|
|
||||||
chunking_strategy = embedding_data['chunking_strategy']
|
|
||||||
if 'chunking_params' in embedding_data:
|
|
||||||
chunking_params = embedding_data['chunking_params']
|
|
||||||
|
|
||||||
# Add source file metadata to each chunk
|
|
||||||
for chunk in chunks:
|
|
||||||
if isinstance(chunk, dict):
|
|
||||||
chunk['source_file'] = filename_stem
|
|
||||||
chunk['source_group'] = group_name
|
|
||||||
elif isinstance(chunk, str):
|
|
||||||
# If the chunk is a string, keep it unchanged
|
|
||||||
pass
|
|
||||||
|
|
||||||
all_chunks.extend(chunks)
|
|
||||||
total_chunks += len(chunks)
|
|
||||||
|
|
||||||
result["source_files"].append(filename_stem)
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error loading embedding file {embedding_path}: {str(e)}")
|
|
||||||
continue
|
|
||||||
|
|
||||||
if all_chunks and all_embeddings:
|
|
||||||
# Merge all embedding vectors
|
|
||||||
try:
|
|
||||||
# Try merging tensors with torch
|
|
||||||
import torch
|
|
||||||
if all(isinstance(emb, torch.Tensor) for emb in all_embeddings):
|
|
||||||
merged_embeddings = torch.cat(all_embeddings, dim=0)
|
|
||||||
dimensions = merged_embeddings.shape[1]
|
|
||||||
else:
|
|
||||||
# If the values are not tensors, try converting them to numpy
|
|
||||||
import numpy as np
|
|
||||||
if NUMPY_SUPPORT:
|
|
||||||
np_embeddings = []
|
|
||||||
for emb in all_embeddings:
|
|
||||||
if hasattr(emb, 'numpy'):
|
|
||||||
np_embeddings.append(emb.numpy())
|
|
||||||
elif isinstance(emb, np.ndarray):
|
|
||||||
np_embeddings.append(emb)
|
|
||||||
else:
|
|
||||||
# If conversion fails, skip this file
|
|
||||||
logger.warning(f"Warning: Cannot convert embedding to numpy from file {filename_stem}")
|
|
||||||
continue
|
|
||||||
|
|
||||||
if np_embeddings:
|
|
||||||
merged_embeddings = np.concatenate(np_embeddings, axis=0)
|
|
||||||
dimensions = merged_embeddings.shape[1]
|
|
||||||
else:
|
|
||||||
result["error"] = "No valid embedding tensors could be merged"
|
|
||||||
return result
|
|
||||||
else:
|
|
||||||
result["error"] = "NumPy not available for merging embeddings"
|
|
||||||
return result
|
|
||||||
|
|
||||||
except ImportError:
|
|
||||||
# If torch is unavailable, try using numpy
|
|
||||||
if NUMPY_SUPPORT:
|
|
||||||
import numpy as np
|
|
||||||
np_embeddings = []
|
|
||||||
for emb in all_embeddings:
|
|
||||||
if hasattr(emb, 'numpy'):
|
|
||||||
np_embeddings.append(emb.numpy())
|
|
||||||
elif isinstance(emb, np.ndarray):
|
|
||||||
np_embeddings.append(emb)
|
|
||||||
else:
|
|
||||||
logger.warning(f"Warning: Cannot convert embedding to numpy from file {filename_stem}")
|
|
||||||
continue
|
|
||||||
|
|
||||||
if np_embeddings:
|
|
||||||
merged_embeddings = np.concatenate(np_embeddings, axis=0)
|
|
||||||
dimensions = merged_embeddings.shape[1]
|
|
||||||
else:
|
|
||||||
result["error"] = "No valid embedding tensors could be merged"
|
|
||||||
return result
|
|
||||||
else:
|
|
||||||
result["error"] = "Neither torch nor numpy available for merging embeddings"
|
|
||||||
return result
|
|
||||||
except Exception as e:
|
|
||||||
result["error"] = f"Failed to merge embedding tensors: {str(e)}"
|
|
||||||
logger.error(f"Error merging embedding tensors: {str(e)}")
|
|
||||||
return result
|
|
||||||
|
|
||||||
# Create merged embedding data structure
|
|
||||||
merged_embedding_data = {
|
|
||||||
'chunks': all_chunks,
|
|
||||||
'embeddings': merged_embeddings, # Critical fix: include the embeddings key
|
|
||||||
'total_chunks': total_chunks,
|
|
||||||
'dimensions': dimensions,
|
|
||||||
'source_files': result["source_files"],
|
|
||||||
'group_name': group_name,
|
|
||||||
'merged_at': str(__import__('time').time()),
|
|
||||||
'chunking_strategy': chunking_strategy,
|
|
||||||
'chunking_params': chunking_params,
|
|
||||||
'model_path': model_path
|
|
||||||
}
|
|
||||||
|
|
||||||
# Save merged embeddings
|
|
||||||
with open(merged_embedding_path, 'wb') as f:
|
|
||||||
pickle.dump(merged_embedding_data, f)
|
|
||||||
|
|
||||||
result["total_chunks"] = total_chunks
|
|
||||||
result["total_dimensions"] = dimensions
|
|
||||||
result["success"] = True
|
|
||||||
|
|
||||||
else:
|
|
||||||
result["error"] = "No valid embedding data found"
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
result["error"] = f"Embedding merging failed: {str(e)}"
|
|
||||||
logger.error(f"Error merging embeddings for group {group_name}: {str(e)}")
|
|
||||||
|
|
||||||
return result
|
|
||||||
|
|
||||||
|
|
||||||
def merge_all_data_by_group(unique_id: str, group_name: str) -> Dict:
|
|
||||||
"""Merge documents, paginations, and embeddings for a group."""
|
|
||||||
|
|
||||||
merge_results = {
|
|
||||||
"group_name": group_name,
|
|
||||||
"unique_id": unique_id,
|
|
||||||
"success": True,
|
|
||||||
"document_merge": None,
|
|
||||||
"pagination_merge": None,
|
|
||||||
"embedding_merge": None,
|
|
||||||
"errors": []
|
|
||||||
}
|
|
||||||
|
|
||||||
# Merge documents
|
|
||||||
document_result = merge_documents_by_group(unique_id, group_name)
|
|
||||||
merge_results["document_merge"] = document_result
|
|
||||||
|
|
||||||
if not document_result["success"]:
|
|
||||||
merge_results["success"] = False
|
|
||||||
merge_results["errors"].append(f"Document merge failed: {document_result['error']}")
|
|
||||||
|
|
||||||
# Merge paginations
|
|
||||||
pagination_result = merge_paginations_by_group(unique_id, group_name)
|
|
||||||
merge_results["pagination_merge"] = pagination_result
|
|
||||||
|
|
||||||
if not pagination_result["success"]:
|
|
||||||
merge_results["success"] = False
|
|
||||||
merge_results["errors"].append(f"Pagination merge failed: {pagination_result['error']}")
|
|
||||||
|
|
||||||
# Merge embeddings
|
|
||||||
embedding_result = merge_embeddings_by_group(unique_id, group_name)
|
|
||||||
merge_results["embedding_merge"] = embedding_result
|
|
||||||
|
|
||||||
if not embedding_result["success"]:
|
|
||||||
merge_results["success"] = False
|
|
||||||
merge_results["errors"].append(f"Embedding merge failed: {embedding_result['error']}")
|
|
||||||
|
|
||||||
return merge_results
|
|
||||||
|
|
||||||
|
|
||||||
def get_group_merge_status(unique_id: str, group_name: str) -> Dict:
|
|
||||||
"""Get the status of merged data for a group."""
|
|
||||||
|
|
||||||
dataset_group_dir = os.path.join("projects", "data", unique_id, "datasets", group_name)
|
|
||||||
|
|
||||||
status = {
|
|
||||||
"group_name": group_name,
|
|
||||||
"unique_id": unique_id,
|
|
||||||
"dataset_dir_exists": os.path.exists(dataset_group_dir),
|
|
||||||
"document_exists": False,
|
|
||||||
"document_size": 0,
|
|
||||||
"pagination_exists": False,
|
|
||||||
"pagination_size": 0,
|
|
||||||
"embedding_exists": False,
|
|
||||||
"embedding_size": 0,
|
|
||||||
"merge_complete": False
|
|
||||||
}
|
|
||||||
|
|
||||||
if os.path.exists(dataset_group_dir):
|
|
||||||
document_path = os.path.join(dataset_group_dir, "document.txt")
|
|
||||||
pagination_path = os.path.join(dataset_group_dir, "pagination.txt")
|
|
||||||
embedding_path = os.path.join(dataset_group_dir, "embedding.pkl")
|
|
||||||
|
|
||||||
if os.path.exists(document_path):
|
|
||||||
status["document_exists"] = True
|
|
||||||
status["document_size"] = os.path.getsize(document_path)
|
|
||||||
|
|
||||||
if os.path.exists(pagination_path):
|
|
||||||
status["pagination_exists"] = True
|
|
||||||
status["pagination_size"] = os.path.getsize(pagination_path)
|
|
||||||
|
|
||||||
if os.path.exists(embedding_path):
|
|
||||||
status["embedding_exists"] = True
|
|
||||||
status["embedding_size"] = os.path.getsize(embedding_path)
|
|
||||||
|
|
||||||
# Check if all files exist and are not empty
|
|
||||||
if (status["document_exists"] and status["document_size"] > 0 and
|
|
||||||
status["pagination_exists"] and status["pagination_size"] > 0 and
|
|
||||||
status["embedding_exists"] and status["embedding_size"] > 0):
|
|
||||||
status["merge_complete"] = True
|
|
||||||
|
|
||||||
return status
|
|
||||||
|
|
||||||
|
|
||||||
def cleanup_dataset_group(unique_id: str, group_name: str) -> bool:
|
|
||||||
"""Clean up merged dataset files for a group."""
|
|
||||||
|
|
||||||
dataset_group_dir = os.path.join("projects", "data", unique_id, "datasets", group_name)
|
|
||||||
|
|
||||||
try:
|
|
||||||
if os.path.exists(dataset_group_dir):
|
|
||||||
import shutil
|
|
||||||
shutil.rmtree(dataset_group_dir)
|
|
||||||
logger.info(f"Cleaned up dataset group: {group_name}")
|
|
||||||
return True
|
|
||||||
else:
|
|
||||||
return True # Nothing to clean up
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error cleaning up dataset group {group_name}: {str(e)}")
|
|
||||||
return False
|
|
||||||
@ -1,297 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Dataset management functions for organizing and processing datasets.
|
|
||||||
New implementation with per-file processing and group merging.
|
|
||||||
"""
|
|
||||||
|
|
||||||
import os
|
|
||||||
import json
|
|
||||||
import logging
|
|
||||||
from typing import Dict, List
|
|
||||||
|
|
||||||
# Configure logger
|
|
||||||
logger = logging.getLogger('app')
|
|
||||||
|
|
||||||
# Import new modules
|
|
||||||
from utils.file_manager import (
|
|
||||||
ensure_directories, sync_files_to_group, cleanup_orphaned_files,
|
|
||||||
get_group_files_list
|
|
||||||
)
|
|
||||||
from utils.single_file_processor import (
|
|
||||||
process_single_file, check_file_already_processed
|
|
||||||
)
|
|
||||||
from utils.data_merger import (
|
|
||||||
merge_all_data_by_group, cleanup_dataset_group
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
async def download_dataset_files(unique_id: str, files: Dict[str, List[str]], incremental_mode: bool = False) -> Dict[str, List[str]]:
|
|
||||||
"""
|
|
||||||
Process dataset files with new architecture:
|
|
||||||
1. Sync files to group directories
|
|
||||||
2. Process each file individually
|
|
||||||
3. Merge results by group
|
|
||||||
4. Clean up orphaned files (only in non-incremental mode)
|
|
||||||
|
|
||||||
Args:
|
|
||||||
unique_id: Project ID
|
|
||||||
files: Dictionary of files to process, grouped by key
|
|
||||||
incremental_mode: If True, preserve existing files and only process new ones
|
|
||||||
"""
|
|
||||||
if not files:
|
|
||||||
return {}
|
|
||||||
|
|
||||||
logger.info(f"Starting {'incremental' if incremental_mode else 'full'} file processing for project: {unique_id}")
|
|
||||||
|
|
||||||
# Ensure project directories exist
|
|
||||||
ensure_directories(unique_id)
|
|
||||||
|
|
||||||
# Step 1: Sync files to group directories
|
|
||||||
logger.info("Step 1: Syncing files to group directories...")
|
|
||||||
synced_files, failed_files = sync_files_to_group(unique_id, files, incremental_mode)
|
|
||||||
|
|
||||||
# Step 2: Detect changes and cleanup orphaned files (only in non-incremental mode)
|
|
||||||
from utils.file_manager import detect_file_changes
|
|
||||||
changes = detect_file_changes(unique_id, files, incremental_mode)
|
|
||||||
|
|
||||||
# Only cleanup orphaned files in non-incremental mode or when files are explicitly removed
|
|
||||||
if not incremental_mode and any(changes["removed"].values()):
|
|
||||||
logger.info("Step 2: Cleaning up orphaned files...")
|
|
||||||
removed_files = cleanup_orphaned_files(unique_id, changes)
|
|
||||||
logger.info(f"Removed orphaned files: {removed_files}")
|
|
||||||
elif incremental_mode:
|
|
||||||
logger.info("Step 2: Skipping cleanup in incremental mode to preserve existing files")
|
|
||||||
|
|
||||||
# Step 3: Process individual files
|
|
||||||
logger.info("Step 3: Processing individual files...")
|
|
||||||
processed_files_by_group = {}
|
|
||||||
processing_results = {}
|
|
||||||
|
|
||||||
for group_name, file_list in files.items():
|
|
||||||
processed_files_by_group[group_name] = []
|
|
||||||
processing_results[group_name] = []
|
|
||||||
|
|
||||||
for file_path in file_list:
|
|
||||||
filename = os.path.basename(file_path)
|
|
||||||
|
|
||||||
# Get local file path
|
|
||||||
local_path = os.path.join("projects", "data", unique_id, "files", group_name, filename)
|
|
||||||
|
|
||||||
# Skip if file doesn't exist (might be remote file that failed to download)
|
|
||||||
if not os.path.exists(local_path) and not file_path.startswith(('http://', 'https://')):
|
|
||||||
logger.warning(f"Skipping non-existent file: {filename}")
|
|
||||||
continue
|
|
||||||
|
|
||||||
# Check if already processed
|
|
||||||
if check_file_already_processed(unique_id, group_name, filename):
|
|
||||||
logger.info(f"Skipping already processed file: {filename}")
|
|
||||||
processed_files_by_group[group_name].append(filename)
|
|
||||||
processing_results[group_name].append({
|
|
||||||
"filename": filename,
|
|
||||||
"status": "existing"
|
|
||||||
})
|
|
||||||
continue
|
|
||||||
|
|
||||||
# Process the file
|
|
||||||
logger.info(f"Processing file: {filename} (group: {group_name})")
|
|
||||||
result = await process_single_file(unique_id, group_name, filename, file_path, local_path)
|
|
||||||
processing_results[group_name].append(result)
|
|
||||||
|
|
||||||
if result["success"]:
|
|
||||||
processed_files_by_group[group_name].append(filename)
|
|
||||||
logger.info(f" Successfully processed {filename}")
|
|
||||||
else:
|
|
||||||
logger.error(f" Failed to process {filename}: {result['error']}")
|
|
||||||
|
|
||||||
# Step 4: Merge results by group
|
|
||||||
logger.info("Step 4: Merging results by group...")
|
|
||||||
merge_results = {}
|
|
||||||
|
|
||||||
for group_name in processed_files_by_group.keys():
|
|
||||||
# Get all files in the group (including existing ones)
|
|
||||||
group_files = get_group_files_list(unique_id, group_name)
|
|
||||||
|
|
||||||
if group_files:
|
|
||||||
logger.info(f"Merging group: {group_name} with {len(group_files)} files")
|
|
||||||
merge_result = merge_all_data_by_group(unique_id, group_name)
|
|
||||||
merge_results[group_name] = merge_result
|
|
||||||
|
|
||||||
if merge_result["success"]:
|
|
||||||
logger.info(f" Successfully merged group {group_name}")
|
|
||||||
else:
|
|
||||||
logger.error(f" Failed to merge group {group_name}: {merge_result['errors']}")
|
|
||||||
|
|
||||||
# Step 5: Save processing log
|
|
||||||
logger.info("Step 5: Saving processing log...")
|
|
||||||
await save_processing_log(unique_id, files, synced_files, processing_results, merge_results)
|
|
||||||
|
|
||||||
logger.info(f"File processing completed for project: {unique_id}")
|
|
||||||
return processed_files_by_group
|
|
||||||
|
|
||||||
|
|
||||||
async def save_processing_log(
|
|
||||||
unique_id: str,
|
|
||||||
requested_files: Dict[str, List[str]],
|
|
||||||
synced_files: Dict,
|
|
||||||
processing_results: Dict,
|
|
||||||
merge_results: Dict
|
|
||||||
):
|
|
||||||
"""Save comprehensive processing log."""
|
|
||||||
|
|
||||||
log_data = {
|
|
||||||
"unique_id": unique_id,
|
|
||||||
"timestamp": str(os.path.getmtime("projects") if os.path.exists("projects") else 0),
|
|
||||||
"requested_files": requested_files,
|
|
||||||
"synced_files": synced_files,
|
|
||||||
"processing_results": processing_results,
|
|
||||||
"merge_results": merge_results,
|
|
||||||
"summary": {
|
|
||||||
"total_groups": len(requested_files),
|
|
||||||
"total_files_requested": sum(len(files) for files in requested_files.values()),
|
|
||||||
"total_files_processed": sum(
|
|
||||||
len([r for r in results if r.get("success", False)])
|
|
||||||
for results in processing_results.values()
|
|
||||||
),
|
|
||||||
"total_groups_merged": len([r for r in merge_results.values() if r.get("success", False)])
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
log_file_path = os.path.join("projects", "data", unique_id, "processing_log.json")
|
|
||||||
try:
|
|
||||||
with open(log_file_path, 'w', encoding='utf-8') as f:
|
|
||||||
json.dump(log_data, f, ensure_ascii=False, indent=2)
|
|
||||||
logger.info(f"Processing log saved to: {log_file_path}")
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error saving processing log: {str(e)}")
|
|
||||||
|
|
||||||
|
|
||||||
def generate_dataset_structure(unique_id: str) -> str:
|
|
||||||
"""Generate a string representation of the dataset structure"""
|
|
||||||
project_dir = os.path.join("projects", "data", unique_id)
|
|
||||||
structure = []
|
|
||||||
|
|
||||||
def add_directory_contents(dir_path: str, prefix: str = ""):
|
|
||||||
try:
|
|
||||||
if not os.path.exists(dir_path):
|
|
||||||
structure.append(f"{prefix}└── (not found)")
|
|
||||||
return
|
|
||||||
|
|
||||||
items = sorted(os.listdir(dir_path))
|
|
||||||
for i, item in enumerate(items):
|
|
||||||
item_path = os.path.join(dir_path, item)
|
|
||||||
is_last = i == len(items) - 1
|
|
||||||
current_prefix = "└── " if is_last else "├── "
|
|
||||||
structure.append(f"{prefix}{current_prefix}{item}")
|
|
||||||
|
|
||||||
if os.path.isdir(item_path):
|
|
||||||
next_prefix = prefix + (" " if is_last else "│ ")
|
|
||||||
add_directory_contents(item_path, next_prefix)
|
|
||||||
except Exception as e:
|
|
||||||
structure.append(f"{prefix}└── Error: {str(e)}")
|
|
||||||
|
|
||||||
# Add files directory structure
|
|
||||||
files_dir = os.path.join(project_dir, "files")
|
|
||||||
structure.append("files/")
|
|
||||||
add_directory_contents(files_dir, "")
|
|
||||||
|
|
||||||
# Add processed directory structure
|
|
||||||
processed_dir = os.path.join(project_dir, "processed")
|
|
||||||
structure.append("\nprocessed/")
|
|
||||||
add_directory_contents(processed_dir, "")
|
|
||||||
|
|
||||||
# Add dataset directory structure
|
|
||||||
dataset_dir = os.path.join(project_dir, "datasets")
|
|
||||||
structure.append("\ndataset/")
|
|
||||||
add_directory_contents(dataset_dir, "")
|
|
||||||
|
|
||||||
return "\n".join(structure)
|
|
||||||
|
|
||||||
|
|
||||||
def get_processing_status(unique_id: str) -> Dict:
|
|
||||||
"""Get comprehensive processing status for a project."""
|
|
||||||
|
|
||||||
project_dir = os.path.join("projects", "data", unique_id)
|
|
||||||
|
|
||||||
if not os.path.exists(project_dir):
|
|
||||||
return {
|
|
||||||
"project_exists": False,
|
|
||||||
"unique_id": unique_id
|
|
||||||
}
|
|
||||||
|
|
||||||
status = {
|
|
||||||
"project_exists": True,
|
|
||||||
"unique_id": unique_id,
|
|
||||||
"directories": {
|
|
||||||
"files": os.path.exists(os.path.join(project_dir, "files")),
|
|
||||||
"processed": os.path.exists(os.path.join(project_dir, "processed")),
|
|
||||||
"dataset": os.path.exists(os.path.join(project_dir, "datasets"))
|
|
||||||
},
|
|
||||||
"groups": {},
|
|
||||||
"processing_log_exists": os.path.exists(os.path.join(project_dir, "processing_log.json"))
|
|
||||||
}
|
|
||||||
|
|
||||||
# Check each group's status
|
|
||||||
files_dir = os.path.join(project_dir, "files")
|
|
||||||
if os.path.exists(files_dir):
|
|
||||||
for group_name in os.listdir(files_dir):
|
|
||||||
group_path = os.path.join(files_dir, group_name)
|
|
||||||
if os.path.isdir(group_path):
|
|
||||||
status["groups"][group_name] = {
|
|
||||||
"files_count": len([
|
|
||||||
f for f in os.listdir(group_path)
|
|
||||||
if os.path.isfile(os.path.join(group_path, f))
|
|
||||||
]),
|
|
||||||
"merge_status": "pending"
|
|
||||||
}
|
|
||||||
|
|
||||||
# Check merge status for each group
|
|
||||||
dataset_dir = os.path.join(project_dir, "datasets")
|
|
||||||
if os.path.exists(dataset_dir):
|
|
||||||
for group_name in os.listdir(dataset_dir):
|
|
||||||
group_path = os.path.join(dataset_dir, group_name)
|
|
||||||
if os.path.isdir(group_path):
|
|
||||||
if group_name in status["groups"]:
|
|
||||||
# Check if merge is complete
|
|
||||||
document_path = os.path.join(group_path, "document.txt")
|
|
||||||
pagination_path = os.path.join(group_path, "pagination.txt")
|
|
||||||
embedding_path = os.path.join(group_path, "embedding.pkl")
|
|
||||||
|
|
||||||
if (os.path.exists(document_path) and os.path.exists(pagination_path) and
|
|
||||||
os.path.exists(embedding_path)):
|
|
||||||
status["groups"][group_name]["merge_status"] = "completed"
|
|
||||||
else:
|
|
||||||
status["groups"][group_name]["merge_status"] = "incomplete"
|
|
||||||
else:
|
|
||||||
status["groups"][group_name] = {
|
|
||||||
"files_count": 0,
|
|
||||||
"merge_status": "completed"
|
|
||||||
}
|
|
||||||
|
|
||||||
return status
|
|
||||||
|
|
||||||
|
|
||||||
def remove_dataset_directory(unique_id: str, filename_without_ext: str):
|
|
||||||
"""Remove a specific dataset directory (deprecated - use new structure)"""
|
|
||||||
# This function is kept for compatibility but delegates to new structure
|
|
||||||
dataset_path = os.path.join("projects", "data", unique_id, "processed", filename_without_ext)
|
|
||||||
if os.path.exists(dataset_path):
|
|
||||||
import shutil
|
|
||||||
shutil.rmtree(dataset_path)
|
|
||||||
|
|
||||||
|
|
||||||
def remove_dataset_directory_by_key(unique_id: str, key: str):
|
|
||||||
"""Remove dataset directory by key (group name)"""
|
|
||||||
# Remove files directory
|
|
||||||
files_group_path = os.path.join("projects", "data", unique_id, "files", key)
|
|
||||||
if os.path.exists(files_group_path):
|
|
||||||
import shutil
|
|
||||||
shutil.rmtree(files_group_path)
|
|
||||||
|
|
||||||
# Remove processed directory
|
|
||||||
processed_group_path = os.path.join("projects", "data", unique_id, "processed", key)
|
|
||||||
if os.path.exists(processed_group_path):
|
|
||||||
import shutil
|
|
||||||
shutil.rmtree(processed_group_path)
|
|
||||||
|
|
||||||
# Remove dataset directory
|
|
||||||
cleanup_dataset_group(unique_id, key)
|
|
||||||
@ -1,343 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Project management functions for handling projects, README generation, and status tracking.
|
|
||||||
"""
|
|
||||||
|
|
||||||
import os
|
|
||||||
import json
|
|
||||||
import logging
|
|
||||||
from typing import Dict, List, Optional
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
# Configure logger
|
|
||||||
logger = logging.getLogger('app')
|
|
||||||
|
|
||||||
from utils.file_utils import get_document_preview, load_processed_files_log
|
|
||||||
|
|
||||||
|
|
||||||
def generate_directory_tree(project_dir: str, unique_id: str, max_depth: int = 3) -> str:
|
|
||||||
"""Generate dataset directory tree structure for the project"""
|
|
||||||
def _build_tree(path: str, prefix: str = "", is_last: bool = True, depth: int = 0) -> List[str]:
|
|
||||||
if depth > max_depth:
|
|
||||||
return []
|
|
||||||
|
|
||||||
lines = []
|
|
||||||
try:
|
|
||||||
entries = sorted(os.listdir(path))
|
|
||||||
# Separate directories and files
|
|
||||||
dirs = [e for e in entries if os.path.isdir(os.path.join(path, e)) and not e.startswith('.')]
|
|
||||||
files = [e for e in entries if os.path.isfile(os.path.join(path, e)) and not e.startswith('.')]
|
|
||||||
|
|
||||||
entries = dirs + files
|
|
||||||
|
|
||||||
for i, entry in enumerate(entries):
|
|
||||||
entry_path = os.path.join(path, entry)
|
|
||||||
is_dir = os.path.isdir(entry_path)
|
|
||||||
is_last_entry = i == len(entries) - 1
|
|
||||||
|
|
||||||
# Choose the appropriate tree symbols
|
|
||||||
if is_last_entry:
|
|
||||||
connector = "└── "
|
|
||||||
new_prefix = prefix + " "
|
|
||||||
else:
|
|
||||||
connector = "├── "
|
|
||||||
new_prefix = prefix + "│ "
|
|
||||||
|
|
||||||
# Add entry line
|
|
||||||
line = prefix + connector + entry
|
|
||||||
if is_dir:
|
|
||||||
line += "/"
|
|
||||||
lines.append(line)
|
|
||||||
|
|
||||||
# Recursively add subdirectories
|
|
||||||
if is_dir and depth < max_depth:
|
|
||||||
sub_lines = _build_tree(entry_path, new_prefix, is_last_entry, depth + 1)
|
|
||||||
lines.extend(sub_lines)
|
|
||||||
|
|
||||||
except PermissionError:
|
|
||||||
lines.append(prefix + "└── [Permission Denied]")
|
|
||||||
except Exception as e:
|
|
||||||
lines.append(prefix + "└── [Error: " + str(e) + "]")
|
|
||||||
|
|
||||||
return lines
|
|
||||||
|
|
||||||
# Start building tree from dataset directory
|
|
||||||
dataset_dir = os.path.join(project_dir, "datasets")
|
|
||||||
tree_lines = []
|
|
||||||
|
|
||||||
if not os.path.exists(dataset_dir):
|
|
||||||
return "└── [No dataset directory found]"
|
|
||||||
|
|
||||||
try:
|
|
||||||
entries = sorted(os.listdir(dataset_dir))
|
|
||||||
dirs = [e for e in entries if os.path.isdir(os.path.join(dataset_dir, e)) and not e.startswith('.')]
|
|
||||||
files = [e for e in entries if os.path.isfile(os.path.join(dataset_dir, e)) and not e.startswith('.')]
|
|
||||||
|
|
||||||
entries = dirs + files
|
|
||||||
|
|
||||||
if not entries:
|
|
||||||
tree_lines.append("└── [Empty dataset directory]")
|
|
||||||
else:
|
|
||||||
for i, entry in enumerate(entries):
|
|
||||||
entry_path = os.path.join(dataset_dir, entry)
|
|
||||||
is_dir = os.path.isdir(entry_path)
|
|
||||||
is_last_entry = i == len(entries) - 1
|
|
||||||
|
|
||||||
if is_last_entry:
|
|
||||||
connector = "└── "
|
|
||||||
prefix = " "
|
|
||||||
else:
|
|
||||||
connector = "├── "
|
|
||||||
prefix = "│ "
|
|
||||||
|
|
||||||
line = connector + entry
|
|
||||||
if is_dir:
|
|
||||||
line += "/"
|
|
||||||
tree_lines.append(line)
|
|
||||||
|
|
||||||
# Recursively add subdirectories
|
|
||||||
if is_dir:
|
|
||||||
sub_lines = _build_tree(entry_path, prefix, is_last_entry, 1)
|
|
||||||
tree_lines.extend(sub_lines)
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
tree_lines.append(f"└── [Error generating tree: {str(e)}]")
|
|
||||||
|
|
||||||
return "\n".join(tree_lines)
|
|
||||||
|
|
||||||
|
|
||||||
def generate_project_readme(unique_id: str) -> str:
|
|
||||||
"""Generate README.md content for a project"""
|
|
||||||
project_dir = os.path.join("projects", "data", unique_id)
|
|
||||||
readme_content = f"""# Project: {unique_id}
|
|
||||||
|
|
||||||
## Project Overview
|
|
||||||
|
|
||||||
This project contains processed documents and their associated embeddings for semantic search.
|
|
||||||
|
|
||||||
## Directory Structure
|
|
||||||
|
|
||||||
"""
|
|
||||||
|
|
||||||
# Generate directory tree
|
|
||||||
readme_content += "```\n"
|
|
||||||
readme_content += generate_directory_tree(project_dir, unique_id)
|
|
||||||
readme_content += "\n```\n\n"
|
|
||||||
|
|
||||||
readme_content += """## Dataset Structure
|
|
||||||
|
|
||||||
"""
|
|
||||||
|
|
||||||
dataset_dir = os.path.join(project_dir, "datasets")
|
|
||||||
if not os.path.exists(dataset_dir):
|
|
||||||
readme_content += "No dataset files available.\n"
|
|
||||||
else:
|
|
||||||
# Get all document directories
|
|
||||||
doc_dirs = []
|
|
||||||
try:
|
|
||||||
for item in sorted(os.listdir(dataset_dir)):
|
|
||||||
item_path = os.path.join(dataset_dir, item)
|
|
||||||
if os.path.isdir(item_path):
|
|
||||||
doc_dirs.append(item)
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error listing dataset directories: {str(e)}")
|
|
||||||
|
|
||||||
if not doc_dirs:
|
|
||||||
readme_content += "No document directories found.\n"
|
|
||||||
else:
|
|
||||||
for doc_dir in doc_dirs:
|
|
||||||
doc_path = os.path.join(dataset_dir, doc_dir)
|
|
||||||
document_file = os.path.join(doc_path, "document.txt")
|
|
||||||
pagination_file = os.path.join(doc_path, "pagination.txt")
|
|
||||||
embeddings_file = os.path.join(doc_path, "embedding.pkl")
|
|
||||||
|
|
||||||
readme_content += f"### {doc_dir}\n\n"
|
|
||||||
readme_content += f"**Files:**\n"
|
|
||||||
readme_content += f"- `{doc_dir}/document.txt`"
|
|
||||||
if os.path.exists(document_file):
|
|
||||||
readme_content += " ✓"
|
|
||||||
readme_content += "\n"
|
|
||||||
|
|
||||||
readme_content += f"- `{doc_dir}/pagination.txt`"
|
|
||||||
if os.path.exists(pagination_file):
|
|
||||||
readme_content += " ✓"
|
|
||||||
readme_content += "\n"
|
|
||||||
|
|
||||||
readme_content += f"- `{doc_dir}/embedding.pkl`"
|
|
||||||
if os.path.exists(embeddings_file):
|
|
||||||
readme_content += " ✓"
|
|
||||||
readme_content += "\n\n"
|
|
||||||
|
|
||||||
# Add document preview
|
|
||||||
if os.path.exists(document_file):
|
|
||||||
readme_content += f"**Content Preview (first 10 lines):**\n\n```\n"
|
|
||||||
preview = get_document_preview(document_file, 10)
|
|
||||||
readme_content += preview
|
|
||||||
readme_content += "\n```\n\n"
|
|
||||||
else:
|
|
||||||
readme_content += f"**Content Preview:** Not available\n\n"
|
|
||||||
|
|
||||||
readme_content += f"""---
|
|
||||||
*Generated on {__import__('datetime').datetime.now().strftime('%Y-%m-%d %H:%M:%S')}*
|
|
||||||
"""
|
|
||||||
|
|
||||||
return readme_content
|
|
||||||
|
|
||||||
|
|
||||||
def save_project_readme(unique_id: str):
|
|
||||||
"""Save README.md for a project"""
|
|
||||||
readme_content = generate_project_readme(unique_id)
|
|
||||||
readme_path = os.path.join("projects", "data", unique_id, "README.md")
|
|
||||||
|
|
||||||
try:
|
|
||||||
os.makedirs(os.path.dirname(readme_path), exist_ok=True)
|
|
||||||
with open(readme_path, 'w', encoding='utf-8') as f:
|
|
||||||
f.write(readme_content)
|
|
||||||
logger.info(f"Generated README.md for project {unique_id}")
|
|
||||||
return readme_path
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error generating README for project {unique_id}: {str(e)}")
|
|
||||||
return None
|
|
||||||
|
|
||||||
|
|
||||||
def get_project_status(unique_id: str) -> Dict:
|
|
||||||
"""Get comprehensive status of a project"""
|
|
||||||
project_dir = os.path.join("projects", "data", unique_id)
|
|
||||||
project_exists = os.path.exists(project_dir)
|
|
||||||
|
|
||||||
if not project_exists:
|
|
||||||
return {
|
|
||||||
"unique_id": unique_id,
|
|
||||||
"project_exists": False,
|
|
||||||
"error": "Project not found"
|
|
||||||
}
|
|
||||||
|
|
||||||
# Get processed log
|
|
||||||
processed_log = load_processed_files_log(unique_id)
|
|
||||||
|
|
||||||
# Collect document.txt files
|
|
||||||
document_files = []
|
|
||||||
dataset_dir = os.path.join(project_dir, "datasets")
|
|
||||||
if os.path.exists(dataset_dir):
|
|
||||||
for root, dirs, files in os.walk(dataset_dir):
|
|
||||||
for file in files:
|
|
||||||
if file == "document.txt":
|
|
||||||
document_files.append(os.path.join(root, file))
|
|
||||||
|
|
||||||
# Check system prompt and MCP settings
|
|
||||||
system_prompt_file = os.path.join(project_dir, "system_prompt.txt")
|
|
||||||
mcp_settings_file = os.path.join(project_dir, "mcp_settings.json")
|
|
||||||
|
|
||||||
status = {
|
|
||||||
"unique_id": unique_id,
|
|
||||||
"project_exists": True,
|
|
||||||
"project_path": project_dir,
|
|
||||||
"processed_files_count": len(processed_log),
|
|
||||||
"processed_files": processed_log,
|
|
||||||
"document_files_count": len(document_files),
|
|
||||||
"document_files": document_files,
|
|
||||||
"has_system_prompt": os.path.exists(system_prompt_file),
|
|
||||||
"has_mcp_settings": os.path.exists(mcp_settings_file),
|
|
||||||
"readme_exists": os.path.exists(os.path.join(project_dir, "README.md")),
|
|
||||||
"log_file_exists": os.path.exists(os.path.join(project_dir, "processed_files.json"))
|
|
||||||
}
|
|
||||||
|
|
||||||
# Add dataset structure
|
|
||||||
try:
|
|
||||||
from utils.dataset_manager import generate_dataset_structure
|
|
||||||
status["dataset_structure"] = generate_dataset_structure(unique_id)
|
|
||||||
except Exception as e:
|
|
||||||
status["dataset_structure"] = f"Error generating structure: {str(e)}"
|
|
||||||
|
|
||||||
return status
|
|
||||||
|
|
||||||
|
|
||||||
def remove_project(unique_id: str) -> bool:
|
|
||||||
"""Remove entire project directory"""
|
|
||||||
project_dir = os.path.join("projects", "data", unique_id)
|
|
||||||
try:
|
|
||||||
if os.path.exists(project_dir):
|
|
||||||
import shutil
|
|
||||||
shutil.rmtree(project_dir)
|
|
||||||
logger.info(f"Removed project directory: {project_dir}")
|
|
||||||
return True
|
|
||||||
else:
|
|
||||||
logger.warning(f"Project directory not found: {project_dir}")
|
|
||||||
return False
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error removing project {unique_id}: {str(e)}")
|
|
||||||
return False
|
|
||||||
|
|
||||||
|
|
||||||
def list_projects() -> List[str]:
|
|
||||||
"""List all existing project IDs"""
|
|
||||||
projects_dir = "projects"
|
|
||||||
if not os.path.exists(projects_dir):
|
|
||||||
return []
|
|
||||||
|
|
||||||
try:
|
|
||||||
return [item for item in os.listdir(projects_dir)
|
|
||||||
if os.path.isdir(os.path.join(projects_dir, item))]
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error listing projects: {str(e)}")
|
|
||||||
return []
|
|
||||||
|
|
||||||
|
|
||||||
def get_project_stats(unique_id: str) -> Dict:
|
|
||||||
"""Get statistics for a specific project"""
|
|
||||||
status = get_project_status(unique_id)
|
|
||||||
|
|
||||||
if not status["project_exists"]:
|
|
||||||
return status
|
|
||||||
|
|
||||||
stats = {
|
|
||||||
"unique_id": unique_id,
|
|
||||||
"total_processed_files": status["processed_files_count"],
|
|
||||||
"total_document_files": status["document_files_count"],
|
|
||||||
"has_system_prompt": status["has_system_prompt"],
|
|
||||||
"has_mcp_settings": status["has_mcp_settings"],
|
|
||||||
"has_readme": status["readme_exists"]
|
|
||||||
}
|
|
||||||
|
|
||||||
# Calculate file sizes
|
|
||||||
total_size = 0
|
|
||||||
document_sizes = []
|
|
||||||
|
|
||||||
for doc_file in status["document_files"]:
|
|
||||||
try:
|
|
||||||
size = os.path.getsize(doc_file)
|
|
||||||
document_sizes.append({
|
|
||||||
"file": doc_file,
|
|
||||||
"size": size,
|
|
||||||
"size_mb": round(size / (1024 * 1024), 2)
|
|
||||||
})
|
|
||||||
total_size += size
|
|
||||||
except Exception:
|
|
||||||
pass
|
|
||||||
|
|
||||||
stats["total_document_size"] = total_size
|
|
||||||
stats["total_document_size_mb"] = round(total_size / (1024 * 1024), 2)
|
|
||||||
stats["document_files_detail"] = document_sizes
|
|
||||||
|
|
||||||
# Check embeddings files
|
|
||||||
embedding_files = []
|
|
||||||
dataset_dir = os.path.join("projects", "data", unique_id, "datasets")
|
|
||||||
if os.path.exists(dataset_dir):
|
|
||||||
for root, dirs, files in os.walk(dataset_dir):
|
|
||||||
for file in files:
|
|
||||||
if file == "embedding.pkl":
|
|
||||||
file_path = os.path.join(root, file)
|
|
||||||
try:
|
|
||||||
size = os.path.getsize(file_path)
|
|
||||||
embedding_files.append({
|
|
||||||
"file": file_path,
|
|
||||||
"size": size,
|
|
||||||
"size_mb": round(size / (1024 * 1024), 2)
|
|
||||||
})
|
|
||||||
except Exception:
|
|
||||||
pass
|
|
||||||
|
|
||||||
stats["embedding_files_count"] = len(embedding_files)
|
|
||||||
stats["embedding_files_detail"] = embedding_files
|
|
||||||
|
|
||||||
return stats
|
|
||||||
@ -77,6 +77,15 @@ CHECKPOINT_CLEANUP_INACTIVE_DAYS = int(os.getenv("CHECKPOINT_CLEANUP_INACTIVE_DA
|
|||||||
CHECKPOINT_CLEANUP_INTERVAL_HOURS = int(os.getenv("CHECKPOINT_CLEANUP_INTERVAL_HOURS", "24"))
|
CHECKPOINT_CLEANUP_INTERVAL_HOURS = int(os.getenv("CHECKPOINT_CLEANUP_INTERVAL_HOURS", "24"))
|
||||||
|
|
||||||
|
|
||||||
|
# ============================================================
|
||||||
|
# Redis Configuration (Huey task queue backend)
|
||||||
|
# ============================================================
|
||||||
|
|
||||||
|
# Redis connection URL.
|
||||||
|
# Format: redis://[:password]@host:port/db
|
||||||
|
REDIS_URL = os.getenv("REDIS_URL", "redis://localhost:6379/1")
|
||||||
|
|
||||||
|
|
||||||
# ============================================================
|
# ============================================================
|
||||||
# Mem0 long-term memory configuration
|
# Mem0 long-term memory configuration
|
||||||
# ============================================================
|
# ============================================================
|
||||||
|
|||||||
@ -1,301 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
"""
|
|
||||||
Single file processing functions for handling individual files.
|
|
||||||
"""
|
|
||||||
|
|
||||||
import os
|
|
||||||
import tempfile
|
|
||||||
import zipfile
|
|
||||||
import logging
|
|
||||||
from typing import Dict, List, Tuple, Optional
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
# Configure logger
|
|
||||||
logger = logging.getLogger('app')
|
|
||||||
|
|
||||||
from utils.file_utils import download_file
|
|
||||||
|
|
||||||
# Try to import excel/csv processor, but handle if dependencies are missing
|
|
||||||
try:
|
|
||||||
from utils.excel_csv_processor import (
|
|
||||||
is_excel_file, is_csv_file, process_excel_file, process_csv_file
|
|
||||||
)
|
|
||||||
EXCEL_CSV_SUPPORT = True
|
|
||||||
except ImportError as e:
|
|
||||||
logger.warning(f"Excel/CSV processing not available: {e}")
|
|
||||||
EXCEL_CSV_SUPPORT = False
|
|
||||||
|
|
||||||
# Fallback functions
|
|
||||||
def is_excel_file(file_path):
|
|
||||||
return file_path.lower().endswith(('.xlsx', '.xls'))
|
|
||||||
|
|
||||||
def is_csv_file(file_path):
|
|
||||||
return file_path.lower().endswith('.csv')
|
|
||||||
|
|
||||||
def process_excel_file(file_path):
|
|
||||||
return "", []
|
|
||||||
|
|
||||||
def process_csv_file(file_path):
|
|
||||||
return "", []
|
|
||||||
|
|
||||||
|
|
||||||
async def process_single_file(
|
|
||||||
unique_id: str,
|
|
||||||
group_name: str,
|
|
||||||
filename: str,
|
|
||||||
original_path: str,
|
|
||||||
local_path: str
|
|
||||||
) -> Dict:
|
|
||||||
"""
|
|
||||||
Process a single file and generate document.txt, pagination.txt, and embedding.pkl.
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
Dict with processing results and file paths
|
|
||||||
"""
|
|
||||||
# Create output directory for this file
|
|
||||||
filename_stem = Path(filename).stem
|
|
||||||
output_dir = os.path.join("projects", "data", unique_id, "processed", group_name, filename_stem)
|
|
||||||
os.makedirs(output_dir, exist_ok=True)
|
|
||||||
|
|
||||||
result = {
|
|
||||||
"success": False,
|
|
||||||
"filename": filename,
|
|
||||||
"group": group_name,
|
|
||||||
"output_dir": output_dir,
|
|
||||||
"document_path": os.path.join(output_dir, "document.txt"),
|
|
||||||
"pagination_path": os.path.join(output_dir, "pagination.txt"),
|
|
||||||
"embedding_path": os.path.join(output_dir, "embedding.pkl"),
|
|
||||||
"error": None,
|
|
||||||
"content_size": 0,
|
|
||||||
"pagination_lines": 0,
|
|
||||||
"embedding_chunks": 0
|
|
||||||
}
|
|
||||||
|
|
||||||
try:
|
|
||||||
# Download file if it's remote and not yet downloaded
|
|
||||||
if original_path.startswith(('http://', 'https://')):
|
|
||||||
if not os.path.exists(local_path):
|
|
||||||
logger.info(f"Downloading {original_path} -> {local_path}")
|
|
||||||
success = await download_file(original_path, local_path)
|
|
||||||
if not success:
|
|
||||||
result["error"] = "Failed to download file"
|
|
||||||
return result
|
|
||||||
|
|
||||||
# Extract content from file
|
|
||||||
content, pagination_lines = await extract_file_content(local_path, filename)
|
|
||||||
|
|
||||||
if not content or not content.strip():
|
|
||||||
result["error"] = "No content extracted from file"
|
|
||||||
return result
|
|
||||||
|
|
||||||
# Write document.txt
|
|
||||||
with open(result["document_path"], 'w', encoding='utf-8') as f:
|
|
||||||
f.write(content)
|
|
||||||
result["content_size"] = len(content)
|
|
||||||
|
|
||||||
# Write pagination.txt
|
|
||||||
if pagination_lines:
|
|
||||||
with open(result["pagination_path"], 'w', encoding='utf-8') as f:
|
|
||||||
for line in pagination_lines:
|
|
||||||
if line.strip():
|
|
||||||
f.write(f"{line}\n")
|
|
||||||
result["pagination_lines"] = len(pagination_lines)
|
|
||||||
else:
|
|
||||||
# Generate pagination from text content
|
|
||||||
pagination_lines = generate_pagination_from_text(result["document_path"],
|
|
||||||
result["pagination_path"])
|
|
||||||
result["pagination_lines"] = len(pagination_lines)
|
|
||||||
|
|
||||||
# Generate embeddings
|
|
||||||
try:
|
|
||||||
embedding_chunks = await generate_embeddings_for_file(
|
|
||||||
result["document_path"], result["embedding_path"]
|
|
||||||
)
|
|
||||||
result["embedding_chunks"] = len(embedding_chunks) if embedding_chunks else 0
|
|
||||||
result["success"] = True
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
result["error"] = f"Embedding generation failed: {str(e)}"
|
|
||||||
logger.error(f"Failed to generate embeddings for {filename}: {str(e)}")
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
result["error"] = f"File processing failed: {str(e)}"
|
|
||||||
logger.error(f"Error processing file {filename}: {str(e)}")
|
|
||||||
|
|
||||||
return result
|
|
||||||
|
|
||||||
|
|
||||||
async def extract_file_content(file_path: str, filename: str) -> Tuple[str, List[str]]:
|
|
||||||
"""Extract content from various file formats."""
|
|
||||||
|
|
||||||
# Handle zip files
|
|
||||||
if filename.lower().endswith('.zip'):
|
|
||||||
return await extract_from_zip(file_path, filename)
|
|
||||||
|
|
||||||
# Handle Excel files
|
|
||||||
elif is_excel_file(file_path):
|
|
||||||
return await extract_from_excel(file_path, filename)
|
|
||||||
|
|
||||||
# Handle CSV files
|
|
||||||
elif is_csv_file(file_path):
|
|
||||||
return await extract_from_csv(file_path, filename)
|
|
||||||
|
|
||||||
# Handle text files
|
|
||||||
else:
|
|
||||||
return await extract_from_text(file_path, filename)
|
|
||||||
|
|
||||||
|
|
||||||
async def extract_from_zip(zip_path: str, filename: str) -> Tuple[str, List[str]]:
|
|
||||||
"""Extract content from zip file."""
|
|
||||||
content_parts = []
|
|
||||||
pagination_lines = []
|
|
||||||
|
|
||||||
try:
|
|
||||||
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
|
|
||||||
# Extract to temporary directory
|
|
||||||
temp_dir = tempfile.mkdtemp(prefix=f"extract_{Path(filename).stem}_")
|
|
||||||
zip_ref.extractall(temp_dir)
|
|
||||||
|
|
||||||
# Process extracted files
|
|
||||||
for root, dirs, files in os.walk(temp_dir):
|
|
||||||
for file in files:
|
|
||||||
if file.lower().endswith(('.txt', '.md', '.xlsx', '.xls', '.csv')):
|
|
||||||
file_path = os.path.join(root, file)
|
|
||||||
|
|
||||||
try:
|
|
||||||
file_content, file_pagination = await extract_file_content(file_path, file)
|
|
||||||
|
|
||||||
if file_content:
|
|
||||||
content_parts.append(f"# Page {file}")
|
|
||||||
content_parts.append(file_content)
|
|
||||||
pagination_lines.extend(file_pagination)
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error processing extracted file {file}: {str(e)}")
|
|
||||||
|
|
||||||
# Clean up temporary directory
|
|
||||||
import shutil
|
|
||||||
shutil.rmtree(temp_dir)
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error extracting zip file {filename}: {str(e)}")
|
|
||||||
return "", []
|
|
||||||
|
|
||||||
return '\n\n'.join(content_parts), pagination_lines
|
|
||||||
|
|
||||||
|
|
||||||
async def extract_from_excel(file_path: str, filename: str) -> Tuple[str, List[str]]:
|
|
||||||
"""Extract content from Excel file."""
|
|
||||||
try:
|
|
||||||
document_content, pagination_lines = process_excel_file(file_path)
|
|
||||||
|
|
||||||
if document_content:
|
|
||||||
content = f"# Page {filename}\n{document_content}"
|
|
||||||
return content, pagination_lines
|
|
||||||
else:
|
|
||||||
return "", []
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error processing Excel file {filename}: {str(e)}")
|
|
||||||
return "", []
|
|
||||||
|
|
||||||
|
|
||||||
async def extract_from_csv(file_path: str, filename: str) -> Tuple[str, List[str]]:
|
|
||||||
"""Extract content from CSV file."""
|
|
||||||
try:
|
|
||||||
document_content, pagination_lines = process_csv_file(file_path)
|
|
||||||
|
|
||||||
if document_content:
|
|
||||||
content = f"# Page {filename}\n{document_content}"
|
|
||||||
return content, pagination_lines
|
|
||||||
else:
|
|
||||||
return "", []
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error processing CSV file {filename}: {str(e)}")
|
|
||||||
return "", []
|
|
||||||
|
|
||||||
|
|
||||||
async def extract_from_text(file_path: str, filename: str) -> Tuple[str, List[str]]:
|
|
||||||
"""Extract content from text file."""
|
|
||||||
try:
|
|
||||||
with open(file_path, 'r', encoding='utf-8') as f:
|
|
||||||
content = f.read().strip()
|
|
||||||
|
|
||||||
if content:
|
|
||||||
return content, []
|
|
||||||
else:
|
|
||||||
return "", []
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error reading text file {filename}: {str(e)}")
|
|
||||||
return "", []
|
|
||||||
|
|
||||||
|
|
||||||
def generate_pagination_from_text(document_path: str, pagination_path: str) -> List[str]:
|
|
||||||
"""Generate pagination from text document."""
|
|
||||||
try:
|
|
||||||
# Import embedding module for pagination
|
|
||||||
import sys
|
|
||||||
sys.path.append(os.path.join(os.path.dirname(__file__), '..', 'embedding'))
|
|
||||||
from embedding import split_document_by_pages
|
|
||||||
|
|
||||||
pages = split_document_by_pages(str(document_path), str(pagination_path))
|
|
||||||
|
|
||||||
# Return pagination lines
|
|
||||||
pagination_lines = []
|
|
||||||
with open(pagination_path, 'r', encoding='utf-8') as f:
|
|
||||||
for line in f:
|
|
||||||
if line.strip():
|
|
||||||
pagination_lines.append(line.strip())
|
|
||||||
|
|
||||||
return pagination_lines
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error generating pagination from text: {str(e)}")
|
|
||||||
return []
|
|
||||||
|
|
||||||
|
|
||||||
async def generate_embeddings_for_file(document_path: str, embedding_path: str) -> Optional[List]:
|
|
||||||
"""Generate embeddings for a document."""
|
|
||||||
try:
|
|
||||||
# Import embedding module
|
|
||||||
import sys
|
|
||||||
sys.path.append(os.path.join(os.path.dirname(__file__), '..', 'embedding'))
|
|
||||||
from embedding import embed_document
|
|
||||||
|
|
||||||
# Generate embeddings using paragraph chunking
|
|
||||||
embedding_data = embed_document(
|
|
||||||
str(document_path),
|
|
||||||
str(embedding_path),
|
|
||||||
chunking_strategy='paragraph'
|
|
||||||
)
|
|
||||||
|
|
||||||
if embedding_data and 'chunks' in embedding_data:
|
|
||||||
return embedding_data['chunks']
|
|
||||||
else:
|
|
||||||
return None
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
logger.error(f"Error generating embeddings: {str(e)}")
|
|
||||||
return None
|
|
||||||
|
|
||||||
|
|
||||||
def check_file_already_processed(unique_id: str, group_name: str, filename: str) -> bool:
|
|
||||||
"""Check if a file has already been processed."""
|
|
||||||
filename_stem = Path(filename).stem
|
|
||||||
output_dir = os.path.join("projects", "data", unique_id, "processed", group_name, filename_stem)
|
|
||||||
|
|
||||||
document_path = os.path.join(output_dir, "document.txt")
|
|
||||||
pagination_path = os.path.join(output_dir, "pagination.txt")
|
|
||||||
embedding_path = os.path.join(output_dir, "embedding.pkl")
|
|
||||||
|
|
||||||
# Check if all files exist and are not empty
|
|
||||||
if (os.path.exists(document_path) and os.path.exists(pagination_path) and
|
|
||||||
os.path.exists(embedding_path)):
|
|
||||||
|
|
||||||
if (os.path.getsize(document_path) > 0 and os.path.getsize(pagination_path) > 0 and
|
|
||||||
os.path.getsize(embedding_path) > 0):
|
|
||||||
return True
|
|
||||||
|
|
||||||
return False
|
|
||||||
Loading…
Reference in New Issue
Block a user