add task
This commit is contained in:
parent
ee59209937
commit
8f0a5569e2
1
.gitignore
vendored
1
.gitignore
vendored
@ -4,3 +4,4 @@ workspace
|
||||
__pycache__
|
||||
public
|
||||
models
|
||||
queue_data
|
||||
|
||||
536
README.md
536
README.md
@ -1,108 +1,185 @@
|
||||
# Qwen Agent - 智能数据检索专家系统
|
||||
|
||||
## 项目概述
|
||||
[](https://python.org)
|
||||
[](https://fastapi.tiangolo.com)
|
||||
[](LICENSE)
|
||||
|
||||
## 📋 项目概述
|
||||
|
||||
Qwen Agent 是一个基于 FastAPI 构建的智能数据检索专家系统,专门用于处理和分析结构化数据集。系统通过无状态的 ZIP 项目加载机制,支持动态加载多种数据集,并提供类似 OpenAI 的聊天接口,便于与现有 AI 应用集成。
|
||||
|
||||
## 核心功能
|
||||
### 🌟 核心特性
|
||||
|
||||
### 1. 智能数据检索
|
||||
- 基于倒排索引和多层数据架构的专业数据检索
|
||||
- 支持复杂查询优化和自主决策能力
|
||||
- 动态制定最优检索策略
|
||||
- **🔍 智能数据检索** - 基于倒排索引和多层数据架构的专业数据检索
|
||||
- **📦 无状态项目加载** - 通过 ZIP URL 动态加载数据集,自动缓存和解压
|
||||
- **🏗️ 多层架构数据处理** - 文档层、序列化层、索引层的分层存储
|
||||
- **🚀 异步文件处理队列** - 基于 huey 和 SQLite 的高性能异步任务队列
|
||||
- **📊 任务状态管理** - 实时任务状态查询和 SQLite 数据持久化
|
||||
- **🤖 兼容 OpenAI API** - 完全兼容 OpenAI chat/completions 接口
|
||||
|
||||
### 2. 无状态项目加载
|
||||
- 通过 ZIP URL 动态加载数据集
|
||||
- 自动缓存和解压,提高性能
|
||||
- 支持多种数据结构和文件格式
|
||||
---
|
||||
|
||||
### 3. 多层架构数据处理
|
||||
- **文档层** (document.txt): 原始文本内容,提供完整上下文
|
||||
- **序列化层** (serialization.txt): 结构化数据,支持高效匹配
|
||||
- **索引层** (schema.json): 字段定义、枚举值映射、文件关联关系
|
||||
## 🚀 快速开始
|
||||
|
||||
## API 接口协议
|
||||
### 环境要求
|
||||
|
||||
### Chat Completions 接口
|
||||
- Python 3.8+
|
||||
- Poetry (推荐) 或 pip
|
||||
- 足够的磁盘空间用于缓存
|
||||
|
||||
### 安装依赖
|
||||
|
||||
```bash
|
||||
# 使用 Poetry (推荐)
|
||||
poetry install
|
||||
poetry run python fastapi_app.py
|
||||
|
||||
# 或使用 pip
|
||||
pip install -r requirements.txt
|
||||
python fastapi_app.py
|
||||
```
|
||||
|
||||
### Docker 部署
|
||||
|
||||
```bash
|
||||
# 构建镜像
|
||||
docker build -t qwen-agent .
|
||||
|
||||
# 运行容器
|
||||
docker run -p 8001:8001 qwen-agent
|
||||
|
||||
# 或使用 Docker Compose
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📖 使用指南
|
||||
|
||||
### 1. 聊天接口 (OpenAI 兼容)
|
||||
|
||||
**端点**: `POST /api/v1/chat/completions`
|
||||
|
||||
**请求格式**:
|
||||
```bash
|
||||
curl -X POST "http://localhost:8001/api/v1/chat/completions" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": "HP Elite Mini 800 G9ってノートPC?"
|
||||
}
|
||||
],
|
||||
"model": "qwen3-next",
|
||||
"zip_url": "http://127.0.0.1:8080/all_hp_product_spec_book2506.zip",
|
||||
"stream": false
|
||||
}'
|
||||
```
|
||||
|
||||
### 2. 异步文件处理队列
|
||||
|
||||
#### 启动队列系统
|
||||
|
||||
```bash
|
||||
# 终端1:启动队列消费者
|
||||
poetry run python task_queue/consumer.py --workers 2
|
||||
|
||||
# 终端2:启动API服务器
|
||||
poetry run python fastapi_app.py
|
||||
```
|
||||
|
||||
#### 提交异步任务
|
||||
|
||||
```bash
|
||||
curl -X POST "http://localhost:8001/api/v1/files/process/async" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"unique_id": "my_project_123",
|
||||
"files": {
|
||||
"documents": ["public/document.txt"],
|
||||
"reports": ["public/data.zip"]
|
||||
},
|
||||
"system_prompt": "处理这些文档"
|
||||
}'
|
||||
```
|
||||
|
||||
**响应**:
|
||||
```json
|
||||
{
|
||||
"messages": [
|
||||
{
|
||||
"role": "user",
|
||||
"content": "HP Elite Mini 800 G9ってノートPC?"
|
||||
}
|
||||
],
|
||||
"model": "qwen3-next",
|
||||
"model_server": "https://openrouter.ai/api/v1",
|
||||
"api_key": "your-api-key",
|
||||
"zip_url": "http://127.0.0.1:8080/all_hp_product_spec_book2506.zip",
|
||||
"stream": false,
|
||||
"max_input_tokens": 58000,
|
||||
"top_p": 0.8,
|
||||
"temperature": 0.7,
|
||||
"max_tokens": 2000
|
||||
"success": true,
|
||||
"task_id": "abc-123-def",
|
||||
"unique_id": "my_project_123",
|
||||
"task_status": "pending",
|
||||
"estimated_processing_time": 30
|
||||
}
|
||||
```
|
||||
|
||||
**参数说明**:
|
||||
- `messages`: 聊天消息列表
|
||||
- `model`: 模型名称(默认: "qwen3-next")
|
||||
- `model_server`: 模型服务器地址(必须)
|
||||
- `api_key`: API 密钥(可通过 Authorization header 传入)
|
||||
- `zip_url`: ZIP 数据集的 URL(必需)
|
||||
- `stream`: 是否流式响应(默认: false)
|
||||
- `max_input_tokens`: 最大输入tokens数
|
||||
- `top_p`: 核采样参数
|
||||
- `temperature`: 温度参数
|
||||
- `max_tokens`: 最大生成tokens数
|
||||
- 其他任意模型生成参数
|
||||
#### 查询任务状态
|
||||
|
||||
**响应格式**(非流式):
|
||||
```bash
|
||||
# 🎯 主要接口 - 只需要记住这一个
|
||||
curl "http://localhost:8001/api/v1/task/abc-123-def/status"
|
||||
```
|
||||
|
||||
**状态响应**:
|
||||
```json
|
||||
{
|
||||
"choices": [
|
||||
{
|
||||
"index": 0,
|
||||
"message": {
|
||||
"role": "assistant",
|
||||
"content": "HP Elite Mini 800 G9はノートPCではなく、小型のデスクトップPCです。"
|
||||
},
|
||||
"finish_reason": "stop"
|
||||
}
|
||||
],
|
||||
"usage": {
|
||||
"prompt_tokens": 25,
|
||||
"completion_tokens": 18,
|
||||
"total_tokens": 43
|
||||
"success": true,
|
||||
"task_id": "abc-123-def",
|
||||
"status": "completed",
|
||||
"unique_id": "my_project_123",
|
||||
"result": {
|
||||
"status": "success",
|
||||
"message": "成功处理了 2 个文档文件",
|
||||
"processed_files": ["projects/my_project_123/dataset/docs/document.txt"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**流式响应**: 使用 Server-Sent Events (SSE) 格式,每个数据块采用 OpenAI 格式。
|
||||
### 3. Python 客户端示例
|
||||
|
||||
### 系统管理接口
|
||||
```python
|
||||
import requests
|
||||
import time
|
||||
|
||||
#### 健康检查
|
||||
- `GET /api/health` - 系统健康状态检查
|
||||
def submit_and_monitor_task():
|
||||
# 1. 提交任务
|
||||
response = requests.post(
|
||||
"http://localhost:8001/api/v1/files/process/async",
|
||||
json={
|
||||
"unique_id": "my_project",
|
||||
"files": {"docs": ["public/file.txt"]}
|
||||
}
|
||||
)
|
||||
|
||||
task_id = response.json()["task_id"]
|
||||
print(f"任务已提交: {task_id}")
|
||||
|
||||
# 2. 监控任务状态
|
||||
while True:
|
||||
response = requests.get(f"http://localhost:8001/api/v1/task/{task_id}/status")
|
||||
data = response.json()
|
||||
|
||||
status = data["status"]
|
||||
print(f"任务状态: {status}")
|
||||
|
||||
if status == "completed":
|
||||
print("🎉 任务完成!")
|
||||
break
|
||||
elif status == "failed":
|
||||
print("❌ 任务失败!")
|
||||
break
|
||||
|
||||
time.sleep(2)
|
||||
|
||||
#### 系统状态
|
||||
- `GET /system/status` - 获取系统状态和缓存统计信息
|
||||
submit_and_monitor_task()
|
||||
```
|
||||
|
||||
#### 缓存管理
|
||||
- `POST /system/cleanup-cache` - 清理所有缓存
|
||||
- `POST /system/cleanup-agent-cache` - 清理助手实例缓存
|
||||
- `GET /system/cached-projects` - 获取所有缓存的项目信息
|
||||
- `POST /system/remove-project-cache` - 移除特定项目缓存
|
||||
---
|
||||
|
||||
## ZIP_URL 数据包结构
|
||||
## 🗃️ 数据包结构
|
||||
|
||||
### 压缩包内容要求
|
||||
|
||||
ZIP 压缩包应包含以下目录结构:
|
||||
### ZIP 数据集格式
|
||||
|
||||
```
|
||||
dataset_name/
|
||||
@ -110,57 +187,183 @@ dataset_name/
|
||||
├── dataset/
|
||||
│ └── data_collection/
|
||||
│ ├── document.txt # 原始文本内容
|
||||
│ ├── serialization.txt # 序列化结构数据
|
||||
│ ├── serialization.txt # 结构化数据
|
||||
│ └── schema.json # 字段定义和元数据
|
||||
├── mcp_settings.json # MCP 工具配置
|
||||
└── system_prompt.md # 系统提示词(可选)
|
||||
```
|
||||
|
||||
### 文件详细说明
|
||||
### 文件说明
|
||||
|
||||
#### 1. document.txt
|
||||
- 包含原始 Markdown 文本内容
|
||||
- 提供数据的完整上下文信息
|
||||
- 检索时需要包含前后行的上下文才有意义
|
||||
- **document.txt**: 原始 Markdown 文本,提供完整上下文
|
||||
- **serialization.txt**: 格式化结构数据,每行 `字段1:值1;字段2:值2`
|
||||
- **schema.json**: 字段定义、枚举值映射和文件关联关系
|
||||
|
||||
#### 2. serialization.txt
|
||||
- 基于 document.txt 解析的格式化结构数据
|
||||
- 每行格式:`字段1:值1;字段2:值2;...`
|
||||
- 支持正则高效匹配和关键词检索
|
||||
- 单行内容代表一条完整的数据
|
||||
---
|
||||
|
||||
## 📊 数据存储和管理
|
||||
|
||||
### 任务状态存储
|
||||
|
||||
任务状态存储在 SQLite 数据库中:
|
||||
|
||||
#### 3. schema.json
|
||||
```json
|
||||
{
|
||||
"字段名": {
|
||||
"txt_file_name": "document.txt",
|
||||
"serialization_file_name": "serialization.txt",
|
||||
"enums": ["枚举值1", "枚举值2"],
|
||||
"description": "字段描述信息"
|
||||
}
|
||||
}
|
||||
```
|
||||
- 定义字段名、枚举值映射和文件关联关系
|
||||
- 提供 serialization.txt 中所有字段的集合
|
||||
- 用于字段预览和枚举值预览
|
||||
queue_data/task_status.db
|
||||
```
|
||||
|
||||
#### 4. MCP 工具配置 (mcp_settings.json)
|
||||
- 配置 Model Context Protocol 工具
|
||||
- 支持数据检索和处理的工具集成
|
||||
- 可包含 JSON reader、多关键词搜索等工具
|
||||
**数据库结构**:
|
||||
```sql
|
||||
CREATE TABLE task_status (
|
||||
task_id TEXT PRIMARY KEY, -- 任务ID
|
||||
unique_id TEXT NOT NULL, -- 项目ID
|
||||
status TEXT NOT NULL, -- 任务状态
|
||||
created_at REAL NOT NULL, -- 创建时间
|
||||
updated_at REAL NOT NULL, -- 更新时间
|
||||
result TEXT, -- 处理结果(JSON)
|
||||
error TEXT -- 错误信息
|
||||
);
|
||||
```
|
||||
|
||||
### 数据库管理工具
|
||||
|
||||
```bash
|
||||
# 查看数据库内容
|
||||
poetry run python db_manager.py view
|
||||
|
||||
# 交互式管理
|
||||
poetry run python db_manager.py interactive
|
||||
|
||||
# 获取统计信息
|
||||
curl "http://localhost:8001/api/v1/tasks/statistics"
|
||||
```
|
||||
|
||||
### 数据备份
|
||||
|
||||
```bash
|
||||
# 备份数据库
|
||||
cp queue_data/task_status.db queue_data/backup_$(date +%Y%m%d).db
|
||||
|
||||
# 清理旧记录
|
||||
curl -X POST "http://localhost:8001/api/v1/tasks/cleanup?older_than_days=7"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🛠️ API 接口总览
|
||||
|
||||
### 聊天接口
|
||||
- `POST /api/v1/chat/completions` - OpenAI 兼容的聊天接口
|
||||
|
||||
### 文件处理接口
|
||||
- `POST /api/v1/files/process` - 同步文件处理
|
||||
- `POST /api/v1/files/process/async` - 异步文件处理
|
||||
- `GET /api/v1/files/{unique_id}/status` - 文件处理状态
|
||||
|
||||
### 任务管理接口
|
||||
- `GET /api/v1/task/{task_id}/status` - **主要接口** - 查询任务状态
|
||||
- `GET /api/v1/tasks` - 列出任务(支持筛选)
|
||||
- `GET /api/v1/tasks/statistics` - 获取统计信息
|
||||
- `DELETE /api/v1/task/{task_id}` - 删除任务记录
|
||||
|
||||
### 系统管理接口
|
||||
- `GET /api/health` - 健康检查
|
||||
- `GET /system/status` - 系统状态
|
||||
- `POST /system/cleanup-cache` - 清理缓存
|
||||
|
||||
---
|
||||
|
||||
## 🔧 配置和部署
|
||||
|
||||
### 环境变量
|
||||
|
||||
```bash
|
||||
# 模型配置
|
||||
MODEL_SERVER=https://openrouter.ai/api/v1
|
||||
API_KEY=your-api-key
|
||||
|
||||
# 队列配置
|
||||
MAX_CACHED_AGENTS=20
|
||||
|
||||
# 其他配置
|
||||
TOKENIZERS_PARALLELISM=false
|
||||
```
|
||||
|
||||
### 生产部署建议
|
||||
|
||||
1. **队列配置**
|
||||
```bash
|
||||
# 设置合适的工作线程数
|
||||
poetry run python task_queue/consumer.py --workers 4 --worker-type threads
|
||||
```
|
||||
|
||||
2. **性能优化**
|
||||
- 使用 Redis 作为队列后端(可选)
|
||||
- 配置 nginx 作为反向代理
|
||||
- 设置适当的缓存策略
|
||||
|
||||
3. **监控**
|
||||
- 定期检查任务状态
|
||||
- 监控磁盘空间使用
|
||||
- 设置日志轮转
|
||||
|
||||
---
|
||||
|
||||
## 📈 性能特性
|
||||
|
||||
### 智能检索策略
|
||||
- **探索性查询**: 结构分析 → 模式发现 → 结果扩展
|
||||
- **精确性查询**: 目标定位 → 直接搜索 → 结果验证
|
||||
- **分析性查询**: 多维度分析 → 深度挖掘 → 洞察提取
|
||||
|
||||
### 缓存机制
|
||||
- ZIP 文件基于 URL 的 MD5 哈希值进行缓存
|
||||
- 助手实例缓存,提高响应速度
|
||||
- SQLite 查询缓存
|
||||
|
||||
### 并发处理
|
||||
- 异步文件处理队列
|
||||
- 多线程任务执行
|
||||
- 支持批量操作
|
||||
|
||||
---
|
||||
|
||||
## 📁 项目结构
|
||||
|
||||
```
|
||||
qwen-agent/
|
||||
├── fastapi_app.py # FastAPI 主应用
|
||||
├── gbase_agent.py # 助手服务逻辑
|
||||
├── task_queue/ # 队列系统
|
||||
│ ├── config.py # 队列配置
|
||||
│ ├── manager.py # 队列管理器
|
||||
│ ├── tasks.py # 文件处理任务
|
||||
│ ├── integration_tasks.py # 集成任务
|
||||
│ ├── task_status.py # 任务状态存储
|
||||
│ └── consumer.py # 队列消费者
|
||||
├── utils/ # 工具模块
|
||||
├── models/ # 模型文件
|
||||
├── projects/ # 项目目录
|
||||
├── queue_data/ # 队列数据
|
||||
├── public/ # 静态文件
|
||||
├── db_manager.py # 数据库管理工具
|
||||
├── requirements.txt # 依赖列表
|
||||
├── pyproject.toml # Poetry 配置
|
||||
├── Dockerfile # Docker 构建文件
|
||||
└── docker-compose.yml # Docker Compose 配置
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🎯 使用场景
|
||||
|
||||
### 适用场景
|
||||
- **产品规格检索** - 快速查找产品技术规格
|
||||
- **文档分析** - 大量文档的智能检索和分析
|
||||
- **数据问答** - 基于结构化数据的问答系统
|
||||
- **知识库构建** - 企业知识库的智能检索
|
||||
|
||||
### 示例数据集
|
||||
|
||||
项目中包含的 HP 产品规格书数据集示例:
|
||||
|
||||
```
|
||||
all_hp_product_spec_book2506/
|
||||
├── document.txt # HP 产品完整规格信息
|
||||
├── serialization.txt # 结构化的产品规格数据
|
||||
└── schema.json # 产品字段定义(类型、品牌、规格等)
|
||||
```
|
||||
|
||||
数据包含:
|
||||
项目包含 HP 产品规格书数据集:
|
||||
- 商用/个人笔记本电脑 (EliteBook/OmniBook)
|
||||
- 台式机 (Elite/OMEN)
|
||||
- 工作站 (Z系列)
|
||||
@ -168,85 +371,58 @@ all_hp_product_spec_book2506/
|
||||
- Poly 通信设备
|
||||
- HyperX 游戏配件
|
||||
|
||||
## 技术特性
|
||||
---
|
||||
|
||||
### 智能检索策略
|
||||
- **探索性查询**: 结构分析 → 模式发现 → 结果扩展
|
||||
- **精确性查询**: 目标定位 → 直接搜索 → 结果验证
|
||||
- **分析性查询**: 多维度分析 → 深度挖掘 → 洞察提取
|
||||
## 🤝 贡献指南
|
||||
|
||||
### 专业工具体系
|
||||
- **结构分析工具**: json-reader-get_all_keys, json-reader-get_multiple_values
|
||||
- **搜索执行工具**: multi-keyword-search, ripgrep-count-matches, ripgrep-search
|
||||
- **智能路径优化**: 根据查询复杂度选择最优搜索路径
|
||||
1. Fork 项目
|
||||
2. 创建特性分支 (`git checkout -b feature/AmazingFeature`)
|
||||
3. 提交更改 (`git commit -m 'Add some AmazingFeature'`)
|
||||
4. 推送到分支 (`git push origin feature/AmazingFeature`)
|
||||
5. 打开 Pull Request
|
||||
|
||||
### 缓存机制
|
||||
- ZIP 文件基于 URL 的 MD5 哈希值进行缓存
|
||||
- 助手实例缓存,提高响应速度
|
||||
- 支持缓存清理和管理
|
||||
---
|
||||
|
||||
## 部署方式
|
||||
## 📄 许可证
|
||||
|
||||
### Docker 部署
|
||||
```bash
|
||||
# 构建镜像
|
||||
docker build -t qwen-agent .
|
||||
本项目采用 MIT 许可证 - 查看 [LICENSE](LICENSE) 文件了解详情。
|
||||
|
||||
# 运行容器
|
||||
docker run -p 8001:8001 qwen-agent
|
||||
```
|
||||
---
|
||||
|
||||
### Docker Compose 部署
|
||||
```bash
|
||||
docker-compose up -d
|
||||
```
|
||||
## 🆘 支持
|
||||
|
||||
### 本地开发部署
|
||||
```bash
|
||||
# 安装依赖
|
||||
pip install -r requirements.txt
|
||||
- 📖 [详细文档](docs/)
|
||||
- 🐛 [问题反馈](https://github.com/your-repo/qwen-agent/issues)
|
||||
- 💬 [讨论区](https://github.com/your-repo/qwen-agent/discussions)
|
||||
|
||||
# 启动服务
|
||||
python fastapi_app.py
|
||||
```
|
||||
---
|
||||
|
||||
## 系统要求
|
||||
## 🎉 开始使用
|
||||
|
||||
- Python 3.8+
|
||||
- FastAPI
|
||||
- Uvicorn
|
||||
- Qwen Agent 库
|
||||
- Requests(用于 ZIP 下载)
|
||||
- 足够的磁盘空间用于缓存
|
||||
1. **克隆项目**
|
||||
```bash
|
||||
git clone https://github.com/your-repo/qwen-agent.git
|
||||
cd qwen-agent
|
||||
```
|
||||
|
||||
## 注意事项
|
||||
2. **安装依赖**
|
||||
```bash
|
||||
poetry install
|
||||
```
|
||||
|
||||
1. **必需参数**: 所有请求都必须提供 zip_url 参数
|
||||
2. **API 密钥**: 可通过 Authorization header 或请求参数传入
|
||||
3. **URL 格式**: zip_url 必须是有效的 HTTP/HTTPS URL 或本地路径
|
||||
4. **文件大小**: 建议 ZIP 文件不超过 100MB
|
||||
5. **安全性**: 确保 ZIP 文件来源可信
|
||||
6. **网络**: 需要能够访问 zip_url 指向的资源
|
||||
3. **启动服务**
|
||||
```bash
|
||||
# 启动队列消费者
|
||||
poetry run python task_queue/consumer.py --workers 2
|
||||
|
||||
# 启动API服务器
|
||||
poetry run python fastapi_app.py
|
||||
```
|
||||
|
||||
## 项目结构
|
||||
4. **测试接口**
|
||||
```bash
|
||||
# 运行测试脚本
|
||||
poetry run python test_simple_task.py
|
||||
```
|
||||
|
||||
```
|
||||
qwen-agent/
|
||||
├── fastapi_app.py # FastAPI 主应用
|
||||
├── gbase_agent.py # 助手服务逻辑
|
||||
├── zip_project_handler.py # ZIP 项目处理器
|
||||
├── file_loaded_agent_manager.py # 助助实例管理
|
||||
├── agent_pool.py # 助手池管理
|
||||
├── system_prompt.md # 系统提示词
|
||||
├── requirements.txt # 依赖包列表
|
||||
├── Dockerfile # Docker 构建文件
|
||||
├── docker-compose.yml # Docker Compose 配置
|
||||
├── mcp/ # MCP 工具配置
|
||||
├── projects/ # 项目目录
|
||||
│ ├── _cache/ # ZIP 文件缓存
|
||||
│ └── {hash}/ # 解压后的项目目录
|
||||
├── public/ # 静态文件
|
||||
└── workspace/ # 工作空间
|
||||
```
|
||||
|
||||
此系统提供了完整的智能数据检索解决方案,支持动态数据集加载和高效的查询处理,适用于各种数据分析和检索场景。
|
||||
现在您可以开始使用 Qwen Agent 进行智能数据检索了!🚀
|
||||
168
db_manager.py
Executable file
168
db_manager.py
Executable file
@ -0,0 +1,168 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
SQLite任务状态数据库管理工具
|
||||
"""
|
||||
|
||||
import sqlite3
|
||||
import json
|
||||
import time
|
||||
from task_queue.task_status import task_status_store
|
||||
|
||||
def view_database():
|
||||
"""查看数据库内容"""
|
||||
print("📊 SQLite任务状态数据库内容")
|
||||
print("=" * 40)
|
||||
print(f"数据库路径: {task_status_store.db_path}")
|
||||
|
||||
# 连接数据库
|
||||
conn = sqlite3.connect(task_status_store.db_path)
|
||||
cursor = conn.cursor()
|
||||
|
||||
# 查看表结构
|
||||
print(f"\n📋 表结构:")
|
||||
cursor.execute("PRAGMA table_info(task_status)")
|
||||
columns = cursor.fetchall()
|
||||
for col in columns:
|
||||
print(f" {col[1]} ({col[2]})")
|
||||
|
||||
# 查看所有记录
|
||||
print(f"\n📝 所有记录:")
|
||||
cursor.execute("SELECT * FROM task_status ORDER BY updated_at DESC")
|
||||
rows = cursor.fetchall()
|
||||
|
||||
if not rows:
|
||||
print(" (空数据库)")
|
||||
else:
|
||||
print(f" 共 {len(rows)} 条记录:")
|
||||
for i, row in enumerate(rows):
|
||||
task_id, unique_id, status, created_at, updated_at, result, error = row
|
||||
created_str = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(created_at))
|
||||
updated_str = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(updated_at))
|
||||
|
||||
print(f" {i+1}. {task_id}")
|
||||
print(f" 项目ID: {unique_id}")
|
||||
print(f" 状态: {status}")
|
||||
print(f" 创建: {created_str}")
|
||||
print(f" 更新: {updated_str}")
|
||||
if result:
|
||||
try:
|
||||
result_data = json.loads(result)
|
||||
print(f" 结果: {result_data.get('message', 'N/A')}")
|
||||
except:
|
||||
print(f" 结果: {result[:50]}...")
|
||||
if error:
|
||||
print(f" 错误: {error}")
|
||||
print()
|
||||
|
||||
conn.close()
|
||||
|
||||
def run_query(sql_query: str):
|
||||
"""执行自定义查询"""
|
||||
print(f"🔍 执行查询: {sql_query}")
|
||||
|
||||
try:
|
||||
conn = sqlite3.connect(task_status_store.db_path)
|
||||
conn.row_factory = sqlite3.Row
|
||||
cursor = conn.cursor()
|
||||
cursor.execute(sql_query)
|
||||
rows = cursor.fetchall()
|
||||
|
||||
if not rows:
|
||||
print(" (无结果)")
|
||||
else:
|
||||
print(f" {len(rows)} 条结果:")
|
||||
for row in rows:
|
||||
print(f" {dict(row)}")
|
||||
|
||||
conn.close()
|
||||
|
||||
except Exception as e:
|
||||
print(f"❌ 查询失败: {e}")
|
||||
|
||||
def interactive_shell():
|
||||
"""交互式数据库管理"""
|
||||
print("\n🖥️ 交互式数据库管理")
|
||||
print("输入 'help' 查看可用命令,输入 'quit' 退出")
|
||||
|
||||
while True:
|
||||
try:
|
||||
command = input("\n> ").strip()
|
||||
|
||||
if command.lower() in ['quit', 'exit', 'q']:
|
||||
break
|
||||
elif command.lower() == 'help':
|
||||
print("""
|
||||
可用命令:
|
||||
view - 查看所有记录
|
||||
stats - 查看统计信息
|
||||
pending - 查看待处理任务
|
||||
completed - 查看已完成任务
|
||||
failed - 查看失败任务
|
||||
sql <查询> - 执行SQL查询
|
||||
cleanup <天数> - 清理N天前的记录
|
||||
count - 统计总任务数
|
||||
help - 显示帮助
|
||||
quit/exit/q - 退出
|
||||
""")
|
||||
elif command.lower() == 'view':
|
||||
view_database()
|
||||
elif command.lower() == 'stats':
|
||||
stats = task_status_store.get_statistics()
|
||||
print(f"统计信息:")
|
||||
print(f" 总任务数: {stats['total_tasks']}")
|
||||
print(f" 状态分布: {stats['status_breakdown']}")
|
||||
print(f" 最近24小时: {stats['recent_24h']}")
|
||||
elif command.lower() == 'pending':
|
||||
tasks = task_status_store.search_tasks(status="pending")
|
||||
print(f"待处理任务 ({len(tasks)} 个):")
|
||||
for task in tasks:
|
||||
print(f" - {task['task_id']}: {task['unique_id']}")
|
||||
elif command.lower() == 'completed':
|
||||
tasks = task_status_store.search_tasks(status="completed")
|
||||
print(f"已完成任务 ({len(tasks)} 个):")
|
||||
for task in tasks:
|
||||
print(f" - {task['task_id']}: {task['unique_id']}")
|
||||
elif command.lower() == 'failed':
|
||||
tasks = task_status_store.search_tasks(status="failed")
|
||||
print(f"失败任务 ({len(tasks)} 个):")
|
||||
for task in tasks:
|
||||
print(f" - {task['task_id']}: {task['unique_id']}")
|
||||
elif command.lower().startswith('sql '):
|
||||
sql_query = command[4:]
|
||||
run_query(sql_query)
|
||||
elif command.lower().startswith('cleanup '):
|
||||
try:
|
||||
days = int(command[8:])
|
||||
count = task_status_store.cleanup_old_tasks(days)
|
||||
print(f"✅ 已清理 {count} 条 {days} 天前的记录")
|
||||
except ValueError:
|
||||
print("❌ 请输入有效的天数")
|
||||
elif command.lower() == 'count':
|
||||
all_tasks = task_status_store.list_all()
|
||||
print(f"总任务数: {len(all_tasks)}")
|
||||
else:
|
||||
print("❌ 未知命令,输入 'help' 查看帮助")
|
||||
|
||||
except KeyboardInterrupt:
|
||||
print("\n👋 再见!")
|
||||
break
|
||||
except Exception as e:
|
||||
print(f"❌ 执行错误: {e}")
|
||||
|
||||
def main():
|
||||
"""主函数"""
|
||||
import sys
|
||||
|
||||
if len(sys.argv) > 1:
|
||||
if sys.argv[1] == 'view':
|
||||
view_database()
|
||||
elif sys.argv[1] == 'interactive':
|
||||
interactive_shell()
|
||||
else:
|
||||
print("用法: python db_manager.py [view|interactive]")
|
||||
else:
|
||||
view_database()
|
||||
interactive_shell()
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
235
fastapi_app.py
235
fastapi_app.py
@ -17,7 +17,8 @@ from pydantic import BaseModel, Field
|
||||
from utils import (
|
||||
# Models
|
||||
Message, DatasetRequest, ChatRequest, FileProcessRequest,
|
||||
FileProcessResponse, ChatResponse,
|
||||
FileProcessResponse, ChatResponse, QueueTaskRequest, QueueTaskResponse,
|
||||
QueueStatusResponse, TaskStatusResponse,
|
||||
|
||||
# File utilities
|
||||
download_file, remove_file_or_directory, get_document_preview,
|
||||
@ -38,6 +39,11 @@ from utils import (
|
||||
# Import gbase_agent
|
||||
from gbase_agent import update_agent_llm
|
||||
|
||||
# Import queue manager
|
||||
from task_queue.manager import queue_manager
|
||||
from task_queue.integration_tasks import process_files_async, cleanup_project_async
|
||||
from task_queue.task_status import task_status_store
|
||||
|
||||
os.environ["TOKENIZERS_PARALLELISM"] = "false"
|
||||
|
||||
# Custom version for qwen-agent messages - keep this function as it's specific to this app
|
||||
@ -233,6 +239,14 @@ async def process_files(request: FileProcessRequest, authorization: Optional[str
|
||||
json.dump(request.mcp_settings, f, ensure_ascii=False, indent=2)
|
||||
print(f"Saved mcp_settings for unique_id: {unique_id}")
|
||||
|
||||
# 生成项目README.md文件
|
||||
try:
|
||||
save_project_readme(unique_id)
|
||||
print(f"Generated README.md for unique_id: {unique_id}")
|
||||
except Exception as e:
|
||||
print(f"Failed to generate README.md for unique_id: {unique_id}, error: {str(e)}")
|
||||
# 不影响主要处理流程,继续执行
|
||||
|
||||
# 返回结果包含按key分组的文件信息
|
||||
result_files = []
|
||||
for key in processed_files_by_key.keys():
|
||||
@ -261,6 +275,225 @@ async def process_files(request: FileProcessRequest, authorization: Optional[str
|
||||
raise HTTPException(status_code=500, detail=f"Internal server error: {str(e)}")
|
||||
|
||||
|
||||
@app.post("/api/v1/files/process/async")
|
||||
async def process_files_async_endpoint(request: QueueTaskRequest, authorization: Optional[str] = Header(None)):
|
||||
"""
|
||||
异步处理文件的队列版本API
|
||||
与 /api/v1/files/process 功能相同,但使用队列异步处理
|
||||
|
||||
Args:
|
||||
request: QueueTaskRequest containing unique_id, files, system_prompt, mcp_settings, and queue options
|
||||
authorization: Authorization header containing API key (Bearer <API_KEY>)
|
||||
|
||||
Returns:
|
||||
QueueTaskResponse: Processing result with task ID for tracking
|
||||
"""
|
||||
try:
|
||||
unique_id = request.unique_id
|
||||
if not unique_id:
|
||||
raise HTTPException(status_code=400, detail="unique_id is required")
|
||||
|
||||
# 估算处理时间(基于文件数量)
|
||||
estimated_time = 0
|
||||
if request.files:
|
||||
total_files = sum(len(file_list) for file_list in request.files.values())
|
||||
estimated_time = max(30, total_files * 10) # 每个文件预估10秒,最少30秒
|
||||
|
||||
# 提交异步任务
|
||||
task_id = queue_manager.enqueue_multiple_files(
|
||||
project_id=unique_id,
|
||||
file_paths=[],
|
||||
original_filenames=[]
|
||||
)
|
||||
|
||||
# 创建任务状态记录
|
||||
import uuid
|
||||
task_id = str(uuid.uuid4())
|
||||
task_status_store.set_status(
|
||||
task_id=task_id,
|
||||
unique_id=unique_id,
|
||||
status="pending"
|
||||
)
|
||||
|
||||
# 提交异步任务
|
||||
task = process_files_async(
|
||||
unique_id=unique_id,
|
||||
files=request.files,
|
||||
system_prompt=request.system_prompt,
|
||||
mcp_settings=request.mcp_settings,
|
||||
task_id=task_id
|
||||
)
|
||||
|
||||
return QueueTaskResponse(
|
||||
success=True,
|
||||
message=f"文件处理任务已提交到队列,项目ID: {unique_id}",
|
||||
unique_id=unique_id,
|
||||
task_id=task_id, # 使用我们自己的task_id
|
||||
task_status="pending",
|
||||
estimated_processing_time=estimated_time
|
||||
)
|
||||
|
||||
except HTTPException:
|
||||
raise
|
||||
except Exception as e:
|
||||
print(f"Error submitting async file processing task: {str(e)}")
|
||||
raise HTTPException(status_code=500, detail=f"Internal server error: {str(e)}")
|
||||
|
||||
|
||||
@app.get("/api/v1/task/{task_id}/status")
|
||||
async def get_task_status(task_id: str):
|
||||
"""获取任务状态 - 简单可靠"""
|
||||
try:
|
||||
status_data = task_status_store.get_status(task_id)
|
||||
|
||||
if not status_data:
|
||||
return {
|
||||
"success": False,
|
||||
"message": "任务不存在或已过期",
|
||||
"task_id": task_id,
|
||||
"status": "not_found"
|
||||
}
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"message": "任务状态获取成功",
|
||||
"task_id": task_id,
|
||||
"status": status_data["status"],
|
||||
"unique_id": status_data["unique_id"],
|
||||
"created_at": status_data["created_at"],
|
||||
"updated_at": status_data["updated_at"],
|
||||
"result": status_data.get("result"),
|
||||
"error": status_data.get("error")
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error getting task status: {str(e)}")
|
||||
raise HTTPException(status_code=500, detail=f"获取任务状态失败: {str(e)}")
|
||||
|
||||
|
||||
@app.delete("/api/v1/task/{task_id}")
|
||||
async def delete_task(task_id: str):
|
||||
"""删除任务记录"""
|
||||
try:
|
||||
success = task_status_store.delete_status(task_id)
|
||||
if success:
|
||||
return {
|
||||
"success": True,
|
||||
"message": f"任务记录已删除: {task_id}",
|
||||
"task_id": task_id
|
||||
}
|
||||
else:
|
||||
return {
|
||||
"success": False,
|
||||
"message": f"任务记录不存在: {task_id}",
|
||||
"task_id": task_id
|
||||
}
|
||||
except Exception as e:
|
||||
print(f"Error deleting task: {str(e)}")
|
||||
raise HTTPException(status_code=500, detail=f"删除任务记录失败: {str(e)}")
|
||||
|
||||
|
||||
@app.get("/api/v1/tasks")
|
||||
async def list_tasks(status: Optional[str] = None, unique_id: Optional[str] = None, limit: int = 100):
|
||||
"""列出任务,支持筛选"""
|
||||
try:
|
||||
if status or unique_id:
|
||||
# 使用搜索功能
|
||||
tasks = task_status_store.search_tasks(status=status, unique_id=unique_id, limit=limit)
|
||||
else:
|
||||
# 获取所有任务
|
||||
all_tasks = task_status_store.list_all()
|
||||
tasks = list(all_tasks.values())[:limit]
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"message": "任务列表获取成功",
|
||||
"total_tasks": len(tasks),
|
||||
"tasks": tasks,
|
||||
"filters": {
|
||||
"status": status,
|
||||
"unique_id": unique_id,
|
||||
"limit": limit
|
||||
}
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error listing tasks: {str(e)}")
|
||||
raise HTTPException(status_code=500, detail=f"获取任务列表失败: {str(e)}")
|
||||
|
||||
|
||||
@app.get("/api/v1/tasks/statistics")
|
||||
async def get_task_statistics():
|
||||
"""获取任务统计信息"""
|
||||
try:
|
||||
stats = task_status_store.get_statistics()
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"message": "统计信息获取成功",
|
||||
"statistics": stats
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error getting statistics: {str(e)}")
|
||||
raise HTTPException(status_code=500, detail=f"获取统计信息失败: {str(e)}")
|
||||
|
||||
|
||||
@app.post("/api/v1/tasks/cleanup")
|
||||
async def cleanup_tasks(older_than_days: int = 7):
|
||||
"""清理旧任务记录"""
|
||||
try:
|
||||
deleted_count = task_status_store.cleanup_old_tasks(older_than_days=older_than_days)
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"message": f"已清理 {deleted_count} 条旧任务记录",
|
||||
"deleted_count": deleted_count,
|
||||
"older_than_days": older_than_days
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error cleaning up tasks: {str(e)}")
|
||||
raise HTTPException(status_code=500, detail=f"清理任务记录失败: {str(e)}")
|
||||
|
||||
|
||||
@app.get("/api/v1/projects/{unique_id}/tasks")
|
||||
async def get_project_tasks(unique_id: str):
|
||||
"""获取指定项目的所有任务"""
|
||||
try:
|
||||
tasks = task_status_store.get_by_unique_id(unique_id)
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"message": "项目任务获取成功",
|
||||
"unique_id": unique_id,
|
||||
"total_tasks": len(tasks),
|
||||
"tasks": tasks
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error getting project tasks: {str(e)}")
|
||||
raise HTTPException(status_code=500, detail=f"获取项目任务失败: {str(e)}")
|
||||
|
||||
|
||||
@app.post("/api/v1/files/{unique_id}/cleanup/async")
|
||||
async def cleanup_project_async_endpoint(unique_id: str, remove_all: bool = False):
|
||||
"""异步清理项目文件"""
|
||||
try:
|
||||
task = cleanup_project_async(unique_id=unique_id, remove_all=remove_all)
|
||||
|
||||
return {
|
||||
"success": True,
|
||||
"message": f"项目清理任务已提交到队列,项目ID: {unique_id}",
|
||||
"unique_id": unique_id,
|
||||
"task_id": task.id,
|
||||
"action": "remove_all" if remove_all else "cleanup_logs"
|
||||
}
|
||||
except Exception as e:
|
||||
print(f"Error submitting cleanup task: {str(e)}")
|
||||
raise HTTPException(status_code=500, detail=f"提交清理任务失败: {str(e)}")
|
||||
|
||||
|
||||
@app.post("/api/v1/chat/completions")
|
||||
async def chat_completions(request: ChatRequest, authorization: Optional[str] = Header(None)):
|
||||
"""
|
||||
|
||||
17
poetry.lock
generated
17
poetry.lock
generated
@ -942,6 +942,21 @@ files = [
|
||||
{file = "httpx_sse-0.4.3.tar.gz", hash = "sha256:9b1ed0127459a66014aec3c56bebd93da3c1bc8bb6618c8082039a44889a755d"},
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "huey"
|
||||
version = "2.5.3"
|
||||
description = "huey, a little task queue"
|
||||
optional = false
|
||||
python-versions = "*"
|
||||
groups = ["main"]
|
||||
files = [
|
||||
{file = "huey-2.5.3.tar.gz", hash = "sha256:089fc72b97fd26a513f15b09925c56fad6abe4a699a1f0e902170b37e85163c7"},
|
||||
]
|
||||
|
||||
[package.extras]
|
||||
backends = ["redis (>=3.0.0)"]
|
||||
redis = ["redis (>=3.0.0)"]
|
||||
|
||||
[[package]]
|
||||
name = "huggingface-hub"
|
||||
version = "0.35.3"
|
||||
@ -3961,4 +3976,4 @@ propcache = ">=0.2.1"
|
||||
[metadata]
|
||||
lock-version = "2.1"
|
||||
python-versions = "3.12.0"
|
||||
content-hash = "06c3b78c8107692eb5944b144ae4df02862fa5e4e8a198f6ccfa07c6743a49cf"
|
||||
content-hash = "2c1ee5d2aabc25d2ae830cf2663df07ec1a7a25156b1b958d04e4d97f982e274"
|
||||
|
||||
@ -20,6 +20,7 @@ dependencies = [
|
||||
"numpy<2",
|
||||
"aiohttp",
|
||||
"aiofiles",
|
||||
"huey (>=2.5.3,<3.0.0)",
|
||||
]
|
||||
|
||||
|
||||
|
||||
18
start_queue.py
Executable file
18
start_queue.py
Executable file
@ -0,0 +1,18 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Task Queue启动脚本
|
||||
"""
|
||||
|
||||
import sys
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
# 添加项目根目录到Python路径
|
||||
project_root = Path(__file__).parent.parent
|
||||
sys.path.insert(0, str(project_root))
|
||||
|
||||
from task_queue.consumer import main as consumer_main
|
||||
|
||||
if __name__ == "__main__":
|
||||
print("启动文件处理队列系统...")
|
||||
consumer_main()
|
||||
154
task_queue/README.md
Normal file
154
task_queue/README.md
Normal file
@ -0,0 +1,154 @@
|
||||
# 队列系统使用说明
|
||||
|
||||
## 概述
|
||||
|
||||
本项目集成了基于 huey 和 SqliteHuey 的异步队列系统,用于处理文件的异步处理任务。
|
||||
|
||||
## 安装依赖
|
||||
|
||||
```bash
|
||||
pip install huey
|
||||
```
|
||||
|
||||
## 目录结构
|
||||
|
||||
```
|
||||
queue/
|
||||
├── __init__.py # 包初始化文件
|
||||
├── config.py # 队列配置(SqliteHuey配置)
|
||||
├── tasks.py # 文件处理任务定义
|
||||
├── manager.py # 队列管理器
|
||||
├── consumer.py # 队列消费者(工作进程)
|
||||
├── example.py # 使用示例
|
||||
└── README.md # 说明文档
|
||||
```
|
||||
|
||||
## 核心功能
|
||||
|
||||
### 1. 队列配置 (config.py)
|
||||
- 使用 SqliteHuey 作为消息队列
|
||||
- 数据库文件存储在 `queue_data/huey.db`
|
||||
- 支持任务重试和错误存储
|
||||
|
||||
### 2. 文件处理任务 (tasks.py)
|
||||
- `process_file_async`: 异步处理单个文件
|
||||
- `process_multiple_files_async`: 批量异步处理文件
|
||||
- `process_zip_file_async`: 异步处理zip压缩文件
|
||||
- `cleanup_processed_files`: 清理旧的文件
|
||||
|
||||
### 3. 队列管理器 (manager.py)
|
||||
- 任务提交和管理
|
||||
- 队列状态监控
|
||||
- 任务结果查询
|
||||
- 任务记录清理
|
||||
|
||||
## 使用方法
|
||||
|
||||
### 1. 启动队列消费者
|
||||
|
||||
```bash
|
||||
# 启动默认配置的消费者
|
||||
python queue/consumer.py
|
||||
|
||||
# 指定工作线程数
|
||||
python queue/consumer.py --workers 4
|
||||
|
||||
# 查看队列统计信息
|
||||
python queue/consumer.py --stats
|
||||
|
||||
# 检查队列状态
|
||||
python queue/consumer.py --check
|
||||
|
||||
# 清空队列
|
||||
python queue/consumer.py --flush
|
||||
```
|
||||
|
||||
### 2. 在代码中使用队列
|
||||
|
||||
```python
|
||||
from queue.manager import queue_manager
|
||||
|
||||
# 处理单个文件
|
||||
task_id = queue_manager.enqueue_file(
|
||||
project_id="my_project",
|
||||
file_path="/path/to/file.txt",
|
||||
original_filename="myfile.txt"
|
||||
)
|
||||
|
||||
# 批量处理文件
|
||||
task_ids = queue_manager.enqueue_multiple_files(
|
||||
project_id="my_project",
|
||||
file_paths=["/path/file1.txt", "/path/file2.txt"],
|
||||
original_filenames=["file1.txt", "file2.txt"]
|
||||
)
|
||||
|
||||
# 处理zip文件
|
||||
task_id = queue_manager.enqueue_zip_file(
|
||||
project_id="my_project",
|
||||
zip_path="/path/to/archive.zip"
|
||||
)
|
||||
|
||||
# 查看任务状态
|
||||
status = queue_manager.get_task_status(task_id)
|
||||
print(status)
|
||||
|
||||
# 获取队列统计信息
|
||||
stats = queue_manager.get_queue_stats()
|
||||
print(stats)
|
||||
```
|
||||
|
||||
### 3. 运行示例
|
||||
|
||||
```bash
|
||||
python queue/example.py
|
||||
```
|
||||
|
||||
## 配置说明
|
||||
|
||||
### 队列配置参数 (config.py)
|
||||
- `filename`: SQLite数据库文件路径
|
||||
- `always_eager`: 是否立即执行任务(开发时可设为True)
|
||||
- `utc`: 是否使用UTC时间
|
||||
- `compression_level`: 压缩级别
|
||||
- `store_errors`: 是否存储错误信息
|
||||
- `max_retries`: 最大重试次数
|
||||
- `retry_delay`: 重试延迟
|
||||
|
||||
### 消费者参数 (consumer.py)
|
||||
- `--workers`: 工作线程数(默认2)
|
||||
- `--worker-type`: 工作类型(threads/greenlets/processes)
|
||||
- `--stats`: 显示统计信息
|
||||
- `--check`: 检查队列状态
|
||||
- `--flush`: 清空队列
|
||||
|
||||
## 任务状态
|
||||
|
||||
- `pending`: 等待处理
|
||||
- `running`: 正在处理
|
||||
- `complete/finished`: 处理完成
|
||||
- `error`: 处理失败
|
||||
- `scheduled`: 定时任务
|
||||
|
||||
## 最佳实践
|
||||
|
||||
1. **生产环境建议**:
|
||||
- 设置合适的工作线程数(建议CPU核心数的1-2倍)
|
||||
- 定期清理旧的任务记录
|
||||
- 监控队列状态和任务执行情况
|
||||
|
||||
2. **开发环境建议**:
|
||||
- 可以设置 `always_eager=True` 立即执行任务进行调试
|
||||
- 使用 `--check` 参数查看队列状态
|
||||
- 运行示例代码了解功能
|
||||
|
||||
3. **错误处理**:
|
||||
- 任务失败后会自动重试(最多3次)
|
||||
- 错误信息会存储在数据库中
|
||||
- 可以通过 `get_task_status()` 查看错误详情
|
||||
|
||||
## 故障排除
|
||||
|
||||
1. **数据库锁定**: 确保只有一个消费者实例在运行
|
||||
2. **任务卡住**: 检查文件路径和权限
|
||||
3. **内存不足**: 调整工作线程数或使用进程模式
|
||||
4. **磁盘空间**: 定期清理旧文件和任务记录
|
||||
23
task_queue/__init__.py
Normal file
23
task_queue/__init__.py
Normal file
@ -0,0 +1,23 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Queue package initialization.
|
||||
"""
|
||||
|
||||
from .config import huey
|
||||
from .manager import QueueManager, queue_manager
|
||||
from .tasks import (
|
||||
process_file_async,
|
||||
process_multiple_files_async,
|
||||
process_zip_file_async,
|
||||
cleanup_processed_files
|
||||
)
|
||||
|
||||
__all__ = [
|
||||
"huey",
|
||||
"QueueManager",
|
||||
"queue_manager",
|
||||
"process_file_async",
|
||||
"process_multiple_files_async",
|
||||
"process_zip_file_async",
|
||||
"cleanup_processed_files"
|
||||
]
|
||||
27
task_queue/config.py
Normal file
27
task_queue/config.py
Normal file
@ -0,0 +1,27 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Queue configuration using SqliteHuey for asynchronous file processing.
|
||||
"""
|
||||
|
||||
import os
|
||||
from huey import SqliteHuey
|
||||
from datetime import timedelta
|
||||
|
||||
# 确保queue_data目录存在
|
||||
queue_data_dir = os.path.join(os.path.dirname(__file__), '..', 'queue_data')
|
||||
os.makedirs(queue_data_dir, exist_ok=True)
|
||||
|
||||
# 初始化SqliteHuey
|
||||
huey = SqliteHuey(
|
||||
filename=os.path.join(queue_data_dir, 'huey.db'),
|
||||
name='file_processor', # 队列名称
|
||||
always_eager=False, # 设置为False以启用异步处理
|
||||
utc=True, # 使用UTC时间
|
||||
)
|
||||
|
||||
# 设置默认任务配置
|
||||
huey.store_errors = True # 存储错误信息
|
||||
huey.max_retries = 3 # 最大重试次数
|
||||
huey.retry_delay = timedelta(seconds=60) # 重试延迟
|
||||
|
||||
print(f"SqliteHuey队列已初始化,数据库路径: {os.path.join(queue_data_dir, 'huey.db')}")
|
||||
171
task_queue/consumer.py
Executable file
171
task_queue/consumer.py
Executable file
@ -0,0 +1,171 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Queue consumer for processing file tasks.
|
||||
"""
|
||||
|
||||
import sys
|
||||
import os
|
||||
import time
|
||||
import signal
|
||||
import argparse
|
||||
from pathlib import Path
|
||||
|
||||
# 添加项目根目录到Python路径
|
||||
project_root = Path(__file__).parent.parent
|
||||
sys.path.insert(0, str(project_root))
|
||||
|
||||
from task_queue.config import huey
|
||||
from task_queue.manager import queue_manager
|
||||
from task_queue.integration_tasks import process_files_async, cleanup_project_async
|
||||
from huey.consumer import Consumer
|
||||
|
||||
|
||||
class QueueConsumer:
|
||||
"""队列消费者,用于处理异步任务"""
|
||||
|
||||
def __init__(self, worker_type: str = "threads", workers: int = 2):
|
||||
self.huey = huey
|
||||
self.worker_type = worker_type
|
||||
self.workers = workers
|
||||
self.running = False
|
||||
self.consumer = None
|
||||
|
||||
# 注册信号处理
|
||||
signal.signal(signal.SIGINT, self._signal_handler)
|
||||
signal.signal(signal.SIGTERM, self._signal_handler)
|
||||
|
||||
def _signal_handler(self, signum, frame):
|
||||
"""信号处理器,用于优雅关闭"""
|
||||
print(f"\n收到信号 {signum},正在关闭队列消费者...")
|
||||
self.running = False
|
||||
|
||||
def start(self):
|
||||
"""启动队列消费者"""
|
||||
print(f"启动队列消费者...")
|
||||
print(f"工作线程数: {self.workers}")
|
||||
print(f"工作类型: {self.worker_type}")
|
||||
print(f"数据库: {os.path.join(os.path.dirname(__file__), '..', 'queue_data', 'huey.db')}")
|
||||
print("按 Ctrl+C 停止消费者")
|
||||
|
||||
self.running = True
|
||||
|
||||
try:
|
||||
# 创建Huey消费者
|
||||
self.consumer = Consumer(self.huey, workers=self.workers, worker_type=self.worker_type.rstrip('s'))
|
||||
|
||||
# 显示队列统计信息
|
||||
stats = queue_manager.get_queue_stats()
|
||||
print(f"当前队列状态: {stats}")
|
||||
|
||||
# 启动消费者运行
|
||||
print("消费者开始处理任务...")
|
||||
self.consumer.run()
|
||||
|
||||
except KeyboardInterrupt:
|
||||
print("\n收到中断信号,正在关闭...")
|
||||
except Exception as e:
|
||||
print(f"队列消费者运行时发生错误: {str(e)}")
|
||||
finally:
|
||||
self.stop()
|
||||
|
||||
def stop(self):
|
||||
"""停止队列消费者"""
|
||||
print("正在停止队列消费者...")
|
||||
try:
|
||||
if self.consumer:
|
||||
# 停止消费者
|
||||
self.consumer.stop()
|
||||
self.consumer = None
|
||||
print("队列消费者已停止")
|
||||
except Exception as e:
|
||||
print(f"停止队列消费者时发生错误: {str(e)}")
|
||||
|
||||
def process_scheduled_tasks(self):
|
||||
"""处理定时任务"""
|
||||
print("处理定时任务...")
|
||||
# 这里可以添加额外的定时任务处理逻辑
|
||||
|
||||
|
||||
def main():
|
||||
"""主函数"""
|
||||
parser = argparse.ArgumentParser(description="文件处理队列消费者")
|
||||
parser.add_argument(
|
||||
"--workers",
|
||||
type=int,
|
||||
default=2,
|
||||
help="工作线程数 (默认: 2)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--worker-type",
|
||||
choices=["threads", "greenlets", "processes"],
|
||||
default="threads",
|
||||
help="工作线程类型 (默认: threads)"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--stats",
|
||||
action="store_true",
|
||||
help="显示队列统计信息并退出"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--flush",
|
||||
action="store_true",
|
||||
help="清空队列并退出"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--check",
|
||||
action="store_true",
|
||||
help="检查队列状态并退出"
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# 初始化消费者
|
||||
consumer = QueueConsumer(
|
||||
worker_type=args.worker_type,
|
||||
workers=args.workers
|
||||
)
|
||||
|
||||
# 处理不同的命令行选项
|
||||
if args.stats:
|
||||
print("=== 队列统计信息 ===")
|
||||
stats = queue_manager.get_queue_stats()
|
||||
print(f"总任务数: {stats.get('total_tasks', 0)}")
|
||||
print(f"待处理任务: {stats.get('pending_tasks', 0)}")
|
||||
print(f"运行中任务: {stats.get('running_tasks', 0)}")
|
||||
print(f"已完成任务: {stats.get('completed_tasks', 0)}")
|
||||
print(f"错误任务: {stats.get('error_tasks', 0)}")
|
||||
print(f"定时任务: {stats.get('scheduled_tasks', 0)}")
|
||||
print(f"数据库: {stats.get('queue_database', 'N/A')}")
|
||||
return
|
||||
|
||||
if args.flush:
|
||||
print("=== 清空队列 ===")
|
||||
try:
|
||||
# 清空所有任务
|
||||
consumer.huey.flush()
|
||||
print("队列已清空")
|
||||
except Exception as e:
|
||||
print(f"清空队列失败: {str(e)}")
|
||||
return
|
||||
|
||||
if args.check:
|
||||
print("=== 检查队列状态 ===")
|
||||
stats = queue_manager.get_queue_stats()
|
||||
print(f"队列状态: 正常" if "error" not in stats else f"队列状态: 错误 - {stats['error']}")
|
||||
|
||||
pending_tasks = queue_manager.list_pending_tasks(limit=10)
|
||||
if pending_tasks:
|
||||
print(f"\n待处理任务 (最多显示10个):")
|
||||
for task in pending_tasks:
|
||||
print(f" 任务ID: {task['task_id']}, 状态: {task['status']}, 创建时间: {task['created_time']}")
|
||||
else:
|
||||
print("当前没有待处理任务")
|
||||
return
|
||||
|
||||
# 启动消费者
|
||||
print("=== 启动文件处理队列消费者 ===")
|
||||
consumer.start()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
132
task_queue/example.py
Executable file
132
task_queue/example.py
Executable file
@ -0,0 +1,132 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Example usage of the queue system.
|
||||
"""
|
||||
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
# 添加项目根目录到Python路径
|
||||
project_root = Path(__file__).parent.parent
|
||||
sys.path.insert(0, str(project_root))
|
||||
|
||||
from task_queue.manager import queue_manager
|
||||
from task_queue.tasks import process_file_async, process_multiple_files_async
|
||||
|
||||
|
||||
def example_single_file():
|
||||
"""示例:处理单个文件"""
|
||||
print("=== 示例:处理单个文件 ===")
|
||||
|
||||
project_id = "test_project"
|
||||
file_path = "public/test_document.txt"
|
||||
|
||||
# 将文件加入队列
|
||||
task_id = queue_manager.enqueue_file(
|
||||
project_id=project_id,
|
||||
file_path=file_path,
|
||||
original_filename="example_document.txt"
|
||||
)
|
||||
|
||||
print(f"任务已提交,任务ID: {task_id}")
|
||||
|
||||
# 检查任务状态
|
||||
time.sleep(2)
|
||||
status = queue_manager.get_task_status(task_id)
|
||||
print(f"任务状态: {status}")
|
||||
|
||||
|
||||
def example_multiple_files():
|
||||
"""示例:批量处理文件"""
|
||||
print("\n=== 示例:批量处理文件 ===")
|
||||
|
||||
project_id = "test_project_batch"
|
||||
file_paths = [
|
||||
"public/test_document.txt",
|
||||
"public/goods.xlsx" # 假设这个文件存在
|
||||
]
|
||||
original_filenames = [
|
||||
"batch_document_1.txt",
|
||||
"batch_goods.xlsx"
|
||||
]
|
||||
|
||||
# 将多个文件加入队列
|
||||
task_ids = queue_manager.enqueue_multiple_files(
|
||||
project_id=project_id,
|
||||
file_paths=file_paths,
|
||||
original_filenames=original_filenames
|
||||
)
|
||||
|
||||
print(f"批量任务已提交,任务ID: {task_ids}")
|
||||
|
||||
|
||||
def example_zip_file():
|
||||
"""示例:处理zip文件"""
|
||||
print("\n=== 示例:处理zip文件 ===")
|
||||
|
||||
project_id = "test_project_zip"
|
||||
zip_path = "public/all_hp_product_spec_book2506.zip"
|
||||
|
||||
# 将zip文件加入队列
|
||||
task_id = queue_manager.enqueue_zip_file(
|
||||
project_id=project_id,
|
||||
zip_path=zip_path
|
||||
)
|
||||
|
||||
print(f"zip任务已提交,任务ID: {task_id}")
|
||||
|
||||
|
||||
def example_queue_stats():
|
||||
"""示例:获取队列统计信息"""
|
||||
print("\n=== 示例:队列统计信息 ===")
|
||||
|
||||
stats = queue_manager.get_queue_stats()
|
||||
print(f"队列统计信息:")
|
||||
for key, value in stats.items():
|
||||
if key != "recent_tasks":
|
||||
print(f" {key}: {value}")
|
||||
|
||||
|
||||
def example_cleanup():
|
||||
"""示例:清理任务"""
|
||||
print("\n=== 示例:清理任务 ===")
|
||||
|
||||
project_id = "test_project"
|
||||
|
||||
# 将清理任务加入队列(延迟10秒执行)
|
||||
task_id = queue_manager.enqueue_cleanup_task(
|
||||
project_id=project_id,
|
||||
older_than_days=1, # 清理1天前的文件
|
||||
delay=10
|
||||
)
|
||||
|
||||
print(f"清理任务已提交,任务ID: {task_id}")
|
||||
|
||||
|
||||
def main():
|
||||
"""主函数"""
|
||||
print("队列系统使用示例")
|
||||
print("=" * 50)
|
||||
|
||||
try:
|
||||
# 运行示例
|
||||
example_single_file()
|
||||
example_multiple_files()
|
||||
example_zip_file()
|
||||
example_queue_stats()
|
||||
example_cleanup()
|
||||
|
||||
print("\n" + "=" * 50)
|
||||
print("示例运行完成!")
|
||||
print("\n要查看任务执行情况,请运行:")
|
||||
print("python queue/consumer.py --check")
|
||||
print("\n要启动队列消费者,请运行:")
|
||||
print("python queue/consumer.py")
|
||||
|
||||
except Exception as e:
|
||||
print(f"运行示例时发生错误: {str(e)}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
215
task_queue/integration_tasks.py
Normal file
215
task_queue/integration_tasks.py
Normal file
@ -0,0 +1,215 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Queue tasks for file processing integration.
|
||||
"""
|
||||
|
||||
import os
|
||||
import json
|
||||
import time
|
||||
from typing import Dict, List, Optional, Any
|
||||
|
||||
from task_queue.config import huey
|
||||
from task_queue.manager import queue_manager
|
||||
from task_queue.task_status import task_status_store
|
||||
from utils import download_dataset_files, save_processed_files_log
|
||||
|
||||
|
||||
@huey.task()
|
||||
def process_files_async(
|
||||
unique_id: str,
|
||||
files: Optional[Dict[str, List[str]]] = None,
|
||||
system_prompt: Optional[str] = None,
|
||||
mcp_settings: Optional[List[Dict]] = None,
|
||||
task_id: Optional[str] = None
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
异步处理文件任务 - 与现有files/process API兼容
|
||||
|
||||
Args:
|
||||
unique_id: 项目唯一ID
|
||||
files: 按key分组的文件路径字典
|
||||
system_prompt: 系统提示词
|
||||
mcp_settings: MCP设置
|
||||
task_id: 任务ID(用于状态跟踪)
|
||||
|
||||
Returns:
|
||||
处理结果字典
|
||||
"""
|
||||
try:
|
||||
print(f"开始异步处理文件任务,项目ID: {unique_id}")
|
||||
|
||||
# 如果有task_id,设置初始状态
|
||||
if task_id:
|
||||
task_status_store.set_status(
|
||||
task_id=task_id,
|
||||
unique_id=unique_id,
|
||||
status="running"
|
||||
)
|
||||
|
||||
# 确保项目目录存在
|
||||
project_dir = os.path.join("projects", unique_id)
|
||||
if not os.path.exists(project_dir):
|
||||
os.makedirs(project_dir, exist_ok=True)
|
||||
|
||||
# 处理文件:使用按key分组格式
|
||||
processed_files_by_key = {}
|
||||
if files:
|
||||
# 使用请求中的文件(按key分组)
|
||||
# 由于这是异步任务,需要同步调用
|
||||
import asyncio
|
||||
try:
|
||||
loop = asyncio.get_event_loop()
|
||||
except RuntimeError:
|
||||
loop = asyncio.new_event_loop()
|
||||
asyncio.set_event_loop(loop)
|
||||
|
||||
processed_files_by_key = loop.run_until_complete(download_dataset_files(unique_id, files))
|
||||
total_files = sum(len(files_list) for files_list in processed_files_by_key.values())
|
||||
print(f"异步处理了 {total_files} 个数据集文件,涉及 {len(processed_files_by_key)} 个key,项目ID: {unique_id}")
|
||||
else:
|
||||
print(f"请求中未提供文件,项目ID: {unique_id}")
|
||||
|
||||
# 收集项目目录下所有的 document.txt 文件
|
||||
document_files = []
|
||||
for root, dirs, files_list in os.walk(project_dir):
|
||||
for file in files_list:
|
||||
if file == "document.txt":
|
||||
document_files.append(os.path.join(root, file))
|
||||
|
||||
# 保存system_prompt和mcp_settings到项目目录(如果提供)
|
||||
if system_prompt:
|
||||
system_prompt_file = os.path.join(project_dir, "system_prompt.md")
|
||||
with open(system_prompt_file, 'w', encoding='utf-8') as f:
|
||||
f.write(system_prompt)
|
||||
print(f"已保存system_prompt,项目ID: {unique_id}")
|
||||
|
||||
if mcp_settings:
|
||||
mcp_settings_file = os.path.join(project_dir, "mcp_settings.json")
|
||||
with open(mcp_settings_file, 'w', encoding='utf-8') as f:
|
||||
json.dump(mcp_settings, f, ensure_ascii=False, indent=2)
|
||||
print(f"已保存mcp_settings,项目ID: {unique_id}")
|
||||
|
||||
# 生成项目README.md文件
|
||||
try:
|
||||
from utils.project_manager import save_project_readme
|
||||
save_project_readme(unique_id)
|
||||
print(f"已生成README.md文件,项目ID: {unique_id}")
|
||||
except Exception as e:
|
||||
print(f"生成README.md失败,项目ID: {unique_id}, 错误: {str(e)}")
|
||||
# 不影响主要处理流程,继续执行
|
||||
|
||||
# 构建结果文件列表
|
||||
result_files = []
|
||||
for key in processed_files_by_key.keys():
|
||||
# 添加对应的dataset document.txt路径
|
||||
document_path = os.path.join("projects", unique_id, "dataset", key, "document.txt")
|
||||
if os.path.exists(document_path):
|
||||
result_files.append(document_path)
|
||||
|
||||
# 对于没有在processed_files_by_key中但存在的document.txt文件,也添加到结果中
|
||||
existing_document_paths = set(result_files) # 避免重复
|
||||
for doc_file in document_files:
|
||||
if doc_file not in existing_document_paths:
|
||||
result_files.append(doc_file)
|
||||
|
||||
result = {
|
||||
"status": "success",
|
||||
"message": f"成功异步处理了 {len(result_files)} 个文档文件,涉及 {len(processed_files_by_key)} 个key",
|
||||
"unique_id": unique_id,
|
||||
"processed_files": result_files,
|
||||
"processed_files_by_key": processed_files_by_key,
|
||||
"document_files": document_files,
|
||||
"total_files_processed": sum(len(files_list) for files_list in processed_files_by_key.values()),
|
||||
"processing_time": time.time()
|
||||
}
|
||||
|
||||
# 更新任务状态为完成
|
||||
if task_id:
|
||||
task_status_store.update_status(
|
||||
task_id=task_id,
|
||||
status="completed",
|
||||
result=result
|
||||
)
|
||||
|
||||
print(f"异步文件处理任务完成: {unique_id}")
|
||||
return result
|
||||
|
||||
except Exception as e:
|
||||
error_msg = f"异步处理文件时发生错误: {str(e)}"
|
||||
print(error_msg)
|
||||
|
||||
# 更新任务状态为错误
|
||||
if task_id:
|
||||
task_status_store.update_status(
|
||||
task_id=task_id,
|
||||
status="failed",
|
||||
error=error_msg
|
||||
)
|
||||
|
||||
return {
|
||||
"status": "error",
|
||||
"message": error_msg,
|
||||
"unique_id": unique_id,
|
||||
"error": str(e)
|
||||
}
|
||||
|
||||
|
||||
@huey.task()
|
||||
def cleanup_project_async(
|
||||
unique_id: str,
|
||||
remove_all: bool = False
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
异步清理项目文件
|
||||
|
||||
Args:
|
||||
unique_id: 项目唯一ID
|
||||
remove_all: 是否删除整个项目目录
|
||||
|
||||
Returns:
|
||||
清理结果字典
|
||||
"""
|
||||
try:
|
||||
print(f"开始异步清理项目,项目ID: {unique_id}")
|
||||
|
||||
project_dir = os.path.join("projects", unique_id)
|
||||
removed_items = []
|
||||
|
||||
if remove_all and os.path.exists(project_dir):
|
||||
import shutil
|
||||
shutil.rmtree(project_dir)
|
||||
removed_items.append(project_dir)
|
||||
result = {
|
||||
"status": "success",
|
||||
"message": f"已删除整个项目目录: {project_dir}",
|
||||
"unique_id": unique_id,
|
||||
"removed_items": removed_items,
|
||||
"action": "remove_all"
|
||||
}
|
||||
else:
|
||||
# 只清理处理日志
|
||||
log_file = os.path.join(project_dir, "processed_files.json")
|
||||
if os.path.exists(log_file):
|
||||
os.remove(log_file)
|
||||
removed_items.append(log_file)
|
||||
|
||||
result = {
|
||||
"status": "success",
|
||||
"message": f"已清理项目处理日志,项目ID: {unique_id}",
|
||||
"unique_id": unique_id,
|
||||
"removed_items": removed_items,
|
||||
"action": "cleanup_logs"
|
||||
}
|
||||
|
||||
print(f"异步清理任务完成: {unique_id}")
|
||||
return result
|
||||
|
||||
except Exception as e:
|
||||
error_msg = f"异步清理项目时发生错误: {str(e)}"
|
||||
print(error_msg)
|
||||
return {
|
||||
"status": "error",
|
||||
"message": error_msg,
|
||||
"unique_id": unique_id,
|
||||
"error": str(e)
|
||||
}
|
||||
362
task_queue/manager.py
Normal file
362
task_queue/manager.py
Normal file
@ -0,0 +1,362 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Queue manager for handling file processing queues.
|
||||
"""
|
||||
|
||||
import os
|
||||
import json
|
||||
import time
|
||||
from typing import Dict, List, Optional, Any
|
||||
from huey import Huey
|
||||
from huey.api import Task
|
||||
from datetime import datetime, timedelta
|
||||
|
||||
from .config import huey
|
||||
from .tasks import process_file_async, process_multiple_files_async, process_zip_file_async, cleanup_processed_files
|
||||
|
||||
|
||||
class QueueManager:
|
||||
"""队列管理器,用于管理文件处理任务"""
|
||||
|
||||
def __init__(self):
|
||||
self.huey = huey
|
||||
print(f"队列管理器已初始化,使用数据库: {os.path.join(os.path.dirname(__file__), '..', 'queue_data', 'huey.db')}")
|
||||
|
||||
def enqueue_file(
|
||||
self,
|
||||
project_id: str,
|
||||
file_path: str,
|
||||
original_filename: str = None,
|
||||
delay: int = 0
|
||||
) -> str:
|
||||
"""
|
||||
将文件加入处理队列
|
||||
|
||||
Args:
|
||||
project_id: 项目ID
|
||||
file_path: 文件路径
|
||||
original_filename: 原始文件名
|
||||
delay: 延迟执行时间(秒)
|
||||
|
||||
Returns:
|
||||
任务ID
|
||||
"""
|
||||
if delay > 0:
|
||||
task = process_file_async.schedule(
|
||||
args=(project_id, file_path, original_filename),
|
||||
delay=timedelta(seconds=delay)
|
||||
)
|
||||
else:
|
||||
task = process_file_async(project_id, file_path, original_filename)
|
||||
|
||||
print(f"文件已加入队列: {file_path}, 任务ID: {task.id}")
|
||||
return task.id
|
||||
|
||||
def enqueue_multiple_files(
|
||||
self,
|
||||
project_id: str,
|
||||
file_paths: List[str],
|
||||
original_filenames: List[str] = None,
|
||||
delay: int = 0
|
||||
) -> List[str]:
|
||||
"""
|
||||
将多个文件加入处理队列
|
||||
|
||||
Args:
|
||||
project_id: 项目ID
|
||||
file_paths: 文件路径列表
|
||||
original_filenames: 原始文件名列表
|
||||
delay: 延迟执行时间(秒)
|
||||
|
||||
Returns:
|
||||
任务ID列表
|
||||
"""
|
||||
if delay > 0:
|
||||
task = process_multiple_files_async.schedule(
|
||||
args=(project_id, file_paths, original_filenames),
|
||||
delay=timedelta(seconds=delay)
|
||||
)
|
||||
else:
|
||||
task = process_multiple_files_async(project_id, file_paths, original_filenames)
|
||||
|
||||
print(f"批量文件已加入队列: {len(file_paths)} 个文件, 任务ID: {task.id}")
|
||||
return [task.id]
|
||||
|
||||
def enqueue_zip_file(
|
||||
self,
|
||||
project_id: str,
|
||||
zip_path: str,
|
||||
extract_to: str = None,
|
||||
delay: int = 0
|
||||
) -> str:
|
||||
"""
|
||||
将zip文件加入处理队列
|
||||
|
||||
Args:
|
||||
project_id: 项目ID
|
||||
zip_path: zip文件路径
|
||||
extract_to: 解压目标目录
|
||||
delay: 延迟执行时间(秒)
|
||||
|
||||
Returns:
|
||||
任务ID
|
||||
"""
|
||||
if delay > 0:
|
||||
task = process_zip_file_async.schedule(
|
||||
args=(project_id, zip_path, extract_to),
|
||||
delay=timedelta(seconds=delay)
|
||||
)
|
||||
else:
|
||||
task = process_zip_file_async(project_id, zip_path, extract_to)
|
||||
|
||||
print(f"zip文件已加入队列: {zip_path}, 任务ID: {task.id}")
|
||||
return task.id
|
||||
|
||||
def get_task_status(self, task_id: str) -> Dict[str, Any]:
|
||||
"""
|
||||
获取任务状态
|
||||
|
||||
Args:
|
||||
task_id: 任务ID
|
||||
|
||||
Returns:
|
||||
任务状态信息
|
||||
"""
|
||||
try:
|
||||
# 尝试从结果存储中获取任务结果
|
||||
try:
|
||||
# 使用huey的内置方法检查结果
|
||||
if hasattr(self.huey, 'result') and self.huey.result:
|
||||
result = self.huey.result(task_id)
|
||||
if result is not None:
|
||||
return {
|
||||
"task_id": task_id,
|
||||
"status": "complete",
|
||||
"result": result
|
||||
}
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# 检查任务是否在待处理队列中
|
||||
try:
|
||||
pending_tasks = list(self.huey.pending())
|
||||
for task in pending_tasks:
|
||||
if hasattr(task, 'id') and task.id == task_id:
|
||||
return {
|
||||
"task_id": task_id,
|
||||
"status": "pending"
|
||||
}
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# 检查任务是否在定时队列中
|
||||
try:
|
||||
scheduled_tasks = list(self.huey.scheduled())
|
||||
for task in scheduled_tasks:
|
||||
if hasattr(task, 'id') and task.id == task_id:
|
||||
return {
|
||||
"task_id": task_id,
|
||||
"status": "scheduled"
|
||||
}
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# 如果都找不到,可能任务不存在或已完成但结果已清理
|
||||
return {
|
||||
"task_id": task_id,
|
||||
"status": "unknown",
|
||||
"message": "任务状态未知,可能已完成或不存在"
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
return {
|
||||
"task_id": task_id,
|
||||
"status": "error",
|
||||
"message": f"获取任务状态失败: {str(e)}"
|
||||
}
|
||||
|
||||
def get_queue_stats(self) -> Dict[str, Any]:
|
||||
"""
|
||||
获取队列统计信息
|
||||
|
||||
Returns:
|
||||
队列统计信息
|
||||
"""
|
||||
try:
|
||||
# 使用简化的统计方法
|
||||
stats = {
|
||||
"total_tasks": 0,
|
||||
"pending_tasks": 0,
|
||||
"running_tasks": 0,
|
||||
"completed_tasks": 0,
|
||||
"error_tasks": 0,
|
||||
"scheduled_tasks": 0,
|
||||
"recent_tasks": [],
|
||||
"queue_database": os.path.join(os.path.dirname(__file__), '..', 'queue_data', 'huey.db')
|
||||
}
|
||||
|
||||
# 尝试获取待处理任务数量
|
||||
try:
|
||||
pending_tasks = list(self.huey.pending())
|
||||
stats["pending_tasks"] = len(pending_tasks)
|
||||
stats["total_tasks"] += len(pending_tasks)
|
||||
except Exception as e:
|
||||
print(f"获取pending任务失败: {e}")
|
||||
|
||||
# 尝试获取定时任务数量
|
||||
try:
|
||||
scheduled_tasks = list(self.huey.scheduled())
|
||||
stats["scheduled_tasks"] = len(scheduled_tasks)
|
||||
stats["total_tasks"] += len(scheduled_tasks)
|
||||
except Exception as e:
|
||||
print(f"获取scheduled任务失败: {e}")
|
||||
|
||||
return stats
|
||||
|
||||
except Exception as e:
|
||||
return {
|
||||
"error": f"获取队列统计信息失败: {str(e)}"
|
||||
}
|
||||
|
||||
def cancel_task(self, task_id: str) -> bool:
|
||||
"""
|
||||
取消任务
|
||||
|
||||
Args:
|
||||
task_id: 任务ID
|
||||
|
||||
Returns:
|
||||
是否成功取消
|
||||
"""
|
||||
try:
|
||||
task = self.huey.get_task(task_id)
|
||||
if task and task.get_status() in ["pending", "scheduled"]:
|
||||
# Huey不直接支持取消任务,但可以从队列中移除
|
||||
# 这里返回False表示不能直接取消
|
||||
return False
|
||||
else:
|
||||
return False
|
||||
except Exception as e:
|
||||
print(f"取消任务失败: {str(e)}")
|
||||
return False
|
||||
|
||||
def cleanup_old_tasks(self, older_than_days: int = 7) -> Dict[str, Any]:
|
||||
"""
|
||||
清理旧的任务记录
|
||||
|
||||
Args:
|
||||
older_than_days: 清理多少天前的任务记录
|
||||
|
||||
Returns:
|
||||
清理结果
|
||||
"""
|
||||
try:
|
||||
# 简化的清理方法 - 清空整个队列
|
||||
self.huey.flush()
|
||||
|
||||
return {
|
||||
"status": "success",
|
||||
"message": f"已清空队列(简化清理)",
|
||||
"older_than_days": older_than_days,
|
||||
"cleaned_count": "unknown (queue flushed)"
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
return {
|
||||
"status": "error",
|
||||
"message": f"清理任务记录失败: {str(e)}"
|
||||
}
|
||||
|
||||
def enqueue_cleanup_task(
|
||||
self,
|
||||
project_id: str,
|
||||
older_than_days: int = 30,
|
||||
delay: int = 0
|
||||
) -> str:
|
||||
"""
|
||||
将清理任务加入队列
|
||||
|
||||
Args:
|
||||
project_id: 项目ID
|
||||
older_than_days: 清理多少天前的文件
|
||||
delay: 延迟执行时间(秒)
|
||||
|
||||
Returns:
|
||||
任务ID
|
||||
"""
|
||||
if delay > 0:
|
||||
task = cleanup_processed_files.schedule(
|
||||
args=(project_id, older_than_days),
|
||||
delay=timedelta(seconds=delay)
|
||||
)
|
||||
else:
|
||||
task = cleanup_processed_files(project_id, older_than_days)
|
||||
|
||||
print(f"清理任务已加入队列: 项目 {project_id}, 任务ID: {task.id}")
|
||||
return task.id
|
||||
|
||||
def list_pending_tasks(self, limit: int = 50) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
列出待处理的任务
|
||||
|
||||
Args:
|
||||
limit: 返回的最大任务数
|
||||
|
||||
Returns:
|
||||
待处理任务列表
|
||||
"""
|
||||
try:
|
||||
pending_tasks = []
|
||||
|
||||
# 获取pending任务
|
||||
try:
|
||||
tasks = list(self.huey.pending())
|
||||
for i, task in enumerate(tasks[:limit]):
|
||||
if hasattr(task, 'id'):
|
||||
pending_tasks.append({
|
||||
"task_id": task.id,
|
||||
"status": "pending",
|
||||
})
|
||||
except Exception as e:
|
||||
print(f"获取pending任务失败: {e}")
|
||||
|
||||
# 获取scheduled任务
|
||||
try:
|
||||
tasks = list(self.huey.scheduled())
|
||||
for i, task in enumerate(tasks[:limit - len(pending_tasks)]):
|
||||
if hasattr(task, 'id'):
|
||||
pending_tasks.append({
|
||||
"task_id": task.id,
|
||||
"status": "scheduled",
|
||||
})
|
||||
except Exception as e:
|
||||
print(f"获取scheduled任务失败: {e}")
|
||||
|
||||
return pending_tasks
|
||||
|
||||
except Exception as e:
|
||||
print(f"获取待处理任务失败: {str(e)}")
|
||||
return []
|
||||
|
||||
def get_task_result(self, task_id: str) -> Optional[Any]:
|
||||
"""
|
||||
获取任务结果
|
||||
|
||||
Args:
|
||||
task_id: 任务ID
|
||||
|
||||
Returns:
|
||||
任务结果,如果任务未完成则返回None
|
||||
"""
|
||||
try:
|
||||
task = self.huey.get_task(task_id)
|
||||
if task and task.get_status() in ["complete", "finished"]:
|
||||
return task.get()
|
||||
return None
|
||||
except Exception as e:
|
||||
print(f"获取任务结果失败: {str(e)}")
|
||||
return None
|
||||
|
||||
|
||||
# 全局队列管理器实例
|
||||
queue_manager = QueueManager()
|
||||
210
task_queue/task_status.py
Normal file
210
task_queue/task_status.py
Normal file
@ -0,0 +1,210 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
任务状态SQLite存储系统
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import sqlite3
|
||||
import time
|
||||
from typing import Dict, Optional, Any, List
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
class TaskStatusStore:
|
||||
"""基于SQLite的任务状态存储器"""
|
||||
|
||||
def __init__(self, db_path: str = "queue_data/task_status.db"):
|
||||
self.db_path = db_path
|
||||
# 确保目录存在
|
||||
Path(db_path).parent.mkdir(parents=True, exist_ok=True)
|
||||
self._init_database()
|
||||
|
||||
def _init_database(self):
|
||||
"""初始化数据库表"""
|
||||
with sqlite3.connect(self.db_path) as conn:
|
||||
conn.execute('''
|
||||
CREATE TABLE IF NOT EXISTS task_status (
|
||||
task_id TEXT PRIMARY KEY,
|
||||
unique_id TEXT NOT NULL,
|
||||
status TEXT NOT NULL,
|
||||
created_at REAL NOT NULL,
|
||||
updated_at REAL NOT NULL,
|
||||
result TEXT,
|
||||
error TEXT
|
||||
)
|
||||
''')
|
||||
conn.commit()
|
||||
|
||||
def set_status(self, task_id: str, unique_id: str, status: str,
|
||||
result: Optional[Dict] = None, error: Optional[str] = None):
|
||||
"""设置任务状态"""
|
||||
current_time = time.time()
|
||||
|
||||
with sqlite3.connect(self.db_path) as conn:
|
||||
conn.execute('''
|
||||
INSERT OR REPLACE INTO task_status
|
||||
(task_id, unique_id, status, created_at, updated_at, result, error)
|
||||
VALUES (?, ?, ?, ?, ?, ?, ?)
|
||||
''', (
|
||||
task_id, unique_id, status, current_time, current_time,
|
||||
json.dumps(result) if result else None,
|
||||
error
|
||||
))
|
||||
conn.commit()
|
||||
|
||||
def get_status(self, task_id: str) -> Optional[Dict]:
|
||||
"""获取任务状态"""
|
||||
with sqlite3.connect(self.db_path) as conn:
|
||||
conn.row_factory = sqlite3.Row
|
||||
cursor = conn.execute(
|
||||
'SELECT * FROM task_status WHERE task_id = ?', (task_id,)
|
||||
)
|
||||
row = cursor.fetchone()
|
||||
|
||||
if not row:
|
||||
return None
|
||||
|
||||
result = dict(row)
|
||||
# 解析JSON字段
|
||||
if result['result']:
|
||||
result['result'] = json.loads(result['result'])
|
||||
|
||||
return result
|
||||
|
||||
def update_status(self, task_id: str, status: str,
|
||||
result: Optional[Dict] = None, error: Optional[str] = None):
|
||||
"""更新任务状态"""
|
||||
with sqlite3.connect(self.db_path) as conn:
|
||||
# 检查任务是否存在
|
||||
cursor = conn.execute(
|
||||
'SELECT task_id FROM task_status WHERE task_id = ?', (task_id,)
|
||||
)
|
||||
if not cursor.fetchone():
|
||||
return False
|
||||
|
||||
# 更新状态
|
||||
conn.execute('''
|
||||
UPDATE task_status
|
||||
SET status = ?, updated_at = ?, result = ?, error = ?
|
||||
WHERE task_id = ?
|
||||
''', (
|
||||
status, time.time(),
|
||||
json.dumps(result) if result else None,
|
||||
error, task_id
|
||||
))
|
||||
conn.commit()
|
||||
return True
|
||||
|
||||
def delete_status(self, task_id: str):
|
||||
"""删除任务状态"""
|
||||
with sqlite3.connect(self.db_path) as conn:
|
||||
cursor = conn.execute(
|
||||
'DELETE FROM task_status WHERE task_id = ?', (task_id,)
|
||||
)
|
||||
conn.commit()
|
||||
return cursor.rowcount > 0
|
||||
|
||||
def list_all(self) -> Dict[str, Dict]:
|
||||
"""列出所有任务状态"""
|
||||
with sqlite3.connect(self.db_path) as conn:
|
||||
conn.row_factory = sqlite3.Row
|
||||
cursor = conn.execute(
|
||||
'SELECT * FROM task_status ORDER BY updated_at DESC'
|
||||
)
|
||||
all_tasks = {}
|
||||
for row in cursor:
|
||||
result = dict(row)
|
||||
# 解析JSON字段
|
||||
if result['result']:
|
||||
result['result'] = json.loads(result['result'])
|
||||
all_tasks[result['task_id']] = result
|
||||
return all_tasks
|
||||
|
||||
def get_by_unique_id(self, unique_id: str) -> List[Dict]:
|
||||
"""根据项目ID获取所有相关任务"""
|
||||
with sqlite3.connect(self.db_path) as conn:
|
||||
conn.row_factory = sqlite3.Row
|
||||
cursor = conn.execute(
|
||||
'SELECT * FROM task_status WHERE unique_id = ? ORDER BY updated_at DESC',
|
||||
(unique_id,)
|
||||
)
|
||||
tasks = []
|
||||
for row in cursor:
|
||||
result = dict(row)
|
||||
if result['result']:
|
||||
result['result'] = json.loads(result['result'])
|
||||
tasks.append(result)
|
||||
return tasks
|
||||
|
||||
def cleanup_old_tasks(self, older_than_days: int = 7) -> int:
|
||||
"""清理旧任务记录"""
|
||||
cutoff_time = time.time() - (older_than_days * 24 * 3600)
|
||||
|
||||
with sqlite3.connect(self.db_path) as conn:
|
||||
cursor = conn.execute(
|
||||
'DELETE FROM task_status WHERE updated_at < ?',
|
||||
(cutoff_time,)
|
||||
)
|
||||
conn.commit()
|
||||
return cursor.rowcount
|
||||
|
||||
def get_statistics(self) -> Dict[str, Any]:
|
||||
"""获取任务统计信息"""
|
||||
with sqlite3.connect(self.db_path) as conn:
|
||||
# 总任务数
|
||||
total = conn.execute('SELECT COUNT(*) FROM task_status').fetchone()[0]
|
||||
|
||||
# 按状态分组统计
|
||||
status_stats = conn.execute('''
|
||||
SELECT status, COUNT(*) as count
|
||||
FROM task_status
|
||||
GROUP BY status
|
||||
''').fetchall()
|
||||
|
||||
# 最近24小时的任务
|
||||
recent = time.time() - (24 * 3600)
|
||||
recent_tasks = conn.execute(
|
||||
'SELECT COUNT(*) FROM task_status WHERE updated_at > ?',
|
||||
(recent,)
|
||||
).fetchone()[0]
|
||||
|
||||
return {
|
||||
'total_tasks': total,
|
||||
'status_breakdown': dict(status_stats),
|
||||
'recent_24h': recent_tasks,
|
||||
'database_path': self.db_path
|
||||
}
|
||||
|
||||
def search_tasks(self, status: Optional[str] = None,
|
||||
unique_id: Optional[str] = None,
|
||||
limit: int = 100) -> List[Dict]:
|
||||
"""搜索任务"""
|
||||
query = 'SELECT * FROM task_status WHERE 1=1'
|
||||
params = []
|
||||
|
||||
if status:
|
||||
query += ' AND status = ?'
|
||||
params.append(status)
|
||||
|
||||
if unique_id:
|
||||
query += ' AND unique_id = ?'
|
||||
params.append(unique_id)
|
||||
|
||||
query += ' ORDER BY updated_at DESC LIMIT ?'
|
||||
params.append(limit)
|
||||
|
||||
with sqlite3.connect(self.db_path) as conn:
|
||||
conn.row_factory = sqlite3.Row
|
||||
cursor = conn.execute(query, params)
|
||||
tasks = []
|
||||
for row in cursor:
|
||||
result = dict(row)
|
||||
if result['result']:
|
||||
result['result'] = json.loads(result['result'])
|
||||
tasks.append(result)
|
||||
return tasks
|
||||
|
||||
|
||||
# 全局状态存储实例
|
||||
task_status_store = TaskStatusStore()
|
||||
356
task_queue/tasks.py
Normal file
356
task_queue/tasks.py
Normal file
@ -0,0 +1,356 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
File processing tasks for the queue system.
|
||||
"""
|
||||
|
||||
import os
|
||||
import json
|
||||
import time
|
||||
import shutil
|
||||
from pathlib import Path
|
||||
from typing import Dict, List, Optional, Any
|
||||
from huey import crontab
|
||||
|
||||
from .config import huey
|
||||
from utils.file_utils import (
|
||||
extract_zip_file,
|
||||
get_file_hash,
|
||||
is_file_already_processed,
|
||||
load_processed_files_log,
|
||||
save_processed_files_log,
|
||||
get_document_preview
|
||||
)
|
||||
|
||||
|
||||
@huey.task()
|
||||
def process_file_async(
|
||||
project_id: str,
|
||||
file_path: str,
|
||||
original_filename: str = None,
|
||||
target_directory: str = "files"
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
异步处理单个文件
|
||||
|
||||
Args:
|
||||
project_id: 项目ID
|
||||
file_path: 文件路径
|
||||
original_filename: 原始文件名
|
||||
target_directory: 目标目录
|
||||
|
||||
Returns:
|
||||
处理结果字典
|
||||
"""
|
||||
try:
|
||||
print(f"开始处理文件: {file_path}")
|
||||
|
||||
# 确保项目目录存在
|
||||
project_dir = os.path.join("projects", project_id)
|
||||
files_dir = os.path.join(project_dir, target_directory)
|
||||
os.makedirs(files_dir, exist_ok=True)
|
||||
|
||||
# 获取文件hash作为标识
|
||||
file_hash = get_file_hash(file_path)
|
||||
|
||||
# 检查文件是否已处理
|
||||
processed_log = load_processed_files_log(project_id)
|
||||
if file_hash in processed_log:
|
||||
print(f"文件已处理,跳过: {file_path}")
|
||||
return {
|
||||
"status": "skipped",
|
||||
"message": "文件已处理",
|
||||
"file_hash": file_hash,
|
||||
"project_id": project_id
|
||||
}
|
||||
|
||||
# 处理文件
|
||||
result = _process_single_file(
|
||||
file_path,
|
||||
files_dir,
|
||||
original_filename or os.path.basename(file_path)
|
||||
)
|
||||
|
||||
# 更新处理日志
|
||||
if result["status"] == "success":
|
||||
processed_log[file_hash] = {
|
||||
"original_path": file_path,
|
||||
"original_filename": original_filename or os.path.basename(file_path),
|
||||
"processed_at": str(time.time()),
|
||||
"status": "processed",
|
||||
"result": result
|
||||
}
|
||||
save_processed_files_log(project_id, processed_log)
|
||||
|
||||
result["file_hash"] = file_hash
|
||||
result["project_id"] = project_id
|
||||
|
||||
print(f"文件处理完成: {file_path}, 状态: {result['status']}")
|
||||
return result
|
||||
|
||||
except Exception as e:
|
||||
error_msg = f"处理文件时发生错误: {str(e)}"
|
||||
print(error_msg)
|
||||
return {
|
||||
"status": "error",
|
||||
"message": error_msg,
|
||||
"file_path": file_path,
|
||||
"project_id": project_id
|
||||
}
|
||||
|
||||
|
||||
@huey.task()
|
||||
def process_multiple_files_async(
|
||||
project_id: str,
|
||||
file_paths: List[str],
|
||||
original_filenames: List[str] = None
|
||||
) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
批量异步处理多个文件
|
||||
|
||||
Args:
|
||||
project_id: 项目ID
|
||||
file_paths: 文件路径列表
|
||||
original_filenames: 原始文件名列表
|
||||
|
||||
Returns:
|
||||
处理结果列表
|
||||
"""
|
||||
try:
|
||||
print(f"开始批量处理 {len(file_paths)} 个文件")
|
||||
|
||||
results = []
|
||||
for i, file_path in enumerate(file_paths):
|
||||
original_filename = original_filenames[i] if original_filenames and i < len(original_filenames) else None
|
||||
|
||||
# 为每个文件创建异步任务
|
||||
result = process_file_async(project_id, file_path, original_filename)
|
||||
results.append(result)
|
||||
|
||||
print(f"批量文件处理任务已提交,共 {len(results)} 个文件")
|
||||
return results
|
||||
|
||||
except Exception as e:
|
||||
error_msg = f"批量处理文件时发生错误: {str(e)}"
|
||||
print(error_msg)
|
||||
return [{
|
||||
"status": "error",
|
||||
"message": error_msg,
|
||||
"project_id": project_id
|
||||
}]
|
||||
|
||||
|
||||
@huey.task()
|
||||
def process_zip_file_async(
|
||||
project_id: str,
|
||||
zip_path: str,
|
||||
extract_to: str = None
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
异步处理zip压缩文件
|
||||
|
||||
Args:
|
||||
project_id: 项目ID
|
||||
zip_path: zip文件路径
|
||||
extract_to: 解压目标目录
|
||||
|
||||
Returns:
|
||||
处理结果字典
|
||||
"""
|
||||
try:
|
||||
print(f"开始处理zip文件: {zip_path}")
|
||||
|
||||
# 设置解压目录
|
||||
if extract_to is None:
|
||||
extract_to = os.path.join("projects", project_id, "extracted", os.path.basename(zip_path))
|
||||
|
||||
os.makedirs(extract_to, exist_ok=True)
|
||||
|
||||
# 解压文件
|
||||
extracted_files = extract_zip_file(zip_path, extract_to)
|
||||
|
||||
if not extracted_files:
|
||||
return {
|
||||
"status": "error",
|
||||
"message": "解压失败或没有找到支持的文件",
|
||||
"zip_path": zip_path,
|
||||
"project_id": project_id
|
||||
}
|
||||
|
||||
# 批量处理解压后的文件
|
||||
result = process_multiple_files_async(project_id, extracted_files)
|
||||
|
||||
return {
|
||||
"status": "success",
|
||||
"message": f"zip文件处理完成,解压出 {len(extracted_files)} 个文件",
|
||||
"zip_path": zip_path,
|
||||
"extract_to": extract_to,
|
||||
"extracted_files": extracted_files,
|
||||
"project_id": project_id,
|
||||
"batch_task_result": result
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
error_msg = f"处理zip文件时发生错误: {str(e)}"
|
||||
print(error_msg)
|
||||
return {
|
||||
"status": "error",
|
||||
"message": error_msg,
|
||||
"zip_path": zip_path,
|
||||
"project_id": project_id
|
||||
}
|
||||
|
||||
|
||||
@huey.task()
|
||||
def cleanup_processed_files(
|
||||
project_id: str,
|
||||
older_than_days: int = 30
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
清理旧的处理文件
|
||||
|
||||
Args:
|
||||
project_id: 项目ID
|
||||
older_than_days: 清理多少天前的文件
|
||||
|
||||
Returns:
|
||||
清理结果字典
|
||||
"""
|
||||
try:
|
||||
print(f"开始清理项目 {project_id} 中 {older_than_days} 天前的文件")
|
||||
|
||||
project_dir = os.path.join("projects", project_id)
|
||||
if not os.path.exists(project_dir):
|
||||
return {
|
||||
"status": "error",
|
||||
"message": "项目目录不存在",
|
||||
"project_id": project_id
|
||||
}
|
||||
|
||||
current_time = time.time()
|
||||
cutoff_time = current_time - (older_than_days * 24 * 3600)
|
||||
cleaned_files = []
|
||||
|
||||
# 遍历项目目录
|
||||
for root, dirs, files in os.walk(project_dir):
|
||||
for file in files:
|
||||
file_path = os.path.join(root, file)
|
||||
file_mtime = os.path.getmtime(file_path)
|
||||
|
||||
if file_mtime < cutoff_time:
|
||||
try:
|
||||
os.remove(file_path)
|
||||
cleaned_files.append(file_path)
|
||||
print(f"已删除旧文件: {file_path}")
|
||||
except Exception as e:
|
||||
print(f"删除文件失败 {file_path}: {str(e)}")
|
||||
|
||||
# 清理空目录
|
||||
for root, dirs, files in os.walk(project_dir, topdown=False):
|
||||
for dir in dirs:
|
||||
dir_path = os.path.join(root, dir)
|
||||
try:
|
||||
if not os.listdir(dir_path):
|
||||
os.rmdir(dir_path)
|
||||
print(f"已删除空目录: {dir_path}")
|
||||
except Exception as e:
|
||||
print(f"删除目录失败 {dir_path}: {str(e)}")
|
||||
|
||||
return {
|
||||
"status": "success",
|
||||
"message": f"清理完成,删除了 {len(cleaned_files)} 个文件",
|
||||
"project_id": project_id,
|
||||
"cleaned_files": cleaned_files,
|
||||
"older_than_days": older_than_days
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
error_msg = f"清理文件时发生错误: {str(e)}"
|
||||
print(error_msg)
|
||||
return {
|
||||
"status": "error",
|
||||
"message": error_msg,
|
||||
"project_id": project_id
|
||||
}
|
||||
|
||||
|
||||
def _process_single_file(
|
||||
file_path: str,
|
||||
target_dir: str,
|
||||
original_filename: str
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
处理单个文件的内部方法
|
||||
|
||||
Args:
|
||||
file_path: 源文件路径
|
||||
target_dir: 目标目录
|
||||
original_filename: 原始文件名
|
||||
|
||||
Returns:
|
||||
处理结果字典
|
||||
"""
|
||||
try:
|
||||
# 检查文件是否存在
|
||||
if not os.path.exists(file_path):
|
||||
return {
|
||||
"status": "error",
|
||||
"message": "源文件不存在",
|
||||
"file_path": file_path
|
||||
}
|
||||
|
||||
# 获取文件信息
|
||||
file_size = os.path.getsize(file_path)
|
||||
file_ext = os.path.splitext(original_filename)[1].lower()
|
||||
|
||||
# 根据文件类型进行不同处理
|
||||
supported_extensions = ['.txt', '.md', '.pdf', '.doc', '.docx', '.zip']
|
||||
|
||||
if file_ext not in supported_extensions:
|
||||
return {
|
||||
"status": "error",
|
||||
"message": f"不支持的文件类型: {file_ext}",
|
||||
"file_path": file_path,
|
||||
"supported_extensions": supported_extensions
|
||||
}
|
||||
|
||||
# 复制文件到目标目录
|
||||
target_file_path = os.path.join(target_dir, original_filename)
|
||||
|
||||
# 如果目标文件已存在,添加时间戳
|
||||
if os.path.exists(target_file_path):
|
||||
name, ext = os.path.splitext(original_filename)
|
||||
timestamp = int(time.time())
|
||||
target_file_path = os.path.join(target_dir, f"{name}_{timestamp}{ext}")
|
||||
|
||||
shutil.copy2(file_path, target_file_path)
|
||||
|
||||
# 获取文件预览(如果是文本文件)
|
||||
preview = None
|
||||
if file_ext in ['.txt', '.md']:
|
||||
preview = get_document_preview(target_file_path, max_lines=5)
|
||||
|
||||
return {
|
||||
"status": "success",
|
||||
"message": "文件处理成功",
|
||||
"original_path": file_path,
|
||||
"target_path": target_file_path,
|
||||
"file_size": file_size,
|
||||
"file_extension": file_ext,
|
||||
"preview": preview
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
return {
|
||||
"status": "error",
|
||||
"message": f"处理文件时发生错误: {str(e)}",
|
||||
"file_path": file_path
|
||||
}
|
||||
|
||||
|
||||
# 定期任务示例:每天凌晨2点清理30天前的文件
|
||||
@huey.periodic_task(crontab(hour=2, minute=0))
|
||||
def daily_cleanup():
|
||||
"""每日清理任务"""
|
||||
print("执行每日清理任务")
|
||||
# 这里可以添加清理逻辑
|
||||
return {"status": "completed", "message": "每日清理任务完成"}
|
||||
@ -69,6 +69,10 @@ from .api_models import (
|
||||
ProjectListResponse,
|
||||
ProjectStatsResponse,
|
||||
ProjectActionResponse,
|
||||
QueueTaskRequest,
|
||||
QueueTaskResponse,
|
||||
QueueStatusResponse,
|
||||
TaskStatusResponse,
|
||||
create_success_response,
|
||||
create_error_response,
|
||||
create_chat_response
|
||||
@ -134,6 +138,10 @@ __all__ = [
|
||||
'ProjectListResponse',
|
||||
'ProjectStatsResponse',
|
||||
'ProjectActionResponse',
|
||||
'QueueTaskRequest',
|
||||
'QueueTaskResponse',
|
||||
'QueueStatusResponse',
|
||||
'TaskStatusResponse',
|
||||
'create_success_response',
|
||||
'create_error_response',
|
||||
'create_chat_response'
|
||||
|
||||
@ -199,6 +199,66 @@ def create_error_response(message: str, error_type: str = "error", **kwargs) ->
|
||||
}
|
||||
|
||||
|
||||
class QueueTaskRequest(BaseModel):
|
||||
"""队列任务请求模型"""
|
||||
unique_id: str
|
||||
files: Optional[Dict[str, List[str]]] = Field(default=None, description="Files organized by key groups. Each key maps to a list of file paths (supports zip files)")
|
||||
system_prompt: Optional[str] = None
|
||||
mcp_settings: Optional[List[Dict]] = None
|
||||
priority: Optional[int] = Field(default=0, description="Task priority (higher number = higher priority)")
|
||||
delay: Optional[int] = Field(default=0, description="Delay execution by N seconds")
|
||||
|
||||
model_config = ConfigDict(extra='allow')
|
||||
|
||||
@field_validator('files', mode='before')
|
||||
@classmethod
|
||||
def validate_files(cls, v):
|
||||
"""Validate dict format with key-grouped files"""
|
||||
if v is None:
|
||||
return None
|
||||
if isinstance(v, dict):
|
||||
# Validate dict format
|
||||
for key, value in v.items():
|
||||
if not isinstance(key, str):
|
||||
raise ValueError(f"Key in files dict must be string, got {type(key)}")
|
||||
if not isinstance(value, list):
|
||||
raise ValueError(f"Value in files dict must be list, got {type(value)} for key '{key}'")
|
||||
for item in value:
|
||||
if not isinstance(item, str):
|
||||
raise ValueError(f"File paths must be strings, got {type(item)} in key '{key}'")
|
||||
return v
|
||||
else:
|
||||
raise ValueError(f"Files must be a dict with key groups, got {type(v)}")
|
||||
|
||||
|
||||
class QueueTaskResponse(BaseModel):
|
||||
"""队列任务响应模型"""
|
||||
success: bool
|
||||
message: str
|
||||
unique_id: str
|
||||
task_id: Optional[str] = None
|
||||
task_status: Optional[str] = None
|
||||
estimated_processing_time: Optional[int] = None # seconds
|
||||
|
||||
|
||||
class QueueStatusResponse(BaseModel):
|
||||
"""队列状态响应模型"""
|
||||
success: bool
|
||||
message: str
|
||||
queue_stats: Dict[str, Any]
|
||||
pending_tasks: List[Dict[str, Any]]
|
||||
|
||||
|
||||
class TaskStatusResponse(BaseModel):
|
||||
"""任务状态响应模型"""
|
||||
success: bool
|
||||
message: str
|
||||
task_id: str
|
||||
task_status: Optional[str] = None
|
||||
task_result: Optional[Dict[str, Any]] = None
|
||||
error: Optional[str] = None
|
||||
|
||||
|
||||
def create_chat_response(
|
||||
messages: List[Message],
|
||||
model: str,
|
||||
|
||||
Loading…
Reference in New Issue
Block a user