870 lines
24 KiB
Markdown
870 lines
24 KiB
Markdown
# 知识库模块功能实现计划
|
||
|
||
> **Enhanced on:** 2025-02-10
|
||
> **Sections enhanced:** 10
|
||
> **Research agents used:** FastAPI best practices, Vue 3 composables, UI/UX patterns, RAGFlow SDK, File upload security, Architecture strategy, Code simplicity, Security sentinel, Performance oracle
|
||
|
||
---
|
||
|
||
## Enhancement Summary
|
||
|
||
### Key Improvements
|
||
|
||
1. **安全加固** - 添加文件类型验证、大小限制、API Key 管理
|
||
2. **性能优化** - 流式文件上传、分页查询、连接池管理
|
||
3. **架构分层** - 引入服务层和仓储模式,提高可测试性
|
||
4. **代码简化** - 移除过度设计,遵循 YAGNI 原则
|
||
5. **用户体验** - 完善空状态、加载状态、错误处理
|
||
|
||
### New Considerations Discovered
|
||
|
||
- RAGFlow 部署使用 HTTP(非 HTTPS),需要评估安全风险
|
||
- 文件上传必须实现流式处理,避免内存溢出
|
||
- 切片查询必须分页,否则大数据量会 OOM
|
||
- API Key 应通过环境变量管理,不应硬编码
|
||
|
||
---
|
||
|
||
## 概述
|
||
|
||
在 qwen-client 项目上增加一个独立的知识库模块功能(与 bot 无关联),通过 RAGFlow SDK 实现知识库管理功能。
|
||
|
||
**架构设计:**
|
||
```
|
||
qwen-client (Vue 3) → qwen-agent (FastAPI) → RAGFlow (http://100.77.70.35:1080)
|
||
```
|
||
|
||
## 需求背景
|
||
|
||
用户需要一个独立的知识库管理系统,可以:
|
||
1. 创建和管理多个知识库(数据集)
|
||
2. 向知识库上传文件
|
||
3. 管理知识库内的文档切片
|
||
4. 后续可与 bot 关联进行 RAG 检索
|
||
|
||
---
|
||
|
||
## 技术方案
|
||
|
||
### 后端实现 (qwen-agent)
|
||
|
||
#### 1. 环境配置
|
||
|
||
**文件:** `/utils/settings.py`
|
||
|
||
```python
|
||
# ============================================================
|
||
# RAGFlow Knowledge Base Configuration
|
||
# ============================================================
|
||
|
||
# RAGFlow API 配置
|
||
RAGFLOW_API_URL = os.getenv("RAGFLOW_API_URL", "http://100.77.70.35:1080")
|
||
RAGFLOW_API_KEY = os.getenv("RAGFLOW_API_KEY", "") # 必须通过环境变量设置
|
||
|
||
# 文件上传配置
|
||
RAGFLOW_MAX_UPLOAD_SIZE = int(os.getenv("RAGFLOW_MAX_UPLOAD_SIZE", str(100 * 1024 * 1024))) # 100MB
|
||
RAGFLOW_ALLOWED_EXTENSIONS = os.getenv("RAGFLOW_ALLOWED_EXTENSIONS", "pdf,docx,txt,md,csv").split(",")
|
||
|
||
# 性能配置
|
||
RAGFLOW_CONNECTION_TIMEOUT = int(os.getenv("RAGFLOW_CONNECTION_TIMEOUT", "30")) # 30秒
|
||
RAGFLOW_MAX_CONCURRENT_UPLOADS = int(os.getenv("RAGFLOW_MAX_CONCURRENT_UPLOADS", "5"))
|
||
```
|
||
|
||
#### 2. 依赖安装
|
||
|
||
**文件:** `/pyproject.toml`
|
||
|
||
在 `[tool.poetry.dependencies]` 添加:
|
||
```toml
|
||
ragflow-sdk = "^0.1.0"
|
||
python-magic = "^0.4.27"
|
||
aiofiles = "^24.1.0"
|
||
```
|
||
|
||
执行:
|
||
```bash
|
||
poetry install
|
||
poetry export -f requirements.txt -o requirements.txt --without-hashes
|
||
```
|
||
|
||
#### 3. 项目结构
|
||
|
||
基于架构审查建议,采用分层设计:
|
||
|
||
```
|
||
qwen-agent/
|
||
├── routes/
|
||
│ └── knowledge_base.py # API 路由层
|
||
├── services/
|
||
│ └── knowledge_base_service.py # 业务逻辑层(新增)
|
||
├── repositories/
|
||
│ ├── __init__.py
|
||
│ └── ragflow_repository.py # RAGFlow 适配器(新增)
|
||
└── utils/
|
||
├── settings.py # 配置管理
|
||
└── file_validator.py # 文件验证工具(新增)
|
||
```
|
||
|
||
#### 4. API 路由设计
|
||
|
||
**文件:** `/routes/knowledge_base.py`
|
||
|
||
**路由前缀:** `/api/v1/knowledge-base`
|
||
|
||
| 端点 | 方法 | 功能 | 认证 | 优化 |
|
||
|------|------|------|------|------|
|
||
| `/datasets` | GET | 获取所有数据集列表(分页) | Admin Token | 缓存 |
|
||
| `/datasets` | POST | 创建新数据集 | Admin Token | - |
|
||
| `/datasets/{dataset_id}` | GET | 获取数据集详情 | Admin Token | 缓存 |
|
||
| `/datasets/{dataset_id}` | PATCH | 更新数据集(部分更新) | Admin Token | - |
|
||
| `/datasets/{dataset_id}` | DELETE | 删除数据集 | Admin Token | - |
|
||
| `/datasets/{dataset_id}/files` | GET | 获取数据集内文件列表(分页) | Admin Token | 缓存 |
|
||
| `/datasets/{dataset_id}/files` | POST | 上传文件到数据集(流式) | Admin Token | 限流 |
|
||
| `/datasets/{dataset_id}/files/{document_id}` | DELETE | 删除文件 | Admin Token | - |
|
||
| `/datasets/{dataset_id}/chunks` | GET | 获取数据集内切片列表(分页) | Admin Token | 游标分页 |
|
||
| `/datasets/{dataset_id}/chunks/{chunk_id}` | DELETE | 删除切片 | Admin Token | - |
|
||
|
||
**代码结构:**
|
||
|
||
```python
|
||
"""
|
||
Knowledge Base API 路由
|
||
通过 RAGFlow SDK 提供知识库管理功能
|
||
"""
|
||
import logging
|
||
import os
|
||
from typing import Optional, List
|
||
from fastapi import APIRouter, HTTPException, Header, UploadFile, File, Query, BackgroundTasks
|
||
from pydantic import BaseModel, Field
|
||
from pathlib import Path
|
||
|
||
from utils.settings import RAGFLOW_API_URL, RAGFLOW_API_KEY
|
||
from utils.fastapi_utils import extract_api_key_from_auth
|
||
from repositories.ragflow_repository import RAGFlowRepository
|
||
from services.knowledge_base_service import KnowledgeBaseService
|
||
|
||
logger = logging.getLogger('app')
|
||
|
||
router = APIRouter()
|
||
|
||
# ============== 依赖注入 ==============
|
||
async def get_kb_service() -> KnowledgeBaseService:
|
||
"""获取知识库服务实例"""
|
||
return KnowledgeBaseService(RAGFlowRepository())
|
||
|
||
async def verify_admin(authorization: Optional[str] = Header(None)):
|
||
"""验证管理员权限(复用现有认证)"""
|
||
from routes.bot_manager import verify_admin_auth
|
||
valid, username = await verify_admin_auth(authorization)
|
||
if not valid:
|
||
raise HTTPException(status_code=401, detail="Unauthorized")
|
||
return username
|
||
|
||
# ============== Pydantic Models ==============
|
||
|
||
class DatasetCreate(BaseModel):
|
||
"""创建数据集请求"""
|
||
name: str = Field(..., min_length=1, max_length=128, description="数据集名称")
|
||
description: Optional[str] = Field(None, max_length=500, description="描述信息")
|
||
chunk_method: str = Field(default="naive", description="分块方法")
|
||
# RAGFlow 支持的分块方法: naive, manual, qa, table, paper, book, laws, presentation, picture, one, email, knowledge-graph
|
||
|
||
class DatasetUpdate(BaseModel):
|
||
"""更新数据集请求(部分更新)"""
|
||
name: Optional[str] = Field(None, min_length=1, max_length=128)
|
||
description: Optional[str] = Field(None, max_length=500)
|
||
chunk_method: Optional[str] = None
|
||
|
||
class DatasetListResponse(BaseModel):
|
||
"""数据集列表响应(分页)"""
|
||
items: List[dict]
|
||
total: int
|
||
page: int
|
||
page_size: int
|
||
|
||
# ============== 数据集端点 ==============
|
||
|
||
@router.get("/datasets", response_model=DatasetListResponse)
|
||
async def list_datasets(
|
||
page: int = Query(1, ge=1, description="页码"),
|
||
page_size: int = Query(20, ge=1, le=100, description="每页数量"),
|
||
search: Optional[str] = Query(None, description="搜索关键词"),
|
||
username: str = Depends(verify_admin),
|
||
kb_service: KnowledgeBaseService = Depends(get_kb_service)
|
||
):
|
||
"""获取数据集列表(支持分页和搜索)"""
|
||
return await kb_service.list_datasets(
|
||
page=page,
|
||
page_size=page_size,
|
||
search=search
|
||
)
|
||
|
||
@router.post("/datasets", status_code=201)
|
||
async def create_dataset(
|
||
data: DatasetCreate,
|
||
username: str = Depends(verify_admin),
|
||
kb_service: KnowledgeBaseService = Depends(get_kb_service)
|
||
):
|
||
"""创建数据集"""
|
||
try:
|
||
dataset = await kb_service.create_dataset(
|
||
name=data.name,
|
||
description=data.description,
|
||
chunk_method=data.chunk_method
|
||
)
|
||
return dataset
|
||
except Exception as e:
|
||
logger.error(f"Failed to create dataset: {e}")
|
||
raise HTTPException(status_code=500, detail=f"创建数据集失败: {str(e)}")
|
||
|
||
@router.get("/datasets/{dataset_id}")
|
||
async def get_dataset(
|
||
dataset_id: str,
|
||
username: str = Depends(verify_admin),
|
||
kb_service: KnowledgeBaseService = Depends(get_kb_service)
|
||
):
|
||
"""获取数据集详情"""
|
||
dataset = await kb_service.get_dataset(dataset_id)
|
||
if not dataset:
|
||
raise HTTPException(status_code=404, detail="数据集不存在")
|
||
return dataset
|
||
|
||
@router.patch("/datasets/{dataset_id}")
|
||
async def update_dataset(
|
||
dataset_id: str,
|
||
data: DatasetUpdate,
|
||
username: str = Depends(verify_admin),
|
||
kb_service: KnowledgeBaseService = Depends(get_kb_service)
|
||
):
|
||
"""更新数据集(部分更新)"""
|
||
try:
|
||
dataset = await kb_service.update_dataset(dataset_id, data.model_dump(exclude_unset=True))
|
||
if not dataset:
|
||
raise HTTPException(status_code=404, detail="数据集不存在")
|
||
return dataset
|
||
except Exception as e:
|
||
logger.error(f"Failed to update dataset: {e}")
|
||
raise HTTPException(status_code=500, detail=f"更新数据集失败: {str(e)}")
|
||
|
||
@router.delete("/datasets/{dataset_id}", status_code=204)
|
||
async def delete_dataset(
|
||
dataset_id: str,
|
||
username: str = Depends(verify_admin),
|
||
kb_service: KnowledgeBaseService = Depends(get_kb_service)
|
||
):
|
||
"""删除数据集"""
|
||
success = await kb_service.delete_dataset(dataset_id)
|
||
if not success:
|
||
raise HTTPException(status_code=404, detail="数据集不存在")
|
||
|
||
# ============== 文件端点 ==============
|
||
|
||
@router.get("/datasets/{dataset_id}/files")
|
||
async def list_dataset_files(
|
||
dataset_id: str,
|
||
page: int = Query(1, ge=1),
|
||
page_size: int = Query(20, ge=1, le=100),
|
||
username: str = Depends(verify_admin),
|
||
kb_service: KnowledgeBaseService = Depends(get_kb_service)
|
||
):
|
||
"""获取数据集内文件列表(分页)"""
|
||
return await kb_service.list_files(dataset_id, page=page, page_size=page_size)
|
||
|
||
@router.post("/datasets/{dataset_id}/files")
|
||
async def upload_file(
|
||
dataset_id: str,
|
||
file: UploadFile = File(...),
|
||
background_tasks: BackgroundTasks = None,
|
||
username: str = Depends(verify_admin),
|
||
kb_service: KnowledgeBaseService = Depends(get_kb_service)
|
||
):
|
||
"""
|
||
上传文件到数据集(流式处理)
|
||
|
||
支持的文件类型: PDF, DOCX, TXT, MD, CSV
|
||
最大文件大小: 100MB
|
||
"""
|
||
# 文件验证在 service 层处理
|
||
try:
|
||
result = await kb_service.upload_file(dataset_id, file)
|
||
return result
|
||
except ValueError as e:
|
||
raise HTTPException(status_code=400, detail=str(e))
|
||
except Exception as e:
|
||
logger.error(f"Failed to upload file: {e}")
|
||
raise HTTPException(status_code=500, detail=f"上传失败: {str(e)}")
|
||
|
||
@router.delete("/datasets/{dataset_id}/files/{document_id}")
|
||
async def delete_file(
|
||
dataset_id: str,
|
||
document_id: str,
|
||
username: str = Depends(verify_admin),
|
||
kb_service: KnowledgeBaseService = Depends(get_kb_service)
|
||
):
|
||
"""删除文件"""
|
||
success = await kb_service.delete_file(dataset_id, document_id)
|
||
if not success:
|
||
raise HTTPException(status_code=404, detail="文件不存在")
|
||
return {"success": True}
|
||
|
||
# ============== 切片端点(可选,延后实现)=============
|
||
# 根据简化建议,切片管理功能延后到明确需求时再实现
|
||
```
|
||
|
||
---
|
||
|
||
### 前端实现 (qwen-client)
|
||
|
||
#### 1. API 服务层
|
||
|
||
**文件:** `/src/api/index.js`
|
||
|
||
添加 `knowledgeBaseApi` 模块:
|
||
|
||
```javascript
|
||
// ============== Knowledge Base API ==============
|
||
const knowledgeBaseApi = {
|
||
// 数据集管理
|
||
getDatasets: async (params = {}) => {
|
||
const qs = new URLSearchParams(params).toString()
|
||
return request(`/api/v1/knowledge-base/datasets${qs ? '?' + qs : ''}`)
|
||
},
|
||
|
||
createDataset: async (data) => {
|
||
return request('/api/v1/knowledge-base/datasets', {
|
||
method: 'POST',
|
||
body: JSON.stringify(data)
|
||
})
|
||
},
|
||
|
||
updateDataset: async (datasetId, data) => {
|
||
return request(`/api/v1/knowledge-base/datasets/${datasetId}`, {
|
||
method: 'PATCH', // 使用 PATCH 支持部分更新
|
||
body: JSON.stringify(data)
|
||
})
|
||
},
|
||
|
||
deleteDataset: async (datasetId) => {
|
||
return request(`/api/v1/knowledge-base/datasets/${datasetId}`, {
|
||
method: 'DELETE'
|
||
})
|
||
},
|
||
|
||
// 文件管理
|
||
getDatasetFiles: async (datasetId, params = {}) => {
|
||
const qs = new URLSearchParams(params).toString()
|
||
return request(`/api/v1/knowledge-base/datasets/${datasetId}/files${qs ? '?' + qs : ''}`)
|
||
},
|
||
|
||
uploadFile: async (datasetId, file, onProgress) => {
|
||
const formData = new FormData()
|
||
formData.append('file', file)
|
||
|
||
// 支持上传进度回调
|
||
const xhr = new XMLHttpRequest()
|
||
|
||
return new Promise((resolve, reject) => {
|
||
xhr.upload.addEventListener('progress', (e) => {
|
||
if (onProgress && e.lengthComputable) {
|
||
onProgress(Math.round((e.loaded / e.total) * 100))
|
||
}
|
||
})
|
||
|
||
xhr.addEventListener('load', () => {
|
||
if (xhr.status >= 200 && xhr.status < 300) {
|
||
resolve(JSON.parse(xhr.responseText))
|
||
} else {
|
||
reject(new Error(xhr.statusText))
|
||
}
|
||
})
|
||
|
||
xhr.addEventListener('error', () => reject(new Error('上传失败')))
|
||
xhr.addEventListener('abort', () => reject(new Error('上传已取消')))
|
||
|
||
xhr.open('POST', `${API_BASE}/api/v1/knowledge-base/datasets/${datasetId}/files`)
|
||
xhr.setRequestHeader('Authorization', `Bearer ${localStorage.getItem('admin_token') || 'dummy-token'}`)
|
||
xhr.send(formData)
|
||
})
|
||
},
|
||
|
||
deleteFile: async (datasetId, documentId) => {
|
||
return request(`/api/v1/knowledge-base/datasets/${datasetId}/files/${documentId}`, {
|
||
method: 'DELETE'
|
||
})
|
||
}
|
||
}
|
||
```
|
||
|
||
#### 2. 简化的状态管理
|
||
|
||
**基于代码简洁性审查建议,直接在组件中管理状态,而不是创建独立的 composable**
|
||
|
||
```vue
|
||
<!-- KnowledgeBaseView.vue -->
|
||
<script setup>
|
||
import { ref, onMounted } from 'vue'
|
||
import { knowledgeBaseApi } from '@/api'
|
||
import DatasetList from '@/components/knowledge-base/DatasetList.vue'
|
||
import FileList from '@/components/knowledge-base/FileList.vue'
|
||
import DatasetFormModal from '@/components/knowledge-base/DatasetFormModal.vue'
|
||
|
||
// 状态
|
||
const datasets = ref([])
|
||
const currentDataset = ref(null)
|
||
const files = ref([])
|
||
const isLoading = ref(false)
|
||
const error = ref(null)
|
||
|
||
// 分页
|
||
const page = ref(1)
|
||
const pageSize = ref(20)
|
||
const total = ref(0)
|
||
|
||
// 加载数据集
|
||
const loadDatasets = async () => {
|
||
isLoading.value = true
|
||
error.value = null
|
||
try {
|
||
const response = await knowledgeBaseApi.getDatasets({
|
||
page: page.value,
|
||
page_size: pageSize.value
|
||
})
|
||
datasets.value = response.items || []
|
||
total.value = response.total
|
||
} catch (err) {
|
||
error.value = err.message
|
||
} finally {
|
||
isLoading.value = false
|
||
}
|
||
}
|
||
|
||
// 选择数据集
|
||
const selectDataset = async (dataset) => {
|
||
currentDataset.value = dataset
|
||
await loadFiles(dataset.dataset_id)
|
||
}
|
||
|
||
// 加载文件
|
||
const loadFiles = async (datasetId) => {
|
||
isLoading.value = true
|
||
try {
|
||
const response = await knowledgeBaseApi.getDatasetFiles(datasetId)
|
||
files.value = response.items || []
|
||
} finally {
|
||
isLoading.value = false
|
||
}
|
||
}
|
||
|
||
// 创建数据集
|
||
const createDataset = async (data) => {
|
||
await knowledgeBaseApi.createDataset(data)
|
||
await loadDatasets()
|
||
}
|
||
|
||
// 删除数据集
|
||
const deleteDataset = async (datasetId) => {
|
||
await knowledgeBaseApi.deleteDataset(datasetId)
|
||
if (currentDataset.value?.dataset_id === datasetId) {
|
||
currentDataset.value = null
|
||
files.value = []
|
||
}
|
||
await loadDatasets()
|
||
}
|
||
|
||
onMounted(() => {
|
||
loadDatasets()
|
||
})
|
||
</script>
|
||
|
||
<template>
|
||
<div class="knowledge-base-view">
|
||
<!-- 数据集列表 -->
|
||
<DatasetList
|
||
:datasets="datasets"
|
||
:loading="isLoading"
|
||
:current="currentDataset"
|
||
@select="selectDataset"
|
||
@create="createDataset"
|
||
@delete="deleteDataset"
|
||
/>
|
||
|
||
<!-- 文件列表(选中数据集后显示) -->
|
||
<FileList
|
||
v-if="currentDataset"
|
||
:dataset="currentDataset"
|
||
:files="files"
|
||
@upload="handleFileUpload"
|
||
@delete="handleFileDelete"
|
||
/>
|
||
</div>
|
||
</template>
|
||
```
|
||
|
||
#### 3. 路由配置
|
||
|
||
**文件:** `/src/router/index.js`
|
||
|
||
添加知识库路由:
|
||
|
||
```javascript
|
||
{
|
||
path: '/knowledge-base',
|
||
name: 'knowledge-base',
|
||
component: () => import('@/views/KnowledgeBaseView.vue'),
|
||
meta: { requiresAuth: true, title: '知识库管理' }
|
||
}
|
||
```
|
||
|
||
#### 4. 视图组件
|
||
|
||
**文件:** `/src/views/KnowledgeBaseView.vue`
|
||
|
||
主视图组件,包含:
|
||
- 数据集列表(左侧或顶部)
|
||
- 文件列表(选中数据集后显示)
|
||
- 上传文件按钮
|
||
- 创建数据集按钮
|
||
|
||
**子组件(简化后的结构):**
|
||
|
||
| 组件 | 文件 | 功能 |
|
||
|------|------|------|
|
||
| `DatasetList.vue` | `/src/components/knowledge-base/DatasetList.vue` | 数据集列表展示 + 创建/删除 |
|
||
| `DatasetFormModal.vue` | `/src/components/knowledge-base/DatasetFormModal.vue` | 创建/编辑数据集弹窗(合并) |
|
||
| `FileList.vue` | `/src/components/knowledge-base/FileList.vue` | 文件列表展示 + 上传 |
|
||
| `FileUploadModal.vue` | `/src/components/knowledge-base/FileUploadModal.vue` | 文件上传弹窗 |
|
||
|
||
**目录结构:**
|
||
```
|
||
src/components/knowledge-base/
|
||
├── DatasetList.vue # 数据集列表(含创建按钮)
|
||
├── DatasetFormModal.vue # 创建/编辑数据集表单
|
||
├── FileList.vue # 文件列表(含上传按钮)
|
||
└── FileUploadModal.vue # 文件上传弹窗
|
||
```
|
||
|
||
#### 5. 导航菜单
|
||
|
||
**文件:** `/src/views/AdminView.vue`
|
||
|
||
在导航菜单中添加知识库入口:
|
||
|
||
```vue
|
||
<Button
|
||
variant="ghost"
|
||
@click="currentView = 'knowledge-base'"
|
||
>
|
||
<Database :size="20" />
|
||
<span>知识库管理</span>
|
||
</Button>
|
||
```
|
||
|
||
---
|
||
|
||
## 实现阶段
|
||
|
||
### Phase 1: 后端基础 (qwen-agent) - 核心功能
|
||
|
||
- [ ] 添加 `ragflow-sdk` 依赖到 `pyproject.toml`
|
||
- [ ] 在 `utils/settings.py` 添加 RAGFlow 配置(环境变量)
|
||
- [ ] 创建 `repositories/ragflow_repository.py` - RAGFlow SDK 适配器
|
||
- [ ] 创建 `services/knowledge_base_service.py` - 业务逻辑层
|
||
- [ ] 创建 `routes/knowledge_base.py` - API 路由
|
||
- [ ] 在 `fastapi_app.py` 注册路由
|
||
- [ ] 测试 API 端点
|
||
|
||
### Phase 2: 前端 API 层 (qwen-client)
|
||
|
||
- [ ] 在 `src/api/index.js` 添加 `knowledgeBaseApi`
|
||
- [ ] 添加知识库路由到 `src/router/index.js`
|
||
- [ ] 在 AdminView 添加导航入口
|
||
|
||
### Phase 3: 前端 UI 组件 - 最小实现
|
||
|
||
- [ ] 创建 `src/components/knowledge-base/` 目录
|
||
- [ ] 实现 `KnowledgeBaseView.vue` 主视图
|
||
- [ ] 实现 `DatasetList.vue` 组件
|
||
- [ ] 实现 `DatasetFormModal.vue` 组件
|
||
- [ ] 实现 `FileList.vue` 组件
|
||
- [ ] 实现 `FileUploadModal.vue` 组件
|
||
|
||
### Phase 4: 切片管理 (延后实现)
|
||
|
||
根据 YAGNI 原则,切片管理功能延后到有明确需求时再实现:
|
||
- [ ] 后端实现切片列表/删除端点
|
||
- [ ] 前端实现 `ChunkList.vue` 组件
|
||
- [ ] 切片搜索功能
|
||
|
||
### Phase 5: 测试与优化
|
||
|
||
- [ ] 端到端测试
|
||
- [ ] 错误处理优化
|
||
- [ ] 加载状态优化
|
||
- [ ] 添加性能监控
|
||
|
||
---
|
||
|
||
## 数据模型
|
||
|
||
### 数据集 (Dataset)
|
||
|
||
```typescript
|
||
interface Dataset {
|
||
dataset_id: string
|
||
name: string
|
||
description?: string
|
||
chunk_method: string
|
||
chunk_count?: number
|
||
document_count?: number
|
||
created_at: string
|
||
updated_at: string
|
||
}
|
||
```
|
||
|
||
### 文件 (Document)
|
||
|
||
```typescript
|
||
interface Document {
|
||
document_id: string
|
||
dataset_id: string
|
||
name: string
|
||
size: number
|
||
status: 'running' | 'success' | 'failed'
|
||
progress: number // 0-100
|
||
chunk_count?: number
|
||
token_count?: number
|
||
created_at: string
|
||
updated_at: string
|
||
}
|
||
```
|
||
|
||
### 切片 (Chunk) - 延后实现
|
||
|
||
```typescript
|
||
interface Chunk {
|
||
chunk_id: string
|
||
document_id: string
|
||
dataset_id: string
|
||
content: string
|
||
position: number
|
||
important_keywords?: string[]
|
||
available: boolean
|
||
created_at: string
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## API 端点规范
|
||
|
||
### 1. 获取数据集列表(分页)
|
||
|
||
```
|
||
GET /api/v1/knowledge-base/datasets?page=1&page_size=20&search=keyword
|
||
Authorization: Bearer {admin_token}
|
||
|
||
Response:
|
||
{
|
||
"items": [
|
||
{
|
||
"dataset_id": "uuid",
|
||
"name": "产品手册",
|
||
"description": "公司产品相关文档",
|
||
"chunk_method": "naive",
|
||
"document_count": 5,
|
||
"chunk_count": 120,
|
||
"created_at": "2025-01-01T00:00:00Z",
|
||
"updated_at": "2025-01-01T00:00:00Z"
|
||
}
|
||
],
|
||
"total": 1,
|
||
"page": 1,
|
||
"page_size": 20
|
||
}
|
||
```
|
||
|
||
### 2. 创建数据集
|
||
|
||
```
|
||
POST /api/v1/knowledge-base/datasets
|
||
Authorization: Bearer {admin_token}
|
||
Content-Type: application/json
|
||
|
||
{
|
||
"name": "产品手册",
|
||
"description": "公司产品相关文档",
|
||
"chunk_method": "naive"
|
||
}
|
||
|
||
Response:
|
||
{
|
||
"dataset_id": "uuid",
|
||
"name": "产品手册",
|
||
...
|
||
}
|
||
```
|
||
|
||
### 3. 上传文件(流式)
|
||
|
||
```
|
||
POST /api/v1/knowledge-base/datasets/{dataset_id}/files
|
||
Authorization: Bearer {admin_token}
|
||
Content-Type: multipart/form-data
|
||
|
||
file: <binary>
|
||
|
||
Response (异步):
|
||
{
|
||
"document_id": "uuid",
|
||
"name": "document.pdf",
|
||
"status": "running",
|
||
"progress": 0,
|
||
...
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 安全考虑
|
||
|
||
### Research Insights
|
||
|
||
**文件上传安全:**
|
||
- 实现文件类型白名单验证(扩展名 + MIME 类型 + 魔数)
|
||
- 限制文件大小(最大 100MB)
|
||
- 使用 UUID 重命名文件,防止路径遍历
|
||
- 清理文件名中的危险字符
|
||
|
||
**API 认证:**
|
||
- 复用现有的 `verify_admin_auth` 函数
|
||
- 所有端点需要有效的 Admin Token
|
||
- 集成现有的 RBAC 系统
|
||
|
||
**输入验证:**
|
||
- 使用 Pydantic Field 进行输入验证
|
||
- 限制字符串长度
|
||
- 验证分页参数范围
|
||
|
||
**配置安全:**
|
||
- API Key 必须通过环境变量设置
|
||
- 不在代码中硬编码敏感信息
|
||
|
||
### 安全配置清单
|
||
|
||
| 措施 | 优先级 | 状态 |
|
||
|------|--------|------|
|
||
| 文件类型验证 | 高 | 待实现 |
|
||
| 文件大小限制 | 高 | 待实现 |
|
||
| API Key 环境变量 | 高 | 已规划 |
|
||
| 路径遍历防护 | 高 | 待实现 |
|
||
| 文件名清理 | 中 | 待实现 |
|
||
| 病毒扫描 | 中 | 可选 |
|
||
|
||
---
|
||
|
||
## 性能优化
|
||
|
||
### Research Insights
|
||
|
||
**文件上传优化:**
|
||
- 使用流式处理,避免一次性读取大文件到内存
|
||
- 实现并发限制(最多 5 个并发上传)
|
||
- 添加上传进度回调
|
||
|
||
**查询优化:**
|
||
- 实现分页机制,避免返回大量数据
|
||
- 使用游标分页优化深分页性能
|
||
- 对数据集列表添加缓存
|
||
|
||
**连接池:**
|
||
- 使用异步 HTTP 客户端连接池
|
||
- 设置合理的超时时间
|
||
|
||
### 性能优化清单
|
||
|
||
| 优化项 | 优先级 | 预期效果 |
|
||
|--------|--------|----------|
|
||
| 流式文件上传 | 高 | 避免 OOM |
|
||
| 分页查询 | 高 | 响应时间 < 100ms |
|
||
| 数据集缓存 | 中 | 减少外部 API 调用 |
|
||
| 连接池 | 中 | 提高并发能力 |
|
||
| 限流 | 中 | 防止资源耗尽 |
|
||
|
||
---
|
||
|
||
## 用户体验优化
|
||
|
||
### Research Insights
|
||
|
||
**空状态设计:**
|
||
- 为空列表提供友好的提示和操作引导
|
||
- 区分不同场景的空状态(首次使用、搜索无结果等)
|
||
|
||
**加载状态:**
|
||
- 使用骨架屏替代传统 loading 指示器
|
||
- 显示加载进度(特别是文件上传)
|
||
|
||
**错误处理:**
|
||
- 使用人类可读的错误消息
|
||
- 提供具体的修复建议
|
||
- 区分不同类型的错误(网络、验证、服务器)
|
||
|
||
**文件上传 UX:**
|
||
- 支持拖拽上传
|
||
- 显示上传进度
|
||
- 支持批量上传
|
||
- 显示文件大小和类型验证
|
||
|
||
---
|
||
|
||
## 配置清单
|
||
|
||
### 环境变量
|
||
|
||
| 变量名 | 默认值 | 说明 | 必填 |
|
||
|--------|--------|------|------|
|
||
| `RAGFLOW_API_URL` | `http://100.77.70.35:1080` | RAGFlow API 地址 | 是 |
|
||
| `RAGFLOW_API_KEY` | - | RAGFlow API Key | 是 |
|
||
| `RAGFLOW_MAX_UPLOAD_SIZE` | `104857600` | 最大上传文件大小(字节) | 否 |
|
||
| `RAGFLOW_ALLOWED_EXTENSIONS` | `pdf,docx,txt,md,csv` | 允许的文件扩展名 | 否 |
|
||
| `RAGFLOW_CONNECTION_TIMEOUT` | `30` | 连接超时(秒) | 否 |
|
||
| `RAGFLOW_MAX_CONCURRENT_UPLOADS` | `5` | 最大并发上传数 | 否 |
|
||
|
||
### 依赖包
|
||
|
||
```toml
|
||
[tool.poetry.dependencies]
|
||
ragflow-sdk = "^0.1.0"
|
||
python-magic = "^0.4.27"
|
||
aiofiles = "^24.1.0"
|
||
```
|
||
|
||
---
|
||
|
||
## 参考资料
|
||
|
||
- **RAGFlow 官方文档:** https://ragflow.com.cn/docs
|
||
- **RAGFlow HTTP API:** https://ragflow.io/docs/http_api_reference
|
||
- **RAGFlow GitHub:** https://github.com/infiniflow/ragflow
|
||
- **RAGFlow Python SDK:** https://github.com/infiniflow/ragflow/blob/main/docs/references/python_api_reference.md
|
||
- **qwen-client API 层:** `/src/api/index.js`
|
||
- **qwen-agent 路由示例:** `/routes/bot_manager.py`
|
||
|
||
**研究来源:**
|
||
- [Vue.js Official Composables Guide](https://vuejs.org/guide/reusability/composables.html)
|
||
- [FastAPI Official Documentation](https://fastapi.tiangolo.com/)
|
||
- [UX Best Practices for File Uploader - Uploadcare](https://uploadcare.com/blog/file-uploader-ux-best-practices/)
|
||
- [Empty State UX Examples - Pencil & Paper](https://www.pencilandpaper.io/articles/empty-states)
|
||
- [Error-Message Guidelines - Nielsen Norman Group](https://www.nngroup.com/articles/error-message-guidelines/)
|
||
|
||
---
|
||
|
||
## 后续扩展
|
||
|
||
1. **与 Bot 关联:** 在 Bot 设置中选择知识库
|
||
2. **RAG 检索:** 实现基于知识库的问答功能
|
||
3. **批量操作:** 批量上传、删除文件
|
||
4. **知识库搜索:** 在知识库内搜索内容
|
||
5. **访问统计:** 查看知识库使用情况
|
||
6. **切片管理:** 前端切片查看和编辑(延后实现)
|