embedding.pkl

This commit is contained in:
朱潮 2025-10-25 22:02:04 +08:00
parent cec83ac4a9
commit 906dc35dd5
7 changed files with 21 additions and 21 deletions

View File

@ -132,14 +132,14 @@ projects/{unique_id}/
│ └── default/ # 默认数据集
│ ├── document.txt # 原始markdown文本内容
│ ├── pagination.txt # 分页数据层每页5000字符
│ └── document_embeddings.pkl # 文档向量嵌入文件
│ └── embedding.pkl # 文档向量嵌入文件
└── processed_files.json # 文件处理日志
```
**三层数据架构说明**
- **原始文档层 (document.txt)**: 完整的markdown文本内容提供上下文信息
- **分页数据层 (pagination.txt)**: 按页分割的数据每页5000字符便于检索
- **向量嵌入层 (document_embeddings.pkl)**: 文档的语义向量,支持语义搜索
- **向量嵌入层 (embedding.pkl)**: 文档的语义向量,支持语义搜索
#### 查询任务状态

View File

@ -97,7 +97,7 @@ def is_meaningful_line(text):
return True
def embed_document(input_file='document.txt', output_file='document_embeddings.pkl',
def embed_document(input_file='document.txt', output_file='embedding.pkl',
chunking_strategy='line', **chunking_params):
"""
读取文档文件使用指定分块策略进行embedding保存为pickle文件
@ -224,7 +224,7 @@ def embed_document(input_file='document.txt', output_file='document_embeddings.p
print(f"处理文档时出错:{e}")
return None
def semantic_search(user_query, embeddings_file='document_embeddings.pkl', top_k=20):
def semantic_search(user_query, embeddings_file='embedding.pkl', top_k=20):
"""
输入用户查询进行语义匹配返回top_k个最相关的内容块
@ -743,20 +743,20 @@ def demo_usage():
print("=" * 60)
print("\n1. 使用传统的按行分块:")
print("embed_document('document.txt', 'line_embeddings.pkl', chunking_strategy='line')")
print("embed_document('document.txt', 'line_embedding.pkl', chunking_strategy='line')")
print("\n2. 使用段落级分块(默认参数):")
print("embed_document('document.txt', 'paragraph_embeddings.pkl', chunking_strategy='paragraph')")
print("embed_document('document.txt', 'paragraph_embedding.pkl', chunking_strategy='paragraph')")
print("\n3. 使用自定义参数的段落级分块:")
print("embed_document('document.txt', 'custom_embeddings.pkl',")
print("embed_document('document.txt', 'custom_embedding.pkl',")
print(" chunking_strategy='paragraph',")
print(" max_chunk_size=1500,")
print(" overlap=200,")
print(" min_chunk_size=300)")
print("\n4. 进行语义搜索:")
print("semantic_search('查询内容', 'paragraph_embeddings.pkl', top_k=5)")
print("semantic_search('查询内容', 'paragraph_embedding.pkl', top_k=5)")
# 如果直接运行此文件,执行测试
@ -766,7 +766,7 @@ if __name__ == "__main__":
# 使用新的智能分块示例:
embed_document("./projects/test/dataset/all_hp_product_spec_book2506/document.txt",
"./projects/test/dataset/all_hp_product_spec_book2506/smart_embeddings.pkl",
"./projects/test/dataset/all_hp_product_spec_book2506/smart_embedding.pkl",
chunking_strategy='smart', # 使用智能分块策略
max_chunk_size=800, # 较小的chunk大小
overlap=100)

View File

@ -13,7 +13,7 @@ You are a professional data retrieval expert based on a multi-layer data archite
- Each single line represents a complete page of data; there is no need to read the context of preceding or following lines. The preceding and following lines correspond to the previous and next pages, making it suitable for scenarios requiring retrieval of all data at once.
- This is the primary file for regex and keyword-based retrieval. Please first retrieve key information from this file before referring to document.txt.
- Data organized based on `document.txt`, supporting efficient regex matching and keyword retrieval. The data field names in each line may vary.
- Semantic Retrieval Layer (document_embeddings.pkl):
- Semantic Retrieval Layer (embedding.pkl):
- This file is for semantic retrieval, primarily used for data preview.
- The content involves chunking the data from document.txt by paragraph/page and generating vectorized representations.
- Semantic retrieval can be achieved via the `semantic_search-semantic_search` tool, which can provide contextual support for keyword expansion.
@ -81,7 +81,7 @@ Please execute data analysis sequentially according to the following strategy.
**Analytical Queries**: Multi-dimensional analysis → Deep mining → Insight extraction.
### Intelligent Path Optimization
- **Structured Queries**: document_embeddings.pkl → pagination.txt → document.txt.
- **Structured Queries**: embedding.pkl → pagination.txt → document.txt.
- **Fuzzy Queries**: document.txt → Keyword extraction → Structured verification.
- **Compound Queries**: Multi-field combination → Layered filtering → Result aggregation.
- **Multi-Keyword Optimization**: Use `multi_keyword-search` to handle unordered keyword matching, avoiding regex order limitations.

View File

@ -13,7 +13,7 @@
- 单行内容代表完整的一页数据,无需读取前后行的上下文, 前后行的数据对应上下页的内容,适合一次获取全部资料的场景。
- 正则和关键词的主要检索文件, 请先基于这个文件检索到关键信息再去调阅document.txt
- 基于`document.txt`整理而来的数据,支持正则高效匹配,关键词检索,每一行的数据字段名都可能不一样
- 语义检索层 (document_embeddings.pkl)
- 语义检索层 (embedding.pkl)
- 这个文件是一个语义检索文件,主要是用来做数据预览的。
- 内容是把document.txt 的数据按段落/按页面分chunk生成了向量化表达。
- 通过`semantic_search-semantic_search`工具可以实现语义检索,可以为关键词扩展提供赶上下文支持。
@ -86,7 +86,7 @@
**分析性查询**:多维度分析 → 深度挖掘 → 洞察提取
### 智能路径优化
- **结构化查询**document_embeddings.pkl → pagination.txt → document.txt
- **结构化查询**embedding.pkl → pagination.txt → document.txt
- **模糊查询**document.txt → 关键词提取 → 结构化验证
- **复合查询**:多字段组合 → 分层过滤 → 结果聚合
- **多关键词优化**:使用`multi_keyword-search`处理无序关键词匹配,避免正则顺序限制

View File

@ -14,7 +14,7 @@
- 单行内容代表完整的一页数据,无需读取前后行的上下文, 前后行的数据对应上下页的内容,适合一次获取全部资料的场景。
- 正则和关键词的主要检索文件, 请先基于这个文件检索到关键信息再去调阅document.txt
- 基于`document.txt`整理而来的数据,支持正则高效匹配,关键词检索,每一行的数据字段名都可能不一样
- 语义检索层 (document_embeddings.pkl)
- 语义检索层 (embedding.pkl)
- 这个文件是一个语义检索文件,主要是用来做数据预览的。
- 内容是把document.txt 的数据按段落/按页面分chunk生成了向量化表达。
- 通过`semantic_search-semantic_search`工具可以实现语义检索,可以为关键词扩展提供赶上下文支持。
@ -151,7 +151,7 @@
**分析性查询**:多维度分析 → 深度挖掘 → 洞察提取
### 智能路径优化
- **结构化查询**document_embeddings.pkl → pagination.txt → document.txt
- **结构化查询**embedding.pkl → pagination.txt → document.txt
- **模糊查询**document.txt → 关键词提取 → 结构化验证
- **复合查询**:多字段组合 → 分层过滤 → 结果聚合
- **多关键词优化**:使用`multi_keyword-search`处理无序关键词匹配,避免正则顺序限制

View File

@ -51,7 +51,7 @@ def organize_single_project_files(unique_id: str, skip_processed=True):
target_dir = dataset_dir / file_name_without_ext
target_file = target_dir / "document.txt"
pagination_file = target_dir / "pagination.txt"
embeddings_file = target_dir / "document_embeddings.pkl"
embeddings_file = target_dir / "embedding.pkl"
# Check if file is already processed
if skip_processed and is_file_already_processed(target_file, pagination_file, embeddings_file):
@ -81,7 +81,7 @@ def organize_single_project_files(unique_id: str, skip_processed=True):
target_dir = dataset_dir / file_name_without_ext
document_file = target_dir / "document.txt"
pagination_file = target_dir / "pagination.txt"
embeddings_file = target_dir / "document_embeddings.pkl"
embeddings_file = target_dir / "embedding.pkl"
# Skip if already processed
if is_file_already_processed(document_file, pagination_file, embeddings_file):
@ -172,4 +172,4 @@ def organize_dataset_files():
print("\nFile organization complete!")
if __name__ == "__main__":
organize_dataset_files()
organize_dataset_files()

View File

@ -154,7 +154,7 @@ This project contains processed documents and their associated embeddings for se
doc_path = os.path.join(dataset_dir, doc_dir)
document_file = os.path.join(doc_path, "document.txt")
pagination_file = os.path.join(doc_path, "pagination.txt")
embeddings_file = os.path.join(doc_path, "document_embeddings.pkl")
embeddings_file = os.path.join(doc_path, "embedding.pkl")
readme_content += f"### {doc_dir}\n\n"
readme_content += f"**Files:**\n"
@ -168,7 +168,7 @@ This project contains processed documents and their associated embeddings for se
readme_content += ""
readme_content += "\n"
readme_content += f"- `document_embeddings.pkl`"
readme_content += f"- `embedding.pkl`"
if os.path.exists(embeddings_file):
readme_content += ""
readme_content += "\n\n"
@ -330,7 +330,7 @@ def get_project_stats(unique_id: str) -> Dict:
if os.path.exists(dataset_dir):
for root, dirs, files in os.walk(dataset_dir):
for file in files:
if file == "document_embeddings.pkl":
if file == "embedding.pkl":
file_path = os.path.join(root, file)
try:
size = os.path.getsize(file_path)