refactor: 将 citation 详细提示词从 system prompt 移至 RAG tool result 按需注入

system prompt 中的 citation 规则(document/table/web 三类约80行)占用大量 token,
现将详细格式要求移到 rag_retrieve_server.py 中作为工具返回前缀按需注入,
system prompt 仅保留精简版通用 placement rules。

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
朱潮 2026-03-27 12:30:20 +08:00
parent becd36da9d
commit 6300eea61d
2 changed files with 52 additions and 79 deletions

View File

@ -29,6 +29,49 @@ from mcp_common import (
BACKEND_HOST = os.getenv("BACKEND_HOST", "https://api-dev.gptbase.ai")
MASTERKEY = os.getenv("MASTERKEY", "master")
# Citation instruction prefixes injected into tool results
DOCUMENT_CITATION_INSTRUCTIONS = """<CITATION_INSTRUCTIONS>
When using the retrieved knowledge below, you MUST add XML citation tags for factual claims.
## Document Knowledge
Format: `<CITATION file="file_uuid" filename="name.pdf" page=3 />`
- Use `file` attribute with the UUID from document markers
- Use `filename` attribute with the actual filename from document markers
- Use `page` attribute (singular) with the page number
- `page` MUST be 0-based and must match the `pages:` values shown in the learned knowledge context
## Web Page Knowledge
Format: `<CITATION url="https://example.com/page" />`
- Use `url` attribute with the web page URL from the source metadata
- Do not use `file`, `filename`, or `page` attributes for web sources
- If content is grounded in a web source, prefer a web citation with `url` over a file citation
## Placement Rules
- Citations MUST appear IMMEDIATELY AFTER the paragraph or bullet list that uses the knowledge
- NEVER collect all citations and place them at the end of your response
- Limit to 1-2 citations per paragraph/bullet list
- If your answer uses learned knowledge, you MUST generate at least 1 `<CITATION ... />` in the response
</CITATION_INSTRUCTIONS>
"""
TABLE_CITATION_INSTRUCTIONS = """<CITATION_INSTRUCTIONS>
When using the retrieved table knowledge below, you MUST add XML citation tags for factual claims.
Format: `<CITATION file="file_id" filename="name.xlsx" sheet=1 rows=[2, 4] />`
- Parse `__src`: `F1S2R5` = file_ref F1, sheet 2, row 5
- Look up file_id in `file_ref_table`
- Combine same-sheet rows into one citation: `rows=[2, 4, 6]`
- MANDATORY: Create SEPARATE citation for EACH (file, sheet) combination
- NEVER put <CITATION> on the same line as a bullet point or table row
- Citations MUST be on separate lines AFTER the complete list/table
- NEVER include the `__src` column in your response - it is internal metadata only
- Citations MUST appear IMMEDIATELY AFTER the paragraph or bullet list that uses the knowledge
- NEVER collect all citations and place them at the end of your response
</CITATION_INSTRUCTIONS>
"""
def rag_retrieve(query: str, top_k: int = 100) -> Dict[str, Any]:
"""调用RAG检索API"""
try:
@ -94,7 +137,7 @@ def rag_retrieve(query: str, top_k: int = 100) -> Dict[str, Any]:
"content": [
{
"type": "text",
"text": markdown_content
"text": DOCUMENT_CITATION_INSTRUCTIONS + markdown_content
}
]
}
@ -107,7 +150,7 @@ def rag_retrieve(query: str, top_k: int = 100) -> Dict[str, Any]:
}
]
}
except requests.exceptions.RequestException as e:
return {
"content": [
@ -179,7 +222,7 @@ def table_rag_retrieve(query: str) -> Dict[str, Any]:
"content": [
{
"type": "text",
"text": markdown_content
"text": TABLE_CITATION_INSTRUCTIONS + markdown_content
}
]
}

View File

@ -2,83 +2,13 @@
## CITATION REQUIREMENTS
### A. Regular Document Knowledge
When answering questions based on `rag_retrieve` tool results, you MUST add XML citation tags for factual claims derived from the knowledge base.
When your answer uses learned knowledge, you MUST generate `<CITATION ... />` tags. Follow the specific citation format instructions returned by each tool (`rag_retrieve`, `table_rag_retrieve`).
**Format:** `<CITATION file="file_uuid" filename="name.pdf" page=3 />`
- Use `file` attribute with the UUID from document markers
- Use `filename` attribute with the actual filename from document markers
- Use `page` attribute (singular) with the page number
- `page` MUST be 0-based and must match the `pages:` values shown in the learned knowledge context
### B. Table Knowledge (TABLE_KNOWLEDGE BEGIN/END)
When answering questions based on `table_rag_retrieve` tool results, you MUST add XML citation tags for factual claims derived from the knowledge base.
**!!! CRITICAL RULE: NEVER put <CITATION> on same line as bullet/row !!!**
**Citations MUST be on separate lines AFTER the complete list/table.**
**NEVER include the `__src` column in your response - it is internal metadata only.**
Format: `<CITATION file="file_id" filename="name.xlsx" sheet=1 rows=[2, 4] />`
- Parse `__src`: `F1S2R5` = file_ref F1, sheet 2, row 5
- Look up file_id in `file_ref_table`
- Combine same-sheet rows into one citation: `rows=[2, 4, 6]`
- **MANDATORY: Create SEPARATE citation for EACH (file, sheet) combination**
✅ CORRECT (data from sheet 1 AND sheet 2 = 2 citations):
1. Liam - male
2. Noah - male
3. Ethan - male
4. Mason - male
5. William - male
<CITATION file="22920c54-31b2-4bdc-9902-a16fc9fe3a53" filename="students_list.xlsx" sheet=1 rows=[2, 4, 8] />
<CITATION file="22920c54-31b2-4bdc-9902-a16fc9fe3a53" filename="students_list.xlsx" sheet=2 rows=[10, 15] />
❌ WRONG (citation on same line):
1. Liam - male <CITATION ... rows=[2] />
❌ WRONG (missing sheet 2 citation):
...only 1 citation when data comes from 2 sheets...
### C. Web Page Knowledge
**Format:** `<CITATION url="https://example.com/page" />`
- Use `url` attribute with the web page URL from the source metadata
- Do not use `file`, `filename`, or `page` attributes for web sources
- Web citations should appear immediately after the content they reference
**!!! CRITICAL PLACEMENT RULES !!!**
1. **Citations MUST appear IMMEDIATELY AFTER the paragraph or bullet list** that uses the knowledge
2. **NEVER collect all citations and place them at the end of your response**
3. **Limit to 1-2 citations per paragraph/bullet list** - combine related facts under one citation
4. **If your answer uses learned knowledge, you MUST generate at least 1 `<CITATION ... />` in the response**
5. **If any paragraph or bullet list is grounded in a web source, prefer a web citation with `url` over a file citation**
✅ CORRECT (citation immediately after paragraph):
氣候變遷的影響包括世界平均氣溫持續上升2024年為有紀錄以來最熱的一年。<CITATION file="c4ccf054-e4e7-4c4f-8244-5ce9c1ddee51" filename="環境白皮書.pdf" page=0 />
具體影響包括:
- 極端高溫事件頻率增加
- 海洋熱浪
- 暴雨強度和頻率增強<CITATION file="c4ccf054-e4e7-4c4f-8244-5ce9c1ddee51" filename="環境白皮書.pdf" page=2 />
✅ CORRECT (web citation):
MIMURE位于东京港区高轮是一家综合性商业设施。<CITATION url="https://www.newoman.jp/takanawa/mimure/" />
❌ WRONG (all citations at the end):
氣候變遷的影響包括...(long response)...
<CITATION file="..." filename="環境白皮書.pdf" page=0 />
<CITATION file="..." filename="環境白皮書.pdf" page=1 />
<CITATION file="..." filename="環境白皮書.pdf" page=2 />
(13 citations dumped at the end)
❌ WRONG (web citation with file attributes):
MIMURE位于东京港区高轮是一家综合性商业设施。<CITATION file="abc123" filename="mimure.html" />
❌ WRONG (too many citations for short content):
2024年全球氣溫上升。<CITATION file="..." filename="環境白皮書.pdf" page=0 />
世界各地發生災害。<CITATION file="..." filename="環境白皮書.pdf" page=0 />
沙烏地阿拉伯熱浪。<CITATION file="..." filename="環境白皮書.pdf" page=1 />
### General Placement Rules
1. Citations MUST appear IMMEDIATELY AFTER the paragraph or bullet list that uses the knowledge
2. NEVER collect all citations and place them at the end of your response
3. Limit to 1-2 citations per paragraph/bullet list - combine related facts under one citation
4. If your answer uses learned knowledge, you MUST generate at least 1 `<CITATION ... />` in the response
### Current Working Directory