refactor: 将 citation 详细提示词从 system prompt 移至 RAG tool result 按需注入

system prompt 中的 citation 规则(document/table/web 三类约80行)占用大量 token,
现将详细格式要求移到 rag_retrieve_server.py 中作为工具返回前缀按需注入,
system prompt 仅保留精简版通用 placement rules。

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
朱潮 2026-03-27 12:30:20 +08:00
parent becd36da9d
commit 6300eea61d
2 changed files with 52 additions and 79 deletions

View File

@ -29,6 +29,49 @@ from mcp_common import (
BACKEND_HOST = os.getenv("BACKEND_HOST", "https://api-dev.gptbase.ai") BACKEND_HOST = os.getenv("BACKEND_HOST", "https://api-dev.gptbase.ai")
MASTERKEY = os.getenv("MASTERKEY", "master") MASTERKEY = os.getenv("MASTERKEY", "master")
# Citation instruction prefixes injected into tool results
DOCUMENT_CITATION_INSTRUCTIONS = """<CITATION_INSTRUCTIONS>
When using the retrieved knowledge below, you MUST add XML citation tags for factual claims.
## Document Knowledge
Format: `<CITATION file="file_uuid" filename="name.pdf" page=3 />`
- Use `file` attribute with the UUID from document markers
- Use `filename` attribute with the actual filename from document markers
- Use `page` attribute (singular) with the page number
- `page` MUST be 0-based and must match the `pages:` values shown in the learned knowledge context
## Web Page Knowledge
Format: `<CITATION url="https://example.com/page" />`
- Use `url` attribute with the web page URL from the source metadata
- Do not use `file`, `filename`, or `page` attributes for web sources
- If content is grounded in a web source, prefer a web citation with `url` over a file citation
## Placement Rules
- Citations MUST appear IMMEDIATELY AFTER the paragraph or bullet list that uses the knowledge
- NEVER collect all citations and place them at the end of your response
- Limit to 1-2 citations per paragraph/bullet list
- If your answer uses learned knowledge, you MUST generate at least 1 `<CITATION ... />` in the response
</CITATION_INSTRUCTIONS>
"""
TABLE_CITATION_INSTRUCTIONS = """<CITATION_INSTRUCTIONS>
When using the retrieved table knowledge below, you MUST add XML citation tags for factual claims.
Format: `<CITATION file="file_id" filename="name.xlsx" sheet=1 rows=[2, 4] />`
- Parse `__src`: `F1S2R5` = file_ref F1, sheet 2, row 5
- Look up file_id in `file_ref_table`
- Combine same-sheet rows into one citation: `rows=[2, 4, 6]`
- MANDATORY: Create SEPARATE citation for EACH (file, sheet) combination
- NEVER put <CITATION> on the same line as a bullet point or table row
- Citations MUST be on separate lines AFTER the complete list/table
- NEVER include the `__src` column in your response - it is internal metadata only
- Citations MUST appear IMMEDIATELY AFTER the paragraph or bullet list that uses the knowledge
- NEVER collect all citations and place them at the end of your response
</CITATION_INSTRUCTIONS>
"""
def rag_retrieve(query: str, top_k: int = 100) -> Dict[str, Any]: def rag_retrieve(query: str, top_k: int = 100) -> Dict[str, Any]:
"""调用RAG检索API""" """调用RAG检索API"""
try: try:
@ -94,7 +137,7 @@ def rag_retrieve(query: str, top_k: int = 100) -> Dict[str, Any]:
"content": [ "content": [
{ {
"type": "text", "type": "text",
"text": markdown_content "text": DOCUMENT_CITATION_INSTRUCTIONS + markdown_content
} }
] ]
} }
@ -107,7 +150,7 @@ def rag_retrieve(query: str, top_k: int = 100) -> Dict[str, Any]:
} }
] ]
} }
except requests.exceptions.RequestException as e: except requests.exceptions.RequestException as e:
return { return {
"content": [ "content": [
@ -179,7 +222,7 @@ def table_rag_retrieve(query: str) -> Dict[str, Any]:
"content": [ "content": [
{ {
"type": "text", "type": "text",
"text": markdown_content "text": TABLE_CITATION_INSTRUCTIONS + markdown_content
} }
] ]
} }

View File

@ -2,83 +2,13 @@
## CITATION REQUIREMENTS ## CITATION REQUIREMENTS
### A. Regular Document Knowledge When your answer uses learned knowledge, you MUST generate `<CITATION ... />` tags. Follow the specific citation format instructions returned by each tool (`rag_retrieve`, `table_rag_retrieve`).
When answering questions based on `rag_retrieve` tool results, you MUST add XML citation tags for factual claims derived from the knowledge base.
**Format:** `<CITATION file="file_uuid" filename="name.pdf" page=3 />` ### General Placement Rules
- Use `file` attribute with the UUID from document markers 1. Citations MUST appear IMMEDIATELY AFTER the paragraph or bullet list that uses the knowledge
- Use `filename` attribute with the actual filename from document markers 2. NEVER collect all citations and place them at the end of your response
- Use `page` attribute (singular) with the page number 3. Limit to 1-2 citations per paragraph/bullet list - combine related facts under one citation
- `page` MUST be 0-based and must match the `pages:` values shown in the learned knowledge context 4. If your answer uses learned knowledge, you MUST generate at least 1 `<CITATION ... />` in the response
### B. Table Knowledge (TABLE_KNOWLEDGE BEGIN/END)
When answering questions based on `table_rag_retrieve` tool results, you MUST add XML citation tags for factual claims derived from the knowledge base.
**!!! CRITICAL RULE: NEVER put <CITATION> on same line as bullet/row !!!**
**Citations MUST be on separate lines AFTER the complete list/table.**
**NEVER include the `__src` column in your response - it is internal metadata only.**
Format: `<CITATION file="file_id" filename="name.xlsx" sheet=1 rows=[2, 4] />`
- Parse `__src`: `F1S2R5` = file_ref F1, sheet 2, row 5
- Look up file_id in `file_ref_table`
- Combine same-sheet rows into one citation: `rows=[2, 4, 6]`
- **MANDATORY: Create SEPARATE citation for EACH (file, sheet) combination**
✅ CORRECT (data from sheet 1 AND sheet 2 = 2 citations):
1. Liam - male
2. Noah - male
3. Ethan - male
4. Mason - male
5. William - male
<CITATION file="22920c54-31b2-4bdc-9902-a16fc9fe3a53" filename="students_list.xlsx" sheet=1 rows=[2, 4, 8] />
<CITATION file="22920c54-31b2-4bdc-9902-a16fc9fe3a53" filename="students_list.xlsx" sheet=2 rows=[10, 15] />
❌ WRONG (citation on same line):
1. Liam - male <CITATION ... rows=[2] />
❌ WRONG (missing sheet 2 citation):
...only 1 citation when data comes from 2 sheets...
### C. Web Page Knowledge
**Format:** `<CITATION url="https://example.com/page" />`
- Use `url` attribute with the web page URL from the source metadata
- Do not use `file`, `filename`, or `page` attributes for web sources
- Web citations should appear immediately after the content they reference
**!!! CRITICAL PLACEMENT RULES !!!**
1. **Citations MUST appear IMMEDIATELY AFTER the paragraph or bullet list** that uses the knowledge
2. **NEVER collect all citations and place them at the end of your response**
3. **Limit to 1-2 citations per paragraph/bullet list** - combine related facts under one citation
4. **If your answer uses learned knowledge, you MUST generate at least 1 `<CITATION ... />` in the response**
5. **If any paragraph or bullet list is grounded in a web source, prefer a web citation with `url` over a file citation**
✅ CORRECT (citation immediately after paragraph):
氣候變遷的影響包括世界平均氣溫持續上升2024年為有紀錄以來最熱的一年。<CITATION file="c4ccf054-e4e7-4c4f-8244-5ce9c1ddee51" filename="環境白皮書.pdf" page=0 />
具體影響包括:
- 極端高溫事件頻率增加
- 海洋熱浪
- 暴雨強度和頻率增強<CITATION file="c4ccf054-e4e7-4c4f-8244-5ce9c1ddee51" filename="環境白皮書.pdf" page=2 />
✅ CORRECT (web citation):
MIMURE位于东京港区高轮是一家综合性商业设施。<CITATION url="https://www.newoman.jp/takanawa/mimure/" />
❌ WRONG (all citations at the end):
氣候變遷的影響包括...(long response)...
<CITATION file="..." filename="環境白皮書.pdf" page=0 />
<CITATION file="..." filename="環境白皮書.pdf" page=1 />
<CITATION file="..." filename="環境白皮書.pdf" page=2 />
(13 citations dumped at the end)
❌ WRONG (web citation with file attributes):
MIMURE位于东京港区高轮是一家综合性商业设施。<CITATION file="abc123" filename="mimure.html" />
❌ WRONG (too many citations for short content):
2024年全球氣溫上升。<CITATION file="..." filename="環境白皮書.pdf" page=0 />
世界各地發生災害。<CITATION file="..." filename="環境白皮書.pdf" page=0 />
沙烏地阿拉伯熱浪。<CITATION file="..." filename="環境白皮書.pdf" page=1 />
### Current Working Directory ### Current Working Directory