qwen_agent/skills/support/kfs-answer/SKILL.md
2026-05-26 17:43:12 +08:00

11 KiB
Raw Blame History

name description category
kfs-answer Primary skill for answering ALL questions about the datasets knowledge base. Search files, run queries (SQL / markdown), and return answers with citations. MUST be used first for any data-related question. Data & Retrieval

kfs-answer

Answer ALL questions about the datasets knowledge base using this skill's scripts. This is the primary and mandatory tool for any question involving uploaded data files. Do NOT explore the filesystem, write Python code, or use other tools to access dataset content — all data access goes through the scripts below.

Inputs

  • {user_question} — the user's question
  • {chat_history} — recent conversation context (may be empty)

Scripts are in {SKILL_DIR}/scripts/.

Datasets are auto-discovered by scripts from ./datasets/ subdirectories — agent does NOT need to know or pass dataset IDs.

Scripts

  • python3 {SKILL_DIR}/scripts/search.py <query> <kw1> <kw2> ... — scan knowledge files, return RECOMMENDED file:sheet pairs with compact summaries (source name, L0, L1, per-sheet description with fallback).

  • python3 {SKILL_DIR}/scripts/detail.py <file_id1:sheet_id1>,<file_id2:sheet_id2> — return full schema (columns with types/stats) + sample data

  • python3 {SKILL_DIR}/scripts/query.py <file_id1:sheet_id1>,... <question> <kw1> <kw2> ... — budget-aware auto query (db: keyword SQL, markdown: section match)

  • python3 {SKILL_DIR}/scripts/query_db.py <db_path> <SQL> [--offset N] — execute custom SQL with auto-pagination.

    query_db.py output structure: TSV header + data rows + status line at the end.

    Status line — three cases:

    1. [RESULT: N/N rows returned | COMPLETE] All data fully returned. Proceed to answer.

    2. [RESULT: K/total returned | this batch: rows X-Y (offset A-B) | PARTIAL — call again with --offset=M] Output size limit reached; more data remaining.

      • K/total — cumulative rows returned so far / total matching rows
      • this batch: rows X-Y — 1-indexed row range this call returned
      • Re-invoke with SAME <db_path> and <SQL>, adding --offset M as CLI arg. Repeat until COMPLETE.
      • If total is very large (1000+), consider reducing SELECT columns or adding WHERE filters instead of paginating.
    3. [RESULT: 0 rows | EMPTY] — query matched no rows. [RESULT: 0 rows | offset N exceeds total M | call again with --offset=0] — offset out of range.

    Pagination rules:

    • --offset is a COMMAND-LINE argument, NOT a SQL clause. Do NOT write OFFSET N in SQL.
    • Do NOT use SQL LIMIT/OFFSET to manually control output size — pagination handles it automatically.
    • You MAY use SQL LIMIT when the question genuinely requires it (e.g. "top 10 by revenue").
    • Keep the SQL string character-for-character IDENTICAL across pagination calls.
  • python3 {SKILL_DIR}/scripts/merge_citations.py — merge accumulated citations from query.py/query_db.py into final <CITATION> tags. MUST call once before composing answer (Step 4), regardless of which query path was used.

Note: file:sheet pairs are comma-separated strings. Keywords are SEPARATE positional arguments — one keyword per arg, placed after the fixed args.

Protocol

Consider chat_history to understand full context. Extract keywords from user_question (in the question's language). Then:

Bash: python3 {SKILL_DIR}/scripts/search.py "<rewritten_question>" <kw1> <kw2> ...

Example: python3 {SKILL_DIR}/scripts/search.py "delivery report" delivery report overdue

If output shows NO_MATCH, answer: "The dataset does not contain data relevant to this question."

Step 2 — query

From search output, pick ONLY the file_id:sheet_id pairs relevant to the question (often just 1 file).

Before calling query.py, classify your keywords against the search output (sheet names + L0 + L1 + per-sheet description):

  • Table-level: keyword appears in sheet name or L0 description → it describes the file/sheet scope, not individual rows. Do NOT pass as row-level filter. Example: question asks about "福井県のBCP企業" → "福井県" is the sheet name (all rows belong to Fukui). Do not use it as a WHERE keyword.
  • Column-level: keyword matches a concept mentioned in L0 as a data dimension → determines which columns to look at, not a WHERE filter. Example: L0 says "エネルギー・たんぱく質・脂質等68項目" → "エネルギー" is a column concept, not a row filter.
  • Row-level: keyword refers to a specific entity/item not mentioned in sheet names or L0 → use as query.py keywords for WHERE filtering. Example: "アーモンド" is a specific food item, not in sheet name or L0 → valid row-level keyword.

Only pass row-level keywords to query.py:

Bash: python3 {SKILL_DIR}/scripts/query.py "<recommended_pairs>" "<question>" <row_kw1> <row_kw2> ...

This handles ~80% of questions directly. Check the results:

  • Sufficient (no [BUDGET] tag, or truncation is acceptable) → go to Step 4 (answer). Done.
  • Insufficient ([BUDGET] shows missing rows/columns critical to the question) → go to Step 3. Discard query.py results completely — query_db.py uses different SQL and ordering, so do NOT use --offset to "continue" from query.py. Always start query_db.py from offset=0.
  • Suspiciously few (≤3 rows returned, but question asks for "最初/一覧/全部/比較" or total row count is much larger) → results are likely incomplete. Remove the most restrictive keyword and re-run query.py, or use empty keywords to get a broader view. If still unclear, go to Step 3.

Sheet selection from multi-sheet files

When a RECOMMENDED file has multiple sheets (e.g., sheet_001/sheet_002, 7-2-2図①/7-2-2図②, 基本票/詳細票), the technical sheet names may not convey semantics. The search output now includes a per-sheet description line for each sheet. Use it to select the correct sheet:

  - 7-2-2図①[db,30]: ①女性14歳以上の年齢層別女性人口の推移...
  - 7-2-2図②[db,30]: ②男性14歳以上の年齢層別男性人口の推移...

Do NOT infer sheet identity from:

  • data value heuristics (e.g., "larger value = female")
  • technical sheet id / name alone (e.g., sheet_001, 7-2-2図①)

If the per-sheet description is missing, short, or ambiguous, call detail.py to get the full sheet description before issuing a WHERE/filter decision.

Step 3 — detail + refine (only if Step 2 insufficient)

Call detail.py to understand the full schema, then write precise SQL:

Bash: python3 {SKILL_DIR}/scripts/detail.py "<recommended_pairs>"

Read the column names and types from detail output. Then write a targeted SQL query:

Bash: python3 {SKILL_DIR}/scripts/query_db.py "<db_path>" "SELECT col1,col2 FROM table WHERE ..."

CRITICAL — Hidden __src column: Every db table has an __src column that is NOT shown in detail.py schema output (by design — the parser hides it from the human-readable schema). You MUST include __src as the FIRST column in every SELECT on a db table, regardless of what detail.py reports. Without it, per-row citation is impossible.

Example: SELECT __src, col_a, col_b FROM sheet_001 WHERE col_a = 'x'

The db_path is shown in query.py output. Pagination is automatic — see Scripts section above for query_db.py output protocol (COMPLETE / PARTIAL / EMPTY status line). When PARTIAL, follow the --offset=N instruction in the status line; keep calling until COMPLETE.

Step 4 — output

Compose final answer, then append source attribution on the last line:

  • Answer in the same language as the question
  • When data was truncated, state total count and what was omitted (e.g., "Showing 20 of 32 results. Narrow your query for complete data.")
  • Keep response concise — the output will be injected into another LLM's context with a ~3000 character budget

Citations (MANDATORY):

ALWAYS call merge_citations.py before composing your answer, regardless of whether you used query.py or query_db.py:

Bash: python3 {SKILL_DIR}/scripts/merge_citations.py

This script reads all citation data accumulated by query.py / query_db.py, merges rows by (file, sheet), and outputs ready-to-use tags like:

[CITATIONS]
<CITATION file="a1b2c3d4-e5f6-7890-abcd-ef1234567890" filename="商品リスト.xlsx" sheet="1" rows="[2, 3, 4, 5]" />

Your job:

  1. Copy each CITATION tag EXACTLY as output by the script — character for character. Do not modify any attributes. Rows are Excel row numbers (row 1 = header, data starts at row 2). Do NOT renumber, do NOT use range syntax. Do not invent tags.

  2. Place each tag after the paragraph / list / table that uses its data. If only one tag, place it after the main content block. If multiple tags from different files, place each near the content that references that file.

  3. NEVER put a tag on the same line as a list bullet or table row. NEVER write __src=.

Calculation audit (mandatory when the answer involves arithmetic, aggregation, ratios, or percentages):

Before writing the final answer, explicitly verify that every operand in your formula aligns with the question's scope. Output a short audit block:

[AUDIT]
Question scope: <entity / range / time period the question specifies>
Formula: <numerator> / <denominator> = <result>  (or SUM, AVG, etc.)
Operand check:
  - <operand1>: <value> — source: <row/column description> — matches question scope? YES/NO
  - <operand2>: <value> — source: <row/column description> — matches question scope? YES/NO
Verdict: PASS — formula matches question semantics
         OR WARNING — <operand X> scope mismatch: <explanation>. Re-querying.

If the verdict is WARNING, do NOT output the answer. Instead, re-query with corrected keywords to find the operand that matches the question's scope. Only output the answer after the audit passes.

Note: Pre-computed percentages or "share" values found in data remarks may use a different denominator than what the question asks. Always verify — never adopt them without confirming the denominator matches the question.

Final output format: Write your answer body with <CITATION> tags from merge_citations.py placed near the relevant content. Do not add anything else.

Rules

  1. Query first. Always try query.py before detail.py. Skip detail if query results are sufficient.
  2. Minimize turns. Typical: 2 turns (search + query). Max: 3-4 turns (+ detail + query_db for complex cases).
  3. No exploratory reads. Do not ls, Glob, or Read files. All info comes from the scripts.
  4. Verify before answering. If query.py returns very few rows (≤3) for a listing/ranking question, do not assume the result is complete. Check if a table-level keyword was accidentally used as a row filter.
  5. Fallback flexibility. query_db.py with custom SQL handles most needs including large result sets (via auto-pagination). Do NOT write inline Python (sqlite3.connect) to query knowledge.db — it bypasses query_db.py's auto-fix protections (fullwidth comma, identifier quoting, __src replacement) and causes debug loops. Accuracy over speed.