qwen_agent/skills/support/kfs-answer/SKILL.md

---
name: kfs-answer
description: Primary skill for answering ALL questions about the datasets knowledge base. Search files, run queries (SQL / markdown), and return answers with citations. MUST be used first for any data-related question.
category: Data & Retrieval
---

# kfs-answer

Answer ALL questions about the datasets knowledge base using this skill's scripts. This is the **primary and mandatory** tool for any question involving uploaded data files. Do NOT explore the filesystem, write Python code, or use other tools to access dataset content — all data access goes through the scripts below.

## Inputs

- `{user_question}` — the user's question
- `{chat_history}` — recent conversation context (may be empty)

Scripts are in `{SKILL_DIR}/scripts/`.

Datasets are auto-discovered by scripts from `./datasets/` subdirectories — agent does NOT need to know or pass dataset IDs.

## Scripts

- `python3 {SKILL_DIR}/scripts/search.py <query> <kw1> <kw2> ...` — scan knowledge files, return RECOMMENDED file:sheet pairs with compact summaries (source name, L0, L1, per-sheet description with fallback).
- `python3 {SKILL_DIR}/scripts/detail.py <file_id1:sheet_id1>,<file_id2:sheet_id2>` — return full schema (columns with types/stats) + sample data
- `python3 {SKILL_DIR}/scripts/query.py <file_id1:sheet_id1>,... <question> <kw1> <kw2> ...` — budget-aware auto query (db: keyword SQL, markdown: section match)
- `python3 {SKILL_DIR}/scripts/query_db.py <db_path> <SQL> [--offset N]` — execute custom SQL with auto-pagination.

  **query_db.py output structure:** TSV header + data rows + status line at the end.

  **Status line — three cases:**

  1. `[RESULT: N/N rows returned | COMPLETE]`
     All data fully returned. Proceed to answer.

  2. `[RESULT: K/total returned | this batch: rows X-Y (offset A-B) | PARTIAL — call again with --offset=M]`
     Output size limit reached; more data remaining.
     - `K/total` — cumulative rows returned so far / total matching rows
     - `this batch: rows X-Y` — 1-indexed row range this call returned
     - Re-invoke with SAME `<db_path>` and `<SQL>`, adding `--offset M` as CLI arg. Repeat until COMPLETE.
     - If `total` is very large (1000+), consider reducing SELECT columns or adding WHERE filters instead of paginating.

  3. `[RESULT: 0 rows | EMPTY]` — query matched no rows.
     `[RESULT: 0 rows | offset N exceeds total M | call again with --offset=0]` — offset out of range.

  **Pagination rules:**
  - `--offset` is a COMMAND-LINE argument, NOT a SQL clause. Do NOT write `OFFSET N` in SQL.
  - Do NOT use SQL `LIMIT`/`OFFSET` to manually control output size — pagination handles it automatically.
  - You MAY use SQL `LIMIT` when the question genuinely requires it (e.g. "top 10 by revenue").
  - Keep the SQL string character-for-character IDENTICAL across pagination calls.

- `python3 {SKILL_DIR}/scripts/merge_citations.py` — merge accumulated citations from query.py/query_db.py into final `<CITATION>` tags. **MUST call once before composing answer (Step 4), regardless of which query path was used.**

Note: file:sheet pairs are comma-separated strings. Keywords are SEPARATE positional arguments — one keyword per arg, placed after the fixed args.

## Protocol

### Step 1 — search

Consider chat_history to understand full context. Extract keywords from user_question (in the question's language). Then:

```
Bash: python3 {SKILL_DIR}/scripts/search.py "<rewritten_question>" <kw1> <kw2> ...
```

Example: `python3 {SKILL_DIR}/scripts/search.py "delivery report" delivery report overdue`

If output shows `NO_MATCH`, answer: "The dataset does not contain data relevant to this question."

### Step 2 — query

From search output, pick ONLY the file_id:sheet_id pairs relevant to the question (often just 1 file).

**Before calling query.py, classify your keywords against the search output (sheet names + L0 + L1 + per-sheet description):**
- **Table-level**: keyword appears in sheet name or L0 description → it describes the file/sheet scope, not individual rows. Do NOT pass as row-level filter.
  Example: question asks about "福井県のBCP企業" → "福井県" is the sheet name (all rows belong to Fukui). Do not use it as a WHERE keyword.
- **Column-level**: keyword matches a concept mentioned in L0 as a data dimension → determines which columns to look at, not a WHERE filter.
  Example: L0 says "エネルギー・たんぱく質・脂質等68項目" → "エネルギー" is a column concept, not a row filter.
- **Row-level**: keyword refers to a specific entity/item not mentioned in sheet names or L0 → use as query.py keywords for WHERE filtering.
  Example: "アーモンド" is a specific food item, not in sheet name or L0 → valid row-level keyword.

Only pass **row-level keywords** to query.py:

```
Bash: python3 {SKILL_DIR}/scripts/query.py "<recommended_pairs>" "<question>" <row_kw1> <row_kw2> ...
```

This handles ~80% of questions directly. Check the results:
- **Sufficient** (no `[BUDGET]` tag, or truncation is acceptable) → go to Step 4 (answer). Done.
- **Insufficient** (`[BUDGET]` shows missing rows/columns critical to the question) → go to Step 3. **Discard query.py results completely** — query_db.py uses different SQL and ordering, so do NOT use `--offset` to "continue" from query.py. Always start query_db.py from offset=0.
- **Suspiciously few** (≤3 rows returned, but question asks for "最初/一覧/全部/比較" or total row count is much larger) → results are likely incomplete. Remove the most restrictive keyword and re-run query.py, or use empty keywords to get a broader view. If still unclear, go to Step 3.

### Sheet selection from multi-sheet files

When a RECOMMENDED file has multiple sheets (e.g., `sheet_001`/`sheet_002`, `7-2-2図①`/`7-2-2図②`, `基本票`/`詳細票`), the technical sheet names may not convey semantics. The search output now includes a per-sheet description line for each sheet. Use it to select the correct sheet:

```
  - 7-2-2図①[db,30]: ①女性：14歳以上の年齢層別女性人口の推移...
  - 7-2-2図②[db,30]: ②男性：14歳以上の年齢層別男性人口の推移...
```

**Do NOT infer sheet identity from**:
- data value heuristics (e.g., "larger value = female")
- technical sheet id / name alone (e.g., `sheet_001`, `7-2-2図①`)

If the per-sheet description is missing, short, or ambiguous, call `detail.py` to get the full sheet description before issuing a WHERE/filter decision.

### Step 3 — detail + refine (only if Step 2 insufficient)

Call detail.py to understand the full schema, then write precise SQL:

```
Bash: python3 {SKILL_DIR}/scripts/detail.py "<recommended_pairs>"
```

Read the column names and types from detail output. Then write a targeted SQL query:

```
Bash: python3 {SKILL_DIR}/scripts/query_db.py "<db_path>" "SELECT col1,col2 FROM table WHERE ..."
```

**CRITICAL — Hidden `__src` column:**
Every db table has an `__src` column that is NOT shown in detail.py schema output (by design — the parser hides it from the human-readable schema). You MUST include `__src` as the FIRST column in every SELECT on a db table, regardless of what detail.py reports. Without it, per-row citation is impossible.

Example: `SELECT __src, col_a, col_b FROM sheet_001 WHERE col_a = 'x'`

The db_path is shown in query.py output. Pagination is automatic — see Scripts section above for query_db.py output protocol (COMPLETE / PARTIAL / EMPTY status line). When PARTIAL, follow the `--offset=N` instruction in the status line; keep calling until COMPLETE.

### Step 4 — output

Compose final answer, then append source attribution on the last line:

- Answer in the same language as the question
- When data was truncated, state total count and what was omitted (e.g., "Showing 20 of 32 results. Narrow your query for complete data.")
- Keep response concise — the output will be injected into another LLM's context with a ~3000 character budget

**Citations (MANDATORY):**

**ALWAYS** call `merge_citations.py` before composing your answer, regardless of whether you used query.py or query_db.py:

```
Bash: python3 {SKILL_DIR}/scripts/merge_citations.py
```

This script reads all citation data accumulated by query.py / query_db.py, merges rows by (file, sheet), and outputs ready-to-use tags like:

```
[CITATIONS]
<CITATION file="a1b2c3d4-e5f6-7890-abcd-ef1234567890" filename="商品リスト.xlsx" sheet="1" rows="[2, 3, 4, 5]" />
```

Your job:

1. **Copy each CITATION tag EXACTLY as output by the script — character for character.** Do not modify any attributes. Rows are Excel row numbers (row 1 = header, data starts at row 2). Do NOT renumber, do NOT use range syntax. Do not invent tags.

2. **Place each tag after the paragraph / list / table that uses its data.** If only one tag, place it after the main content block. If multiple tags from different files, place each near the content that references that file.

3. **NEVER** put a tag on the same line as a list bullet or table row. **NEVER** write `__src=`.

**Calculation audit (mandatory when the answer involves arithmetic, aggregation, ratios, or percentages):**

Before writing the final answer, explicitly verify that every operand in your formula aligns with the question's scope. Output a short audit block:

```
[AUDIT]
Question scope: <entity / range / time period the question specifies>
Formula: <numerator> / <denominator> = <result>  (or SUM, AVG, etc.)
Operand check:
  - <operand1>: <value> — source: <row/column description> — matches question scope? YES/NO
  - <operand2>: <value> — source: <row/column description> — matches question scope? YES/NO
Verdict: PASS — formula matches question semantics
         OR WARNING — <operand X> scope mismatch: <explanation>. Re-querying.
```

If the verdict is WARNING, do NOT output the answer. Instead, re-query with corrected keywords to find the operand that matches the question's scope. Only output the answer after the audit passes.

Note: Pre-computed percentages or "share" values found in data remarks may use a different denominator than what the question asks. Always verify — never adopt them without confirming the denominator matches the question.

**Final output format:** Write your answer body with `<CITATION>` tags from merge_citations.py placed near the relevant content. Do not add anything else.

## Rules

1. **Query first.** Always try query.py before detail.py. Skip detail if query results are sufficient.
2. **Minimize turns.** Typical: 2 turns (search + query). Max: 3-4 turns (+ detail + query_db for complex cases).
3. **No exploratory reads.** Do not ls, Glob, or Read files. All info comes from the scripts.
4. **Verify before answering.** If query.py returns very few rows (≤3) for a listing/ranking question, do not assume the result is complete. Check if a table-level keyword was accidentally used as a row filter.
5. **Fallback flexibility.** query_db.py with custom SQL handles most needs including large result sets (via auto-pagination). Do NOT write inline Python (`sqlite3.connect`) to query knowledge.db — it bypasses query_db.py's auto-fix protections (fullwidth comma, identifier quoting, __src replacement) and causes debug loops. Accuracy over speed.