--- name: kfs-answer description: Primary skill for answering ALL questions about the datasets knowledge base. Search files, run queries (SQL / markdown), and return answers with citations. MUST be used first for any data-related question. category: Data & Retrieval --- # kfs-answer Answer ALL questions about the datasets knowledge base using this skill's scripts. This is the **primary and mandatory** tool for any question involving uploaded data files. Do NOT explore the filesystem, write Python code, or use other tools to access dataset content — all data access goes through the scripts below. ## Inputs - `{user_question}` — the user's question - `{chat_history}` — recent conversation context (may be empty) Scripts are in `{SKILL_DIR}/scripts/`. Datasets are auto-discovered by scripts from `./datasets/` subdirectories — agent does NOT need to know or pass dataset IDs. ## Scripts - `python3 {SKILL_DIR}/scripts/search.py ...` — scan knowledge files, return RECOMMENDED file:sheet pairs with compact summaries (source name, L0, L1, per-sheet description with fallback). - `python3 {SKILL_DIR}/scripts/detail.py ,` — return full schema (columns with types/stats) + sample data - `python3 {SKILL_DIR}/scripts/query.py ,... ...` — budget-aware auto query (db: keyword SQL, markdown: section match) - `python3 {SKILL_DIR}/scripts/query_db.py [--offset N]` — execute custom SQL with auto-pagination. **query_db.py output structure:** TSV header + data rows + status line at the end. **Status line — three cases:** 1. `[RESULT: N/N rows returned | COMPLETE]` All data fully returned. Proceed to answer. 2. `[RESULT: K/total returned | this batch: rows X-Y (offset A-B) | PARTIAL — call again with --offset=M]` Output size limit reached; more data remaining. - `K/total` — cumulative rows returned so far / total matching rows - `this batch: rows X-Y` — 1-indexed row range this call returned - Re-invoke with SAME `` and ``, adding `--offset M` as CLI arg. Repeat until COMPLETE. - If `total` is very large (1000+), consider reducing SELECT columns or adding WHERE filters instead of paginating. 3. `[RESULT: 0 rows | EMPTY]` — query matched no rows. `[RESULT: 0 rows | offset N exceeds total M | call again with --offset=0]` — offset out of range. **Pagination rules:** - `--offset` is a COMMAND-LINE argument, NOT a SQL clause. Do NOT write `OFFSET N` in SQL. - Do NOT use SQL `LIMIT`/`OFFSET` to manually control output size — pagination handles it automatically. - You MAY use SQL `LIMIT` when the question genuinely requires it (e.g. "top 10 by revenue"). - Keep the SQL string character-for-character IDENTICAL across pagination calls. - `python3 {SKILL_DIR}/scripts/merge_citations.py` — merge accumulated citations from query.py/query_db.py into final `` tags. **MUST call once before composing answer (Step 4), regardless of which query path was used.** Note: file:sheet pairs are comma-separated strings. Keywords are SEPARATE positional arguments — one keyword per arg, placed after the fixed args. ## Protocol ### Step 1 — search Consider chat_history to understand full context. Extract keywords from user_question (in the question's language). Then: ``` Bash: python3 {SKILL_DIR}/scripts/search.py "" ... ``` Example: `python3 {SKILL_DIR}/scripts/search.py "delivery report" delivery report overdue` If output shows `NO_MATCH`, answer: "The dataset does not contain data relevant to this question." ### Step 2 — query From search output, pick ONLY the file_id:sheet_id pairs relevant to the question (often just 1 file). **Before calling query.py, classify your keywords against the search output (sheet names + L0 + L1 + per-sheet description):** - **Table-level**: keyword appears in sheet name or L0 description → it describes the file/sheet scope, not individual rows. Do NOT pass as row-level filter. Example: question asks about "福井県のBCP企業" → "福井県" is the sheet name (all rows belong to Fukui). Do not use it as a WHERE keyword. - **Column-level**: keyword matches a concept mentioned in L0 as a data dimension → determines which columns to look at, not a WHERE filter. Example: L0 says "エネルギー・たんぱく質・脂質等68項目" → "エネルギー" is a column concept, not a row filter. - **Row-level**: keyword refers to a specific entity/item not mentioned in sheet names or L0 → use as query.py keywords for WHERE filtering. Example: "アーモンド" is a specific food item, not in sheet name or L0 → valid row-level keyword. Only pass **row-level keywords** to query.py: ``` Bash: python3 {SKILL_DIR}/scripts/query.py "" "" ... ``` This handles ~80% of questions directly. Check the results: - **Sufficient** (no `[BUDGET]` tag, or truncation is acceptable) → go to Step 4 (answer). Done. - **Insufficient** (`[BUDGET]` shows missing rows/columns critical to the question) → go to Step 3. **Discard query.py results completely** — query_db.py uses different SQL and ordering, so do NOT use `--offset` to "continue" from query.py. Always start query_db.py from offset=0. - **Suspiciously few** (≤3 rows returned, but question asks for "最初/一覧/全部/比較" or total row count is much larger) → results are likely incomplete. Remove the most restrictive keyword and re-run query.py, or use empty keywords to get a broader view. If still unclear, go to Step 3. ### Sheet selection from multi-sheet files When a RECOMMENDED file has multiple sheets (e.g., `sheet_001`/`sheet_002`, `7-2-2図①`/`7-2-2図②`, `基本票`/`詳細票`), the technical sheet names may not convey semantics. The search output now includes a per-sheet description line for each sheet. Use it to select the correct sheet: ``` - 7-2-2図①[db,30]: ①女性:14歳以上の年齢層別女性人口の推移... - 7-2-2図②[db,30]: ②男性:14歳以上の年齢層別男性人口の推移... ``` **Do NOT infer sheet identity from**: - data value heuristics (e.g., "larger value = female") - technical sheet id / name alone (e.g., `sheet_001`, `7-2-2図①`) If the per-sheet description is missing, short, or ambiguous, call `detail.py` to get the full sheet description before issuing a WHERE/filter decision. ### Step 3 — detail + refine (only if Step 2 insufficient) Call detail.py to understand the full schema, then write precise SQL: ``` Bash: python3 {SKILL_DIR}/scripts/detail.py "" ``` Read the column names and types from detail output. Then write a targeted SQL query: ``` Bash: python3 {SKILL_DIR}/scripts/query_db.py "" "SELECT col1,col2 FROM table WHERE ..." ``` **CRITICAL — Hidden `__src` column:** Every db table has an `__src` column that is NOT shown in detail.py schema output (by design — the parser hides it from the human-readable schema). You MUST include `__src` as the FIRST column in every SELECT on a db table, regardless of what detail.py reports. Without it, per-row citation is impossible. Example: `SELECT __src, col_a, col_b FROM sheet_001 WHERE col_a = 'x'` The db_path is shown in query.py output. Pagination is automatic — see Scripts section above for query_db.py output protocol (COMPLETE / PARTIAL / EMPTY status line). When PARTIAL, follow the `--offset=N` instruction in the status line; keep calling until COMPLETE. ### Step 4 — output Compose final answer, then append source attribution on the last line: - Answer in the same language as the question - When data was truncated, state total count and what was omitted (e.g., "Showing 20 of 32 results. Narrow your query for complete data.") - Keep response concise — the output will be injected into another LLM's context with a ~3000 character budget **Citations (MANDATORY):** **ALWAYS** call `merge_citations.py` before composing your answer, regardless of whether you used query.py or query_db.py: ``` Bash: python3 {SKILL_DIR}/scripts/merge_citations.py ``` This script reads all citation data accumulated by query.py / query_db.py, merges rows by (file, sheet), and outputs ready-to-use tags like: ``` [CITATIONS] ``` Your job: 1. **Copy each CITATION tag EXACTLY as output by the script — character for character.** Do not modify any attributes. Rows are Excel row numbers (row 1 = header, data starts at row 2). Do NOT renumber, do NOT use range syntax. Do not invent tags. 2. **Place each tag after the paragraph / list / table that uses its data.** If only one tag, place it after the main content block. If multiple tags from different files, place each near the content that references that file. 3. **NEVER** put a tag on the same line as a list bullet or table row. **NEVER** write `__src=`. **Calculation audit (mandatory when the answer involves arithmetic, aggregation, ratios, or percentages):** Before writing the final answer, explicitly verify that every operand in your formula aligns with the question's scope. Output a short audit block: ``` [AUDIT] Question scope: Formula: / = (or SUM, AVG, etc.) Operand check: - : — source: — matches question scope? YES/NO - : — source: — matches question scope? YES/NO Verdict: PASS — formula matches question semantics OR WARNING — scope mismatch: . Re-querying. ``` If the verdict is WARNING, do NOT output the answer. Instead, re-query with corrected keywords to find the operand that matches the question's scope. Only output the answer after the audit passes. Note: Pre-computed percentages or "share" values found in data remarks may use a different denominator than what the question asks. Always verify — never adopt them without confirming the denominator matches the question. **Final output format:** Write your answer body with `` tags from merge_citations.py placed near the relevant content. Do not add anything else. ## Rules 1. **Query first.** Always try query.py before detail.py. Skip detail if query results are sufficient. 2. **Minimize turns.** Typical: 2 turns (search + query). Max: 3-4 turns (+ detail + query_db for complex cases). 3. **No exploratory reads.** Do not ls, Glob, or Read files. All info comes from the scripts. 4. **Verify before answering.** If query.py returns very few rows (≤3) for a listing/ranking question, do not assume the result is complete. Check if a table-level keyword was accidentally used as a row filter. 5. **Fallback flexibility.** query_db.py with custom SQL handles most needs including large result sets (via auto-pagination). Do NOT write inline Python (`sqlite3.connect`) to query knowledge.db — it bypasses query_db.py's auto-fix protections (fullwidth comma, identifier quoting, __src replacement) and causes debug loops. Accuracy over speed.