186 lines
11 KiB
Markdown
186 lines
11 KiB
Markdown
---
|
||
name: kfs-answer
|
||
description: Primary skill for answering ALL questions about the datasets knowledge base. Search files, run queries (SQL / markdown), and return answers with citations. MUST be used first for any data-related question.
|
||
category: Data & Retrieval
|
||
---
|
||
|
||
# kfs-answer
|
||
|
||
Answer ALL questions about the datasets knowledge base using this skill's scripts. This is the **primary and mandatory** tool for any question involving uploaded data files. Do NOT explore the filesystem, write Python code, or use other tools to access dataset content — all data access goes through the scripts below.
|
||
|
||
## Inputs
|
||
|
||
- `{user_question}` — the user's question
|
||
- `{chat_history}` — recent conversation context (may be empty)
|
||
|
||
Scripts are in `{SKILL_DIR}/scripts/`.
|
||
|
||
Datasets are auto-discovered by scripts from `./datasets/` subdirectories — agent does NOT need to know or pass dataset IDs.
|
||
|
||
## Scripts
|
||
|
||
- `python3 {SKILL_DIR}/scripts/search.py <query> <kw1> <kw2> ...` — scan knowledge files, return RECOMMENDED file:sheet pairs with compact summaries (source name, L0, L1, per-sheet description with fallback).
|
||
- `python3 {SKILL_DIR}/scripts/detail.py <file_id1:sheet_id1>,<file_id2:sheet_id2>` — return full schema (columns with types/stats) + sample data
|
||
- `python3 {SKILL_DIR}/scripts/query.py <file_id1:sheet_id1>,... <question> <kw1> <kw2> ...` — budget-aware auto query (db: keyword SQL, markdown: section match)
|
||
- `python3 {SKILL_DIR}/scripts/query_db.py <db_path> <SQL> [--offset N]` — execute custom SQL with auto-pagination.
|
||
|
||
**query_db.py output structure:** TSV header + data rows + status line at the end.
|
||
|
||
**Status line — three cases:**
|
||
|
||
1. `[RESULT: N/N rows returned | COMPLETE]`
|
||
All data fully returned. Proceed to answer.
|
||
|
||
2. `[RESULT: K/total returned | this batch: rows X-Y (offset A-B) | PARTIAL — call again with --offset=M]`
|
||
Output size limit reached; more data remaining.
|
||
- `K/total` — cumulative rows returned so far / total matching rows
|
||
- `this batch: rows X-Y` — 1-indexed row range this call returned
|
||
- Re-invoke with SAME `<db_path>` and `<SQL>`, adding `--offset M` as CLI arg. Repeat until COMPLETE.
|
||
- If `total` is very large (1000+), consider reducing SELECT columns or adding WHERE filters instead of paginating.
|
||
|
||
3. `[RESULT: 0 rows | EMPTY]` — query matched no rows.
|
||
`[RESULT: 0 rows | offset N exceeds total M | call again with --offset=0]` — offset out of range.
|
||
|
||
**Pagination rules:**
|
||
- `--offset` is a COMMAND-LINE argument, NOT a SQL clause. Do NOT write `OFFSET N` in SQL.
|
||
- Do NOT use SQL `LIMIT`/`OFFSET` to manually control output size — pagination handles it automatically.
|
||
- You MAY use SQL `LIMIT` when the question genuinely requires it (e.g. "top 10 by revenue").
|
||
- Keep the SQL string character-for-character IDENTICAL across pagination calls.
|
||
|
||
- `python3 {SKILL_DIR}/scripts/merge_citations.py` — merge accumulated citations from query.py/query_db.py into final `<CITATION>` tags. **MUST call once before composing answer (Step 4), regardless of which query path was used.**
|
||
|
||
Note: file:sheet pairs are comma-separated strings. Keywords are SEPARATE positional arguments — one keyword per arg, placed after the fixed args.
|
||
|
||
## Protocol
|
||
|
||
### Step 1 — search
|
||
|
||
Consider chat_history to understand full context. Extract keywords from user_question (in the question's language). Then:
|
||
|
||
```
|
||
Bash: python3 {SKILL_DIR}/scripts/search.py "<rewritten_question>" <kw1> <kw2> ...
|
||
```
|
||
|
||
Example: `python3 {SKILL_DIR}/scripts/search.py "delivery report" delivery report overdue`
|
||
|
||
If output shows `NO_MATCH`, answer: "The dataset does not contain data relevant to this question."
|
||
|
||
### Step 2 — query
|
||
|
||
From search output, pick ONLY the file_id:sheet_id pairs relevant to the question (often just 1 file).
|
||
|
||
**Before calling query.py, classify your keywords against the search output (sheet names + L0 + L1 + per-sheet description):**
|
||
- **Table-level**: keyword appears in sheet name or L0 description → it describes the file/sheet scope, not individual rows. Do NOT pass as row-level filter.
|
||
Example: question asks about "福井県のBCP企業" → "福井県" is the sheet name (all rows belong to Fukui). Do not use it as a WHERE keyword.
|
||
- **Column-level**: keyword matches a concept mentioned in L0 as a data dimension → determines which columns to look at, not a WHERE filter.
|
||
Example: L0 says "エネルギー・たんぱく質・脂質等68項目" → "エネルギー" is a column concept, not a row filter.
|
||
- **Row-level**: keyword refers to a specific entity/item not mentioned in sheet names or L0 → use as query.py keywords for WHERE filtering.
|
||
Example: "アーモンド" is a specific food item, not in sheet name or L0 → valid row-level keyword.
|
||
|
||
Only pass **row-level keywords** to query.py:
|
||
|
||
```
|
||
Bash: python3 {SKILL_DIR}/scripts/query.py "<recommended_pairs>" "<question>" <row_kw1> <row_kw2> ...
|
||
```
|
||
|
||
This handles ~80% of questions directly. Check the results:
|
||
- **Sufficient** (no `[BUDGET]` tag, or truncation is acceptable) → go to Step 4 (answer). Done.
|
||
- **Insufficient** (`[BUDGET]` shows missing rows/columns critical to the question) → go to Step 3. **Discard query.py results completely** — query_db.py uses different SQL and ordering, so do NOT use `--offset` to "continue" from query.py. Always start query_db.py from offset=0.
|
||
- **Suspiciously few** (≤3 rows returned, but question asks for "最初/一覧/全部/比較" or total row count is much larger) → results are likely incomplete. Remove the most restrictive keyword and re-run query.py, or use empty keywords to get a broader view. If still unclear, go to Step 3.
|
||
|
||
### Sheet selection from multi-sheet files
|
||
|
||
When a RECOMMENDED file has multiple sheets (e.g., `sheet_001`/`sheet_002`, `7-2-2図①`/`7-2-2図②`, `基本票`/`詳細票`), the technical sheet names may not convey semantics. The search output now includes a per-sheet description line for each sheet. Use it to select the correct sheet:
|
||
|
||
```
|
||
- 7-2-2図①[db,30]: ①女性:14歳以上の年齢層別女性人口の推移...
|
||
- 7-2-2図②[db,30]: ②男性:14歳以上の年齢層別男性人口の推移...
|
||
```
|
||
|
||
**Do NOT infer sheet identity from**:
|
||
- data value heuristics (e.g., "larger value = female")
|
||
- technical sheet id / name alone (e.g., `sheet_001`, `7-2-2図①`)
|
||
|
||
If the per-sheet description is missing, short, or ambiguous, call `detail.py` to get the full sheet description before issuing a WHERE/filter decision.
|
||
|
||
### Step 3 — detail + refine (only if Step 2 insufficient)
|
||
|
||
Call detail.py to understand the full schema, then write precise SQL:
|
||
|
||
```
|
||
Bash: python3 {SKILL_DIR}/scripts/detail.py "<recommended_pairs>"
|
||
```
|
||
|
||
Read the column names and types from detail output. Then write a targeted SQL query:
|
||
|
||
```
|
||
Bash: python3 {SKILL_DIR}/scripts/query_db.py "<db_path>" "SELECT col1,col2 FROM table WHERE ..."
|
||
```
|
||
|
||
**CRITICAL — Hidden `__src` column:**
|
||
Every db table has an `__src` column that is NOT shown in detail.py schema output (by design — the parser hides it from the human-readable schema). You MUST include `__src` as the FIRST column in every SELECT on a db table, regardless of what detail.py reports. Without it, per-row citation is impossible.
|
||
|
||
Example: `SELECT __src, col_a, col_b FROM sheet_001 WHERE col_a = 'x'`
|
||
|
||
The db_path is shown in query.py output. Pagination is automatic — see Scripts section above for query_db.py output protocol (COMPLETE / PARTIAL / EMPTY status line). When PARTIAL, follow the `--offset=N` instruction in the status line; keep calling until COMPLETE.
|
||
|
||
### Step 4 — output
|
||
|
||
Compose final answer, then append source attribution on the last line:
|
||
|
||
- Answer in the same language as the question
|
||
- When data was truncated, state total count and what was omitted (e.g., "Showing 20 of 32 results. Narrow your query for complete data.")
|
||
- Keep response concise — the output will be injected into another LLM's context with a ~3000 character budget
|
||
|
||
**Citations (MANDATORY):**
|
||
|
||
**ALWAYS** call `merge_citations.py` before composing your answer, regardless of whether you used query.py or query_db.py:
|
||
|
||
```
|
||
Bash: python3 {SKILL_DIR}/scripts/merge_citations.py
|
||
```
|
||
|
||
This script reads all citation data accumulated by query.py / query_db.py, merges rows by (file, sheet), and outputs ready-to-use tags like:
|
||
|
||
```
|
||
[CITATIONS]
|
||
<CITATION file="a1b2c3d4-e5f6-7890-abcd-ef1234567890" filename="商品リスト.xlsx" sheet="1" rows="[2, 3, 4, 5]" />
|
||
```
|
||
|
||
Your job:
|
||
|
||
1. **Copy each CITATION tag EXACTLY as output by the script — character for character.** Do not modify any attributes. Rows are Excel row numbers (row 1 = header, data starts at row 2). Do NOT renumber, do NOT use range syntax. Do not invent tags.
|
||
|
||
2. **Place each tag after the paragraph / list / table that uses its data.** If only one tag, place it after the main content block. If multiple tags from different files, place each near the content that references that file.
|
||
|
||
3. **NEVER** put a tag on the same line as a list bullet or table row. **NEVER** write `__src=`.
|
||
|
||
**Calculation audit (mandatory when the answer involves arithmetic, aggregation, ratios, or percentages):**
|
||
|
||
Before writing the final answer, explicitly verify that every operand in your formula aligns with the question's scope. Output a short audit block:
|
||
|
||
```
|
||
[AUDIT]
|
||
Question scope: <entity / range / time period the question specifies>
|
||
Formula: <numerator> / <denominator> = <result> (or SUM, AVG, etc.)
|
||
Operand check:
|
||
- <operand1>: <value> — source: <row/column description> — matches question scope? YES/NO
|
||
- <operand2>: <value> — source: <row/column description> — matches question scope? YES/NO
|
||
Verdict: PASS — formula matches question semantics
|
||
OR WARNING — <operand X> scope mismatch: <explanation>. Re-querying.
|
||
```
|
||
|
||
If the verdict is WARNING, do NOT output the answer. Instead, re-query with corrected keywords to find the operand that matches the question's scope. Only output the answer after the audit passes.
|
||
|
||
Note: Pre-computed percentages or "share" values found in data remarks may use a different denominator than what the question asks. Always verify — never adopt them without confirming the denominator matches the question.
|
||
|
||
**Final output format:** Write your answer body with `<CITATION>` tags from merge_citations.py placed near the relevant content. Do not add anything else.
|
||
|
||
## Rules
|
||
|
||
1. **Query first.** Always try query.py before detail.py. Skip detail if query results are sufficient.
|
||
2. **Minimize turns.** Typical: 2 turns (search + query). Max: 3-4 turns (+ detail + query_db for complex cases).
|
||
3. **No exploratory reads.** Do not ls, Glob, or Read files. All info comes from the scripts.
|
||
4. **Verify before answering.** If query.py returns very few rows (≤3) for a listing/ranking question, do not assume the result is complete. Check if a table-level keyword was accidentally used as a row filter.
|
||
5. **Fallback flexibility.** query_db.py with custom SQL handles most needs including large result sets (via auto-pagination). Do NOT write inline Python (`sqlite3.connect`) to query knowledge.db — it bypasses query_db.py's auto-fix protections (fullwidth comma, identifier quoting, __src replacement) and causes debug loops. Accuracy over speed.
|