diff --git a/skills/developing/table-query/SKILL.md b/skills/developing/table-query/SKILL.md new file mode 100644 index 0000000..9a9156a --- /dev/null +++ b/skills/developing/table-query/SKILL.md @@ -0,0 +1,137 @@ +--- +name: table-query +description: Query structured spreadsheet/table data (Excel/CSV) to answer questions about values, prices, quantities, inventory, specifications, rankings, comparisons, summaries, aggregations, lists, or any numeric/tabular lookup. Use this skill whenever the answer likely comes from uploaded tables. You locate tables, read their schema, author SQLite SQL yourself, and run it — the backend does no LLM work, so it is fast. +category: Data & Retrieval +--- + +# Table Query + +Answer table/spreadsheet questions by authoring and running SQLite SQL against the +bot's uploaded Excel data. The backend is a thin, fast SQL executor — **you** do the +thinking (rewrite the question, pick tables, write SQL). Row-level citations +(`__src`) are produced for you. + +## When to use + +Use `table-query` for: values, prices, quantities, inventory, specifications, +rankings, comparisons, summaries, aggregations (sum/avg/count), lists, person / +project / product lookups, monthly/period totals, or any question whose answer +comes from structured tables. For pure concept / definition / policy / explanation +questions, use the `rag_retrieve` document tool instead. + +## Workflow (do this in order, once) + +1. **search-tables** — rewrite the user's question into a retrieval query (core + entity + attributes + synonyms), then locate candidate tables. Call this **once**. +2. **get-schemas** — for the relevant subset of returned tables, fetch their + `CREATE TABLE` schema and sample rows. Never write SQL without seeing the schema. +3. **author SQL** — write a SQLite query plan as JSON (see below). +4. **run-sql** — execute the plan. It returns CSV with an `__src` column and a + `file_ref_table` mapping plus citation instructions. +5. **answer + cite** — write the answer and add `` tags built from + `__src` + `file_ref_table`. Never print the `__src` column to the user. + +### Anti-waste rules + +- Call **search-tables at most once** per question. Do not re-locate tables you + already have schemas for. +- If `run-sql` returns an error, fix the SQL and call **run-sql** again (at most ~2 + tries). Do **NOT** restart from search-tables. +- If `search-tables` finds nothing, fall back to the `rag_retrieve` document tool. + +## Commands + +```bash +# 1. locate tables +python {SKILL_DIR}/scripts/table_query.py search-tables --query "2025 April May June sales total" --top-k 20 + +# 2. read schema + sample rows for the tables you picked +python {SKILL_DIR}/scripts/table_query.py get-schemas --tables "sales_2025,customers" + +# 3. run your authored plan — pipe the JSON plan via stdin (no temp file needed) +python {SKILL_DIR}/scripts/table_query.py run-sql <<'PLAN' +{"queries":[{"step":1,"sql":"CREATE TEMP TABLE \"final_table_step1\" AS SELECT \"month\", SUM(\"amount\") AS \"total\" FROM \"sales_2025\" GROUP BY \"month\"","source_table_names":["sales_2025"],"destine_table_name":"final_table_step1","destine_table_type":"final","destine_table_description":"Monthly totals"}]} +PLAN +``` + +## Authoring the SQL plan + +The plan is a JSON object `{ "queries": [ ... ] }` that you pass to `run-sql` **on +stdin via a quoted heredoc** (`<<'PLAN' ... PLAN`). The quoted delimiter keeps all +the double quotes, single quotes and `$` in your SQL intact — no shell escaping. +(You may instead write it to a file and use `--plan-file path.json` if a plan is very +large, but stdin is the default and needs no extra step.) + +Each query is one SQL step: + +```json +{ + "queries": [ + { + "step": 1, + "sql": "CREATE TEMP TABLE \"final_table_step1\" AS SELECT \"month\", SUM(\"amount\") AS \"total\" FROM \"sales_2025\" WHERE \"month\" IN ('2025-04','2025-05','2025-06') GROUP BY \"month\"", + "source_table_names": ["sales_2025"], + "destine_table_name": "final_table_step1", + "destine_table_type": "final", + "destine_table_description": "Monthly sales totals for Apr-Jun 2025" + } + ] +} +``` + +Field meaning: +- `step`: 1-based execution order. +- `sql`: a SQLite statement, normally `CREATE TEMP TABLE "..." AS SELECT ...`. +- `source_table_names`: tables this step reads (original tables, or earlier steps' + `destine_table_name` for multi-step plans). +- `destine_table_name`: the temp table this step creates. Convention: + `intermediate_table_stepN` or `final_table_stepN`. +- `destine_table_type`: `"final"` for results the user should see, `"intermediate"` + for helper steps. **At least one `final` is required.** +- `destine_table_description`: short human description of the result. + +### SQL rules (important) + +- **Quote every identifier** with double quotes: `"column name"`, `"table name"`. +- String literals use single quotes; escape `'` as `''`. +- Prefer **one logical result per `final` table**. For multiple separate results, + emit multiple `final` tables (e.g. step1, step2) — do **NOT** `UNION` unrelated results. +- For row-level citations to be precise, keep `final` steps as simple single-table + `SELECT`s (no `JOIN` / `GROUP BY` / aggregation). Aggregations still work but the + citation degrades to file+sheet level (`F1S2`) instead of an exact row (`F1S2R5`). +- Multi-step plans run in `step` order: build `intermediate_table_stepN` first, then + read it in a later step. Don't reference a temp table before it is created. +- **Sample rows are a format hint only** — never assume they represent the full data + or the row count. Your SQL must scan the whole table. Use `LIKE '%value%'` for free + text and `=` for enums/codes. + +## Result handling & citations + +- `run-sql` output begins with citation instructions, then `file_ref_table`, then the + result CSV (with `__src`). +- Parse `__src` (`F1S2R5` = file_ref F1, sheet 2, row 5) and `file_ref_table` to build + ``. +- Put citations on their own line **after** the list/table that uses the data; combine + same-(file,sheet) rows into one citation. +- If the result hint says rows were truncated (`Only the first N rows ...; the + remaining M ...`), tell the user the total (`N+M`), shown (`N`), and omitted (`M`). +- Never expose the `__src` column itself to the user. + +### Controlling truncation + +`run-sql` truncates results by default (total rows and per-cell characters) to keep +the context manageable. If a result comes back truncated and you genuinely need more, +re-run with higher limits — do **not** re-run search-tables: + +```bash +python {SKILL_DIR}/scripts/table_query.py run-sql --max-rows 500 --cell-max 4000 <<'PLAN' +{"queries":[ ... ]} +PLAN +``` + +- `--max-rows`: max total rows across all `final` tables (default from backend config, + hard ceiling 2000). Prefer writing an aggregate query (SUM/COUNT/GROUP BY) over + pulling thousands of detail rows. +- `--cell-max`: max characters per cell before it is truncated with `..` (default from + backend config, hard ceiling 10000). Raise this when a long-text column (e.g. a + description/spec field) is getting cut off. diff --git a/skills/developing/table-query/scripts/table_query.py b/skills/developing/table-query/scripts/table_query.py new file mode 100755 index 0000000..b45a121 --- /dev/null +++ b/skills/developing/table-query/scripts/table_query.py @@ -0,0 +1,213 @@ +#!/usr/bin/env python3 +""" +table-query CLI. + +Fast, LLM-free table querying. Talks to the felo-mygpt table_query endpoints: + - search-tables : POST /v1/table_query/search_tables/{bot_id} + - get-schemas : POST /v1/table_query/get_schemas/{bot_id} + - run-sql : POST /v1/table_query/run_sql/{bot_id} + +The agent drives the orchestration (rewrite -> locate -> author SQL -> run); +the backend only does cheap work, so each call returns in seconds. +""" + +import argparse +import hashlib +import json +import os +import sys + +try: + import requests +except ImportError: + print("Error: requests module is required. Please install it with: pip install requests") + sys.exit(1) + +DEFAULT_BACKEND_HOST = os.getenv("BACKEND_HOST", "https://api-dev.gptbase.ai") +DEFAULT_MASTERKEY = os.getenv("MASTERKEY", "master") + +# Same citation contract the legacy table_rag_retrieve used, so the agent's +# behaviour is unchanged. +TABLE_CITATION_INSTRUCTIONS = """ +When using the retrieved table knowledge below, you MUST add XML citation tags for factual claims. + +Format: `` +- Parse `__src`: `F1S2R5` = file_ref F1, sheet 2, row 5 +- Look up file_id in `file_ref_table` +- Combine same-sheet rows into one citation: `rows=[2, 4, 6]` +- MANDATORY: Create SEPARATE citation for EACH (file, sheet) combination +- NEVER put on the same line as a bullet point or table row +- Citations MUST be on separate lines AFTER the complete list/table +- NEVER include the `__src` column in your response - it is internal metadata only +- Citations MUST appear IMMEDIATELY AFTER the paragraph or bullet list that uses the knowledge +- NEVER collect all citations and place them at the end of your response + +""" + + +def load_config() -> dict: + """Load robot_config.json from the robot project root (3 levels up from scripts/).""" + config_path = os.path.join(os.path.dirname(__file__), '..', '..', '..', 'robot_config.json') + if os.path.exists(config_path): + try: + with open(config_path, 'r', encoding='utf-8') as f: + return json.load(f) + except (json.JSONDecodeError, IOError) as e: + print(f"Warning: failed to load robot_config.json: {e}", file=sys.stderr) + return {} + + +def _resolve_bot_id(cli_bot_id: str) -> str: + if cli_bot_id: + return cli_bot_id + return load_config().get('bot_id') or os.getenv("BOT_ID") or os.getenv("ASSISTANT_ID") + + +def _post(path: str, bot_id: str, payload: dict) -> dict: + url = f"{DEFAULT_BACKEND_HOST}/v1/table_query/{path}/{bot_id}" + auth_token = hashlib.md5(f"{DEFAULT_MASTERKEY}:{bot_id}".encode()).hexdigest() + headers = { + "content-type": "application/json", + "authorization": f"Bearer {auth_token}", + } + trace_id = os.getenv("TRACE_ID") or os.getenv("X_REQUEST_ID") + if trace_id: + headers["X-Request-ID"] = trace_id + resp = requests.post(url, json=payload, headers=headers, timeout=30) + if resp.status_code != 200: + raise RuntimeError(f"API {path} returned {resp.status_code}: {resp.text}") + return resp.json() + + +def cmd_search_tables(args, bot_id: str) -> str: + res = _post("search_tables", bot_id, {"query": args.query, "top_k": args.top_k}) + tables = res.get("tables", []) + if not tables: + return ("No matching tables found. If the question may be answered from documents " + "instead of spreadsheets, fall back to the rag_retrieve document tool.") + lines = [f"Found {len(tables)} candidate table(s). Pick the relevant ones and call " + f"`get-schemas` for them next.\n"] + for t in tables: + lines.append( + f"- table_name: {t['table_name']}\n" + f" file: {t.get('file_name','')} | sheet: {t.get('sheet_name','')} " + f"| score: {round(t.get('score', 0), 3)}\n" + f" description: {t.get('table_description','')}" + ) + return "\n".join(lines) + + +def cmd_get_schemas(args, bot_id: str) -> str: + table_names = [t.strip() for t in args.tables.split(',') if t.strip()] + res = _post("get_schemas", bot_id, + {"table_names": table_names, "sample_rows": args.sample_rows}) + schemas = res.get("schemas", []) + missing = res.get("missing_tables", []) + if not schemas: + return f"No schemas resolved. Missing tables: {missing}" + blocks = [] + for s in schemas: + block = [f"### Table: {s['table_name']}", + f"File: {s.get('file_name','')} | Sheet: {s.get('sheet_name','')}", + "```sql", s.get('sql_create', ''), "```"] + sample = s.get('sample_rows') or [] + if sample: + block.append("Sample rows (format hint only, NOT the row count):") + block.append("```csv") + for row in sample: + block.append(",".join('"' + str(c).replace('"', '""') + '"' for c in row)) + block.append("```") + blocks.append("\n".join(block)) + out = "\n\n".join(blocks) + if missing: + out += f"\n\nNote: these requested tables were not found: {missing}" + out += ("\n\nNow author a SQLite plan and run it by piping the JSON to run-sql on stdin:\n" + " run-sql <<'PLAN'\n" + " {\"queries\": [{\"step\": 1, \"sql\": \"CREATE TEMP TABLE \\\"final_table_step1\\\" " + "AS SELECT ...\", \"source_table_names\": [\"...\"], " + "\"destine_table_name\": \"final_table_step1\", \"destine_table_type\": \"final\"}]}\n" + " PLAN\n" + "Quote all identifiers with double quotes.") + return out + + +def cmd_run_sql(args, bot_id: str) -> str: + # Read the plan from --plan-file if given, otherwise from stdin (heredoc). + try: + if args.plan_file: + with open(args.plan_file, 'r', encoding='utf-8') as f: + raw = f.read() + else: + raw = sys.stdin.read() + if not raw.strip(): + return ("Error: no plan provided. Pipe the JSON plan via stdin, e.g.\n" + " python scripts/table_query.py run-sql <<'PLAN'\n" + " {\"queries\": [...]}\n" + " PLAN") + plan = json.loads(raw) + except (json.JSONDecodeError, IOError) as e: + return f"Error: failed to read SQL plan: {e}" + # accept either {"queries": [...]} or a bare [...] list + queries = plan.get("queries") if isinstance(plan, dict) else plan + if not queries: + return "Error: the plan must contain a non-empty `queries` list." + payload = {"queries": queries} + if args.max_rows is not None: + payload["max_rows"] = args.max_rows + if args.cell_max is not None: + payload["cell_max"] = args.cell_max + res = _post("run_sql", bot_id, payload) + if not res.get("success"): + return (f"SQL execution failed: {res.get('error')}\n" + "Fix your SQL and call run-sql again. Do NOT restart from search-tables.") + parts = [TABLE_CITATION_INSTRUCTIONS] + if res.get("instruction"): + parts.append(res["instruction"]) + if res.get("knowledge"): + parts.append(res["knowledge"]) + if res.get("extra_goal"): + parts.append(res["extra_goal"]) + return "\n".join(parts) + + +def main(): + parser = argparse.ArgumentParser(description="table-query: fast LLM-free table querying") + parser.add_argument("--bot-id", default=None, help="Bot id (defaults to robot_config.json)") + sub = parser.add_subparsers(dest="command", required=True) + + p_search = sub.add_parser("search-tables", help="Vector-locate relevant tables") + p_search.add_argument("--query", "-q", required=True, help="Rewritten retrieval query") + p_search.add_argument("--top-k", "-k", type=int, default=20) + + p_schemas = sub.add_parser("get-schemas", help="Fetch CREATE TABLE schema + sample rows") + p_schemas.add_argument("--tables", "-t", required=True, help="Comma-separated table names") + p_schemas.add_argument("--sample-rows", type=int, default=3) + + p_run = sub.add_parser("run-sql", help="Execute an authored SQL plan (JSON via stdin or file)") + p_run.add_argument("--plan-file", "-f", default=None, + help="Path to plan JSON file (optional; defaults to reading stdin)") + p_run.add_argument("--max-rows", type=int, default=None, + help="Max total result rows (raise if a result came back truncated)") + p_run.add_argument("--cell-max", type=int, default=None, + help="Max characters per cell before truncation") + + args = parser.parse_args() + bot_id = _resolve_bot_id(args.bot_id) + if not bot_id: + print("Error: bot_id is required (robot_config.json / --bot-id / BOT_ID env)") + sys.exit(1) + + try: + if args.command == "search-tables": + print(cmd_search_tables(args, bot_id)) + elif args.command == "get-schemas": + print(cmd_get_schemas(args, bot_id)) + elif args.command == "run-sql": + print(cmd_run_sql(args, bot_id)) + except Exception as e: + print(f"Error: {str(e)}") + sys.exit(1) + + +if __name__ == "__main__": + main() diff --git a/skills/developing/table-query/skill.yaml b/skills/developing/table-query/skill.yaml new file mode 100644 index 0000000..839dda9 --- /dev/null +++ b/skills/developing/table-query/skill.yaml @@ -0,0 +1,25 @@ +name: table-query +version: 1.0.0 +description: Fast LLM-free table querying. Locate tables, fetch schema, author SQLite SQL, and run it with row-level citations. +author: + name: sparticle + email: support@gbase.ai +license: MIT +tags: + - table + - sql + - excel + - retrieval + - citation +runtime: + python: ">=3.7" + dependencies: + - requests +entry_point: scripts/table_query.py +commands: + search-tables: + description: Vector-locate relevant tables for a query + get-schemas: + description: Fetch CREATE TABLE schema + sample rows for given tables + run-sql: + description: Execute an authored SQLite plan and return CSV with __src citations diff --git a/skills/developing/table-query/verify_table_query.sh b/skills/developing/table-query/verify_table_query.sh new file mode 100755 index 0000000..f6de962 --- /dev/null +++ b/skills/developing/table-query/verify_table_query.sh @@ -0,0 +1,67 @@ +#!/usr/bin/env bash +# +# Manual verification for the new table_query endpoints. +# Run this against an environment where the feature/table-query-split branch is +# deployed (e.g. dev). It checks the 3 fast endpoints and diffs run_sql output +# against the legacy table_rag_retrieve for parity. +# +# Usage: +# HOST=https://api-dev.gptbase.ai BOT_ID= MASTERKEY=master ./verify_table_query.sh +# +set -euo pipefail + +HOST="${HOST:-https://api-dev.gptbase.ai}" +# bot from the slow-request log (has the 案1_売上明細 xlsx). Override as needed. +BOT_ID="${BOT_ID:-c1fa021b-6c41-41d5-b1e6-adfb8896aaaa}" +MASTERKEY="${MASTERKEY:-master}" +QUERY="${QUERY:-2025年4月〜6月の売上実績}" + +# auth token = MD5(masterkey:bot_id) +TOKEN=$(python3 -c "import hashlib,sys;print(hashlib.md5(f'{sys.argv[1]}:{sys.argv[2]}'.encode()).hexdigest())" "$MASTERKEY" "$BOT_ID") +AUTH="authorization: Bearer ${TOKEN}" +CT="content-type: application/json" + +echo "=== HOST=$HOST BOT_ID=$BOT_ID ===" + +echo +echo "### 1) search_tables ###" +curl -s --request POST "$HOST/v1/table_query/search_tables/$BOT_ID" \ + --header "$AUTH" --header "$CT" \ + --data "{\"query\": \"$QUERY\", \"top_k\": 20}" | python3 -m json.tool + +echo +echo "### 2) get_schemas (EDIT --data table_names with names from step 1) ###" +echo "curl -s --request POST \"$HOST/v1/table_query/get_schemas/$BOT_ID\" \\" +echo " --header \"$AUTH\" --header \"$CT\" \\" +echo " --data '{\"table_names\": [\"\"], \"sample_rows\": 3}' | python3 -m json.tool" + +echo +echo "### 3) run_sql (EDIT the sql to match the real table/columns from step 2) ###" +cat > /tmp/tq_plan.json <<'JSON' +{ + "queries": [ + { + "step": 1, + "sql": "CREATE TEMP TABLE \"final_table_step1\" AS SELECT \"計上日\", \"得意先名\", \"売上金額\" FROM \"\" LIMIT 10", + "source_table_names": [""], + "destine_table_name": "final_table_step1", + "destine_table_type": "final", + "destine_table_description": "sample rows" + } + ] +} +JSON +echo "Edit /tmp/tq_plan.json (replace ), then:" +echo "curl -s --request POST \"$HOST/v1/table_query/run_sql/$BOT_ID\" \\" +echo " --header \"$AUTH\" --header \"$CT\" \\" +echo " --data @/tmp/tq_plan.json | python3 -m json.tool" +echo +echo "ASSERT: run_sql output 'knowledge' contains a '__src' column and 'file_ref_table'." + +echo +echo "### 4) legacy table_rag_retrieve (parity reference, same question) ###" +echo "curl -s --request POST \"$HOST/v1/table_rag_retrieve/$BOT_ID\" \\" +echo " --header \"$AUTH\" --header \"$CT\" \\" +echo " --data '{\"query\": \"$QUERY\"}' | python3 -m json.tool" +echo +echo "Compare the __src tokens / result rows between #3 and #4 for the same SQL intent."