RAG corpus prep

Take a raw document and turn it into a vector-DB-ready JSONL dataset, deterministically. Measures the corpus, token-counts it with the real OpenAI BPE, chunks at the right token boundary, attaches entities + keywords as metadata, emits NDJSON, then validates every record against a JSON Schema before you ingest it. Seven pure-CPU tools, free-tier eligible — the canonical 'prep my docs for embeddings' workflow done as deterministic tool calls instead of a hand-rolled Python script.

When to use this pack

The 'prep my docs for embeddings' workflow is the universal first step for every RAG pipeline, fine-tuning dataset, and agent corpus — and it's almost always done with a hand-rolled Python script that silently drops malformed records, chunks by character instead of token, and ships unvalidated JSONL to the vector DB. This pack does the whole pipeline as deterministic tool calls so the output is reproducible across runs, the schema gate catches corruption at the boundary, and the agent can re-run a single step (re-chunk with a different size) without re-doing the rest.

Tools in this pack

Workflow

  1. Size the corpus with text-stats. Returns characters / words / sentences / paragraphs / tokens (cheap heuristic, not the real LLM tokenizer). This is the budget probe: if the doc is 50k characters you can afford to chunk and embed it; if it's 50MB you need a different strategy (streaming, summarization-first, or sharding). Treat the heuristic-token count as an upper bound; the real count comes from token-count next.
  2. Count exact LLM tokens with token-count. Uses the real OpenAI BPE (o200k_base for gpt-4o/o-series, cl100k_base for gpt-4/gpt-3.5) — same tokens the embedding model will actually consume. The reason this matters: a chunk that text-stats counts as 800 'tokens' (whitespace-split) is often 1100+ real tokens (BPE splits subwords), and OpenAI's text-embedding-3-small caps at 8191 input tokens. Knowing the true count before you chunk lets you pick a size that fits the embedding model's context window with safety margin.
  3. Chunk with text-chunk in token mode. Pass unit='tokens' and size=512 (or whatever your embedding model + retrieval strategy wants). Returns the chunks plus offsets. Token-mode chunking is the correct primitive: char-mode is a convenient default but will split mid-word and produce chunks that mismatch embedding-model context boundaries. Overlap of ~10-20% (e.g., 64 tokens on a 512 chunk) is the conventional sweet spot for retrieval recall.
  4. Per chunk: enrich with extract-entities. Returns deduped emails / urls / ipv4 / mentions / hashtags — exactly the metadata fields you want indexed alongside the embedding so you can hybrid-search ('chunks mentioning user@example.com' OR semantic-similarity). The agent runs this once per chunk and attaches the result as a metadata block on each JSONL record.
  5. Per chunk: extract keywords with keywords. Top-N tokens by frequency with stopword removal — your hybrid-search BM25 / tag-filter index. Conventional payload: keep top 10. This is the 'sparse retrieval' lane on top of dense embeddings; modern vector DBs (Pinecone, Weaviate) expect both.
  6. Emit JSONL with jsonl in to-jsonl mode. Input: an array of {id, text, metadata: {entities, keywords, chunkIndex, sourceDoc, tokens}} records. Output: newline-delimited JSON — the canonical format for OpenAI fine-tuning (.jsonl), Pinecone batch upserts, Weaviate import scripts, and HuggingFace datasets. One-record-per-line means downstream tools can stream-process without loading the whole file.
  7. Final gate: json-validate the emitted records against a JSON Schema. Schema defines required fields (id, text, metadata.tokens within model limit, metadata.chunkIndex ≥ 0), types, and constraints. This is the boundary check the rest of the pipeline takes for granted: if any record is missing 'id' or has tokens > 8191 the validator returns errors and the agent can re-chunk or drop those records BEFORE they hit the vector DB and become silent failures (missing search results, broken citations) downstream.

Run it in Claude

claude mcp add agent402 -s user -- npx -y agent402-mcp@latest

Then paste this prompt into Claude:

Prep this document for vector-DB ingestion using Agent402.

Document:
"""
Alice from acme@example.com filed a support ticket on 2026-06-21 about the checkout flow returning a 502 from api.acme.com/v2/orders. Engineer Bob investigated and found the issue was a connection-pool exhaustion in the order-service: postgres max_connections was 100 and the pool had been silently leaking since the rollout of feature flag #orders-2026. Fix landed in commit 9a3b2c1; deploy went out 2026-06-22. Follow-up: add pgbouncer in front of the order-service and an alert on pool.in_use / max_connections > 0.8 in PagerDuty. Slack thread: #incident-orders-502. Mentioned engineers: @alice @bob @carol.
"""

Target embedding model: text-embedding-3-small (8191 token cap).
Target chunk size: 512 tokens with 64-token overlap.

(1) text-stats on the full doc. Report char / word / sentence / token (heuristic) counts. (2) token-count on the full doc with model=gpt-4o. This is the BPE-accurate count; compare against the heuristic and report the delta — the BPE count is usually 1.3-1.5x the whitespace-split count. (3) text-chunk with unit='tokens', size=512, overlap=64. Returns N chunks. For each chunk, attach chunkIndex (0-based), sourceDoc ('input'), and tokens (the chunk's exact token count). (4) For each chunk: extract-entities. Attach the result as metadata.entities (emails, urls, ipv4, mentions, hashtags). (5) For each chunk: keywords with n=10. Attach as metadata.keywords. (6) Assemble the array: [{id: '<sourceDoc>#<chunkIndex>', text: <chunk>, metadata: {chunkIndex, sourceDoc, tokens, entities, keywords}}, …]. Call jsonl in to-jsonl mode to emit the NDJSON string. (7) Define a JSON Schema with required: ['id','text','metadata'], metadata.properties.tokens.maximum: 8191, metadata.properties.chunkIndex.minimum: 0. For each record in the array, call json-validate. Collect any records with errors. Final return: {totalChunks, totalTokens, ndjson, schemaPassed: N, schemaFailed: M, failingRecords: [{id, errors}], oneLineSummary: 'prepped N chunks (avg X tokens, max Y); M failed schema; ready for upsert'}. All seven tools are pure-CPU and PoW-eligible — the whole pipeline runs on the free tier. Budget ≤ $0.01 even paid.

← All skill packs