Take a raw document and turn it into a vector-DB-ready JSONL dataset, deterministically. Measures the corpus, token-counts it with the real OpenAI BPE, chunks at the right token boundary, attaches entities + keywords as metadata, emits NDJSON, then validates every record against a JSON Schema before you ingest it. Seven pure-CPU tools, free-tier eligible — the canonical 'prep my docs for embeddings' workflow done as deterministic tool calls instead of a hand-rolled Python script.
The 'prep my docs for embeddings' workflow is the universal first step for every RAG pipeline, fine-tuning dataset, and agent corpus — and it's almost always done with a hand-rolled Python script that silently drops malformed records, chunks by character instead of token, and ships unvalidated JSONL to the vector DB. This pack does the whole pipeline as deterministic tool calls so the output is reproducible across runs, the schema gate catches corruption at the boundary, and the agent can re-run a single step (re-chunk with a different size) without re-doing the rest.
claude mcp add agent402 -s user -- npx -y agent402-mcp@latest
Then paste this prompt into Claude:
Prep this document for vector-DB ingestion using Agent402.
Document:
"""
Alice from acme@example.com filed a support ticket on 2026-06-21 about the checkout flow returning a 502 from api.acme.com/v2/orders. Engineer Bob investigated and found the issue was a connection-pool exhaustion in the order-service: postgres max_connections was 100 and the pool had been silently leaking since the rollout of feature flag #orders-2026. Fix landed in commit 9a3b2c1; deploy went out 2026-06-22. Follow-up: add pgbouncer in front of the order-service and an alert on pool.in_use / max_connections > 0.8 in PagerDuty. Slack thread: #incident-orders-502. Mentioned engineers: @alice @bob @carol.
"""
Target embedding model: text-embedding-3-small (8191 token cap).
Target chunk size: 512 tokens with 64-token overlap.
(1) text-stats on the full doc. Report char / word / sentence / token (heuristic) counts. (2) token-count on the full doc with model=gpt-4o. This is the BPE-accurate count; compare against the heuristic and report the delta — the BPE count is usually 1.3-1.5x the whitespace-split count. (3) text-chunk with unit='tokens', size=512, overlap=64. Returns N chunks. For each chunk, attach chunkIndex (0-based), sourceDoc ('input'), and tokens (the chunk's exact token count). (4) For each chunk: extract-entities. Attach the result as metadata.entities (emails, urls, ipv4, mentions, hashtags). (5) For each chunk: keywords with n=10. Attach as metadata.keywords. (6) Assemble the array: [{id: '<sourceDoc>#<chunkIndex>', text: <chunk>, metadata: {chunkIndex, sourceDoc, tokens, entities, keywords}}, …]. Call jsonl in to-jsonl mode to emit the NDJSON string. (7) Define a JSON Schema with required: ['id','text','metadata'], metadata.properties.tokens.maximum: 8191, metadata.properties.chunkIndex.minimum: 0. For each record in the array, call json-validate. Collect any records with errors. Final return: {totalChunks, totalTokens, ndjson, schemaPassed: N, schemaFailed: M, failingRecords: [{id, errors}], oneLineSummary: 'prepped N chunks (avg X tokens, max Y); M failed schema; ready for upsert'}. All seven tools are pure-CPU and PoW-eligible — the whole pipeline runs on the free tier. Budget ≤ $0.01 even paid.