Content extraction

Turn arbitrary URLs and PDFs into clean structured text — articles, page metadata, PDF pages, OCR'd images, browser-rendered SPAs.

When to use this pack

Building a RAG corpus, a daily newsletter from a list of source URLs, or extracting a table from a scanned PDF.

Tools in this pack

Extract article $0.010 Extract the main article content from any public URL as clean markdown. Returns title, byline, excerpt, word count, and markdown.
Page metadata $0.002 Fetch page metadata for a URL: title, description, OpenGraph, Twitter cards, canonical URL, favicon.
PDF to Markdown $0.01 POST /api/pdf-to-markdown Convert a PDF to clean markdown: headings, paragraphs, and bullets reconstructed from the text layer — ready to drop into a model's context. Body: {"url":"https://…/file.pdf"}.
Extract / split PDF pages $0.003 POST /api/pdf-extract-pages Pull a subset of pages into a new PDF (split). Body: {"url":"https://…/file.pdf","pages":"1-3,5"}. Returns the new PDF as base64.
Browser render $0.02 Render a page in a real headless Chromium browser (JavaScript executed), then extract the main content as clean markdown. Use this for SPAs and JS-heavy sites where plain fetching returns an empty shell.
Image OCR $0.01 POST /api/image-ocr Extract text from an image (PNG/JPEG): returns the full text, overall confidence (0-100), and per-line bounding boxes. Send either {image: base64} or {url: 'https://…'}. Pure-CPU Tesseract via tesseract.js — no upstream API, no keys. Default lang 'eng'; pass 'lang' (ISO 639-2) for others.

Workflow

For an article URL, extract returns clean markdown (Readability-style) plus title, byline, word count.
For OpenGraph card data (title, description, image, canonical), meta is faster than extract.
For a PDF that lives at a URL, pdf-to-markdown converts the whole document; pdf-extract-pages pulls a specific page range.
For a SPA or paywalled page that needs JavaScript execution, render returns the post-JS HTML — extract usually works directly against the rendered URL.
For an image URL (scanned receipt, screenshot of a table), image-ocr returns the text.
Pipeline: render → extract → embed for a robust ingest path that handles client-rendered sites without breaking.

Run it in Claude

claude mcp add agent402 -s user -- npx -y agent402-mcp@latest

Then paste this prompt into Claude:

Ingest these 10 URLs into clean markdown using Agent402. For each: try extract first; if it returns no body, fall back to render→extract; for any PDF URL, use pdf-to-markdown. Return one markdown blob per URL with the source URL as the H1.

← All skill packs