Content extraction

Turn arbitrary URLs and PDFs into clean structured text — articles, page metadata, PDF pages, OCR'd images, browser-rendered SPAs.

When to use this pack

Building a RAG corpus, a daily newsletter from a list of source URLs, or extracting a table from a scanned PDF.

Tools in this pack

Workflow

  1. For an article URL, extract returns clean markdown (Readability-style) plus title, byline, word count.
  2. For OpenGraph card data (title, description, image, canonical), meta is faster than extract.
  3. For a PDF that lives at a URL, pdf-to-markdown converts the whole document; pdf-extract-pages pulls a specific page range.
  4. For a SPA or paywalled page that needs JavaScript execution, render returns the post-JS HTML — extract usually works directly against the rendered URL.
  5. For an image URL (scanned receipt, screenshot of a table), image-ocr returns the text.
  6. Pipeline: render → extract → embed for a robust ingest path that handles client-rendered sites without breaking.

Run it in Claude

claude mcp add agent402 -s user -- npx -y agent402-mcp@latest

Then paste this prompt into Claude:

Ingest these 10 URLs into clean markdown using Agent402. For each: try extract first; if it returns no body, fall back to render→extract; for any PDF URL, use pdf-to-markdown. Return one markdown blob per URL with the source URL as the H1.

← All skill packs