Content extraction
Turn arbitrary URLs and PDFs into clean structured text — articles, page metadata, PDF pages, OCR'd images, browser-rendered SPAs.
When to use this pack
Building a RAG corpus, a daily newsletter from a list of source URLs, or extracting a table from a scanned PDF.
Tools in this pack
Workflow
- For an article URL, extract returns clean markdown (Readability-style) plus title, byline, word count.
- For OpenGraph card data (title, description, image, canonical), meta is faster than extract.
- For a PDF that lives at a URL, pdf-to-markdown converts the whole document; pdf-extract-pages pulls a specific page range.
- For a SPA or paywalled page that needs JavaScript execution, render returns the post-JS HTML — extract usually works directly against the rendered URL.
- For an image URL (scanned receipt, screenshot of a table), image-ocr returns the text.
- Pipeline: render → extract → embed for a robust ingest path that handles client-rendered sites without breaking.
Run it in Claude
claude mcp add agent402 -s user -- npx -y agent402-mcp@latest
Then paste this prompt into Claude:
Ingest these 10 URLs into clean markdown using Agent402. For each: try extract first; if it returns no body, fall back to render→extract; for any PDF URL, use pdf-to-markdown. Return one markdown blob per URL with the source URL as the H1.
← All skill packs