Any URL to markdown

The 'I have a URL but it might be HTML, PDF, or an image — give me clean markdown either way' workflow. HEAD-detect the content-type, branch to the right deterministic extractor (article extract for HTML, pdf-to-markdown for PDFs, OCR for images), and report token/word stats on the output so the caller can budget the result against an LLM context window.

When to use this pack

An agent is handed a URL by a user and needs LLM-clean text out — but the URL might point at an HTML article, a PDF whitepaper, or a JPG screenshot, and the naive single-tool approach (just call extract on everything) silently fails on PDFs (returns empty) and images (returns nothing at all). Agents currently hand-roll the content-type detection + branching, often badly: they call extract first, get an empty body, then guess at a PDF extractor. This pack hands them the canonical decision tree — HEAD probe → branch → extract → stat — as a single workflow. Output is markdown plus a {chars, words, est_tokens} block so the caller can decide whether to chunk before feeding an LLM. The same pattern powers any 'ingest the document at this URL' agent step: research assistants, RAG ingest pipelines, document QA bots, archive-to-knowledge-base scripts.

Tools in this pack

Workflow

  1. Call http-headers with the URL to fetch the response headers without downloading the body. The decisive field is Content-Type — `text/html` → step 2, `application/pdf` → step 3, `image/*` → step 4. If Content-Type is missing or generic (`application/octet-stream`), fall back to extension sniffing on the path (`.pdf`, `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp` for images; everything else default to HTML). http-headers is cheaper than a full fetch and returns the HTTP status too — bail early with a clear message if the URL is 4xx/5xx before spending money on a downstream extractor.
  2. HTML branch — call extract on the URL for the readable article body as clean markdown. extract handles the boilerplate-stripping (nav, footer, sidebars, cookie banners) and returns title + byline + excerpt + markdown. If extract returns an empty markdown body (some heavy SPAs render fully client-side and extract can't see the article without JS), fall back to html-to-markdown on the same URL — it converts the raw DOM verbatim, which is noisier but never empty. Cost: extract is wallet-only ($0.005); html-to-markdown is PoW-eligible.
  3. PDF branch — call pdf-to-markdown with the URL. Returns markdown preserving headings, paragraphs, and bullet structure from the PDF's text layer. For scanned PDFs (image-only, no text layer), pdf-to-markdown will return empty or near-empty — in that case the caller should hand the PDF pages to image-ocr (step 4) page-by-page, which is more expensive but the only path that works on scans. Most modern PDF whitepapers and research papers have proper text layers and don't need OCR fallback.
  4. Image branch — call image-ocr with the imageUrl. Returns the recognized text plus per-word confidence scores. Confidence < 60 on most words signals a low-resolution or low-contrast source — surface this to the caller so they don't trust the output as authoritative text. image-ocr is the catch-all for screenshots, scanned receipts, whiteboard photos, and image-only PDF pages. The output is plain text (not markdown) — wrap it in a single fenced code block if downstream needs a markdown payload.
  5. Optional: re-render — if the HTML branch picked extract and the agent wants *raw* HTML-to-markdown instead of the boilerplate-stripped article (e.g. for archiving a documentation page where the nav links matter), swap step 2 for html-to-markdown directly. It's the same shape (URL → markdown), just verbose. Both extract and html-to-markdown return the same field name (`markdown`) so the rest of the workflow is interchangeable.
  6. Finalize — call text-stats with the markdown body to compute word count, character count, and estimated token count (≈chars/4). This is a budget step: it tells the caller whether the result fits in a single LLM call (<32k tokens), needs chunking (32k-200k), or warrants a RAG-style ingestion (>200k). Final payload: { url, contentType, branch: 'html'|'pdf'|'image'|'html-raw', markdown: '<body>', stats: { chars, words, est_tokens } } — a single object the caller's LLM-input layer consumes directly.

Run it in Claude

claude mcp add agent402 -s user -- npx -y agent402-mcp@latest

Then paste this prompt into Claude:

Convert https://example.com to clean markdown using Agent402, branching on content-type.

(1) http-headers with url=https://example.com — return {status, headers}. Read headers['content-type']. If status >= 400, abort with {error: 'unreachable', status}. (2) Branch: if content-type starts with 'text/html' → call extract with url=https://example.com, return {title, markdown}. If empty markdown, retry with html-to-markdown (url=https://example.com). If content-type is 'application/pdf' or the path ends in .pdf → call pdf-to-markdown with url=https://example.com, return {markdown}. If content-type starts with 'image/' or the path ends in .png/.jpg/.jpeg/.gif/.webp → call image-ocr with imageUrl=https://example.com, return {text, confidence}. Wrap the text in a single fenced code block as the markdown. (3) text-stats with text=<markdown from step 2> — return {chars, words}. Compute est_tokens = Math.ceil(chars/4). Final return: {url: 'https://example.com', contentType: <from step 1>, branch: 'html'|'pdf'|'image'|'html-raw', markdown: <step 2 result>, stats: {chars, words, est_tokens}, warnings: [<'low OCR confidence' if image branch and confidence<60, 'empty extract — used raw html-to-markdown' if html-raw fallback>]}. Budget ~$0.018 paid; 4 of 6 tools are PoW-eligible (extract and pdf-to-markdown are wallet-only).

← All skill packs