Structured scrape

Pull structured data out of any web page deterministically — articles to clean text, tables to JSON rows, specific elements via CSS selector — without writing regex against raw HTML.

When to use this pack

Extracting a product price, a sports stats table, a roster, a pricing tier, an outlink list — anything where the page has the data but no public API exposes it, and you need a repeatable deterministic answer instead of an LLM guess.

Tools in this pack

Extract article $0.010 Extract the main article content from any public URL as clean markdown. Returns title, byline, excerpt, word count, and markdown.
Browser render $0.02 Render a page in a real headless Chromium browser (JavaScript executed), then extract the main content as clean markdown. Use this for SPAs and JS-heavy sites where plain fetching returns an empty shell.
HTML select $0.001 POST /api/html-select Run a CSS selector against an HTML string and return the matches (text, attrs, and optionally outerHTML for each). The deterministic alternative to regex when you already have the HTML and know the selector — pairs with /api/render or any page you've fetched yourself.
HTML table to JSON/CSV $0.001 POST /api/html-table Extract a <table> from an HTML string as JSON rows (header-keyed) or CSV. Useful for prices, schedules, sports stats, or anything an agent has already fetched as HTML. If multiple tables match, the first is used.
HTML strip to text $0.001 POST /api/html-strip Strip all HTML tags and return plain text. Preserves block-level structure (paragraphs and headings become newline-separated). Faster and more predictable than running extract on a raw HTML string when you already have it.
HTML links $0.001 POST /api/html-links Enumerate every <a href> in an HTML string with its anchor text and rel attribute. Optionally resolves relative hrefs against a base URL and filters by a regex on the href. The deterministic way to crawl a page's outlinks without writing a regex.
HTML meta (from string) $0.001 POST /api/html-meta Extract <title>, <meta description>, OpenGraph/Twitter cards, canonical URL, and JSON-LD blocks from an HTML string. Distinct from /api/meta which fetches a URL — feed this the HTML you already have (from /api/render, /api/extract.body, or your own fetch).

Workflow

If the page is prose (an article, a blog post, a docs page), try extract first — it returns clean Readability-style markdown in one call, no HTML wrangling needed.
If the page is a SPA, paywalled-but-bypassable-with-render, or has data that lives outside the article body, fall back to render — it runs Chromium and returns the post-JS HTML you can then drill into.
Pipe the HTML from render into html-select with a CSS selector to pull specific elements (a price, a header, a button label). Use the `attr` parameter when you only need href/id/data-* values — keeps the response tight.
If the data is in a <table>, use html-table — it returns header-keyed JSON rows by default, or RFC 4180 CSV if you'd rather paste it into a spreadsheet. It picks the first matching table; pass a selector for more specificity.
If you need plain text from a specific subtree (e.g. "give me the body of <article>"), use html-strip with a selector — it preserves block-level newlines and removes <script>/<style>.
To enumerate outlinks (link audits, crawl seeds, footnote URLs), use html-links — it resolves relative hrefs against a base URL and dedups by href. Filter by regex when you only want one host or path prefix.
If you already have the rendered HTML and just want the metadata (title, description, OpenGraph, Twitter, canonical, JSON-LD), use html-meta on the string — avoids paying for a second fetch from /api/meta.

Run it in Claude

claude mcp add agent402 -s user -- npx -y agent402-mcp@latest

Then paste this prompt into Claude:

Scrape the price and SKU from https://example.com/product/42 using Agent402. (1) Try extract first; if the price isn't in the article body, (2) call render to get the post-JS HTML. (3) Use html-select with a precise CSS selector to pull the price element — fall back to a broader selector if the first returns 0 matches. (4) Use html-select again with attr="data-sku" or similar to read the SKU. Return a single JSON object {price, sku, url, source} where source = "extract" or "render" depending on which path worked.

← All skill packs