How to get structured web data

At its core, the internet was built for human eyes, not machine consumption. HTML is a layout language, not a data storage format. When we try to "scrape" data, we are essentially trying to reverse-engineer a visual interface back into a database record.

The Fragility of CSS Selectors

If you’ve ever used BeautifulSoup in Python or Cheerio in Node.js, you know the drill. You right-click "Inspect" on a price tag, find a class name, and write something like:

Find a class name with BeautifulSoup

price = soup.find('span', class_='product-price-value').text

This works... for about ten minutes. Then the site owner pushes an update, or they’re using a CSS-in-JS library that generates random class names on every build. Your selector returns None, your pipeline crashes, and your dashboard shows $0.00 for everything.

The JavaScript Wall

Modern web apps (React, Vue, Next.js) don't send data in the initial HTML. They send a skeleton and then fetch data via internal APIs. If you use a simple HTTP client, you get an empty shell. To get the data, you have to spin up a headless browser (like Puppeteer or Playwright), which is:

Resource Heavy: It eats RAM like Chrome does (because it is Chrome).
Slow: You have to wait for the page to "settle."
Complex: You have to manage browser contexts, cookies, and timeouts.

The "Manual" Approach: A Python Example

Let’s look at what a "traditional" robust scraper looks like. Suppose we want to get product names and prices from an e-commerce site. We’ll use Playwright because it handles JavaScript better than requests.

# # rounded-lg border border-gray-200 dark:border-gray-800 bg-[#22272e]">

Example: Traditional Scraper

import asyncio from playwright.async_api import async_playwright async def get_product_data(url): async with async_playwright() as p: browser = await p.chromium.launch(headless=True) page = await browser.new_page()  # We have to mimic a real user to avoid immediate blocking await page.set_extra_http_headers({ "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) ..." }) await page.goto(url, wait_until="networkidle") # Now we pray the selectors haven't changed products = [] items = await page.query_selector_all(".inventory_item") for item in items: name = await item.query_selector(".inventory_item_name") price = await item.query_selector(".inventory_item_price")  products.append({ "name": await name.inner_text() if name else "N/A", "price": await price.inner_text() if price else "N/A" }) await browser.close() return products Usage data = asyncio.run(get_product_data("https://www.example-store.com"))

The "Hidden" Costs of This Code

On the surface, this code looks fine. But as a developer, you know what’s coming next:

The Anti-Bot Wall: The site notices you're running a headless browser and throws a CAPTCHA.
The Parsing Headache: The price comes back as "$1,200.00 USD". Now you need a Regex or a parsing library to turn that into a float.
The Schema Shift: One product has a "Discount Price" and your logic only looks for "Price." You’re missing data.

The Maintenance Trap: Why "In-House" Scraping Scales Poorly

Most teams start with a few scripts. Then they realize they need:

Proxies: To avoid IP bans.
Retries: Because the web is flaky.
Data Validation: To ensure the "price" is actually a number.
Monitoring: To know when the site layout changed.

Suddenly, you aren't a Backend Engineer anymore; you're a "Data Acquisition Engineer" spending 40% of your week fixing broken scrapers. It’s a massive drain on productivity.

Developer Pro-Tip: If you find yourself writing more than three nested if-else blocks just to find a single piece of data on a webpage, you've already lost the battle against the DOM.

A Better Way: From HTML to Type-Safe APIs

What if we stopped treating web data extraction as a "scraping" problem and started treating it as a "transformation" problem?

Instead of writing custom logic for every site, the industry is moving toward AI-powered structured extraction. By combining browser automation with Large Language Models (LLMs), we can tell a system: "Here is a URL. I want the product name, the price as a number, and the availability as a boolean."

This is where tools like ManyPI change the game. Instead of you managing headless browsers, proxy rotations, and brittle CSS selectors, ManyPI acts as a middleware that turns any website into a structured, type-safe API.

How it works in practice

Imagine you don't have to write any Playwright code. Instead, you send a request to a single endpoint and get back clean JSON that matches your specific schema.

Here’s how you’d get that same product data using ManyPI:

Implementing ManyPI

curl -X POST
'https://app.manypi.com/api/scrape/YOUR_API_ENDPOINT_ID'
-H 'Authorization: Bearer YOUR_API_KEY'
-H 'Content-Type: application/json'

Why this is a paradigm shift for developers:

Type Safety: You define the schema. The data comes back in the format you expect (e.g., numbers are actually numbers, not strings with currency symbols).
Resilience: Because ManyPI doesn't rely solely on hard-coded CSS selectors, it can often survive small layout changes that would break a traditional scraper.
Speed to Production: You go from "writing a scraper" to "calling an API" in seconds.

Practical Best Practices for Structured Data Extraction

Regardless of the tool you use, if you're dealing with web data at scale, you should follow these pragmatic guidelines:

1. Define Your Schema Early

Don't just "grab everything." Use a validation library like Zod (TypeScript) or Pydantic (Python) to define what your "Clean Data" looks like. If the incoming data doesn't match, fail loudly and early.

2. Respect the `robots.txt`

Just because you can scrape it doesn't always mean you should at a high frequency. Be a good web citizen. Check the site's robots.txt and try to run your jobs during off-peak hours if you're hitting them hard.

3. Handle Ethics and Legality

Always ensure you have the right to use the data you are collecting, especially if it’s PII (Personally Identifiable Information). Structured data is powerful—use it responsibly.

4. Cache Everything

The web is slow and expensive. If the data you’re pulling doesn't change every minute, cache it in Redis or a simple S3 bucket. Your latency (and your API bill) will thank you.

When to Build vs. When to Buy (or Use a SaaS)

I love building things from scratch as much as the next dev. There’s a certain satisfaction in getting a complex Puppeteer script to bypass a tough anti-bot measure.

But ask yourself: Is this the core value of my product?

If you're building a groundbreaking AI for fashion, your value is in the AI, not in the script that pulls images from clothing sites. Every hour you spend debugging a SelectorNotFoundException is an hour you aren't improving your core product.

Use a manual approach (Playwright/BeautifulSoup) if:

You are scraping a single, simple site that never changes.
You have zero budget but infinite time.
The data is hidden behind a complex multi-step login flow that requires manual intervention.

Use a solution like ManyPI if:

You need to extract data from dozens or hundreds of different domains.
You need the data to be structured and typed (JSON) immediately.
You want to build a "Type-Safe" data pipeline that doesn't break every Friday afternoon.

The Path Forward: Web Data as an Asset, Not a Chore

Getting structured web data doesn't have to be a dark art. As LLMs and automated extraction tools mature, the "API-less" web is slowly becoming accessible to every developer.

By moving away from brittle CSS selectors and resource-heavy browser management, you can focus on what actually matters: what you do with the data.

Whether you're building a price aggregator, a lead generation tool, or a research engine, the goal is the same: clean, reliable data delivered at scale. Tools like ManyPI are making that "Friday afternoon request" from your PM a lot less scary.

Ready to stop debugging HTML and start using data? You can try turning your first website into an API in seconds. Your terminal (and your sanity) will thank you.

Table of Contents