How to extract data from a website

How to Extract Data From a Website (Without Losing Your Mind)

If you’ve ever tried to pull data from a website, you already know how this usually goes.

At first, it looks easy. The data is right there on the page. You open DevTools, inspect an element, copy a selector, write a quick script — and it works. For a day. Maybe two.

Then the site changes. A class name disappears. JavaScript starts loading content lazily. Pagination behaves differently. Suddenly, what looked like a simple task turns into a brittle mess of edge cases and fixes.

This article walks through how website data extraction actually works in practice, shows real code examples, and explains when manual scraping makes sense — and when it doesn’t.

What “extracting data from a website” really means

At its core, extracting data from a website means turning unstructured content into something structured and usable — typically JSON, CSV, or database records.

Websites are built for humans. Data is mixed with layout, ads, scripts, and interaction logic. Your job is to isolate just the parts you care about and make them predictable.

That predictability is the hardest part.

Step one: understand where the data comes from

Before writing code, always answer one question:

Is the data already in the HTML, or is it loaded dynamically?

You can usually tell by viewing the page source. If the content is there, you’re dealing with a static page. If not, the site is probably fetching data via JavaScript.

This decision determines everything that follows.

Extracting data from static HTML pages

Static pages are the best-case scenario. You send an HTTP request, parse the HTML, and extract elements using selectors.

Here’s a simple Python example using requests and BeautifulSoup:

HTTP Request using BeautifulSoup

import requests
from bs4 import BeautifulSoup

url = "https://example.com/products"
headers = {
    "User-Agent": "Mozilla/5.0"
}

response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()

soup = BeautifulSoup(response.text, "html.parser")

products = []
for card in soup.select(".product-card"):
    title = card.select_one(".title")
    price = card.select_one(".price")

    products.append({
        "title": title.text.strip() if title else None,
        "price": price.text.strip() if price else None
    })

print(products)

This works well, but it relies on one fragile assumption: that the HTML structure stays the same.

Even small redesigns can break scrapers like this.

JavaScript-heavy websites: the real challenge

Modern websites often load data after the page renders. In these cases, the HTML response contains almost nothing useful.

One option is to use a headless browser like Playwright:

Usage example for Playwright

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com")

    page.wait_for_selector(".product-card")

    products = page.evaluate("""
        () => Array.from(document.querySelectorAll('.product-card')).map(el => ({
            title: el.querySelector('.title')?.innerText,
            price: el.querySelector('.price')?.innerText
        }))
    """)

    print(products)
    browser.close()

This works — but it’s slow, resource-intensive, and surprisingly hard to scale.

Experienced engineers typically take a different approach.

The better way: extract data from internal APIs

Most JavaScript websites load their data from JSON endpoints. You can see these in the Network tab of your browser’s developer tools.

Once you find the endpoint, you can often call it directly:

Direct call to JSON endpoints

import requests

api_url = "https://example.com/api/products?page=1"
response = requests.get(api_url)
data = response.json()

print(data["items"])

This is faster, cleaner, and far more reliable than scraping rendered HTML.

But it still has problems:

Endpoints change
Authentication gets added
Response shapes drift over time
No guarantees about types or stability

Which brings us to the real pain point.

The hidden cost of manual scraping

Manual scraping looks cheap at first, but the maintenance cost adds up quickly:

Selectors break
APIs change without notice
Pagination logic grows complex
Data formats become inconsistent
Every new website requires custom logic

You’re not just extracting data — you’re maintaining dozens of tiny, fragile integrations.

This is exactly where most teams start looking for alternatives.

Turning websites into type-safe APIs instead

Rather than scraping every site manually, a growing number of teams are moving toward API abstraction: turning any website into a clean, typed API that behaves like a real backend.

Instead of writing and maintaining scrapers, you define what data you want — and let the platform handle how it’s extracted.

This is where tools like ManyPI quietly change the game.

How ManyPI fits into this workflow

ManyPI lets you treat websites as if they already had proper APIs.

Instead of writing brittle scraping logic, you:

Define the data shape you want
Get structured, type-safe responses
Avoid manual selector maintenance
Skip browser automation entirely

From a developer’s perspective, it feels like calling a normal API:

Implementing ManyPI

curl -X POST
'https://app.manypi.com/api/scrape/YOUR_API_ENDPOINT_ID'
-H 'Authorization: Bearer YOUR_API_KEY'
-H 'Content-Type: application/json'

No HTML parsing. No selector juggling. No silent breakage.

You still control the data — you just don’t have to fight the website anymore.

When manual extraction still makes sense

Manual scraping is still useful when:

You’re experimenting or prototyping
The data source is small and stable
You need full control over every request
You’re learning how web data works

But once extraction becomes part of a product, pipeline, or automation workflow, abstraction usually wins.

Final thoughts

Extracting data from websites isn’t about clever hacks. It’s about choosing the right level of abstraction.

Sometimes that means parsing HTML. Sometimes it means calling hidden APIs. And sometimes it means stepping back and asking whether you should be doing this manually at all.

If your goal is reliable, structured, type-safe data from arbitrary websites, turning sites into APIs — instead of scraping them — is often the most pragmatic solution.

If you want to explore that route, tools like ManyPI are worth a look. They exist for one reason: so you can focus on using data, not fighting websites to get it.

How to extract data from a website