PINGDOM_CHECK
8 Mins

How to extract data from a website?

It's a 21st-century truism that web data touches virtually every aspect of our daily lives. We create, consume, and interact with it while we’re working, shopping, traveling, and relaxing. It’s not surprising that web data makes the difference for companies to innovate and get ahead of their competitors. But how to extract data from a website? And what’s this thing called ‘web scraping’?

Why would you want to extract data from a webpage?

Up-to-date, trustworthy data from other websites is the rocket fuel that can power every organization’s successful growth, including your own.

There are multiple reasons you may want to extract data from the web. You might want to compare the pricing of competitors’ products across popular e-commerce sites. You could be monitoring customer sentiment by trawling for name-checks for your brand – favorable or otherwise – in news articles and blogs. Or you might be gleaning information about a particular industry or market sector to guide critical investment decisions.

A concrete example where being able to extract data from the web increasingly valuable role in the financial services industry is insurance underwriting and credit scoring. There are billions of ‘credit invisibles’ around the world, in both developing and mature markets.

Although these individuals don’t possess a standard credit history, there’s a huge range of ‘alternative data’ sources out there, helping lenders assess risk and potentially take these individuals on as clients. These sources range from debit card transactions and utility payments to survey responses, social media posts on a particular topic, and product reviews. Read our blog that explains how public web data can provide financial services providers with a precise, insightful alternative dataset.

Also in the financial sector, hedge fund managers are turning to alternative data – beyond the scope of conventional sources like company reports and bulletins – to help inform their investment decisions. We’ve blogged recently about the value of web data in this space, and how Zyte can help deliver standards-compliant custom data feeds that complement traditional research methodologies.

What is so important about data?

Data, in short, is the differentiating factor for companies when it comes to understanding customers, knowing what competitors are up to – or making just about any kind of commercial decisions based on hard facts rather than intuition.

The web holds answers to all these questions and countless more. Think of it as the world’s biggest and fastest-growing research library. There are billions of web pages out there. This is where knowing how to extract data comes into play. Unlike a static library, however, many of those pages present a moving target when details like product pricing can change regularly.

Whether you’re a developer or a marketing manager, getting your hands on reliable, timely web data might seem like searching for a needle in a huge, ever-changing digital haystack.

The best way to access high-quality and timely web data is to work with a web data partner like Zyte.

What is web scraping?

So you know your business needs to extract data from the web.

What happens next?

There’s nothing to stop you from collecting data from any website manually by cutting and pasting the relevant bits you need from other websites. But it’s easy to make errors, and it’s going to be fiddly, repetitive, and time-consuming for whoever’s been tasked with the job. And by the time you’ve gathered all the data you need, there’s no guarantee that the price or availability of a particular product hasn’t changed.

For all but the smallest projects, you’ll need to turn to some kind of [automated?] extraction solution. Often referred to as ‘web scraping’, data extraction is the art and science of grabbing relevant web data – may be from a handful of pages, or hundreds of thousands – and serving it up in a neatly organized structure that your business can make sense of.

So how does data extraction work? In a nutshell, it makes use of computers to mimic the actions of human beings when they’re finding specific information on a website, quickly, accurately, and at scale. Webpages are designed primarily for the benefit of humans. They tend to present information in ways that we can easily process, understand, and interact with.

If it’s a product page, for example, the name of a book or a pair of trainers is likely to be shown pretty near the top, with the price nearby and probably with an image of the product too. Along with a host of other clues lurking in the HTML code of that webpage, these visual pointers can help a machine pinpoint the data you’re after with impressive accuracy.

There are various practical ways to attack the challenges faced when you extract data.

The crudest is to make use of the wide range of open-source scraping tools that are out there. In essence, these are chunks of ready-written code that scan the HTML content of a webpage, pull out the bits you need, and file them into some kind of structured output.

Going down the open-source route has the obvious appeal of being ‘free’. But it’s not a task for the faint-hearted, and your own developers will spend a fair amount of time writing scripts and tweaking off-the-shelf code to meet the needs of a specific job.

Step-by-step on how to extract data from a product page

OK – it’s time to put all this web scraping theory into practice so you can extract the data you need.

Here’s a worked example that illustrates the three key steps in a real-world extraction project.

1. Create an extraction script

To keep things simple, we are going to use requests and beautifulsoup libraries to create our script.

As an example, I will be extracting product data from this website: books.toscrape.com

The extraction script will contain two functions:

  1. A crawler to find product URLs
  2. A scraper that will actually extract information from a website

Making requests is an important part of the script: both for finding the product URLs and fetching the product HTML files. So first, let’s start off by creating a new class and adding the base URL of the website:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
class ProductExtractor(object):
BASE_URL = 'http://books.toscrape.com'
class ProductExtractor(object): BASE_URL = 'http://books.toscrape.com'
class ProductExtractor(object):
	BASE_URL = 'http://books.toscrape.com'

Then, let’s create a simple function that will help us make requests:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
import requests
def make_request(self, url):
return requests.get(url)
import requests def make_request(self, url): return requests.get(url)
import requests
def make_request(self, url):
	return requests.get(url)

The function, requests.get() is fairly simple in itself, but in case you want to scale up your requests with proxies, you will only need to modify this part of your code and not all the places where you invoke requests.get().

Extract product URLs

I will only extract products from one category called Travel to get some sample data. Here, the task is basically to find all product URLs on this category page and return them in some kind of iterable format so we have each URL to make a request to:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
from urllib.parse import urljoin
from bs4 import BeautifulSoup
def extract_urls(self, start_url):
response = self.make_request(start_url)
parser = BeautifulSoup(response.text, 'html.parser')
product_links = parser.select('article.product_pod > h3 > a')
for link in product_links:
relative_url = link.attrs.get('href')
absolute_url = urljoin(self.BASE_URL, relative_url.replace('../../..', 'catalogue'))
yield absolute_url
from urllib.parse import urljoin from bs4 import BeautifulSoup def extract_urls(self, start_url): response = self.make_request(start_url) parser = BeautifulSoup(response.text, 'html.parser') product_links = parser.select('article.product_pod > h3 > a') for link in product_links: relative_url = link.attrs.get('href') absolute_url = urljoin(self.BASE_URL, relative_url.replace('../../..', 'catalogue')) yield absolute_url
from urllib.parse import urljoin
from bs4 import BeautifulSoup
def extract_urls(self, start_url):
	    	response = self.make_request(start_url)
	    	parser = BeautifulSoup(response.text, 'html.parser')
	    	product_links = parser.select('article.product_pod > h3 > a')
	    	for link in product_links:
    		relative_url = link.attrs.get('href')
    		absolute_url = urljoin(self.BASE_URL, relative_url.replace('../../..', 'catalogue'))
    		yield absolute_url

This is what this function does, line by line:

We make a normal request to get to the category page (start_url)

Create a BeautifulSoup object which will help us parse the HTML of the category page

We identify that each product URL on the page is available using the specified selector

Iterate over the extracted links - which are at this point are <a> elements

Extract the relative URL from the <a> element, by parsing the href attribute

Convert the relative URL to an absolute URL

Return a generator with the absolute URLs

2. Extract product fields

The other important part of our script in order to extract data, is the product extractor function.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
def extract_product(self, url):
response = self.make_request(url)
parser = BeautifulSoup(response.text, 'html.parser')
book_title = parser.select_one('div.product_main > h1').text
price_text = parser.select_one('p.price_color').text
stock_info = parser.select_one('p.availability').text.strip()
product_data = {
'title': book_title,
'price': self.clean_price(price_text),
'stock': stock_info
}
return product_data
def extract_product(self, url): response = self.make_request(url) parser = BeautifulSoup(response.text, 'html.parser') book_title = parser.select_one('div.product_main > h1').text price_text = parser.select_one('p.price_color').text stock_info = parser.select_one('p.availability').text.strip() product_data = { 'title': book_title, 'price': self.clean_price(price_text), 'stock': stock_info } return product_data
def extract_product(self, url):
	    	response = self.make_request(url)
	    	parser = BeautifulSoup(response.text, 'html.parser')
	    	book_title = parser.select_one('div.product_main > h1').text
	    	price_text = parser.select_one('p.price_color').text
	    	stock_info = parser.select_one('p.availability').text.strip()
	    	product_data = {
    		'title': book_title,
    		'price': self.clean_price(price_text),
    		'stock': stock_info
	    	}
	    	return product_data

As you can see above, for the price field I needed to do some cleaning because it contained currency and other characters as well. Luckily, there’s an open-source library that can do the heavy lifting for us to parse the price value, it’s called price_parser (created by Zyte):

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
from price_parser import Price
def clean_price(self, price_text):
return Price.fromstring(price_text).amount_float
from price_parser import Price def clean_price(self, price_text): return Price.fromstring(price_text).amount_float
from price_parser import Price
def clean_price(self, price_text):
	    	return Price.fromstring(price_text).amount_float

This function returns the price of the product - extracted from text - as a float value.

3. The main function

And finally – this is the main function when it comes to extract data from a product page. It is where we put together extract_urls() and extract_product().

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
def main():
extractor = ProductExtractor()
product_urls = extractor.extract_urls('http://books.toscrape.com/catalogue/category/books/travel_2/index.html')
extracted_data = []
for url in product_urls:
product_data = extractor.extract_product(url)
extracted_data.append(product_data)
extractor.export_json(extracted_data, 'data.json')
And the export_json() function:
import json
def export_json(self, data, file_name):
with open(file_name, 'w') as f:
json.dump(data, f)
The end result is a clean json data file, something like this:
[
{
"title": "It's Only the Himalayas",
"price": 45.17,
"stock": "In stock (19 available)"
},
{
"title": "Full Moon over Noah\u00e2\u0080\u0099s Ark: An Odyssey to Mount Ararat and Beyond",
"price": 49.43,
"stock": "In stock (15 available)"
},
{
"title": "See America: A Celebration of Our National Parks & Treasured Sites",
"price": 48.87,
"stock": "In stock (14 available)"
},
{
"title": "Vagabonding: An Uncommon Guide to the Art of Long-Term World Travel",
"price": 36.94,
"stock": "In stock (8 available)"
},
{
"title": "Under the Tuscan Sun",
"price": 37.33,
"stock": "In stock (7 available)"
},
{
"title": "A Summer In Europe",
"price": 44.34,
"stock": "In stock (7 available)"
},
{
"title": "The Great Railway Bazaar",
"price": 30.54,
"stock": "In stock (6 available)"
},
{
"title": "A Year in Provence (Provence #1)",
"price": 56.88,
"stock": "In stock (6 available)"
},
{
"title": "The Road to Little Dribbling: Adventures of an American in Britain (Notes From a Small Island #2)",
"price": 23.21,
"stock": "In stock (3 available)"
},
{
"title": "Neither Here nor There: Travels in Europe",
"price": 38.95,
"stock": "In stock (3 available)"
},
{
"title": "1,000 Places to See Before You Die",
"price": 26.08,
"stock": "In stock (1 available)"
}
]
def main(): extractor = ProductExtractor() product_urls = extractor.extract_urls('http://books.toscrape.com/catalogue/category/books/travel_2/index.html') extracted_data = [] for url in product_urls: product_data = extractor.extract_product(url) extracted_data.append(product_data) extractor.export_json(extracted_data, 'data.json') And the export_json() function: import json def export_json(self, data, file_name): with open(file_name, 'w') as f: json.dump(data, f) The end result is a clean json data file, something like this: [ { "title": "It's Only the Himalayas", "price": 45.17, "stock": "In stock (19 available)" }, { "title": "Full Moon over Noah\u00e2\u0080\u0099s Ark: An Odyssey to Mount Ararat and Beyond", "price": 49.43, "stock": "In stock (15 available)" }, { "title": "See America: A Celebration of Our National Parks & Treasured Sites", "price": 48.87, "stock": "In stock (14 available)" }, { "title": "Vagabonding: An Uncommon Guide to the Art of Long-Term World Travel", "price": 36.94, "stock": "In stock (8 available)" }, { "title": "Under the Tuscan Sun", "price": 37.33, "stock": "In stock (7 available)" }, { "title": "A Summer In Europe", "price": 44.34, "stock": "In stock (7 available)" }, { "title": "The Great Railway Bazaar", "price": 30.54, "stock": "In stock (6 available)" }, { "title": "A Year in Provence (Provence #1)", "price": 56.88, "stock": "In stock (6 available)" }, { "title": "The Road to Little Dribbling: Adventures of an American in Britain (Notes From a Small Island #2)", "price": 23.21, "stock": "In stock (3 available)" }, { "title": "Neither Here nor There: Travels in Europe", "price": 38.95, "stock": "In stock (3 available)" }, { "title": "1,000 Places to See Before You Die", "price": 26.08, "stock": "In stock (1 available)" } ]
def main():
        	extractor = ProductExtractor()
        	product_urls = extractor.extract_urls('http://books.toscrape.com/catalogue/category/books/travel_2/index.html')
        	extracted_data = []
        	for url in product_urls:
	    	product_data = extractor.extract_product(url)
	    	extracted_data.append(product_data)
extractor.export_json(extracted_data, 'data.json')
And the export_json() function:
import json
def export_json(self, data, file_name):
	    	with open(file_name, 'w') as f:
    		json.dump(data, f)
The end result is a clean json data file, something like this:
[
  {
        	"title": "It's Only the Himalayas",
        	"price": 45.17,
        	"stock": "In stock (19 available)"
  },
  {
        	"title": "Full Moon over Noah\u00e2\u0080\u0099s Ark: An Odyssey to Mount Ararat and Beyond",
        	"price": 49.43,
        	"stock": "In stock (15 available)"
  },
  {
        	"title": "See America: A Celebration of Our National Parks & Treasured Sites",
        	"price": 48.87,
        	"stock": "In stock (14 available)"
  },
  {
        	"title": "Vagabonding: An Uncommon Guide to the Art of Long-Term World Travel",
        	"price": 36.94,
        	"stock": "In stock (8 available)"
  },
  {
        	"title": "Under the Tuscan Sun",
        	"price": 37.33,
        	"stock": "In stock (7 available)"
  },
  {
        	"title": "A Summer In Europe",
        	"price": 44.34,
        	"stock": "In stock (7 available)"
  },
  {
        	"title": "The Great Railway Bazaar",
        	"price": 30.54,
        	"stock": "In stock (6 available)"
  },
  {
        	"title": "A Year in Provence (Provence #1)",
        	"price": 56.88,
        	"stock": "In stock (6 available)"
  },
  {
        	"title": "The Road to Little Dribbling: Adventures of an American in Britain (Notes From a Small Island #2)",
        	"price": 23.21,
        	"stock": "In stock (3 available)"
  },
  {
        	"title": "Neither Here nor There: Travels in Europe",
        	"price": 38.95,
        	"stock": "In stock (3 available)"
  },
  {
        	"title": "1,000 Places to See Before You Die",
        	"price": 26.08,
        	"stock": "In stock (1 available)"
  }
]

Why do you need to use a smart proxy to extract data?

There are plenty of pitfalls to negotiate during the course of any web scraping project. One of the biggest challenges comes when you’re trying to extract data at scale.

At Zyte we often talk to clients who successfully extract data from a hundred web pages a day, or a thousand. Surely, they ask, it must be just as easy getting data from a million pages daily.

Many websites use ‘anti-bot’ technology to discourage automated scraping. There are ways on how to bypass ip bans, the most effective being using smart rotating proxies. This is a technique that effectively lulls a target website into thinking it’s being visited innocuously by a human, rather than an extraction script.

Here’s an illustration of how Zyte’s Smart Proxy Manager can be integrated into a data extraction script to boost your chances of getting banned.

Remember that we created a make_request() function at the beginning so it handles all the requests in the script? Now if we want to use Smart Proxy Manager, we only need to make a small change in this function. Everything else will work just fine. To integrate Smart Proxy Manager, change this function:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
def make_request(self, url):
return requests.get(url)
def make_request(self, url): return requests.get(url)
def make_request(self, url):
        	return requests.get(url)

to this:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
def make_request(self, url):
zyte_apikey = 'apikey'
proxy_url = 'proxy.zyte.com:8011'
return requests.get(
url,
proxies={
"http": "http://{}:@{}".format(zyte_apikey, proxy_url),
},
)
def make_request(self, url): zyte_apikey = 'apikey' proxy_url = 'proxy.zyte.com:8011' return requests.get( url, proxies={ "http": "http://{}:@{}".format(zyte_apikey, proxy_url), }, )
def make_request(self, url):
        	zyte_apikey = 'apikey'
        	proxy_url = 'proxy.zyte.com:8011'
        	return requests.get(
        	url,
        	proxies={
        	        	"http": "http://{}:@{}".format(zyte_apikey, proxy_url),
        	},
)

In this code, we add the Smart Proxy Manager endpoint as a proxy and authenticate using the Zyte API key.

If you want to learn more about Smart Proxy Manager and how it can help you scale, check out our webinar.

The legality of getting to extract data - also known as web scraping - depends on the context of how you extract the data and how you plan to use it.

Copying information from public domain websites for your own personal review and analysis is normally permissible. But re-using other people’s copyrighted content for profit may be unethical and possibly illegal.

It’s important that you discuss your plans to extract data with legal counsel to ensure that your use is in compliance with copyright laws.

How can I extract data from a website for free?

If you’re viewing a website – just as you’re doing now – you could simply cut and paste the information you’re reading on screen into another document like a spreadsheet.

It’s certainly one way to extract data for free. But gathering information manually this way is going to be slow, inefficient, and error-prone for all but the simplest tasks.

In practice you’ll be looking at ways to automate this process, allowing you to extract data from lots of web pages – maybe thousands or millions of them per day – and organize the results in a neatly organized structure. To achieve this you’ll need some kind of web data extraction tool, often known as a web scraper.

There are plenty of free scraping solutions out there to extract data from web pages. Some of these are dedicated applications aimed firmly at programmers, requiring a level of coding proficiency to configure and manage.

Ideal for non-specialists with moderate extraction needs, there are also some easy-to-use scrapers that run as a browser extension or plug-in with a simple point-and-click interface. Less sophisticated than their developer-focused counterparts, they’re typically more limited in the variety and volume of data they let you scrape.

How can Zyte help you extract data for your projects?

At Zyte we’ve spent the best part of a decade focused on extracting the all-important web data that companies need.

Our international team of developers and data scientists includes some of the biggest brains in analytics, AI, and machine learning. And along the way we’ve developed some powerful tools – several of them protected by international patents – to help our customers extract data quickly, reliably, and cost-efficiently.

If web data is what you're interested in, we are here for you. All you have to do is tell us what you need!

Learn from the leading web scraping developers

A discord community of over 3000 web scraping developers and data enthusiasts dedicated to sharing new technologies and advancing in web scraping.