PINGDOM_CHECK

Solution architecture part 2: How to define the scope of your web scraping project

Read Time
8 Mins
Posted on
April 5, 2019
In this second post in our solution architecture series, we will share with you our step-by-step process for data extraction requirement gathering.
By
Colm Kenny
Ă—

Try Zyte API

Zyte proxies and smart browser tech rolled into a single API.
Return to top

Solution architecture part 2: How to define the scope of your web scraping project

In this second post in our solution architecture series, we will share with you our step-by-step process for data extraction requirement gathering.

As we mentioned in the first post in this series, the ultimate goal of the requirement gathering phase is to minimize the number of unknowns, if possible to have zero assumptions about any variable so the development team can build the optimal solution for the business need.

As a result, accurately defining project requirements most important part of any web scraping project.

In this article, we will discuss the four critical steps to scoping every web scraping project and the exact questions you should be asking yourself when planning your data extraction needs.

The requirement gathering process

The requirement gathering process can be broken into two parts: 1) understanding the business needs, and 2) defining the technical requirements to meet those needs.

The business need

The business need is at the core of every web scraping project, as it clearly defines what objective do they want to achieve.

  • Do you want to optimize your pricing strategy by monitoring your competitor's prices?
  • Do you want to monitor online sentiment towards specific companies?
  • Do you want to grow your business by generating new leads for your sales team?
  • etc.

Once, you’ve clearly defined what you’d like to achieve then it is possible to start gathering the technical requirements.

At the start of every solution architecture process, our team will put a huge amount of focus on trying to understand our customer's underlying business needs and objectives. Because when we truly understand your objectives, we can assist you in selecting the best data sources for your project and be able to best optimize the project if there is any rescoping or trade-offs required due to technical or budgetary constraints.

Questions to ask yourself are:

  • What is your business goal, what are you trying to achieve?
  • How can web data play a role in achieving this goal?
  • What type of web data do you need?
  • How will you use this data to achieve your business goals?

On the surface, these questions may seem very simple, but we’ve found that in a lot of cases customers come to use without a clear idea of what web data they need and how they can use it to achieve their business objectives. So the first step is to clearly identify what type of web data you need to achieve your business objectives.

Technical requirements

Once, everyone has a clear understanding of the business objectives, then it is time to define the technical requirements of the project i.e. how do we extract the web data we need.

There are four key parts to every web scraping project:

  1. Data discovery
  2. Data extraction
  3. Extraction scale
  4. Data output

We will look at each one of these individually, breaking down why each one is important and the questions you should be asking yourself at each stage.

Step 1: Data discovery

The first step of the technical requirement gathering process is defining what data needs extracting and where can it be found.

Without knowing what web data you need and where it can be found most web scraping projects aren’t viable. As a result, getting a clear idea of these basic factors are crucial for moving forward with the project.

Questions to ask yourself are:

  • Do you know which sites have suitable data to extract?
  • How will the crawler navigate to the data? I.e. how will be crawler find the data on the website.
  • Do you need to login to access the desired data?
  • Do you need to input any data to filter the data on the website before extraction?
  • Do you need to access this website from a certain location to see the correct data?

By asking yourself these questions you will be able to really start to define the scope of the project.

The next step in the requirement gathering process is digging into extracting the data…

Step 2: Data extraction

During this step, our goal is to clearly capture what data do we want to extract from the target web pages. Oftentimes, there are vast amounts of data available on a page so the goal here is to focus on the exact data the customer wants.

One of the best methods to clearly capturing the scope of the data extraction is to take screenshots of the target web pages and mark them with fields that need to be extracted. Oftentimes during calls with our solution architects, we will run through this process with customers to ensure everyone understands exactly what data is to be extracted.

Questions to ask yourself are:

  • What data fields do I want extracted?
  • Do I want to extract any images on the page?
  • Do I want to download any files (PDFs, CSVs, etc)?
  • Do I want to take a screenshot of the page?
  • Do I need the data formatted into a different format (e.x. the currency signs removed from product prices)?
  • Is all the desired data available on a single web page?

A general rule of thumb is that the more data is extracted from a page, the more complex the web scraping project. Every new data type will require additional data extraction, data quality assurance checks, and maybe more technical resources in certain circumstances (i.e. a headless browser if a data type is rendered via javascript, or if multiple requests are required to access the target data).

Once this step is complete we then look to estimate the scale of the project...

Step 3: Extraction scale

Okay, by this stage you should have a very good idea of the type of data you want to extract and how your crawlers will find and extract it. Next, our goal is to determine the scale of the web scraping project.

This is an important step as it allows you to estimate the amount of infrastructural resources (servers, proxies, data storage, etc.) you’ll need to execute the project, the amount of data you’re likely to receive, and the complexity of the project.

The three big variables when estimating the scale of your web scraping projects are the:

  1. Number of websites
  2. Number of records being extracted from each website
  3. Frequency of the data extraction crawls

After following steps 1 & 2 of the requirement gathering process, you should know exactly how many websites you’d like data extracted from (variable 1). However, estimating the number of records that will be extracted (variable 2) can be a bit trickier.

Number of records being extracted

In some cases (often in smaller projects), it is just a matter of counting the number of records (products, etc.) on a page and multiply by the number of pages. In certain situations, the website will even list the number of records on a website.

However, there is no one size fits all solution for this question. Sometimes it is impossible to estimate the number of results you will extract, especially when the crawl is extracting hundreds of thousands or millions of records. In cases like these often the only way to know for sure how many records the crawl will return is by actually running the crawl. You can guesstimate, but you will never know for sure.

Frequency of the data extraction

The final factor to take into account is how often would you like to extract this data? Daily, weekly, monthly, once-off, etc?

Also, do you need the data extraction completed within a certain time window to avoid changes to the underlying data?

It goes without saying that the more often you extract the data, the larger the scale of the crawl. If you go from extracting data monthly to extracting data daily you are in effect multiplying the scale of the crawl by a factor of 30.

In most circumstances, increasing the scale from monthly to daily won’t have much of an effect other than increasing the infrastructural resources required (server bandwidth, proxies, data storage, etc.).

However, it does increase the risk that your crawlers will be detected and banned, or put excessive pressure on the website's servers. Especially, if the website typically doesn’t receive a large volume of traffic each day.

Sometimes the scale of the crawl is just too big to complete at the desired frequency without crashing a website or requiring huge infrastructural resources. Typically this is only an issue when extracting enormous quantities of data on an hourly or less basis, such as monitoring every product on an e-commerce store in real-time. During the technical feasibility phase, we normally test for this.

Deltas & incremental crawls

The other factor to take into account when moving from once-off data extraction to a recurring project is how do you intend to process the incremental crawls?

Here you have a number of options:

  • Extract all the available data every single time the site is crawled.
  • Only extract the deltas or changes to the data available on the website.
  • Extract and monitor all the changes that occur to the available data (data changes, new data, data deletions, etc.).

The crawlers can be configured to do this or else they can just extract all the available data during each crawl and you can post-process it to your requirements afterward.

Step 4: Data delivery

Finally, the last step of the project scoping process is defining how do you want to interact with the web scraping solution along with how do you want the data delivered.

If you are building the web scraping infrastructure yourself, you really only have one option: you’re managing the data, the web scraping infrastructure, and the underlying source code.

However, when customers work with Zyte (or other web scraping providers) we can offer them a number of working relationships to best meet their customer's needs. Here are some of Zyte’s most common working relationships:

  • Zyte Data Extraction - If the customer is just interested in receiving data, we can develop a custom data feed for their project and deliver their data to them at the desired frequency.
  • Zyte Automatic Extraction - If the customer would like instant access to news or product data with our patented AI-powered automated extraction service. All they have to do is give us the URLs and get quality data right back with no coding.

The next question is in what format do you want the data and how you would like the data delivered. These questions are largely dependant on how you would like to consume the data and the nature of your current internal systems. There is a huge range of options for both, but here are some examples:

  • Data Formats - CSV, JSON, JSON line, XML files, etc.
  • Data Delivery - Amazon S3 bucket, FTP, Dropbox, etc.

These considerations are typically more relevant when you are working with an external web scraping partner.

So there you have it, they are the four steps you need to take to define the scope of your web scraping project. In the next article in the series, we will share with you how to take that project scope and conduct a legal review of the project.

Your web scraping project

At Zyte we have extensive experience architecting and developing data extraction solutions for every possible use case.

Our legal and engineering teams work with clients to evaluate the technical and legal feasibility of every project and develop data extraction solutions that enable them to reliably extract the data they need.

If you have a need to start or scale your web scraping project then our Solution architecture team is available for a free consultation, where we will evaluate and architect a data extraction solution to meet your data and compliance requirements.

At Zyte we always love to hear what our readers think of our content and any questions you might have. So please leave a comment below with what you thought of the article and what you are working on.

Ă—

Try Zyte API

Zyte proxies and smart browser tech rolled into a single API.