PINGDOM_CHECK

A practical guide to web data QA part IV: Complementing semi-automated techniques

Read Time
7 Mins
Posted on
September 3, 2020
If you haven’t read the previous ones, here’s the first part, the second and third part of the series.
By
Ivan Ivanov, Warley Lopes
Ă—

Try Zyte API

Zyte proxies and smart browser tech rolled into a single API.
Return to top

A practical guide to web data QA part IV: Complementing semi-automated techniques

If you haven’t read the previous ones, here’s the first part, the second and third part of the series.

In this article, we build upon some of the semi-automated techniques and tools introduced in the previous part of the series.

1. Text editor search-and-replace techniques

Let’s say that the data we work with is separated by comma and line breaks:

change,deep-thoughts,thinking,world

abilities,choices

However, there isn’t a consistency of how many words are separated by comma per line. To make it easier, we can transform the data set so there’s only one word in each:

change

deep-thoughts

thinking

world

abilities

choices

In order to make this transformation of the data we can use search and replace functionalities of code text editors such as SublimeText, Notepad++ or Visual Studio Code. 

This how you can do it using Sublime Text:

  1. Open the data file or paste it into a new tab in the program
  2. Open Search and Replace dialog by using the shortcut CTRL + H
    1. Enable Regex through the button “.*” in the bottom left corner
    2. Search for line breaks using the control character n
    3. Replace with a comma (“,”)
  3. Click on “Replace All”

Once the process finishes, all the words will be in a single row, separated by commas:

Finally, we replace commas with n - newline

Once the replacing is done, we have a normalized dataset with only one word per line.

2. Approaches with spreadsheets

Let's work again with data from http://quotes.toscrape.com/. Our goal for this example is to make sure that the Top Ten tags stated on the website are indeed the top ten tags present in the scraped data. The web page looks like this:

After scraping the data from the page and loading it into a spreadsheet. this is what it looks like:

We will be using Google Sheets for this example. The first step will be to split the tags column into several columns so that we can count each word individually and analyze the results better:

  • Select the tags column
  • Click on the Data menu option
  • Then Split text to columns
  • For the little Separator window shown in the spreadsheet:
    • Chose automatically (if the comma option is selected)
    • Or change the selected one to comma

Then your spreadsheet will look like this:

The next step will be to convert the multiple tag columns into a single one. This can be done by defining a new range:

  • Select the range with tags - in my case is: D2:K101
  • Click on the Data menu option
  • Click on Named ranges
  • Then enter the name for it, e.g: Tags

Then apply the following formula to our data and expanded up to total expected length (in this case 8 columns by 100 words = 800 rows in total):

=INDEX(Tags,1+INT((ROW(A1)-1)/COLUMNS(Tags)),MOD(ROW(A1)-1+COLUMNS(Tags),COLUMNS(Tags))+1)

When you reach the end you will see an error “#NUM!” as shown below:

Next, we create a new column next to the tags populated with 1:

You can enter 1 in cells N2 and N3 and then click twice on the bottom angle of the cell to duplicate the value to the rest of the rows.

The final step is to create a Pivot table with columns M and N thus:

  • Select the entire range, in this case, M1:N801
  • Click on the Data menu option
  • Then click on Pivot Table
  • Click on Create:

  • For Rows - add the Tags column
  • For Values - add the Count column:

Then, this should be the result:

Finally, we can sort our data by occurrence count - this can be done from the Pivot table editor which gives us the following:

3. Manually checking multiple pages

The top tags were verified in the previous example. What if we need to open the URLs of the top 10 and visually check them or random tags? In this case, we can use a combination of tools such as:

If we take a closer look at the tag links, they follow this structure:

http://quotes.toscrape.com/tag/inspirational/
http://quotes.toscrape.com/tag/love/

Let’s say we are picking the following tags:

change

deep-thoughts

thinking

world

To visually check them, we can:

  1. Paste them into Sublime Text or Notepad++ for example
  2. Copy the initial part of the URL http://quotes.toscrape.com/tag/
  3. Use Block Selection - to select the start part of all rows for the tags
    1. Shift + Right Mouse click - move the mouse on the start of the text in rows, before the first letter
  4. Then paste the copied initial part of the URL

The result will be:

  • http://quotes.toscrape.com/tag/change
  • http://quotes.toscrape.com/tag/deep-thoughts
  • http://quotes.toscrape.com/tag/thinking
  • http://quotes.toscrape.com/tag/world

Now we can open all of them simultaneously by using the Chrome extension we mentioned, “Open Multiple URLs”.

After checking the results page by page comparing with the scraped data we can close the ones without a problem and keep the bad ones by:

  • Copying all URLs still open through the extension “Copy All Urls” and save it somewhere
  • Save the results in Google Chrome Tab session for later through the extension “Session Buddy”

The final result you’ll have for analysis or reporting issues will be:

  • List of all pages
  • List of pages with problems 

4. Difference check

Whenever it’s visible that side by side comparison could be something helpful, diff check tools (simple text comparison tools such as WinMerge) could be used. For this example, let’s say we received the task to assure all category links of books from books.toscrape.com were indeed scraped.

The data would look similar to this:

category_links

http://books.toscrape.com/catalogue/category/books/cultural_49/index.html

http://books.toscrape.com/catalogue/category/books/politics_48/index.html

http://books.toscrape.com/catalogue/category/books/health_47/index.html

...

The website has a listing of the categories just like this:

So since it’s pretty simple to select all the categories we can copy it and then simply side by side compare it with the extracted data.

Firstly upon copying all the categories from the website, we will need to normalize it in a similar way the scraped category links will be as well. So let’s order them alphabetically and transform to lower case.

In order to transform everything to lowercase, I’ll be using a built-in command called “Transform to Lowercase”  for Visual Studio Code (through the command palette - open through F1 keyboard key):

Then I’ll use an extension for Visual Studio Code called Sort lines which upon using the available commands, will give us the following result that later will be compared to the category_links scraped data:

Now we normalize the category_links from the data scraped removing everything else not related to the category name with search and replace approaches similar to what was shown in the first method:

Removal of the start part of the URLs:

Then removing the header and the final part:

Now wrapping up with everything else (replacing “-” with one whitespace and removing a final leftover part):

Then we go over to WinMerge, open a new window through CTRL + N, paste the copied content from the website in one side and the categories from the links we normalized in the other. Upon hitting F5, both will be compared and if equal, every single option shown were indeed scraped:

5. Structured data testing tool

Some scraping methods rely on data and structures that may be available only internally in the page source code like microdata and JSON linked data. Therefore, to make the job of assuring the data quality easier and compare with the scraped data, these can be checked with a tool called Google Structured Data Testing Tool.

6. SQL

SQL has been a key tool in the QA Engineer’s toolbelt. Let’s suppose we have 1000 rows of data scraped from books.toscrape.com following the below pattern to assess its data quality:

_type

image

price

title

URL

All products Books to Scrape - Sandbox1

http://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg

ÂŁ51.77

http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html

http://books.toscrape.com/

All products Books to Scrape - Sandbox1

http://books.toscrape.com/media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg

ÂŁ53.74

http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html

http://books.toscrape.com/

Apart from field names not matching nicely for the scraped data (like a “title” column actually having URLs instead), with SQL we can:

  • Investigate minimum and maximum prices for anomalies (e.g: negative or zero values):

  • Investigate aggregation of values for fields that could indicate issues following a distinguishable pattern:

However, let’s suppose we expected the field “_type” to be the genre/category of the books:

All data had only one value for “_type” and therefore could be corrected to scrape the genre of the books instead to be more useful.

Conclusions

In this post, we showcased multiple semi-automated techniques which combined with the approaches shown in the previous posts of the series will, hopefully, help bring creative ideas into your data quality assurance process to test your data better.

Want to learn more about web data QA? Read the Practical guide to web data QA part V: Broad crawls

Learn more about Enterprise Web Scraping

Web scraping can look deceptively easy when you're starting out. There are numerous open-source libraries/frameworks and data extraction tools that make it very easy to scrape data from a website. But in reality, it can be very hard to extract web data at scale. Read our whitepaper and learn how to build a scalable web scraping infrastructure for your business or project.

Ă—

Try Zyte API

Zyte proxies and smart browser tech rolled into a single API.