St Patrick’s day special: Finding Dublin’s best pint of Guinness with web scraping
At Zyte we are known for our ability to help companies make mission-critical business decisions through the use of web scraping.
But for anyone who enjoys a freshly poured pint of stout, there is one mission-critical question that creates a debate like no other…
“Who serves the best pint of Guinness?”
So on St Patrick's day, we decided to turn our expertise in large-scale data extraction to answering this mission-critical question.
Although this is a somewhat humorous question, the data extraction and analysis methods used are applicable to numerous high-value business use cases and are used by the world’s leading companies to gain a competitive edge in their respective markets.
In this article, we’re going to explore how to architect a web scraping and data science solution to find the best pint of Guinness in Dublin. But most importantly which Dublin pub serves the best pint of Guinness?
Step #1 - Identify rich data
For anyone who has ever enjoyed a pint of the black stuff, they know that the taste of a pint of Guinness is highly influenced by the skill of the person and the quality of the equipment they use.
With that in mind, our first task is to identify where we can find web data that contains rich insights into the quality of a pub’s Guinness and where the coverage levels are sufficient for all pubs in Dublin.
After a careful analysis of our options - pub websites, social media, reviews, articles, etc. we decided customer reviews would be our best option. They provide the best combination of relevant high granularity data and coverage to answer this question that will help with further web scraping.
Step #2 - Extract review data
The next step would be to develop a web scraping infrastructure to extract this review data at scale using Scrapy. To do so we’d need to create two separate types of spiders:
- Pub discovery spiders - Designed to find and index pub listings for the data extraction spiders.
- Data Extraction spiders - a spider to extract the details of the pub once it was discovered by the discovery spider. This spider would extract data such as pub name, location, description, customer rating, and customer reviews.
We’d also need to run these spiders on a web scraping infrastructure that can reliably extract the review data with no data quality issues. To do so, we’d configure the web scraping infrastructure as follows:
- Spider Hosting & Scheduling - to enable our spiders to run at scale in the cloud we’d use Scrapy Cloud.
- Proxies - critical to reliably extracting raw data is the ability to make successful requests to the target website. To do this we’d use Smart Proxy Manager, which manages your proxies so you don’t have to.
- Spider Monitoring & Quality Assurance - we’d apply Zyte's 4-layer QA system to the project that would monitor the data extraction 24/7 and alert and alert our QA team to any malfunctioning spiders.
Due to data protection regulations such as GDPR, it is important that the extraction spiders don’t extract any personal information of the customers who submitted the review. As a result, data extraction spiders need to anonymize customer reviews.
Step #3 - Text pre-processing
Once the unstructured review data was extracted from the site, the next step is to convert the text data into a collection of text documents, or “Corpus”, and pre-process the review data in advance of analysis.
Natural Language Processing (NLP) techniques have difficulty modeling unstructured and messy text, preferring instead well defined fixed-length inputs and outputs. As a result, typically this raw data needs to be converted into numbers. Specifically, vectors of numbers. The more similar the words are, the closer the number assigned to the words are.
The simplest way to convert a corpus to a vector format is the bag-of-words approach, where each unique word is represented by a unique number.
To use this approach the review data first needs to be cleaned up and structured. Implemented using a library such as Python’s NLTK, the most frequently used library in Python for text processing.
Here are some of the common web scraping pre-processing steps:
- Convert all text to lower case - to ensure the most accurate data mining, we need to ensure there was only a single format for any word. For example: if there were two words, “Guinness” and “guinness”, all instances of “Guinness” would be converted to “guinness”.
- Remove stopwords - to remove filler words from the reviews as text generally consists of a large number of prepositions, pronouns, conjunctions, etc (common stop words include - “the”, “a”, “an”, etc.).
- Remove punctuation - remove punctuation such as full-stops and commas, etc.
- Stemming words - here to avoid having multiple versions of similar words in the text, inflected (or derived) words need to be reduced to their word stem, base, or root form.
The goal of this pre-processing step is to ensure the text corpus is clean and contains only the core words required for text mining.
Once cleaned the review data then needs to be vectorized to enable an analysis of the data. Here is an example review prior to it being vectorized:
review_21 = X[20] review_21 Output: "One of the greatest of Dublin's great bars, the Guinness here is always terrific, the atmosphere is friendly and it is perfect especially around Christmas -- snug warm and welcoming."
Here was how the review should be represented once it has been vectorized using the bag-of-words approach. Each unique word is assigned a unique number, and the frequency of the word's appearance recorded.
bow_21 = bow_transformer.transform([review_21]) Bow_21 Output: (0, 2079) 1 (0, 2006) 1 (0, 6295) 1 (0, 8609) 1 (0, 9152) 1 (0, 13620) 1 (0, 14781) 1 (0, 12165) 1 (0, 16179) 1 (0, 17816) 1 (0, 22077) 1 (0, 24797) 1 (0, 26102) 1
Step #4 - Exploratory text analysis
Once the text corpus was cleaned, structured, and vectorized, the next step is to analyze the review data obtained from web scraping to determine which pubs had the best Guinness reviews.
Although there is no definitive method of achieving this goal, for the purposes of this project we decided not to overcomplicate things and instead do a simple analysis of the data to see what insights we can yield.
One approach would be to filter the review data by looking for the word “guinness”. This would enable us to identify all the reviews that specifically mention “guinness”, an essential requirement when trying to determine who pours the best pint of the black stuff.
Next, we need to create a way to determine if the mentioning of Guinness was done in a positive or negative context.
One powerful method would be to build a classifier model using a labeled training dataset (30% of the overall dataset with reviews labeled as having positive or negative sentiment) developed with the Multinomial Naive Bayes library from Scikit-learn (a specialized version of Naive Bayes designed more for text documents) and apply our trained sentiment classifier model to the entire dataset. Categorizing all the reviews as either positive or negative.
To ensure the accuracy of these sentiment predictions, the results need to be analyzed and compared to the actual reviews. Our aim is to have an accuracy of 90% and above.
Step #5 - Who serves the best pint of Guinness?
Finally, with a fully classified database of Guinness reviews, we should now be in a position to analyze this data and determine which pub serves the best Guinness in Dublin.
In this simple analysis project, we carried out web scraping analysis using the following assumptions and weighting criteria:
- There is a strong correlation between overall review sentiment (high star rating and positive sentiment) and the sentiment in the context of a pint of Guinness. I.e. if the overall review is very positive and they mention Guinness then likely the pub has good Guinness, and vice versa for negative sentiment.
- There is a strong correlation between the number of times Guinness is mentioned in a pub’s reviews and the quality of Guinness the pub serves.
- The ratio of overall reviews to the number of reviews mentioning Guinness is indicative of how known the pub is for serving great pints of Guinness.
Using this methodology we were able to get an interesting insight into the quality of Guinness in every bar in Dublin and find the best place to get a pint of the black stuff.
So enough with the data science mumbo jumbo, what do our results say?
Winner: Kehoes Pub - 9 South Anne Street
Of the 74 reviews analyzed, 36 display positive sentiment for pints of Guinness. 48.6% of all reviews. The highest ratio of reviews mentioning Guinness in a positive light and the highest number of total reviews mentioning Guinness in their reviews. A great sign that they serve the best Guinness in Dublin.
To validate our results, the Zyte team did our due diligence and sampled Kehoes’ Guinness. We can safely say that those reviews weren’t lying, a great pint of stout!
Worthy runners up…
Runners Up #1: John Kavanagh The Gravediggers - 1 Prospect Square
Of the 54 reviews analyzed, 25 display positive sentiment for pints of Guinness. 46.3% of all reviews.
Runners Up #2: Mulligan’s Pub - 8 Poolbeg St
Of the 49 reviews analyzed, 21 display positive sentiment for pints of Guinness. 42.9% of all reviews.
So if you’re looking for the best place to find a great pint of Guinness this Saint Patrick’s Day, be sure to check out these great options.
At Zyte we specialize in turning unstructured web data into structured data through the use of web scraping and other techniques. If you would like to learn more about how you can use web scraped data in your business then feel free to contact our Solution architecture team, who will talk you through the services we offer startups right through to Fortune 100 companies.
We always love to hear what our readers think of our content and any questions you might have. So please leave a comment below with what you thought of the article and what you are working on right now.
Until next time…
Happy St Patrick's Day! ☘️