How to extract data from HTML table
HTML tables are a very common format for displaying information. When building scrapers you often need to extract data from HTML tables on web pages and turn it into some different structured format, for example, JSON, CSV, or Excel. In this article, we discuss how to extract data from HTML tables using Python and Scrapy.
Before we move on, make sure you understand web scraping and its two main parts: web crawling and web extraction.
Crawling involves navigating the web and accessing web pages to collect information. In this phase, structures that allow bypassing IP blocks, mimicking human behavior, are necessary.
After successfully crawling to a web page, the scraper extracts specific information from it - much of that info will be formatted into HTML tables.
For this tabular information to be correctly parsed into a structured format for further analysis or use, such as a database or a spreadsheet, we can extract it using Python and Scrapy.
Understanding HTML Tables
Understanding the HTML code that makes up these tables is key to extract data successfully. HTML tables provide a structured way to display information on a web page. They are grid-based structures made up of rows and columns that can hold and organize data effectively. While they are traditionally used for tabular data representation, web developers often use them for web layout purposes. Let's delve deeper into their structure:
<table>: This tag indicates the beginning of a table on an HTML document. Everything that falls between the opening <table> and closing </table> tags constitutes the table's content.
<thead>: Standing for 'table head', this element serves to encapsulate a set of rows (<tr>) defining the column headers in the table. Not all tables need a <thead>, but when used, it should appear before <tbody> and <tfoot>.
<tbody>: This 'table body' tag is used to group the main content in a table. In a table with a lot of data, there might be several <tbody> elements, each grouping together related rows. This is useful when you want to style different groups of rows differently.
<tfoot>: The 'table foot' tag is used for summarizing the table data, such as providing total rows. The <tfoot> should appear after any <thead> or <tbody> sections. It visually appears at the bottom of the table when rendered, providing a summary or conclusion to the data presented in the table.
<tr>: The 'table row' element. Each <tr> tag denotes a new row in the table, which can contain either header (<th>) or data (<td>) cells.
<td> and <th>: The 'table data' (<td>) tag defines a standard cell in the table, while the 'table header' (<th>) tag is used to identify a header cell. The text within <th> is bold and centered by default. <td> or <th> cells can contain a wide variety of data, including text, images, lists, other tables, etc.
<caption>: The 'caption' tag provides a title or summary for the table, enhancing accessibility. It is placed immediately after the <table> tag.
Let's look at an example of an HTML table: