Introduction to Material Properties
Explore how to extract and organize material properties of aluminum alloys from websites using Python. Learn HTML basics, use BeautifulSoup to parse content, and automate data export to Excel with Openpyxl.
We'll cover the following...
Material properties
You used Requests earlier to scrape a website and do something with the results. That website functioned more like an API because you knew exactly what format the website would return. Having the data nicely formatted does not always happen. Sometimes, you have to find and format, clean, or otherwise manipulate the data ourselves. When you scrape a real website, you will need to be able to parse through HyperText Markup Language (HTML) to get the information that you want. On any webpage that you visit, HTML and its sister Cascading Style Sheets (CSS) are hard at work to display the design of the particular webpage. The Engineering Toolbox website [1] contains technical data, material properties, chemical properties, economic information, drawing tools, and so much more for all types of engineers. You will make a Python program to grab information related to the material properties of different aluminium alloys from the Engineering Toolbox website. The program will put this information into an Excel spreadsheet so you can reference it later. BeautifulSoup4 is the preferred library for HTML parsing. The documentation for BeautifulSoup [2] is very good. Openpyxl is the preferred cross-platform library for manipulating Excel spreadsheets; Openpyxl’s documentation can be found under footnote [3].
The program will need to go to the material properties page [4], convert the table into a nested list so the information is easily parse-able, and iterate through the info to put each value into a cell in an Excel spreadsheet.
About HTML
Let’s go through a rough crash course on HTML. It was the original way to build websites. HTML uses “tags” like <a>text here</a> to display different types of objects. The <a> tag creates a hyperlink, also called a link. In this case, the “text here” would show up as a blue, clickable link on a website like this. Below is a table of the common HTML tags:
| HTML Tag | What it represents |
|---|---|
<a> |
Hyperlink |
<div>, <p> |
Paragraphs of text |
<table>, <tr>, <td>, <tbody>, <thead> |
Table, table row, and table cell, etc |
<style>, <span> |
Other types of paragraphs or text |
<input> |
A field to input data |
<i>, <b>, <u>, <strike>, <sup> |
Font styles like italics, bold, underline, strikethrough, superscript, respectively |
<img> |
Image |
When information on a website is not nicely formatted, you have to dive into the HTML to get the information ourselves. This means that you will have to use Requests to get the text of the website, which means getting all of the source code. Then, you give the source code to BeautifulSoup to more easily parse through the code to get the information that you want. This allows you to treat the source code text as lists, so you can get only <tr> tags, <a> tags, etc.