Webscraping can be very useful when you need data from a website but there is no api for it. In this tutorial I will demonstrate a simple python program that scrapes a list of open source books off a website. You will only need basic understanding of html and python to pull this off.
First, let us head over to the the page (I will be using chrome) where we will be scraping from. Our objective is to get a list of books from the catalog by specifying its discipline (“Biomedicine” in this tutorial). Take note of the url parameters and the number of results because we will be needing those later.
Next, ctrl+shift+i to show the page’s console on the side. At the top of the console, click elements to inspect the html elements on the page. Part of your screen should look like this.
Now if you mouse-over the html elements you will see that certain parts of the page become highlighted. Once you have found the part you want, extend the element to see its nested elements until you find a list of identical elements. Mouse-over them to see if each of them highlights individual books on the catalog. If they do, take note of the elements and their class names.
For the final step, click into one of the li elements to see its nested elements. Write down the elements and their class names which have the content that you want. For example, you want to retrieve the books url which is in an ‘a’ element with the class name ‘link’.
Now, we can begin writing our program. But before you start, make sure you have installed Beautiful Soup and Requests by running pip install bs4 and pip install requests. You can also install file writers depending on how you want your output to be, in this case, we will be using csv because I want to display the output in an excel file.
Start by importing the necessary libraries.
First, we need to define a function that sends a GET request to the url. The response to the request will be parsed by Beautiful Soup and assigned to the soup variable. This variable will include many things that we do not need so we need to process it into an array containing only the results that we want.
Recall in our previous steps where we found elements that house the books that we want. They all have a class called “has-cover”. Call the find_all() function on soup with the parameter class_ equals to the class name. What this will do is it will find all the elements with the specified class name and return a list of them. Assign this list to the books variable and return it.
Now that we have a list of the books, we need to extract only the information that we want. If you print the list right now, it will print everything in that particular element, including its html tags.
Start by iterating through the list of books and extract the contents of the nested elements. The code below may seem a little bit messy (I am still a beginner in python) but I have removed some lines to make it more readable.
I have defined the function extract which takes in the list of books and writes them to an existing excel file. On line 16, the function loops through the list of books to get the title, link and author fields from line 21–31.
Let us dissect the code more. On line 21, a find() function was used to get the first element that matches its parameters. The first parameter ‘a’ denotes the ‘a’ element while the second parameter in curly brackets looks for a class with the name ‘title’. The function is followed by another one named get_text(). This is a function used to retrieve the contents of the element instead of the entire element. For instance, book.find(‘a’) will return ‘<a>content</a>’ whereas book.find(‘a’).get_text() will return ‘content’.
Line 22 shows a similar function but this time with a square brackets ‘[‘href’]’ at the end instead of get_text(). The square brackets is used to retrieve a specific attribute of an element. In this case, we want to extract the value of ‘href’ from ‘<a href=’asd’>not href</a>’ so we use a square bracket instead.
From line 23, the same find() function is used but at the end of the function is another function called select(‘a’). The select function returns a list of elements because the books may have more than one author. Between line 29–31, I created an array and appended all of the authors’ names to it.
Line 32 is where all the information is compiled into a ‘row’ list to be inserted into an existing excel file. The ‘csv_writer’ takes in a row and appends it to the excel file. If you do not have an existing excel file, you can create one as shown below.
And that is how I webscraped a website. It is pretty much the same process for most ordinary websites and you just need to understand how the page is structured. However, this is not be applicable that pages that “post-render” their html with Javascript or other tools. The process to webscrape these pages are more sophisticated but there will be another tutorial for this if I have time.
I wrote this tutorial mainly for myself because I tend to forget. If you find anything wrong with this please let me know so that I can correct my mistakes in the future. Thank you for reading!