Web Scraping Essentials with Python, Requests and BeautifulSoup | Andrei Dumitrescu | Skillshare

Web Scraping Essentials with Python, Requests and BeautifulSoup

Andrei Dumitrescu, DevOps Engineer and Professional Trainer

Play Speed
  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x
6 Lessons (38m)
    • 1. Welcome

      1:56
    • 2. Intro to Web Scraping using Requests and BeautifulSoup

      3:40
    • 3. Setup the Environment. Installing Requests and BeautifulSoup

      2:42
    • 4. Diving into Requests HTTP Library

      8:46
    • 5. Diving into BeautifulSoup Library

      10:18
    • 6. Project: Real-World Web Scraping (Requests, BeautifulSoup and OpenPyXL)

      10:22

About This Class

22f0db26

Web Scraping Essentials with Python, Requests, and BeautifulSoup will teach you one of the hottest topics of the Data Science Industry.

Web Scraping (also known as Web Data Extraction, Web Harvesting, Web Crawling, etc.) is a technique used to extract large amounts of data from websites and save the extracted data into a local file or to a database.

In this course, you will learn how to perform web scraping using Python 3, Requests, and BeautifulSoup, probably the most used open-source library written in Python for parsing HTML documents.

Since this is intermediate Python you are required to already master the basics of Python before enrolling in this class. My advice is to first check my other classes on Python published here on SkillShare; they will help you build a strong foundation of Python Programming Language.

In this course, you'll get the Web Scraping skills in Python to move ahead!

 Major topics of this course:

  • Intro to Web Scraping using Requests and BeautifulSoup
  • Setup the Environment. Installing Requests and BeautifulSoup
  • Diving into Requests HTTP Library
  • Diving into BeautifulSoup Library
  • Project: Real-World Web Scraping (Requests, BeautifulSoup and OpenPyXL)

and more!

Transcripts

1. Welcome: Hello and welcome toe this class on Web scraping essentials with python requests and beautiful soup. My name is Andrea Dmitry School, and I'll be your instructor for this class is, well, is lax of other classes here on skill share. Since this is Intermediate python, you are required toe already master the basics off by phone before enrolling into this class. My advice is to first check my other classes on Python published here on skill share. They will help you build a strong foundation off python programming. Language Web scraping essentials with python requests and beautiful soup will teach one off the most hottest topics of the data science industry. Web scraping is a technique used to extract large amounts of data from websites and save the extracted data into a local file for toe a database. In this course, you will learn how to perform Web scraping, using python requests and beautiful soup. Probably the most used Open Sores library written in python for parsing HTML documents in the V's course, you'll get the Web scraping skills in private toe move ahead. These hands on, of course, goes straight to the point without any destruction and focuses solely on how to scrape Web pages using python requests and beautiful soup. If you want to waste no more time with incomplete scripts or tutorials, copy paste solutions poor, confusing source code. Then this class is for you. See you in the class. 2. Intro to Web Scraping using Requests and BeautifulSoup: in this section will start talking about Web scraping, using requests and beautiful soup. First, let's see what does Web scraping mean? Web scraping Is the process off automatically mining, collecting or extracting data from websites using the http protocol. Web scraping is a two step process. The first process involves fixing or downloading the Web. Bake from the server using the http protocol. Once fetched, the second process, which is extraction, can take place. The content off the page may be parsed, searched or co opted into a spreadsheet or a database. Web scraping is getting more and more popular every day. He D is used for a brand or price monitoring, stock market breaking, breaking latest prints or for building machine learning models. Let me show you some examples off how useful Web scraping can before you. Let's assume you are interested in buying or renting a property. Getting information about every single property listed Inner city is not a small task, and for this reason, many companies are actually taking the help off. Web scraping services toe help get more released ing on their websites as well as help more people get their perfect property. You could develop your own Python Web scraping script that cross and extracts data about each property. Leased it on major real estate websites. You're escaped could save the extracted data in a spreadsheet or in a database for you to seat. Or if you're an investor, getting stock exchange. Closing prices every day can be a pain, especially when the information you need is found across several websites. You could make data extraction easier by building a Web scraper to retrieve stock indexes automatically from the Internet. At the beginning. Off this lecture, I said that the first process off Web scraping, ISS fetching or Web page downloading for this purpose we use a very popular library called requests requests is a python. Http library and ex girl is to make http requests simpler and more human friendly. It allows you to send http requests and ed content like Heathers for later multi part files and parameters like authentication data from Python script. It also allows you to access the response data in the same way. After fixing the Web page using requests, we need toe pull data out off html Feiss here, beautiful soup module comes into play. It helps us toe, pass the webpage, extract the desired information and present it in a format we can easily make sense off. This was just an introduction about Web scraping and how useful it can be and in the next lecture will set up the environment needed for Web scraping, install and see how to use the requests and beautiful super mergers. See you in a few seconds in the next lecture. 3. Setup the Environment. Installing Requests and BeautifulSoup: the requests and beautiful soup modules required for Web scraping Don't belong Toe Python Standard Library, and we have to install them for of it will use people. The Python package manager. So in a terminal, I right peop install requests, a white space and BS for BS four is infected. The module that contains the beautiful soup class used for Web scraping test whether it is installed correctly. Try to import the modules in a python shell, so import requests coma bs. For if there is no error, we assume it has been installed correctly. If you get a name, air, such as name requests or BS for is not defined. There was a problem with the installation off the modules and forever troubleshooting is required now. If you are using a vehicle environment, which is an is related environment for your application, you need to install the modules also in that vehicle environment not only global like we did before, For example, the latest version off by charm creates a vehicle environment for each new project. In this case, we must install the modules also in the vehicle environment goto file Say things project in Thoroughbred three. Click on the plus sign and write requests. Click on requests, install package and package requests installed successfully. Now we do the same for BS for Bs four, he's told successfully, Lex tested whether they have been installed correctly, so important requests Coma Bs for and there is no error when running this script. Now that we have everything set up, we'll start working with these modules and see how Web scraping is done. 4. Diving into Requests HTTP Library: in this lecture will take a closer look at requests. Library. The first thing we'll need to do to scrape a weapon is to download the page. We can download pages using requests. The requests library will make a get request toe a Web server, which will download the HTML contents or forgiven Web bait for us. There are several types off requests we can make and get his one off them. By the way, another possibility to retrieve data from a Web server is to use an A P I or application programming interface sites like Reddit, Twitter and Facebook, all or for certain data through their AP ice AP ICE are a way to serve customers. Companies are packaging AP ice ISS products, for example. Weather websites sell access toe their weather data. I p I AP ICE are applications hosted on Web servers. Instead, off using a Web browser toe, ask for a webpage. Your program asks for data and the data is usually return in. Jason Former. The same requests library is used to communicate with AP ICE Anyway, the majority off websites don't offer an A P I and in that case, a manual web scraping is required. However, AP ICE are not a topical for this section. Now let's go toe coding in C. Some examples off using the requests. A library First, we imported the module. After that, we create a variable called U R L that holds the address off the website. Let's say, http Colon, double forward slashes W W Python. Don't talk. Take care http is mandatory. If you don't use http or Https, you'll get an error. Now I'll create a variable Cold responds equals requests. Don't get off. You are. Let's see, what is the type off the response Object? Print type off response and we see exactly quests. Response, Object. Now let's talk about http Status. Cote's status codes are issued by a server in response to a client's request made toe the server. All http sports starters. Coats are separated into five classes or categories. The first digit off a status code specifies one off five stand barbed classes or for responses. They are five values for the first digit. If the first digit is one, it means that the request was received in the process continues two means successful, so the request was successfully received understood and accepted for him in straight direction. And four and five mean errors. Let's see, what is the http sports status? Coat off our response. Object Print response dot status coat And we see each 200 which means the request was successfully received, understood and accepted. What if we modify the address off the website this way? Slash a dot html Let's try to access this page in the browser and the brows are displayed. Air 4045 not found. Now let's try it from python and we see the same air coat. Now the status coat is 40 for which means that I've requested a resource that doesn't exist or Page not found. I'll make the U. R l correct. Again. Another useful are three beauties. Okay, it returns. True. If the request was successful and false, if not so, print her response that okay and eat retardant. True. But what happens now? Here I had a dot txt and I'll execute the script again and the okay, our tribute retarding force. Okay, The next step is to see the contents off the page for this purpose. There are two attributes, content and text response dot content is the content off the response in bites. It'll return on object off type of bites. I'll execute the script. We see the letter B. It comes from bites, and this is the content off the page. Is bikes being a bikes object? I can decode it toe a strength for of it. I use the decode the method and I'll execute the script again and it returned a stink. In fact, this is the HTML source off the page. It's the same content I see here. There is also the response dot text attributes that returns the response in Unicode. I'll comment out these lines and I write print response dot text and this is the content off the page in Unicode, in fact, is the same content we've seen before. When we've used the decode method, we can see the encoding scheme used to decode when accessing response dot text object. Using the encoding, attribute, response dot including, and the encoding scheme is beautifui eight. If we're not happy with the proposed encoding scheme, we can change it by modifying the encoding our tribute. Now let's take a look at headers. Http headers allowed the client in the server toe. Pass additional information with the request or their dispose on. Http. Heather consists off its case. Insensitive name followed by a colon, then bike's value. Using the headers are tribute. We can see a case insensitive dictionary off headers off this response response dot headers . And these are the headers. The head of name is server index value his engine X. This is the Web server content type text slash x TML and so on. These are the basics off requests library and in the next lecture will see how to use the beautiful Super library and how to combine requests and beautiful soup to scrape the content off webpages. See you in the next lecture. 5. Diving into BeautifulSoup Library: Hi and welcome back. In this lecture, we learn how to scrape. Using the python well known library called Beautiful Soup, these library navigates through the HTML code off the Web page and it's trucks the data. It doesn't actually download the page for these. We use the requests library. Let's go coding and see exactly how it's done. For this example, I've created a very simple HTML file and put all ex content into a variable. This way it will be easier for you to understand in detail what happens off course. There is no problem. If you want to download to the pigs from the Internet, you just have to import the requests module. Get the page Creator, Exports, Object and pest, the response dot text object, which is a giant strength to beautiful Super. This is what we've done in a previous lecture, and we'll also see some examples in the next lecture. Our HTML variable contains some basic HTML text, some photographs, a DIF tag, some links and other HTML stuff. You need some very basic HTML knowledge to understand that, and we begin the scraping by creating a beautiful soup object like this. Soup equals beautiful soup off html my variable and the second argument is html dot parcel . This is a string beautiful soup can also parse XML content, but in this case we want HTML. So we use html dot parsa. It's very important not to forget to import the beautiful soup class from BS four module. So from B s for important beautiful Super Bs for his the module and beautiful soup is the class will use. After creating the beautiful soup object and parsing it, we can navigate through the HTML coat. There are several ways to navigate by tag name using find Method which finds one mixing bag using find all method which returns all makes Inc Tex is the least and using CSS selectors , let's start using the tag name and I can go print soup dot title and it will return the title tag or I can do print sub dot body off course. It will return the body off the page. I can go deeper into the body tag and right soup taught body dot deve and it returned to the dif tag inside the body. If there are more leaves, it returns only the 1st 1 If you need all the DVD's, you use another method called Find All We'll See in a minute what is all about. Even if with the output looks like a string, it's not a stink. It's infect an object off type tech. I'll take it into a variable Alexei, the equals soup that body daughter dif and the now print type off the and we see that the ease off type bag. Let's move further and see the find and find all the methods he fire right brain soup dot Find off Deve. It will find and print out the first day of tech. This is the defect. Let's try again using the P tag instead off Deve Soapy. And it returned only the first paragraph. If you want all texts like, say, all P tax, you should use the find old method and instead off find I used find underscore all. And it returned a least that contains all P tax in the xdm l coat. We see here the first paragraph, the second paragraph, the third paragraph, and so on. We see also the paragraph inside the thief tag. Now, if we want to the text inside a tag. We use the text attributes off any tag object soup. Don't find off the IV dot text and it returned the text inside my Deve tag. Find all returns a list so I can also eat rate over that least lexei forex in soup dot find all off b it will return all p tax is the least brilliant ex dot text And it pointed out the text off any paragraph parks in our HTML coat. The find all method also takes in released his argument. For example, bring into soup Don't find all and the least his argument. This is a least the first element is Deve. In the second element, he speak it will find and return all dif and pee bags as a list. This is the least it contains all be and Deve Tex. The next thing I'll talk about is selecting. Using text are three books, such as Idea or Class in our HTML coat. There is a P tag that has, um, I d called my i d. If I want to select all p tex that have that, i d. I write soap. Don't find all the name of the tech B in the second argument is I t equals and the name off that I d. In this case my i d and it selected only the P tag that has rape I d. Now if I want to select by class, for example, I want to find and retrieve whatever Deve big that has a class called some class I right soup dot find all the first argument is the tech name Deve. In the second argument, Cliffs underscore equals in the name off the class some place and it selected end returned on Lee the Deve that has that class. They care the name of the functions. Argument is class underline in the not class class is a reserved keyword in python, and it's not recommended. Toby used is a name for a variable or an argument in the last example. Off this lecture, I'll show you how to find and retrieve all anchors or hyperlinks in a page. In our example, there are two links I create a variable called Links equals soup dot find all. The first argument is the name off the tag in this case a. And the second argument eggs ref equals through and print links and it returned at least. Rep come pace the links off my page. It's better to use the eggs. Ref equals through argument because there can be a tex without on eggs. Ref Attribute. If I just want to retrieve the eggs ref value or the U. R L The address off the hyperlink I do for linking links. Brilliant link don't get and the name off the attribute in this case eight. Ref it will return the value often attribute off attack, and these are the addresses or the eggs. Ref values in the next lecture will start a real Web scraping project, using requests and beautiful super. 6. Project: Real-World Web Scraping (Requests, BeautifulSoup and OpenPyXL): in this lecture will start the real world Web scraping project using requests and beautiful super. We start with a disclaimer. While scraping can sometimes be used is a legitimate way to access all kinds off data on the Internet, it's also important to consider the legal implications. There are many cases where scraping data may be considered illegal. This lecture is purely intended, Toby Educational and nothing else. Before starting scraping, check the policies off individual websites and scrape at your own risk. Now let's take a look at IMDb, the Internet movie database, which is the world's most popular and authoritative source for movies, TV and celebrity content. I've sorted by number off votes in descending order, all movies released in 2019. On each page, you see the 1st 50 movies. I want to fix the page using requests and then parse and retrieve some data using beautiful soup. I am interested in the movie name and X rating. After retrieving the name and berating, I want to save with them in an Excel file for future use before starting the project. Let's discuss about some aspects. All the pages we want to scrape must have the same overall structure. This implies that they also have the same overall HTML structure. So to write our A script, it will be enough to understand the HTML structure off only one pig. To do that, we use the browsers, develop our tools. If you are using Mozilla Firefox or Google chrome, right, click on a page element that interests you and then click inspect. This will take you right to the HTML, a line that chorus points so that element. You can also do this using both Microsoft Age and Safari. Not this. That all of the information for each movie is contained in a diff tech and that Def Tech has an attribute class named least er, item content. There are a lot off HTML a lines nested within each div, and you can explore them by clicking those little grey arrows on the left off the HTML A lines corresponding to each different within these nested tax will find the information we need, like a movie's name or a rating. Now let's start coding. First we import the requests library. Then we import the beautiful soup class from BS for module. Then I'll create a U R L variable that stores the arteries off the website. Like before we create the page. Object equals requests dot Get off you are go. If I want to test the script until this point, I write print page dot Okay, If it returned a successful startup scout the okay, our tribute will be true and it's too. Then I'll create a super object equals beautiful soap, off paged or content coma. The second argument is html dot sparser Now we started the selecting and the data retrieving process. We saw earlier that the information for each movie is inside the diff and that Dave has a class our tribute with the value off least, er, eat them content. So I write content equals soap. Don't find all they got many movies. So many Dave's Tex The name off the Tech is Dave, and the second argument class with a trailing underline equals in the name off the class, so least er minus item minus content. Then I create a variable off type least called movies, where I'll save the name and the rating off each movie so movies equals list on empty list . Now I'll eat the rate over the content. So for item in content, colon and the four block off coat, let's check the content off the page one more time. Using inspect Element, we begin with the movie's name and locate its corresponding html a line by using inspect Element. You can see that the name is contained within an Encarta Tech, and this tag is nested within a header. Tech age three Tech And this tag is nested within our dif tech the tech we've selected. So in order to retrieve the name of the movie, I write, the name equals item dot eggs three dot a dot fixed. I only want the text off with the anchor. Now that we've got the name off, the movie will focus on extracting the IMDb rating off the movie, using the inspect element. Again, we notice that generating is contained within a strong peg and rating equals item, not strong dot text. Now that I have also deray Ting, I'll upend both the name and aerating toe the movies, least so movies dot append and I'll append them Is a couple name coma rating. I'll have a least off topples, and I can check the content off movies least to see whether I've fixed the information correctly. So paint movies, legs executed the script and it seems OK, keep in mind that this is a trial and error process. Maybe you don't retrieve the information in the first place, but you have to continue drank. In this case, we were lucky. Now that we have the information, it's time to save it to an Excel file. I'm importing the open, pure Excel mogul and I write WB equals open p y excel dot workbook I am creating on in Memory Workbook. So a New Excel file. Then I get the reference toe, the active shit she calls wb dot active. I am sitting the title, she thought Title equals I am debate movies, and I'll write some cells. These are the headers, so sheet off a one equals you are and sheet off B one equals movie. This will be the header line off the spreadsheet, having the movies and orating in a least. I'll eater it over the least and a pent each movie and iterating toe the fight. So for a movie in movies, colon sheet dot append off movie after a pending the information I must save the file. In fact, this will create a new file, a new workbook. WB did not safe off in the name of the file like, say, Movies 2000 and 19 dot Excel SX and I'll around the script. Okay, there's no air are. And Lexie. If a new file called Movies 2019 dot XLs Sex appeared in the current working directory and this is the file legs, open it. And inside the spreadsheet, we see the content with retrieved from the website.