Build a News Web Scraper with Python: A Beginner’s Project | Navid Ansari | Skillshare

Playback Speed


1.0x


  • 0.5x
  • 0.75x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 1.75x
  • 2x

Build a News Web Scraper with Python: A Beginner’s Project

teacher avatar Navid Ansari

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

    • 1.

      01 Web Scraper introduction

      0:28

    • 2.

      02 Load BBC site

      6:03

    • 3.

      03 Get first heading of BBC news

      4:25

    • 4.

      04 Get all headlines

      3:11

    • 5.

      05 Show all headlines with numbers

      2:22

    • 6.

      06 Skip headline that is short

      1:20

    • 7.

      07 For loop with counter

      2:42

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.

2

Students

--

Projects

About This Class

Unlock the power of Python with this beginner-friendly project course! In just one hour, you’ll learn how to build your own web scraper that automatically fetches the latest news headlines from BBC.

Whether you're brand new to Python or looking to practice real-world coding skills, this hands-on course will guide you step-by-step using:

  • requests to access web pages

  • BeautifulSoup to parse HTML and extract data

  • Clean, readable Python code to structure and display the results

By the end of this class, you’ll not only understand how to collect real-time data from websites, but also have a useful and customizable tool you can build on for scraping other sites.

No prior web scraping experience needed—just basic Python and a curious mind!

Meet Your Teacher

Teacher Profile Image

Navid Ansari

Teacher

Hello, I'm Navid.

See full profile

Level: Beginner

Class Ratings

Expectations Met?
    Exceeded!
  • 0%
  • Yes
  • 0%
  • Somewhat
  • 0%
  • Not really
  • 0%

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

Transcripts

1. 01 Web Scraper introduction: This Python practice, I want to teach you how you can scrape through a web a website like PVC, to get all the news off of it and show it to the user. This is for beginners, and you will learn the basics of web scraping. You will learn how to request to get a site, and how you can use Beautiful so to scrape through a web page and show the headlines. So let's begin. 2. 02 Load BBC site: Creating or web scraping, we need to create a project. In here, I will create a folder and I will call it bra per the side, open it up inside the VS code in here, open folder, go to desktop, web scraper, select. And in here, I want to create a new file of type of Python file. And in here, Controls, and I want to call it web cre per, Headsave and put the PY at the end of it. Now, in here, I want to close this one we need to request a site and get a site, but what site I want to get? In here, I want to search for bibc.com, right? For example, let's just open it. And if you go to News tab, you can see we have bbc.com news, and we want to get all of these headlines that you can see in here. For example, at this time, of course, it says Trumps and Mosque trade in Salt as a row for example, right? And all of these headlines of all of these news we want to get them and show them Tuser. So we need this link. For now, let's just leave it at that. No, next thing that I want to do, I want to go and create a terminal or open a terminal. First thing that I want to do in here, I want to say PIP install requests, right? So with these requests, library, we can get the bibc.com news, right? We can open it up and see the headlines head inter because I already install it, it will be installed really fast. But for you, it will take a little bit time. Know, with that done, what we want to do we want to say import requests, right? This is the library that we want to use. Now, the URL that we want to use, we want to save it inside a variable, so it will be easier to work with. And in here, let me copy it from here, just like this copy it and no, paste it over here. That's it. No, we want to use the request library to get the URL. So put the URL there and Control S to save. This will give us a response and we can save it inside something. I will save it inside the response. We can equal it to this, right? And now, let's just print the response. Let's see what it will give us, right? Control T save. Let's just open up the terminal in here. I want to make it big. I will say PY, web, head tab, head Inter. Wait a little bit, so it will give us something. It says response 200, right? So what it means, it means it successfully connect to the bbc.com news and it get the data, but how we can see the data. If I click in here, I will add a breakpoint in here. Now I can go in here, run and debug and click on Run and Debug, and it will tell you with which one of these debuggers you want to run the debugger, you click on Python debugger because we are writing in Python, and after that, it will tell you which file you want to divide. We say Python file, divide the currently activated Python file, the one that we are working on, right? Click on it, and it will run the code for us. Now, when it reached that line that we add a breakpoint, it will wait there. Let me make this bigger, and you can see in you can see all the variables, for example, the URL that we create together, the response that we create together, and some special variables that are built in, right? We don't need to talk about them. Just the important thing is the response. We can just copy this response and add it to this watch list, right? Put a the head inter. Now, with this, let's just look at the response. If you go inside it, you can see there is lots of things. For example, the content that is just a HTML file, the cookies, the encoding. And if you go down, you can see the headers, and you can see the links. You can see the status code as well. We want to use this Status Code. After that, you see the text. Again, it is HTML file that it will get. You can see the URL, and you can see lots of stuff in here that we want to get. The first thing that we want to get is the Status Code, right? You can see the Status Code 200. It means we connected to the web server correctly. Now, I can stop from the debugging, right? I don't need that anymore, so I can bring it left, so you can see everything, right? Know with that done, what I want to do instead of saying this, I want to check if the response status code was 200, we can print, for example, success website is loaded, right? That's it. But in case of s in here, we can check. It's not 200. Whatever it is, it means we are not connected to the site. So in here, we can say with a FS string, I can say failed status code. After that, a call and after that, a cl bracket, and I can get the Status code from response dot Status Code. So it will show us what is the problem that it can't be connected there. No, I will click in here to get rid of that breakpoint. No, let me run. Make sure you save and after that run, and you can see it's sax website is loaded, and that's it. If we can connect to the website, it will say success, but if we can't will say fail. 3. 03 Get first heading of BBC news: No, with this response that we get from BBC, we know we get the whole HTML and code of the BBC inside our code. Now, we want to use a library that is called Beautiful Soup to scrape through all of these properties. In Google, if you search for Beautiful Soup Python and open up the PYPI right, this site, let's see. It will tell you how to install it and how to work with it. It has documentation as well. But what we need just in here, just click in here to copy this PIP install Beautiful Soup four. And after that, go to VS code. Let me open up the terminal like this. Hold Control Shift and head to paste PIP install Beautiful Soup four. You can just write it down. Copying it, it will be easier, head in there, and it will try to install it, and you can see how easy it get installed, right? That's beautiful. No, next thing, we need to import it. So I will say from BS four, import, beautiful soup. That's it. That's more than enough. Now, in here, after we say success website is loaded, what I want to do I want to use this beautiful soup, right? Parse the HTML that we get from response, right? So we say response, right, dot text, if you remember in last Vidor, we had a text that will give us the whole HTML, right? And after that, we should tell this beautiful soup how to parse it, how to decode it. The decoder that we want to use is called html dot par. Sir. Now, this will give us a object that we can save it somewhere. I want to save it inside the variable, and I want to call it soup, right? No, with this soup, we can get some stuff. Let me open up the BB scene here. For example, just right click in here. Go to Inspect, right, and click in here and, for example, choose this one, right? You can see this one is a header of H two. Let me zoom a little bit so you can see it better and make this beer like this. Click in here and click in here. You can see this is a tag of H two. If you want to get the content of this, that says Trump and Musk trade insult, right? We need to tell it to give me the H two, right? So let's see how we can do that. In here, we say soup, find, okay? H two. That's it. Now what this find do. If you hover over it, it will tell you look in the children of this page element and find the first page element that matches the given criteria. That is H two. Let's see what this will give us. I want to save this inside the variable, and I will call it first heading right, equal to this. No, let's just show that print. And here, I want to use the By slash end, so we go to the new line, and I want to use a F string to show the first heading. And in here, I want to use the curly bracket and put the first heading there. Control this to save. Now, let me run. And know with that, you can see it did give us H two that in there, it says Trump and mask, trade insult, right? But we don't need the complete tag, right? We don't need the whole of it, right? So what we need, we need just the text inside it. So in here, we can just tell it to get the text for me. This is a method, so we can use it. Now, let's just test that no. In here, I want to run again. And now you can see Trump and mask trade insult as a robbed in public view, right? Beautiful. We are getting the first heading of the BBC correctly. We specify the tag, and it will give us the first heading. That is this one. In your case, it will be different because every day there is new news, right. But in my case, you can see we did get the first head. 4. 04 Get all headlines: No next thing that I want to do, I want to get all the headlines and show them to the user. Let's see what we can do that. For doing that, I don't want to do it like this. So let's just delete that. We say soup dot find A, right? So we find all the tags with H two on it, right? So if you go here, you can see if you hover over this, you can see it is H two. This one, it is H two as well. If I hover over this, you can see it at the right side, it is H two. If I click on it, you can see it is H two. Again, click in here, go down. Doesn't matter which one, you can see it is H two, right? No with that do, let's see what this will give us. First, I want to save this find old return inside the ible, and I want to call it headlines. It's equal to this. No, I want to just print it to see what this will give us. Headlines Control is to save. No let's just run, and you can see, there is a lot of them, and it has everything. But as you can see, it is just a list. List of H two tags. You can see, it's just a list. Start with a square bracket. And if you go down, you can see it is ending with a square bracket as well. That's beautiful, right? We can use that. For using that, what I want to do. First of all, I want to say print print close pansies after that double quotation, and in here, I want to use Been first. I want to say today. Head lines, right? And we don't need to say success website is loaded because the response is 200, so it is loaded. And after that, I want to use a four loop to show all the headlines. So I will say four headline in head lines, right? And in each one of these headline, we want to get the text, right, get text. After that, I want to strip it down because it will give us a string and we can strip it. So we delete the spaces at the beginning and at the end of it, right? And now we can save this somewhere. For example, let's just save it inside a text, right? Text variable. Now we can show that. We can print the text, right? Let's just see what will happen, bring this up, make sure you save and run. No, you can see all the headlines are here, right? It says, Today headlines Trump and Mask after that Washington buckles up, and you can see one problem that I can see in here, it's like, we don't know where the headline start and where the headline ends. So we need to add a number to each one of these headlines, and we can do that. We can do it with enumerate. We can do it with range, and we can do it with the counter. You decide which way you want to do, and we will decide one of them, and we will do it in next video 5. 05 Show all headlines with numbers: No, we want to add numbers to each one of the headlines, right? So for doing that, there is lots of way to do it. What I want to do it with enumerate because I want to show you something new. So instead of just saying for headline in headlines, I want to use enumerate. So I will say for index and headline in here, we say enumerate, right? In open and close parenss let me put the headlines there, headlines know what will happen. I will give an index to each one of the headline inside these headlines, and it will give us that index number inside this index variable. And because I don't want this enumerate to start from zero, in here, I can say tort equal one, right? That's it. No, it will start from one. No, we can use this index. How we can use it. Instead of just showing the text, we can use the FS string in here, and we can say index, right. After that, a dot, after that space, after that color bracket, and no, we can show the text, right? News text or whatever text. With that down, control is to save and no let's run. Here, let me go down in here, head up ROK, head inter, and no let's see. We have lots of them, right? If you go up, you can see Trump and Mask, and you can see there is a lot of them. But one more problem that we have, we want to say if there is a headline that is less than these characters, don't show it. For example, also in Muss, we don't need to see it this most watched. We don't need to see it this most read. We don't need to see it, right? So, for example, we can say, if the length of the headline was less than let's say, one, two, three, four, five, six, seven, eight, nine, ten, 11, 12. Let's just say 15. If the length of this headline was less than 15, don't show it. If it is bigger than 15, show it. You try to do that. We will do it in next video. 6. 06 Skip headline that is short: No, for checking if the length of the headline is less than 15, don't show it, how we can do that. So for doing that in here, we want to check the text if there is a text. Okay. And the length of that text was bigger than or equal than 15. In that case, you can print the headline, right? That set control is to save. Now, let's just check that out in here, head of Broke, head inter. Make sure you save before running. And now you can see all the headlines is really good. But one problem that we have, for example, in here, from 33, we are going to 35 all of a sudden. Why this problem is happening, it's because we have some headlines that we are not showing because the length of them is less than 15. No, for fixing that, we need to change this four loop completely to a four loop with a counter. So we decide when we add to the counter, right? You try to do that. You try to do four loop with a counter, and we will do it in next. 7. 07 For loop with counter: No for changing the four loop. I want to go down in here. I want you to see everything side by side. In here, I will create a variable and I will call it index, and I will equal it to one because we wanted to start from one. After that, we say four headline like before, in headlines, right? In that case, first, we want to get the headline text. So we say headline that get text, right. And after that, we want to strip it, right? Strip. That's it. Now, we want to save it inside the variable. I will call it headline. Text is equal to this, right? And after that, if the headline text was exist and length of headline text was bigger or equal than 15, in that case, first, you can show it with this print, let's just copy and paste it over here and change the text to headline text and index to the index. Now everything is working fine. If this happened correctly, in this case, we can increase the index, right? So we say index plus equal one. In that case, we are increasing the index, not for all the headlines that has less character, right? Now, with that done, we can just delete all of these or we can hold shift drag from top to bottom and use a hash tag at the beginning of it. And with that done, it will be commented and it won't be executed. Now, if you go up and let's just run and now you can see, if you look at all the numbers, you can see it will be okay. Let me see. 31, 32, 33, 34, 35, 36, 37. And you can see, everything is working fine. And we are showing all the headlines of the BBC. Just one we have in here that is more to explore. If you count it, one, two, three, four, five, six, seven, eight, nine, ten, 11, 12, 13, 14 and 15. That's a problem. So for fixing that, we can just change these to 20, right? Why not? Just put 20 in there. And, no, I don't think that any headline of the BBC will be less than 20. So no, let's just run it one more time, and let's just look at it. Everything looks awesome. Now, we are getting all the headlines of BBC News. Congratulations on finishing this project.