Transcripts
1. 01 Web Scraper introduction: This Python practice, I
want to teach you how you can scrape through a
web a website like PVC, to get all the news off of
it and show it to the user. This is for beginners, and you will learn the
basics of web scraping. You will learn how to
request to get a site, and how you can use
Beautiful so to scrape through a web page and show the headlines.
So let's begin.
2. 02 Load BBC site: Creating or web scraping, we need to create a project. In here, I will create a
folder and I will call it bra per the side, open it up inside
the VS code in here, open folder, go to desktop,
web scraper, select. And in here, I want to create a new file of type
of Python file. And in here, Controls, and I want to call
it web cre per, Headsave and put the
PY at the end of it. Now, in here, I want
to close this one we need to request a
site and get a site, but what site I want to get? In here, I want to search
for bibc.com, right? For example, let's just open it. And if you go to News tab, you can see we
have bbc.com news, and we want to get all of these headlines that
you can see in here. For example, at this
time, of course, it says Trumps and Mosque trade in Salt as a row
for example, right? And all of these
headlines of all of these news we want to get
them and show them Tuser. So we need this link. For now, let's just leave it at that. No, next thing
that I want to do, I want to go and create a
terminal or open a terminal. First thing that I
want to do in here, I want to say PIP
install requests, right? So with these requests, library, we can get the
bibc.com news, right? We can open it up and see the headlines head inter
because I already install it, it will be installed
really fast. But for you, it will
take a little bit time. Know, with that done, what
we want to do we want to say import requests, right? This is the library
that we want to use. Now, the URL that
we want to use, we want to save it
inside a variable, so it will be easier
to work with. And in here, let me
copy it from here, just like this copy it and no, paste it over here. That's it. No, we want to use the request
library to get the URL. So put the URL there
and Control S to save. This will give us a response and we can
save it inside something. I will save it
inside the response. We can equal it to this, right? And now, let's just
print the response. Let's see what it will give us, right? Control T save. Let's just open up
the terminal in here. I want to make it big. I will say PY, web, head tab, head Inter. Wait a little bit, so it
will give us something. It says response 200, right? So what it means, it means it
successfully connect to the bbc.com news and
it get the data, but how we can see the data. If I click in here, I will add a breakpoint in here. Now I can go in here, run and debug and click
on Run and Debug, and it will tell you
with which one of these debuggers you want
to run the debugger, you click on Python debugger because we
are writing in Python, and after that, it will tell you which file you want to divide. We say Python file, divide the currently
activated Python file, the one that we are
working on, right? Click on it, and it will
run the code for us. Now, when it reached that line
that we add a breakpoint, it will wait there. Let me make this bigger, and you can see in you can see all the
variables, for example, the URL that we create together, the response that
we create together, and some special variables
that are built in, right? We don't need to
talk about them. Just the important
thing is the response. We can just copy this response and add it to this
watch list, right? Put a the head inter. Now, with this, let's just
look at the response. If you go inside it, you can
see there is lots of things. For example, the content
that is just a HTML file, the cookies, the encoding. And if you go down, you
can see the headers, and you can see the links. You can see the
status code as well. We want to use this Status Code. After that, you see the text. Again, it is HTML file
that it will get. You can see the URL, and you can see lots of stuff in here that
we want to get. The first thing that we want to get is the Status Code, right? You can see the Status Code 200. It means we connected to
the web server correctly. Now, I can stop from
the debugging, right? I don't need that anymore, so I can bring it left, so you can see
everything, right? Know with that done, what I want to do instead
of saying this, I want to check if the
response status code was 200, we can print, for example, success website
is loaded, right? That's it. But in
case of s in here, we can check. It's not 200. Whatever it is, it means we are not connected
to the site. So in here, we can
say with a FS string, I can say failed status code. After that, a call and
after that, a cl bracket, and I can get the
Status code from response dot Status Code. So it will show us what is the problem that it can't
be connected there. No, I will click in here to
get rid of that breakpoint. No, let me run. Make sure you save
and after that run, and you can see it's sax website is loaded, and that's it. If we can connect
to the website, it will say success, but if we can't will say fail.
3. 03 Get first heading of BBC news: No, with this response
that we get from BBC, we know we get the whole HTML and code of the BBC
inside our code. Now, we want to use a
library that is called Beautiful Soup to scrape through
all of these properties. In Google, if you search for Beautiful Soup Python and
open up the PYPI right, this site, let's see. It will tell you how to install it and how
to work with it. It has documentation as well. But what we need just in here, just click in here to copy this PIP install
Beautiful Soup four. And after that, go to VS code. Let me open up the
terminal like this. Hold Control Shift and head to paste PIP install
Beautiful Soup four. You can just write it down. Copying it, it will be
easier, head in there, and it will try to install it, and you can see how easy it get installed, right?
That's beautiful. No, next thing, we
need to import it. So I will say from BS four, import, beautiful
soup. That's it. That's more than enough. Now, in here, after we say
success website is loaded, what I want to do I want to use this beautiful soup, right? Parse the HTML that we
get from response, right? So we say response,
right, dot text, if you remember in last Vidor, we had a text that will give
us the whole HTML, right? And after that, we should tell this beautiful soup how to
parse it, how to decode it. The decoder that we want to
use is called html dot par. Sir. Now, this will give us a object that we can
save it somewhere. I want to save it
inside the variable, and I want to call
it soup, right? No, with this soup, we can get some stuff. Let me open up the
BB scene here. For example, just
right click in here. Go to Inspect, right, and click in here and, for example, choose
this one, right? You can see this one
is a header of H two. Let me zoom a little
bit so you can see it better and make
this beer like this. Click in here and click in here. You can see this
is a tag of H two. If you want to get
the content of this, that says Trump and Musk
trade insult, right? We need to tell it to
give me the H two, right? So let's see how we can do that. In here, we say
soup, find, okay? H two. That's it. Now what this find do. If you hover over
it, it will tell you look in the children of this page element and find the first page element that
matches the given criteria. That is H two. Let's see
what this will give us. I want to save this
inside the variable, and I will call it first
heading right, equal to this. No, let's just show that print. And here, I want to
use the By slash end, so we go to the new line, and I want to use a F string
to show the first heading. And in here, I want to
use the curly bracket and put the first heading there. Control this to save. Now, let me run. And know with that,
you can see it did give us H two that in there, it says Trump and mask,
trade insult, right? But we don't need the
complete tag, right? We don't need the
whole of it, right? So what we need, we need
just the text inside it. So in here, we can just tell
it to get the text for me. This is a method,
so we can use it. Now, let's just test that no. In here, I want to run again. And now you can
see Trump and mask trade insult as a robbed
in public view, right? Beautiful. We are getting the first heading of
the BBC correctly. We specify the tag, and it will give us
the first heading. That is this one. In your case, it will be different because every day there is
new news, right. But in my case, you can see we did
get the first head.
4. 04 Get all headlines: No next thing that I want to do, I want to get all the headlines and show them to the user. Let's see what we can do that. For doing that, I don't
want to do it like this. So let's just delete that. We say soup dot find A, right? So we find all the tags
with H two on it, right? So if you go here, you can
see if you hover over this, you can see it is H two. This one, it is H two as well. If I hover over
this, you can see it at the right side, it is H two. If I click on it, you
can see it is H two. Again, click in here, go down. Doesn't matter which one, you can see it is H two, right? No with that do, let's see
what this will give us. First, I want to save this find old return inside the ible, and I want to call it
headlines. It's equal to this. No, I want to just print it to see what
this will give us. Headlines Control is to save. No let's just run,
and you can see, there is a lot of them, and it has everything. But as you can see, it is just a list. List of H two tags. You can see, it's just a list. Start with a square bracket. And if you go down, you can see it is ending with
a square bracket as well. That's beautiful,
right? We can use that. For using that,
what I want to do. First of all, I want to say print print close pansies
after that double quotation, and in here, I want
to use Been first. I want to say today.
Head lines, right? And we don't need to say success website is loaded
because the response is 200, so it is loaded. And after that, I want to use a four loop to show
all the headlines. So I will say four headline
in head lines, right? And in each one of
these headline, we want to get the
text, right, get text. After that, I want
to strip it down because it will give us a
string and we can strip it. So we delete the spaces at the beginning and at
the end of it, right? And now we can save
this somewhere. For example, let's just save
it inside a text, right? Text variable. Now
we can show that. We can print the text, right? Let's just see what will happen, bring this up, make
sure you save and run. No, you can see all the
headlines are here, right? It says, Today
headlines Trump and Mask after that
Washington buckles up, and you can see one problem
that I can see in here, it's like, we don't know where the headline start and
where the headline ends. So we need to add a number to each one of these
headlines, and we can do that. We can do it with enumerate. We can do it with range, and we can do it
with the counter. You decide which
way you want to do, and we will decide one of them, and we will do it in next video
5. 05 Show all headlines with numbers: No, we want to add numbers to each one of the
headlines, right? So for doing that, there
is lots of way to do it. What I want to do it with enumerate because I want
to show you something new. So instead of just saying
for headline in headlines, I want to use enumerate. So I will say for index
and headline in here, we say enumerate, right? In open and close parenss let
me put the headlines there, headlines know what will happen. I will give an index to each one of the headline inside
these headlines, and it will give us that index number inside
this index variable. And because I don't want this enumerate to
start from zero, in here, I can say
tort equal one, right? That's it. No, it
will start from one. No, we can use this
index. How we can use it. Instead of just
showing the text, we can use the FS
string in here, and we can say index, right. After that, a dot,
after that space, after that color bracket, and no, we can show
the text, right? News text or whatever text. With that down, control is
to save and no let's run. Here, let me go down in here, head up ROK, head inter,
and no let's see. We have lots of them, right? If you go up, you can
see Trump and Mask, and you can see there
is a lot of them. But one more problem
that we have, we want to say if
there is a headline that is less than these
characters, don't show it. For example, also in Muss, we don't need to see
it this most watched. We don't need to see
it this most read. We don't need to see it, right? So, for example, we can say, if the length of the headline was less than let's say, one, two, three, four,
five, six, seven, eight, nine, ten, 11, 12. Let's just say 15.
If the length of this headline was less
than 15, don't show it. If it is bigger
than 15, show it. You try to do that. We
will do it in next video.
6. 06 Skip headline that is short: No, for checking if the length of the headline is less than 15, don't show it, how
we can do that. So for doing that in here, we want to check the
text if there is a text. Okay. And the length of that text was bigger
than or equal than 15. In that case, you can
print the headline, right? That set control is to save. Now, let's just check
that out in here, head of Broke, head inter. Make sure you save
before running. And now you can see all the
headlines is really good. But one problem that we
have, for example, in here, from 33, we are going
to 35 all of a sudden. Why this problem is happening, it's because we
have some headlines that we are not showing because the length of them
is less than 15. No, for fixing that,
we need to change this four loop completely to
a four loop with a counter. So we decide when we add
to the counter, right? You try to do that. You try to do four loop with a counter, and we will do it in next.
7. 07 For loop with counter: No for changing the four loop. I want to go down in here. I want you to see
everything side by side. In here, I will create a variable and I
will call it index, and I will equal it to one because we wanted
to start from one. After that, we say four headline like before, in
headlines, right? In that case, first, we want to get the
headline text. So we say headline
that get text, right. And after that, we
want to strip it, right? Strip. That's it. Now, we want to save it
inside the variable. I will call it headline. Text is equal to this, right? And after that, if the headline
text was exist and length of headline text was
bigger or equal than 15, in that case, first, you can show it with this print, let's just copy and paste
it over here and change the text to headline text
and index to the index. Now everything is working fine. If this happened correctly, in this case, we can
increase the index, right? So we say index plus equal one. In that case, we are
increasing the index, not for all the headlines that
has less character, right? Now, with that done, we can just delete all of these
or we can hold shift drag from top to bottom and use a hash tag
at the beginning of it. And with that done, it will be commented and it
won't be executed. Now, if you go up and let's
just run and now you can see, if you look at all the numbers, you can see it will
be okay. Let me see. 31, 32, 33, 34, 35, 36, 37. And you can see, everything
is working fine. And we are showing all
the headlines of the BBC. Just one we have in here
that is more to explore. If you count it, one, two, three, four, five,
six, seven, eight, nine, ten, 11, 12, 13, 14 and 15. That's a problem.
So for fixing that, we can just change these
to 20, right? Why not? Just put 20 in there. And, no, I don't think that any headline of the BBC will be less than 20. So no, let's just run
it one more time, and let's just look at it. Everything looks awesome. Now, we are getting all
the headlines of BBC News. Congratulations on
finishing this project.