Webscrapping in Python for Beginners | Max S | Skillshare

Webscrapping in Python for Beginners

Max S, Power through programming

Webscrapping in Python for Beginners

Max S, Power through programming

Play Speed
  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x
28 Lessons (3h 58m)
    • 1. Introduction

    • 2. Prerequisit libraries

    • 3. Introduction to The Modulus Operation

    • 4. Introduction to Simple Error Handling

    • 5. Introduction to Pandas

    • 6. Response Status Codes From a HTTP Request

    • 7. Reading The Response Text From Our Request

    • 8. First Approach at Parsing The Data

    • 9. Understanding the Exception Cases

    • 10. Parsing Out All Data for One Company

    • 11. Determining Where We Can Get More Ticker Symbols

    • 12. Extracting Company Ticker Symbols Part 1

    • 13. Extracting Company Ticker Symbols Part 2

    • 14. Getting Data For All Parsed Companies

    • 15. Final Data For All Parsed Companies

    • 16. Final Result Static Websites

    • 17. Prerequisite Libraries for Dynamic Web Scrapping

    • 18. Short review: Recursive Functions

    • 19. Getting started with Selenium

    • 20. View The Page Source

    • 21. Website Elements and XPath

    • 22. Navigating Deeper Into The Page Source

    • 23. Identifying The Path To Our Data

    • 24. Using The XPath To Our Data

    • 25. Parsing Out Our Data

    • 26. Getting Our Final Data

    • 27. Final Results Dynamic Websites

    • 28. WebscrapingPythonOutro

17 students are watching this class
  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.





About This Class

Web scraping is the art of picking out data from a website by looking at the HTML code and identifying patterns that can be used to identify your data. This data can then be gathered and later used for your own analysis.

In this course we will go over the basic of web scraping, learning all about how we can extract data from websites, and all of this is guided along by a work example.

At the end of the course you should be able to go off on your own, and pick out most common websites, and be able to extract all the relevant data you may need just through using Python code.

Meet Your Teacher

Teacher Profile Image

Max S

Power through programming


Class Ratings

Expectations Met?
  • Exceeded!
  • Yes
  • Somewhat
  • Not really
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Your creative journey starts here.

  • Unlimited access to every class
  • Supportive online creative community
  • Learn offline with Skillshare’s app

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.



1. Introduction: everyone, it's Max and welcome my course on Web scrapping and python. Now, this course is made for beginner to intermediate level, basically anyone who has some basic python programming experience. But you don't require any HTML or CSS knowledge or anything like that. Now, in this course, you're gonna learn Web scrapping in Python. So you're gonna learn how to think about even getting data from websites. And then you're gonna learn how to extract static data from websites. So that's data that's already there when you load the website as well as how to extract dynamic data, which is data that updates and loads as you open and load the webpage. Now, by the end of this chorus, you'll be able to write python code that basically goes through websites and just extracts the relevant data that you want. And you also be able to write code that load the website and waits for some data to load or toe update on the website and then extracted so that you can use it for well, whatever purpose you may want to use that data for 2. Prerequisit libraries: All right. So let's go ahead and talk about some of the libraries that we're going to use it now. Most importantly, we're gonna be using requests on. And so that's part of Python. So we're gonna be writing festival and python 3.5. That's important to note. If you're writing in a 2.7 version or something, you may. Or you should probably look at your all lib or you're all up to you can even look at the euro love to dot request subclass. But since we're gonna be coating in 3.5, we're going to be using the requests library that's been spilled specifically almost for this kind of purpose. And so requests allows us to interact through the http protocol with all sorts of websites and everything, and we can use it to read websites. We can use it to interact with AP eyes, and it's just really great. It also high as authentication methods and everything included in it. So that's what we're gonna be using, um and yeah, so this is the website for it. If you like, you can head over also to the installation to look at it. Installation guide. Usually it's quite easy that you can just do a pip install and requests should be ready for you. So this is what we're going to be mainly using for reading HTML websites or reading the content, and then later on, once we have that, then we can extract it using python. But this is kind of our gateway to the Internet or to these websites. So that's what we're gonna That's what we're gonna need to that for now, to actually store our data and maybe analyze it a little bit will be using pandas and will mainly be using this just to have everything in my city order for much, um, and so pandas. If you don't know pandas, it allows us to store data and analyze it in a format that's very similar to an Excel format, for example, So think CSP or something like that. So Hennes is just a great way or a great tool for data analysis because it just it needs the structures, your data, and it just has so many functions for selecting data that sorting through it and just being able to, you know, just move your data around and have it exactly like the way you want to have it and not have to take care of all that fuss of first organizing all of your date on everything. So that's what Panas it's for, and it's just really great for that. So to just kind of visualize our results a little bit will probably use pandas. Um, and you can also hear head over to the Donald section and you can get some deeper instructions, and here you probably just do a pip install for pandas. Pandas is also built upon numb pine, So in case you don't have no pie installed, then you can also a pip install numb pie. I'm not sure if that's completely necessary, or it would come with the panels, but in case it doesn't work, you should probably also go ahead and get numb pine. But these air kind of the standard tools that you'll be using for data analysis anyway, So it's also just really great to have those 3. Introduction to The Modulus Operation: everyone. And welcome to our data extraction tutorial on pipe. So in this section, we're going to go over some of the prerequisite material first so that you everyone's able to follow along with what we're doing in the next section. If you feel like you already know all the stuff, you can just go ahead and skip the section and move to record on to the main data extraction section. All right, so anyway, the first thing that we're gonna be looking at here is going to be the module ISS. So the module IHS or the remainder pretty much tells us or is a former division that tells us how much we have as a remainder after we've completed the division. So, for example, and I'm just going to be working with print statements here just so that we see everything on the output. All right, so the first thing that we're going to start with is going to be a module ISS to, and we'll take, we'll start off and we'll just go through a couple of numbers just so that we can really get hang of what's going on. So, for example, if we do a zero module is too will get a remainder of zero. I'm if we do, a one module is too will get a remainder of one. And if we gotta to marginalise to, we'll get a remainder of zero again. So we're kind of entry cycle. And so what's happening is that if I do some a module iss be really what I'm doing is I'm dividing a by B and then the output that I'm getting is going to be the remainder after the remainder after the division. Um, so in this case, for example, if I do two divided by two, that gives me one with the remainder of zero. Right, So we've divided two by two. They get the one with the remainder of zero I divide to re by two, for example, I get a one has not. But here three by two gives me one, and then we've got a remainder of one. So we're dividing. So to fits once into three completely and then we've got a one extra. And since one isn't bigger than two, we now have a remainder of two. If we were to do five by two instead like this, we would get either one with the remainder of three, which would then be to with a remainder of one. Or you can just completely step skip this stuff. So we see, um, if we divide five by 22 fits and twice into five and then we've got one left over, and that's what we get with this module is division. Now, we can also do this for higher numbers, for example for So if you do five modules for we get one, we do six. We're gonna get to If we do seven, we're gonna get three. If we do eight, we're gonna get zero. So what? We can see also that there is a kind of pattern going on. So food to do a module ISS by a certain, um, number eight or a certain number of you to be consistent with up here, we're going to get numbers from zero up to B minus one. So because if we get be and here, that's just going to give us one again. So this is kind of the up track form again. In form of example. If we do a model ISS of five, we can get numbers between zero up to four so we can get in here. Um, 012 three and form. Because if we were to get five, then five divided by five gives us the remainder of zero again. So this is a nice way to also produce kind of cycling numbers, especially if you're going in a row. In future. Being module is five than 0101 to keep that format good here. 012 three, 456 seven. Um, all that would go after doing in module is five would go to the phone number, so they got zero. We've got 1234 and then we'll start zero again and then we'll go 12 So we see that we get kind of this cycle going on, and that's really nice for us. And we can eat so to our advantage to kind of produce the cycling numbers, especially if we're going just along a linear kind of line of numbers 4. Introduction to Simple Error Handling: Hey, everyone, it's Max. And welcome back to our data extraction tutorial on pipin. So in this section, we're still just gonna be looking at the pre records that stuff. So now we will be looking at error handling in the form of using, try and accept statements. So what often happens if you write some code is that at some point you can run into some errors. So one example would be, um we print out the character a divided by three. And we get something here called a type air. And that's because I can't divide a string by an integer that doesn't make sense for Python . So Python doesn't understand what I'm trying to do here, and it will give us an Arab, and the type of that air or the form of that air over the name of that era is a type air in this in this case. But there are many different types of errors. So depending on what I do, um, so I can have different areas coming up, and you can I'm sorry about that. You can go around then, using try and accept statements. So one of examples that is, we do a try, and then we have an indentation here, and then we can also have and accept so they come together. It's like a if and else. But you need to have an except if you have a tribe and so you can start off and do something called a pass, which will do right now and pass pretty much means do nothing. So that's a valid statement price that you can always use if you want. House just means do nothing. So if I run this again, then I see I don't get any output. But I also don't get air. So what happened is, I said Python. Try the following try to print a character aim divided by the number of three, and it tries. And it gets an error that because it doesn't know how to divide strings by integers and then because it gotten error. It went into this except statement here. And then, instead of continuing on here, it just continued on with this statement here. In this case, it just did nothing because we told that the pass a better example of the switching or a better visual addition could be if we put in turn statement here so you can type in here entered error so we'll print that out. So either we try this and we print out the results or we print out entered air or even better encountered error like this. So if you run this, we know, see that we have an encountered error statement here, and that's because we entered the except statement. Now we will only stop executing this block of code once you run into an Arab. So, for example, if I print over here, the character aim before my air is going to curl, and if I print afterwards character be, then let's see what we get is output. So first we get the character A because we try this and then we print aim, and then you get encounter this error so we can no longer do anything. And so this statement is no longer executed on this line. We encounter in Arabs. Everything above is still executed because it's still part of this. Try statement. So we go into the Troy and we execute line by line, and then we encounter this line that has an error, and then all the sudden um we leave the try statement, and we go into the except and here with imprint encountered Arab like that. So we never print the B. We never also execute the line that has the air inside of it. Now, the form also works just like we know before with indentation here. So you don't just have to have one line of code. You can also prince, or I mean, you're not limited to printing. Obviously, this is just like any other statement, like function definitions or if statements or, you know, whatever else you like, loops you could really just continue on as long as you maintain this indentation. So just a short you that we can do the encountered Arab, and then we're stones of the except statement. And then we can also print line to, in this case 5. Introduction to Pandas: everyone, It's Max. And welcome back to your data extraction tutorial on Python. So not we're gonna quickly look at pandas, and we're only gonna do a very, very quick overview just because we're not actually using Pan does for any of this. This is just Panis is just the final form that we're gonna put the data into a zone example . But we're not actually gonna be doing a lot of it with with pandas, but we'll just take a little bit of a look to maybe better understand what? We're getting a lot towards the very end. So the first thing that we're gonna need to do, we're going to need to import hand. Oh, so that we have have our library in our code here, and you just lately import panders. I was a short form, so we import panels as PD, which just gives us this short form to use. All right, so one of the main data structures in pandas is something called a data frame. Now, a data frame you can imagine in your head as something very, very similar to an excel table. So you've got kind of, um you've got Rose and you've got columns and stuff. So what? Just look at how we can create a data frame. Now, One of the easiest ways to put in information, except for reading it from CSP files and stuff would be in the form of a dictionary, so we can maybe create some input data. So that's what we just call her Variable. It's gonna be input data, and it's gonna be in the form of a dictionary. And then we'll just have some keys, will have data one here and then we can also have maybe some data to and maybe also some data three like this. All right? And we're gonna be having several elements. So in here, we'll just put a list. So, um, are dictionary right now? Contains three keys data one data to and data three. And each of these keys has a corresponding value. In this case, there's just empty lists like this. Now, what we can do is we can feel these things up a little bit, so we want to make sure that they're all the same length, so we can just put in maybe four numbers One from two comma, three kind of four So are key data one. And our numbers is still just a dictionary. So that's we can just think of it like this are key Data One has, um, value of the list that contains four numbers. It has the numbers 1234 data to let's have that be. Maybe 5678 Let's keep it simple. And data three will have 9 10 11 and 12. So we notice I also use the Commons to indicate or toe happened like those separators between different elements. So we have four elements here. Onda. We also use these comments here to separate between different keys. All right, so now that we've got our data set up, maybe we can just visualize that wants by printing it out, and then we're gonna feed it into our data structure and see how that looks like. So, just in the form of printing it out, we see the general types from a dictionary. So we've got no order here. We see we put in data three last, but here right now, the 1st 1st Then we've got data. One second, you know, two in the third position. That's just the way dictionaries work, there is no real defined order. But what we do have is three order inside of the values which are lists with four elements is maintained. So that's good. And all that these keys were going to be is they're just gonna be had our names for our data frame, which is kind of a fancy word for an excel like table. So let's go ahead and create that now usually won't create a data framework to save it into very book called DF, which stands for data frame. Of course, if you start writing your own code, you can change that name up however you like. But right now, it would just use DF, which is short form for data frame, and to create a data frame will go into pandas, which we abbreviated appear as PD, and then we'll access a data frame capital D capital F, and our input data is just going to be the input data. And then let's print out the data frame and we're gonna see is we're going to see a very different form of representing this data. So maybe if we just added extra print statement in between here to get some extra space we see here. This is how the data looked like before in the form of a dictionary. And now we have it in this for months. We've got some row indices over here. We've got ahead our names over here, and we will also got nicely ordered data for each of the headers that we have. And so it looks very similar to what we would see. An excel, for example, or a C S V file. And it's also very easy for pandas data frames to write out to see a sea files to read from CSP Fells just because, um, the way that the state it looks like is very, very similar to that structure. Now there are some small things that we can do. So, for example, if we only want to print out some of the top elements weaken go into our data frame and we can call the head method. And if you don't put anything here, it's actually gonna print out the top five elements, which we're not going to see because we only have four. So let's go ahead and maybe add two more elements into each of these, Um, just so that we can see the output a little bit better. So we have here. We've extended each of our lists by just two. And then we'll print out the data frame once completely. And now we're just printing at the data frame head, which, if we don't define a number and here is just going to be the top five values. If we do put in a number, for example to then and if we rerun this, then we see we're only gonna get the top two rows printed up. So the 1st 2 which we also see here by the index more brother, the row indexed indices. We can also do something in print out a data frame dog tale. Just gonna by default, print out the last five numbers. So the bottom five thes five here. If we put a different number here, for example three, then we're gonna print out the last three numbers in this case. So that's just how a data framework this is all very simple stuff were not really going to be going through all of Panos just because there's a lot of things to cover. But to give you a general idea of how these things look like, um, we've just covered it just like this 6. Response Status Codes From a HTTP Request: Hey, guys, it's Max. And welcome back to our Python data extraction tutorial. So again, right now have this apple Yahoo finance website open because we're gonna be using this for our walk through example. And what we're going to try to do is we're gonna just gonna try to get out all of this data here. Um, and I'm using the finance just because, you know, maybe some of you are already interested in finance. Or maybe that's what you want to apply it to so that you have something concrete that you already doing that's relevant to topic. Now, I know all of you are interested in finance, and that's okay. We're just gonna be using this, For example. You can, of course, do this with whatever Weps that you like, maybe pick out a different example, but yeah, well, just be using this killer thought it may be relevant to some of you. So really, the only thing that we're gonna need from here right now is just going to be the u. R l. So we're gonna just copy that out, and then we're gonna head into our coding environment, and we're just going to define the or a u r l variable. We're just gonna call that you Earl. And we're just gonna copy paste. Are you URL on here inside of some quotation marks so that we've got it saved. Other string, and we can just use this later. All right, so now that we've got are you? Well, let's go ahead and import the requests library that we've actually installed earlier on that we can go ahead and start talking to this website using the http protocol. So what we're gonna do is And what? Go go over this again in a second and more detail. But let's just right up the quote first, and then we'll go through what actually happened. So we're gonna send a yet request. So we're going to go into our requests library, and we're gonna send a get we're gonna call the get method, which is going to send a get request to this. You were well, and we're gonna save whatever we get from here and again. We'll talk about this in the second. We'll save that in a variable call in response. And then let's just print out our response like this. And if we run our code. Done. What we see here is we get a response. 200. So woke over what that means in a second to, But let's talk about what we did here first. All right, So what we did here is we went inside this request library, and we access a get method. Now, what the get method does is it has the name pretty much almost suggests is it gets something in. We've passed the UL here. So what it does is that it pretty much gets the u R l. And it's gonna get everything kind of around out so approved it would just get the html code in this case for us. If we're talking with an A p I and we set a dumb, we send it gets or we send ah, get message, then we're gonna have or your l that may be changed a little bit or we'll have a header and and then we'll also have a a body or ah, payload or something like that. Um, and then we'll get back a different response, but they get is just kind of this general way of allowing us to ask for something from the U R l that we're passing it. So in this case, we're sending a get, um, a get request to this your own, which is gonna return to us the html code. So if we use it for a few guys, then there may be a special format that we get back, and that's usually outlined on the AP eyes, but we'll see that all later. Right now, we're just interested in the HTML code. So what we're doing here is we're just pretty much reading, um, the websites and we're saving that in the response. Now, what we get is a response 200. So what we have here is something called a status code, so we can also print that out. We can go into a response and print out the status code. And so if you run that again, we see here. This is just our response and variable that say is what we get in here. And if we print of the status code, we get this 200 which is what we also see here. Now, there are many different types of status code. So let's just quickly go up to this Vicky PDO of site that I've pulled up over here that's gonna outline the different types of status codes. Now, if you go down, you'll see that they're different. They're, you know, many different types here, but the point or the the important thing to note is that there is a one XX, which stands for an informational response. And then there's we've got a two x X, which is pretty much successful. So, um, if you get a two XX or specifically 200 response, that's usually what you're looking forward. That means everything went OK. So it's also what we got in our coding environment. Then if you get something that starts with the three, there's gonna be some form of redirection. And again, the type of response that you get, you can look up exactly what it means. And then there is also a response that's going to start with a four, which is going to be a client air. So that means you somehow messed up. Um, something you sent or something you asked for wasn't quite valid. Um, And so again, depending on that response, the response that you get, you know, you may have to there may be different years, So maybe the most famous one is this air for four that's not found. Similarly, you're looking forward just wasn't there on the Web site. Or maybe the Ural doesn't exist or something like that andan. There's also a 500 response or five and 500 something else, which is pretty much an air on the server side. So you probably you probably did your job OK, but there has been an air on the server side of interpreting whatever it is that you asked for. So I am. If you get back a response different from this 200 code or that different from the status code, you can look it up and see what it means. But usually we're looking for this 200 if you get a 400 that or 400 something, that probably means you messed up somehow, but hopefully you'll be getting these 202 most of the time. Um, and if you get something like a 500 you can assume that something's probably wrong on the server side, although if he continues to get it and you notice other servers are running, you may also have for matter to your all badly, and the website just wasn't dept to deal with it correctly. So it's sending back a 500 response instead of a 400 or something like that. Yes, so that's a different kind of status codes that you can get. And this pretty much gives you the information about if what you asked for was done correctly or if an air air occurred while reading website. So before you can really go on to parsing the HTML code of your website, you want to make sure that your status code is OK. So that's one of the first things or one of the areas that can pop up in terms of reading websites. So sometimes you may not not just be able to read it, and you'll see that in the response status code that you get 7. Reading The Response Text From Our Request: All right, everyone. So welcome back to our pipe and data extraction tutorial. What I've done here is I've just actually taken out these indicators here. So what I've done is I've copy Copy this list. And then I went into our coding environment and I just pasted all of this in here and then just so that I got the names of everything so that you don't have to do this says it's just the information that we're gonna be interested in extracting. So that's pretty much our only starting point that we're really gonna need besides the HTML or rather besides you, Well, we're just gonna need the names of the information that we're going to extract so that we have something to base it on on. We're just gonna be pretty much splitting based on this. So I'll have done itself just copied copy, pasted, thes values that we have right here. So all of the ones that we want to extract And then I just kind of took out the numbers and put them into a python list so that we can just go ahead and go through them directly. And so that's what I've done here, and I just saved them and indicators, And so we'll be using that now. And first we'll learn how we can read the response to the actual text response, and then we're gonna go ahead and splits based on this, and we're going to split s so that we can get out the following type of data. So let's go ahead and do that. So the way that we can get out the actual HTML code is we can go into our response and we can access the text element like this. So if we run our code and right now we're just printing it out. Um, so this is what we see. This is pretty much just all of that html code that describes that website or that you are l that we've loaded All right, Just like this. Okay, so what we can do is well, we confer civil, save this rather than just printing out so we can go ahead and save this somewhere on. We'll save this in. I will say this in just are html text. That's what we're gonna call a variable. And that's gonna be our response dot text. So the text attributes of the response here just holds all of this html co that we've got going on over here. So again, if we scroll up to the top to the very top, we see first of all, just print out the response and this is what we get some kind of a response class with the code 200. And we also see the status code right here. And if we actually go into our response that we get back from here and we look at the text attributes, then starting from here, this is what we see. So all of this html code and everything all right, and sen always safe that in this HTML text. So before really goal and analyzed the HTML text here, let's just go ahead and look for one of the values to see what did that actually is that we're looking for. So before we can parse our text, we first have to understand what exactly it is that we have so that we know how to parse it so we can go over here and call a find and we're just looking for the previous close. Now we have to make sure that we take out these quotation marks because they probably won't be present. Although they could be the problem, They may not be present in this HTML. So we're just gonna be looking at exactly what it is that we're looking for. So right now we're just pretty much trying to find this corresponding about you here, um, to this previous close in value. So we've loaded the website from Apple, and we have this up previous close attribute which has the value 138 point 68. So that's what we're trying to look for. So we see if we just go through here and we watch the sidebar over here that they're actually different occurrences of previous close like this. So we kind of see that we're scrolling down and they're different or their multiple occurrences. So we actually have toe know which one to take out. Oh, are. And where we are, data is located. So that's the first thing that's important to know. So again, the value that we're looking for is 138.68. And so if we go back here, we actually see that we find it right here. And if we scroll through her data again a little bit, Um and then we arrive at the very top one, which you can kind of see here we see that the very first value actually contains our data , and it's not directly behind it. There's some stuff in between, but it's right up here, so that's where you find it. And if we go to the later ones, um, we see there there surrounded by different stuff, so they don't really interest us that much. Um, again, it's this this first value that holds the data for us. And we can also do the same, maybe for open, just to check for different value. Um, and then we see Oh, it's it's right below here. And it almost looks like this could be the value for open. But let's go ahead and double check up 139.25. And maybe we also just look forbid, maybe just to check again, but yes, so it looks like all of this data that we have right here. Um, all of that is corresponding to this table here, So if we go down, maybe we just look manually. Then let's see. What's the next thing that we're looking for? The next thing that we're looking for is asked which we see right here. So it looks like all over data is right here. And that's the first important piece of information that we need to know is where in our response goat or were in our HTML is our data actually located so that we can now start parsing it or now start getting it out. All right, so we found the location of our data in this case is actually at the first occurrence. So again, if we scroll down here, we see that when we go down, we're going to reach later values. But if we just go all the way to the end, then here we go. The first value actually corresponds to a datum. That's good to know. And we'll just go on the assumption that our first value of everything here corresponds to your data in case we get some weird values. And so we hidden exception that will account for that later. But right now, rather than just checking for everything, will just kind of assume that that's the way it is unless we somehow see later on. But that's not the waiters. Yeah, that's what we're gonna be doing right here. Okay, so we've got our indicators that we're looking for here. Um and we assume that after the first, after their first occurrence, we confined the data maybe two lines down or something like that, depending on you know, how wide I drag this, But somewhere, somewhere on that range. So it's it's quite close by. Maybe let's just go ahead and find a value against another. Disappeared after a directed, and we'll go to the top. So here we go. All right, so that's where we gonna find our data. We're going to find it right here. So let's go ahead. And the first thing that we're gonna do is we're going to go into our hte email text, and we're gonna play around a little bit with our HD. No response that we're getting here the HTML code to find a good way to extract out the state owned. So the first thing that we're gonna do is we're going to split by, um, this previous close value. So right now we're doing it all manually and then in the second will automated once we have the first version down. So right now we're just trying to find a method to get at the data, and then we're gonna assume everything is coherent. And unless we kind of running, two exceptions and stuff will just continue on the assumption that all of this HTML code is coherent and it's based on kind of the same structure. Now, this may not always be the case, but usually it is. So if we also just look at the format here, this all looks very, very similar. So it's a good assumption that everything is coherent and it all follows the same structure . So we'll go back into her coat here, and what we're doing is we're going into this text s so this is pretty much just one big string in some sense and recalling this split method on the string and were splitting by previous close. Now, if you don't know what the split does is what the split does is it takes a text and it splits at that value, and everything before that will be well, it's going to split this string into a list and everything before that occurrence will be the first, the first element. So the zeroth element, everything after that occurrence will be the first settlement. And if you have multiple occurrences, you're gonna be splitting into multiple elements. So maybe I can come with this out and show you a quick example so we can maybe create a string example and we'll have the string B A B C B D I'm d b e like this, and we're just going to print out our string example, and we're going to split by B like this. So this is just a quick example. Just toe, make sure that we're on the same page. And so what we see here is it every time this character be occurs and pretty much creates a new element in the list, So we can also do maybe a g here just to show you, um that it really splits at the speed and everything before is a element analyst. And so that's exactly the principle that we're going to be using four R html code. So we're gonna be splitting by the previous close. So if you run or code now, then wait a little bit. And so we see here at the very end, we have these square parentheses in the square brackets, which indicate that there are that there's a list going on right here. And we shouldn't have recurrences of previous close anymore. Um, just because we split by it. But now we have different elements or not. We have several elements in a list. So maybe we can also print out well, so we'll save this in our split list. That's gonna be our split html code or air html text split by the previous close value. And if we look at the length of this split list, then we see it has three elements, so we can assume that there were multiple occurrences, and every time previous close occurred, it was split into multiple, um, elements. So that's what we've got going on right here. All right, so now that we've split it, um, now we have everything in a list, and we're getting a little bit closer to finding our data or finding exactly the ah, nice way for this HTML code to extract our data 8. First Approach at Parsing The Data: everyone, it's Max. And welcome back to our python data extraction tutorial. So in the last tutorial we saw and we still have our output here of how weaken read the HTML text and then we looked at splitting it so that we can get it into a list format. Not we're gonna look at how we can parse this list so that we can finally get out our data . So we also saw a quick example of what exactly this but does. And we see that we have a resulting list over here. So we've split every time the previous close occurred. And now we have different elements in the list now something that we also know is that the occurrence of our data is after the first mention of previous close. So what that means is our data should be the second element. So the element with index one of our list, And so the reason it's like that is, if you remember what we had over here, and we'll just common this out again if we're looking for or for data is C, which is after the first occurrence of be so in this case, our data occurs after the first occurrence of previous close, which in this case is this be here. So our data see, we see that STI is in second place, so it's the second element. So the element with Index one said, that's where our data should be. And let's test that out. So we'll run our code and we'll just look at the second element of this list that we split the HTML code into and we'll scroll up. Oh, let's see. That's can't really get the thing right now. All right, here we go. So let's scroll up to the very top. Um, make sure not to miss what? We actually started the code. Um and let's find here we go. So we started the code here. Ah, and let's see, maybe it'll be better if we actually just look for the open and then kind of going from there. So it started. The very got him, and then we'll we'll we'll find where data is. So we know that are open. Should be very close to our previous close. It was actually right below it. So we'll just look for that occurrence. And then we should also have maybe it wasn't right there. Um, now, here we go. So this is this is the start. And now the reason I just look for this is because this is, um, seventh output. So I couldn't just go to the top to find it again. If this is if you're only having one output a time or something could just scroll to the top. I'm in this case, I had more output, so we also screwed back into previous kind of HTML responses. All right, here. But this is our last output. Um, so that's the last time we ran her code. And this is the first element of our split list. And so we see here, um, we're very close to her value, so it's It's pretty much they already We just have a little bit of trash in between. So and it's not really trash, but for far cases trash because we don't need it. That's in between here. And then we've got the value that we're really interested in right here. So something that we're gonna notice, or something that we need first of all is we need a common way of getting of the state of so we see that we have our value here, and that's gonna be for the previous close. For open. We have the value here and for ask. That should be or forbid. Sorry, that should be the third value. We have our data about here, so let's let's look, if that's right 100 39 times, 100 point away, times 100. Okay, that's good. So we've got our data, and here and tsunami need to try to identify a common theme for all of these values. And we'll assume that theme keeps on going so that we can extract out or data. Now, if you look at this a little bit closely for a little bit longer will see that there is a quotation mark and this kind of greater than symbol that happens before. And then we've got this less than symbol with a forward slash t d. So this year occurs so everywhere. So we've got this quotation mark greater than symbol and this lesson symbol and then forward slash T. D. Every time we have data or our data Now, the reason this is like that is because it's it's kind of ordered into a table format and stuff in the HTML. So it's going to be consistent like this, and we can assume And it's a valid assumption to assume that because everything here is in the same format, the way that we can read out our data should be consistent all the way through. So what we can try to do is we can try to take the first element of this story second elements of the element with the next one of our split lous. So everything that happens after the first occurrence of previous close in before the second occurrence, we can try to again split this by. And now we're gonna use, um, the quotation mark here and the greater than symbol. And we're gonna split by this. Now, if you don't know, or if you're confused, why put this year? So this backslash is actually an escape character that escapes their quotation marks. So what this allows me to do is because I'm using double quotes. I was here rather than ending mice input string like it would If I do this here, then we see that our string kind events here and then a new one is this seem to start here . I put this back slash year. It's goes out of, um of the string. And, um, it reads this double quotation marks as an actual character. Eso Another way to do this would just be to put in single quotation marks like this. And then we also read this double quotation marks on the string. But in case you're using double quotation marks or several quotation marks, it's also cool to know that you can use this escape character so that we can include these quotation marks as part of our string. So that's what we're going to do. We're going to split by this character right here, and then, hopefully our data, Our data should be the first value that we're getting in this print statement now to find our data, I'm just gonna put something here so that we can search for this and I'll just type in here search to find, and then I'm going to search for this year so that we can easily just pull up our response rather than having to scroll through everything. So this is just to allow me to search. So, yeah, let's go ahead and run this and nominee Or take this and I'm just going to search for this . And then we see we've got search to find. Um let's see what we've got over here. So yeah, the second element is our data. And so since we split again, we split into a list em and we split every every time we have this occurrence of this of this character sequence here or rather of this sequence here. So we see we've got this element on the second element is the deer that were interested in . But then we also have more elements because there are a lot of occurrences of this character sequence. So yeah, we have this up it list and we notice we're interested in the second element. So the first occurrence after we split by this which would be the second element since everything before that occurrence, is going to be the zeroth element and everything after the first occurrence is going to be the first or the second element. And then once the second occurrence has occurred, everything after that would be third occurrence and so on. So it would just split the list every time this occurs. So if he run this again. Now we see also that our response is a lot shorter. So we see we've got this search to find here. And this is the second element of our split html Tex that we split again by this character sequence and we see that this is the first value here. So we'll save this. We'll save a response here. A rub of this value here in a new element or in a new variable and will say this. We'll just call this, um after first split civil. Just We'll just call it like that. It doesn't matter there many more appropriate names, but we'll just call it like that. And then we'll print out after first split to make sure that we're saving everything correctly. And yeah, so that's what we see here. This is the after first split. Now what we can do again is weaken again. Split now by this sequence back here. So we assumed that this sequence occurs after every time we have this data here. And then hopefully we should get our data so we can dio after second split, we can look at our after first split, which is this string here and this again. We're going to split by this sequence, so we'll just copy and paste it, um, to make sure that everything is correct. So you're gonna split by this here, and that's we're going to save an hour after seconds, but variable. And then we'll just print that out to see what you get. And yet we get a list which comes from the fact that we split and we see that there are two elements just because there's only one common that I can see here and our zeroth element or our first element element with the index zero is our data that were interested in. And so the reason it's zeroth element where the first element, the element within next zero rather than through the second element or the element with the next one, is because this time our data occurred before the three key that were splitting by rather than after how it was before all the time. So we can save that in a variable called data, which is going to be, um, the first element of this after second split list that we've created. And so then if we print out our data. We see here Our data for our previous close is 138.68. And if we look at it here, we see you've got the right value. 130 points. 68. Let's try that First on, different values to So what? Try that for the open. Also, it should be 139.25. So we'll copy Paste open, and we'll just replace previous close by open, and we'll hope it works the same. You see, you've got 139.25. We trust him completely different that we haven't looked at before yet Volume and let's run not. And then we see up. So now we've got different elements or we've got 17 225 390. So let's see what we're supposed to get. Hope 17. Okay, so this is actually, uh, longer number rather than be different elements. So that's good to know. So also hear our response is correct. Um, maybe we can also look at the last value. So the one year target estimate. So let's go ahead and split by that. And let's hope that that works, too. And we get a a value here, Um, which in this case is not correct. So that's something that we're gonna need to keep in mind that we've gotten exception going on here for one of your term. A good estimate. But it looks like for most of the other values. So maybe we can try out the earning date too. Um, we've got right here, so it looks like for most or at least for some of our values, this is the correct way of getting out the data. So in this case, also for the earning dates, it looks like we're not getting exactly what we want. But at least for some of this, um, getting the right data and then when we get to these exception, cases will look at what exactly is going on in the HTML code and maybe look at a better way to parse it from there 9. Understanding the Exception Cases: everyone. And welcome back to our python tutorial where we're going to be looking at data extraction . All right, so let's continue on with where we left last time, which was looking at the exceptional cases or the exceptions. So if we've run our code again, um, what we saw is that we got a not available back for earning stage. Whereas if we actually look here, we see that there is a date popping up. No, the first thing that we're gonna do is we're gonna go and we're going to try toe, understand what kind of errors going on by actually looking at the response text again. So we'll go and just print out our HTML text. I'm just to see, you know, maybe we were taking out the wrong values or something like that. Maybe before my looks a little bit different here. So let's go ahead and run that And this is the the body that we get back and let's go ahead and search for the earning state. So we see we have the learning stage here. Um, it's quite close to well, the former actually here it looks very similar to everything that we've seen before. So here we also have a price equity ratio, for example. Um and let's see if we can find anything before we got a market cap. So what we're seeing here, it looks very, very similar to what we expect to see over here. So we're probably in the right section and let's just know I've lost the earnings day. So let's find that again. Here we go. But in this case, we actually do have not available data here. So it's not that we're not reading the data correctly. It's just that we're actually getting the any value which happens to be here. So why is that? So let's continue to look a little bit. And if we search for the next earnings stage and we just look here a little bit, um, what we see around this earning state value is that there seems to be some kind of download going on. So what we're dealing with here is more of a dynamic. You are l it seems like, and we can actually see them if we were fresher website here. And we just keep an eye on the earnings date and actually also hear the ups and the one you target estimate and all that stuff. So if you refresher webpage and won't wait for the still owed, let's and here we go. So if you look at the earnings state right now, it's actually not available, Um, and similarly for also here for the one you target estimate and the E. P s value that we've got going on here. And if we wait a little bit, then this data is actually gonna become available to us. So what we're dealing with right now is that Yahoo's kind of integrated or, um, a dynamic websites. So parts of these things are actually loading after we've opened up the website. And they're just kind of, um, you know, dealing with how we are at the moment so that it always that the information always adapts to its, which is great if we're looking at things through what browser? Not so great if we're looking at things through Python. So since these things are dynamic and they're probably dealing with JavaScript and dealing with JavaScript and dealing with dynamic websites is something that should be covered later on. But it's not something for right now, so knowing that will just say, All right, So this information, we're not reading up the data wrong or anything. It just so happens that our website is dynamic and it's running JavaScript code As soon as we opened up the HTML browser and based on that JavaScript code, it is loading information for us. But since we're not actually opening up the problems or anything, we're just reading the Web page. Once all of this JavaScript is not being executed for us, and therefore, all of these values they're going to stay is not available rather than loading to the values that they should be. So that gives us that least an explanation of why if we run this code, um, we get not available data for earning state as well as the one your target estimate. Um, and what else did we have here? The ups and stuff like that. So at least we know that that is nothing to worry about. That's just the way it is right now. Just because our website, this loading dynamically and the way that we do things with Python is we read the webpage once. So here is also one of the reasons why you can see. Sometimes it could be a little bit more difficult to get off this information because your websites can load stuff they can change. So if any of our format changes here and then the then the the code that we're using may not directly apply to, um to this anymore. So if you look for the running stage one more time and for some reason they want to maybe change their their format here, or maybe how some other types of Indians are put some extra comments here, I don't know how they want to change their HTML code. Then it could well be that our code no longer works. So these are important things to keep in mind, too, that were parsing our data based on what we see in the HTML code and a salad for right now . But it could be that later on the website may actually change the way they format there html code or something, and then your code could stop working all of a sudden, and then you have to go back to the root and see do my base assumptions still hold. Is this still how the data looks like as it still how all the formats like, um and then you can kind of build up from there. But these are also important things to keep in mind. But yeah, usually these aren't that big of issues that you run into, especially if you're just going through static websites and everything, and you just want to walk through them and read off all the information, then these aren't things that are that important to deal with. However, you know, it's still always important to keep in mind. But all right, so now we looked at the exception cases, and we understand what's going on. We understand that our code also providing there. So we noticed that we always get a n a value back rather than getting back in air code or something, which is already a good sign. So that means our code executes this. Just our data just has a A naval units rather than anything useful. But it still has the value, so it's always good to know that it's still executes. In this case. We just kind of double checked against what they should actually be like, And we just found out that our website is dynamic, so our data would load later on after the website's open and after some JavaScript code or some other type of script finishes loading the content of the website. 10. Parsing Out All Data for One Company: everyone, It's Max. And welcome back to our data extraction tutorial, Python. So let's go ahead and continue with our tutorial. And what we're going to do now is we're going to go and make a loop and go through all of these indicators and just get out the information one by one. So right now we're just splitting based on a certain indicator. But what we're gonna do now is we're just going to go and go through all over text. And we're doing this gonna split by each and every one of these indicators that we can get our complete data set. All right, so the first thing that we're gonna do is we're gonna make a couple of changes so that we can use this data in a Pandas data frame later. So what we're gonna do is we're gonna transform this into a dictionary right now, and we're just going to make these values here. Mt lists like this and what we're gonna dio is we're going to go through each of these values where we're gonna go through each of these keys and we're just going to append to each of these keys in the dictionary the value that we get back. So we'll see what I mean. In a second, we just finish adding python lists to all of these. And now, if you're familiar with numb pie, you can also add an umpire A's. Of course, that would be great. Um, but in pandas, that should get converted to a new empire right to so that's nothing to worry about. But yeah, if you're If you like it on time, Lauren stuff you can, of course, Button umpire. Right here. You don't want what that is. Don't worry about it. That's just a kind of different data structures to deal with. All right, so what we did here is we transformed our indicators list to a dictionary of indicators where each key isn't indicator name and the corresponding value and is a list that's gonna hold the values. Now there is one more thing. Um or actually, we'll do that in the second afterwards, but, um, we're gonna have a new dictionary that's going to be cold data. Um, and data is gonna hold all over data that we're gonna put into our Panis data from later, but will fill this out on the very end. First of all, let's just worry about filling up these indicators here or yeah, filling up our indicator dictionary. So what we're gonna do is we're gonna take away our print statement here and as well as this one, and we're just gonna store We're gonna keep this HTML text, which is going to store our response sex. And now we're just going to make a for loop. And you are indicators dictionary. So we're going to say for indicator indicator in indicators. So we're looking through this dictionary, and if we're just using this kind of for Luke were actually always going to get the key, which is going to make things really nice and so weaken sign of value. Or we can say, um, instead of looking for a specific value here, we can say we're gonna split based on the indicator. And if we want, we can also print out the indicator just to see what we're on and everything. Um and then we store our data on here, so our whole process is still the same. The only thing that we've really changed is that rather than putting Emmanuel value here, we're going through our dictionary value by value or key by key. Rather and we're splitting based on every one of these. But our HTML text never actually changed changes. So we never write this. So you're just pretty much using the basic code or not changing much. We're just starting with our standard string, and then we're splitting based on one value, and then we're starting with the standard beginning string again. They're splitting best in another value, just so that things don't kind of go all over the place and everything. So that's what we're gonna be doing here. And the rest of the code is the same. And then what we're gonna do in the end is we're gonna go and to indicators indicators and we're gonna look at the correspondent key, which is the current indicator. And to this, we're going to upend the data that we have here. Um, so where she will change his name to data value just because we're gonna have our actual data stored in this data dictionary later on. So yeah, we'll change this to data value just so that we don't have conflict conflicting names, So let's go ahead and at the end. Maybe let's print out Are indicators to and then let's run our code and let's see what we get. All right, so we ran into an error after we started looking for, um Well, it looks like the dividend yield gave us a certain error when we split by this data here. So it looks like we've actually entered an exceptions case. So one thing that we can do to quickly, deep like this, or rather quickly is we'll have to look at the HTML text again. Now we're the response texts, and we'll break our loop so that we don't run into the air again, and then we're gonna need to find where our code kind of hit a road bump on. Then we're going to see what was going on. So what happened? Is this value here or rather, this key here, Um or rather, accessing the element of the first element of whatever it is that we split by here didn't exist. So there is There is something missing here, so let's just go ahead and look at that and let's see what happened. So we're gonna be looking for he dividend and you, which let's see And our code. So we actually don't have this value occurring, So it could well, be that the way that we copy and pasted it over either isn't correct. Or, um, maybe the way that it's written and in the in the html code is different. So yeah, we see here, Um, it's actually not called dividend and yield. Its has a little bit of a different name, so let's just make sure that everything is right. So we look a dividend yield 2.281 point 64%. Which is if we look, um, where is it? Up right down here. So that's actually the value that we're looking for him. But the air that we run into is that even though it says dividend yield right here, the HTML code actually has a little bit of a different name because of the formatting that was going on in this table here. So we're gonna adopt what are indicator? Um, the indicator dictionary. So the key is gonna adapt to this nickname. Um, and then we'll take out this print statement and will take out this break and we'll river on our code and We'll see how, um, how it works now. So we got back past the dividend yield again. We've encountered something in the days range. So let's see where that value is coming from right here. Days range. And let's look for that. And let's look what's going on here. All right, so maybe we'll look for range instead. Uh, So what we see here is again something different formatting. So if you look at the day's range, we see one looks like the data is 138 from 64 239.36 which, if you look at right here, that's exactly the corresponding value. But, um, because of formatting, the way that we see this apostrophe is actually formatted differently in the HTML code. And so we're gonna have to account for that. And we're going to change the name, the corresponding name of her key right here. So that we kind of adopt to the different name that it has needed actually html code, because that's really what we care about. We don't We do care about the actual name, but we can't get the information out if we don't know the name that score that its corresponding to in the HTML code. So that's the name that's important to us. And if we run it now, So we see that our code actually completed at this time without any errors. So it looks like the rest of the names roll, okay? And we see that we've got our dictionary here that contains all of these nice, nice values. Um, so right now, we're just going one list or one element in each list for each of these values. But we'll also in a second understand why we've put a list here and not just single values , but yeah, Right now, we were able to extract out its data. We encounter some errors that had do with the way that the names who represented the HTML code we went back to the source, um, looked over what was going on. I tried to find the data that we that we wanted and looked at the corresponding name. And it turns out we had to change the name for days range as well as dividend yield just because of the way that the formatting HTML worked for us. 11. Determining Where We Can Get More Ticker Symbols: Hever wanted smocks and welcome back to our did extraction tutorial that we're doing in Python. So now that we've got the basic code down for extracting out all of data data based on these indicators from Yahoo Finance would be great if we could just build us a list of companies or, um yeah, So what we're gonna do is I've actually loaded up the S and P 500 companies and we're going to try to get all of these companies symbols just from Wikipedia, and then we're gonna put them into our writing code, and we're gonna use thes and we're going to get data for each of them. So that's gonna be our goal. So let's see if we can do that. So the first thing that we're gonna need is we're gonna need the Wikipedia U R l So let's copy Paste that And let's write that in here. We can watch Ricky, you are like this and just put that in quotation marks, and then we're gonna make a here. I can make us a little bigger, just so that weekend, See what's going on here so that we've got this Wikipedia. You're all that I just kind of copy pasted in here. I'll make this a bit smaller so that we could see the output better and not we could just make a second request. And we'll call this, um, Vicky response. And this is gonna be we're going to go into requests. And again, we're going to set I send a get request this time to the Ricky your oil. So we have to make sure the variable names the same here. So what we're doing here is we're using the requests library, and we're using this get method to use the http protocol to contact the websites and then kind of read it doing that. So that's what this again get request allows us to do. We can get this information from this website, and so that's what we're gonna be doing here. So we've got one of our responses, which is going to be going through Yahoo Finance now. The other response is going to be going through Wikipedia, and it's gonna be looking at also the wicker you around that we've provided here. So what we're going to need to do is we're also gonna need to put in our companies and somewhere. So we're gonna need toe save our companies. All right, So what we can do is, for example, we can just add them into our indicators so we can make a, um another attributes here where we can put in companies and we can just kind of do the same way. But what we can do is we'll actually put it in data here just because our indicators are solely for Yahoo Finance of the way that we constructed this code and also by the name that we've given them. So put our company list is what we're gonna call it. Or actually, we can call it company. Um, and here who just put in an empty python list like this, and that's gonna be our data stored data is right now, I'm just gonna be a dictionary with one key, and that has the value of an empty list. And this empty list is eventually going to hold the names of all of our companies that were going to be or that are that are showing up here, and this US and P 500 component stocks. All of these ticker symbols is what we want. Because based on that, we can change our young finance you, Earl, and get the required data. Yeah, but now let's just go ahead and look again, these company names. So the first thing that we're gonna want to do is we're going to want to see our HTML text that we're getting, um, so we're just going to send a or we're just going to read the text and we'll just print it out. So let's go ahead and run this, all right? So if we scroll back up, Um, and actually, if we scroll all the way to the top, well, you'll notice very quickly is that the size of this of this text that we're getting is actually bigger than our output council, But generally we're getting all of the information over here. But if we actually search for one of these keywords and see where we are on Wikipedia, we see we're actually well, we're kind of at the top, but not close enough. So we're in a much longer end bolder list or well, we're on the same list, but we're not really at the top yet, so that's something that should be noted. too. Um, because this is quite a long article. We want to make sure that we can get out all of that information. And it's just something to note that if you would be looking for a ticker symbol to start analyzing data. So, for example, if you're looking for them mm, you're not actually gonna find it in your output council just because, um, the string is too long and it can't all be displayed in here, but anyway, we'll look at some ways to get around that, and we'll see how we can best gets our guest best get or did out here anyway. So if we scroll down a little bit, we're really gonna want to suction out just this on company list right here. So we're gonna look for some key indicators that we can maybe split by. And if we scroll all the way to the bottom, we see this unique number here, and that could be some kind of indicator in the HTML code that we could use to just split by, um and then see if we can reduce the amount of upward that we're having. So we don't really care about anything right here. We only care about this company list that we have up here, and so we should be able to split by that number that we picked, which, if we scroll to the top and actually read what it represents, So it's a si i k. Um, So it's actually an index key, which is great, which means it should be unique. So we can try splitting by here to see if we can just maybe take out some of that data here , or rather reduce the amount of data that we have. All right, so that's what we're going to do right now. We're not going to save anything yet. We're just gonna play around a little bit. So we're going to try to split by this number, and then let's go ahead and see what we get. So we get a list and actually forgot to choose the element of this list that we want. So in this case, we want the zeroth element. So we're splitting by this number all the way down there. So we want the zeroth element, which is going to be everything in front of it. So we'll run that again now only taking out the zeroth element or the first element in the list. And this is what we get now. So let's see. Let's go to the top. First of all, let's see if we get most of our data, we still don't we get up to United Continental Holdings. It says over here. So let's try to find where we are. Um, let's see. So they're the company's down here again, but looks like, Well, we're not even that far towards the top. Maybe let's start searching from the top and go down away, all right? Yes. So that's if free. See, over here, we're still pretty far towards the bottom. It seems all right. Um, but that's okay, So maybe we'll just use a different key to split by. So maybe we'll use something like this. Word here contains 505 stocks, and now we can try to look so he's to reach trucks putting several times before, and it just seems or string is too long to actually read any of this up. So is trying to find a unique phrase in here, And if we search the webpage, we actually see if we could enter. We don't get anything else. So maybe it be smarter to just split by this and take the well the second element, the element with index one. And we run us and let's see what we get now. All right, So now we again have to split off the lower part. Or maybe we can actually print out. Um, So right now, we split by here, and we're taking everything below it. Now, if you want to take out everything up to a certain point, we can again split by a certain number and then take everything in between. So right now, we pretty much only cut off the top part. So now maybe we can just use right now would use a number in the middle just to kind of see what it's like. Um, and again here will split by this. And then we'll take everything before just to see, um, if it's all OK, and if we can reduce their size a little bit, Well, it still doesn't look like we can reduce their size. So what we're gonna do is we're just going to continue on. So I've actually been playing around with this website a little bit before and the strings here or the whole website. HTML is really, really long. Big. You can try to play around with a little bit to reduce the size, but we don't want to spend too much time on, not just because we want to move on with our tutorial to um, so we'll just try to look for some of are indicators wherever we are right now. So maybe let's try to look for Hall X. So, um or that's that's the key that we're gonna be looking for because we're printing. We're seeing output towards the bottom first, um, so that's should appear somewhere here, and it is right here. Alex. That's what we see on. And then we've got all this information with God, a hyperlink going on here to new to a different website and all types of different information. Also going to Are there Wikipedia pages and everything. All right, so that looks pretty good. Maybe let's try to find Let's see if we can find a different one. We'll look for Home Depot, date Depot HD, and he was got the Home Depot. All right, so it seems again here and we can kind of judge that from before because we're on this table for Mt. Everything is probably gonna be similar. But one thing to notice here is that we've always got this hyperlink going on here. But we can really use that to our advantage so we can try to split by hyperlinks and use that to get out this information here. So right, let's go back to our data and we always see that are ticker key is right in here. So it follows the hyperlink. That is the hyperlink on this ticker key. And then there's more information. Um, for the next columns that we have right here, for example. So now what we really care about is we want to make sure that we cut off at the very end, and now we're actually gonna start processing idea. So everything after this is not gonna be relevant to us anymore. So this is gonna be our final value. So let's stop printing and let's actually start storing. So we'll see. This will be the with the first part. So this is going to be the first time we parse the response. Just going to be our Vicky response that we're getting here from the girl. And now we're gonna look at the text and this text We're going to split. Why this unique number that we found here? So also, if you look for it, we see if it click, enter, it doesn't go anywhere else. So this is really the unique number, at least on this website, which is great. And we want the zeroth element or so the first element here. The element with the next zero, which means we want everything before here. So that ensures that we get all of this table everything above year and nothing below here . Okay, good. Now, the next thing that we're gonna want to do is we're gonna want to go to the very top, and we're gonna want to define the top of, um yeah, at the top of our our data set. So maybe we can print out the first couple of characters were the first couple of 1000 characters and then split based on something very close to what we see here. So we could just split by the S and P 500 component stocks, for example. But it may well be that this and percent symbol is actually different in HTML code than it is when we just see it here and copy paste it. That's why if you try to copy paste this, it probably won't work just because of the way this and percent symbol is defined in html code. So that's what we're gonna look at kind of the first couple of 1000 characters to see if we can identify where this table starts so that we can split by there and really just section off our table. All right, so let's start from zero, and we'll go to maybe 2000 characters. And we want to make sure that we actually study within our output when No, Here, Um, it looks like we already finish. That's great. And let's see if we can look for you. Want to look in here for? Come opponents. Stocks. This is what we're going to want to look for. All right, let's see. So we don't find it yet. Did I? Runner? Did I run the code? I'm not. Let's see. All right, so looks like we're not having enough output here. So the first up, which seems to be quite long. So maybe if we go to something like 5000 will have a little bit more luck. All right. Still not. Maybe we can start at 5001 and go up to 10,000. So right now what we're trying to do is we're just trying to narrow down our range. All right, so here we've got the S and P 500 component stocks. Um, and if we search for it, we see it actually occurs several times, which occurs? One, it seems like 23 times. Okay, so that's good to know eso to curse three times. And if we go back to the website and also makes sense because they've got it occurring here once and a second time in the hyperlink definition, and then we've got occurring here another time. So let's go ahead and split by our component stocks to get our Ricky data table that's gonna be are equal to our wiki first cars. And this we're going to split by the component stocks and we're going to take out the third or rather, the fourth elements of the element with index three. And then, if we print out, are Ricky Data table and we run this one more time. Then we see Now we're getting something that looks if we go to the top, So right now we're getting our data. Yeah. So this looks like we picked out the data table when I say here. 12. Extracting Company Ticker Symbols Part 1: Hey, everyone, and welcome back to our data extraction tutorial on Python. So let's continue on right where we left off and what we got last time is we identified the data table in the Vicky PD, a entry that correspond to the S and P 500 companies. This one here. So it took us a little bit of time to identify this. Lustrous how it is with data extraction going to need to go through the HTML a little bit, become a little bit familiar and then just if possible, reduced the data range down to the ones that you're actually interested in so that you can write a nice little algorithm or a nice little script. We'll probably algorithm in that script. Just extract out information based on hunger data table Looks like. So now we have managed to, in some way or another, just extract out this data table here and we start up here and we go all the way to the bottom, and we also end there because we split by this identification. So we really just narrowed down to the data table. All right, so let's go ahead and see if we can get out some of these ticker symbols. Now, one thing that we notice is that there are hyperlinks at every single one of these ticker symbols. So that's gonna be useful for us. That's actually gonna be really for us, because we're going to use that on going to try to split by every time we see a hyperlink, and then we're going to try toe, see when this ticker symbol appears, and then just assume that this kind of format is coherent all the way down, and we're gonna use this to extract out the ticker symbols. So let's go ahead and see what I mean here. So the first thing that we're gonna do is we're gonna go back into our environment and let's just first print it out. Before we save anything, we'll go into our Vicky data table, and here we're going to split by and a hyperlink is denoted by this h ref equals and we're gonna split by this. And then let's just look at the zeroth element. So the all the element with the next zero so the first element, and if we run this, it's all of the system are up with him before from previously back eso where it gets solo for in the candidate on, Just in case you were wondering. So we see here is just a class, nothing of interest. If we look at these second element the element with the next one, then we see we get here this ticker symbol. So what we did is since we're looking only at the stable here and we split by the hyper reference or the hyperlink, Um, we get everything before, which is kind of this this portion here, which is not really of interest to us. Um, but then we get this ticker symbol that's gonna be our first elements. Everything after here is element one. Then we get the second hyperlink, which means everything in here is gonna be element to the third hyperlink, which means everything in here is gonna be Element three. The fourth hyperlink, which means everything in here is going to be element for and then the fifth hyperlink. That which means everything in here is going to be element five. So let's go ahead and take a look at Element five and let's see what we get. So what we get with element five is we actually get our ticker symbol right in here. It's not by itself yet, but we've managed to isolate it. So that's a really, really good Um, so maybe it's one thing that we can do later is just split by this symbol, just like we did before, and then also split by this or so and use that to extract out the data. So that's already really good point. But now something that's important to note is that not only are here hyper Ling's, but also here all of the times we're gonna need to skip over all of these. And we're only going to need to take out to the ticker symbols here at the very front. All right, so again we notice a pattern. We noticed that every column has four hyperlinks, so every column has four hyperlinks, which is really great because that means the way that our list is organized isn't some sense of repetitive. That means every fourth element after that should be another ticker symbol. So, for example, we've got symbol five or element, and with the next five here, which is our ticker symbol, then we've got elements six in here Elements. Seven Element ain't an Element nine or element with index. Nine. Should be Rice. Here should be our next ticker symbol, a b t. So let's see if that's true. So let's look at the element with Index nine and let's see, Forget a B t and we dio and let's try that one more time. So adding floor. So we went from 5 to 9. So first we started off at five. Then we went to nine. Let's try the same pattern of adding four and try out the number 13. And if we look at this here we see we've got a BBV and let's see if those air next symbol and it is. And so the reason that in this case right now, it's the working is because we've identified a pattern here and we can use that pattern to our advantage and just split the data. According so, what we're gonna do is we're just going to write a little fruit and just go over each of the elements in our list, and then we're gonna shift our nexus around a little bit and do a little bit of testing. Um, but we can write a for loop based on this because there's a pattern here, right? A little algorithm to just get out our ticker symbol. All right, so let's maybe just start again with the first value. And right now we're still just printing because we're still trying to identify. And really, what we're going to need to do is we're gonna need to split by this, and then we're gonna need to split by this. So right now we're taking out the fifth element, and next, we're going to split by this right here. So all you single quotation marks this time because we've got a double in there. And then from here, we want the second elements to the element with index one. So we want this. I want this element right here. So let's run them. And yet, So we've almost got it. Now. We just got it again. Take this away so we'll assume. And right now all we're doing is we're just assuming that everything is always gonna be the same. And that's a valid assumption because we're in a table that looks extremely repetitive. Now, if we do encounter some errors are some exceptions. Those are things that you with when we get there later. But right now, we'll just assume that everything is the same because we've identified a pattern, and we'll use that to the best of our ability. All right, so on here will again split by, um, this character, which hopefully appears after every ticker symbol. So that's what we're doing right now. And maybe I can also make this a little bit bigger here so that we can see the whole code, but you'll have access to this anyway, um, and he received. So if we've got this list, um, And from here, we want the element within X zero. So the first element because we're splitting by here. So we want everything in front of that, and that should give us our ticker symbol. So let's see. And yes, it does. Good. So these splits are we're going to need to get our ticker symbol, so that's already really great. We've identified a pattern. Now we're just gonna need to code it up a little bit. All right, So how are we going to do them? Well, first of all would take away this print statement and we can leave this comment here just to give us a little bit of a hint. And we're gonna right a little for Lupin. We're gonna say for position. So that's just going to be our index. Or we can also use in next or whatever we're gonna say Floor position in the range of the length of. And now we're gonna put in a list and so well, over we're going to split this here. We're gonna put that somewhere else. I'm actually gonna move over this list that we're creating. So we'll say this is the hyper link split Vicky like this. All right, so we're gonna loop over this list, and again, What we've done here is we haven't done any of the extra analysis right now. We have done any of this extra splitting. We've just gone through our Vicky table here, and we split every time we have a hyperlink. So we just have this long list now. And every time there's a hyperlink, there is a new element. So that's all that we've got right now. And we're going in a fort. Look over that. Okay, so we know that our starting position is fun. So well, defense will remember our starting position here, which is five, and we want to go into our hyperlink split Ricky, and we want to eventually call this on here. So we want to look at the element if it's in the right position. Um, and then we want to split on this. And this should give us our ticket data based on what we did before. And we don't want to look at every element because not every element is going to give us our correct data on because we've got all of these extra hyperlinks here, but we want to look at every fourth element. So what we're gonna do is we're gonna have a little variable that's gonna keep truck. So we'll call this one just just a trucker. And this is going to keep track of every time we have our fourth element. And we'll also need an if statement here. So what we're gonna do is we're gonna say if our position is, um, Locksley will use the great oven and then weakens, change or start to form. So if your position is greater than are starting number, so for position is equal to five And then what we're gonna do is if this is the case. So now that we've we've kind of past the 1st 4 values, so the next value or the that we would have would be this ticker symbol. So now that we've entered this case, now we're going to check if our tracker is equal to zero. Then we have the right data here and we want to take out the position. So we'll assume if our trucker is equal to zero, we're going to take out the position and where we're gonna look at that position and that should give us our index. So now we just have to define our tracker. So our tracker is going to be equal to So what are we doing now? We're looking at our position and well, first, we need to take five off of our position because we have this shift of five. So we're pretty much doing minus the start, plus one. So this is really where that's coming from. It's just because our hyperlinks are being shifted by five and the first time we actually get valuable data is in this case here. So that's why we have this shift of five going on. So we're saying our tracker is the position minus five. And since we want every fourth element, we're going to take a module ISS by four. So let's use the integrated environment here to just make a quick example of what this would look like. So our tracker is going to start at zero, and our first position, where we find data, is going to be five. So in that case, five is going to be greater than four. So we'll get going to be in this case and the statement here and now our trucker is going to be equal to our position, which is five minus the starting value for plus one. And that's just because of this shift, because we've got some hyperlinks before here. So we've got minus five, um, so five minus four plus one. And now we're taking the model ISS of that by four because we're repeating every fourth hyperlink should be your ticker symbol. Now, if you don't know what the module is's is pretty much the remainder. So, for example, if we do a module is by four. We're gonna get the numbers 0123 So it's pretty much giving us a remainder. All right, so we see, for the value five we get to your five minus five. Um, the remainder of dividing zero by four is room. If we do the same thing for nine instead and we run this again, we get zero. So we see nine minus five is 44 divided by four has a remainder of zero. And so that's why this would be a good value for a tracker. And so what are trackers saying? Every time our tracker is equal to zero, we're going to take out the symbol, and we're gonna assume it's gonna be our We're gonna assume that's our ticker symbol. And the reason that we're doing this modules by four is because we have four of these hyperlinks in here and we assume that every fourth element is going to be our ticker symbol . So, really, what? We're assuming that everything is going to continue on, just like the way it is right now. There's not gonna be any hyperlinks missing or anything like that. If there is, then we need to change your code later on. We'll run that. Were on it first, and hopefully things will work out for us. And hopefully it fell out our company list. All right, so this should give us if everything is continuous and everything is always staying the same. This should give us a ticker symbol like we tested before. And all we have to do is go into our data and access the or, I guess we can save this first so that it's a little bit. Um, we'll save this in the temp data variable just so that it's a little bit more oversee able so that it's not all in one line. And then we're gonna go into our data and we're going to access the company key, which is going to give us this list here. And to this, we're just going to upend our temporary data like this. And then at the end, let's print out our data dictionary here and that we've created and let's run or code, you know, something bigger and we see Let's see. So right now we're getting there getting names, and it looks to be quite good up to a certain point. So we started off quite good here. Then we seem to be running into some different things, so it looks like hyperlinks start missing after a low end. So let's look for a Len and let's see if there is something missing somewhere. Um, doesn't look like it right now, but we'll have to dive a little bit more into detail here because all of a sudden we're getting all of these extra values that are unexpected, so we'll see what we can figure out there in a second. So we ran into some exception. Cases are data worked quite well for the first couple of values on, and now we've entered one of these exception cases where something happened and our table is no longer as homogeneous as we thought it waas. 13. Extracting Company Ticker Symbols Part 2: Hey, everyone, it's Max. And welcome back to our data extraction tutorial using python. So last time, if we run our code here, what we got is we were able to split or we split some of our data based on what we identified to be the pattern. And we wanted to get article Semberas ticker symbols from here and we had a pretty good job until about We've reached a low and here and then things started to go haywire all of a sudden. So now let's go ahead and look at the Wikipedia website and see what went wrong. So let's take this symbol here and let's go to the website and let's look where you are and happened. And until here it went fine. But we didn't get a P a. So for some reason there was a missing hyperlink or there must be an extra hyperlink. And if we go through here, everything looks the same until you reach London, United Kingdom and what we see here, we've got a hyperlink for London and one for United Kingdom. But if we actually look at the link above, we see that the comma is taken into the hyperlink, which makes this one. And here we've got to hyperlinks, which makes this two different symbols. Um, so that's already kind of a problem. So that kind of nullifies our assumption that we had that everything was homogeneous. Let's look at. So if you look at output, we actually see that things change again after Arlington, Virginia. So let's try to find Arlington, Virginia, over here, which is right here. And so after here, we noticed that we've got this extra hyperlink going on over here, so we've got a lot of extra cases that we may need to consider. Now there are different things that we can do. First of all, we can continue with our algorithm like this, make it a little bit more complicated and account for all of these exceptions. Or we can look at the HTML code and a little bit more detail again and see if maybe we can try to identify a different pattern that will help us a little bit better. So let's try the second option first. Rather than making things too complicated, let's try to keep things as simple as we can and just try to identify a different pattern when we can use to our advantage to extract out these ticker symbols. All right, so we'll go back in here, and what we're gonna do is we're just going to look out, Um, and here we're going to say if or let's actually just print out are values here that we're parsing. So we just want to print out all of this. And then we want to say, if the length of this value, let's say, is greater than six. So if all of a sudden we have something, something like United Kingdom, um, or Houston, Texas rather than, ah, ticker kind of symbol here, which are usually quite short. Then we want to break. And now we're going to use this for debugging. So we're gonna print out whatever it is that we're parsing. So we're gonna print out this whole element that we're going to be parsing in a second. And if the length of this element is too long, or, um or actually we want to see if the length of our temp data, rather than our version here, because this party is always gonna be longer than six because we have to parse it for us. But if after parsing these elements similar to long, then we were just wanted break out so we can use that to debug a little bit. So let's go ahead and run our code and we see we do quite well and then we stop here. You know the kingdom and we get all of us output here now, um, to make us a little bit easier to read. Let's just add another print statement just so that everything here is separated by new lines and we will run our code again. And here we go. All right, So the form, it's already a lot better. We see that things are starting to get messed up over here in the United Kingdom, I'm section. But above that for our Dickerson was at least we seem tohave another type of pattern. So it looks like the values they're either coming from the NYC website, which 1 may assume could be the New York Stock Stock it change or the NASDAQ website. So if we just scroll through here, it looks like there are only two Web sites where these ticker symbol hyperlinks go to. So that's another pattern that we could use to maybe pull out some of this data here. So rather than using the tracker that we've identified before here, we'll be using these symbols right here. So we're going to redefine our if statements of a bit or actually gonna refine them. And first we're gonna parse, and we're gonna check if this N Y s e occurs in are on current elements. So if NYC is in whatever it is that we're parsing and then we cannot we can either do an and here or we can just make a separate statement on Bill Do that just so that everything kind of keeps in the same line and everything, and then we're also going to look for another thing here. So we always also see quote for the NYC website, and we'll use that just in case. And why is he is mentioned in another type of hyperlink. So we want to be extra sure in this case, and we're gonna have a second s statement and we're going to say if quote is in here. So what we're really doing here is we're doing a double check first, we're checking if we have the NYC keys here. Then we're checking if quote is in here and if both of these cases are fulfilled, then we should be quite good. So let's make sure that everything is rightly indented too, so that we've got everything nicely inside of our if statements here again, we're using it. If sudden, if we could just as well have used it if and in this case and then we could have put in and key word here and continued with another If, but we'll just use the double if just because it's a little bit more clear to read. That's why, um, so we've got this and then we want to do the following. So this is what we had when we looked at tracker being zero. So we still want to do this. Um, and we can actually take this print statement out right now. So this here, and actually we can also take out the brake. So all of this year fulfills the case when we have this form of link. So if we have this form of link or this war hyperlink, then we get our ticker symbol. All right, So what we can do actually is, we could just copy paste this and we're gonna use the exact same thing. But not we're going to check if instead of N Y S e. We have not stock in here and again to make sure that we always get our ticker symbol where else they're going to use this symbol afterwards. So, like that. So you've got NASDAQ and symbol for the double checking to make sure we always get the ticker and N Y S e and quote like we have right here. And we can even do a Nelson's statement just because it's gonna be one or the other. So we don't need to check both. And then we can take out this tracker here, too, and also take out that trekker like that. And then let's see if the pattern that we identify goes through the whole time. Then we should be good. So let's run this and let's see what we get. Right. So we ran it and it looks very good. So it doesn't look like there are any cases in here. Um, where, uh, compared to up here where we had, you know, like, you know, Kanye mob of stuff in here right now. This all just looks like symbol Ticker data. All right, so that's great. Um, let's also go in and look at the company index or the company key. So rather than printing out, um, this whole data dictionary we want to look at this list, and we want to print out the length of this lists, and we want to see how many ticker symbols we haven't here to see if we maybe missed something. So we're just doing a form of double checking and what it looks like. We've got 505 symbols, um, which is actually more than we would expect for the 500. Um, so and we just print this out one more time and see if we see anything strange here. But at least we know that we got around the 500 which is already really good. We're not missing. We hopefully shouldn't be missing any of these ticker symbols, and that already is a really good sign. So if you go through here at least me, it all looks pretty clean. Um, down. So that actually already looks very good. Pretty clean to me. So we've got all of these ticker symbols and we should just be able to go ahead and use that And for our Yahoo finance, you're here. So that's gonna be the next step that we're going to do. So now we can finally managed to get this out. It took a little bit of time, but that's just the way it is. Just because if you know how kind of a little bit complicated this HTML code is, we have to first identify or hypothesize certain patterns. And then we have Teoh try them out, and either we confirm the patterns or we or we run into a problem like it did before. And then we can either maker algorithm or complicated, or we can try and find a different pattern, which is what we did here. So usually this data extraction, it does take a little bit of time. It takes a little bit of thinking a little bit of analysis, but it does work very, very well. So I wrote this short little algorithm here, but it already does a very good job of parsing this pretty complex Wikipedia page in some sense so it doesn't look that complex to the I. But then when you actually try to find some of these patterns like you did with the hyperlink, for example, you run into some of the areas like we had with the Kama case somewhere or later on with these extra hyperlinks that are being introduced. So, for example, this one, even though that one wasn't the air that we run into, but that would have been another air that we run into. 14. Getting Data For All Parsed Companies: everyone. And welcome back to our data extraction tutorial on python. So last time we were able to get out all of these companies stock symbols or these things ticker symbols, and we got them from this Wikipedia pitch here we got 505 stocks in most, which after reading the Wikipedia entry, the S and P 500 is actually comprised of 505. And when we checked for the length, we also got a value of 505 earlier on. So that's really good. We can all stress rerun that for a little bit or one more time just to make sure. But we do have all of the 505 stocks, which is great soul. We've done exactly what we wanted to do. And now we have all of these different companies that we can look for and that we can parse information out of using Yahoo Finance. All right, so let's go ahead and jump right into that, um, in Let's look first at the apple page that we first pulled up and we're going to be looking at the U R l. So we see here, we've got the apple symbol once. And then we've got the apple symbol another time. And I've actually already went ahead and opened up a different what page in your finances. And this time I'm open up the alphabet. Um, and we see that we all gonna have the alphabet symbol or the alphabet ticker symbol up here on day. We've got it here another time. So we can assume right now until we run into an error or something like that, Maybe is that he Generally you were l would be the first part here, the first part here. And then we have the company symbol and then we have the second part here. And then we've got another company symbol. So that should be the General Yahoo Finance C or L. And we're going to try this first and then we'll see if it works. And if it works, then we're all good. If it doesn't work that we're gonna have to find where the problem lies. That's gonna be your basis. Something right now that this year l we can make up based on these two bass parts here, as well as the ticker symbol. Great. So let's go back to our coding environment like we did here. And we can take these prints statements away now, which and also, I guess we can leave this in. Um and we're gonna change. We're gonna take this year l here and we'll move it down a little bit like this, and we can take these print statements that way, too. And Ross are going to need to take this response, and we'll move this off a little bit down, so that's a bit more ordered. So what we're gonna want to do is we're going to want to manipulate this base year l and these are going to be or these These ticker symbols here are gonna be modified dynamically . So what are we going to do? We're going to go and loop through all of our companies that we found. So we're gonna go and say for company in and our companies, our safety in the data variable. And it's a dictionary. And it saved the list of saved in the company key right here like that. And so we're going. We're saying for company in this list that saved in the company key in the state of variable which is this dictionary here. So we're going to go. And pretty much what we're doing is we're just going through this list, um, element by element like that. So that's without for a loop is doing. And it's gonna have each of these symbols. That's what company is gonna be. The company's gonna be each of these ticker symbols, or we can even call it, um, ticker symbol brother than companies. So that were very clear about what we're doing. And we're going through here. Symbol by symbol andan. Now we can use that to form the New Year. L So we'll take this again and rule move that down here. And now we're gonna build our own. Your l um All right, so let me put this in quotation marks. Sorry. And parentheses. Just that we can put it across multiple lines. All right, so this is gonna be the first part, and then we're gonna need a second part. You. So in these parts, the ticker symbol goes So we just took out the apple symbol and in that part on the ticker symbol goes, So that's what goes on. It goes in here instead said before we had a PL and then we had this symbol here and then of the a p l. But now we're doing it Mawr broad more dynamically, and we're taking each of these symbols and we're just creating the girl just like that. And then we want to and end this so that it's all part of this for Luke. And now we're going to We want to create our HTML text again, and we also want to get a response. Still, copy paste that And no, I need a certain on this and then it this we want to make sure that everything is inside this for look. And now what we've done is we've pretty much had this party already. Now, this part, um, was a much simpler Eurail. It was just three the Yahoo Finance apple your It was just this euro. That was our base that we started with. So before we had this and now we took out the apple ticker symbol here, and we replaced it by this dynamic variable that we're getting by looping through each of the company ticker symbols that we've parsed out from the Wikipedia page. And then we're adding to it. This part of the string here that seems to be part of the girl and then again were adding the ticker symbol, more time. And that's gonna be the euro that we're gonna set. And then on this year, L we're going to send a get request and we're just going to read the website and we'll save that in the HTML text. And now we pretty much have everything that we needed before. We don't need to change anything here because we're dealing with our HTML sex again. And we can just take out that Prince statement as well as the common to break and maybe also this example down here, Um, and we can leave in our indicators. So just a kind of see what we have going on afterwards. And, yeah, it's from this. Um, let's see how long it takes. If it takes a little bit too long. Weaken. Add another if statement here so that we can break before we have to wait so long. So maybe I can already start running the US I'm and taste. It takes too long. So a set a counter to zero. And every time we finish your move increment the counter by one. And if her counter, let's say is greater than 10. Just break. All right, So doesn't seem to finish now. And just for the sake of time, would just running begin with this extra little key here, Um, just to make things go a little bit faster, so we're gonna go through 10 different companies, and we should hopefully find 10 different values for them, and then or we're gonna have a length of 10 for each of these indicators. So we see here, each of these this is gonna be our indicator dictionary that we've created. And the list for each of the keys should contain town elements, which it looks like it does. So our coach seems to be working, at least for now. Um, we haven't tested it for the whole race, So you can try that on your own. Um, it may take a little bit longer. It shouldn't take too long. It may take a little bit longer. That's why we're not doing it in this tutorial right now. But that code run one more time just to make sure that everything you're getting doesn't run into heirs or something, that there's no false assumptions that we've created. And then let's see what we get 15. Final Data For All Parsed Companies: Hey, everyone, it's Max. And welcome back to our date extraction tutorial that we're doing in Python. So I went ahead and ran our code. Eso if you didn't, don't worry about it. If you did, that's great. You would have probably noticed at some point that you would have encountered in era. Um, and that air for me happened when I was looking at the C i n f tickets and and it told me that there was an indicator out of range. All right, so what I did is I changed the code a little bit, and I we ran it and we'll walk through that in a second. Just said to see if there are any other symbols. And I use some try or any other errors. And I used to try and accept statements just to see if there is on air that occurs. You just Kenneth skip by it. And what we see is that at the very end when it finished running, I looked at the length of one of these lists and we have a length of 500 floor, so it looks like this. See, I and F ticket um, indicator or the tickets symbol is the only one that's causing an error. So we're going to do now is we're going to change our quote a little bit to add error handled in here on dso. This is also just out of this in the end, just to make sure that we actually get that length. So this is an an extra piece to that was added after the last tutorial just to check how much data we actually have. Yeah, Let's go ahead and put in some error handling so that we can account for the air that's going on here. In this case, there seems to be some Aaron and getting more than a website. At least that's what I was getting. So for some reason, um, this website is a little bit special and it's giving. The air is in most of these regions. So that's okay, though, Uh, well, you can get by without We have a lot of data. So what we're gonna do is we're gonna add some air handling through try statement. So what we're gonna do inside of our loop here? We're gonna try to do the following, and if this works then? Great. If, however, we have an exception, So if there is an exception raised when we try the following and if an exception is raised , we do the falling. So except if this doesn't work, what we're gonna do is we're going to go into our indicator is here, Um, and instead of a pending nothing. So that's what would happen if we didn't do anything. We're just going to append and a like this. So, um, not a number or anything like that. So that's what we're gonna depend instead. So in case something doesn't work, then we're just going to say, what were their result that we got is not a number not available. Whatever it is, they like to call it, so that's what we're gonna do. So that's the little change that we're gonna make to her coat. We have a for loop and we're gonna put an air handling, and we're gonna say, Let's try to do the following. Let's try to, you know, get the website. Let's try to split it by the indicator and get out all the relevant information. If we can't then just put in an A in its place So that's the two cases that we have. And then we can let this run in a second. But we're gonna do is we're gonna do a little bit more and we're gonna be looking at transferring this into a Panis data frame to so we can do some more data analysis with that , so that would be our last step. So let's go ahead and write the code for that right now and will import pandas as PD. And what we're gonna do is we're gonna take our data that we have here, which right now only includes the company's and to our data, we're going to update us with our indicators now, since none of her keys are overlapping, it's gonna be really great for us because we're just gonna add all of these indicators in here into this dinner dictionary, and then we can use this dictionary and directly put that into a penance data frame. So first we update our data, and now all of the indicator data that we've upended and stored in here is going to be in this data variable as well. And then we'll just create a data from like this We can write did a frame. For example. We can create a Panoz dot data frame with the data just as our data. So that's the name here that we have a variable and that's gonna be our input data. And now with this data frame, we could do a whole lot of stuff. You know, we can write it out to Sea SV's. You can do more analysis, you can structure it and what not? But that's not really the point of this tutorial. So all that we're really looking at here is just getting it into that format, getting all over data, Getting it into a hand is ready for Matt so that we can nice and easily do some more data analysis with it so you can go ahead and run your code now. And at the very end, you should have a nice, filled up data frame that you can do all of your further data analysis with 16. Final Result Static Websites: Hey, everyone, it's Max. And welcome back to our data extraction tutorial and python. So these are our final results. I went ahead and printed out the data frame head as well as the data frame tail that are part of pandas. Um, which we can see here. So this is the tale of the data frame, and then we move up a little bit, and then we've got the head going on right here. Then we've got the dictionary keys, so we see that a dictionary has been nicely merge. We also see that, of course, inside of our data frame. Um And then if we move up here, we see that the length of the company's this 505 as well as the length of in this case and the expected dividend date or something like that. So that's also like the 505 which is great. So we've got exactly what we want is, um, 505 is also the number of stock companies that are in this S and P 500. So we've got all of our data. Um although some of it, as we can see here, maybe not available just just the waiters. But at least we have a lot a lot of data that we can do stuff with. We have it in a very nice structure. Were using these Panis data from here. So you were able to parse our data from, you know, kind of not too messy, but, you know, just general websites And we formed about data down very, very nicely and divided it into this Panis data frame. And now we have a very well structured data that we can do all sorts of analysis with, in this case, you know, maybe finance or something. But this obviously doesn't just apply to finance. You could really do this for whatever it is that you would like to dio. 17. Prerequisite Libraries for Dynamic Web Scrapping: Hey, everyone, it's Max! And welcome back to my Web scrapping tutorial in Python. So now that we've covered how we can get data from static websites, let's move on to getting data from dynamic websites or websites that load their data with JavaScript. So, for example, if we stay on this Yahoo Finance upside and we moved from this summary top over to the statistics, all of this information that we see down here is actually dynamically generated through javascript code. So if you were to try to get this website and we can see here in the U. R l, the euro has been extended talks a show. What Tag, Ron. But if you try to get the data from this website, you you're not really gonna get anything. So none of these values air going appear just because they're being generated Once the pages loaded now, more more websites are moving towards generating the content with JavaScript. And if you want to parse those websites, it's not enough to just send a request and get the static html because the job of scripts not loaded yet, So we need to find out a way or we need to go and learn the way of how we can deal with websites that use JavaScript. All right, so the way that we're going to do that is we're gonna use a python package called Selenium Eso. You can go over here to Saleem, Dabshe, Python. Dodd. Read the docks dot io, which is where I'm on. So this is where the documentation is. Um, but a simple way of installing it would just be to type in the pip install. But again, there should be instructions here of how you can download, um, all of these drivers. But, um just go ahead and pip install that. That probably the easiest way to set all this up. And yeah, so that's the one package that we're gonna be using and Selenia is gonna allow us to. It's pretty much gonna love us to open a browser in our code. And then that browser will load all the javascript and load website for us. And we pretty much have access to this dynamic website where we pretty much have access to the whole browser through the cellini. Um, now one thing that you'll notice later on, and that's why we're going to talk about the snow is that when we run this selenium on we create a browser, the browser actually opens as an application. Now, we don't really care about the browser opening that much, and we don't want that to kind of pop up all the time. So what we're gonna do is we're gonna make our browser headless. Or, in other words, we're just gonna use a brother without actually having a browser up open. Eso Yeah, To do that, we're gonna use Phantom Js now. Phantom Js is actually for JavaScript, but we can just go ahead and pip install Phantom Js too, just like we did here with. So they need, um So what does Pip install Phantom Js? And what that's going to allow us to do is it's gonna allow us to create this headless browser which, if you don't really understand what I mean right now, also show you, um, that concretely later on. But it's really going to allow us to work with this browser instance without actually having to have the browser open or pop up in the background every time we were on the coat and stuff like that. So that's what. That's what we're gonna work on. Now, Um, in case you don't, you're in case you're using Anaconda like I am, and you don't have your python pack your python path linked to your anaconda path. So if you're using pip install and for some reason the packages aren't importing a solution to that would be to go up here into your python path manager and then you can add paths that will pretty much tell you where your packages are. Now. My packages air usually in one of these two paths. So in the python frameworks and then using python 3.5 years in the bin were in the lip and then here, over in the site packages. If you're not sure how I got this or where this came from, I can also show you. So if I do a pip install because I have several versions of python installed, I usually always have to type in hip 3.5 and storm to specify that actually want to install now this package for python 3.5. So that's how you go about it. Um, but where does that package actually land so to find where the land it. What I can do is I can use this little command called which and then if I type in which tip 3.5 that tells me where this execute herbal is located and that's that's pretty much the path that I got here. So that's the first path that we see now. Sometimes the packages getting stolen to hear other times they would get installed into the site packages. So that's why added both passed in case, you know, packages installed somewhere else, you would have to find where that package is installed. And then just you can just add the path here by clicking at path and then just navigating directly to where that is. Um, yeah, So this is just in case, you know, your Donald in the packages and for some reason they're not popping up or so this is one of the solutions. And of course, you know, this pretty much allows you to sink your 3.5 python that you're using, Likely. I mean, any machine with the anaconda platform here 18. Short review: Recursive Functions: Hey, everyone, it's Max. And welcome back to my path and tutorial for Web scrapping. So before we jump right in completely, I'm going to cover one thing. I'm going to do a little review about Emmerich. Ergin so occurred in is a type of function where you call the function itself where, then it again. And we're gonna be using this later on in parsing out some of the data. So I thought it might be nice. You know, if you're not familiar with this to just quickly cover it now what we're gonna do is we're gonna write a function that's going to give us back the Fibonacci sequence. So if you don't know what this Fibonacci sequences it pretty much starts with the number zero in one and then the next number is always the some of the previous two numbers. So, for example, the number three would be the some of the number two in the number one, which is one. So one plus zero in the fourth number is gonna be one plus one, which is to then we get the 5th 1 is two plus one, which is three. And in the 6th 1 is three up, sir. He has three plus two, which is going to be five. And then we've got five plus three, which is eight and so on. So this is the Fibonacci sequence and yeah, So the Fibonacci sequence is pretty much the current number is the sum of the previous two numbers. And we're going to write a little Rikers in function that's going to calculate the sequence to us up to a specific number. So we're gonna say is we're going to say deaf fib. So we're gonna create are Fibonacci function, and this is gonna be the input in. And what we're gonna do is if our number is zero, where one we're going to return them. So the first thing we do is Do we are we at our final result? So ultimately, we're going to start at this number, and we're gonna go work ourselves back all the way to these two numbers, and then we're gonna add them on love together again. So in case, um, if and is zero, then we want to return zero else. If n is one, then we want to return one. So these are the two base cases that we know, Um, and everything else we're not gonna hard code in. We're just gonna do that backwards. So in case and is one or zero thes air the results that we're going to get, Um, if if it's neither of these two, then we're going to return. And then now it's going to be the Fibonacci number of n minus one plus the Fibonacci number and minus two. So what that means is this is gonna be so this is gonna be RN Fibonacci number. That's what we're getting here from this Fibonacci sequence. Or from this from this Rikers in fashion. Now we're gonna check. Is this number zero or one is the zeroth of the first number. If not, then this number is equal to its previous number, plus the 2nd 1 So it's getting the previous two values added together. And that's pretty much what the Fibonacci. Whatever. If it be not, you know where is equal to itself. Remember we had this here we start off a zero and one. The third number is equal to its previous and the one before. So zero plus one, which is this one here. We have the previous one so and minus one plus and minus two, which is two. And then for three. We have an minus one, which is two now, plus and minus two, which is one. So one plus two is three and an answer on insulin. So that's what we're doing here. And really, what we're doing here is we're just calling this function again. And we're going to go in, down, down, down, down, down until we reach the base case, where N is either equal to one or and is equal to zero. So we're going to reach one of these two based cases, and then we're gonna return one or zero, and then we're gonna build everything up again. So we're going to do is we're gonna have this huge tree that's gonna build down, and then once we reach zero or one, everything is gonna be out it up together again. So let's run an example of this And let's print out the the third Fibonacci number. And if we run this, see, you get to hear. So we start off. This is what we call the zeroth Fibonacci number. Just the way we define it in our code. This is the first triple nausea number because that's a way to find in our code. That makes this a second. That makes this one the third for do the 4th 1 For example. We're gonna get three. Fifth one should give us five, and then we just get Nope. Sorry. You wanted the 61 and I'm not the H one. Yes. So the 61 should give us eight. So that's exactly what the Fibonacci series is. Now, if we work this out just a little bit, So if you go through this once in detail, maybe four feminazi for let's just really right out. What happens? So, Fibonacci, for checks, um is first of all, is four equal to zero? No is for you. Go to one. No. So then what we do is we return Fibonacci three plus Fibonacci, too. So this is what Fibonacci for is equal to like this. And now we need to calculate out Fibonacci three and Fibonacci too. So for Banacci, three is three equal to zero. Not was 3/4 1 No. So Fibonacci three is equal to Fibonacci To sew and minus one plus Fibonacci. One like this. And in this case here, we've got Fibonacci, too. So this one is equal to we check again. Is is it is an 01 notes. Not so this is gonna be equal to Fibonacci one plus Fibonacci zero. Now, these ones we know this is equal to one plus zero, because N here is equal to one. So we returned one, and and here is equal to zero. So we return zero. Um, so that's what Fibonacci two is equal to, Um But here we can have to break down for Benassi three. So this this one is a resolved now. But now here, we need to break this one down. So this one right now is equal to Fibonacci too, Plus one. So we're not going any deeper in here because Fibonacci one and is equal to one gives us back one. But for this tree, we again need to calculate Fibonacci too. Which is equal to Fibonacci. One plus Fibonacci Zero. Uh, just like we had above. But again, this is a different tree. And here for Banacci, one is equal to one. And from a Nazi zero is equal to zero. So what? We get is Now we move back up. So now we have Fibonacci, too. So this Fibonacci two represents this one here. So the final results here is one. So this goes, I back up. And now for but not you too is replaced by one. And Fibonacci to here is also replaced by one from this value. Here and now. This is equal to two and not Fibonacci. Three here is replaced by two, and all of this is equal to three. So now Fibonacci four is equal to three. So that's how incursion works. And that's that's a very simple principle behind it that we call the function within itself . But we need to make sure that we have thes based cases, these ending values that would check for, and that's gonna be as deep as we go. So once we reach there, we know the answer. And then we build up from there again, and we always need to make sure that we check first and otherwise we calculate the function again so that if we reach this base case, then we stop with our workers in 19. Getting started with Selenium: Hey, everyone, it's Max and welcome back. So now that we've got everything installed and set up, let's go ahead and run our first test run. So we're gonna do is we're gonna try to open this website here using python. So we're gonna take the girl that we have up here, and then we're going to go into a coding environment and we'll just save this u R l in a very bill that we're gonna call you R l So just directly copy paste that into here. So this is the euro that we're gonna be referencing from known also that we don't have to type it up and or type it out each time your stuff like that. So this is pretty much the euro that we're gonna be going on and later on, if you want, you can actually do the same thing that we did before with the different indices were the different ticker symbols and just dynamically change those, but yeah, so let's start with the basics. So the first thing that we're gonna want to do is we're gonna want to somehow have access to this web driver or to this browser, so to do that. We're gonna go and use Cellini. Um, so we're actually going to import just to one specific thing from sleepy? Um, so, Flynn Cellini, um, we're going to imports a wept driver or the web driver. Now you'll see in a second what exactly? The Web driver? I love this to dio, but just bear with me here. So what we're gonna use this web driver for is we're going to create a browser variable and the reason that we need to create a variable It's just so that we can reference this browser later on. So we want to create the browser, and we're going to save that in a variable said It's gonna be kind of a running instance. And then we can always just reference the variable to reference the browser that we've opened and that we're dealing with and stuff. So that's why we want to save this in the variable. Of course, you don't need to call a browser, call it whatever you like. I'll just call it melts or just because that's a good representation in my head. So to create this browser, we're going to go into the Web driver that we've just imported. And here I'm going to use Firefox first for the example. Now you can use different things. You know, you can also use chrome if you want or whatever else. But right now, we'll just use Firefox. We also change this in a second, but this is just to show you kind of the first steps. All right, so what we've done right now is we went into this web driver class, and we were pretty much created a Firefox browser. So that's what we've done right now. No, what we can do to this browser eso right now number access in this Firefox browser that's running in a code, we can get a u r l So what that means is very similar to the requests we get A UL with requests, which gives us back the HTML code. In this case, if we get the your own, then we're actually opening that. You are l in our browser. So that's what we're doing right now. S O. R. Browser. We're pretty much in navigating to this humor. Well, it's like entering the euro in the search bar on the top. Um and then the very end. We're just gonna quit our browser. So this is the very simple steps that we're gonna take right now. We're gonna create a Firefox Web browser. Then we're going to get the URL, or we're going to navigate to the euro. And then we're just gonna close our browser. All right, so now that we have that, let's go ahead and give our code a nice little test run and what we see. Also, what I mentioned in the package when we talked about the packages in the libraries is that a Firefox browser actually opens up right here on. And so that's That's kind of what happens if I create this Firefox Web driver. Is that well, my Firefox browser opens. Um, and I have this app that's going on right here. And so now it closed again because I quit the website at the very indirect close the browser. But this is kind of that thing that I was talking about, that if we create this Web browser, then are an application's gonna open for us, and what headless means and we'll see that in a second tomb is that we're going to not have this app pop up in the background. So if you run that again, we see here and once that actually opens, maybe I can click on it and we see the Web browsers navigating to this website. It's loading the website and once the website is done loading, it should just close the browser because we're not really doing anything else. Eso Yeah, there we go. As soon as it finished loading, we close the browser. So that's that's what's going on when we open this woman create this Firefox driver. Now you can know you're I think you're also starting understand why it would be nice to have a headless. I mean, it's not a big nuance, but, you know, it's nice to not have that app open all the time in the background. So to do that, we're going to use Phantom Js that we've just pipping stop before, So we're gonna go ahead and instead of Firefox, we're gonna use a Santa Js Web driver instead. And so this is going to allow us to create this kind of found him. Hence the name also Web jobber. That's not gonna create an app that's gonna pop up another is one more thing that we need to do to it. And that's we need to define the path with this phantom Js Web driver with the execute Herbal file is located. So if you don't know where this executed will file is we can again just go over into a terminal and not weaken again type which and this time we're gonna look for phantom Js and here it's gonna give us the path to this phantom jazz execute herbal file. So we're gonna take this path here, and we're gonna use that in a second. So what we're gonna do now is inside of this phantom jazz browser, we now need to give it the path to the file. And the reason for that is probably because there is not really that standard path. There may be your python environment, doesn't it? Doesn't really know where this phantom Js executed a file is, so we're gonna tell it where it is. Otherwise, it's gonna tell us that the executed a file is probably missing unless you have some special set up and ready. But just generally, um, we can just define where we can just tell the Parthenon code where this executed all file is. So we're gonna go into our phantom jazz, wept over here, and we're going to define the execute herbal path, um, to be and then inside of strings here, we're just gonna put in this path. Um, So if you're on a windows, you can also use an R in front of year to make this raw string rather than just pure string . In case there are things you know, like the back slashes and stuff or double backslash is that you may have to use that. But you know, if it works without, then you can also do without. But, you know, just in case something weird happens for you or it says, you know, passengers, founder, blah, blah, other errors. You can also try it within our instead to have that path be a raw path, Um, or a raw string rather not a broad path. All right, so what we're doing here, though, is before we've created this, this Firefox driver now are Python knows where this Firefox's, and it's also inside of the web driver here and everything. And because we installed Phantom Js, we can now use Phantom Js so that we create this phantom web browser so that we don't have this app popping up all the time. Um, but since our python environment may not know where it iss and we're gonna tell it Hey, this is the path to the execute herbal file. So this is the path where this phantom Js executed Elice. So that's what we're going to tell it, All right? And so that's that's pretty much what we've done here. We've changed the Firefox to this phantom Jess driver so that we don't have this app popping up and we define the executed A path. Um, because our environment may not know where it is. It may. You know, sometimes if you're lucky or yes, I think set up. But when I first ran this also give me complaints. So I had tell it exactly what this execute herbal file is. All right, so let's try a code one more time. And if you run things now, we see nothing else pops up down here and we'll just wait just a little bit. And now our code finished running. So we don't actually see any printing up foot, But that's no different to what we had up here. We didn't get any heirs or anything. Um, so it seems to be working quite well unless there's something weird happening in the background, but it seems to work quite well. We've created this phantom Web driver now that is opening this year, all for us, and then it closes again. What's the year Ellis loaded? So this is the nice first step that we have our browser inside of recording environment. We don't have anything popping up in the background, always when it opens and closes, and now we can go on and we can start doing more with it. 20. View The Page Source: everyone. That's Max and welcome back to my python Web strapping course. So now that we've got this browser running, we can almost pretty much go ahead and extract information. Now there are some things that we need to know about selenium and how we can get data out or how we can nicely get data out. So let's take a quick detour and look at that first so that we can really understand. You know what we're doing in a second and at least get a general overview. So something that, you probably know is that when you get the website back, you get this html code and you may be a little bit familiar with how each team out looks like, Um, if not so. You don't need to be an expert on anything in, you know, html CSS or, you know, even Java script. All that stuff. You don't need to be an expert. You just need to have a general idea of how things look like, and I'll show you a quick example of then because also just having this a general idiom, it tells you a lot more about how the website is formatted and how it's structured. So that's what that's what all of this is really going to tell us. We just is going to tell us what the former things like, and it's gonna give us an order of the website. And if we understand that, then we can also you know, better use that to take out certain sections that were really interested in because everything is nice and ordered all right, But anyway, so let's let's look a quick example of that, and then we'll go into applying that. So generally, if you have this and I'm don't just write down something very simple, so you'll maybe have something with this HTML tag at first and in the in the website that you get back. These tags usually have, you know, the um less than and the greater than in front to kind of indicate where you're at. So, um, if you start with an HTML upside, then you'll probably have this less than aged female on greater than. So this is kind of like a tag that's going on, and then if you're going into subsection here that inside we would maybe have something like a body and then in here. You can No, you know, you can have something else like a form that you need to fill out or something. And then you really things were closed at the end. So to close it, you have this slash beforehand. So this this forward slash and so that's pretty much closes this part. And so that tells you that this form is over. Now, you don't really need to know. You know what exactly is a form? Pretty much the words already tell you pretty much You don't need to know specifically how it works. But, you know, you should be able to kind of know like, Oh, this is where I would be able to find a form inside of this tagged, for example. And then once the body is done, then you would also close the body. Um, like such And then at the very end, the HTML tag would appear again. So that's pretty much this very, very rough overview. And of course, we know websites get a lot more complicated than paragraphs and headers. A lot, A lot, A lot, A lot, a lot of, um but generally what we care about is just that there are names in here and these names we can use to get information out. So if we kind of know where information is, then we can just say, Hey, our information, for example, is in the form. And to get to the form we need to go into the HTML that we're going to the body, then we'll go into the form and inside if year. That's where we find our information. So that's really what's important to us. It's not, you know, it sure is also the stuff in between here, but it's also the path that we need to take in here to get to our final destination. Now, if you have heard of XML before, it's the type of it's a way of storing data were a way of formatting data also for storing . Um, then this may also seem familiar to you. At least you know, using these tax to kind of navigate through here. Otherwise, you can also imagine it in some sense, very similar to like a dictionary. So you could imagine it's something like this, that you've got this dictionary and then you say you have this, you know, HTML and inside of the HTML. You find the following you find the body and then inside of this body. Here we have, um here we have our form like this, and this has a specific this a specific data. So the data is what would be in here so that we have a bunch of data in here, and then our while a form kind of closes the the form closes by that by the in the data on here, and then our body closes, and then our HTML also it closes. So this year it would be our HTML. Closing this year would be our body closing. And then, you know, sure, your form could technically also be something like this, but let's just keep it as a value. Otherwise, we need to put another key value parent here. Um, so if you were familiar a little bit or if you like dictionaries a little bit more, that's you can kind of imagine it like this s O. This is also very similar to the Jason structure or pretty much as addressing structure. Um, yeah, but this is just important to at least have a general understanding of how we can use this or how the HTML structured because we will be using that in a second to locate our data. Because if we create data with scripts or if the website creates data with scripts, you know you want to be able to navigate there, and we don't really want to have to search the whole website and not really know where we are. So this is pretty much gonna help us to simplify it. Of course, You know, if you want to do it a different way and you don't want to do it this way, you can do that, too. But we'll do that this way here because it's really gonna help us simplify and narrow die on just exactly where the date eyes. Yeah, sit ups on that general HTML or XML format. And then this is that relation to maybe a python dictionary or to a Jason data structure. So, yeah, just just kind of a reference so that you can also reference this and, yeah, so no, no, we'll try to use this, Um, and let's get something out of this browser first. All right, So what we're gonna do is now that we have this browser we want toe, see what's going on. So the first thing that we want to do, maybe is we want to just Prince. And then we can go into the browser and in the browser we can actually print the page source. So that's all of that source that we're getting inside of the browser. Um, so over on this and then I can also show you in I'll show you in the browser and a second. What exactly This relates to all of this text. What does it mean? But let's just run this first andan. You will kind of see what we get out of it, and then I'll show you in a second and we'll be using chrome because chromosomes cool feature that lets you view the page source. But like here we see, for example, of these tags, Um and then we've got, like, the scripts and oh, yeah, so got the forwards last year. So we see that the HTML is ending, the body is ending. And then this script over here is ending and all that stuff. So this is all of the tags that I was talking about and then in here. It looks like we have some form of Jason data structure or so And then we've got, like, a script starting here, and the script has, you know, specific identifications and what not? And you know all of these extra things in here. Um, okay, so let's see that in the browser. So if we go into Crumb, for example, or all use chrome because it nicely lets you see that source and you press option command J on a Mac than what we get here is this nice thing that pops up And this is the page source that we've just loaded. And you see, if I go over here kind of highlights the different sections, what they correspond to, um and we have all of this stuff going on. And then here we're going, for example, inside the body. And then here we can go into this divider and on their specific ideas here and everything. You know, toe have these unique I identifications and whatnot. So this is pretty much how that would translate like So this is what we've we've pulled in . This is all that page source, and we see there's just a bunch of stuff in the body, and they're also lots of scripts here and everything. So when we printed the page source, what we really printed is all of this here, you know, like all of it opened up. Eso that's that's what we've printed out. And so that's all of the information that's available to us. And somewhere in here, this this data's have you looked to us. So somewhere in there this script has been executed, and this data here is available to us. Now all we have to do all we have to do in quotation marks. All we have to do is just find where this is on. Once we find where this is, you know, we can narrow things down, weaken pretty much go there directly. And then hopefully we can pull things out really nice and really efficiently. All right, so this is on that general structure. And here this is how we would get out the page source. Um, and in the next tutorial, you know, well, going into a lot more detail on will understand why I also brought this stuff up. And this has, you know, this is important because this this has to do with house selenium or how we can use selenium to narrow down. Um, our search would know. 21. Website Elements and XPath: the guys, Mack. So welcome back to my Python tutorial on Web scrapping and let's now apply what we covered last time with all of these tags and everything. And let's also comin out this page source for a second because it prints out a lot of stuff and let's now understand or let's no do this. So now that we've gone through this and, you know, have a little bit of a review, let's understand why we actually like that. Um, so the reason that we learned that is because in our browser we can actually select certain elements. So, like, for example, we can select this body section here, Um, so I'll show you how we can do that. And l ish give you a concrete example of how that looks like rather than just explaining in the abstract. Let's just do it first. And then let's understand what the output is, because that's probably a better way of approaching it. So we're gonna save the result that we get in a variable, that we're just going to call on element, and the reason that we're gonna call it an element is because it will see in a second, So we'll go into a browser. So this is the weapons of that we're gonna be accessing. And here in Selene, Ian, there are these methods that are called find element by. So that's what we're calling element because we're finding an element. That's what they're called. That's what it's called in silly name. So that's why we're saving it as an element. And we're going inside of this browser here because this is this is the Web browser that we've created that has all of our content that has the euro loaded everything. So we're gonna go into their browser, and then we're gonna try to find certain elements. Um, now, as you may have seen, So when I let's see if this So if I find an element and let's have that pop again find element. We see there are many different names to find elements by So you can have a class name. For example, a man I d named CSS elector name partial ing tag Name, X path. Um, so let me show an example. Let's see if we can find an I d name somewhere here we can see like an I D name that you can reference, for example. Um, so that's and then here we've got, like, a class name or something like that. And here you've got another I d. So these are the different things that we can reference. And the more familiar you are with the HTML or with the website, you know, the easier you can pick out these elements. If you're not really familiar with it at all, then you know you're gonna have to. Then we're gonna have to play around with it a little bit, which is also what were going to do now. So right now, we don't really have any understanding of what this website looks like, and we're gonna try to decipher it and find everything. Great. Okay, So what we're gonna do is first of all, we're gonna use something called by, um X path like this. And the reason that we're doing that is because the X path what it is, it's an XML path. And if you remember XML AFI that rings a bell or if it doesn't, that's something very similar to this format. So the former pretty much looks identical. We've got this HTML format here and an XML format looks very similar to the HTML, but it pretty much tells you, you know where certain data is located. So we see that in the Jason or the dictionary format, you know, in the HTML. And I'm key, for example, we have all of the following and in the XML format. We see that in this HTML path that we have here, we have all of this data inside, so that's the principle behind it. Um, So what we're gonna do now is we're just going to go on to the very top, and we're we're going to start with the HTML path because that's our very top. And that's usually what it should start with eso your websites to start with an h. C'mon. And we'll just see what we get from there. So what we're gonna do is we're gonna find the element we're going to take out the element by the ex path HTML. So our path right now is just html. So we're on the top level and we're just choosing the element with this name or with this tag HTML, um, note that it's this name right here. So it's always these descriptions. It's not any of this stuff, like the SRC or the I D or all that stuff. It's really this This purple were this pink purple description that we see over here. All right, so that's the X path in terms of a dictionary, you know, it be like accessing, um, your dictionary, and then you would go into HTML. So this is this would be the python dictionary equivalent kind of this ex path just just for reference. You know, if if this looks a little bit weird So this would be the python dictionary equivalent to that. Um okay, so I'll just leave us up here just in case. All right, So this is how we can pick out these elements now with elements we can reduce down the information that we have. So let's see what is actually contained inside this element. So the first thing that we can do is we can go into our element that we've gotten and we can just print out the tag name that we have from it. Now, the tag name is just pretty much so. This is the tag name eso in this case, it bh two Mom But if we got the body that are tag, name would be body instead. If we got form are tagging, it would be form instead. So in terms of a python dictionary, it would be the name of the key, not the full path, just the name of the key. So that's what the tag name would be. So let's try that out and we'll run our code here, wait for our driver to start and for everything to load. Um, and so we see here, the tag name that we have is the HTML. So this is the tag that we're in right now. Um, all right, so that's good toe. Have No. We have an overview. We know where we are. Let's try to print out some better information. Maybe some text. So something else that we can do is we can go into element and we can print out the text. And if we run this now, then once this is done running, we should get the attack name again. And we also get this formatted text here. Now, you'll notice immediately that this text is nowhere near as long as this text that we got up here, even though we didn't really reduce the information because everything is still included inside of these html tax. So something is a little bit different. And that's because the text included in here isn't actually the full text. This is kind of like the superficial, nicely formatted specific to this region kind of text. Um, which can be nice, but in our case is not that nice because we're interested in something much deeper. So to see the other part of it And let's comment this out and we're going to go into element and we're gonna use the method, get attributes and the school over so we can see the rest and the number two to actually see this full kind of output again, that includes all the JavaScript output and all of that hidden text. Um, the attribute that we're gonna get is gonna be the text contents so very similar to the text, but very different in terms of execution. So if you run this one more time and just wait for our brother to lower the website and everything, then here we go. So now we've got this full, full output again, which is much nicer. This is exactly what we wanted. And somewhere in here, our data is contained. All right, so that's the general idea behind elements that we can use it. And we can use these tax to navigate. And if things still seem a little bit weird now, you know, don't worry about it. We're gonna use it more and more, and also make sure to go through everything slowly and, you know, make sure we explain it again and why we're doing certain things and how those past look like. And at the very end, hopefully, you understand a little bit better how you can use this ex pass to find your data. All right, so now that we've gotten this element and we've got in the text content out, the next thing that we're gonna look at in the next tutorial would be finding our data first. And then later on, we're gonna learn how we can navigate directly to that data 22. Navigating Deeper Into The Page Source: everyone, It's Max. And welcome back to my python web scrapping tutorial. So last time we saw how we confined these individual elements that give us, you know, like tak names and all these things and just kind of produced on the information. We're gonna look at one more thing. So sometimes, um, if we go and find an element, then there could be several things that occur. So it could be that in here we don't only have inside this body, we don't only have one form, but we maybe have two forms. Um, like this. So, you know, we want to be able to not only get one element, but we want to be able to get all of them. So we're gonna look now how we can first handle with handle that. And the reason that we're doing all of this is is that it's gonna make us a lot easier to narrow down all the paths to the data. Sure, you know, we can shorthand it. We just look for the values and pick out the individual strings. But let's make this really clean and really nice so that our code is really, really cool. And great. And that can also scale it and apply to different things so that our code is not just, you know, like a one time thing. But what we really understand what's going on, why we do it, how we do it. Except all right, go. So we saw here is we use this find element by ex path. Now, what we can do instead is we can also do something called find elements by Ex path. So for the case that, you know, there may be more than one element after this path, we can try to get all of them. So what we're gonna do now is now we've got into this HTML path and we've pretty much gotten that element. And we gotten, you know, everything associated with it. So what if you want to go in here and I look at the next kind of headers? So what we can do is if you want to go into this HTML thing, it's almost like going into a folder. So we'll put a forward slash. But now, rather than specifying a folder name, what we can do is we can use an Asterix. Now, if you're familiar with Reddick's expressions. Or if you're familiar with using your terminal or something like that, you may have seen this before. If you don't know what this means, this pretty much means that we're going to go into here and we're gonna look at everything . So the star pretty much meat is everything. Ah, and again, I'll show you exactly what that means in a second. But right now, so that with the start, allows us to do Is that pretty much for going into this html tag? Or you can also imagine we're going to the HTML folder or something like that, and we're just gonna list pretty much everything. We're gonna look at everything. Um, and what we're doing here with the find elements is that because we're listing everything, we also want to store everything. So let me show you first. What happens have just by having this Asterix here without having the elements, and then we'll transfer this code into containing elements. And what we're gonna do is we're not gonna print out all this text content. We're just gonna print out the tech name. All right? So if you run this code, um, and just wait a little second for to finish. Ah, all right. So apparently there's something wrong with the other website right now, or at least with the browser. So so knows this happens. They just need to wait a little bit for this air to go away. But while we wait a little bit, let me explain to you what we're doing right now. So if we go over back into Chrome and we go into this HTML attack here now, if we close all of this we see there are inside of the h t. Know there are two other kind of elements. So first or to other tax. First, we've got this header here, this head, and then we've got this body. And what we're doing with these h team no slash star is that we're going into this HTML. So we're going in here, and now we're looking at everything here, Um or rather, we're going in here, and now we have thes two things that pop up. So these this is this html and going and doing this slash And then during the Asterix, where the slash is pretty much like opening this pop down. We know We can't see it here for the HTML because it's a special case. But if you were to do for example, ht male slash head slash, Then I would just be like opening this pop down when they're here. So that's really what we're doing with this 18 mil forward slash and others Asterix allows us to look for everything. So rather than looking for specific, you know, head or body, it just just pretty much a lot of us to look for everything. Here s so we're not giving any specifications. Were just saying look for everything inside directly beneath this HTML thing here. All right, so that stops the principle that we're doing right now. So let's run our code one more time, and hopefully this air has resolved itself. Um, all right, we'll wait for the Web Preiser to start and get the euro. All right, so not right now. Um, we'll get back to this, and it's like it now, once this air has kind of settled down is really more of a timing thing rather than anything specific that you can do to it. So it has to do with this. Http. Connection. See you just kind of have to wait it out. This is nothing that's dramatically changed in your code or anything like that. We just have to kind of wait it out. All right, so everything seems to be working again. Now, one more thing I notice is also that we've redefined this name. Two elements? No, because we change this tag here so we can quickly change this back to element like this. And then we'll change the name in a second so that it's more descriptive of what we're actually doing. All right, well, so let's run this one more time and we'll be opening the browser and link the browser, get the Braille. Um, and once that's done, we should see the tag that. All right, so we see the first tack, which is this head. So that's the first thing that we see here. Now, the reason that we don't get this body is because So there are two tags in here inside this HTML. We have this head tag and we have this body. Now. This ass trick allowed us to pretty much get everything in there. The reason we're not getting everything is because we're doing find element by and this just finds the first element. That's what we're gonna do elements and find elements by. And so this allows us to not just get one, but multiple and than to read. All of these were just gonna use a little four loops or do for element in elements. So we're gonna loop over each of the elements that we get back in here, and each of these will just save in the temporary, very bull that we're gonna call Element. So we're gonna get back from here is multiple elements or a list of elements, and we'll loop over them element by element. And then we'll print out the individual tax. So let's run that. And then once that's done, was hopefully see all of the tags that a record low. So we see the head and the body numb. So that's great. So we've done is we went in here and first we got So we're finding all the elements, all the elements in here by the certain path, which is html and then in html it's everything in there. So we're going inside of HTML. We're looking at all of these elements and because we're doing find elements plural. We're getting all of them rather than just the 1st 1 and that's returned to us in a list, and then we can loop over the list. So each element in the list or each value on the list is an element of this browser, and we can loop over each of those elements and then print out the corresponding tag Nate for it. So we see first we get the header that's in there and then we also get the body. All right, so this is pretty much the path. This is pretty much the start of how we can start in navigating our website using these elements. Um, also using this, you know, like forward slash start to get more of the stuff. And then we can start going deeper and deeper, um, into our website. So, for example, if I were to go into the head or now and look at everything inside the header, so pretty much what I'm doing here is I'm going into the HTML. Then I'm going into the header, so we should look at that in the browser and page source. I'm going into the edge to know I'm going into the header. So unfolding this open a non looking at everything in here so I can run this one more time and then we'll see the results of that. All right. Okay, so let's wait for that to run, okay? We go. So no, we see inside of the header. They're a bunch of different tags going on, and that's all of that that we see over here. So we see all of these tags contained. And here, um, everything like that. And the reason that we're getting all of that again is because we're using this Asterix right here. And we're also not getting the body anymore because we're inside of the head. Or now we're inside of the head or tag and and we're just taking everything from here. We're getting all of the elements. We don't care what name they have. So that's what this Asterix does get all the names, everything in there, get all the elements. Storm analysts go through them and then printing up the tag name. So that's the general way of how we can start navigating this website. Um, there are different things that we can put in here too. So right now we went into the HTML, and then we looked at the head. What? Something else that we can do is we can go first into the ht known like this and get everything here. And if we want, we cannot take this element. So the element that we've created here and we can create a new element which, if we go into the old element So we're getting all the zones back, we're saving them in the elements list. So kind of like this got, like, an element one and the element to which in our case, correspond to the head and the body. So the things that we got back up here, these are the two elements that we have. And so what we can do is we can go into each of the eso the first owner, for example, be the head that we can go into the head in this case, just saving the element. And here we can again say, for example, find elements by Ex path. And then we can put a new path here. So rather than having to type out the full path every time something else that we can do is we can use a dart forward slash now what that means. But the start means is that means use the current path that we're in right now, so pretty much start from where we are. So since we're already in the HTML, this is going to go, and this thought is pretty much gonna replace, um, it's gonna be like this, so it's gonna be html forward slash So already we're already in the HTML. So that's where that dot forward slash does. If we have for in the HTML head, then we're gonna This thought is gonna be like that. So the stock just means, you know, wherever we are right now, that's where we want to start. We don't want to start from the very beginning. If you don't have this dodge here and you just have this, that means you start from the very top. So to kind of continue navigating, we can use the start, and that allows us to not always have to, you know, at all these extra tack names and you save our path. But we can just continue on from where we are right now. And if we put this Asterix here. That means in this current folder where we are going to the next ones and just, you know, print out or just get all of that stuff. And well, what we can do, then is we can again use another four loop, and maybe we'll save this as new elements because they're suffer elements. And then we can go and go for every new element in the new elements alone menace that we found. We can again print out the new government, underscore tag nips. So just like we're doing here, But now we're going one level deeper. So let's run that and see what we get out of here. Yeah, eso something that you can also notices that with our Web driver, it takes a little bit longer. And so this is also one of the reasons that, compared to requests and stuff, if you have a static HTML Web site, you probably want to use requests because what we see and like what your contingency noticing and the reason that it takes so long is because this Web driver needs to start and use to load the website. And there's a lot of stuff that used to do, and it needs to execute the JavaScript. So if you don't have JavaScript or you don't need to get data from JavaScript, don't use thes dynamic Web drivers just because they take so much longer. As we can see every time I run the code, it takes a little bit of time because the program is to start. It used to load the website, and it needs to execute the Java script and only then kind of start doing everything. So that's why I always takes a little longer to run. So if you actually don't need it, you know you should probably stick to the requests and the static HTML because it would make a code want faster. So that's just a short explanation. Also, why, you know, it takes a little bit longer to run. All right, so we see on the output on the side here, we just get a bunch of output. Um, and you know what pretty much happened. So let's see where the head stopped. So this is all of that up put. That's in the header. So that's what we've seen before. So first we went in here And then we went into the header and we printed all of that. And then the next time we went into the body and this is a lot shorter and those are all of the tax that are associated to the body. So that's how we can also continue navigating and deeper. And we can also use the elements that we have previously and just, you know, again, use this. Find elements by method, um, to go deeper into them. So we're not just limited to one, and we have to specify are full path. We can use the ones that we already have and go deeper and deeper into them. Um and so that's what we're also going to use to then go ahead and find our data. 23. Identifying The Path To Our Data: Hey, everyone, it's Max. And welcome back to my python tutorial for Web scrapping. So now that we've covered three elements and stuff and where we have a little bit of an understanding of X path and navigating with it and everything, let's try to find our data. And then let's try to get to our data. Um, all right, so what I'm gonna do is I'm just gonna comment all of this out. I'm just so that, you know, we don't have all of this output. It was gonna print out the page source again because the page source contains all of the information, including, you know, the contents up by JavaScript. And in there we're gonna look for our data, and then we're gonna write several scripts that are gonna go through everything and get out our data and just have it nicely formatted. And it's gonna be really cool. Um, so, yeah, that's that's gonna be our final goal. All right, so let's first run this. And in the meantime, let's also refresh our website that we have here so that we know what values to look for. All right, so all this is generated by javascript And then if you go back, So here we have our, um, all over page source and everything. Now, what we're gonna do is we're just gonna pick out one of these values. So, for example, let's pick out trailing P. And let's see if we can find 17.21 somewhere. So we're just kind of see, you know, where can we find this Training P. Now, the reason that we're not using this name here is because you may remember, sometimes the names look different in the HTML than they do here because there may be extra for mending or all that stuff. So we may not always get a direct hit just by putting in this name. So we're actually gonna look for the number for the value instead and see if that tells us anything if we could get some nice information out of there. So he's got this page source response and he I'm just gonna use command f. And I'm just manually going to search for what was the value, um, 17.21. All right, so 17 coined 21. And so I see I got a value here on day then I've got I'm inside the code block. So this looks like the raw value. This is the four minor value so surrounded down on and I felt look at with this belongs to So this looks like a dictionary or Jason for a minute. We see we have this trailing p e value here. So this is also what I talked about before, And the name and here always doesn't correspond to the name that you see out there, and this is probably just value, and then it gets reformatted on the web page and all that stuff. So this is the name, Apparently, that actually contains the data that we're interested in, and and so we're going to save, really? Take this name. We're just gonna copy it over for now, and we'll use that in a second. And our ultimate goal right now is we don't really know where this is, so we just kind of command after, and it's somewhere in the pages. So that's great. So first of all the data exists, and as it should, eso we know it's somewhere there. Now we need to find it, and the first thing that we're gonna want to do is we want to find where in all of these tags it iss So that's what we're going to do now. We're going to write a script, and we're going to see what the path is to this data. So that's gonna be our first task. We're going to try to narrow down exactly where we have to go to find this data. Now, of course, you can just say, Hey, you know, I found the data here. I'm just gonna copy out the values that are interesting to me, and I'm gonna write a little script, and I'm just going to take the next four minute value after I see the trailing P. Sure you can do that? Um, you know, and then you'd be done. But what? We're gonna do this a little bit more elegantly also so that we can much better understand exactly what we're doing and that we have a really nice script up. Just, you know, works beautifully. So you know, of course, you can stop here and say I'm done, but we'll just take this a little bit further also, you know, So there's this extra learning involved and that we see these cool methods and that we understand the elements a little bit better And how we can deal how it could do with, um and also just, you know, some cool approaches that you that you can use And maybe that would also like toe later Use that could, you know, actors, inspiration, Um, were you know, that you can come back to you and say, Hey, this is himself before Let me just use let me just use this. It would simplify my life. So that's what that's why we're gonna write a method that's gonna get this path for us, this ex path directly to our data. Okay, So as the name already suggests, um, we're gonna define a function. And this function, we're gonna call find ex path like this. And so what we want this function to do is we want this function to go through all of this , and we wanted to give us the path to whatever it may be. So, for example, maybe it's html body form, and that's where the data is. So we want this path so that we can nicely search for the elements by using this find elements by path, and we can just go directly to it. And we don't need to deal with all the rest of the data so we can throw that all away, and we're only looking at the do you know, that's really interesting to us. That's what we're doing with this. Find X path here. All right, so we need some inputs. Eso The first input that we're gonna need is we're gonna need a reference to either an element or our browser. So we're gonna need to have something that we can call find elements by Ex path Also something that we can search through. So we'll just call that a general element. Um, Then we're also gonna need our target value, which in this case, is going to be trailing P. But we'll keep that a little bit more general, in case you know, you want to look for something else instead, So we're gonna have our element. We're gonna have a target, and we're also gonna have our path. So this is the actual result that we're interested in. This is the final path that were interested in. So we'll also supply that, and we'll continues to change that. So these are the three inputs that we're gonna use. And with this, we're gonna write a nice little recursive function that's going to go through all of the tags and here, and it's gonna tell us exactly where our path is or what the path should be that we can use to get to this data. Okay, so let's get started. Um, so inside this find expats. Um, the first thing that we're gonna check is we're gonna check Is our data in there? So is our targets in the element. And then here, we're gonna use the, um Where was there? Where did we use it before? Here, this Get attribute, text content. So we're gonna see if our data is already in the text content. Eso remember, before this get attributes, text, text content gives us something like the page source for that narrow down elements. So it gives us all the text, not just a limited text that we get using this dot text call. We're gonna see if our targeted is in here. We're gonna narrow this down a little bit. So what we know is we know that, um, our data should be inside inside some kind form a script. So we know that we're looking for because it's generated by JavaScript. It's gonna be inside of this script tag. So this is we know our data isn't here because Ms Generated by javascript code. So it's gonna be inside of a script tag. So we're going to say is we're gonna limit our search a little bit because also, some things that could happen is if we just use the HTML tag and we get the text contents then are trailing. P is also contained in there, but that's not really path. That's just a C mo. So we're gonna use our knowledge of the fact that we know that our data is inside of the script tags and we're gonna say and our element tag, name, tag name is script. So if the tag name is script and let's look at that job off and Sophie go here into crime again, um, script. Let's see if we can find a script here. So this is the script checks, so we know that our data should be somewhere inside of the script. So that's why we're looking for here. If our tag. Name is script. And if inside that script, somewhere inside of those text contents inside the script, um, we have our target value that we're looking for. Then what we're gonna do is we're going to return our path. So this is the final path that we're gonna have. And once we've reached that final element, we're just going to return in the past. Um, now, in case we don't find us, we're going to do as a kind of worst case is we're going to return nothing. So we didn't find it. There's no path to return, so we're just gonna return a empty string, not contending anything. So in case we do find it, we're gonna return the path. Otherwise, we're not gonna return anything now for the actual searching part that's in between the recursive function. What we're gonna do is if we don't find it, um, that means that it's probably not inside of that element. So we're gonna create new elements that are, um, gonna be the Children of our current elements self, for example, if we were in our HTML, and it's none of it is in any of these Children elements were gonna go into each of these. So we're gonna go into the body, for example. So our new elements are going to be equal to the current element that we're in. So, for example, were in html and here we're going to find elements. Shouldn't be a space here, So should be directly part of the elements. Find elements by Ex path, just as we did before. And here. We're just gonna stay where we are right now, and we're just gonna look at the values before So again for the example of being in the html tag in case, Um, Well, if we look at the HTML and we look at the text content, then this is actually contained aside, So a target is contained inside the text content. But our tag is not script, so that's a little bit too general for us. There should be a more direct path rather than just html. So to find all of the elements inside of this HTML tag, we're going to go inside of the element that corresponds to the HTML tag, and we're gonna find all of the elements inside. So we're gonna go into the HTML where we all right now and we're gonna list all of the elements inside, and we're going to save those new elements inside of variable that we call new elements and someone lots of that. What that allows us to do is it allows us to go in here, and if we don't find anything in here, would love us to go one step down. So that's what we're doing and then again for body. And because our element, our body tag name is not script again, we'll go into the body and then we're going to get the new elements, which you're going to be script and form and then finally will go into a script. Then we're gonna find her data. So that's what we're doing here is this is pretty much initiating the going deeper and deeper into into the into the into the tax. So into this ex XML were just pretty much going down this path. That's what we're doing here. Um, so now that we have those developments, we need to loop over them, so we're gonna group over them new element by new element. So for a new element in these new elements that we've created. What we're gonna do is we want to get the final result, which we're gonna store in the final Value and Bergen again in each of these new elements. So, for example, again, the HTML we'll go into the HTML. And since the tag name is not script, we're not even gonna look into this. If statement, then we're going to create the new elements were going to find the new elements which in this case is actually only body and then inside this body here were again going to do the same thing. So now that we know it's not in, decide the HTML, we're gonna do the same thing and we're gonna look inside of the body. So we're gonna find again we're gonna call the fine X path again. And now the element that we're searching on is not the element here. But it's the new element that we're looping over. Our target is still the same and our path has changed a little bit. Our path is our current path. Plus this. So we've gone down. One more eso we've gone down one tag, which is why we have this forward slash year, plus, uh, new elements. Tag name. So, for example, again, for h t. Know our current path for each team, I would just be HTML. And then once we go into the body, it's gonna be HTML plus Ford slash plus the bodies tag name, which is body. So a new path would, for example, be html forward slash body. Um, so that's what we're doing here. And now we just need to do one final check, which is gonna be if our final value is not an empty string, which is what we get. If we don't find anything, then we're going to return the final path. All right, so this is our recursive hope so. I've got some indentation things here. Let's take everything back. Just one so that all the invitation is well aligned. Um, all right, so all the complaints have gone from python. Great. All right, so what we've done here is we've written this nice recursive function, and it seemed it makes seem a little bit weird and abstract at first. But if you go over, you know several times in your head to, or you try to do some some printing and actually what we can do is we can also just print out the path that we're looking through right now. So this is the current path that we're looking through. We can just print that out to see how this code is going to run. But pretty much what we're doing is we're going inside of here and we're navigating through all of the tags. And for example, if we go into Google Chrome, we're going to start in html. Then we're gonna go into the header and we're going to see inside the header. We don't have that. So then we're going to go into the first tag, which is going to be the script, and there's nothing inside here. So we're gonna open this script, then we're gonna go inside here. We're going to go inside all of these. Once that's done, we don't find anything. We're going to go into the next set next to next. Once we looked through all the header, we're going to go into the body, then go into here then and go into here than go into here. Traverse all of this, you know, And here and so on. So on. And if we don't find anything here. We're going to the next one, so that's what we're doing. We're just going and we're looking in depth through all of it. So we're going and going deeper and deeper and deeper. We're checking until we get to the very final values, and if we don't find it, then we're going back up. And we're just pretty much traversing everything. We're traversing the whole tree down to as far as we can go, or as far as we need to go to get to our data. All right, so that's what the function does. I thought we just did. So now the only thing that we're going to need to do is we're going to need to, um, call this function. So let's create, um, the seed value. So since we have an element that we need to put in here, we're going to create a C value or a starting value some starting element and are starting elements is going to be element. And we're going to our browser here, into our browser and in our browser. We're going to find elements by Ex Path, and we'll start at the very top just at HT mo. So we're not really doing anything here. We're just getting, you know, the first element in the HTML and that elements tag is gonna be HTM on everything. So we're really not doing any assumptions. Were just starting at the very top, the html tag. So that's gonna be our starting point. And then what we're gonna do is let's just print it out. Let's just print out the final response. Um, we're going to go and find the ex path and are starting values this element, our target value Is this trailing Pete Heat? So the trailing p again is the actual value. Let me see if I can find it again. So there was that. Ah, here we go. So again are trailing. P is the value that actually holds, or Daito. So that's the value that we want, that's we want the path to this value so we know those values somewhere in here. But we want the actual path, and that's why we're going up and down inside of these trees. I'm so up and down and here, checking through everything so that we know the exact path to the state of and so that's That's the final value that we're looking for and are starting path. It's just HTML. So we're starting at the very top. All right, maybe just for some nice formatting can say, um, the final calf is the following, and then we can run this and let's scroll down to the very bottom. I don't. Down, down, down, down, down. Okay, here we go. So here we see, this is what our script is doing. Its starts off of the HTML. Then it goes into the head. Then it goes into the script, looks to the script and then well, keeps going down and down and down. Um, and so if we actually look in here, starts off the HTML, then we go into the header than we look in here. And then we look in here and it looks through all of that, and it looks through all of those. And so we see we're going deeper and deeper and deeper in, and we're looking through all of that. And our script right now is just is going through all of this. And it's looking to find, um, all of that data and it's going through the whole tree the whole depth until it finds the path to where our data is contained. So until it finds this past year, um, until we find that data on here and then we get the resulting path year, it's going to continue looking. Or if it doesn't find anything, it's just going to return an empty string to us. But that shouldn't be the case just because we know the data exists. If that is the case, that probably means that the tag name of um, the tightening that we're looking for isn't script. That would be the only case that would be The only other explanation for why we get an empty path is because the tack name is not script about our data is actually contained in something else. But because we know that our data is loaded through JavaScript or tag name should be script . All right, so I'm just gonna let this run, and then the next tutorial, we'll look at the results because this is gonna take a little bit of time to run 24. Using The XPath To Our Data: everyone, it's Max. Welcome back to my path and tutorial for Web scrapping. So now that the code has finished running, it's produced us a result. And it's told us the XML path to our data so we can go, for example, into a chrome driver here and go into the HTML and then the body, and then we're gonna go into the script. Now. What we see here is that there are three scripts present. Um, and just from looking at it, you know, it's it's kind of we can kind of see that it's probably the 1st 1 because we can go into here. And if we do that, it actually says data here and everything. So this looks like a good indication. It may be the 1st 1 We can't be really sure, just by the result that we have here that you know it's this, that it's the first script rather than the second or the third. So we've got the X XML path just because they are different scripts occurring here, though we're not really sure which one it is. So we're going to write a little check code just for that. Um, so just so that we know which which, which one it is which path it is. All right, So what we can do is we can copy this final path that we got and we're gonna replace what we had before, and we're gonna find elements by the following path. So what we're gonna do is I mean, the reason that we're doing this. If there's only one that's good, there's only one. But if there are several, then there are several. And we're gonna identify where or in which one of these scripts there is the trailing P E. So we're going to create these elements where we're gonna identify these elements by the ex path that we got from a recursive code. Now again, the recursive code. If it looks a little bit weird to you at first, or like you know, it's a little bit hard to grasp. Rickerson is a little bit of a weird thing to kind of grasp. I completely understand that, you know, it takes sometimes they develop those lots of debugging, you know, lots of weird error messages and just things going on that you don't want. So I understand, you know, like reversion is kind of hard to think about, Um, but if you really get into writing and rickerson code or something, we see it's not all that long. So, especially if I take away like the print statement or something. It's not all that long, but it's actually quite powerful for going, for example, through the whole depth of our tree until we get that final result. So it's really cool. It's just a little bit weird to think about the logic. At first. I completely understand that. But you know, it's one of those things that you'll just kind of soul you get into. Or at least you'll understand the general idea. And then you always want to check first. You know, if you've reached your results. Otherwise, you know you'll continue doing more checking. Um, and you always want to return something in the end to. So in case you don't find anything or so eso that's just kind of the general idea that you're going with. And then you can just do a little bit of trial on air, you know, and do some debugging and stuff, and then finally you get that nice result that you want that Really? What? You you know that that's really rewarding, that you finally get that final result. So I know this for this Rickerson is a little bit weird, but maybe you can also just try to go through about yourself a little bit more toe, really understand what we're doing. But fortunately, it's not that long where there aren't really that many complicated parts to it. Ihsaa checking and a loop and, you know, finding new elements in the next folder in the next path and then just searching again. So just pretty much Andi hope the output helps you through here, too, that we're just going in and out and in and out and checking through everything. All right, so But now that we've you know, our code, the script has done his job and it's found the path. We're gonna find all of the elements and we're going to do a little loops. They're gonna say for element and elements. So we're gonna loop over these elements one by one, and we're gonna check if trailing P is in our element, and then we're going to get the attributes, the text content. So just again that we get all of that nice text that we got before that actually contains audio. If our, um if if the trailing PT isn't there, then you know that's that's the one that we want to get them. So we're gonna have a counter to our counter is actually going to start it one rather than zero, because, you know, different to all of the other programming counting that we have, for example, for lists honestly started to your own stuff for HTML, the first script is actually the script with Index One said this first group that we have here is the next one rather than index zero. So that's also why here we're starting at one, not at zero. So that's just one of those little things that you know could could change a lot. Eso That's why again, we're starting at one here. No zero because an html rather compared to Python, you know, the index would start at one here rather than at zero. So if we find that data in the script, we're gonna print out the counter, and at the very end, we're gonna increment the counter by one. Technically, if we want, we can also break here because we assume that the data is only in one of the scripts. But we could just leave that out for now. Maybe we'll comment that out. Um, like that. So again, if you don't know how I comment out here like that, I also just press command one and that comments at the line. And then you can highlight several lines and commenting all out. So that's just one of those nice shortcuts. Um, yeah, anyway, So also, just for a little bit of an overview, let's just print out the element tag names just to see that they're all scripts and we don't have to look, we don't have to do so. We'll run this. But what's nice is that we don't We're not really using this fine expat again. We only that wants to identify the final path once. And now that we have the final path, we can just directly put it in here and now we're just going through the different elements and we see we reach our first script and we see that the trailing P is actually in the text contents of the first script. So they print out the counter and the other three other ones are also scripts, but they don't content. They don't have the trailing p data. So what that allows us to do is we're gonna put in here script index one. So, um, what we get from here is we get, you know, several in the field by to go chrome. We have several scripts that we can choose from. And this in next one lets us choose this first script here like that. So that's what we're doing. And then we can also take away this find elements. Actually, maybe I'll keep this year just so that you can also reference it in the code later on. And I'll take this line and all commented out, and I'll copy paste it on to the next line. Um, and so here will choose the first groups. We'll do find elements. And what change this back to element. And this weekend also coming out. Maybe we'll move this down just to kind of key our structure. So move this down to the very bottom now, just to kind of keep the structure intact a little bit, so they're they're still bit of order. All right, So we've got our element here and from our browser. Let's again, inspector elements. So we're going to Element and hero print out the text content out to attribute. So we'll get the attribute text content like this, and we'll print that out so we'll run that and word for that to run. Uh, and Okay, there we go. And now we can go in here and we can search for trailing P e. I'm gonna go scroll up because we ran this code several times. So we want to make sure that the value of getting is actually the only one, which I guess it looks like it is so fine. Search for anyone's There's nothing else here. So we see that in our this element, it really just contains the data. And then if we scroll all the way to the top, um, now we can look at you know how this data is formatted. So there is all of this, uh, almost there. Ok, here we go. So this is where we start. This is what's contained inside the elements. So we start off with this function route and then this route at main and everything and then our data starts here. So it's all contained inside of this dictionary or Jason format, which, if we go into the body and into the first group, we also see here the function starts and starting time and everything. And then this is the data that's contained inside all in a nice Jason format like this. All right, cool. So now we've identified our data. We can also scroll to the very bottom and screw up a little bit. And so we see here there's a little bit of extra formatting behind the end of the data, but it looks very, very similar to a dictionary or adjacent format. So the next thing that we're gonna do is we're going to really reduce this down into a Jason format like this that we can nicely search through to get her date out. But yes, So we've done a really good job of identifying the path already identifying which are these scripts. It is. And we've narrowed down our content immensely, you know, into these values and like, you know, sure. Like I said before, you could have gone in here, you could have copied out all of these values that would have worked for this case, it would have been more manual labor, and it would have been less flexibility. But here, you know, we also learned a lot more. And maybe for cases where there are hundreds of pieces of data that you don't just want to copy all out, you can do this, and at the end we'll see that we have access to a lot more data, much faster, in a much cleaner way. 25. Parsing Out Our Data: Hey, everyone, smacks. Welcome back to my path. Editorial for Web scrapping. So now that we've got this data from last time, let's go ahead and reduce it down into the part that's actually useful. And then we're gonna put it into a Jason format so that we can have very nice, very fast access to it. So I will put it in a format like this, pretty much like a dictionary. Um, because that's also how it's stored in here like that. So that's gonna make it really, really easy for us to put it into that format just because that's the way it's stored here . So what we're gonna do is we're gonna save are data in variable will call it temporary data . And rather than printing out this text content, we're going to save it in here. Um, what we're gonna do is we don't want to have all of the State House, so we're gonna take the text content, and the first thing that we want to do to get that right, Jason format is if we go to the bottom here, then we see there is this extra stuff back here that's actually not really part of, um, it's not really part of the Jason object or if the data, it's more part of the function and everything. So, for example, what my environment nothing tells me to do is if I click this bracket here, and so for then scroll up to the top. It's gonna highlight where the brackets starts. So if I go all the way to the top this this me, just find the top here is very long data. Ah, there we go. We pass it, Um, this is corresponds to the start of this function here, So there's some extra stuff that we need to get rid of at the bottom. So what we're gonna do is we're gonna look at the very bottom, and we're gonna look at that formats, and we also sees. So we've got two spaces here that indicates to us also, if I highlight this that there is this extra new line here, so we're gonna do is we're going to, um, strip some stuff off from the back, and the first thing that we're gonna want to strip off is the new line. Then we also want to get rid of this extra semi colon here. Then we want to get rid of this this thing here, so we know all of this doesn't belong. Um, just because not part of adjacent format. So, Jason format again very similar to a dictionary format. Pretty much looks identical, and none of that stuff belongs in there, so we know we can just strip all of that off immediately like that. All right, so that's what we're gonna do, and and that's gonna take care of the back end there. Or for now, we'll change that again in a little bit. The next thing that we're gonna do is we're gonna take off the top. So we're gonna update our temporary data and instead of going into the element now will go into the text content the modified text found on that we have. And to get the start of our data, we'll go to the top here, and we're going to split by this. So we're going to go into our text and we're going to split our data by this value here, and we're gonna take as much as possible. I'm also because they're probably only is one of these root dot app dot main is equal to so space is equal to space. There's probably only one of these things here, So we're gonna split by this. And if we split by this than this gets taken out everything before, um is the first element of a list. Everything after is the second element. And so the second element in a python list has index one. So we're gonna split by here, and we're going to take everything after this. So Element index one, Um, but what we're gonna do is we're not actually going to take all of it. We're going to take this text and we're gonna go all the way up to, and we're going all the way up to the third last value. So in adjacent format, we also have this extra semi colon here, So we want to get rid of that. And then there's this extra open parentheses here that also doesn't belong, So we're going to get rid of all of that. And again, you see, if I highlight this bracket here and we can try to identify where it starts, um, let's go back to the top. So that bracket here indicates the end of our data here. So again, find them. Put my cursor here. So this is the end of our data. The ending racket for data is gonna be the bracket before this semi colon here. So we've taken all of this away already. This is now the last element. This is the second last element. So since this is all a string, we're going to go all the way up to the third last element, and we're just going to take these things away here. Um, so that's our data now in a string format. But our whole string contains Jason for matter data. Now, we just need to turn our string into a dictionary or into that Jason format. So the way that we can do that is we're going to import Jason. And this is just a standard library that comes with Python, and what we can do is we can create we're going to save our results in adjacent object or we'll call it may be the Jason Data eso that's gonna be our data in Jason format, and that's gonna be equal to and so will go into Jason. And there's a function called or there is a method called loads. Where we can put in are temporary data in the form of a string or something like that, and it's going to transform it into this Jason for meant. So right now we pretty much have our data. Looks like this just, you know, more complicated and stuff. But it's in string format, and if it's a strength format, you know we can't call the keys, so you want to turn it into a Jason format or a dictionary format. And so this Jason Loads allows us to put in this the string for amount of data, and it turns it into this dictionary format or into this Jason format for us. So, for example, what I can do now is I can print out my Jason data dot key said the keys on the first level that correspond to this Jason Data. So we'll let this run, and if everything worked out correctly, then we should have. So we see the top level keys are the context in the plug ins, which maybe now if we go here, So, for example, here the top level keys is context, and somewhere else there should be plug ins. So it seems everything is work correctly, are dead, has been transformed into Jason format, and now are now. It's much easier for us to handle this data because we can access things for example, like this, like Jason data and context. And then we go into this context key. So, for example, if our Jason Dedeaux looked like what we have up here, then we can go into the HTML and then we would have all of this so not so much easier for us to access all over data. And we don't need to look for matching strings, but we can just use thes keys. And the reason again that works is because of the way that this data is formatted here in the response from the function. So the response of the function actually has its Jason format in it, which again in python looks just like a dictionary format. So if you know dictionaries, you know, Jason and I assume you probably have heard of dictionaries before. Um, so that's how that looks like here. And because it looks like that, um, we're able to take all of the string and turn that into Jason Format and, yeah, make make it all a lot easier for us. All right, So this is a great job of, you know, like making our data are a lot more accessible, accessible to us, because the Jason Jason data structure is much easier to handle than just pure strings, and it has a lot of structure to it. So the next thing that we're gonna look at is now identifying where exactly inside of this Jason data are hard trailing P data is, and maybe also the rest of our data. We're gonna identify the path that, um so that we can nicely just grab out all of that data in their on. But we don't have to look through all of it and, you know, go up through the strength and go up through all of these data is and try to look for where the parentheses open and close and all that stuff. So we're gonna identify exactly the path that we need to take to get to our data. 26. Getting Our Final Data: everyone in smacks. Welcome back to my python web scrapping tutorial. So we're getting really, really close to our final data. We have everything in Jason format. No. The only thing that we're missing is the actual path in Jason, this time toward data. So what we're gonna do is we're gonna do something very similar to what we did up here for finding the X path, but we're gonna find the Jason path instead. So it was comin out this example from before, Um, and we're gonna define a new function, and we're gonna call this one, find Jason tough. Um, And what it's gonna do is something very similar to above where we went through all of the elements and Thea tributes. And we found the X path. Not we're gonna go through our Jason object, and we're gonna find the Jason path. So our input is going to be a Jason object, and then we're gonna look, they're also gonna have the target value that to look forward. So the path that we want for it again the path variable. Now, there's one more thing that we need. Um, and we'll call this the match type and I'll explain in a second what that is. Um, so these are going to be the four parameters that we're gonna search for now, what we're gonna do is, and he this is where the match touch comes in. So we're gonna first look, if the type of our Jason object is equal to our match type and so are a match type input here is going to be the type of our Jason Daito. Now, the reason that we're doing this this is because if we go too deep, we're actually not going to get a Jason object anymore. For example, if you go into the form here, we're not going to get a Jason object anymore. We're actually going to get an individual key, which can be, you know, introduce Boolean strings. Whatever. Um, so to avoid running into areas here, we're going to check, and we're only going to go deeper and deeper if we're still inside of adjacent formats. Otherwise, we're not going to go any deeper, so to avoid running into the stairs. That's why we're going to check if you know, if our Jason object is still a Jason object. Um, otherwise, it means we went too far. And so the reason that this is this is like this is because we're gonna search through the whole Jason object, and we're gonna look, if any of the keys are the keys that were looking for and if they're not, and then we don't need to go further into the into the Jason object and look at the individual key and look at the data because we know the key doesn't have the data that we want. So if we're that far in, we're already too far in. So that means in case our Jason object type is no longer Jason format or a dictionary in the sense were already gone too far. So I already know that iterated again in seconds. But we'll just leave it like this for now. Um, so if you know, if all of this is good, then we do the following. Otherwise, we're just gonna return an empty string in case we don't find anything. Um, And again, we're here. We're gonna use incursion now, what's nice is a rickerson is gonna actually work very similar to what we had above. So the first thing you know, check is um, if our target value is in our Jason object. Um and so what's nice is with the Jason objects, What we can do here is we can check. Well, what we're doing with us if Target and Jason is we're checking if this target value is one of the keys now on the same level of our Jason object. So, for example, for this HTML, we start in here. Is this HTML? You know, this is our Jason object right now. It's all of this eso the key right now is H 20 And right now, a trailing p is not in here because they see No. So then we're going to go into HTM. Oh, and then we're gonna look here at body, and it's not going to be in here. So we're going to the next one, and then we'll have here something like, um, here will have script, and then we'll have our, um Then we'll have another Jason object in here or something like that. Hopes. And then in here, we're gonna have our let's get forming right here. We're gonna have our trailing p like that, and then you. So that's how it's gonna look like. So we're going to go into body, not find anything. Then we're going to go into script again. It's not gonna be exactly this format. It's different now. So this is the former that we did for doing the XML path. There is no more html in body and stuff in the data because it's all Jason Data. No. And all of this data is represented to what? What was created in this Java script here. So all of this data is this kind of stuff that we're seeing here. So there's no more HTML tags or all that stuff. I'm just using it for the example. And then eventually, we're gonna find we're gonna go into this thing here or let's see, it's not actually gonna be called script is gonna be called something like our data, you know? So whatever that name is, um, call it name off our data, and inside of here there's gonna be that trailing p value, and that's where we're going to check for, so if that key is inside of the dictionary in here, so we don't actually need to go into the trailing p e. And look at the value. We just need to see if in that dictionary that we have here is this trailing P. Keaton key contained. So that's what we're doing in here. We're checking for the key. So if it isn't there, we're going to return the past. Um, so in case it's not in there, we're going to use a full of. So we're gonna say for nooky in our Jason object. So first, we're going through all the keys in our Jason object on the for example Here, we can have a second key, a key to with value two or something like that. So first, we're going through all of these keys here, and we'll check in if any of them match our target value. If they don't, we're going to go through each of these keys one by one. So, for example, here, we're going to check if this key is equal to the HTM. If this keys, it would to training P, it's not So now we're going to take these. We're gonna go through each of these keys one by one, and then we're gonna look in here, and then in a second, we're gonna pretty much we're gonna take this key, and we're gonna go inside it. So we're doing this depth search again. That's our final goal. That first we're checking, you know? Is it on the same level? If not, we're going to go inside and look for it there. And then if it's not in there, then we're going to go into the next one. So we're doing the exact same death search that we did before that got us that, Expats. All right, so we're gonna go iterating over all of the new keys, and again, we're gonna have our final value be equal to the find Jason path, um, results. And Frankfurt is not going to be the Jason object, but to go into it, we're going to take our Jason object, and we're gonna go into the new key. So that's this in depth thing that we're going, um, up here. You know, we went into the new element, which is inside. So we went down one level here. We're going into thes into the Jason object inside of that new key that each of which reiterating over All right. And then we again, we need to supply target. Since we're calling this function here. We also need to supply a target value again. Our path is going to be equal to our current path and to it's just add new key. So we'll just have a general for money here. We'll just separate everything by commas, and we'll add our new key like that, and then we'll just keep the same match type. So this is gonna be constant. This is going to be a constant type that we're just going to continue, sit past, Um, and then if our final value is not the empty string which we get if we don't get the right data, um, if we don't get the right data that we get the empty stream return. So if it's not the empty string, we're gonna return our final value. So again, this is a record and function. We can also run it again, printing out the path, for example, so that we can see how we're going in through everything on. And the structure is very similar to finding the X path on like we did before. So this is kind of going in depth, going through everything. And then if we don't find anything in the 1st 1 We go out and we go into 2nd 1 all the way down and up and down, up and down. So that's what we're doing with this Rikers in here. And then let's do that. So to run that, let's just print out the results. So we're gonna find the Jason path. Let me actually copy the sole function here. So we're gonna find the Jason past path we're gonna need to define the match type. And that much type is just gonna be the type of our chasing. Taito, our Jason object is just gonna be our Jason data. Um, our target value is going to be just like above. Um, you find it's the trailing P e. Here we go. So that's the final key that we're looking for. Our current path is just nothing and are much type. Is there much time? All right, So let's run our code and let's see that in action and hopefully provides to us the path directly to India. So here we see, it's pretty much going over all of the, you know, different all of these air, different indices. And here, So first it goes and here goes into context and stuff and it goes down and we see that our final path is this lost value that we got from here. So we can also print that out again. Um, final path is the following. So this is our final path. We could run that again one more time, but this is the result that we got. So we did get something which is great on this is gonna tell us exactly where our data is located. So to access that we can just go directly into our Jason data, and then we're going to go into the context key. And now we're just know this is exactly the path that we need to take to get to a trading p values. So we're gonna go into the context and then and there were going to go into the dispatcher , So also showing you through chrome, we're going to go into context that we're going to go into dispatcher here. The next thing that we're going to go into is going to be stores. So again, our result is this one right here. The reason that we have all this other stuff coming out is because we're printing the path here. So we go into stores, then we're going to go into quote summary store. So let's just copy paste that over to make sure we don't have any mistakes. And then we're gonna go into summary detail like this. Um, yeah, and we'll just print out the results that we get from here and we'll run this again and wait for that to finish. All right, here we go. Sanofi Scroll Through it. We see we've got pretty much all of the relevant data. So we've got a regular market, open payout, open interest, day high, etcetera, etcetera. And somewhere in here, we also have our trailing P Um, So right here, Um, that's where I finally PS. So now we've got this direct path to all of our data. And what's nice is we don't just have our trailing p value, but we have all of the other values, too. So although we have to do now is just kind of loop over all of these keys, and we can use that to just extract out the data directly, so there isn't really that much more to do with it and what we've done in right now is is pretty pretty cool. We've taken, you know, this this huge dynamic website here, we've reduced everything down. We've built this in depth searching Rikers in to tell us the exact path. Once we got to our data, we again parse through all of the Jason object or through the whole Jason object to find the path to our data. And now we have all of these cubes that cool values here. Um, no. So we've done a really great job with that. If we want, we can just do what we did before, and we can import pandas as pretty. And what's cool with this Jason object is that it's pretty much just like a dictionary. So what we can do is we can save our We can save this as our final data like thoughts, and we'll just save the whole Jason object in there, and then we're gonna create a Panis data frame that's gonna be equal to p D dot data capital de data frame, and our data is going to be equal to our final data. And then maybe we can print out, um, our data frame and Let's give that a run and let's see what we get from here. Alright? Again. Waiting for that to load. So cool. So now what you've done is we've taken all that data and were able to just directly transfer that into a pandas data frame, which is really, really cool. Um and so we see we have all of these values stored in here directly, and they're directly accessible to us. And that's a lot easier. Um, you know, minus you know, if you have these these incursion functions, it's a lot easier to do rather than you know, to copy out each of the values, each of the names and then, you know, have those in a list and then both your text, you know, over and over and over again just looking for most values and then splitting by there and then taking out the raw value here. We just have each of the huge of these values of the column. See, each of the different indices is a column or indicators of the column, and we see that here we have the formatted value, and here we have the corresponding raw value, and sometimes they're also long for medicine, all types of other things. But yes. Oh, that's that's what we've done. Not right now is very cool. We've really taken all of this data, and we have access to a lot. A lot more data. Um, and in the end, the formatting is very, very easy because we can just directly put it into this panel's data frame and do all the panel stuff that we also did before using, you know, exporting it we're using in next, l or, you know, continuing on with our data analysis there. 27. Final Results Dynamic Websites: Hey, everyone, it's Max. Welcome back to my python Web scrapping tutorial. So what we've managed to do now is we finally managed to get all of the state out that was dynamically generated through Java script code. And the way that we did that is we created this browser instance and we created a phantom browser so that we don't have this application popping up the whole time. And we have to define the path to where it is to where the executed a file is. And then we navigated to this u R l We went through it and we got out the text content and we figured out this path to aggressively going through all the past and then finally identifying the path to our data, um, that we isolated out our data, put everything into adjacent format, ran another code that traversed our Jason data until it gave us the path to the final data that we're looking for. And then we were nice and easily able to transfer that into a Panoz data frame. Now what we've done here is a little bit more interpret than just, you know, like reading out the D html code and just looking for the values that you're interested in . Copying him out, putting them in a list, running over the list and just going through the string and then looking for the corresponding valuable raw value. So the next raw value that appears. So, for example, what I could have done so I could have copied out dividend Yield and has gone through my texts and looked until I found Dividend Yield and then taken out the raw value. Um, you know, I could have done that, but that's not as clean as what we did here. So what we did here is extremely clean. Also, because thief final data format that we have is much nicer and we have access to a lot more data. It's also a lot safer. Um, because it could be that this dividend yield isn't the only one that's interesting to us. So with all of this parsing in all of this checking that we were doing in all of this depth traversing, we made sure that the data that we got is actually the final data that we want. So yeah, great job, guys, I hope you've learned a lot about how you can approach scrapping dynamic websites some of the things that you have to watch out for, I hope also, these rickerson codes you know that we've created together can be useful to you are at least the idea of, you know, like this depth traversing you, of course, don't always have to do it. And every website may be different to nevertheless, um, I really hope that, you know, you're able to use this and it's been interesting for you and fun and that, you know, you go out and you apply this to something fun and hopefully get out some cool data that you can do some really nice data analysis on One more thing too, is that it's important to note that you shouldn't use thes cellini in Web drivers. If the data that you're parsing with the website that you're parsing is just static html and the reason for that is because you know, you saw how long it takes or relatively how long it takes for this browser to open and for it to navigate to the website. And then it takes a lot of time for the JavaScript to run. So if you don't actually need JavaScript content. You shouldn't use these Web drivers, and you can just be safe here with requests or something like that. Just get the static HTML in parts that because that will go a lot faster rather than having to run all of the Java script on the website before you can start parsing all of the data. So also, you know, be mindful what methods you're using. Don't just always use the one that's, you know, like that's most obvious there don't always just stick to the one that you use the latest just because you know it works. Of course it would work. And if time is on issue for you, that that's great. Um, but if you do want to make you know more efficient algorithms than these are things that you could also keep in your head, you know, do I need to load all the JavaScript because that's going to slow my code down, especially if you go through hundreds or even thousands of different websites. If you need to run all the JavaScript on them, that's really going to slow down your your code rather than if you just get the static website HTML parts that without really caring about the results of the Java script because they're not relevant to you anyway. Yeah. So these are some things that you should keep in mind. Nevertheless, you should I should have a you know, a lot of power, a lot of ability to go through the web and do lots of great funds scrapping. Um, I hope you enjoyed what you learned in this section, and I hope you can apply it to. And I hope you had fun, and I hope you've learned a lot. 28. WebscrapingPythonOutro: everyone smacks, and I just want to say Congratulations on finishing the course and also just want to remind you, Make sure you go through the exercise in the project section so that you, of course, applied the skills that you've learned in the course. And also just think about some ways to use Web scrapping in your current life. Be that either through extracting more data to provide context for maybe an analysis that you're doing. Or maybe also just create a database for yourself, using that unique, well, unstructured data that you find on the Web sites that's basically out there everywhere. It just needs to be gathered for us before you can use it. And then you can use that for personal projects or also for work projects. Toe. Have extra data to work with, either for an analysis or, you know, creating report or whatever other reasons you may be using data for