Machine Learning for Absolute Beginners - Level 2 | Idan Gabrieli | Skillshare

Playback Speed

  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Machine Learning for Absolute Beginners - Level 2

teacher avatar Idan Gabrieli, Pre-sales Manager | Cloud and AI Expert

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

42 Lessons (4h 2m)
    • 1. ML Level 2 Promo

    • 2. Welcome!

    • 3. Anaconda Installation

    • 4. JupyterLab Overview

    • 5. Working with a Jupyter Notebook

    • 6. Python Fundamentals for Data Science - Overview

    • 7. Variables and Data Types

    • 8. Strings

    • 9. Lists

    • 10. IF and For Loop Statements

    • 11. Functions

    • 12. Dictionaries

    • 13. Classes

    • 14. Importing Modules

    • 15. Libraries for Data Science Projects

    • 16. Exercise #1 - Python Fundamentals

    • 17. Introduction to the Pandas Library - Overview

    • 18. Series Data Structure

    • 19. DataFrame Data Structure

    • 20. Data Selection in a DataFrame

    • 21. Exercise #2 – Pandas Series and DataFrame

    • 22. Loading Data into a DataFrame - Overview

    • 23. Kaggle and the Titanic Dataset

    • 24. Loading a Tabular Data File

    • 25. Adjusting the Loading Parameters

    • 26. Preview the DataFrame

    • 27. Using Summary Statistics

    • 28. Methods Chaining

    • 29. Sorting and Ranking

    • 30. Filtering

    • 31. Exercise #3 – Data Loading and Analysis

    • 32. Grouping and Aggregating

    • 33. Data Cleaning and Transformation - Overview

    • 34. Removing Columns or Rows

    • 35. Removing Duplicate Rows

    • 36. Renaming Column Labels

    • 37. Dropping Missing Values

    • 38. Filling in Missing Values

    • 39. Creating Dummy Variables

    • 40. Exporting Pandas DataFrames

    • 41. Exercise #4 – Data Cleaning and Transformation

    • 42. Course Summary - Let’s Recap

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.





About This Class


Unleash the Power of ML

Machine Learning is one of the most exciting fields in the hi-tech industry, gaining momentum in various applications. Companies are looking for data scientists, data engineers, and ML experts to develop products, features, and projects that will help them unleash the power of machine learning. As a result, a data scientist is one of the top ten wanted jobs worldwide!

Machine Learning for Absolute Beginners

The “Machine Learning for Absolute Beginners” training program is designed for beginners looking to understand the theoretical side of machine learning and to enter the practical side of data science. The training is divided into multiple levels, and each level is covering a group of related topics for continuous step by step learning.

Level 2 - Python and Pandas

The second course, as part of the training program, aims to help you start your practical journey. You will learn the Python fundamentals and the amazing Pandas data science library, including:

  • Python syntax for developing data science projects

  • Using JupyterLab tool for Jupiter notebooks

  • Loading large datasets from files using Pandas

  • Perform data analysis and exploration

  • Perform data cleaning and transformation as a pre-processing step before moving into machine learning algorithms.

Each section has a summary exercise as well as a complete solution to practice new knowledge.

The Game just Started!

Enroll in the training program and start your journey to become a data scientist!

Meet Your Teacher

Teacher Profile Image

Idan Gabrieli

Pre-sales Manager | Cloud and AI Expert


Class Ratings

Expectations Met?
  • Exceeded!
  • Yes
  • Somewhat
  • Not really
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.


1. ML Level 2 Promo: hi and welcome to this training program about machine learning. My name is he done, and I will be a teacher. Machine learning and the umbrella terms artificial intelligence are exciting and engaging topics gaining tremendous momentum every well. It's a mind shift on how to develop applications. Instead of using hard coded rules for performing something, we let the machine let things from data, decipher the complex Parton automatically and gain new knowledge. Companies are looking for ways to utilize those technologies in practical use. Case as features in products, physical products like our mobile phone on the village, your product like a recommendation system in a website. It's a game changing technology, and the game just started. The market demand for skip people is growing and as a result of data science community is becoming the hottest place in the high tech industry. Still, machine learning is a complex topic divided into many sub topics, and we can easily get closed while trying to figure out where to start and what kind of skills we should develop. This training program provides a comprehensive yet state forward and sequential landing pad for beginners. You can follow the complete learning pat step by step, All decide what levels are relevant for you. Level two is the first step when building a machine learning pipeline. We want to make sure we feed that pipeline with the most relevant and optimized data. Therefore, we must develop the skills off floating a data set, perform data cleaning, handling missing data, filtering things that are not relevant and transform the road data to make it more optimized input for a machine learning system. In this level, we will set up a fight on development development lame. How to use a Jupiter notebook, land the basic fight on object oriented syntax, load clean and transform road data set using the famous painters Data Science Library and finally perform initial data analysis. Sign up today and start to learn the exciting concept of machine learning and data science . Thanks for watching, and I hope to see inside 2. Welcome!: hi and welcome to the second level off my training program about machine learning. My name is he done, and I will be a teacher in level one. We talked about the most fundamental concepts off artificial intelligence and machine learning. It was a soft entry point to get you started with this exciting topic. Now you know what is supervised? Learning unsupervised, landing a train more the training data set. And so once you got some off the cold terms in machine learning level two is the next step . While moving into the practical side off machine learning, Meaning, Taking some whoa. Data set about something. Exploit using all kind off data analysis and data visualization method and then be able to use machine learning algorithms for performing a task like a prediction and classifications and hopefully toe build some useful machine learning projects. At this level, we're going to jump start our practical journey. We're focusing on one single a programming language called Fightin. Fightin is the most popular language for data science projects. And then for the main objective off this level, Level two is to learn the fundamentals off the bite on language, including one of the most popular libraries called pandas. As always, we're going to do things in baby steps, learning something new, and then try that with practical use cases. Hopefully while enjoying the learning process. Every level after this one, meaning level three Level four will keep building our knowledge in Piketon in the context of data science while covering more advanced topics. Okay, I would like to wish you exciting and useful learning. Let me know if you have any questions. Then I will do my best toe help you. Let's get started. 3. Anaconda Installation: It's part of the training we will need to run a pie Tony code in a fight on environment. As you will see in the next lecture. The most popular way to create a data science project is by using something that is called Jupiter Notebooks. So I will. First step is to make sure you have a local piped on environment on your computer, together with the needed data. Science libraries there also online weapons for running Fight on. But I recommend to have your local installation. The good news is that it is an easy process. It will do it by using the popular on a Kanda package. Manager on a condom is an open source distribution package share for Piketon that already includes multiple data science tools, including a tool called Jupiter Lab. It is coming together as a package with everything that is needed to develop projects. So let's go ahead and perform the installation off their anaconda individual addition. It's a free version without any cost. Okay, I just opened Anaconda a website and going to the list off products and selecting their 1st 1 which is the individual. In addition, this is the free AM open source distribution, pick egx and clicking on downloads. It would provide me all kind off options based on the operation system. I have a window so I will download this file. By the way, Bite on has all kind of vision. This is the latest version fight on a 3.7. So as a fell step, I will download this file. Let's run the installation. It's simply weaseled, and it will take a few minutes, say, to install. After the installation, I will search for the Anaconda and Navigator. The started application under the home page off the Navigator will be able to see all the tools that are coming out of the box a spot off the installation. For example, this one a called spy dill is free integrated development involvement for vital. It's probably much more useful for anyone that is looking for a tool to develop more complex application for by tone. You can use it for data science projects, but in my opinion, it is over killing more complex to learn. A much better option for data science projects is called Jupiter a notebook, and it's a Web based interactive environment that is looking like a notebook. We'll see that later. It will help us to write small pieces off code and play around with our data. Almost interactive. Lee It's one of the most popular tools today for such a data science projects. Now Jupiter Notebook as a tool is starting Toby, replaced by a new tool called Jupiter Lab. It's the next generation user interface for the classical Jupiter. An old book. Eso We will use the latest stool, meaning Jupiter Lab that is also part of the Anaconda installation. Okay, that's it. We are ready to start using the stool. Jupiter Love and in the next lecture will have a quick overview off their Web interface off that, too. 4. JupyterLab Overview: Let's launch the Jupiter lap tool. It will open in my browser with this. Say I P address. Local host is the local I P address in my computer, and together with this, it's sippy port number. It's a creating a Web interface where a Web server running in my computer It's a clients of a Web interface. The main type off document that we will create and managing Jupiter Lab is called Jupiter Notebook, and in the next, electoral will dive into the concept of Jupiter notebooks. Right now, I'm just reviewing the overall interface off the stool. There are three main areas. It's part of the interface, a menu ball. We don't kind off options, a left side ball and the main a walking area that we can open multiple files. We can open multiple notebooks, document and other types of files as multiple tops. The first option in the side ball is for opening safe. Don't notebooks. Does this files? Okay, Notebook file has a dedicated file extension ending with this I p. Why and be extension name. Let's open a one notebook is an example. It's opening is a new tab. I can also open other types of files that are part of my project. For example, I can open some a J pig some picture and C is we file okay, all kind of files that they are part off a off some project. It makes sense to place all the relevant files a specific project in a dedicated folder and then opened them as multiple tabs while walking on a project. I can also start a new a notebook from scratch going into the file menu clicking New Selecting notebook. I need to select the Colonel Ah, which in our case, despite on three that will handle the code in the notebook. A notebook is a long program that we decided to break into many small blocks. Each block is a few lines off code. Then we need some engine that will take their lines of code and provide us their results. That's the job of the kernel, which in our case, is fighting. He went on the upper right side. I can see the name of that colonel fightin by the way. Jupiter Lab can also be used for other programming languages like a job math club are and mo, for which language a dedicated killers should be installed, and then ish would be selected. Toby used in a specific notable. Another useful and nice thing I would like to mention is the options to replace the thing. And I'm going into setting. Okay. Jupiter team and I have two options that can light. Okay. Can switch to duck if you like to walk in this way, not it is nice option. I will jump back to Ah, lights them. Okay, let's move to the interesting part, Which is the mindset A off handling projects as notebook. 5. Working with a Jupyter Notebook: one of the key advantages off the Jupiter notebook is the option toe. Perform an interactive programming workflow. Okay, we can write a few lines off code, execute them a small blocks and then explode the output. Then we'll do it again and again until getting the neither results. We can mix up in one notebook and the Fightin Code itself a presenter will data using all kinds off visualization widgets, ed remarks is text images, video clips. All of them can be embedded in the notebook, which makes it an organized way to record our project a steps including the input and the output. Then we can save a notebook as a file and share it with others. We'll see it later. But this kind off interactive programming approach is well aligned with the typical data analysis workflow, which is about exploring the daytime step by step, breaking the problem into small pieces and try different options until getting the needed results. Just to set the expectation, I'm going to use a couple off a simple bite on lines off court just for the demonstration without going into the same text off the code. We're going to review the fight on syntax in the next section, and at this point, I just want to provide you some high level understanding off the concept off a Jupiter notebooks. Let's go to the new notebook option. Good. A notebook consists off a linear sequence off cell. Okay, this is a cell. Right now. My new notebook is empty. A single cell is a multi line text input field that we can type inside, and it's content can be executed by using shift. Enter will see that in a minute or by clicking over here on the play and button in the tool ball. The execution. A behavior off a particular cell. It will be determined by the sales type. There are three types off cell okay, and the 1st 1 is a court cells, month down cells and roll cells that the default setting off a new cell is a court cell. But the type can be changed using the said drop down now in the manual, using some keyboard, a short cut inside the first tell, I will type some simple fight on code. Good. I can click a run over here or just used their keys shift and the okay. Just asked the fight on colonel toe on those two lines in that cell, the cells will be automatically numbered. Okay. Here on the left side. I can. Right now. The next line off codes in the next cell. Okay. The blue polo in a specific cell indicates that I'm in editing more than I can change the cells content. I will type additional airlines, print a plus B. Okay. And a equal to full and again print a plus. So plus B Great. The results that returned from the this simple calculation are printed in there in the notebooks as the cell's output. You can see that. This says, OK, Number two is printing the sum of two valuable that I already declared in the previous Selves. Okay, I don't need toe place values in a and being again. The fight on colonel will remember my previous lines of code and that was executed in prove yourselves. Okay. This concept is helping to break a complex Borg, um, into small steps. I can always a click on any cell, okay? And enter editing mode is the blue polo and change their code inside. I can move a complete sell to a new location. If I would like or maybe delete some sell, click right, deal it cell, and it will be deleted as a reminder. The shift enter combination is used to run at the same. In addition to writing Fight on code Inside Sale, I can write notes, is a rich text inside the notebook. Using them mark down a language. McDonnell is a popular markup language. It's provide simple way to perform text markup, similar to which team l think about your paper. The notebook of your paper office notebook that you can write notes combined with graphs or whatever information. It's a useful way. Also, when documenting and organizing our data science projects toe ed such a notes in a specific location inside the notebook. I need to change the cell type. By clicking on this drop down menu, we will be able to see the three different cells type called Mark up in the hole, and the element one that we're looking for is this one mark down. I need to select it and by the way, and other options to change between markdown sale and Kotal is using the keyboard shortcuts and and Why, Okay, I'm clicking on that and changing a cell type, we click em to use a markdown. Now I can write free text inside using their markdown syntax. I will. They just a pieced some text. Okay, The following step is going to analyze that. Provided the data using clustering, for example click shift and the and it will execute that marked down code and converted into the corresponding a rich text. I will get another type off markdown cells and in that perspective, show you an options to create headlines by starting a line with the hash character. Okay, so using this hush a character, let's call it Step one and execute that. And another one with two a hash I There's 1.1 chief then and because it's different headlines, okay, Something that is useful when we are adding titles and all kind off remarks in our project is something that we're going to use later for saving my notebook. I will go to file save notebook has and provided some name. I already saved that before. So that's the name. Okay. And this is the extension and name. Okay, click save and you will be able to see the safe file over here. This is basically a dedicated file format to store the notebook content in a single place. Let's close it for a second and go here and try to reopen that again. This is the final name clicking on that, and here we have it. Okay, Project one. This is extremely useful because we can basically that code our walk while walking on some data science project and then share that project as a notebook with someone else, Or load some project, example that we can download form some a website, something that will do at the later stage. Another very useful feature is something that is called top completion. Okay, well, entering some expression when writing a line, of course, I can press the top key one time. Wait for a few seconds and it will search the name Space A for any valuables matching the text. I have time before. Imagine that I have, you know, many lines in the notebook dedicated many valuables. And then I can search those valuables quite quickly. But not just that. Let's do that exercise, for example, I will click my mom equals, then Okay, and let's ed also a string die and another one Okay, now I am If I will it type just my anywhere in my notebook and then press the tab key a one time I will get the tree valuables Okay, in my notebook with that prefix can, then I will be able to select one off them. Okay, Lets well for select this one also each valuable type something that will see later in the next section, is coming with many useful functions that we will use for all kinds of things. Okay. And the automatic top completion will help us to find the needed functions. We'll dive into that in more details in the next section, of course. But in the context, off completion, it's going to be very useful. For example, I just type, you know, my str to which is a string type. And I can use a special character, special dog character right after the name of the valuable to access all kind off method related to that. And then I can just press stop and it will open all the available functions that I can use for this type of object. Okay, sometimes you don't remember the name of the function. This is a great way to access the list off fair functions. Let's say that I selected some function, okay, For example, function that is called after. And I'm not true what this specific function is doing. Okay, now I can type a question mark, Okay. Before after the name. Execute that. Okay, I know we get some information about dysfunction. Okay? The name of that function in and the main. And I think that this function is doing this is what? Something that's called Doc String. We'll see that later. But it's a great way to see information about a functions that we can use. We can export our final notebook toe, arrange off static a file for months, just goto file and go toe export notebook has and we have all kind off options. For example, I can export the notebook toe en html format, and then it can be used in a webpage or export that as a pdf, and then I can send it by email to someone is a file and other important things to know is that the code we are writing inside the notebook is running in a separate process called the colonel would be able to see it from here. You see the open caramel session for every a notebook that will be a dedicated a kennel session. Okay, A camel session is basically all the things that have done so far in a notebook. Okay, All the lines that we put, let's say that were 100 lines off code that divided only to 10 cells. And we already execute those 10 cells. So the colonel will remember all the one hundreds lines that we already executed. Remembering the states and you have the options toe perform early, start to the kernel to start from scratch. Then you need toe re execute the relevant lines. That's enough at this point to jump start your knowledge around Jupiter notebooks that are used inside the Jupiter Lab interface. Our next step is to learn the fundamental fight on syntax. So you'll be able to apply this knowledge in the Jupiter notebooks. Thanks for watching so far. See you in the next section. 6. Python Fundamentals for Data Science - Overview: hi and welcome in the previous section recover the needed steps to install and run our interactive development environment, meaning the Jupiter lap pool. Now we're ready to start using the stool while covering the fundamentals off the fight on language just to set expectations. I'm not planning toe cover. Everything about Fight on is a programming language. It's not my objective. My plan is to focus on topics that are needed to develop data science projects or get topics such as the bite on the sea tax, a data structure, a built in functions and libraries that are used to perform data analysis and data exploration. We may also learn new things in fight on in future levels. According to the project. We are going to explode. So why Peyton Fight On is one of the most useful languages for data science and machine learning projects. It has an extensive list off libraries that enables us to perform complex operation on data . If you're not familiar with the term a library, a library is a package off code that someone else developed and released so other people can use it in their applications. It simplify the development off New programs another advantage off a pie Tony's that it has a low entry level for beginners. It is a fairly easy language to learn and use, even for people without any programming background. The tell advantage is the size off the active community. There are many people worldwide that are using fight on for data science projects. If you perform some Web search about fightin, you will see that it has a massive community of people that are using it for any question. That is probably some answer online, and we will be able to find a substantial amount off useful information that will help us to develop our projects. And the last thing is that fightin is the primary job requirement for anyone looking to become a they. The scientists and machine learning engineer. A Iot machine learning business consultant ML product manager, etcetera. It's a very popular machine learning development, a framework ingeniously Okay, if I convinced you that it makes us to lend pie tone, then let's tell to review the necessary syntax step by step. So go and get some coffee or tea and let's begin to walk 7. Variables and Data Types: Our first topic in fightin is about valuables and data types. A valuable is a container for storing data values like a number or text line stowing a record for me table and so one application are usually loading data from files or storing data in fires. But the data manipulation processes done using valuables that are stored in memory, doing runtime. How much memories will be allocated as a container for a specific valuable and what kind of operation can be done in a specific valuable is based on the valuable data type. Each viable must be related to a specific data types. Piketon has a multiple data types divided into the following main categories. Please know that the column on the right side represent the name off the data type in fightin, for example, for handling text. It has a string data type called STR for numeric numbers. We have an integer or float for supporting floating point numbers. A 1,000,000,000 data type is handling its value. True or false, A sequence is an ordered collection off. It's similar or different data types. It is useful to handle a list off values, and there are more data types and you can see the sea in the table. Doing this section will cover the most relevant data types for our data science projects. Using fightin, we don't need to declare the data type for available. It is created the moment we first assigned a value to that valuable. It was a little bit strange for me, with my background in C plus plus and Java, but I got used to it very quickly, moving to the practical side. For example, our type A equal to 10 fightin will create an integer viable called a and place this value inside. Okay, 10. I can check any valuable a type that fight on created using the type function, something that is very useful. This is an integer valuable. Let's create another valuable be vehicle to 10.5. Now if I will check it. This is afloat data type because it needs to handle there. Floating point. I can create or convert data types to another data types in the context off numbers X equal to float a. Okay, even though that a is an integer, it will create X is afloat valuable, and I can do the rose off the other way around. Why equal to empty jail? Be okay and why will be and into your data type another very useful later type strings. So let's create an example. It I see good. This is a string Things viable can be declared by using single quotes or double quotes. Very useful will use them a lot. And another think that type is a list Skated some. I am a simple list. Okay, with three items red, green and blue. This is a list. Okay, they are very useful and we'll talk about least at the later stage in the last simple valuable that I would like also to show you is bullion so I can type equal to true force. It is a brilliant to print any valuable value or some text. We can use the bite on built in print function while passing object as barometers. I can print number, print a no prints, some text. My name is he done. We can assign values to multiple valuables in one line like C one. See to seek really equal to lead rain blue print. See one c two c three Oh assigned the same value to multiple valuable in one line. Okay. And one, number one, Number two. Number three are equal to 10. They all kind off options. Now there are a few things to remember about Valuables Name first of fall. A valuable name cannot start written with a number I cannot write for for a equal to five. Okay, well, you get some a sim text, Evel it bust out with a letter or the underscore character, and can only contain alpha numeric characters. I cannot write my ball and then use the dollar sign equal to 33 Will you get again? Syntax several. It also valuables. Name are case sensitive. So my other school name Okay, equal toe five is not the same. Is my under school name equal to six. Those are two different a valuables. And the last topic for this lecture is about comments in Piketon at the hash mark indicates a comment. Okay. Fighting will ignore anything after the hash mark so I can write. This is a government check and executed. It's not doing anything. Oh, just after some line, print a and then use the hash tag. This line is doing ex wise it okay? Comments are useful for two main things. One is to explain a code. It is essential to develop the habit off, adding comments in our court. It will be used by us to remember what we have done a few months ago or by others that may review our code. Secondly, sometimes he you would like to prevent fight on from executing a piece off code, but without removing that form the program. Okay, maybe it's a temple. You would like to check something, so it's useful to use a comments to exclude a piece of code from your program. 8. Strings: Our next topic is the string A data structure. A string is a serious off characters in a specific order in fight on anything inside quotes is considered the string a data type and we can use single or double quotes around the text . For example, I can type astir one equal to with a double quote. Hello will be the same is str to equal with a single quote. Hello, it get the size off a string. I can use the land function land in the name off the string inside. Okay, five characters. There are many operations that we can perform on strings by using all kinds off built in method in pie tone, for example, the up metal would return the string in upper case, so I can. It used the falling syntax print and used the up ill method. And I will get hello in upper case. Well, maybe used their lower method. The same syntax. Let's copy based and use low. Well, let's explain those two lines. Any valuable in Piketon is an object from a specific type. Okay, STL one. He's a string valuable STL to is also a string vulnerable Toby able to learn some piece off code related to the specific object type I'm using. A special dot notation this dot over here after they a the valuable name it is telling fight on toe. Perform the method on that specific object and this is a key thing to remember this dot syntax is goingto be very useful. Way to access method to check If a certain text or single collected is presenting a string , we can use the keywords in or not in. Let's see some examples there one equal to Hey, my name is done and then I will create some flex some bullion flag. Let's call it flag one and say something like that you've done in. It's the one. Okay, this line will return a Boolean value True force in tow. This flag that's print the value leg one true. Okay, I have my name in that particular string. Let's do another flag Elikann so that directly without a flag, let's click print and say something like that, David in ste Ah, what s still one? Of course they is not in a stare one, so we'll get force. Let's move on. Let's talk about splitting strings. Okay, the split method, it splits a string into sub string. If it's find an instance off some specific separator. Okay, lets say. Do some example. Let's create the valuable A and click some text inside my name. Is he done? God really? And now let's type something like that. I will create a new data type called least access from a this split method and use a space separator. In this case, the separator is a white space. I will get the five words in a list print. At least one. All of the world's my name. Is he done got? Really? There are many more methods for a string object. If you need to manipulate a string somehow, then check the list off available methods. Most probably you will find the required function or method instead of trying to develop it from scratch is a reminder. You can type a dot and then use the top completion clicking one on top and you will get the list off available method for that particular object type, which is a strength A. This thing and the last thing I would like to talk about is about formatting a string When would like to print a line that includes text as well as valuables. There are several options important to perform this action, but the most recommended option is using the F string formatting. The syntax is simple. Will use the F character at the beginning, and then Kohli braces containing expressions that will be replaced by the values. For example, I will define to valuables. Name this on I d two variables. And let's say I would like to create a new a string available that will combine some text and also those two variables so I can write something like that s t a one equal to and then start with the special character F and right this ring combined with those two valuables high You'll I The number he is is on de. Let's bring history a one Hi Done. Your I d number is 12345 I'm using the f collect ill at the beginning, and each valuable is placed in Kohli braces like the name in person I d. Valuables. I can, of course, keep the string valuable definition and use the F string formatting directly inside the print function. Something like that. Get the same result. I can also combine Aamodt Expressions function calls while using the F string. Let's see some example Brint. If he is, the result is wealth or called a metal. You don't very simple and various for 9. Lists: the next useful later structure type is the least at least data structure is used to stole an order collection off items, items like characters, numbers, strings or any other available object type in Piketon to define. At least we will use the square brackets, and individual element will be separated by a coma. For example, I was defined at least called Polo's with four values. Each value is a string. If I would like to access a specific element in that list, then I need to provide the index allocation off that element. For example, print polos and in location number one and I will get yellow create. The first element in the list is the road in index zero and not one. Okay, so if I would like to access actually the 1st 1 I need to provide index number zero and I will get red. It is recommended to use a floral name for the least valuable. Okay, like polos, because such kind off valuables will contain more than one element. And you don't want toe get confused with a single value valuable like a polo. In some cases, we may want to make a list of numbers without typing anything manually. A four that will use the range function to create the needed numbers and use the least function to convert it into a at least viable. So it's going tobe a the following. Let's say that I would like to locate at least called numbers. Okay, then I'm going to use the function called List and used the range function. Okay, and then print Diego. I'm using the list function to convert the result from the range function into a list. Another useful anything to do is to salt, at least so I will take some May previous defined list called Kahlo's. Agree with those four items. And let's say I would like to sort it in alphabetical order, so I have to option. One option is to use the salted functioning. So did I get a result? But it is not changing the original Kahlo's list. Okay, so if I will print it Carlos, that's the original one. Another option is to change it permanently and using the sort method Carlos Sort. It's changed that specific list for adding elements to the end. Off the least. We can use their penned method, so let's use the same Colores list with the four. Carlos and I were lead Coehlo at the end. Great was added as the last item on that list. It's useful Taubate list dynamically so we can declare an empty list and then start with elements to that list. So let's do that for a second, least one. Now. This is an empty list, and then I can start to ed items to that list. And so what now? What about removing elements? Formalist. Okay, for that, we have two options. It could be by position if we know the position of the item we would like to remove. So am looking on the polos. Place red, yellow, green, black and white. Let's say that would like to remove yellow, so it's a location number one so I can write. They'll Kahlo's that's printed. It's removed the yellow. Another option is to remove by the value of the items, so it's gonna be Kahlo's. Using the metal called remove and some value. Let's say I would like to remove the red Coehlo. Let's bring it again. Okay, they moved the collar. As with other objects, we can use the land a functions to check the size off the list. So print Land Carlos. Three items. If I would like to concoct donate list, we can use the plus operators the letter, our least one and another one list, too. I can do something like that new least equals released. One plus least two went to Tree 45 When we want to select more than one element in a list like a specific group off items, a slice off that least we can use the notation off. Slicing okay, like a pizza slice. You will see it every well, so let's make sure we understand a deception. The genetic syntax off. Making a slice Formalist is the following. That's the name of the sliced, and this is the original list, and those are three. Parameters starts to open step Okay to make a slice with specified the index off the first and last items step is optional. Cape Eitan will stop one. I attempt before the second index with specify which is the stop index. The step parameters is optional and and the default step is a one. As a reminder, Don't forget that their first element in a list is located in index zero. Okay, not one. Let's use the Kahlo's least and make a slice formed at least, so my first slice is form index zero until a tree we get red, yellow and green. Okay, because it stops one items before the stop index. It's used another example, Carlos. In that scenario, I didn't provide the START index, so the default value will be zero. And another option that you may see is looking like that. And in that case, I didn't provided the stock value, so it will run form the start until the end off the list. And this is useful because sometimes you don't know how much items you have in the list, and you would like to start from a specific location until the end of the list. The guard list the length off that list toe copy a list. We can make a slice that includes the entire originalist by omitting the first index and the second index so it will look something like that. New Carlos is my new list. And to make a copy from Carlos, I will do it like that. Now. New Kahlo's is a copy off the clothes, a list on the other end. Please note the following line is not coping at least to a newly Storify will write something like that new Carlos, too, and do something like that. In that scenario, it will create another valuable called new clothes, too, that is pointing to the same list toe the same list that Carlos is pointing out. So if I will change now, something in close to it will change something in Kahlo's. Okay, we'll talk about that a later because this is the important things to discuss when delivering a huge amount of information between functions. I would like to mention another data type A call to pills, which is less used, a compared toa least data type. But you may see that a Topol is the least that can't be changed after it was created for the first time. Stupples are unchangeable. It looks like at least, except we use round brackets instead off square brackets when we define the new valuable. So let's see an example my equal to 123 fall and print it. We can access it too. Polite them or group of items by defending today index numbers, same as we saw in them for the globalist. Nothing special here. But if I will try to change a Topol valuable, for example, in some location and tried to allocate a new value, we get the never great because this is something that I cannot change. Okay, that's enough about at least object type. Let's move on and talk about their if and look for statements in our next lecture. 10. IF and For Loop Statements: moving to our next topic, which is about creating conditions. Using the if statement they've statement is one of the most common options to control the flow in a program. The same tax is simple looking something like that, and we are using the if key words and then some conditional tests that is evaluated. If the conditional test is true, then the block off court below will be executed. Let's see a simple example. So a equal to 10 if a is greater than five, then it will execute those toe airlines. Now, few important things to learn here in this simple example. After the conditional test, it is the conditional test. There is a special character ah called Colon. This colon marks the starting point off an intendant block off code. Okay, below that, the two lines below the statement are part off the same block Toby executed. We're not using braces like in Java to start and close block off court in fight on, we use white spaces to structure the code, so those two lines should be a little bit deeper after the statements. Okay, The block off court must be intendant by the same number off white spaces until the end of the block, so I cannot write something like that. Okay, I will get the Nevil. They should be the same amount off spaces now. How much spaces are needed? A T least one, but it's recommended to use full white spaces to make it move a nicer just to make it clear for anyone completely new in programming, the logical conditions that can be used inside the testing condition of coming from a mathematics. So you we have all those options equals is okay, equal. Toby. An attic will less than less than a week will greater than and greater than a week will, too. Sometimes it is useful toe test. If a specific value existing a list, we'll do it using the key world in or not in so I can write if green in Polo's and then do something or the other way around. If Green not in Carlos does something. Okay, so you have this two options you can build also multiple conditions. For example, I can write if a greater than 10 and take a below equal to 20 after the colon does something. Another variation off the If statement is the if else statement, very straightforward. And if else block is similar to wear simple if statement. But the else statement will be executed. Okay, the blood below the the else would be executed if there test condition was filled. And I can do that also using multiple conditions that are evaluated right after another. For example, over here, if a is greater than 10 print yes else. If a greater than seven print. No, as if a greater than creep print. Maybe that's about the if statement, as you will see later. In some cases, will it run through all items in the least and perform the same task on each item? Inside the least for this job will use the follow up statement, assuming we have four Kahlo's in a list called Kahlo's, as we saw before. Hey, I could scan that least using the following four. Look for Coehlo in polos, drained Coehlo. Unlike other languages, their follow up in Piketon does not require an indexing valuable it to set, and I just while accessing the list, okay, maybe familiar with Java and C plus plus, you need some index. All we need to do is use a temple valuable. Okay, like this one, Coehlo, and combine it with the in operator. At this way, Coehlo in Kahlo's as you we call it, is recommended. Use plural names for at least valuables like Kahlo's. And then I can use a single name for the valuable that is changing in each iteration. A like polo, as we saw in the if statement Also, hear the Coloane at the end off. Therefore, look is being used to tell Pytorch that the next line will start the block off court. You will start the look we can use. Of course, multiple lines under the four look is the same bloke. Oh, by using intendant lines. What about the common situation that we need to look through a set off code a specific number off times, like repeating something 10 times? In that case, we can use the range function that we already so before and I can write something like that . Then Green day. The output off the range a function is the values between zero and X minus one is the input parameters. It we count 10 times, but the first value for a will be zero Another option is to specify the starting value by ending in. Other barometers like writing range for until 10 which means values form 4 to 10 but not including 10. And the last option is to change the incremental size, which is one by default by adding 1/3 parameters So I can write something like that for 1 to 10 step to brained. OK, OK si one at 3579 And the last thing I would like to mention about Loops is nested loops, unless the group is a loop inside a loop. Okay, the syntax is very simple. I can write something like that for a in range two and then write another for loop a print , A B and print a new line. That's about their follow ups. And if statement, we can move on to the next topic, which is about functions 11. Functions: Sometimes when writing a little bit more complex programs, we may need to perform the same task multiple times in different locations. One option is to copy paste the same block off court into any needed location in our program, which is easy to perform but much less effective method. A much better option is to create functions. Okay, with the function, we can encapsulate a block off code under a function name and then call this function multiple times in different location. It is also making a programme easier to read an easier toe test. In addition to the options to create function by ourselves, Fight on is coming with many beat in functions, so it is also essential to understand how to use function, pass arguments and get values from functions. Let's start with the same tax being used to define a function. Okay, the first line uses the key world. Def. We're for defining the function. This definition is telling fight on the name off the function, which is called display name and, if needed, at least off parameters that are provided to the function. For example, here on providing only one parameters is always in fight on any independent lines that follow the first line makes up the body off the function, so this is the body of the function. The second line is a special comment called Doc String. A dock string is used to describe what the function does, and it is enclosed in triple quotes. As you see here those triple quotes. Now it is used by pi tone when it generates the communication for the function in our program. If I will type the function name with the question mark at the end, for example, this play name and a question mark, I will get this comment. Okay, this doc string this function is presenting a name based on I. D number. Fighting will display the dock string information, helping to figure out what the function is doing without revealing the actual could. The next thing I would like to mention about functions is the scope off valuables, their two main types of valuables, local valuables and global valuables. Valuables that arcaded outside off any functions are known is global valuables like this one, and they can be used everywhere. On the other hand, available, located inside the functions belongs to this scope off that function and can only be used inside that function. So this is a local function inside this function that is called my phone and let's do some exercise festival. I can use the global a valuable everywhere so I can print that global fall. And if I will try to access the local voluble, I will get a never off course to call a function. We will write the name off the function, followed by any necessary information that we need to provide toe that function. A function call tells piping toe. Execute the code inside that function here, the person I d over here is the only parameters that should be provided when calling dysfunction. So if I would write display name and I will provide some value okay, well, coming down. Let's change that to another argument, and I would get welcome David. It is also possible and useful to return a value inside the function toe the line that called the function. For that we're using the return statement. So that's another function called get name and as a barometer, it will get the person I D and check that value and it will return the name of that person . It's let's cool it for a second and some I D number and I will get a David. In some cases, we will need to pass a least toe a function, and that's a little bit three kill, because when we pass a least as an argument to a function the function, we get direct access to the list. And let's see the simple example. This is a global valuable called Kahlo's, which is a list. And then I defined some function called Get Coehlo. This function is getting a list is parameters and also where location is parameters and then will return the value in that list. Okay, very simple. So I can basically write, get Coehlo and then polos with education about when I will get clean. It's part off the function call. I passed the Carlos, a reference to the coolest local ever parameters in that function. Now what is happening in the background is that those two valuable one of one of them is global valuable, and the other one is local valuable on defending toe the same list. It means that if I perform any change to the list inside the function, it will change the originalist. Let's right. Are we let the following line to the function. Okay, go low. Location number zero equals to white. Then I will call the function again and try to access location number zero up. I got a different color, which is white. The first element changed from red toe white. Now this kind of behavior, he's handy in some cases because it allow us to walk more efficiently when we are walking with a large amount of data stored in a list. Assuming we have released with 10 million numbers, for example, and when passing the list to a function pie tone does not need to spend any time in memory toe copy to a new temple. Realists. However, in some cases we don't want that the function will be able to change the list. So the way to solve the that is by sending a copy off the originalist, using the slice notation when calling the function. So it will be something like that. I can write, get polos and just ed this kind off notation. Okay, that's it. The slice notation makes a co p off the list to send to the function this metal consume more computing resources. Okay, because by Tony took radi se duplication. So keep in mind that the first option is more efficient and therefore should be the selected option. Unless you have a specific reason toe pass a copy off a list. 12. Dictionaries: the next data structure type provided by bite on that we're going to learn his court dictionary. And so what is a dictionary? A nice analogy is a paper dictionary. If we open a dictionary book to find the meaning off a world, then we will find the location in that book by using the world we're looking for as a key. All worlds are ordered alphabetically is an index inside that book, a fight on dictionary walks in similar way toe a paper dictionary, a dictionary and fight on is a collection off a key value pierce that are stored in key value. Former where a key is associate ID with a value all at least of values, and we can use a key toe. Access the value associate ID. With that specific, a key value can be any bite on object, such as a number string at least or even another dictionary. This is making a dictionary object, a very flexible data type for storing a more complex data structure. It's useful when we would like to connect pieces off related information. Any stowed dictionary items can be a drift by the respective key. Instead of using an index number as we used in a list a data structure. Let's see some examples to get the concert too great. An empty dictionary. We can use the Corley brackets. Okay, that's my dictionary. With those calling brackets and a new dictionary object will vacated by Piketon, I can check and print the type of the new available using the type function Get this is a dictionary. Let's define another dictionary, but this time with the content, which must follow the key value purse format. Let's say that I would like to manage in one place profile information off a person. As you can see in this declaration, each key value purse would be represented using the key value definition with this colon in the middle, which is used to connect between the key and a specific value. I D. Okay is the key and want to tree is the number value for that key, and this is also full name is the key, and David is the value for that key individual key value. Pers are separated by coma. We can stole as many key value purses we want in a dictionary. I can print the complete dictionary using the print funk Jen, Here we go. If we would like to get a value associated with a specific E, we will write the name of the dictionary and then plays the key inside a set off square brackets. It is the way to access a dictionary using keys. So that's an example off accessing a specific key, which is a name, and I will get the value of the key, which is David. Let's define another dictionary that will represent a list of student with the great in a dictionary. The key must always be unique. Here, the keys, the name off a student. If we're going to have to students with same name, then it can't be used as a key. On the other end, the values can be repeated in multiple purse. Here, Mary and John are having the same great, which is 91. A dictionary object is coming with a few very useful method. Like other objecting fightin, we will access those method by providing the dictionary object name and then using the documentation and then the method name. Let's review some off this method in the following examples, starting with ending a new key value purse. So dictionary a dynamic structure, and we can add new key value purse, toe a dictionary at any time. The syntax will be providing the dictionary name and a new key value and then the actual new value. This is the genetic syntax, so let's say do something with our great. So with great, I will add another student and the new greats. Here we go and you key value. Per What about removing a key value person? If you would like to remove a key value per from a dictionary, we can use the Dell statement, so I will right there and then a specific key. Another option is to use the pope method, which really move an element, an item from the specific E and also attended value. So let's ed multi again. Now I will remove it using the pope functions. So it would be something like that print. Okay, this is the second option. What about more defying values in a dictionary toe? Modify Evaluate dictionary will use the following syntax. So let's say I would like to change marta grades. It's similar, like adding a new element but just providing the new value for in existing key. Okay, Malta just got a new great assuming we stole the huge number off key value pers in a dictionary. It can be, you know, thousands or even millions off key value pers. And now we would like to scan that dictionary. We can look through all dictionary key value pers using the four statement Let's perform that that action in our grade dictionary. I'm using a dictionary method Cold items, which returns the list off all key value purse. The follow up then assigns each off these purse so the to temple available key and value, so in each full loop it aeration. Those two valuables will get the next key value purse. Sometimes you would like to scan just the keys off scan on lee the values in a dictionary. So another option to look for a dictionary is using the keys method. The keys method returns a list off all keys in a dictionary. For example, I can check if a specific e existing a dictionary. In our case, it is checking if a student exists in the grade, dictionary will run it. Favorite has a great So I used the in operator and to evaluate if the object on the left side is included in the object. On the right side, I can also use the not in operator. Another method, called values, is used to get just the values. So another use case in our example is to get all grades and check which one is the highest grade by sorting the values. So that's the example. And let's see what we have here. So first off, all I'm using the values metal toe get for May the great dictionary. All values in that dictionary. I did. I'm asking the assorted functions to solve that in a reverse a order from High to Low and then access the first item, which is going to be the highest great. And then I'm putting the result. Okay, that's it. The profile dictionary we defined is holding the profile off one single person that's great to profile, Okay for two different people. What if we would like to manage a list off profiles, which is a list off dictionaries from the same time? If you familiar a little bit with relational database, it's like storing multiple records. Each record in our cases, a profile off one person for this common use case will usually at least valuable. Every item on that list will be a dictionary object. So it's going Toby profiles as a list equal to profile. One profile to okay. Using this concept, we can manage a dynamic list off profiles and loop through this list using the same way as it is the same dictionary object type inside that list toe access A specific profile. We can use the index number off their profile, at least so I can Your Sprint profiles Location zero and I just presenting a specific dictionary to access a specific key value pairs inside a specific profile. I will use Sprint Profiles, Location zero and a specific. Okay, and I'm getting David. Another use case off. Using the dictionary object is to use least inside a dictionary, meaning the value as part off the key value purse is goingto be a list. Okay, the value in a key value pairs can be a single value, always completes a list. Let's see a simple example. De Juan is a dictionary. X is it is a key, and those value 12345 is a list which is a value associated related to that specific key. That's about dictionaries. It's very useful. Data top. It may seem a little bit complex, but as soon as you will play with it, it will be simple to use. 13. Classes: we covered the main data types in Piketon like integer, strings list and dictionaries. Now I would like to touch a fundamental concept in Piketon called object oriented programming. I'm saying thatching because it's going to be high level review without going into complex topics about object oriented programming. Now, fightin is an object oriented language every valuable we creating. Piketon is an object from a specific object type an indigent valuable is an object with an integer object type a string valuable is an object with the string object I Each object is used as a container to stole data. Each object has a group of functions that are associating Onley with that object those associated function alcalde method and they are used to access the data inside that object as a simple example. And this is something we already used before When creating a string valuable we can access method related to this string object type. Okay, here s t on one is an object with a string object type, and the upper function is a metal associate. ID Onley with string objects and we can access it using the dot notation. Let's take another object type like a least, Okay, for example, Kahlo's. And this is the thing that that we already so before of just organizing that So Carlos is an object from the least object type A salt is a method associated with it least objects. Kahlo's is an instance off list object types, and then I'm putting the result off that metal. This kind off association between an object and it's met owed is the core concept in Piketon, because pie tone is an object oriented programming. Everything is around objects now. How Fight on We know which group off methods associated with a specific object type. This is being done by using the concept off classes. Okay, class is like a blueprint to create a particular type of object. Like a template. We can create many instances from the same template. This definition, off a class, includes a group off attributes to store values and a group off methods to access those values. From every class definition, we can create an endless number off objects. Also called instances. The fight on language is coming, with many built in classes for creating integer objects, straying objects, least objects and more. In addition, we can define a completely new class as a template and then create objects. Form that new class. Let's move to the practical side. To define a class will use the class key would fulling by the class name class and then the name off our class, which is cow. It is a best practice to start with the capital letter for the class name. Now what we see you first of all, we have some special function which is called in it we those a toe under school, it is called automatically. Every time the class is being used to create a new object, we'll use dysfunction toe assign values to the object attributes. In our case, the car class includes a two attributes the Kahlo and the model off the car. Now this self parameters is there reference today current new instance created from the class template and it is usedto access. Valuables that belongs to the new object, by the way, doesn't have to be named self. We can call it whatever we would like, But it has to be the first barometers off any function in the class that would like to access valuables when creating a new object from the car class. We need to deliver two arguments into this initialization a metal. So let's create a new car car one time using the cull class, and I'm providing those parameters. White polo is white off my new car and manufacturer is Mazda. That's it. I can create another one cow, too. Print. Call one Coehlo call. Okay, As you can see, we don't provide a value for their self. It parameters. Okay, this one, which is the first parameters in the function fighting is doing that automatically. In addition to the innit metal, we can add more method for performing all kind of task. Let's define the car class again. So what we have you Coehlo model in age three attributes off the class that I can access with the particular instance with a particular new object new class. And I added also to method change, polo and increased age. Now let's create another object from this new class. Okay, let's call it car Today green and also for Mazda to access a method will use the dot notation as we saw before, so I can use color really dot change Coehlo to read and thats print the new Coehlo Car three. Hello, Kates changed to it. And if I will access the increase age and then print that age, okay, get to because the initial value is one great. That's enough about classes. Object in R P roots, you know, know how toe declare classes and then creates objects. Form those classes. Practically speaking, in a typical data science project, you will not create a classes, but you will use existing classes to create all kind off instances like objects from those already defined classes. 14. Importing Modules: they fight on a poor coming language includes all kind off, useful, built in functions. For example, we use the print functions toe display messages on the screen. Still, to make the best usage of Fight on, we should use a leverage function skated by the amazing fight on community those additional off the shelf functions that we can use all encapsulated under the concept off models In Piketon, a model is a file with the dot PH extension that consists off a group of functions, classes and valuables. In many cases, a group off models is back together into something that is called a library. There are several models that built into the pipes on Standard Library, which is coming. It's part of the fight on installation. We can create models by ourselves, but the more common scenario in our future data science project will be to use existing models. There are a variety off motives for performing different dusk and this part off the anaconda installation. You have a long list off models that are ready to be used now to be able to use the model in our program. We first need toe import that model into our program using the import statement A Z you will see later on Almost any fight on poor come will probably include a few lines for importing ago Poff models, the genetic sick talks to import the model repair, using the import, a statement and then the mortal name. And if I would like toa access a function related to that model after I have done the import, I can use the name of the model, model and name and then using the dot notation and then the function name. That's it. This is the generic syntax. Let's import some real model import a condom and then for a five print random don't on in t J between 0 to 100. Okay, it will look for five numbers and each time it will generate some random number between 0 to 100. And I'm actually using the model random and function related to the thunder model. When importing a model into our program. We have the options to provided a different name a new alias using the keyword has and you will see that in other board, um, import Ah Andone has are the and then I'm accessing any functions and ive objects in that model using the new name are the dot with a dot notation run into 0 to 100. That's it. Now keep in mind that every time I need toe access, a function related to them are the A model. I need to use the dot notation with the name off the model before accessing that function. If you would like to avoid that, you can import the model in tow. The program Name space by using the form keyword combined with the import keyword so I can write something like that form random import on int. Now I can access their and eat oh, directly like a function in my Paul Graham, if I would like. Also, I can import all functions from the random model flume. Here's this astray. In that scenario, akin access all functions that are existing album without using their Gundam a model name. Okay, those are the syntax options. You will encounter toe import the models or specific functions from specific models to display a list off all available models. We can use the following commanding fight on consul have with models inside, and it will take a moment until we get a full list off model that are available had to be used. Some of them are alive, kind of built in models. So let's review some of the frequently used their beating models. Will is the first tip. Import those models and then use all kind off functions in those motives. For example, under the toys More Dole, I ever function called list Deal where I can get the list off a file. The director is in specific directory. If I'm not providing anything, you will provide the list off folders and files in the walking director way using the CIS more. DOOL I have a natural book called Version. Okay, 3.7 point six using them. Met model. I have all kind off attributes like the pie value. I have all kind of functions like Matt Oh, this one power or the other way around square. Another useful more Dooley's for calculating almost kind off statistics calculations. So let's create some at least valuable and then apply that models. So I have a short cut esty and, for example, mean off list Ah s t a median standard deviation and more. We already so random, okay? And they're all kind off useful. A function for random generating all kinds off a random May numbers. Okay, Those some off the frequently used built in models in Piketon will probably using different places. 15. Libraries for Data Science Projects: so far, we cover the most important fight on syntax that will be needed to kick off our future data science projects. In addition to the built in capabilities in fightin language, there are many additional bite on libraries that can be used as frameworks to extend and add more capabilities to fight on. Those libraries are helping with simplifying the overall poor coming while building applications. And this is also true for data science projects because we don't want to invent the will. We would like to use whatever may toad or piece of code that will help us to keep our focus on the main task. And libraries are a critical component toe. Help us achieve that goal, meaning get things done as fast as possible. In this lecture, I would like to provide a brief overview off some of the most popular libraries for data science projects. Number by library, which stands for numerical. Piketon is used for high performance scientific and numeric computing in spite on mathematical algorithms require containers to store and handle numbers. Usually we're talking about a huge amount off numbers, and the key challenge is about processing time, and numb pie is a solution to the challenge. It is adding the support for walking with large multi dimensional arrays, a coupled with an extensive collection of functions. Those rays will be used as a container for data to be passed between a variety off machine learning algorithms. Now, why should we use a raise in numb pie if we can use the built in least data structure that can also be used as an array? Okay, is a list of items Well, we can use list, but compared toa dump. I raise their very slow to process when handling a large amount of data, and umpire rays will be much faster a option. Now, even though this library is very important, we're not going to cover it at this level level to because we don't need it at this point. Will cover this library in the next levels when it will be needed. The next very popular piped on library for data science project is called Panned. This library, penned us is an open source. Spite on library for data analysis and exploration pendants provides a powerful high level data structure and a group of functions to prepare and manipulate data. This data preparation and exploration is the foundation step. Before moving the data into any machine learning algorithms and pain, this library is making our life much easier. While handling this important preparation and exploration step, we will use it to low data from different file formats. Performed an initial inspection, filtering, aggregation, cleaning and transformation. Practically speaking, in many projects around 80% off the total walk time will be dedicated. Toe this step while using pandas. One off Our primary learning objectives in this level is to learn how to use this important library. Using pandas, we can load, transform feet and clean the data. During this process, we may want to present the data in a visual way and for that purpose, we can use the math plot. Lib Matt put Leeb is the most popular bite on library for creating static animated in interactive data visualization. We can use it to create a line plod bile chart by charts East O Grams in more. By the way, there are additional data visualization libraries invite on that extent the capabilities off the mud float library. Anyway, the topic off data visualization will be the main landing objective in our next level in this training meaning level tree. So after we loaded a data set, performed some cleaning transformation and used all kind of visualisation options toe understand the nature off our data. We can move on to the next step and apply all types off machine learning algorithms. A science kit Learn is a coal library in Piketon for machine learning algorithms. It's includes classifications, immigration, clustering, model selection options, feature extraction across validation and more. All the good stuff. We won't use a later on, if you remember from Level one, we talked about shallow learning and deep learning. In the planning, we develop much more complex models called neural networks. It's the most exciting, innovative and sophisticated part off machine learning. And over recent years, several open source projects like Tensorflow Care Us and Pytorch were introduced while trying to simplify the process off developing and training Newell Networks. This is, of course, much more advanced topics, and I up together into those topics in the future levels. Okay to summarize, there are all kinds off libraries that are useful for data science projects, selecting the needed libraries based on the task we would like to handle, and it will require some experience. As I mentioned a the pant. This library is going to be a well first touch point in almost any project. So we will start our journey with this essential and very important library. And we finalized the introduction off the fight on language in this section. Thanks for watching Zafar. I hope it was clear in his well as in durable. Don't forget that you can send me questions using the course dashboard or review existing question and answers. Your next mission, if you're willing to accept it, is to practice. OK, practice the information in order to transform it into a solid knowledge. In the following lecture, please review and download the summary exercise that will help you to practice all the essential things we covered so far. See you again in the next section. 16. Exercise #1 - Python Fundamentals: 17. Introduction to the Pandas Library - Overview: hi and welcome. I hope you manage to practice your piped on skills with the provided exercise. From this point moving forward, we will focus our attention on a specific data science library called pandas. Ben, this is a very popular open source. Spite on library for date exploration, data analysis and data manipulation. We can use it for heating and writing files, handle missing data, filter in cleaning, aggregating perform, data transformation and much more. It is making our life easier. While walking with large data sets penned. This is bringing to useful data structures serious ain't data frame. Those data structures will be used to low data phone files and then apply all kinds off actions while analyzing the data. Well, we're going to use spenders. Well, there are two main use cases here. The 1st 1 is when would like to perform date exploration and analyses, which is a common use case when handling a data set. Okay, not every data science project will require to use machine learning algorithms. Sometimes it is just about data analysis and date exploration and pan. This is a great tool to perform such analysis. The second use case is about preparing the data as the planet. Quiz it step before applying a machine learning algorithm as a reminder from level one. We know that data is the primary input for any machine learning algorithm. If you would like to get high quality results for machine learning system, we must make sure we're feeling it with high quality day diets. A simple equation. Otherwise it will be garbage in garbage out. Almost any role data being collected and stored in databases must be prepared to some level before we can feed it into a machine land existing. We need to clean up duplications, remove some columns that are not adding useful information for the algorithm. Transform and Reese kill some values and so on. It is a critical step off many data science projects, and again, pandas is a two to perform such preparation. Therefore, the capabilities related to dependence libraries are our main learning objectives before moving on toe. Any other data science library just as a reminder, we already have depend. This library is part off the anaconda installation so we can start to use it in the Jupiter lap two 18. Series Data Structure: we will start. We depend this serious data structure. The pandas serious is a one dimensional labeled array containing a sequence of values and also in associative array Off data labels called Index less great. A simple, serious subject. First, Awful will import pan this SPD and then we create a new object called s one with the list off number. And let's present that. Okay, What? We have you we can see toe columns. The left side is the index. OK? Do you want to treat for until five? And the right side is the actual data. A serious object as a single data column. Now, since we didn't specify an index for the data, this first left side column was generated automatically to present information about the index. I can access it as an attribute off the serious object. So I will type s one index. That's it. The second column is the actual data, which is a list off numbers. The last line over here represent the type of this column identified by the pie tone is an integer we can directly access the serious a data type using this attribute de type I'm getting This is an integer toe. Access of value from the S one object. We must use the index number. So I will type s one. And let's access item number one. Okay, which is 20. Okay, Until this point, it looks the same as the least. Object, at least off items that we can access using an index number. So what is so special with a serious data structure? Well, when creating a serious subject, we can also provide index labels instead of using numbers meaning and index identifying each data point with the label. Let's see some examples we create another object s to and present that object. Well, maybe just present the index now we can use the labels Is the index when selecting one value, maybe a group of values. This is the new index, a BCD until f so I can type s too, and access a specific item, or maybe change your value in a specific item and then present that value toe access a group of items we can use the slice notation So it will be is to, for example, Formby until the okay, It's a subset off the serious object, so setting labels as an index is the first unique and useful capability with the serious data structure the next useful properties about mathematical operation. Let's create at least object called least one with five numbers 12345 and try to add the same value toe all numbers. All will write at least one equal to list one plus tip. I will get the Nevil. The only way to add the same value toe Ally Temps is by using some full up as we saw before . Or maybe I will try to multiplies all numbers with the same value. I would like to take list one equal to list one, multiply by three. Okay, knowable. But I will display that. Not really good. I mean, fightin will just concoct innate to more similar least and I will get just a longer list. It's not my objective. I wanted to perform it some mathematical action on all values. OK, so list is not a good candidate to perform such actions. So let's move back to our serious object will create an object called s tree that's present it, uh, good five numbers. And let's say I would like to add a value like 10 toe all those. I can do that in one line s streak will toe a street. Plus, then that's it. Oh, maybe multiply all items, okay, with same value. Here we go that this is very powerful option. We can perform an operation on the whole object in one line. Think about the situation that we need to normalize all values in a column or do some re scaling instead of performing it Using for loop. We can use this simple as syntax. If we're performing operation between two serious objects with the same size, the result will be an element by element calculations. So let's create and history and also s for with the same size and it's or something like that. I can print s three plus s for okay. It would be line by line calculation or bring s three multiply by s folk. He would go. We can also check if a value or the key in an object. Serious object. So I can write something like four in s three. Okay, it will surge the whole A and we'll check if the number four exist over there. Let me ask you something. Assuming you would like to great a serious object with 100 lines with same value. Okay. How can we do it? Okay, if you remember from the a previous section, think about their range function we used while building follow ups so I can do something like that. I can Kate, it's five and provided with some value. And then use an argument called index. With this function range, let's displace five. We have 100 lines with the same value. Oh, let's say I would like to create a CS object, but this time I would like to use another list off the numbers as index, but labels like a, B, C, D e, or whatever. And here we have it a, B, c d. And so on. Another powerful option is the ability to apply at testing condition on a serious object. Okay, let's go back and present s to, and we can compare and check all serious in one line so I can write something like that s two is more than 10 which is true for all values. So I will get a bullion serious beirich line testing the condition where each line. Okay, this is true, true, true and so on. Let's say use a different number, like as to is more than 40 this time on Lee. Pre lines are with with true value. We can use this result toe filter lines from the original serious. So I can do something like that. I can Kate my filter. Okay and right. It's too cater than 40. My filter is basically a billions object False, forced, trying someone I can use that and do something like that s too my filter and I will get the only relevant items. Okay? And I can also do that without a priest. Steps like ating. The my filter I can directly, right. Something like that is to and then inside as to is just more than 40. That's it. Another example is to fill the only numbers that are greater than their name me number. So I can't do something like that s two. And inside right s two more than s two. I mean, okay, there is a method called men that they can run and again, it will calculate the main value for the whole s two and then compare each line if it's more than the mean number and then it it will filter only those clothes. Another option is to create a serious object from a dictionary object. So let's say I have a dictionary called profile, Okay? With the list off key values, OK, each one of them is a key value, and I will create a new object, a seven. And as an argument, I would provide that dictionary profile. I'm its present at seven. What I'm getting is that the keys are being used as the index, okay? And this is the value off that serious object. I can now access a specific line by using the index label, like writing s seven, and the index and I would get the value. And the last thing I would like to mention about C is data structure is that we can provide it with the name attributes. It's like providing meta data information about the data. So let's great another dictionary called crave. Okay. Name is the key, and the grade is the value. Now, when I will create a say's object called S eight form that dictionary, I'm providing not just the dictionary, which is called greats, but also the name off that object. You see, greats with a capital letter. If I would Present s eight, you will see that name over here. Greats. It is like a meta data that describe the meaning off that column. Okay, this is enough about the one dimensional serious. Take the structure. Let's move on to the data frame data structure. 19. DataFrame Data Structure: moving on to the next data structure that is coming with pan. This called data fine Pandas Data Frame is a two dimensional labeled tabbouleh data structure with rows and columns, which make it a perfect container to stole whole data that is coming as a spreadsheet structure, like a table or an Excel Shit. Now, usually when we built or store data in an Excel shit will provide labels, toe DeVos and columns every all. We'll have some sequence number or were you Nicky? And every colon will ever title so we can understand what kind of data we have in that specific column. It is the same concept in pan this data frame, the votes and columns are labeled, and they all called access. We have to access one for the rose and another one for the columns. Those access will be used toe access and Phil tell data from the data frame. Access specific wolves, all specific columns and other essential property off the data frames is related to the serious one dimensional data structure we covered in the previous lecture. Each column in a data frame is a serious object when filtering a single column from a data frame. The result is a serious data structure. It will be useful when performing all kinds off actions on a data frame, and also when creating a new data frame, we can create the data frame while using the built in dictionary data structure. If you think about a dictionary with key value PERS, then a key can be the name of the column, and the value will be a list off all loaves in that columns. So basically a data frame is like a dictionary off serious objects and therefore dictionaries are the most common way off, creating a data frame. Let's see an example of creating a data frame from a dictionary, so festival I will import there pandas and then define and dictionary object with this structure. OK, I have AH value, which is called Kiki name and then a list off items as values, and this is yourself for age and weight. Now I can use it to create a data frame object. Okay, we throws in columns. That's great. A data frame object called the F using this function data frame and as an argument, this dictionary I know will present the data Frank looking on this simple example. A data frame is based on three main component. The index. The columns in the data Looking on this stubble of you. This is the index on the left side. The upper first row is the columns, and the specific value inside is a column name in the middle. We have the actual data. Each of the dream data frame components may be access directly. Former data friend, for example. I can access the index value. This is basically a range off numbers or the columns. Okay, we can see name, age and weight. This is the columns name of Ariel. The columns in the index are used to provide labels for the columns and rows, and they are marked here in bold front. The vertical axis, called index, is also referred as access zero and on the other, and the result it'll Axis called columns, is also referred as axes one. The data frame data is presented in a regular funded, also an entirely separate component from the columns or index. A data frame is an object, and like any other object, we can use the dot notation combined with the tab key so I can click the F dot and click tub, and I will get the list off available attributes and metal related to this type off object . 20. Data Selection in a DataFrame: In most cases, the data frame will be based on many columns and many holes, so the ability to select a subset off data form data frame is a critical skill to master. This is a very important lecture in pandas. We have two main options to select rows and columns using integer position or labeled a based selection. It's a dwell selection capability, and it can be a little bit confusing. Concept A. For beginners I will great again a simple data frame from a dictionary object import painters, SPD. And this is a dictionary, and I'm creating a data friend with that dictionary and I also indicating the index for those a B c D great. It's a nice, simple data frame. We have labels for the columns, name, age and weight and also label for their rows A B C. D. Now, whenever we would like to select something form that victor frame, we can use the Internet your position or use the labels as index. Let's start from selecting columns by labels as we have meaningful labels for our columns. We can use the column name with square brackets. Let's select the name Colon by label this is the simplex. The result is one dimensional object with a single column with all those in that data friend, and we use the square. Break it to select multiple columns on providing a list off columns, and in that case there are double a bracket. See, the email brackets is the list. The result would be a data frame. If we selected to more columns, can we select those columns by position, like writing the following simple line the F open square brackets, zero until two? Well, it's not walking. I'm still getting all columns in the data, Frank. It's happening because Piketon will think we're tryingto fill the lows, and I'm getting the 1st 2 hours using this slice notation. There is a way to select columns by position, and we'll see it later. What about featuring goes? We just saw a simple syntax to select those by position. We see the Fist three woes with all the data from columns. If you would like to select rose by labels, then we'll write them with single square brackets and with the slice notation. Okay, this is the square brackets and the slice imitation, but I can't select specific loaves by Labour's using a common separator and the list off argument Okay like that, if I will provide a list inside, as I just tried, Cape Iten will think we're referring toa columns, and we'll get a naval because those two labels are not part of the column index. As you can see, the all kind off limitation that we can overcome by using other option will see later. We can always combine columns filtering based on labels and those filtering based on position. So it looks something like that. This is columns based on labels, and those are those based on position. Oh, maybe column based on labels and also those based on labels. But here I must use the slice imitation. Here we go. So this is the first option. Using the notation off square brackets and other option will be to use the dot notation, which is a very popular, but it comes with some limitation. Let's talk about the dot notation. As you remember, the DOT notation is used toe access, method and attributes in an object. Okay, Python is an object oriented programming language. There is an option toe access a single column as an attribute, using the dot notation so you will see something like that data frame the name of the object dot and the name off the column like age for some or the F name. It is sometimes handy because we can click the top completion key and get the list off method and attributes and also they data frame columns and as a tribute. So if I would like the f dot and clicked up, I will get a list off available method attribute as well as columns so I can search, for example, for age, and I will get it. Keep in mind that it's not always possible to use the top notation, and this is very important to remember. Many programmers are using the second option, using the dot notation instead off their square brackets because it's fuel things to type. So you may see that in many examples, but it will not always work if the column name has space inside, like, for example, when the column name is gold full name. But this is the label off the column. We can't access it using the following sink tax like typing the F dot and then full name with space because because of that space and there is a walk around that some programmers are doing when they are loading something into take the frame, they will add some underscore between those two worlds, so they will be ableto access that using the don't know kid notation. So we have the two options. I'm going to use the more genetic one, meaning there square brackets. But you should be aware that you may see in other places also the documentation being used to access columns. If we check what type off object we get by accessing a single column format data frame, the result will be a penned, this serious one dimensional object. Let's see that. So I'm using the type function, and then I will type. Here's DF in access some Coolum. That's it. As you can see, it's CS an object. The result. It's a subject. Any operation that can be done on a CS object that we saw before can be applied to a column in a data from something that will be extremely useful later on. We talked about selecting columns or loaves using the square brackets, notation or using the dot notation there basically translated by piped onto the same thing , and you will see them in many places. I'm going to use them also doing there. Training. There is another, more powerful and flexible option in Penders in order to select rows and columns, and it is about using the look and I look meh, toad Look is a metal to select data using labels. OK, labels based indexes. And I look with the I is a metal toe access data using integer based indexes as a reminder . This is our a regional later friend. Previously, we saw the options to access column, using their square brackets that the and the label off the column all using the the quotation. Okay, that's it. Let's select the same column by the label using the look metal. Okay, the syntax is very simple and clear, so it will be something like that. The F don't look, and the first argument will be what goes to filter coma and then what columns to fill them . So, as an example, if you would like to select all a rose but only single column, as we have done over here, that will be the same tax as we can see. They it's very explicit and therefore very clear. I'm saying what goes and what columns are needed. Forgetting all those a with two columns on providing at least to the second argument. Okay, this is the second argument. Now it's a list and I will get those two columns with all those for selecting specific clothes and also specific a columns using labels OK for both off them. It will be like that. I'm providing list to the two argument. Okay, that's it. Another option is to use their slice notation forgetting all those off all columns between some value. So instead off using a list as an argument with square brackets, the enough square brackets, I can use the slice notation. So I'm getting those between A and B and them columns between name and weight, which is including also the age in the middle. What about integer based index selection? For that? We can use the I look mated with similar syntax. Okay, again, here I'm using. I look and I'm providing what was toe filter. What columns to filter. So, for example, toe access this single hole by position okay with all columns. It's a single hole from the data frame that is presented in a column structure. We can also right the line without mentioning the columns. Is the second argument. OK, like that, and I will get the same result. But this is really bad syntax. It is always better Toby, explicit and mention what columns or what those were planning to select the could will be much more clear. We can use their slice notation, toe access, multiple lines. Just keep in mind that when using the slice notation, the first position will be included, okay, and the last position will not be included here. The filter. Those will be a one and two without all located in location and number three, which is basically whole number four because the index start form zero. Let's also feel the columns by position. I would print the data frame again. Each column is located in a specific location, starting from zero, even though we can't see that sequence number because the columns are three beauties filled with labels. So to get all hose a but specific columns by position like and zero in to that will be the syntax. Oh, group off columns between those locations using the slice notation all selecting specific, close and also I can use for the holes the slice imitation. Similar toe columns. That's the two option look and I Look, Let's Google exam arise. Selecting gloves and columns can be done using several options. Okay, Option number one is using square brackets that can views to select multiple columns or multiple walls. This is for selecting multiple A columns and this one for selecting multiple blows. Okay, this kind off syntax can be sometimes confusing while reading the line. It's not 100% cliff. We are filter in columns here or hose. The next option. It was by using the dot notation, which is used for selecting Onley Single column. And it's very popular option and the F dot and then the label off that column. That's it. Assuming I don't have any space inside the label name. Okay, another and more flexible and clear option is to use the looking I look mate owed. I look is based on the index starting with you know what? I guess it's easy to remember this. I is by index, and I can filter by position and the look is based on labels. OK, starting with L. Just to remember that your next mission, if you're willing to accept it, will be to practice. Please review the next lecture and download exercise Number two Pender Serious and data frames. See you again in the next section and good luck with that exercise. 21. Exercise #2 – Pandas Series and DataFrame: 22. Loading Data into a DataFrame - Overview: hi and welcome back in the previous section, we learned how to use the to paint this, say data structures, meaning the serious one dimensional a data structure and a data frame, two dimensional data structure and specifically, how to select rows and columns in the data frame. We also use the dictionary object to load some small amount of data into a data friend. Now we are ready to move on and start to play with real data. And this is where things are getting more interesting In this section. We're going to learn how to load a data set for my file and that perform all kinds off data analyses and exploration operations like filtering, salting, grouping and more. Which data set were going to use? Well, one off. The most famous data set for beginners is the well known Titanic data set coming from the popular website called Kagle. In the next lecture, let's start with a quick introduction toe Kagle and the Titanic data set 23. Kaggle and the Titanic Dataset: Do you know what is the biggest challenge for people trying to become like the scientist? Well, it's not about just knowledge. It's about acquiring experience while walking with real projects with real data sets about the variety off industry challenges. So to help people worldwide, to practice their knowledge with real data sets and to share the challenges with other people. The Kagle website was founded there in 2000 and 10. Kagle is probably the largest online community off data scientist and machine learning practitioners. It allows registered users to find and publish interesting data sets explorer and built machine learning models, work with other data scientists while solving challenges, and also enter a competition to solve data signs, a challenges. It's a great source off information, and most probably you will start to use it somewhere in the near future. If I will go and search the world Titanic. Okay, then one off The first results will be the famous Titanic competition that is still up and running. This is like the getting started competition in Kagle. The mission here is to download and use data set that represents a list off passengers and then the mission is to predict the survival off passengers based on their information output will be zero or one for every passenger. A very nice and interesting challenge, and we will solve it in future level. In our case, we will use the Titanic data set to practice our data analysis and exploration capabilities while using it. Bendis. So I'm going into the data top. The provided data is divided into two main files. The 1st 1 is called a test dot C is where the 2nd 1 is called trained dot C S V. At this point, we're not planning to create a prediction model so we can just use the trained CS V file. Let's download that file and then I will rename the file to titanic. Dot C is very, by the way, If you would like to practice doing the training, you can also download this year's refile from the course resources. Let's open the file. I will zoom a little bit. Let's see what we have here. As you can see, the data set is just a long list off passenger. Okay? Every passenger has a unique I d. Hey, I d number here on the left side. The first line is the head. L Okay, so for every column Okay, there is some Coolum name of description like s survived a name, sex, age. Other names are not descriptive enough, so it is not clear. What is the meaning off some columns. For example, this one P c. Class. Oh, those two columns. Okay, Sometimes while looking at the values, we can guess the meeting. But there is a better option. We can go back to the Kagle website open under the date atop the data description and search for something that is called a data dictionary. It's a simple table that described the meaning off each column in the data set for so, for example, so viable is a binary value zero and one. Okay, this is easy. PC class is the ticket class and there are three classes. Okay. 123 also below. There are also some valuable notes. So, for example, the PC class is some social economic indication off that passenger. Okay, so we have upper middle lower. Some columns are quite straightforward, like sex, age, a ticket fare and cabin embarked is basically the port off implication. And there are three possible values in the data set C Q and S. It's a shortcut for the port. The full port name. Now, the additional two more columns, this one in this one, which indicates some information about family relation off that particular passenger. Okay. And we have details and note about them right here, below. Okay, that's it. Now we have a better understanding about the data structure the meta data off the data, and we are ready to explode the content off the data while loading our file into a data frame. 24. Loading a Tabular Data File: in the previous section, we saw how to create a data frame object from a dictionary. Practically speaking, in most cases, the data we would like to analyze will be stored in tabbouleh data files like our Titanic CS refile. Generally speaking, tabular data is data that is stored is text in table structure with multiple hose each. Oh, contains information about something, and it will have the same number off cells. All cells under a specific column will describe the same popular teas. Okay, like there. Age off a person. Okay, over the name and so on. There are multiple well known file formats toe save tabbouleh data and separate the cells from each other. For example, they went alone coma, separated, value a file for month. A CS we filed is a text file containing data in the table structure. This what we see right now where columns are separated using the coma, Correct ill. We don't see the coma character right now and loves us on separate lines. The coma is used to separate the daytime each. Whoa! Let's open the Titanic sees refile again. OK, But this time I will use some simple M editor. We can see here how the columns are separated using coma. Okay. And it show is located in a different headline. This is the structure off a C S V file. Our next step is to learn how to import such double our data into a fight on data frames. Thanks to upend us, this is a very simple process. Dependence library provides some functions for reading Tabbouleh text data are stored in files into a data frame object. For example, we have Hey, Dunder school sees re for reading the limited data for my file over the u. R l Read Underscore Excel laid on the school. Jason And there are more. We know that our Titanic file is based on C. S V. So let's load it while using the read under school sees the function that we have independence First step will be to import Penned this now the first argument we need toe Pasto The red underscore sees re function is the name off the file and pat and off that that would like to load So that will be there. Structure Okay. Data from equal to p d dot Read other school See SV and any directory Relative Directory directory deal One is the relative directory and this is the final aim. Now, what is the meaning relative relative to the current working directory, which is typically the Director way started the Jupiter love application. Unless we change it. By the way, if you would like to check at their current walking director, we can use the built in OS model so I can type import s and then print always get C A W d. And this is my walking directory. Okay? And I can also check the list of files in that walking directory by typing voice. Least do. And here we go the scenes we filed, the relevant sees refile is hill tiny dot CS three. Now, if the required sees REFILES are not they'll we can move them into a world walking in direct away and then use the relative bat. This is usually the recommended option. If you're going to share the Jupiter notebook with someone else, the second option is to use an absolute pet, meaning a complete pet from the base off their file system. So the file location. Okay, let's see some examples. So instead, the same Reitzes refile. But this time, as you can see, I'm typing their full directory. Okay? Starting from the rival Specified. Okay, so those are the two options right now in my file is already in my current working director way. So this is enough. Okay, so let's load the file. That's it and presented, Okay, we just loaded the CIA's refile into a data frame with this read under school says v different parameters. Overall, we can see that they are 891 hose in 12 columns. Let's look on the first 10 hose. I can type something like that. The F head 10 and I will get only 10 hose. Okay. From 0 to 9, we got a nice formatted data frame with numbers as index on the left side. The columns heather as the index at the top of the date of frame and the content of the files is is nicely organized into overs and columns. However, in many cases, the file would like to load will not be so perfectly organized. And therefore, in the next lecture, we should learn how to adjust some essential arguments when calling this area other schools years we function 25. Adjusting the Loading Parameters: The first argument I would like to talk about is the head of Okay, let's open the Our titanic dot sees we five a second. We can open that directly in the Jupiter Lab tool now, since we filed does not contain information about the content of the data. It's a straightforward format, very simple. So when someone is creating a CS refile, it is a best practice to use. The first line is a heather, even though it's not required. Some files will be with the headline and some files without any headache. Our Titanic a file is coming with a nice header as the first line. So in this case will be able to use this line when creating the data frame. It's being done by providing the heather barometers with the location off the headline as an argument. Okay, so if I would go here and again, import that well and this library and type again, okay, all the line to load a defiling toe that the frame. Now I added another barometers with this argument value, which is zero. It's indicating the location off the header line in our Titanic, says we file. It's actually this line, which is number zero. Don't forget that the first line is zero on the other end, in case the CIA's refile does not have a header line. Then we need to set other equal to none while giving it. Then Pandas will use out to generated into job values as a header. To practice this scenario. I already created this file called Tonic. No head, though, without the first line. Okay, No head there. Now let's see how I'm reading that fire. So all I need to do is to and set the new finance Titanic No, on no Heather. And to set this editor toe and none. Let's display the first 10 lines. Okay, Let's see what we have you. Now the heather is looking a little bit strange. It's just an automatic generated numbers. Okay? Like we saw in the Rose. If we want to use the same prefix for each column name. Okay, maybe like column A 01 It said that we can use the prefix argument. So what I need to do is to use another argument called prefix and put some stink as as a prefix. You know, if I will display that data frame will be able to see that prefix being used the course. All the columns in that data, Fred. Another option is to manually specify the column names, using the names, barometers and provide a list off strength as an argument. So let's say vacated some at least off columns. Name Okay, a 1 80 yuan tea, Late wealth. Now I can use it while reading my data set, so I'm setting Heather equal to none. But this time I'm providing this name barometers and this argument called Cold under school names. Anyway, we have it a one a two until 1/12 Justin important reminder. All the columns names should be mentioned within that list. By the way, if the file contains ahead of all, as we have in the original Tectonic sees refile and we still want toe override it, then we can explicitly pass at their equal to zero and also the names in order to override the columns name. So it looks like something like that. I'm reading this year's Refile Titanic tortillas. Weather is ahead over there, so I'm sitting at their equal to zero. But I'm overriding the pedal with that names brand so that will be there in other options to consider. Let's talk about the lows. EDL also called the index. When creating a data frame. Pandas is generating the index as a sequence number, starting at zero. It's not part of the data frame content. It's an attributes off the data frame. So let's say load the Titanic file again and display the data frame. What we see on the left side is the sequence automatically generated 01 and so one for every low in the data said. This is the index attribute. We can access that this way we're getting a special object, and that is a range index starting from zero until this value. Now, sometimes it makes sense to change that okay and to provide labels also for the votes. And let's take some example. If I'm looking on the Titanic sees refiled. I have several columns, and let's assume that for some reason someone would like to use the name off a person as an index. Okay, so this is 01 to treat. This is the location number A three, this part of the data set. So what I can do is to use a special parameters called index Under School called, and I will provide the location off the column that I would like to use Azan Index for those and let's run it now, As you can see in Bold Font, we can see them the names being using index not very useful in that case right now, But it may be useful in other day to sit that you can combine maybe several columns Toby as a label for the hose index. So this is something that you can consider while uploading a data set changing. Also there. Rove's a labors moving next. Sometimes we will review the file and decided only specific columns are relevant for the required analysis. So instead of floating all columns into memory and then deleting some off those columns that we don't need, we can specify which columns to load into the data frame. Now. This is important because sometimes the data set has millions off walls and every columns that you a remove, which is not needed. You reduce the load on the memory, so for that we can use the use calls argument while passing a list off strings corresponding to the columns names all, maybe a list off indigenes corresponding toe the column index location. So let's see some examples. I can write something like that. The F equal to be the redone discourses with the name of the file and their equal to zero would like to use. The first line is the head of for the columns, and I would like to use specific columns. 01234 zero wants to 34 until this column. Okay, all the other corals will not be loaded. Here you go. Much smaller data set the data stored in multiple columns inside the series we file can represent a variety of data types like and numbers brings bullion values, dates, etcetera. But the sea is we file does not contain information on what kind of data types are stored in each column. When we call the head, underscore sees we function and also other functions to read files. Pandas will try to figure out the data types based on the content off those columns. For example, if the columns contains only numbers, then pandas will set that columns, data types, toe numeric types such as indigent or flow. I will load the file again. Okay. The full day to set and then check the data type of each column by using the D types. Attributes the f dot The types OK as a reminder when accessing an attribute. We don't use a brackets just the dot notation, and we're getting at least okay For each column, we can see the automatically identified data types as a simple example. There survived column was identified as into 64 meaning into journal number, but this is not 1% accurate. This column is a bullion data types with only two values zero and one. So in case we would like toa override it doing the fire loading, we can manually set the data types for the data frame columns by using the D type parameters. So it will be something like that. Okay, the same structure I'm reading. This one d type. The input is basically a dictionary with key value purse. In that case, the present. I would like to change. He's survived and I would like to set it to two bulls. That's it. Let's check again. De types. Here we have it. Survived now is bullion type. Now if I will present the data frame instead off getting as zero Wanna will get false and true, which is some cases more easy to read. But it's more important that when adjusting the right a data type, I can use a specific method and function. For that, they take that okay, something that is useful. One thing that I would like to mention is the object data type we can see in our data frame . Several columns that were identified as an object data type a column that is okay identifies an object that I may contain values that are off any bite on object. Typically when loading a tabbouleh file, as we, as we have done when a column is on, often object that I type, it means the entire column is a stink. OK, for example, the name and sex and so on. All of those are object data types, and they are basically text strings. But sometimes a column that should be identified as an integer that the types is by Mr Classified as an object there types, and in that case we need to fix it manually, as we just saw. Okay, that's about loading a CS refile. There are many more optional air parameters for the reeds years V functions, Wendell all kind of different cases, but we cover the most common scenario. In any case, it is always recommended to check them in the official online pan. This the communication that you see right now I'm looking on pendants dot reid underscores years. We and I can see all the available parameters for that particular method with description how to use it. We are now ready to start exploring and analysing the content off the data frame. 26. Preview the DataFrame: we loaded, I will see, as we filed with the whole needed or relevant arguments into a data frame objects. Now it's time to inspect the data frame, check what we have and what we don't have. And to perform such an initial analysis, we can use several attributes and methods that are part of the data frame objects, starting with attributes. The data frame is based on three main components. The index, the columns and the values attributes. Let's access each one off them, so we already So they index, which represent the homes. And this one is the index off the columns. And the last component is the values, which is the content off the data friend. Okay, by simply typing the name of the data frame as we saw before, I'm getting a nice formatted output. It is a very convenient way to preview the love the data check and confirm that the columns names were imported correctly and I'll maybe some missing values any well, for example, we're looking on the cap in a column. Over here there are many cells with value and a N independence. This is representing missing values. Later on, we will talk about the process off handling missing values. Another useful attributes is the shape. Attribute the F dot shape. Okay. It will return the number off woes and the number off columns for the daytime. The date of frames for getting much more detailed. Summary information. We can use the in four metal. Okay, the F in four. As we can see in the output. The summit includes a list off all the columns with their data types. Okay, this is the data types and also information about missing values. We know that the number off forces this number 8000 and 91. So most of the columns are okay. But for example, the cabin has only 204 holes, which means that many missing values. For that A column something that will need to take into account Ah, later, one at the top. We have also information about them all index and at the bottom, a summary off How many columns are from a specific data types. So we have, for example, five columns which are integer okay, or two columns that are floating numbers and five hour generic objects. In many cases, the data friend will be too long with many hopes. It could be, you know, thousands off and even millions of falls. So it makes sense to display only few lines we can use to function head and tail. Okay, we already saw the head, and you can provide the amount off those that you would like to see. That's it. And another May method will display the last and of those so tail and I would like to see just the last five was, and I will get those last five hopes. One of the first thing which will check when loading a day to say to a data frame, is to check their number off missing values. But each column there is a dedicated method called counts for this testing so I can type the f cone and I will have a number for which yo, we know that the total amount of falls is 891. So most of the columns are perfect without any missing values on the other end the age. Okay, this one. Okay. As well as the cabin columns are missing many values. And there are few things we should do with missing values, and we'll see that later on. The next very useful method is used to get the unique values for a column by name. So let's see the data frame again. Great. And let's check, for example, this one this column sex, which is clearly has to value male and female. But let's see that using this function so I can access a specific, um column using the square brackets and then I will use that unique metal. Okay, and I'm getting the result only two option male and female. Let's do that again. But this time for another one, which is the embarked. Now I'm getting the unique value, S C and Q and also none because there are some missing value in that column that will it toe handle in order to remove them? If we want to know how many of those have the same specific value in a specific column, then we can use the value counts metal, so it will be something like that. The resulting object will be a descending order so that the first element is the most frequently occurring element of value. By default, it is ignoring a missing value, So I can see the number four with the S value with the sea and Q basically, most of the passenger embarked from this. Put a location which stand for way saw attempt on port. Let's analyze the additional columns like the amount off men's and female. So again, I will take this one and put six is the column and here we have it by default. It is returning there the count's But we can also get the relative frequency by setting some additional barometers. Okay, the normalized barometers into truth so I can take this one. And instead of getting absolute value, I can set the normalized equal to truth. And now I'm getting a percentage breakdown. Okay, so 60 almost 65% of the male and around 35% are female. I'm not sure there following option will be used so frequently for simple data analysis, but we can also transpose the data structure. Okay, It will be used in future levels when we're going toe one machine learning algorithms. So I can basically type something like that, d f two, which is a new data frame equal to DF. But I suppose, and if I will display the F two, I will see that it switched between the columns and rows again, something that it will be useful in later future levels when will use them machine learning algorithms. 27. Using Summary Statistics: moving next to summary statistics. Parentis objects like a data frame or serious objects already includes a set off standard mathematical and statistical method that can help us toe. Answer specific question about the data set. Let's see some example using questions. For example, what is the minimum? Maximum and ever Ridge age off the Titanic passengers so I can write something like that the F Age and ask for a function called Minimal. This is the minimum age off all columns or the maximum and the same option mean ever h. The default operation for those method is to exclude missing data and operating costs. All those in the data frame. We can combine multiple columns, for example. Let's say I would like to get the mean off two columns together, like the age and also the ticket price. I just need to wait another briquettes, and then we go for getting the maximum value for all columns. I can just write the F dot marks and it will hung on all columns, and I will get the maximum value. How many passengers survived so I can write the f survived dot Some. This is the number off passengers that survived how much money or passengers spend on their ticket so I can use the fair combined with the some and get the total amount. And let's stick more tricky question. What is the maximum age off? 80% off all passenger. Okay, Now we can use a special function core called a quantity, and we can provide the percentage values an argument. So it's going to be like that. They have got Quintal, and I'm providing the percentage between 0 to 1. And here we go, If I will look on the age, Okay. The age 41 this is the maximum age off. 80% off all passenger. This is some kind of statistical analysis. In many cases, we don't have specific question about the data set. Okay? At least on the initial phase. But we would like to get some summary overview about the data set for this purpose. We can use the describe meddled which will provide us a salmon, a statistic for numerical columns in the data frames. So I'm typing the s describe it will return the count off items, the main average value, the standard deviation, many movement maximum, and also some rage off a Quintal data for every column that is a afloat or an individual. Looking on this summary, I can see that there are 740 airlines for the H column. Okay, there are still many missing values for these columns. The mean age is around 29 for all those without missing values. One standard deviation from the mean is a 14.5 years, meaning there are 68% people that their ages between 14.5 and 43.5, the youngest age is less than one you, and the oldest one is 80 years, for example, 75% off. All passengers are less than the age off 38. By the way, if we have too many columns, we can run it on a single column so I can do something like that, the F and age bribe, and it will be on just a single column, which is age 28. Methods Chaining: before presenting more options to analyze the content off our data frame. I would like to touch an important programming concept that you will probably see in many places. It is cold mate owed chaining. Many poor gobbles are using this options, and it is very useful. Were walking in panda's metal. Chaining is a programming style or methodology off performing operation Sequent Lee on a data frame or a serious object that emphasize continue. T It is a friendly, top down approach, and it is eliminates the needs off using temporary valuables while performing a few see Quintal steps. Let's see a few examples to understand the concept. There is a function called is now It will return a data frame with a bullion value for each cell. If the sale is missing value, it will be true. Okay, most off their value. Zero force. Okay, now, assuming we would like toa count the missing values but each color I can write something like that. Okay, can write the F attempt equal to the F is now. The result will be stowed in the DF under school temp data frame, and then I can run the F temp and run this some metal. Here we go. This is the result. A better approach is to use metal chaining and to skip this temple available. Okay, Each method call will be added by using the documentation. So it will be something like that. The f is no. And then I'm accessing the next one, some getting the same result. And we're reading this line from okay from left to right. First awful day is now will learn on the original data frame and it will return an object. We know that the object is data frame and then the sun OK, met owed Well, Iran on day object coming from the is Nall A result the key things in metal chaining is to know the exact object being returned doing each step off the chain could be it could be many, many metal in pandas. This will always be usually a data frame, a serious object in list or maybe a single value. Now, if you're not sure what type off object you got doing some step, just use the type function. So I can it does something like that. I can click type and put all of that inside and I'm getting the type of object that that will be returned from that line. It's a serious subject. Now, let's say I would like to sum the total amount off missing values in the entire data. Friend, I can add another metal. Okay, so it will be something like that. That's it. And again, if I will check it, which is quite straightforward. Okay, this is an integer number. Let's take another example. There is a dedicated method to drop those with missing values, something that will use a later on. But let's just use it in the context off metal chain. If one or more columns in a row are missing values, it will be dropped. Okay, this is the name off that the F A job in a. As you can see here, the returned object is a data frame, and now the number off falls was reduced to only 183 wolves. Now, I would like to take this new data frame and calculate the average age, so I will use again the concept off a metal chaining and right over here a festival. I will filter the age column and get the mean, which is the average? Okay, just to see the difference, I will also calculate the average age without doping votes. Let's remove this steps. Here we go. Okay. Different numbers. This is the concept off metal chaining and it will be used a lot doing our training. 29. Sorting and Ranking: the next vory common operations will perform. While analyzing our data is about salting and ranking. Let's review the method in panders to perform such operation while answering all kind of question about our beta sit. Assuming would like to see the least off the 10 oldest passengers, we can use the sold underscore value method, asking it to solve all those in the data frame using the age A column. In descending order, we can see how those those are salted using the age value. Assuming I would like to see just two columns Kate live, for example, the name in age off each a passenger so I can feel the result. And just use a name in age this way and I will get and that data friend with only two columns. As with other methods, we can perform the sorting in place, meaning changing the original data framed by setting the in place to truly so that before and other option will be to. So the data using more than one values, let's say I would like to first of four So data based on male and female groups and inside off each group salt the holes by age. In that case, we need to provide a columns least as an argument and if needed, also a bully and least for the ascending are human. Here you go. We can see the 1st 5 lows and the last five O's are organized buys the Sex column. The first group is female. Inside the female groups, the votes are organized by the age. Okay, in descending order on the other end, I can see that they last rose for the main groups, with the missing value presented by any end. In that case, I can also use the drop in a metal before assaulting the data open eight and then sold the data. This is an example off metal chaining that we saw dropped the missing values. And then the cleaned data firm will be used a for the next step as input for the next time metal. Another option would be to use the end largest or and smallest metal. The argument. The number off those and the column Toby used for salting the following line will bring the top 10 holes using the age column, meaning the 10 oldest passengers and other option will be to use the smallest. Okay, just using different meddled called and smallest with the same seem tax 30. Filtering: in this lecture, I would like to cover the important concept of filtering things toe penned. This It is quite simple even when performing filtering with complex conditions one off the most useful ways to filter data is by using the concept off Boolean selection. Pull off selection refers to selecting goes by providing a bullion value through a forceful each row. These Boolean values are usually created by applying a boolean condition to one or more columns in the data frame. Using our Titanic data set, let's say I would like to filter and keep on Lee the female passengers okay from the whole data flowing so I can do something like that DF and choose the relevant colon which is sex , and then apply some condition equal to value female. Okay, run it. The output will be a serious type object with Boolean values inside based on the condition based on this condition for each Whoa, there will be some value true or force. Now I can take this bullion output and filter my original data friend so I can write something like that the f and put this condition inside the condition that just saw over here And here we go. It will filter and keep just a female passengers. Let's take another example, assuming I would like to filter out all passengers below the age of 60. So again I can write something like that the F and inside, right? The condition The F age should be more than 60 and that's it. Run it now, Looking on the age value, we can see that's more than 60. What about using multiple filtering conditions? Let's say I would like to filter all passengers below the age of 60 and keep on Lee the female passengers. Okay, it will require to condition the first condition is this one okay. And the second condition will be this one. Okay, In order to combine those two condition using end or all operation will need to add apparatuses toe help fight on, understand the evaluation order. So it will be something like that. Let's copy this one. Great. But still it will not walk. Okay, This line, the additional thing that we should replace is the end key word. Okay. We need toe change it to the thesis up percent characters and then we'll be able to run it . OK, Now it's checking those toe conditions, and we can use it to filter our original data frame. So just write the if and copy all this content inside, and that's it. Okay, just filter. And I got on Lee eight Reliance, meaning passengers that are female and also above the age off 60. Another example will be to get all passenger above the age of 60 and below the age of 20. So it will look something like that. So if I'm using a all again, I need to use this special pipe character. Hanae we go by the way, we can also use the look options to fill the holes and columns. I will put the condition result in another object just to make it more easy. So let's say I will. Gate has some condition valuable, and then I will use it in a look. Si, tex. So I'm providing look the condition inside and the relevant columns that I would like to say so This one is used to filter the vogues, and this one is used to filter the columns with Is the syntax off the look option 31. Exercise #3 – Data Loading and Analysis: 32. Grouping and Aggregating: the next operation. I would like to present this part off. Our data analysis is the option. Toe aggregate overs by columns values and then run some calculation. Aggregation is the process of grouping goes and convert them into down to a single value. Okay, By doing some calculation, it is done by using the group by metal. Let's see some example by using all kind of questions. The first question, for example, is what is the average ticket price for male compared to female passenger? So I can write something like that. And let's break this line into the building blocks. The first part, This one is just a selection off two columns a week. All the holes in the data frames. Next, this one I'm using the goodbye Met owed while using the sex column toe make the grouping we provide as an are humane the aggregation columns. In our case, it's a single aggregation columns. The output until this point will be a new go by object, and then I can apply. Some argue Gatien function for each aggregated group. The aggregation functions will define how delegation takes place, and there are some well known aggregation functions that we can use, like some mean marks. A count values style division and so one. So our aggregation function here is mean. Okay, get the average. So pair each group identified by the go by for the female, it will calculate the mean the rich price. And for male, the average a price. Let's see a more example. I would like to know what is the average age for they survived group. In that case, I will use the mean method as an aggregation function, and I will group by this survived cola. So if I'm looking on the age and they go by, method created two groups, those that didn't survive it, though that's all right. I can see the average age and other questions are how many survived group by the embarked port. So it will be a something like that. Yeah, I'm grouping by the embarked column. Then I'm filtering just the survived and I'm asking toe count the numbers here. What is their village age, as well as the average price for male and female. In that scenario, I can just go by sex and around the mean aggregation methods it will calculate for all a columns the mean value. But in our case, I'm only interesting about the agent in the fell price. I didn't feel that them specifically, even though I can do that in any case, and another nice option is to combine the goodbye method with the A G meddled. In that case, we can specify multiple aggregation functions and not just single one that we saw until now . Let's say I would like to know the number off passengers and men age, maximum age and maximum age bill each in back put. So the group is embarked and I'm interesting about this one. This column and I would like to one several aggregation functions. So I'm using this special method called ug, and provide is an argument, the least off aggregation function that I would like to use. And then we go. This is the tree groups sick us and just for the Age column, it will calculate at those aggregation functions. That's it for this section. We covered many topics while analyzing a data set, and now it is great time to practice. So please review the next lecture exercise in number three, data loading in analysis and see you again in the next section. 33. Data Cleaning and Transformation - Overview: hi and welcome back In the previous section, we covered the structure off one dimensional, serious data type and two dimensional data frame data type. We learned how toe upload a tabbouleh file into a data frame object and then preview the content and structure off that it data frame. Also, we learned how to select a portion off the data frame like a group of columns or go performs and also analyze that a content of the data from using all kind off statistical and filtering metals. Now we're ready to move on to the next step called data cleaning and transformation. In the real world, data is usually quite messy, and we will have to tweak it a little bit. In most cases, the regional file we load into our data friend will not be 100% perfect or fully aligned toe the task we would like to perform and let me give you some examples. There is a high chance that some cells will be empty with missing values, or some holes will be the same, meaning duplicated hopes and the columns. All those heather will not be descriptive enough or easy to use. It's some columns may be redundant or less useful for particular task. Some values will be on a scale that is more complex to present or analyze and more. The good news is that tend us, along with some part on built in features, provides us Tenney the tools to clean, transform and rearrange data in tow, the right optimized structure. And this is our learning objective. In this section, we would like to take the original data frame and make it more useful for our specific task . 34. Removing Columns or Rows: Sometimes, after loading a data set, we may want to remove specific columns or specific owes form a data frame. Let's see how it can be performed. I will load Aware Titanic Day to sit into a data frame and chick the structure of the data frame and also access the shape, attribute and, let's assume, would like to remove a the ticket. Okay and fair columns. Those two columns. For such action, we can use their drop method using the following setbacks. We need to provide a list off strings as the columns Toby removed. This is the name of the metal drop, and this is the first argument a list off strings off columns that I would like to remove, and we need to sit this barometers access equal to one, which is the index off the column and also set the in place barometers toe the true argument in order to perform the action on the original data frame. Okay, let's run it and access again there, Data friend and also them shape attribute. Those two columns are gone. We can see it over here less 10 columns instead of 12 and I can't see that anymore. in the data frame. By the way, if you would like to keep the original data frame, then don't provide the in place argument as the different value is forced. So I will upload the file again. And this time I will stole the result in a new data friend. Okay, if I would like without changing the original date afraid. In any case, I can walk this way without storing the result in a new data frame object while using method chaining as we saw before. So it will be a something like that drop and I can add, you know, and they had metal that the end and just see the result. The best workflow, okay, will be to festival. And he moved things from the data frame without changing the original data frame. And when we sure about what we are going to do, Okay, the change that would like to do. Then we can set the in place to troop and perform the change in the original data friend. Now, what about deleting goes okay? It's a little bit less common than removing columns, but we may need it. In that case, we're using the drop method, but this time will provide a different arguments. Will say that in in a second but festival. Let's check the amount off Forbes right now that we have, we have this amount of falls before making change. Now I will use the drop meddled, and I'm providing several an argument. The 1st 1 is the least off all that I would like to delete now, instead, off providing a manual lists. Let's I would like to drop 100 does the 1st 100 rolls So I'm using the range have functions to generate, at least automatically. So this is the first thing that I'm doing. Secondly, I'm setting their access to zero, which indicate that I would like to change something about the Lows Index and in place equal to true meaning. I would like to change the original data friend. I will run it. And if I will check again the amount of hose I will get 100 loaves less than the original one 35. Removing Duplicate Rows: Let's open our original A Titanic data set. I mean, edit edit mode and then copy Paste the second line. Okay, this line, because the first line is the head of a copy and based it two times 12 vacated A to duplication off the first line. Let's take another line off the reel. NT eight 123 duplication. So, in total, I've five duplication and I will save that file with under school. Do Stanfel duplication safe. Okay, I have a new series. Refile. Let's go to the notebook and load that file and chick the head. Okay, I can immediately say that. Pre duplicated lines that can be in whatever location inside the data friend. So as a first step, I would like to check duplicated votes. Meaning any ho that has at least one identical ho somewhere in the data frame. Identical in the content off all columns excluding the index. Okay, off each row. So I will write something like that DF dot duplicated okay. And we get a serious object with bullion values drone force. It will return true if there is a previous identical Oh, Now, in order to understand how many lines are duplicated. I will use the summary metal. Okay, so I will add some and get that I have five. Duplicated goes as expected. This is what I have done and to the original series. Refight. Okay. Lets see these duplicated homes using the look method and passes an argument What was to filter? Okay, Now the DF don't duplicate returns a bullion serious subject that can be used as a filter because any duplicated or we'll have a true value. So I can write something like that. DF don't look and I need to provide to our human. The 1st 1 is to fill tell my lows and I'm using the DF due to duplicate. And 2nd 1 is to filter the color that I would like to see. Right now it's going to display all columns. Here we go. Those are they duplicated lines. So the idea is to pass the filtering condition for those is the first argument. Okay. And the second argument will be the get the specific columns or all columns. Okay. By using this a slice notation, it will show us the five holes that are having previous duplicated holes. The first occurrence off a duplicated or will not be presented here. Okay, it's pres. It's presenting the second time that this a duplication is happening. I can further tune that a little bit more by providing a different argument. Teoh barometers called Keep inside the duplicated metal so it will be keep equal to force and run that again. In that case, I'm asking the duplicated method to show me any occurrence off any duplicated role. As they remind them. I duplicated to those a five times so it makes sense to get seven always in totals. Okay, this is seven close. I can go ahead and remove all of them, But in other cases, we may want to keep one unique occurrence off any duplicated row. So I can't it just they keep barometers a to type of argument first or last, so I can change it to, for example, last. In that case, I'm asking the duplicated matter to show me the first occurrence off any duplicated whole while keeping the last chorus. It's a little bit confusing, but just look on their argument. Keep. I would like to keep the last occurrence. Let's say that I would like to remove those duplicated those okay. Before changing, I will display the shape of the data friend the F shape We saw that several times the amount of fair rows and columns. And now I'm using another method called dope duplicates and with a similar argument called Keep Keep the first. And I would like to see the result using the shape it just removed. Five. Close the five duplicated holes again, Like in many other method. We can change the original data frame while using the in place barometer. So I can ad here in place equal to troop, and I just change the original data frame and then I can change that. Checked the shape again. Here we go without duplicated divorce. What about duplications in a specific column? Okay, instead of looking on all hoes, I'm just looking on specific column. So I will go again to the original file and simulate such problem. Click Titanic and open with readYto and I will copy, for example, this name and based it in other columns. Let's do that. Oh, very also. Okay. Just created duplication in the name Pullum two times. So I have three passenger with the same name. Let's save it five as, And this time I will call that to Great. Go back to the notebook and let's load this file. Here we have it, the F head. I can immediately see the problem here, but again, it can happen in whatever holes in the database. We can check the duplication in specific column and not all the columns. In that case, the City X will be something like that. The F and I'm selecting only the name duplicated and I'm asking this some. So I'm getting to a duplicated those with the same columns value. Okay, it's not completely duplicated. Also, okay, just goes with the same value in on a specific column. And from that point, the whole process is the same, so I can use the drop duplicate with only that column with the Keep the Promise eaters in place equal to true and let's check you can for duplication. Zero. No more duplicated those when using their name columns. That's about checking and removing duplicated identical wolves or duplicated roles in the context off Specific column. We need to be a little bit more careful about duplicated columns, which is perfectly OK in some scenario. In some cases, 36. Renaming Column Labels: Another standard action will be to fix the columns and rows names to make them more clear, descriptive and to follow some conventions, like maybe used a capital letter or some underscore between worlds, etcetera. Let's start with the 1st 1 which is selectively naming. Okay, the most flexible method for renaming columns. All those is that he name a toad. Okay, I will upload a weird day to sit again full day to set and present them. Those index label, which is just a list off numbers. Okay, so nothing to rename here. Let's look on the columns Index. This is coming from the edl in the file. Now, let's say that we would like to rename some columns after we upload the data into the data friend. For this action, we have a dedicated method called rename. We can use it toe he name any number off columns, whether it is just one column Oh, all columns and this is will be the syntax a little bit long. DF Daughtry name. So the first argument is columns. This is the parameters and I'm providing as a dictionary the list off columns that I would like to change and it's the simple syntax here is that the key is the old name and the value is the new name. So I'm changing name to full name. I'm changing Fail ticket price. Okay, I would like to change that in place inside the original data frame. Let's check the change. Here we go. This is ticket price and full name. Now, assuming would like to rename all column names at once and not specific column. Then there is a simple option without using this renaming method will just a overwrite the columns attributes off the data friend. Okay, lets see how it's being done. I will festival upload again, the original data frame present that attributes. It's basically a list, and I can change that. So as a first step, I will create a list with the names off the new columns. Let's say this is the list off names that I would like to use to my column c one c two c tree until C 12. Okay. Need to supply all columns, names and all. The holy thing that I need to do is just the f dot columns equal toe this new A column Keep in mind that we need to use the slice notation. Okay, This is extremely important to force spite on toe copy the least and not just copied the object reference. And all I need to do is to run it and let's check again. Here we go. Okay. I can also present the data frame, and I would see the new him labels to my column say index. 37. Dropping Missing Values: Our next step will be to identify missing values and then do something with those values. We have two options. The 1st 1 is to just simply drop any lives with missing values, and the second option will be to feeling those missing values some. Somehow. This is something that will do in the next lecture. Keep in mind that when removing glows former data frame, it can damage the analysis and leave us with a useless data set. For example, if I have 100 lines and I just removed 40 lines because off missing values, then I used my they were sent by 40% which is a lot, and as a result, we may not have enough data to run our analysis When we load data. Former Tabbouleh file. A missing value will become in a n this value stand for not in number, and it is usually ignored in our mathematics operation in all kind off method. So let's load our data set into a data frame, and let's start by identifying missing values in the seasonal method. Okay, the if is now we already used that before. Okay, we'll get the data frame filled with Boolean values. OK, the cells are true. Where the data is missing in a particular cell in case our data city small, then it is useful to present the missing data. But with a larger data set, such kind off inspection is less useful. So we can use the any method which will provide a quick indication if you have a t least one sale that is missing in some column so we can see that the age and cabinet in bucked has it least one missing violence. Next, I will check the number off missing values for each column. This is the number off missing values here, here and you. The cabin column is missing many values and it makes us to remove it from the data set. And other option would be to check the amount off. None missing values, if you would like. Okay, with not now just the opposite. Now I have two options to handle those missing values. I will start with the first easy option which is to drop those missing values for dropping missing values will use the drop in a method. The drop in a method has all kind off options by default, it will drop any. Oh, that contains at least one missing value in one off the column. So let's check the shape attributes before changing something. This is the original amount off a hose, and then let's run the drop in A and check the shape. If I will use the simplified approach, and then we'll get 183 votes in our data set instead off 891 which is not really recommended in general, our target will be to keep more data as much as possible. So a much better option in our case will be to first remove. It's some problematic columns, which, as you remember, is the cabin column, and only then dropped a missing value so I can't do something like that. First ofall drop the car been column, which has many missing values and indicating that I'm walking on them. A column za index and as a metal chaining ah will around the drop in a and get the shape. Now I get much larger and clean. Data set went 712 holes, okay, much better, and I can in use also the in place argument to change the original data frame. Okay, so it will be just ed. He moved the shape and add in place equal to true will not run it. But this is the way to change the original data frame. If we want to drop a low only if all the values in that those are missing then we will pass the whole argument today. How parameters. OK, so we have an options to do something like that. We don't have such over the data said we don't ever hope that all the columns are missing values. Okay, so they didn't dropped anything. If we want to drop a column only if all values in that columns are missing, then we'll pass again. The whole argument today how Prom Attar's and in addition will change their axes to one So it will look something like that. Okay, dropping a how equal told. But this time, with this argument again, we don't have such column that all their values in that columns are missing. So it didn't drop any columns. As we can see here, in case would like to be more selective and drop alone when a value is missing in only specific column. Then we can use the subset argument so it will look something like that Drop in a using the subset barometers with this argument as a list. Inspect only those a column. Okay, not the whole data frame. The meaning here is to drop a low if ideal age all cabin is missing value. Okay, it is ignoring missing values in other columns. If we would like to drop Oh, if a value is missing in the age column and also in the cabin column in the same hole, then it will be using the whole argument. So I'm going to use the fooling syntax and used how equal toe hope. All okay. In that case, they are less such combination that in age as well as cabin, there is a missing value. This is the first option to consider dropping goes or columns with missing values. The next possible option is about feeling the missing value in different many ways. And that's the topic off our next lecture 38. Filling in Missing Values: sometimes a few missing values in a column or a few missing values in a hole will not justify the removal off the column or the computer. So instead, off filtering out those missing values and other option will be to fill in the holes we have in the data. Okay, there is a dedicated method called Fill in a and let's see how we can use it. So the first step, I will upload again the data set and static with numbers I can call fill in A with the constant value that will replace all missing values in the data frame with that particular value. Okay, so I can use the data frame dot feeling a and the value that I would like to feel in. And as you can see, there will many missing values, for example, in the cabin. And now they are with zero. Okay, so this is option number one. Oh, we can use the mean value off a specific column. For example. I can calculate the average age and feel missing values in those columns with this calculation. So let me show you this is important example. So I would like to feel any empty cell in the age column, and I'm using the feeling A and as the value that I would like to feeling is basically a calculation that the average off the age when the votes that that we have information. Okay, so that will be the process to do that. By the way, if I want to change the actual data frame, I will use the in place equal to true argument, as we saw before. Okay, what about a string with a missing value? Let's say I'm checking a specific column and using the value counts, I will be able to see how many holes I have was each value in that column. So let's do that for the embarked column, and by default, the value counts will drop the missing value. So let's bite by that and add parameters that is called dope in a equal toe false and returned it again. Now I have another line over here we have only two woes with missing values for that column , and now I want to take all the holes with missing values and provide them the following value. I will do something like that. Access the embarked Coolum with the feelin A and the value of would like to put inside is no boat information. Okay, that's it. And to change that, the framer will use the in place argument in place, equal to true. And let's run again. This one. Now I can see this new a group. Okay, no protein for for those two lines that I that I changed that were with missing values. 39. Creating Dummy Variables: It's part of the data preparation step we may need to create something that is called damage. Valuables Damage. Barbers are columns. We are adding into our director frame to represent categorical features with numbers. Instead, offs text as a reminder. A categorical feature can hold a limited number off possible text values. For example, a gender can be female or male. Okay, or country has limited, you know, amount off possible values. Or maybe vendor name etcetera. Now there are some machine learning algorithms that their input must be numerical values. They can't handle a text, so the idea will be to take categorical features and transform them into new features that are based on numbers. Okay, usually bullion numbers like zero or one. Let's assume you have a column in the data frame that describe the gender category off a person. Okay, it can be a male or female. Okay, So by using our Titanic data set, I know that I ever column like that, which is called sex and with the string value okay, male and female. Now we will create a new column that will map those two string values into numbers by using a dedicated the function called map Okay, and the Citrix will be like that. This is the new call him that I would like to create called gender Underscore female, and I'm accessing them. The column six. And I'm running some function called map, and it's taking a dictionary dictionary. How to map the value in that column so male will be equal to zero female will be equal to one that simple. Yeah, let's present the data frame again. I added a new column over Hill Gender under school female, and the idea will be if it's a female, it will be one. Okay, and we can see how it's correlated to this column. Now, another step of that we should do. We should now remove the original feature sex. If this data frame is goingto be an input toe, some algorithm because we don't need duplicated information in our data set so I can run something like that. They have dropped sex and with the axe is equal to one in place. True and present data frame again, and this sex column is not here anymore. This is the first simple options to create a damn ive arable. Using the map a method. Let's see the next option. The next more powerful option is by using the function called Get Underscored. Dam is, and the easiest way to used dysfunction is by providing data from object and then a list off columns that we want to create. Damn is valuables. Okay? Please note that this is a function that will that will access form there. Panda smuggle in, not form a specific beta frame object, meaning it's not a metal. So as a first step, let's load them original data set into a data frame. I will remove some columns so it will be easy to see the result. I will remove all those columns in present a well data set, much less columns. So in this example we have a categorical feature called Sex and let's say I would like to create a dummy valuable for this categorical features. So all I need to do is to run this get dam is function with the data frame and the list of columns that I would like to create dummies valuables. I just want for this one. So this function will take the categorical feature and then create one column for every possible value inside that categorical feature. So when running dysfunction on the sex column, it is grating to bullion columns called six Under School. Female sex underscore mail because there are two possible values. The name of the columns is used as a prefix. Okay, the sex under school is used as the prefix. Now, for each well will know if it's a male or a female while looking for the one value. Okay, this is a male, This is the female and so on. But if you think about it, we can understand if it's a male over female by using only one column as we already Kate before while using the first option. Okay, it's capturing all the required information. Generally speaking, if we have K possible values for a categorical feature, we will need K minus juan dummy valuable. So back to our simple example. I can just drop the first column by using a specific argument called Drop First and set it to true. So it will be like that, Okay, adding that drop under school first equal to truth, and I'm getting just one coehlo Okay, that's it. For every categorical feature, it will drop the first dummy Valuable because we need K minus wanda me valuables for each feature. Another nice thing that was done automatically is the removal off the original column. I don't see the sex column anymore. The last step will be to override the same data frame with the result, or to store it in any of the different. Let's say I would like to store it in a new data frame so I would run it like that and present the results off the new director friend. That's it, I added. They needed column. Let's get another example. We know that they embarked Column is also some kind of categorical feature because it has three options. Okay, the first step. Let's review their those unique values we can see that s C Q and some missing values. And no, As a first step, I will drop the missing values and we're under together. Now. I have only three option SC in Cuba Now we don't have those music value and I can move own with creating Damn available is using the get dummy functions. They embarked as tree possible value for this categorical feature. So I'm expecting toe. Get three minus one dummy valuables. Meaning to So I will run it with this information. Here we go. We have two new columns. Embarked. Underscore que embarked underscore s a replacing the original embarked columns. Okay, that's it. It's very powerful and flexible function to create dummy valuables. 40. Exporting Pandas DataFrames: the last step we may want to perform will be to save the claimed and transformed a new data frame back to a C three file. So we will be able to use it as an input data into the next step in our project. The process off saving a data set into a file is again very simple. Thanks. Toa banned us will use the to underscore see as we metal to save the data frame toe a comma separated file. So the syntax is quite simple and new underscored. DF equals two to a under school. See as well. This is the name off the function and that will be the name off their file. And I can see that you filed just headed over here. Let's open that file. The first column is the whole index, and typically we may avoid using the whole numbers in the output file, and in that case we can set the specific parameters to avoid storing that index information so it can be using this syntax index equal toe false. We override that fight. It's open that again. And here we go. We don't have it anymore. We can also remove the header line. If you would like using this syntax in the other equal to none, have another file and there is no head there anymore. If we have a useful list off columns labels, it is not making sense to remove the head of. So it's a case by case. Okay, that's about saving the new data framing toe a file a very simple process. One off The primary use case for saving daytime toe files that we manipulated is when we would like to use it as a training data set while using some supervised machine learning algorithm, something will cover in future eleven's Thanks for watching so far. Please practice a little bit using the provided exercise exercise Number four Data Cleaning in Transformation and See You Again in the Course Summary section. 41. Exercise #4 – Data Cleaning and Transformation: 42. Course Summary - Let’s Recap: hi and welcome back to our last section In this training, I want toe recap the things recovered so far for creating an end twin story and to talk about some off their next steps. We started by installing the Anaconda package and then learned to use the interactive development environment, meaning the Jupiter tail up to four kating Jupiter notebooks. Those notebooks are handy for developing stowing, ensuring the steps in data science projects and will keep using them in future levels. We covered the fundamentals off the pipes on language while focusing on topics needed to develop data science projects. Topics such as the bite on syntax, data structures like flexed and dictionaries built in functions. If l States men's four loops classes, objects attribute metal and more. The next step was to focus our attention on a specific data science, a library called pandas, which is a very popular piped on library for data exploration, data analysis and data manipulation. It's the first tool. While walking with large data sets, we learned how to use pain this before loading a tabular data file into a data frame data structure. Then how toe preview the data frame using all kinds off method, selecting glows and columns, salting and renting, handling, missing data filtering, cleaning, aggregating, perform, data transformation, creating dami valuables, drop missing values and much more. This soul ID background about pandas would be essential for us. Also, in future levels were planning to use it as a pic with its step before applying machine learning algorithms. It's a critical step off many data science projects. And again, pandas is the tool to perform such preparation. That's it. It was a quick recap to connect the dots. I want to thank you for watching this training. I hope that you enjoyed it and learned some interesting things along the way. It will be awesome and useful if you will raid the course and share your experience. If you would like to continue your landing path about machine learning, please check if level tree is already available. Thanks again. And I hope to see you again in my next training course. Bye bye.