Data Science in Python | Vishal Rajput | Skillshare

Playback Speed


1.0x


  • 0.5x
  • 0.75x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 1.75x
  • 2x

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

    • 1.

      Introduction to Data Science Course

      3:05

    • 2.

      Exploring Kaggle Datasets

      5:39

    • 3.

      Data Preprocessing using Pandas

      29:39

    • 4.

      Numpy Arrays

      47:17

    • 5.

      Numpy Functions in Python

      18:24

    • 6.

      Statistics for Data science

      24:07

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.

34

Students

--

Project

About This Class

Data Science is one of the most emerging fields in IT. Learn data science by implementing concepts in Python programming and become a data scientist.

What will you learn?

  • What is Data Science?
  • Data preprocessing techniques
  • Data Aggregation
  • Data Sampling
  • Python

Meet Your Teacher

Teacher Profile Image

Vishal Rajput

Programming Instructor from India

Teacher

I am a software developer with 4 years of experience in making products and working for startups.

I am a passionate teacher and educator at ThinkX Academy. I have experience in making good content for students to help them learn programming and get jobs in IT sector or build your own products.

Enroll in my classes to get in love with programming!!

Happy Coding :)

See full profile

Level: Intermediate

Class Ratings

Expectations Met?
    Exceeded!
  • 0%
  • Yes
  • 0%
  • Somewhat
  • 0%
  • Not really
  • 0%

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

Transcripts

1. Introduction to Data Science Course: Hi everyone, welcome to the Data Science scores. In this course, we are going to cover all the practical aspects of data science. This is actually a project-based also live in B, actually bidding on a project and we use the concepts in data science and applied it to that project. After the end of this course, you will be able to get a hold on all the important concepts of data science, which includes data analysis, data preprocessing, and visualization techniques. So basically x plus right? Or does that mean What exactly is data science? And I will give you an overview of what exactly you're going to cover in this course. First of all, you can see that data science is about extracting knowledge and insights from noisy and unstructured data using some items and some processes. Basically, there are a lot of companies and there are a lot of industries which actually uses different types of data. They have millions of records. In order to structure them hand, you extract the knowledge for the benefit of their businesses. They require some data visualization techniques, data pre-processing techniques also, because science is the growing and emerging feet and a lot of opportunities for data scientists. And this whole period is rising the norm in the industry. During this course, I will try my best to give you a hidden hands-on on how to actually implement all of the concepts that are built in data science. Let's start the doc replied my salt discourse, the phosphate importer to climbing is the fight and Browning. Plus we would start with the Python programming. I agree with some basics of 500 videos in this course, you will be able to understand these spice Oreos. And after that, you will be able to get started with data science on six. Alright, so next requirement is dedication, since data analysis is a very old and McDonald's because they bought it that you have good amount of vacation. To understand what the dataset is, to understand what are the techniques which you will have to find dataset. Every good I said is going to be different from the other and there has to be different ways and different techniques. It will be processed that data, analyze that data that requires a lot of dedication. And that is the mean reason why this is an emerging field. Now let's see what are the tools that we will be using too long the validity as I suppose, we wouldn't be using pilot programming and some walking fight and libraries. We can use Jupyter Notebook, which is actually your IDE type in one megabase eight, the other the right, the old invite them to analyze different dataset. The heel staggered to do after you download the important datasets and take a look at what exactly we can do it. Then it has some important libraries like pandas library, scikit-learn, which is a very modern, reprocess the data, also bringing an experience that you can see. Their FEV1 and Microsoft cavity here is used for data visualization that we have some advise libraries like TensorFlow, Python using deep learning. Also doing some beautification like random forest classification, decision trees and some machine learning models. And also, do you make it in these datasets, largely addressing and very practical things that are willing to come in this course. So, see you in the next tutorial times. The logic. 2. Exploring Kaggle Datasets: This is the first video of this course. This data science scores and v are going to perform a lot of things like data pre-processing, data visualization, and a lot of things like data sampling, aggregation, dimensionality reduction, all of that stuff you're going to do. But before starting with any of the concepts, I want to first familiarize you with the Kaggle website, which is kaggle.com. So I'm going to give you all the advantages of using calculus and by every data science enthusiastic uses scapula. First of all, you need to just go to the website.com. And here Kaggle is basically a website which provides lot of datasets and a lot of things that are really helpful for all the students that want to learn and all the students who also wanted to compete, right? So competitions are also there. So first of all, you can see on this var here, here we have the dataset. So if I click in here, you can see it shows us a list of some training datasets which you can see he spoke data, Google stock data. And there are lot of popular datasets which you can see here like Boston Housing, NSL, KDD, mobile price classification. So basically, in the whole course, we are actually going to perform a lot of things like data pre-processing. For that, we're going to pick up different, different datasets from Kaggle. The reason of doing that, of doing so is because when you have different types of datasets you have, you will face different types of challenges. And that really helps you to value or data science skills when you are actually studying data science. So for example, let's say you have a dataset and you want to see that how there are some, let's say an ASR available in our dataset and there are some duplicate items in that dataset. All there are some other challenges as well. So for that, if you will pick up different, different datasets, you are going to get a good exposure of on different types of challenges that you might face in your data science career as well. So we will be picking up various datasets and I will be choosing that dataset for solving or just showing you owe certain concepts. And here you can see we have these datasets. It also has some courses and asthma. There are some other things like competitions as well. Basically, Kaggle is just like a GitHub is for developers, so Kaggle is for data science. So we have some competitions here. And these are price-based also. So you can actually get some money if you participate in them and you, when these competitions, you can see these are some of the computations here. Now we have an courses, some cold now interesting part here in calculus that you are going to use Python programming for basically writing and doing all the data, reprocessing, analyzing stuff. So for that, you do not even need to create a whole Python environment. You can just click on Create here. And here you can just directly go and click on new load book and you can create your own notebook, right? So that's one way interesting thing. You can directly do that in this notebook, which is the Python environment, will also give you access to Panda's library, scikit-learn, so you don't need to manually install them in your next top. Here you can see that we have some datasets. One interesting part that I'm going to show you here is that this dataset which will take some time to load here. Here we have some user rankings progression and all that stuff. So let's take a look at some dataset here. Okay, so here we have the Boston Housing. So I'm just going to click on this one. And here you can see that this will open in a minute. Vi have different types of datasets and not really those datasets though. People, the community of cargo, actually post a lot of their own notebooks and their ways of analyzing data on Kaggle. What you can do is you can, you can see there are notebooks and you can learn from them. It's really a very good thing. Just like you're seeing someone's code on GitHub and you are learning from it right here. I think I need to refresh it. It's open now. So here you can see it shows the data. You can see the data is in the dot CSV format and this is the whole dataset. But the interesting part here is the score, a tab here. So if you will go on cold, you'll be able to see that there are a lot of people off this scaffold community whom it pulls them all in to visualize or preprocess this data, right? So next click on this one, which is here. You can see that it shows that this is the notebook and these are the libraries that this notebook has used. We are going to use them in this whole course. I'm just giving you a gist of how exactly you can use Goggle because it is a really important tool for every data scientists. So you can see we have ponders Mark Floyd, seaborne for data visualization and sci-fi. And here you can see that this person has written his own code here and to visualize the dataset and to understand and, and to plot the dataset. You can see everything here and orders also here. You can see there is good also. So it's a very good platform to learn data science. Basically, we will, I will be teaching all of these functions and all of these complex, thanks to you guys. So you don't need to worry about it. It's just that I'm giving you an idea of what exactly you can expect from this Kaggle website. So basically that's all for this tutorial. In the next tutorial, we'll start by picking up a dataset and importing it using Pandas library in Python. So that's all for this tutorial. Thanks for watching. 3. Data Preprocessing using Pandas: In this video, we are going to discuss some of the very important functions of the Pandas library. And we will discuss what exactly are the things that we can do and use using, using Python and Pandas library for some important data analysis and understanding and extracting knowledge form of a given dataset. So this is going to be a very important tutorial. And this one we are going to cover some of the very basic functions. And as we will move on to this course, we will explore some more important than some complex tasks that can be done easily using Pandas library. The first things for us we need to trigger the Jupyter Notebook. So here I the Jupyter Notebook Open and here we need to create a new Python Notebook using Python tree. I've already created one which is by the name pi bond as central. And what we need to do is we just need to open this one here. So now here we can write the Python program and we can perform all the data science tasks that we want to do here, right? So the next thing is to download the dataset. Here you can download any dataset of your choice. I have downloaded the Boston Housing Prices dataset and it is available on google.com. And I will also give the whole link to that. In the description of this video. You can download it from there. And here I have a folder named as housing data. And you can see we have a housing dot CSV here. If I just want to, I will have to first see what this dataset looks like. So for that, I'm going to open it here. So first thing we can see here, there are some columns in this dataset and there are some lot of rows in this dataset. What we want to do is we want to make sure that this housing dot CSV we create, we are going to create a DataFrame which will actually use This whole dataset and it will import it here in Jupyter Notebook. And whatever we do on that DataFrame, it is not going to impact our original data. So even if something goes wrong here in Python programming, if we do something wrong here, it will not impact the original data. So first things first, let's import the pandas library. So we're going to use the command import pandas. And since then we can assign it as Elias or a name here, which is BD, right? So whenever we want to use miners library, we can use the SPD shortcut. The first thing is we need to create a DataFrame. Let's call this as ds. And this DataFrame will actually hold the dataset of our housing dot csv. So let's see how we can import the dataset here. So we will use pd dot read underscore csv function, which is an inbuilt function of the Pandas library. And here we can specify the path to the housing data, which is on this Boulder housing data. And the name of the file is housing dot csv. So here you can see I've imported the whole dataset here. And this DataFrame, which is DF variable here it is going to hold the whole dataset here, which is the rows and the columns of housing dorsi. Now the first pass that we're going to see is to find the first, let's say, five rows of this dataset. This is the first part that we want to do. For that we can use a very useful function, which is the head function. Head means the starting rules. So here I can specify any number here. So let's say we want to specify five. Now if I will hit Control Enter. Now you can see here it will show me the first five rows of this DataFrame, which is Bf. Now remember that I have already told you that if we make any changes to this dataset, let's say I change any value of this DataFrame, which is df. It is not going to impact the data and the housing dot csv. So even if we wanted to make any changes, We have a separate line for that which we can do that. But here, the mean essence is that we can do anything here like data pre-processing tasks and everything, and it will not impact the original data. So this is how we can see how we can extract the first 54 rows using the head variable. Now let's move on to the next very important function, which is the tail function. Let's say now we want to find out the first, let's say the last five rows of this data. Okay, So we're basically doing this because it is very important that when we're given a dataset, we analyzed it very well. We, we will use these functions very often inside of our data science programs, whatever will write here, because let's say I want to perform some data pre-processing task and now I want to see what are the changes in the dataset. So I do not need to load the whole dataset here. I can just load first four or five laws and I can get an idea of how it has changed the dataset. So it's very important. Now, let's find out what are the last five rows of this dataset. You can see when I will hit Enter, Control Enter, you will see that there are last by rows. And here you can see that stage that there are 488 rules and inside of this dataset. So very important function here. Now let us move on to some other functions as well. The first important function is the shape function. And it is not actually a function, it is a property. So if I will hit Control Enter, you can see df dot shape. It gives me this result here. This result states that there are 489 laws, there are four columns. This is a very important property because most of the times we are not going to open the whole dataset. We can just trigger this shape property and we can know what are the number of rows and columns inside of our dataset. Now, let's move on to a very important function, which is the described function. I will Control Enter here. Now when we use describe function on any detail, it is a very important function and a very important tool for data scientists who actually uses this to understand the dataset. You can see here these are the four rules of this dataset. The first four columns, which is odd m is stack B3 ratio and made me right. So here we can see that when we use the describe function, it shows us count mean. Std stands for standard deviation with a steady. These terms mean and standard deviation in a separate video because they are very important in data science. Then we have men 2550% percent and max. Now what does gives us it actually describes the whole dataset and it gives us an idea of what will be the mean of the whole column here. So you can see that atom, atom is a whole column and the mean of that column is 6.2 for standard deviation is 0.6 for something, and the minimum element is 3.56 something. These are the values that are very crucial in data visualization also. So when we want to visualize our data, we also need to see what are the standard deviations, the means, and all that stuff, right? So this is again a very important function, which is the describe function. And let's move on to some other crucial functions. The first function that we're going to do is the drop function, which is in DFS got drop. Here. We are going to specify a column. So let's say I want to drop column atom. So I'm going to specify item here. Now what I'm going to do is I'm going to specify the axis. So axis one means column. If access is equals to 0, it will mean that it is low state. So here I specified axis equals to one because I wanted to specify that item is actually of column if it was a no, if phi 0 here. Now here, if I want to remove this dataset, right? So if I will hit Control Enter now you can see that this dataset now has only three columns, are m is removed. Alright, so here you can see we have this RMD moved here from the dataset. Now one important thing that you will notice here is that if I will take a look at df.head, if I recall this function and if I hit Control Enter, you can see that item is still there in this DataFrame. Now, it means that it is actually removing it just temporarily and not permanently from this DataFrame. So it means that we will have to do something so that it is actually removed from this DataFrame. Now there are two ways to do that. We can actually reassign DFS, df equals to df dot drop. So now if I will call the df.head function. Now you can see that it has removed the RM column from here. This is one way, but I have a more convenient way here. Instead of doing this vf equals to this, we can use another parameter here which is in place. Equals to true. Then we will run this. It will actually remove the data, earn this atom columns from the dataset. And in-place equals to true means that it will remove it in place of the DataFrame. So now if I will hit Control Enter, you can clearly see here that item is actually moved right? So it is, the use of this in-place parameter is that only right? Now let's move on to some other important functions as well. The next important thing is that we can also use is df three. I will hit Control Enter. You can see that it will give me the first three rows, right? So instead of using head, we can also use this statement which is df, then square brackets, and we have a colon here and then three. This is one. Another way of doing not if you don't want to write the head function, you can also use this. It is quicker in nature. So now let's move on to some other functions like how to delete first columns, right? So let's perform this task. Now let's say that you want to remove those first two or three columns from a dataset. So how will you do that? Let's take a look at this. Now we have the DFT function, and obviously we're going to use the drop function here. Medicare the parameters are going to change. The first thing is I will write columns. And I will specify this parameter columns. So I will have to specify all the columns that I want to believe. The way to do is using columns equals to df.columns. Here I can specify the number of columns. So if I want to believe first two columns, I will have to use a colon. And you can see here I will have to write to him. Now. The next parameter is axis. Remember to always specify the axis because it specifies whether we have rows or columns. Four columns, we have the axis one. For rows we have the axis 0. And obviously the last one is the in-place parameter x. I'm going to right through here. Now if I will try to write df.head. Now you can see that it has removed the first two columns from our DataFrame. First do is actually represented using this colon and two. And here we have df.columns. So we have specified the columns using this df.columns function. Now let's see how we can delete the last n columns. Delete and last column. So let's say you have a task where you want to delete the last two or three columns from the dataset. Again, we will use the drop function. So these are basically the, you can see the variations of the drop function and they're very crucial because you will be using these important things daily when you are analyzing the dataset. So if we want the last N last columns in, actually for us, we will have to specify that we want to drop the columns. For that. I will again use df.columns. Right here. I will specify, let's say minus one against specify the axis, which will be one. Because the deleting columns and in-place equals to true. Now here what I'm going to do is let's take a look at the dataset itself. All right, so you can see here, when I specify minus one, it will delete one column from the end. So you can see the last column which was made was, is now deleted from it, right? So if I will write two here and run it again, you can see one minus two with column is removed. It is actually removing a column at this index position. So we will have to specify colon here. Because if we won't specify the column, it is not going to delete the last n columns, right? It will only delete the column and the minus two index which is here. Now if I will hit this now you can see that it is actually trying to delete some of the columns here, right? So you can see that the first two columns were deleted. Here. I can do one more thing. Let's take a look what will happen if I write here and a colon right? Now you can see that if I write two and it is, colon is at the end, the last two columns have been deleted. So you can try out these different, different variations to take a look at what exactly we can do inside of the dataset because it is very important. Colon two means we are deleting the first n columns. And if read I do colon, it is going to delete the last n columns from our dataset. On the DataFrame, not the dataset because dataset is intact, right? So now let's move on to some other things as well, which is let's try to delete rows from our data frame. We have the data frame here. This is our DataFrame. And let's say we want to believe first n rows from this new data frame. Now let's see how we can do that. Now. Again, we will use the drop function. Function is actually a very important function for deletion. So df dot dot drop. Now remember that here we want to, we do not want to delete the column, so we're not going to use columns equals two. We're going to use the df.head function. Let's say we want to delete the first three rows from our dataset, so I will provide that. Now the next important parameter is the axis. Now access is low, solos is actually 0 and the last parameter is in place, which is true. Let's take a look at this. All right, so now you can see that it is specific. It shows here that we have an edit here, which is you can see nor found in access. So what we will have to do here, df.head dot index, we will have to specify the index here. Then only it will believe the first three rows, right? So df.head three means the first three rows of this DataFrame. Dot index will give it the index, the value, and it will actually drop the false three roles. So you can see that in our DataFrame, I'm actually showing the first photos. So the first photos is not starting from 0123, it is starting from 3456 right here. If I will specify five here you can see the dataset will change, right? So the first five rows, rows are going to delete in this fashion. Now let's see how we can delete the last n rows. I can do that here only by just using the tail method. So d of five will give us the last five rows of the DataFrame. And dot index will actually give us dysfunction the index of the fifth and last fifth column and it is going to drop them. So now if I will write here instead of head, if I will write tail. Now you can see that our dataset initially had a 488 rows, and now it is showing that the last is 4083, which means that it has removed five rows from our data frame. So this is how we can delete first and rows and the last ten rules by using df.head dot index function. Df dot, dot index function. Right? So now I'm just going to come in this because we are now going to move to some other important functions as well. Now let us see how we can actually solve the columns. This is a very important thing. Sorting columns on the basis of, let's say, names or their numerical values in increasing or decreasing order is very important. And you are going to do that very frequently in the datasets. For that, we have a simple function which is the sort values function. We're just going to write df dot sort values. Now what we will do is they will have to specify the column by using, by, let's say we want to sort the RM volume, right? And you can see that this is the item column. And here we have 5.796, then 5.859. After sorting this, it is actually going to change in the ascending order, right? So by item. Now the next thing we need to specify is just the in-place method, which will be true right? Now I'm just going to display df.loc. And here I will specify slice right here. Now you can see that this has actually, let's try it at head here. Now you can see that this item variable is now sorted. So this is the main use of the sort values function. It will actually sort all the values and all the values of this RM columns. So you can see that now it's 3.5613.863 and in such a manner. So this is how the sort values function work. Now, let's say we want to drop some duplicate values inside of our DataFrame. So let's say we have a DataFrame where there are some duplicate values inside of a column. So let's see how we can do that. I'm going to comment this out, and I will also come in this one. Now let's move on to how to drop duplicates. We wanted to drop the duplicate items from the DataFrame that we have. Again, a very simple function, which is the BF got drop underscore duplicates function. We will write in place equals to true. So what it will do is it will remove all the duplicates from a DataFrame. Now since in this dataset we do not have any duplicates, we cannot see that in action. But what you can do is, I will give you a very simple task. You can actually open the housing dot CSV file and create some duplicate values inside of that, and then use this function d of dot, dot drop duplicates. And you can then see and visualize how it has dropped those duplicate values in place equals to true means that it is actually willing to make changes in the original DataFrame. All right, so these were some of the important functions, delete, some sorting and a lot of things that they're now the very important task, which a lot of which is very important as known as slicing. Slicing involves two very important functions here, which is the LOC function and there is Lucy function. So LLC basically means location. This there are two methods, LOC and Lucy of the Pandas DataFrame, which actually helps us to slice the columns and rows. Because sometimes when you are analyzing a dataset, you do not want to analyze the whole dataset. What do you want to analyze? Basic, simple portion of that dataset. So you want to slice that out so that you can visualize that and you can do a lot of things with that. All right, so let's see how we can use the LLC and I LOC function. The first thing is we will use the df.loc function. And here we will specify 04. Now I'm going to specify the column names here, which is RM. And let's say we'll specify one more column. It is LSAT. What it will do is it will actually slice the whole dataset. 0 means the first four rows and the columns that I want two slices, RAM and as sag, right? If I will just write df dot shape function, if I hit OK. So here you can see it is not as sad as that here, so that was a mistake. Now you can see here it shows me that we have this df. I will hit Control Enter. You can see that this is actually the slice dataset of all the dataset. It has picked these two columns only, which is our MNL stack. And the range of the rows is 0 till four. So I can actually change it to, let's say from two to six. And if I hit Enter, you can see that 23456, right? So laws are from two to six and the columns are RM and S tag. So it is a very important lock. It is not actually a function, it is a locator. So LLC basically means located at will locate these using these two indexes and it was slice it down. So what we can do is we can assign dfs as ds equals to this one. If I will try to display df.head, it will show me this. We can actually slice it and we can reassign it to the DataFrame if we want to. And here we have this LOC function are very important function and we will be using this function if we want to analyze just a small part of the dataset. Now we have another locator which is the ILC. Lucy is basically though, the same as LLC, but were the major difference, which is that it does not take extreme values. It will only use the numerical values to locate or to use the indexes, right? So instead of elements in stack, we will have to specify the numerical values. So here if I try to run a very simple command here, you can see which is ILC. From 0 to four. It is going to slice the first four rows of the dataset. So here we cannot specify the column names itself. We can actually just do the slicing this ray right here, if I will, simply, It's two comma four. And if I'll hit enter, okay, So we have, we do not have four columns here. Let's write it three. Now next, hit Enter here. So now you can see this means colon to miss the first two rows. And colon three means the first three columns. If I will write three colon, it means that we have the last three columns. You can, you can see that from here also, we have used your tail, which is at this place. You can see we are using colon two to specify the first n columns into colon to specify the end last columns. The same thing as the supply and adhere in the ILC function because the column section, we cannot specify this. So that's why we're doing that here. So if I hit Control Enter now you can see that it will only specify the last three columns. And you can see that this is actually the mid-70s, which is the last column itself. So if I will make it one. Now at visual me, the last three columns right here, you can do a lot of things. You can play around with this thing. What will happen if I will specify here, one, let's say 32. Here you can see that three and do will not work because it does not good. It is not a range here. So I will have to specify, let's say 310. All the rows from three to ten is specified here. So we had actually slicing the laws three to ten. Here. One to two means that it is actually going to select the columns from position one to position two. If I will make it three. You can see it will select all the columns from one to three. So you can play around with these values. You can specify some negative values here and take a look at what happens in the DataFrame and how the slicing is happening. And it will help you a lot in performing data analysis. Also. In the next tutorial, we're going to start with the data pre-processing task. And basically now you have a good idea of how to use the pandas library. Make sure to try all of these functions by yourself and take a look at how the output is changing by using the df dot head function or df dot function. You will be able to see the changes in the dataset. So that's all for this tutorial. Thanks for watching. 4. Numpy Arrays: In this video, we're going to start with a very important library, which is the numpy library. So the first thing that I'm going to do here is I'm going to import numpy as np. Np is basically aliased. In the previous tutorial of this data science scores, we have already covered a very important library, which is the pandas library. And we saw how we can do various patients using that. Now, we're going to perform this data preprocessing tasks in the future upcoming videos. And for that, we're going to use these two important libraries, which is the NumPy and pandas library. Basically, if you want to see the whole documentation of non-being, you can just go on numpy dot ORG, which is the official website of the NumPy library. You will find all the functions that this library is a boats. Now, since we are focusing on the data science goals and the data pre-processing tasks. I have collected some of the very important functions of the NumPy library. And basically I've selected them from various projects that I've done. So here we are going to cover all of them and most of them are very useful and we will use them in the future coming videos. So basically there are two basic uses of NumPy library. The first one is the num by Alice. And second one is the numerical analysis or numerical operations that we want to perform. So numPy stands for numerical Python. So here we are going to be, have these two parts here. But in this video we are only going to cover the Numpy arrays. And in the next video, we will see how we can perform mathematical operations like logarithm, standard deviation mean, all of that. In the next video. Let's start with the NumPy iris. So basically for us, we need to understand why do we need NumPy. So basically let's create a simple list. I'm going to simply create a list here, which is a. It will have three elements in it. Or let say these are the four elements. So if we can already create a less than, let's say I print this, print the type of this list here. If I hit Control Enter, you can see that this belongs to the class list using arrays. Why we're using Eris? Let us discuss that first. Now, the thing is in list. It is actually not stored in continuous memory locations. So these four elements are not stored in a continuous memory allocation. That's the main reason we will not have a faster access to these elements of the list because they're not stored continuously inside of the memory. So that's why we need num by adults. Because in data science being want to perform operations faster, we want to access these elements faster. So we are going to use NumPy arrays. And the second thing is that we can use some of the mathematical operations on these artists, like matrix multiplications. And we can even create multi-dimensional arrays using NumPy. Alright, so let's start with the first transpose, which is actually create a NumPy array. No NumPy array is actually MDRD. And MDRD means n-dimensional array. So we can create n-dimensional array using them bytes, which is the basically continuous location of objects. It is the n dimensional object. So I'm going to say right here and dimensional objects, right? So let's see how we can create an array. So I'm going to create an ad in here with the name ARR. A way to do is you will use np dot. And here you just need to specify the elements of the study. So if I will specify one comma, two comma three, this is going to be an NumPy array, right? So let's try to brand the type of this. So we will know what this actually is right here. You can see it shows here that it belongs to class. So ARR is a variable and tie function will give us the type of this variable. So you can see it defines that this is a NumPy array. So it is an m by n dimensional adding. Now let us see how we can find out. The dimension of this setting, we can use the dim function, the endocrine function, which will show us the number of dimensions of the Sadie. Sadie has only one dimension, which you can see here, 123. Now let's create another dimension here by using a separate coma. And let's specify another list of elements like 567, right? So now you can see that it says that datatype not understood. So the reason why this is happening is these need to be included in a single one. That we need to write one more square bracket here, like this, and we need to close it here. Now let's hit Control Enter. Now you can see that it is a two-dimensional area. So if we want to specify two-dimensional array, we will have to specify it like this. So the first dimension will have these three elements. The second dimension will have these elements. If I want to create more dimensions, I will include them in these square brackets site. Even if I want to increase the dimension of these two elements, these two lists, what I can do is I can simply add more square brackets here, right? So if I will add three square brackets, interestingly, you can see that it has increased the dimension of the array. So the more the number of these square brackets, the mode is the number of dimensions. So you can see now that I mentioned is seven, although we have only these two elements, right? So if I try to print this adder here, you can see that this is how it is going to show up. In this way we can create these ad is that n number of dimensions. And now here I'm going to just make it two-dimensional. Okay, so now we have this two-dimensional array. Now let's see how we can create a four-dimensional, five-dimensional areas. Let's create another area which is at a du equals to NumPy array. Here, let's say we want to create three-dimensional 123. This is how we're going to specify three-dimensional. Here I will write one comma two, comma three. We will create another list which is four comma five, comma six. And the last one is seven comma 89. So now let's try to print this out. Let's try to open the number of dimensions of this Addie. Did you can see here we have three-dimensions and this is how we have the other. In this manner, we can create any n dimensional object. Basically, you can see that we can have the ability to create ads which are n dimensional. So it will help us a lot in data pre-processing also. And basically when we will combine it with some matrix multiplications and some crucial operations like logs and standard deviation mean, we will obtain some very good pre-processing task and some very important stuff, right? So this was a full step and now let's see what operations we can actually perform on these atoms. Now we know how to create an array, how to create an damaged Hillary. Now let us see how we can actually do the indexing of these areas. So basically I will write indexing. Let's say we have this added ARR. And here I write one coma one. Let's see what is the output here. You can see that one comma one. These are the two elements that were supplying. Indexing basically means what, how will I access a particular element inside of this whole given at it? So you can see this is our attic. If I'm writing one comma 11 will tell us the dimension that we're in. So you can see that we have two-dimensions here, and it starts from 01. So this is the zeroeth dimension and this is the false dimension. So one is actually indexing that. We are now finding the element inside of this list. We can say this dimension. The next one which is here, specifies the element in that list. So here we are actually finding 25671 means that we are actually pointing or indexing to the first element. So five is actually at the 0, at position six is at the one position. If I write three here, let us see. We will get an error because there is, they don't only elements 012 here, right? So let's do, We will get seven here. You can see, now let us see what will happen if I write 0 comma two. Now you can see 0 means that we are actually looking at this. Adding here, which is the 0 at position, then we are reaching the second element in it, which is actually three. So we're getting three in the output. Alright, so this is how you can perform indexing. The first element will give us the dimension we are in. And second element here in this indexing will give us the correct position of the element. Let's move on to another operation that we are going to perform. Very often in these areas, which is known as slicing. We have already seen slicing in pandas also. And we saw how we can do the slicing and data frames. Now let's see how we can do that in areas, right? We're going to consider the same Adi, which is ARR. And here I'm just going to write this command Hill, which is one and colon and then three. Let us see the output of this. Now you can see that 13 gives me 567. Why we're getting this. You can see we are actually doing the slicing of this Adi. This already has these two-dimensions, as we can see here. The first dimension has 123 and the second dimension has 567. We're slicing the ADA from 133 means though, all the elements that we have from one net position. So you can see we have zeros here, then we have first position here, and then all the slicing that occurs from the first position till the second position because three is not included here. So I'm going to write here that three is not inclusive. So it is going to slice the given at it from the first position. Second position because three is not inclusive, right? Let's see how we can do that in the second area which is added to. Okay, so in the added two lips write the same command to see the output. Now you can see here we have nothing inside of this Adi. Let us see why this is happening, because in the first position we don't have any item. This is hole is the 0th position, so let's try 0 here. And now you can quickly see that here, if we try to slice it from 0, we have 0 comma one, comma two. At these three positions, we have all of these elements. So this is at the 0th position, this one is adding the false position, and this one is at the second position. So you can see that this is the resultant of the slicing that we have done. Now you need to play around with different types of arteries and you need to play around with these different values to take a look at what happens in the result, to actually get a better understanding of how things are working. Because you cannot learn all of these terms. You cannot memorize all of these stuff. You will have to keep practicing with different dimensional arrays. Slicing it with different values. Here instead of 0 till three, Let's remove 0 and let's see what will happen here. You can see there is no change here. The reason is when we write del three, it basically means that the first 012 unimpeded, right? Alright, so this is how we can do the slicing inside of home given. Now let's see how we can do steps slicing. Here. What I'm going to do is I'm going to print that. Now we're studying about slicing, which is another very important concept. So here we have, we are given the Sadie and we're going to do the slicing here, the steps slicing. Let's see what exactly this is. So I'm going to bring adding here. I'm going to write one coma. Let's try it 10 to one to two. And let's see what exactly we get the result here. So instead of slides and you can see we have this as the result. So basically what exactly is that slicing? Instead of slicing, we say that we want to slice the given Eddie, but we are going to follow these steps. So first we are going to specify, we're specifying these three values. So it was, it is actually going to slice from 0 till one. And then it was sliced from one till two, right? So it is actually going to be slicing is happening in stepwise rate. So instead of slicing from the whole array, we can actually make some small parts of the array using steps slicing. There are more ways to do this. We can even try all of these methods. So basically I'm going to create another atom here, which I will call as added three. And in this, I'm going to use num biotic. And this is going to have the elements, let's say. Wealth comma three, comma 456, comma seven. And let's create one more which is 11 comma two comma three. The last one will be three comma four only, right? So we have these elements here. This, you will have to be very careful when you are creating a NumPy array because it will have to make sure that the number of dimensions actually the one which you desire, right? So here we have the fault, the dimension, and here we have the second list of elements. Now let's say I want to include these two in a single dimension. I can do that by using, by actually including them in this single square brackets. So now if I will want to include them in another dimension, I can do that by using another, by creating another square bracket, right? So whenever you want to create a dimension, you will have to make sure you create a square record for it. Now, what you will do is let's try it. Use the ending function, which will actually tell us the dimension of this array. And make sure to use this ending functions so that you can know whether you are getting the required dimensions. And the ADA is also the required area that you want, right? So you can see the number of dimensions or two here, which is here, the result is here. Now what I just want to do it, here we have two dimensions. In the positive dimension, I have these two positions, and in the second dimension I have these two atoms right here. I want to do the slicing. This added three. And I want to do, let's say I write one comma four. Let's hit Enter here and let's see what will happen. So now you can see it is not showing us anything. So let's make it as 0. This will be one. What we're doing here is in the first part, I'm specifying that we are at the 0th position, which means the zeroeth dimension. In the 0 dimension, we're actually slicing from one till it will include, Let's slide it as 0 till two. So it will include all the elements from 01 are included and the dimension we are looking at is 0. So if I will make it one, Let's see what will be the result. You can see it shows the result that we have 123 and three-fourths, which is you can see in this damage in which is pointing at the first position, right? So if you want to slice in a particular dimension, you can specify the dimension here. So here I will write a comment that the first position specifies the dimension and the second position index of slicing. In this manner. You will be able to understand this more quickly. That this first parameter belongs to the dimension that we're looking at, that we want to perform a slicing. And here we are writing, how do we want to do the slicing, right? We want to slice 012. But here, if we want to do step slicing, we can also do that. We can actually specify that I want elements from 0 to one, then from one till three, which is the numpy array. If it has these number of elements, it will be able to slice them out. So this is how we can do the slicing inside of an array. You can play out over by creating different areas with different number of dimensions and different number of elements. And you will be able to understand how this step slicing is working and how this type of indexing is walking. Right? Now we're going to move to another concept, which is, let's say we want to test out some functions of NumPy. Let's say we want to calculate the mean and some other stuff as well. So in that case is let say I want to create an array of continuous and natural numbers. Let's say I want to create an array n natural numbers. In that case, we do not need to create an array by writing the numbers manually. What we can do is we can simply, let's say I want to create an array. Which is nRT. And I want to include first 20 natural numbers so I can just write and V dot, right? And let's say I want to create a natural numbers which is starts from one alert certainty. Right? So what will happen is this NumPy library will create and add it, which goes from one till 20. If I will simply print this and add it here, you will be able to see that in the occiput, right? So here it says that module numpy does not have, okay, So it should be single. Now you can see we have these elements that starts from one till 19. So you can see that 20 is not included, right? So the last one is not included. So if I write 20, it will start from n and it will go till 19. Now why we're doing this? Because if we want a set of natural numbers to play around with, we can actually use this function and make sure that is only single hair, right? It's not W. Okay, so let's see what else, how you can agree with more values. Let's say we want to create floating numbers from one position two, let us say from one number to the other. So the way to do is let's create another area which is added here. Let's say we want to create decided. We will again use the arrange function. Here. We're going to specify the range from where we want the floating numbers, right? Let's say I want floating numbers from one till n, which is ten here. And now I will have to specify the data type here by using the type. And here I will specify float. Right. Now, this will actually create and symbiotic with floating numbers from one till ten. So you can see that these are no floating number. So it has 1.2.03 dot. And similarly like this, right? So this is one Another interesting thing. Remember that all these functions are going to come in the data pre-processing task. So make sure that you actually practice them by ourselves. These all are very important and we have already used them in some projects. So make sure that you also practice that out. Let's move on to a very important concept, which is to change the shape. Changing the shape of the array. Let's say we have an array with a given dimension. Let's say that a is three by three-dimensional lit search two-by-two dimension. And now we want to change the shape of the array. We want to change the dimension of the array. So let's see how we can do that. The first thing is let's see how we can check the shape of anodic. So for that I'm going to play it anodic a equals to numpy dot ID. Here I will just specify one comma two, comma three. And I'm going to print Hill dot shape, right? The shape is not a function, it is a property. So if I will hit Enter, it will show us that the shape is three comma and nothing is here because we have not specified the columns. There are only three elements here, so it is specifying that. Let's create one more. Instead of creating an array like this lift, squeeze and natural numbers by using arrange function np.arange. Let's create six elements. So for that I will just specify six. And here we have, let's say I use the reshape function. Now let us say I want to reshape this array. Let's say we have this one here and here I write this a dark shape. We know that it will show three Homer, something good to know. I want to reshape it so I can use the reshape function, right? So let's reassign it, dot reshape. And since there are three elements in here, we can create some more elements so that we can actually change the shape of it. So let's add some more elements. I will add 456789. Also. We have created this elements here. And what I want is, I want, this is actually the nine elements and it is having a dimension nine comma one. So if I will hit Enter or Control Enter, it will show me that the shape of this era is nine comma one. So what I want is I want to convert this single dimension at it. I want to reshape into three by three matrix, right? So the way to do is to specify the positions here. So if I wanted a three-by-three shape, I realized three comma three. Now what will do is it will quickly change the shape of this whole given, which is here, into a three by three dimensional array. So here, if I will hit Control Enter, you can quickly see that now the change, the shape has been changed to three by three, right? So let's take a look at the shape of the added before using the reshape function. So if I will write added dot shape, you can see that initially it was nine comma 0 and now it is three comma three. So we have changed the shape or the dimensional decided by three by three. So what will happen is now lips try to bring this here. Now you can see instead of one single at it, it is now 123. Then we have four fighters and we have 789. So a three-by-three dimensional array we have here. And it has divided it into like this, right? So what will happen if I write three comma two here? Let us see whether it will be able to do that. Now you will quickly see here evaluator says that we cannot reshape the area of size nine into this. Which means that whenever you want to do a reshaping, you will have to make sure that the product of these two, which is which we are going to write in this shape function is equal to the number of elements inside of the setting. What else? You won't be able to do that. All right, let's include only six elements here. And now we know that the product of three comma two is six. So if I hit Control Enter, now you can see it has created an array with two elements in here and there are two, and this is a matrix two by three. We have two columns and three rows. Here we have three comma two. And now let's change it to two comma three. Now you'll quickly see here in the awkward that the elements are three, but we have only two dimensions here. So that's how we can do the reshaping of these elements of a given NumPy array, right? So it is a very important thing you will be using this video often when we're performing the analysis though data pre-processing tasks also move to another important function that I keep seeing. A lot of projects on Data Science, which is replacing the elements with one. Let's say I want to replace all the elements of monadic with the value one. So let's see how can, how I can do that. First of all, I will create a new array. And instead of writing the elements manually, I will simply use this arrange function. Here I will create an array with, let's say we have four elements in it. Now what I'm going to do is I'm going to replace. I will try to just show you the value of this area. You can see that this array has 0123. Now what I'm going to do is I'm going to use a very important function which is np dot underscore, Like function. Here I will supply it with the new RA. I will hit Control Enter. Now you can see it has replaced all of them by one. So it is important, it is useful in a lot of cases when we want to perform some data categorization, we want, we can do that using this, right? So it is important that is one mode which is zeros lake. So if I relate zeros here, it will convert all of them into zeros. You can see here, these are two very important functions, which I have seen and I've personally used in some projects. So make sure you practice them also. And I can see it has zeros and one scale. Now, let's move on to the next part, which is how we can concatenate two arteries right here. I'm going to write here concatenating. All right, So for that, I will need to address. So let's create another one. When we create an undergrad it to here. So let's create some of the elements in this array, which will be numpy dot arranged. And here, let's say we want elements from tutors six. Here, I will create another array which will be at different ones. This will go from seven. We have these two atoms and then we want to concatenate them into a single array. To do that, it's very easy. We can just use, let say we are going to create another area which is the majority of A1 and A2. Now to do that, we have a very simple function, which is the np dot concatenate function. Here we just need to specify these two arrays. Now remember this is the function and we want to specify A1 and A2. The way to do is not like this. Some people do like this, like A1, A2. You cannot do it like this. You will have to specify this as a pair endless circular brackets like this. So now we can see we have one bracket which is all the concatenate function. This second bracket specifies A1 and A2 as a player. Now, if I will try to bring this much added, you can see it shows that name ranges, okay, so here it should be np.arange, not in peak coma. Now you can see that it has contacting you to these two arrows in a single array. You might be used, you will use this function a lot, which is to combine two or more arrays. This is how we can do the concatenation part. Let's move on to the contrary part of this, which is how we can split the ad is like, let's say I want to split this module. And to do that, what I will do is I will. First let's create another idea. I will call it as unmoved daddy. Here I will use a very simple function here, which is used to do the splitting part, which is np dot split at it. Adding underscore split function. Here I just want to specify the mortality. Then we will have to specify, let's say three. Alright, so I wanted to split it at the third position. Now, I will try to print out this unmoved Daddy. Daddy had all these elements, 2345678. So here we see it. The same name, merged is not the final case, so it was not much, it was most underscore ARR. So it will split this at image data from the third position. So we can see here, we have 23457891011. And now you can see that three means it is, split it into three equal parts. The first part is to three for the second part is by 7891011, right? So let's change it to do and let's see what will happen. Now you can see that it falls creates an array with five elements and the next era with four elements. In this way, we can split the RAs in multiple areas right? Now let's move on to another important part which is to perform the searching inside of a given at it. Using this NumPy library. We want to search for some elements. So first, let's take an example. Let's call it as in this. I am going to first example from here only. This one. Next create and add a with some random numbers. I'm just using any random numbers here. And let's say I wanted to search for 87 inside of this added, which is the node that it is at 0123. It is at the third position and we want to search for it. So the way to do is very simple. I will first create an element x, which actually get the location of the 87th elements. So we have a weird function which is used to perform the searching part. So np dot where function will get two parameters. Only one parameter will work. E equals two equals two. I need to specify the element which is 87. So now what it will do is it will search for 87 inside of this given at a which is E here, right? So if I will try to print x, you can see here it shows searching. And now it says at a, at third position and the data type of the element is integer 64. You can see here it was indeed on the third position. So this is the first, sorry, the 0th position. And first, second, third position. So it has given us the position. And you can see that since we are using Eris, searching is the foster in this case. Now, let us move on to another important part which is sorting. Given array. Sorting is also important. Next sort this ERD only. You can see this ad is not sorted. Let's try to sort this. I will write Brent. And P dot sort function. And inside of this sort controller specify E here. You can see that this is the sorted data in ascending order, right? So you can see initially it was not sorted. And now this at a is sorted. So very simple, simple functions, these utility functions will actually help you a lot in data pre-processing task. Practicing them is very important. And when you will do more and more products, you will get familiarized with all of these. And you will get a good hold of all of these functions here, right? Okay, so let's move on to another. And this one is upper triangular. Now this function is, I've seen it in a lot of projects. How to create a VR triangles. And it is a very important concept. So focus on this part here because it is really important to understand why it is important to create a PR triangles. So let's say I create an array here. I will call it as adding one. Let's call it as a 0. Here I will use np dot. Let's take e as the example. We're going to use e, which is here. To create a PR triangles. That I will, I will just print b dot u, which is short form of upper triangular or tri, means triangle, you means above. So np dot triangular U means it will create an upper triangle of the given matrix. For that, Let's create a three-dimensional matrix. I will quickly create a three-dimensional matrix. All we have already created a three-dimensional matrix above here. When we were doing the reshaping part, we have created this one also. Let's just create again. Here. I will use B dot orange, and I will include elements from white one till nine. And I'm going to quickly reshape it three-by-three matrix. And let's take a look at this, at a 0. The first Nexi, whether it is correct or not, it says it should be one called Martin. We have this you can see that this is the addie. We have 123456789. So a three-by-three matrix, you can consider it as a three-by-three matrix. Then we want to create an upper triangular. What we can do is, let's take a look at how the RA will change when we will apply the upper triangle right here. Beeping triangle. So I'm going to print np dot u function. Here. I'm specifically going to, I'm just going to specify two parameters here. The first parameter is going to be the array where we want the upper triangle and the second element is 0. I will tell you what exactly this second element can be. Really change this value. It can be actually 0 minus 11. We will see the values how this is changing when we're applying the second parameter as 0, then minus one and then one. So let's hit Control Enter to see the output here. You can see when I specified 0, it created an upper triangle. So initially this was RID. And now after creating a triangle, you can see all the elements beyond. You can see these are the diagonal elements, 159 millimeter. And now you can see it is actually forming this triangle. 123569 is actually forming a triangle here, which you can see here. And these elements have become 0, right? So once they have become 0, we have an upper triangle here. Now let's change this value from 0 to one. Let's see what will be the change here, right? So now you can see here, if we write one, it will include the diagonal elements also. So it will create an upper triangle. You can see that two three-sixths are involved in this triangle. These elements have become 0. If I change this to minus one, and I will hit Control Enter. Now you can see that V0 have bought an upper triangle, but only the last element is 0, right? So all the elements above this naught 0. So in this way we can create a bot triangles and you will see the significance of creating these. Triangles inside of when we'll start with the data pre-processing tasks. You will see a lot of these function dysfunction dry you used in a lot of projects also. Now you have a good idea of how exactly this function will change the added. Now what I'm going to do is we are now moving on to the last function, which is to change the datatype of the adding the elements of the array. All right, so for that, I will create another adding 23. And here I will use np.edu. Let's create elements which auto floating values, which is 2.11.2. And let's give it one molar, which is, let's see, three-point one. Now all we have this NumPy array. What I'm gonna do is I'm going to print the type of this site. So first of all, I will create a new array. Here. I will use IRR to as Thanks function. I will write in here. So now what will happen is it will create a new array, but which has all the values of this added to three. It is going to convert it into indeed your part. So let's take a look at how exactly our new Adam will look like. So new Audi will have all these elements, but only the integer part of these elements. So here you can see changing data type, and now it has 123 and it has ignored these decimal parts. In this way, you can change these inside. If you want to change the datatype of the elements, you can actually do that and you will face this difficulty a lot when you are preprocessing data. Sometimes you don't need float values. So you convert them for the sake of easiness, you convert them into integer parts also, right? So there are some more things, like if you want to print the datatype of this array, you can just use the dtype property. And here it will show that it is an integer 32. And let's say that you want to change the datatype to string. So here, if I specify that this is actually a string, let's create an array of strings here, which is 13. Like this. Now here what I can do is I can specify the datatype of this. Then this is having a data type of string. Now, let us see whether we can actually convert it and do and indeed your part. If I hit Control Enter, you can see that it has successfully converted into integer part. Let's say now I want to specify that this is not actually a string. Let's say it has four bytes integer. I, four means four bytes integer if I will hit Control Enter. Now you can see that it has this 123 and integer Thirty-two. Here I will have to make some changes. Instead of new at it. I will specify ARR, do three here. Now I can see it is 123. Here. I want to specify the type of this added to three. So the dtype is actually integer 32. Again, this is how you can actually change the datatype of the elements from string to integer or integer to float. So basically that's all for this tutorial. We'll see you in the next tutorial. Thanks for watching. 5. Numpy Functions in Python: In this CDs, in this course, we have already covered NumPy arrays. So we have covered some of the very important functions that we can perform on non-buyers. And in this video we're going to cover all the NumPy mathematical operations that are supported and we can perform them on Dina biotas. So these are some of the basic and some of the very important functions that you will be using throughout your data science career. So let's start with this now. First of all, I will import numpy as np liquidly here. And here I'm going to create a matrix which will be np.array. Here we're just going to create three elements here, which is 123456789. Here you can see that I've created a matrix. Let's quickly check whether we have defined it correctly by printing this out. So here you can see I've created this matrix which has these nine elements in it right? Now what I'm going to do is I'm going to perform some of the mathematical operations on this matrix here. For example, dot-product, standard deviation, mean, and all the statistical functions also. Let's start with some of the basic ones. So the first one here is two. Compute the maximum moment, which is a very important one bit is which will actually give the maximum element in this whole matrix. I'm going to print out, use the np dot max function, which will actually give us the maximum element of this matrix. You can see that if I hit Control Enter, you can see that nine is the maximum element in this whole matrix. Now what I'm going to do here is Let's suppose we want to find out what is the maximum element in this axis, which is the law right here. What I can do is I can actually specify the axis here also alleges Access equals to 0 if I hit Control Enter. Now you can see that the axis 0789 is the maximum element. So we can also do that. So if I will write one here and I like to control Enter, you can see it will be 369. So on changing the access, you can actually return the maximum element according to the access which is rows and the columns. Next part, which is similar, is to compute the, the minimum element. And for that also we have the same Technique which is use np dot main function. And here I can just specify the matrix. And you can see that the minimum, the minimum element of this matrix is one, and that is what it is printing out here. Similarly, we can also provide here the axis at 01 also. Now let us move on to some other functions, though. These are actually the statistical functions. So basically in this course, I have not yet touched these topics of statistics, which is, what is the meaning of standard deviation, variance mean. These are some of the very crucial topics that needs to be covered in data science. So what I'm doing here is in this tutorial, I'll just show how to use them. And in the next video I'm going to teach all the important concepts of these statistics like variance, mean, standard deviation. And in that way you will be able to understand better how these functions are useful in data science. So let's start with the very basic one, which is to compute the mean of the given Eddie. So let's say we're given this matrix here, which is this one, I want to compute the mean. Mean is basically average. So what I can do is I want to print the mean of this. So for that I'm just going to use np dot mean function and I will just have to supply it with the matrix. And you can see here that it returns five as the mean because it is actually the average of all of the elements. We will discuss more than of these things, which is the concepts of statistics in detail in the next tutorial. So let's move on to another statistics concept which is variance. Variance. Let's just covered all of them in this single cell here. Variance and standard deviation. These are the two things which are very important and they are used widely in data science because these are actually very useful to perform some of the important methods of data pre-processing. Also, here we can print the mean. And if we want to paint the variance, I can just supply it with this one. So you can see that the variance of this given matrix is 6.66. Similarly, if I want the standard deviation, I can do that by using np dot SDD, which is standard deviation. And I will just apply it with the matrix here. You can see if I hit Control Enter, it will give me the standard deviation. We will study these three important concepts, and there is one more important concept which is a normal distribution. We will also study that. Let's move on to some of the linear algebra topics. That these are some mathematical concepts, which is the dot-product and multiplication and addition of matrices. So let's perform these health. The first method that we're going to perform is to calculate the transpose of a matrix. So basically this course requires that you have basic knowledge of mathematics, which is matrices and determinants. So let's see how we can compute the transpose of a matrix. For that, it is very simple. I can just write matrix dot capital T. If I will hit Control Enter. Now you can see this is the transpose of a matrix. So essentially you can see that the roles have become columns here. So 123 was actually a row in the matrix here. Then we want to compute though crossbows and become the law. You can see now it's column is 123. That's how we can compute the transpose of a matrix. Let's move on to how to compute the determinant of a matrix. These are all the concepts of basic concepts of linear algebra. And this is the only mathematics that has required in Data Science, which is statistics, probability, and linear algebra. Even if you know the basics of these concepts, you are good to go. Let's see how we can calculate the determinant of this matrix. For that, we're going to use NumPy library with this function here. So np dot LIN LG function, which is actually a property in order function. And then we will use DEP to calculate the determinant of this matrix. You can see that the determinant of this matrix is this hill, np dot LIBNAME dot db. All right, so let's move on to how to calculate the rank of a matrix. Basically, rank is computed as n minus one, where n stands for the end, basically stands for the number of dimensions. All right, adding, you can see that this area was theta emission. So if I wanted to compute the rank, what I can simply do is I will again have to use np dot linear algebra. Dot LAN LG stands for linear algebra here. And since we're using the linear algebra functions and here I'm just like matrix underscore rank. Here. I will supply it with the matrix or the attic. Here you can see that though is the rank of this matrix, right? So that's how you can compute the rank of these matrices. So we want to move on to some other important functions also. So let's take a look at how to compute the eigenvalues and eigenvectors. Eigenvalues and eigenvectors are also important. In here, you will be using these functions more oftenly in data pre-processing task. Let's see how we can calculate the eigenvalues. Basically, let's suppose we have a square matrix a. If I multiply, if I do a dot product with v will be equal to k, which is the eigenvectors. And again, dot product with v, which is the eigenvalues. So basically the purpose of eigenvectors is to actually increase the shape of the square matrix and not the direction. So here I can even write that, then you will apply it. Linear transformation. Eigenvectors change, change the shape of matrix NANDA direction. Okay, so let's see how we can compute these two values. So again, though, we're going to compute the eigenvalues and vectors of this matrix that we are using in this whole program. The first thing is, if you want to compute the eigenvalues, will first define variables here. Let's define two variables which is eigenvalues and eigenvectors. We have a function which will return both of these, which is np dot linear algebra dot eigenvector EEG, and apply it with the mutex. What this function does return the eigenvalues and eigenvectors and what it will be stored here, right? So let's try to see these values by printing out these values here, which is eigenvalues. And here I will print the eigenvectors. If I hit Control Enter, you can see that this is the eigenvalues. This hole you can see is an eigenvector. Let's move on to more functions. Let's see how we can. Calculate the dot product. Dot product is also very important. Let's see how we can do that. First of all, I will have to create two matrices here. Let's create a very simple matrix, 123. And I will create one more matrix, which is matrix two. And this will have values for V6. Now I want to compute the dot product. So if I wanted to compute the dot product, I will. First of all, this is going to be np dot array. This will also be in non-periodic. All right, so now if I wanted to print this out, I will have to use np dot dot function. So it will be have a dark function here. And I just need to provide these two matrices in the argument which is matrix one, matrix Q. You can see here it gets the dot product has 3232 is the dot product of these two matrices. Remember when we're calculating the dotnet dot product, you will first have to understand the linear algebra concepts of how to compute the dot product and how to actually make sure that the rows and columns of these two are matching or not, right? So then we will be able to calculate the dot product. Let's move on how to add two arrays, which is the addition of these vectors. Right? So we're going to take these two matrix one, matrix two only. So if I write print matrix one using the add function and we want to add the matrix one, matrix two. You can see that if I tried to add them, their values are their corresponding values that are added and it is stored in another vector. It is by 79, so one plus four is five, then it's the F7, then we have nine. Similarly, if you want to do the subtraction, you will have to perform the same step. You will have to print NumPy np dot product function. Again, suppliers with these two values, which is matrix one. Matrix two. Again, see that if I subtract them, one minus four gives me minus three to minus pi will give me minus three. And similarly minus V here. That's how we can calculate the subtraction. And let's see how we can do the multiplication. This is multiplication, not the dot product. So I'm going to write here that this is not the dot product. This is matrix multiplication. You will see here what is the difference between them. So for that, I'm going to use the we can multiply two matrices by just simply using an a status like this and it will multiply these two. So four multiplied by one is four, then ten, then 181018 years. The answer here. So this is actually the multiplication of two matrices which is different from the dot-product against you. That dot-product was actually 32. Now we're going to move on to some other functions as well. Here I'm going to start with how to calculate the inverse of a matrix. In order to calculate the inverse of a matrix, we will have to again use the linear algebra function. And I end we function is there. I will just have to supply it with the matrix. If I hit Control Enter, you can see it moves simply calculate the inverse of this matrix. So this is the usage of inverse INV function here, which is, which belongs to the linear algebra property. Now we're going to see how to generate random values, which is again a very important concept, how we can generate random values using NumPy. So for that, I'm going to create a very simple program which will actually find out five random values. Here I'm going to write a comment here. Let me to calculate the five-ninths on values between one to ten. If I wanted to compute five random values between one to ten, Let's see how we can do that. I will use np dot random function. And, AND, and here I will have to supply it with three variables. They'll go from 0 to 11, and I want five values. So it will be like this. You can see here it will calculate the five random values from one to 10110 to six by these values, and 011 are excluded from this. So that's how you can calculate that are known value. So one interesting thing here is if you will again hit Control Enter, it will change this random values and it will keep changing that. So in order to make this constant like if you do not want the random values to change every time, you can use a very important thing which is known as seed. For that you can use the np dot random.seed function. Here. You can supply it with one that you do not want to change the size. So if I hit Control Enter, I guess I will keep hitting Control Enter. This value will not change. This will become constant because of this ceiling that we have done here. Now let's move on to another important thing. Let us say we want to generate some random values from the normal distribution. Normal distribution is another important concept of data science because it belongs to the status, stats, and probability. So we will discuss that also in the next tutorial that I will cover these concepts on statistics. The x plus c, how we can get the values from the normal distribution. For that, we can actually use np dot random dot normal function. And then simply I just need to give it a value. So let's give it a value 1, which will be actually mean. So we need to give it three parameters. The first will be the mean, the second will be the standard deviation. And the third one is the numbers that you want to generate. 1 is the mean, then we have to 10 and standard deviation. And the number that I wanted to do this, then I will hit Control Enter. It will automatically pick up these ten numbers from the normal distribution of the mean of the given mean and standard deviation. So basically these are all the important functions of this NumPy library. We will uncover more of them in the acid. We'll move on in this course. So basically that's not what this factorial cancel watching. 6. Statistics for Data science: Now let's move on to the first topic that we're going to study, which is a mean. We have mean, standard deviation, washing distribution and variance. We are going to cover these very important topics and these are human face them in data. Thanks a lot. Let's start with the mean and let's try to understand what is the significance of being. Here, I've drawn a graph of an example. So let's consider a very simple example. Let's consider that there is a smartphone company, and that company is actually selling the smartphones. And from bad data, I have picked up seven days of the sales. Here you can see that in the data, I have these seven items in it. 151030 twenty five, twenty five, twenty five. On this graph, you can see on this axis I have the day number, which is 1234567. For whom week we have these number of phones sold. You can see for the first day, 15, I'm gonna phones were sold for the second day and number of phones were sold. And similarly for third day, 30 number were sold. And this is how I've plotted these viewpoints, these blue points that represent the data. Now let's see how to compute mean. Mean is actually the average of these values. We can compute mean by just starting them up and dividing it by ID and number of data points that we have, which is seven in our case. Here you can see that on calculation you get 150 by seven, which is 18.57. You can see a red line here which is going through this graph. Here you can see that this is absolutely presenting mean, which is 18.57, and you can see that it lies between 1520. Now, let's try to understand what is mean. The significance of mean is very simple and it gives us the average of daily average of the past seven days. This means actually is 18.57. Let's assume it as an absolute value of 18. It actually means that 18 phones were sold every single day in the past seven days on an average. Now this is a very crucial information because sometimes the company is not interested in what is happening, how much bones are sold in a single day, what they wanted, they wanted to calculate the mean, the average. Here you can see the average is 18. So we can say that 18 phones were sold every single day on an average in just one week. But if you were to take a look at here, there is an interesting information that is missing from this graph. This mean is actually misleading. The reason why it is misleading is that let's assume a data point which is far away from this line, which is far away from this mean. Which means that if I will calculate the mean now the mean will become higher. This value of meat will increase and if it will increase, intellectual was a value that, let's suppose the value increase to 30. So it will show that totally phones virtually every day, which is not true at all. One data point, will actually misinterpret the mean. Since the mean is misinterpreted, it will give us a false and misleading information to the company that every day 18 forties were sold, which was not true because actually it was average. So it was true that you didn't put virtual on. If you can see that if we have such data points which are actually anomalies in the data, it can actually misinterpret and it is misleading. To avoid this misleading fact, we can actually improve this information. We can make it much more than dilutive. I'm adding the standard deviation to it. Let us try to understand in very simple terms wanted, what is actually standard deviation? So standard deviation is actually the distance and distance is actually the deviation only. It is a distance of how much the points are away from the mean. You can see that in the green atoms, you can see that this is the distance, this point is from. The mean. Standard deviation actually tells us how much all of these data points are away from the mean. The reason why we're doing this is to understand how much these data points are closer to the mean. If I'm telling that there are 18 phones sold every single day for the past seven days. And the standard deviation is also less. In this standard deviation is less, it means the distances are less. It means that the data points are closer to the average. That will be a good information. But if the standard deviation is high, it means that these points are actually away from the mean. If they're away from the mean, it means that they are deviating from the mean. And that may tell the company that actually the average was this black. The points would actually having large deviation is an interesting information to add to this one. Let's understand how to calculate the standard deviation. It is very simple. We just need to calculate these green distances. You can see that if I wanted to get this green distance, I just need to subtract this value of five with the mean. If I subtract 18 from five, I will get this region. What I will do is I will calculate for all of them. And then since though it's not negative, we don't want to, we're not interested in negative values because standard deviation is actually a magnitude of how much these data points are away from the mean. We take the squares of these distances. Since we want the standard deviation represents the deviation of all the points that let's sum them up and we will display it like this. And the numerator. Again, see 130 by seven, I've taken it from here, which is the mean. Do not use Indian Point 57 because the calculations will become a lot of her hair. If you will use 130 by seven, you just need to subtract it from 15. You can go into distance and we will square all of these distances like this. And then we're going to add them up. And finally, we are going to divide it by the number of data points that we have. We have seven data points on calculation. If you calculate this whole value, you will get a value of 69.357, which is here. Since we have done the square of these numbers, will have to notify that. So for that we're going to use the square root. So after getting the square root, I get this value, which is 8.32, and this is actually the standard deviation. 8.32 is the standard deviation of this graph. Let's try to understand what this information is and how it will improve the information. So 8.32, what did we presented? Is that mean, which is 1818 phones were sold every single day on an average in the past seven days. But there was a deviation of eight points. I'm going to write here plus eight points. Actually this was 8.32. I will have to write that there was a deviation of 8.32, or we can say that there was a deviation of eight bones. Now since this deviation can be plus eight and minus eight or so, so I will have to write it plus minus eight. Right? Now this isn't very good information. Now from this information, a person will know that yes, 18 phones were sold on an average and the standard deviation was eight, right? So there wasn't a rise and fall off sales in a single day, right? So if the standard deviation is lower than this is the good because it means that these values are actually closer to this line, which is the red line, which is the mean. And if they are closer, it means that the value that we're going to get here is actually close to the average. And the information will be much better because 18 points might be sold and there will be, is minus dxy one or two cells, which won't matter that much. So that's how we calculate the standard deviation. And that is the significance of standard deviation because it completes this information and it adds to this information that there will be an increment or decrement of this much value inside of this average, 18 volts will be sold every single day on an average, but there could be an increase or degrees of eight points maximum. Now let's move onto the second one. The third one, which is the variance. Now here we have migrated the square root of this value here. If you do not have, this value is known as variance. So here you can see this is Lydians. So 69.387 is millions. Now what does variance means? This variance is also the sum of the distances of all of these data points from the mean. So what will happen if the variance is less? So let's understand what will happen if variance is lesson. We want to also understand what is the meaning of this value? What will happen if the variance is high? If the variance is lower? If it is lower, it means that the distances of this point from the mean is actually lower. Because lower it means that these points are very close to the mean. They are closer to the mean. Here I can write that with the readings is lowered. It means that the points are close to me. This is distance. If it is lower, it means that they are very close to the mean. What will happen if the variance is high? If the variance is high, it is basically because of these distances were very high. These differences were very high. So if these distances are very high, it means that the points are very far away from this mean. We have the mean here and the points are scattered. Their startup, it means that there far from being. Now let us see how we can use these two important things about variance to actually apply it in data science or machine learning. In machine learning, there is a concept known as clustering. And clustering what we do is we tried to form groups within a data. What I'm going to do here is I'm going to draw a very simple graph here. Let's consider that we have these points here which are marked in black. And then we have these minds. Micelle may have been due, right? So we have this whole dataset, but we have Margaret into two different groups or clusters. Now, what I can do is the variance will be lowered. It means that the values are closer. The variance is high, then it means that the values are far away from the mean. Now in order to form clusters or groups within our data, there are two conditions. The first condition is that within a group, if you consider this group, within the group, the data elements should be closer to each other. They must be closer to each other. And how can we ensure that they are closer to each other by calculating the variance that they're closer. Similarly here, if you see this, if we want to create this loop, we will have to ensure that these elements are very close to each other. This was the first condition that the elements, the elements of a group within a group should be closer. This was the first condition. We know that we can use the alias for that. There is a second condition also. The second condition says that if you want to form groups, first was to actually ensure that the elements of a group are close to each other. Second was to ensure that the values of this group and this group are far from each other. All right, So these values should be far from each other. And it makes sense because we wanted to make sure that the groups are closer. They didn't know within a group the data point should be closer. Hence, they're forming a cluster. But we also want to make sure that they are far from each other, then only we will be able to distinguish between these two groups. We can use this concept of variance to ensure this grouping within a dataset. And this is very important and it will be understood only if you know how the variance, if you increase the variance, it will be far from the mean and the points are far away from each other. That's one thing. Now let me move on to the last and the very important concept which is known as the normal and the Gaussian distribution. So for that first, I will remove this graph here. Alright, so let's try to understand what is normal Gaussian distribution. Before understanding this, we need to understand what is the meaning of distribution. Very simple example of distribution is let's say I have ten chocolates and there are features. And what I can do is I can actually distribute those $10 to them. And this is actually known as distribution. This is the plain English meaning of distribution, the same as here. Also in distribution, what we tried to do is we're distributing the x, which is the inputs. Do some rearrangement. We call as y to the outputs. In the range. Let's say we have a range 0 to one. I have these data points and I'm distributing these data points within these ranges by plotting them, right? So for that I need a function f of x, which will actually take this inverse. And it will make sure that these inputs lie between these two, which is this one, which is this range. This is just an example to explain what is distribution. So if we want to distribute the elements, we will use different types of distributions. If the output, which is the range, if it's provability is known as probability distribution. Now let's understand what is Gaussian distribution. Gaussian distribution is also known as normal distribution. And we will have to actually understand it using a graph. This graph has at, that I'm going to draw is actually a representation of this function here which I have there. You can see that this f of x one divided by it all of, under root of two pi e raised to the power minus half, x minus mean and standard deviation whole square. So this is a function of which represents the caution distribution. And here you can see this symbol which is rho. This represents the standard deviation. This value which is mu, represents the mean. Here we have the standard deviation and here we have the mean. So if we have mean and standard deviation, we can use the Gaussian distribution. This x here, it represents the data points that we have. So if I will supply the data points here to the x, it is going to give me and it is going to actually distribute the input over a certain range. Whatever the value that this function f x will give me, I will plot it. And hip hands, it is actually in a range which we call it as the outputs, right? So now let's understand what are the steps to actually create the graph of normal distribution, which is the graph of this function. The first step is to actually mark a value which is at the center of this axis. And this value will be the mean. So 18.57 is the mean. For simplicity, I'm afternoon just stating the absolute value. This is actually the mean here. I'm going to use Mu to represent this right? Now, the second step is to add, is to create more markers here by adding and subtracting the standard deviation. So how can I do that? The standard deviation is 8.32. And again, I'm going to take the absolute value of eight. If I add it to this. Here, I will get when D6. If I subtract eight from this, I get a value of ten. This one is actually the standard deviation which is represented by rho. And rho is equal to eight. In our case, I'm taking the absolute value. This is the first step. This is actually known as the first. A standard deviation. Now, the third step is to calculate the total and the second standard deviation. And it is also simple. We just need to add the standard deviation to this number. 26 plus eight is actually equal to 34. Here we have an NAD and I will have to do the same here. I will have to subtract the value of eight from this ten, and then the value will be due. So on, we can do it like this on this graph. This was the first standard deviation. You can see that this one here was the first standard deviation. This one is the second. Similarly, we can create a lot more standard deviations here by just adding the standard deviation to the mean. So the question arises, what are we trying to do with this graph? What is the motive of this graph and what we're going to use it in data science. Now, let's go back to our example. In our example we stated that the number of sales on an average was 18 and there was a increment or decrement of eight mobile phones. Here. If I try to draw this, if you can see here, we have mean and standard deviation. What we want to do is we want to analyze that if the standard deviation will increase, if the standard deviation will increase, how much it is going to impact on the points. If I increment the standard deviation, what will be the impact on these data points? Are they going to come closer to the mean or they're going to get far from the mean. What we do is we actually plot this. We take the mean and we take the standard deviation and these points we supplied to this function. And it is observed that if you supply to this function, you're going to get a graph like this. The step four is to actually draw the graph for that. This is the y-axis. This y axis represents a low and a high value. So here we see less likely situation. And here we see a high likely situation. So basically it means that if the value of this function fx is height, if it is higher, it means that the point is highly likely to be closer to the mean. So the whole point is we want to make sure, we want to understand from this graph how the data points are deviating from the mean. Are they going closer to the mean or they're going far to the mean for this graph will help. Now let's draw this graph. So 18 is the mean. So I'm just going to draw a dotted line here. Now if you try to draw this graph, this is the first deviation. So I'm going to draw another dotted line here like this. So this was our first standard deviation and this is the mean. Now let's try to draw though the cohort effects, which is here. It is observed that a bell-shaped golf is observed. It goes like this. When it touches the first standard deviation, it starts increasing. Your snow goes up like this. And when it reaches the mean, starts going down like this. And then it gets blackened like this. This here is known as a bell-shaped curve, and this is the goal of f of x, which is this function. Let's see what are the important facts about this stuff. It is observed that when you will supply these inputs, the mean and the standard deviation, when we get this curve, it is absorbed that 34% of all the data points will lie in this region, which I'm marking here. 34% will lie her and 34% of the line in this other half. In total, 68% of all the data points will lie within this first standard deviation. What does this mean? This means that if I take this value of ten, it means that if the number of sales was between ten to 26, 68% points are actually there, which are closer to the mean. Now, let us try to understand what will happen if I take a value here, let's assume a value here which lies on the mean. Value is lying on the mean. Let's take a look at this formula. If it is 91, the mean, it has a value equal to the mean, which is 18.57. This value here is 18.57 minus 18.57. This will actually become 0. And since there is minus half multiplied by 0, it will be 0. And then we have e raised to the power 0 here. This whole value will be equal to 0. E raised to the power 0 is equal to one. What we get is only fx equals to one by under root of two pi, then the value of the data point is actually equal to the mean. The value of one by two pi, I've already calculated it. It is actually equal to 0.4, which is a constraint here. I can actually write here the function f of x is equal to 0.4 divided by standard deviation. You can see as the standard deviation. And you can see though function effects, they are inversely proportional to each other. Since they are inversely proportional to each other. The value of the standard deviation will increase. The value of standard deviation increases the value of ethics we'll degrees. You can see from the curve as the standard deviation is increasing, the graph is going down. This is an important point here that if the standard deviation, which basically makes sense because standard deviation is actually the distance of the point from the mean. You will increase that distance, then it is going to become less likely that the point is closer to the mean. So that's my dysfunction is having a lower value. Now let's assume here, let's try to understand one more graph, which is another thing here, which is very interesting one. So let us assume that instead of it we have a standard deviation of two. So instead of eight, I have a standard deviation of two. Let's just assume that the value came out to be two. In this case. If it is two, we know that it is lesser than this case. It means that the points are much closer to the mean. So let's try to plot this on this graph and Lexi, whether our logic of the points being closer to the mean is holding true in the graph or not. We know that under steps of drawing the graph is simple. We will have to add the standard deviation to the mean. So here, 18 plus two will become 2018 minus two will become 16. Now if I want to draw, the graph will go like this and the graph will be flat. But when it reaches the full standard deviation, starts increasing like this. But now the question is, will it go lower or will it go higher? This code, will it go lower? And then down? Or will it go higher? And then don't know? The answer to this question is the logic itself. There are two logics that explain this. The first one is that 68% of the points it is observed that it is going to occupy here. So obviously, if you were to shrink this, you will have to increase the curve to accommodate those 68% values. That is the first logic that you can infer. What the most important common sense logic here is that if you decrease the standard deviation, you have degrees the standard deviation. It means that the points are closer to the mean. If they're already closer to the mean, then they are going to get higher. It is highly likely. It is highlighted that the points are closer to the mean. This value will become like this and it will go down again, the first standard deviation, and again it will go flat. This is all about these important topics and you will use these important concepts in machine learning. You will use variance and machine-learning to the groups are scattered. You will use your candy regularization also studied the problem of overfitting. Basically, that's all for this video. Thanks for watching.