Build a Data Analysis Library from Scratch in Python | Ted Petrou | Skillshare

Playback Speed


  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Build a Data Analysis Library from Scratch in Python

teacher avatar Ted Petrou, Master Data Science with Python

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

57 Lessons (7h 23m)
    • 1. Course Overview

      2:24
    • 2. Pandas Cub Examples

      13:43
    • 3. Downloading the Material from GitHub

      2:27
    • 4. Opening the Project in VS Code

      2:53
    • 5. Setting up the Development Environment

      8:22
    • 6. Test Driven Development

      7:06
    • 7. Installing an IPython Kernel for Jupyter

      13:13
    • 8. Inspecting the init File

      6:07
    • 9. Importing Pandas Cub

      6:08
    • 10. Manually Test in a Jupyter Notebook

      8:10
    • 11. Getting Ready to Start

      1:52
    • 12. Check DataFrame Input Types

      20:05
    • 13. Check Array Lengths

      6:18
    • 14. Convert Unicode Arrays to Object

      11:01
    • 15. Implementing the __len__ special method

      9:25
    • 16. Return Columns as a List

      6:04
    • 17. Set New Column Names

      10:30
    • 18. The Shape Property

      3:27
    • 19. Visual Notebook Representation

      11:29
    • 20. The values Property

      3:21
    • 21. The dtypes Property

      9:52
    • 22. Select a Single Column

      5:37
    • 23. Select Multiple Columns

      3:49
    • 24. Boolean Selection

      9:38
    • 25. Check for Simultaneous Selection

      7:43
    • 26. Select a Single Cell

      8:43
    • 27. Select Rows as Booleans, Lists, or Slices

      10:14
    • 28. Multiple Column Simultaneous Selection

      5:55
    • 29. Column Slices

      8:02
    • 30. Tab Completion for Columns

      3:30
    • 31. Create a New Column

      13:54
    • 32. head and tail methods

      3:26
    • 33. Generic Aggregation Methods

      8:41
    • 34. The isna Method

      5:05
    • 35. The count method

      4:50
    • 36. The unique Method

      5:50
    • 37. The nunique Method

      3:29
    • 38. The value counts Method

      8:42
    • 39. Normalize value counts

      3:24
    • 40. The rename Method

      4:13
    • 41. The drop Method

      3:46
    • 42. Non Aggregation Methods

      12:07
    • 43. The diff Method

      9:19
    • 44. The pct change Method

      2:49
    • 45. Arithmetic and Comparison Operators

      16:47
    • 46. The sort values Method

      11:06
    • 47. The sample Method

      10:47
    • 48. Pivot Tables Part 1

      12:55
    • 49. Pivot Tables Part 2

      8:15
    • 50. Pivot Tables Part 3

      3:56
    • 51. Pivot Tables Part 4

      8:40
    • 52. Pivot Tables Part 5

      8:22
    • 53. Automatically Add Documentation

      7:53
    • 54. String only Methods

      14:05
    • 55. The read csv Function Part 1

      10:25
    • 56. The read csv Function Part 2

      7:56
    • 57. 57 Conclusion

      4:54
  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.

5

Students

--

Projects

About This Class

Build a Data a Data Analysis Library from Scratch in Python targets those that have a desire to immerse themselves in a single, long, and comprehensive project that covers several advanced Python concepts. By the end of the project you will have built a fully-functioning Python library that is able to complete many common data analysis tasks. The library will be titled Pandas Cub and have similar functionality to the popular pandas library.

This course focuses on developing software within the massive ecosystem of tools available in Python. There are 40 detailed steps that you must complete in order to finish the project. During each step, you will be tasked with writing some code that adds functionality to the library. In order to complete each step, you must pass the unit-tests that have already been written. Once you pass all the unit tests, the project is complete. The nearly 100 unit tests give you immediate feedback on whether or not your code completes the steps correctly.

There are many important concepts that you will learn while building Pandas Cub.

  • Creating a development environment with conda

  • Using test-driven development to ensure code quality

  • Using the Python data model to allow your objects to work seamlessly with builtin Python functions and operators

  • Build a DataFrame class with the following functionality:

    • Select subsets of data with the brackets operator

    • Aggregation methods - sum, min, max, mean, median, etc...

    • Non-aggregation methods such as isna, unique, rename, drop

    • Group by one or two columns to create pivot tables

    • Specific methods for handling string columns

    • Read in data from a comma-separated value file

    • A nicely formatted display of the DataFrame in the notebook

It is my experience that many people will learn just enough of a programming language like Python to complete basic tasks, but will not possess the skills to complete larger projects or build entire libraries. This course intends to provide a means for students looking for a challenging and exciting project that will take serious effort and a long time to complete.

This course is taught by expert instructor Ted Petrou, author of Pandas Cookbook, Master Data Analysis with Python, and Exercise Python.

Meet Your Teacher

Teacher Profile Image

Ted Petrou

Master Data Science with Python

Teacher

My name is Ted Petrou and I am the author of Pandas Cookbook, a highly rated text on performing real-world data analysis with Pandas. I am also the author of the books Exercise Python and Master Data Analysis with Python. I am the founder of Dunder Data, a professional training company focusing on teaching mastery of data science and machine learning so that you can produce trusted results.

See full profile

Class Ratings

Expectations Met?
  • Exceeded!
    0%
  • Yes
    0%
  • Somewhat
    0%
  • Not really
    0%
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

Transcripts

1. Course Overview: Hey everybody and welcome to the overview video for the course and build a data analysis library from scratch in Python. My name is Ted, but true and I'm going to be your instructor. I founded a company called dunder data. We specialize in teaching data science and machine learning. And the author of pandas cookbook, exercise Python and master data analysis with Python. A little course overview. So the goal is to build a fully functioning data analysis library similar to Pandas. This will be a single long, comprehensive project that covers advanced Python. We're going to follow test-driven development to ensure code quality. There's going to be 40 challenging steps where you write the source code. And to complete each step, you're going to have to pass a unit test. And there's nearly a 100 unit tests throughout the whole project that needs to be passed. So a little bit about Panda's cab, the library, and the functionality that it will possess. Going to have a DataFrame class with data stored in NumPy arrays. We're going to read in data from comma separated value file. Our DataFrame will have a nicely formatted display in the Jupiter notebook. Want to select subsets of data with the brackets operator. Wouldn't use special methods defined in the Python data model. We're going to implement aggregation methods such as some or Min. We're going to implement non aggregation methods such as, as a unique and others. We're going to use pivot tables to group by one or two columns. And we would have special methods, are specific methods just for string columns. So the targets student is someone who desires to immerse themselves in a larger and more comprehensive project. And who wants to learn more advanced Python topics and wants to learn basic software development. So you must be comfortable with the fundamentals of Python. This is not a course for beginners. And this is going to be a challenging and fun project, and it's going to be very long. It's gonna take quite some time, but it will be worth it. You're going to be able to build an entire library from scratch using Python. And it will be a powerful to do data analysis. So with that said, Please come out and join the course. It is going to be very worthwhile and hope to see you soon. 2. Pandas Cub Examples: In this video, we're going to be looking at some specific examples from the Pandas library, which is the final product that you will produce at the end of this project. And have a Jupiter notebook here that's open that I will use to show you some of these examples from Pandas cut. The first thing I'm gonna do is just simply input and is come into my namespace. I'm going to alias it as PDC. Pandas is by convention, aliased as pd. Now, one of the first things that I will show you is how pandas compass able to read in data from an external file. Japan has come has a function called read CSV. It's the same function that's available in the Pandas library, but it is far simpler. It only has one parameter available to it. And that is a string of the file location where the CSP is located. So we'll be able to read an only comma separated value files and only very simple comma separated value files. So there is an employee dot csv file in the data folder in my current setup as a DataFrame. So that's nice, that is read in this data. But as a analyst, it's important to actually have a visual representation of this. So we're going to learn how to format a display of the DataFrame inside of a Jupiter Notebook. So if we put the DataFrame variable name DF and one of the cells by itself, we will display the representation of it. So as you can see here, we're doing, we have a nice display in our notebook, has the columns and the rows. Just like this. So that completes these two tasks that there, there is a head method that returns just the fire first five rows. I like to use this to shorten up the output and to not pollute the notebooks with, with lots of rows have DataFrame. All right, let's move on. Our DataFrame will also be able to select subsets of data with the brackets operator. So the brackets operator is a universal way that Python has given to developers and users to select subsets of data. So let's go ahead and look at the DataFrame again, at least the head. So there are many types of functionality that are available to the brackets operator with pandas come. The simplest one is to select a single column of data with just the string name given to the brackets operator. So for instance, if I wanted to select the department column, we just simply put the name of that column as a string inside the brackets, and I'll get just that column. So had, here's used again just to get the first five rows. I'm going to cover a few other examples of what pandas comes able to do. So not only will it, will it be able to select a single column, but it'll be able to select multiple columns. You'd give it a list. So for instance, let's say I want to find our rigid retrieve the department and salary of a DataFrame. Well then I can just give a list of the column names inside the brackets operator and we get that. So that's how you would get multiple columns. Now, you can also select rows and columns simultaneously. So rows you can select with their integer location. So for instance, if, if we wanted to select rows 10 and 50 and 30, they don't have to be in any order. A oneness lock the columns of students, same columns, department, and salary. We could give the brackets operator both rows and columns separated by a comma. So this will do simultaneous selection of rows and columns. And in this case we return three rows and two columns. So those correspond to the tenth, 50th, and 33 rows and the department in salary and columns. Now, the DataFrame will also be able to select with slices. So for instance, if we wanted to select rows 10 through 20, along with columns, is the same columns, say department salary. You could do this. You can also select with numerically like this. So from the second column to the end. Okay? So there are other combinations that are available to you. But I want to show you one more and that is boolean selection a very necessary and very necessary thing to do with. Data. So boolean selection selects data by the values and not by the column names or the row locations. So for instance, if I was interested in selecting only those employees that had a salary of greater than a 100 and 1000. And we get first create a Boolean mask like this. And I could save this to some variable name is called filled for filtering or forehead and output the first five rows. I can then go ahead and use this. I can pass this into the brackets. And it will select only the rows which our employees, only the employees that have a salary greater than a 100 thousand. So boolean selection will be possible. You can also select a particular column. So felonious, largest department and salary, along with just those that are greater than a 100 thousand. We could do that and got head to just get the first five rows. Okay, so that is subset selection with pandas code was gone down over here. So you will learn in the next section is using special methods to find any Python data model. So I'm not going to show you the special methods here, but I will show you what they allow you to do. So the Len function, built-in Len function, this is built into Python itself. And we'll cover how a special method may be used within your class so that your object, the DataFrame in this case, we'll be able to work with this Len function. So this returns the number of rows or US. You can also do. You'll also learn how a special method is able to allow us to use this plus operator. In this case, I'm going to set add $10 thousand to everyone salary. And the special method is required so that, so that our object, our DataFrame, can understand how to work with this plus operator. So there are many other special methods that are available that allow us to interact with Python's functions and operators. And we will discuss them or we will encounter them several times during this project. Okay, moving along, the next thing that our DataFrame will be able to do is be able to aggregate the columns. So could be several, many aggregation methods that are implemented. An aggregation method by definition is something that summarizes a sequence of values by a single number. So for instance, typically you're aggregating numeric columns and you only have one numeric column. Yeah, the salary column. If we wanted to find the mean salary, you can use the mean method in this, aggregates it and returns a single value. Now, some aggregation methods do work with strings, so department typically don't find the minimum a string, but it is allowable. So the string, the highest up in the alphabet is Health and Human Services. So those are the aggregation methods, Sunday median, max, and several more. All right, our DataFrame will also be able to do non aggregation method. So a non-ideal method is something that just does not return a single value for each column. So let's just look at a data. Again. We will see that let's just choose the unique method, for instance, is operates a little bit differently than Pandas. So what unique does it returns a list of dataframes. So for every column that you have, it'll find the unique values of that particular column. All right, So if I look at this, it's going to look a little strange. It has a list of dataframes. So you actually look at the unique values, I'm going to have to get an F two. Yep, DFS is a list, so I'm actually going to use the brackets to actually get one particular dataframe. So the first one is just the department colonies. You can see these are the unique values of the department column. So if I want all the unique values of the race column, I could get the index one in there. I want to see the values of the gender column. I could do that. There are going to be, There's salary, There's, there's many unique values. So if I do this, I can just do dot peg C. Those are the unique values of salary. All right, so that takes care. Those are some non-hydrogen. That is one example of a nonnegative method. Let's move on to the next piece of functionality which is grouping. So it's very important that our DataFrame be able to group. By at least one variable, our DataFrame will be able to do by one or two columns exactly. Pandas is able to group by any number of columns. So we're going to, so what pandas come offers is the pivot table to do the grouping. So let's go ahead and look at the head so you can see your columns again. So the pivot table method will allow us to group by one or two columns. So if we wanted to find, say, the average salary by every combination of department and race, we could do that. There's two the grouping columns come first, they go into the rows and columns, perimeter, department and erase. And then values will be aggregated. And so we're going to hire you have to salary column. And there's an AGG func parameter and it tells you how we're going to aggregate. So we're gonna aggregate by getting the mean. So this returns it pivot table. So we have on the rows, we have the unique values of the department. The column names are now the unique values of the array. So yeah, Asian, Black, Hispanic American, and White. And then inside here we have all the, the mean of those values. So that is the, that is how PivotTable was. Here. We can buy two columns. You can group by a single column as well. So say you just wanted to group by a parliament is, and you can just get rid of the columns parameter. And this will find the meaning of every department. Okay, so that's grouping. And the last piece of functionality, major piece of functionality that pan is coupled have, is that you'll have very specific functionality for strings. So let's again take a look at this. So strings are typically handled differently than a numeric columns. So if we look at, say, the, there's an STR method. So it's accessor should say. So STR is just like with Pandas. So let's say we wanted to use the count method, so we want to count. And this looks a little bit different than Pandas. You're going to put in the column name here. So if we wanted to count, say the number of, I don't know if A's in every in every value of the department column, we would do this. Okay? So this will return a one column dataframe with the number of A's for every single value in the department column. And there are many string methods that you'll be implemented here. These are all the same string methods that are available in regular, for regular Python strings. So if I just press tab here, you can see these are all the string methods that our DataFrame string columns will be able to use. And there are quite a few of them. Okay, So that wraps up the examples for, for pandas pub. 3. Downloading the Material from GitHub: This video, we are going to be downloading the material from the GitHub repository where it's located. So the URL where it's located is right here. It'll be in the description below. I've already navigated to it in my browser over here. So this repository, this page contains all of the instructions and all of the material that are necessary to complete the project. And in fact, it previous videos have already covered some of the material on this page. And what all we're gonna do in this video is simply download the materials so that we have a copy of it on our local machine. So if you're familiar with Git, you have Git installed, you can go ahead and clone it. I'm not going to assume that you do. Instead, we're just going to download it onto your machine. So if you click this green button on the right-hand side of the page, you'll have a block that says Download zip that will appear. Let's go ahead and just download the zip. And it should download pretty quickly because it's not a whole lot of the file sizes are not very large whatsoever. It's a small library. So it's going to go ahead and it's gone into my Downloads folder. So it's going to attach the word master, which is the master branch to it, so you don't have to worry about that. But it'll be the full name will be pandas cub master dot zip. Go ahead and unzip this. So when you unzip it, you will see the contents of the entire repository or right here, you can open up this PDF of the README and this is basically just the full instructions on, on that are that are found on that the GitHub repository homepage. So yes, so you can actually have instruction out as a PDF and not have to refer to them on online. Okay. So once you have the contents of the repository on zipped, you probably should move this out of a Downloads folder. I'm simply just going to put this into the Documents folder right here. Okay. So that's all I wanted to cover in this short video. 4. Opening the Project in VS Code: In this video, we're going to be opening up the project in VS Code. So VS code stands for visual Studio Code is a code editor distributed by Microsoft. It is free and open source and it was right as the number 1 a code editor in the 2018 Stack Overflow Developer Survey. So it is powerful and it is popular and it is an editor that I like to use. And the last video we ended up downloading all of the material of pandas come from It's GitHub repository onto our local machine. And in this video we're going to be opening up all that material with Visual Studio Code. So I already have Visual Studio code downloaded and installed, and have an instance of it running right here. Now This video tutorial is not going to cover the ins and outs of Visual Studio Code. I'm merely going to open up the files, open up the project with it, and then continue on with the rest of the steps. Now, my particular VS code, I have installed several extensions and I will link to all the extensions that I use To help for my development environment. Alright, so this is the opening screen that you'll see if you have not started a project. So what I'm gonna do is going to go ahead and open up the pandas cub folder. So I place this folder in my Documents folder. And there it is right there. So I'm gonna go ahead and open this up. Alright, so here are the contents of that folder and these are the exact contents that I downloaded from GitHub. Now when you first open up, when you first begin this project, you're going to want to open up the README dot md file. So this contains the underlying markdown of all of the instructions that you need to follow in order to complete the project. Now the markdown isn't particularly pleasant to read. So Visual Studio Code provides a Markdown preview that's available. So I'm gonna go ahead and make this much more readable in place, this preview over here. So you're going to want to keep this Markdown open at all times during the project. So this is your roadmap for completing the project. All the instructions are here. Every step has detailed instructions on how to complete it. So you'll definitely want to keep this open so that you can follow along and read it. So I'm going to be keeping this open as well. All right, so this is the end of this tutorial and we'll begin our creation of the development environment in the next video. 5. Setting up the Development Environment: Hey everybody and welcome to the next installation for the project to build a data analysis library from scratch in Python. This video is going to cover setting up the development environment. So in the previous video, we opened up our project in VS Code, and we opened up the Readme markdown file. This file contains all of the instructions that are necessary to complete the project. And you should keep it open at all times whenever you're completing the project. If you scroll through here, you'll see that we've actually already completed the first few sections. We've already covered them. And soon you'll get to the section on setting up the development environment. And that is what we are going to cover in this video. So what exactly is a development environment? Well, it's simply it's simply an environment where you develop software and where the software gets executed. So I always recommend creating a new environment whenever you start a new project. Now, whenever we create a new environment, this will isolated from any other part of your file system or any other previous, Any other installations of Python in any of their environments that you already have. So for instance, when we create a new environment and we'll be sure to know exactly what, what libraries are in there, what version of Python is in there, and have more security and more confidence that our code will work with the current specifications. And then hopefully we can set up this environment on other systems and habit and replicate the environment in other systems so that our code will work in other systems as well. Alright, so I'm going to be using a tool called Conda to set up the development environment. Now Conda is, is more well-known as a package manager, but it can also create environments. So if you do not have the conda tool, then you'll need to download it. Now, Conda is not the only tool that creates environments, but it is a popular one and it is one that I am going to use. So the company that creates Conda is called Anaconda. And they're very well known for the Anaconda distribution, which is a distribution of Python packages that contain most of the popular data science libraries. Now whenever you install Conda, you will automatically have a Environment called base. So it'll already be a environment there for you. But, but when we start development, we want to have our own environment. So we're going to, we're just about to create her own development environment. And there are actually a few ways to create environments with Conda itself. So I don't want to use what's called a YAML file. This environment dot YAML file. And I'm gonna go ahead and open this thing up and put it on this side. So it's a simple text file that gives some instructions on how to create the environment. Number 1 is the name of the environment. So here we're simply going to name it Pandas KAB. And then under the dependencies section, we have the version of Python. So I'm going to pin the version of Python 3.6. And then we're also install pandas, Jupiter and PY test. So we're not going to be strict on the versions here that will just install whatever the latest version of these R. So R package isn't going to be so fragile that it'll depend on very specific versions of these libraries. So will be a little bit leaning in here. So it'll install pandas, Jupiter, and PY test. But it will, it will also install all the dependencies of pandas, Jupiter and pi tests. So there'll be many, many other libraries that will be installed. The main dependency of pandas is a library called NumPy, which is a dependency for many of the data science packages in Python are many scientific computing packages in Python. Okay, So with that said, we will go ahead and use this YAML file along with the conda environment tool Creator to do this. So I'm going to go ahead and if you go to View and then you click on terminal, you can actually open up a terminal right here and issue commands from here. So it actually opens up in our current working directory, which is very nice. Now, you'll see that for me anyways, in parentheses, preceding the prompt is the name of the current environment. So here I am in base. So to create a new environment with this YAML file, you simply enter the command conda and create dash, dash ef. And then the environment dot YAML. So there are many, many things you can do with the conduct is a very complex tool and very flexible for managing packages and creating environments. So this is just one particular command that will create a, an environment. And it will use this file to do with the environment creation. So I'm going ahead and run this. I've actually already created this environment because it will take, it can take a minute or two to download all of the, all the contents of that environment. Now, when you create a new environment, you're not inside of that environment. Okay? So it is not what's called active. So there's always an active environment and right now the base environment is active. And actually you can see here where the environment was actually downloaded. So it's in this folder over here within Anaconda three in this ends folder under pandas cup. So this is a completely isolated area in your file system that will not be contaminated by any of the other installations you May have created or any of the other environments but your machine. So if we do Conda and lists, this is the command that will list all of the environments on this machine. And you can see I have a few other environments besides, besides bass and pandas, KAB. So the current environmental always have a star next to it. And then you'll have a location in your file system where that environment is. So base is our current one. We want to activate pandas KAB. So to activate pandas code, you'll simply issue the command. Does pandas conduct, activate? And as cub conda activate and the pandas are in the environment name. So now that I've activated it, you'll see that pandas come now preceding my command prompt. And then if I do conda list, again, you'll see that that star has moved to Pandas KAB, and this is where the environment is located. Okay, so that is created the environment. And now I've activated the environment. Now you should only use this environment to develop this library. Once you are done with this, with this session, you're going to want to deactivate this. So it only uses. So conda deactivate is the command to return to your default environment, which will very likely be base. So that's how you get started with creating an environment and activating the environment. So and will be always developing within the Pandas Cobb environment. All right, so that does it for this one. 6. Test Driven Development: In this particular video, we will be covering test-driven development. So we ended the last video by creating our environment with the other Conda tool. We did deactivate the environment at the end, but I went ahead and reactivated it. So make sure pandas cub is always active whenever you're developing in this project. Now if you scroll down just a little bit, you'll see that the next section is on test-driven development with PY test. So what exactly is test-driven development? Well, it's the idea that you write tests first, whenever you begin a project or whatever, you're adding a new feature to your software, you write tests first. So these tests will be failing tests. And then you go back in or write the code that will pass the tests. So you do things a little bit in reverse and what you might normally, how you would normally approach a problem. So the idea is if you pass the tests, then you can feel fairly confident that your code will work in the future. So that's what test driven development is all about. It's about writing tests first and then writing the code that passes them. Now, all the tests have already been written for you. And there were about 100 tests that you must pass in order to complete the project. Now, all the tests are in one file. This test underscore DataFrame dot py file, and that resides in the test folder. So if we look over here on the left-hand side, you'll see all of the files in this project. And you'll see here is a test tests folder. So go ahead and open up this test DataFrame dot py file. Now, it's not necessary to understand every last thing in here. You're actually not going to be editing this file whatsoever. But it is important to understand the structure of this file and how the tests are actually created. So the way this file is structured is that it is there are a number of classes. So, so for instance, there's a test DataFrame creation class. And underneath this is a test selection class. Now those are the individual test. The individual tests are the actual methods within, inside each class. So every time you see a deaf creating a method within this class, that is going to be a test. So this test input types is considered one whole test. This test or re-link is another test. This test underscore Unicode to Object is another test. So there are several tests within this test DataFrame creation class. So it is these individual methods that are the actual tests, not the classes themselves. The class is simply work to sub-divide the project into tests that are, that are related to one another. So we're going to be using the high test library to test our code. And we can do the way we, the way we use this is a command line tool that pi tests come to it that's also named PY test. We simply give it the location of where the tests are located and it will run all the tests for us. Let's go ahead and run these tests. If we type in pi tests and then the location of that file, the test DataFrame dot py file will run all the tests for this entire project. Excuse me. I had cheated him and gave you tested the answer. So let's go back and run all the tests so they will all fail if you do this correctly. So you should have seen a bunch of red letter F's that filled the screen. So it says here, 96 tests have failed. So they were not exactly a 196 total tests. They have all failed, none of them have passed. So the way this project works is that each step you're going to write a little bit of code. And then you're going to test that particular code to see if it runs. Now, PY test allows you to test. You don't have to, you don't have to test all, you don't have to run all the tests in this file. So it gives you a way to test just vs run the test only in this class and in one particular class. So the way you do that is by running the same command, but follow it up with two colons. And then you say test DataFrame creation, for instance, if you would like to test just this class. And so it will only run these nine tests in this particular class. So they've all failed. Again. Now you can also just run a single test by appending two more colon's, then the name of the name of that one particular test that you want. So here we can run test input types. So this one test that we're going to run. And you can see here that we gotta read f and we have one failed. So we fail to all of the tests. So we'll have to pass every single of these tests in order to complete the project. And that they said. And that's how you would that's how you would run one single test. Now, PY test has this idea of automated test discovery. There are rules for this, which are predicated upon the actual names of the files and folders and classes and tests with within those files. So if you're interested in these rules for automated tests discovery, you can go ahead and click on this link and find out more there. So we, it actually is not necessary to go ahead and write out the exact file location. If you just call pi test itself, it will find all those tests. He collected 96 tests and they're all failing. And so that is called automated tests discovery. And so we will be using pi tests frequently throughout the rest of the project in order to see if we have written the code to correctly passed the test. Alright, so that's it for this one. 7. Installing an IPython Kernel for Jupyter: In this video, we're going to be installing an IPython kernel for Jupiter. In the last video, we ran all the unit tests in this project with the excellent pi test library. Now we ran those unit tests down here on our command line. And we have, when we ran those tests, we were in the Pandas cub environment, which we had created in the previous step. So the pandas come environment was active and we ran those tests. So when we ran pi test, it went into this test folder. And this is in particular this test DataFrame dot py file ran all these tests. Now, the, only the only Python code in Python libraries that are available to this project are those that are contained within the environment pandas, KAB. So whatever libraries were, it downloaded and installed during our environment duration, that is what is available to us. So for instance, we downloaded pandas, which also download NumPy. So we have those libraries available to us. We did not download or install the scikit-learn library. So that library is not available to us. So that's one thing that environments give us is just an isolated area with only the particular packages that we installed in that particular environment. Now in this notebook or in this video, we're going to be launching a Jupiter notebook. So I particularly like Jupyter Notebooks to test code and to sort of experiment with how, you know, my, my, my DataFrame classes, you know, looking and how it's interacting. So then it's a great environment to get quick feedback into iterate quickly. So you don't have to use Jupyter Notebooks for this. You can use iPython, the IPython shell which had been started in terminal. But I like Jupiter Notebooks. So I want to teach you guys how to actually hook up Jupiter Notebooks to your environment. Now, interestingly enough, if you launch a Jupiter notebook from the command line, even though you're in the Pandas cub environment, you will not be automatically connected to that particular environment. Just because you've logged to Jupiter Notebooks or Jupiter is actually disconnected from the actual machinery that's running the code. And the machinery that's running the code is called a kernel. And that's, and you can create, and we're going to create a kernel. Would install a particular kernel that we'll be able to connect directly to our Pandas cub environment. And that's exactly what we're gonna do in this particular video. So just to get started, before you even launch Jupyter notebook, I'm going to launch the IPython shell with the command IPython. So it's not apparent, at least just by running this where exactly the executable, Executable version of Python is actually running. I know it says pandas come over here, but there should be a way to verify this and there is. So these sys library, SIS, the system library for Python, which is a built-in or is a standard library, I should say, can help us get information on on what where our Python is running. So the executable attributes available in the CIS library will tell us the location of where actual, the actual Python is actually running. So here we are. We can see that we're inside this environment folder and inside pandas, KAB, and in the Python binary is right here. So we, so, so we now we know for sure we have verified that in fact, when we run IPython, we are actually within this pandas cub environment. So if I, if I tried to, I didn't not download scikit-learn, but if I tried to import scikit-learn, that is just simply not going to work. So let me go ahead and exit out of this particular IPython shell. And let me go ahead and deactivate this environment just to get to my base environment, which does have scikit-learn installed. So I'm gonna go ahead and run IPython again, just so you can see the difference. So I'm in here, I'm going to import sys again and then get the output of system executable. And you can see that my base environment. Doesn't actually go and there's no ends folder. The Environment folder is not there. It just go straight to this binary folder where Python is located. So you can see the where Python is executed is in a completely different part of your file system, depending on what environment you're in. It's a let me get out of here and let me activate Panda's cab again. So I'm right here, I'm in Pandas cut. Now, the interesting thing is if I want to launch it Jupiter notebook, and I will do so right now with the command Jupiter space notebook. So this is going to open up in my browser. All right here. And I've already created this test notebook that I will use to show you how I sort of manually test. But first we have to hook up the conda environment to the Jupiter notebook. Okay, so here we are in the notebook. I just opened it up and I have that same just executable path right here. None of the cells were run it as a very short notebook. I'm going to go ahead and execute this first cell and it's gonna tell me exactly where the executable is. And it says, so this is probably a surprising finding. You launch a Jupiter notebook, you're in your pandas cup environment, and all of a sudden you're met with this unfortunate no output here that says actually you know what? You're not on. You're not in the environment that you figure in. And if you try to import, say, scikit-learn, that will work because I have a secular in my base environment, but I don't have it in the pandas command line. So what we need to do is we're going to shut this down. So let me close out of this. And I'm going to close out of the Jupiter notebook, um, is shut down. So I have to issue one command to install a new IPython kernel just for this particular environment. Now, this is found within the documentation. So this is not something that you just sort of as not some magic. This is directly in the documentation and I have a link for it right here. And it's gonna go ahead and copy and paste this. So it's a command that begins with Python. And we're going to install this IPython kernel just for the user. And the name of the Python kernel is going to be called pandas KAB, and the display name. This is just some extra that will help you find it a little bit easier. You can make this whatever you want. But this is what we'll see in the Jupiter notebook. It'll be called Python and Pandas cub in parentheses. So this is the command you need to run in order to create this kernel, which will be able to execute only in your pandas cover environment. So a kernel is simply just a, you know, a program that runs the code and interests facts that code. Okay? So it's what communicates the code that you write in the Jupiter notebook area and how it finds the correct interpreter and executes the code. Okay, so let's go ahead and run this. All right, so it says Installed kernel spec pandas code in this directory. And if you run the command Jupiter kernel spec list, you've got a list of all the kernels that are available to you. So before we just had one kernel is Python 3 kernel, now we have pandas hub. So now if we lodge Jupiter again, it's going to run the same thing. Jupiter notebook. I'm going to go ahead and open up tests notebook. So I'm gonna go ahead and execute this one more time. Oh, and it says, okay, it says we're still not in the right environment. So and this is correct. What you need to do is you have to go to, need to go to kernel. And this kernel menu up here, click on colonel, good at Change kernel. And then click on now that you have a list of choices, pandas cub or Python 3, this is what we're currently in. Now when you choose pandas cub, it's going to restart. So it's got a, you know, all of your work will not be, any variables in this notebook will be lost. So it's going to restart the kernel. And you can see that over here it says restarting and then it's ready. So I'm gonna go ahead and re-execute this cell. And now you can see that pen, now we're in pandas come. So if I tried to, for instance, import scikit-learn, this is not going to work. I don't have scikit-learn in this current environment. The pen is kind of environment. Okay, So that's mainly what I wanted to show, but now that we've hooked it up to Pandas Cub, I'm going to actually go ahead and save this when you exit out. So I'm going to exit out of here again. And I'm going to exit out of the whole Jupyter Notebook system. Now that I've set the kernel for that notebook, the notebook has some metadata on it that informs Jupiter about what is the kernel. So if I go here and I click on Test Notebook, again, you'll see that it automatically puts me in the Pandas cub environment. And I just ran that and ran the first two cells again. And you can see that it is in the environment that I want to do. So you only have to set it up once her notebook. Now, of course, you can go back and change the kernel back to any, to whatever it is that you want. And if you have, you can have more than two kernels. So you're not limited to just two. Also another thing, when you start a new notebook now, so on the left-hand side, you're given a choice of which kernel you'd like to start it with. So that's additionally how you can that's how you would set up the default hurdle. So from now on out, this notebook will be opened up with the pandas cub kernel unless he of course, manually change it again from the Kernel menu. Okay, So that's, that's how you install a new kernel in the Jupiter notebook. And again, this is just, just unfortunate that it doesn't come automatically. There are a couple other libraries that are available, but it's just not as robust as doing it with this particular command. So unfortunately, this is the command that I'm going to suggest to use in order to make sure that you've hooked up your Jupiter or system to the correct environment. Okay, in the next video will be manually testing out our code or DataFrame in the Jupyter Notebook. 8. Inspecting the init File: In this video, we're going to be inspecting the dunder init.py file. Okay, so we did have opened the test DataFrame dot py file, and we did run all the tests using PI Test. But this is not the file that you'll be editing for the project. There's a single file that contains the source code for pandas cub that you'll be editing. So if we take a look over here in our, in our file system and our File Explorer, you'll see that there's a Pandas cub directory. And in this directory there's a single Python file. It's the dunder init.py file. Now I'm using the terminology dunder, which is special to Python. And it's simply stands for double underscore. So it is a file that begins with two underscores and ends with two underscores. So this file contains all of the source code for Pandas cut. This is the only file that you will need to be editing throughout the whole project. And this is the file that you'll be keeping open for the rest of the project. So I want to, I want to go a little bit more in detail and cover what is actually inside of this particular file. Let me just give myself a little bit more room here inside the editor so we can get a better visual. Okay? So yes, as I said, this is the only file that will be edited during this project. Now it is not a blank file as you can see. In fact, there's actually several 100 lines set of about 700 lines of code that has already been written here. Now, this is simply a shell of the program that has been written for you. There are, there is one main class that has been defined, and there is the DataFrame class. And within this class there are many dozens of methods that have also already been defined. So you're not actually going to define any other classes. You're not going to define any methods within a class. The only thing that you will be doing is editing the body of some of the methods within the DataFrame class. So if you scroll down just a little bit from the top, I will talk more about this dunder init method in a later video, but you'll start to see the word pass following in the body of several methods. So here we see there's the word pass. This is a keyword in Python that simply means to do nothing. So you can see there's numerous times with the word pass appears, these are the methods that you will end up completing. So there's something else that you'll notice in every single, and almost every single method. Is these. You'll see a triple-quoted string underneath, directly underneath the name of almost every single method. Now these are technically called docstrings or shortened way to say documentation strings. So these are literal Python strings that you can write in here to explain how your, how your method works. Now we're going to be using a specific style of docstring called the NumPy, numpy docstring guide. So if you look down over here, you'll see that I have a link to the NumPy docstring guide. So we're only going to be using a few of the recommendations within the num pi docstring guide. So basically this guide gives breaks down different sections that you can put inside of the string to help the user. So the first section actually does not have a title. It is simply a description about what happens, what this method does. So for instance, this head method has this description return the first n rows, the next section. So each section, as you can see, has a bunch of hyphens that directly come underneath it. The parameters section will list every single parameter and its type. So here, this head method has a single parameter n, and it is going to be an integer. That's what it expects it to be. And there's one more section called returns. And this is going to be the type of object that gets returned from the method. So that is the, those are called docstrings and we will see them elsewhere as they pop up. They give us help in the notebook for, and they're specifically meant for our users to understand how to actually use these methods. So this is the file that we are going to be editing and we're going to be spending essentially all of our time in for the rest of the project. And it's called dunder init.py. And I will talk a little bit more about the specialness of this file when it comes to importing it into our namespace. In the next video. Okay, so that's it. All I have about this particular file. You'll become very accustomed to it and very familiar with it. Since it's the only file that we will be edited. 9. Importing Pandas Cub: In this video, we will be importing pandas cup. And the last video, we inspected the DataFrame or excuse me, the dunder init.py file, which contains the DataFrame class. In this video, we're going to see how to actually import our own library into our workspace so that we can actually use it. So it's great that we have this file open and ready to be edited. But I want to show you guys how to actually how the import statement works and how to utilize. What's the code that you've been written, that's been written in here. How do I get access to this data frame? I want to explain the machinery behind that. So we can do this in a Jupyter Notebook. But I find that it is easier to keep everything in one in plain view and just VS Code. So I'm going to use the IPython shell to do this, and we'll see it again in the notebook in the next video. But for now we'll just use the shell to make things a little bit easier. So I'm gonna go ahead and run IPython. Notice I'm already, I'm, I'm just, I'm running it right here inside this Pandas cut master. So at the same level as data images, pandas, Coven, so forth. Okay. So I've, I've opened up the IPython Shell and we can verify that our exit, that we are actually in pen. The pen is Kab environment like we did in a previous video. And you can see that we are indeed operated with operating within the Pandas cub environment. So that looks good. Now, what I'm gonna do is I'm gonna go ahead and import in pandas COP. Okay, so that worked, that completed, there were no errors. So I want to explain I want to go into the details of this statement own explain what's happening when we write import pandas hub. So Python has a strict set of rules that it follows whenever you write an import statement. And what it does is it searches it as a specific there's a specific sequence of file of directories that it looks into defined python libraries. So this is called the youth directors are stored in this path variable from the sys module. So if you look at this module or this variable path, you'll see that it is a, it is a Python list. And it has a list of directories that Python will look through one by one in order until it finds the library that you've imported. Now, if it does not find the library, then it will raise an error. So it goes through one by one. And then it actually gets to this interesting directory, which is just an empty string, but that represents the current working directory. So in That's exactly where Pandas Cobb is located. And you can actually output the current working directory with PWD. This is an iPython command. This is not a Python code, so it just outputs our current working directory. So we're inside pandas cub can even use ls to list out the contents of the current working directory, just like it's a normal shell command. So you can see here that indeed pandas KAB has been found. Now, you might be thinking, how can you import a directory? And the answer is you can. Python allows that, but if there is a dunder init.py file, it will import that instead. So it doesn't quite make sense just to import a directory. Know it's not a file there no contents, but if there's a dunder init.py file, it will, it will import that. And that is exactly what PDC is referencing. It is this particular unit, dunder init.py file. So I'll just, I'll just write this here again. This doesn't actually do anything I just wanted to write. It just appears on the screen again. So now PDC is referencing the contents of this file. This is exactly what we've imported it. So all of these names in here are available to us with PDC. So for instance, I can get, it will skip over NumPy. It's not as interesting. I can get the version like this, so it's actually just imported this in here. So every definition, everything that has an equal sign over here has been imported in and is accessible with PDC. So DataFrame has also been defined. So I can Get that. I won't instantiate the DataFrame. I can just go ahead and reference it and so forth. And there's one function down below and actually another class that's defined, we'll get there later, but that is exactly what happens whenever you run the import statement. And we'll see that in the notebook again. But I wanted to cover this specifically. So you understood what happened in, in the rules about importing in a directory. And this is a special file name. So if it's, if it's not done during init.py, we'll not get loaded in. You have to name it done during knit dot py in order for it to be loaded in when you actually import the directory name. So that is just some special rules that Python is created for imports. All right, So that wraps up this one. Hope you enjoyed it. 10. Manually Test in a Jupyter Notebook: In this video, we will be manually testing our DataFrame in a Jupiter notebook. Okay, So we've already formally tested our DataFrame, or at least all of pandas cub with the PI Test library. But in this notebook or in this video, I want to show you how to manually test out your DataFrame in a Jupiter notebook. So oftentimes you'll want to really see how your DataFrame is behaving with how you'd actually use it. And since most or a lot of data analysis is done in a Jupiter notebook, then, um, it's a, it can be a good thing to go ahead and use your DataFrame just as if you were using it to do data analysis. So let's go ahead and open up a Jupiter notebook so I can show you how I work with the DataFrame as I'm building it in, as I'm testing it up. I'm going to go ahead and open up this test notebook. Okay, so first of all, we can see that we're in the, we're using the pandas cub kernel, which means that. And then we'll verify again that we're in the Pandas cup environment. So this is a, this is a very important piece I want to show you. This is a magic function called auto reload. And so this is not, this is not the Python code. This is what's called a magic command. So this will only work in IPython or Jupyter Notebooks. It will not work if you want to execute just normal Python. So there's a magic command that is something special and in this case, it will reload any changes to your files. I'm back into the notebook. So normally, let's just say you're working in this one done during it dot py file and you make a change. Then you, if you've already opened up the notebook, that change will not be reflected unless you restart the kernel. So you have to completely restart the kernel and you'd have to re-execute everything that you've done up until that point. Now what auto reload does is it allows you to change some code and then without restarting the kernel, see how the changes reflect as you're working. So this actually saves a tremendous amount of time when you're developing. So you can make some changes, check how they check. Visually, inspect what's going on, and then make some more changes, ink and iterate like that. So it does save quite a bit of time. So this is, this is quite useful magic function. So I'm gonna go ahead and execute that. I'm going to go ahead and import our libraries. So numpy will be used simply to help us make our DataFrames. Hand is Kab at that, at its current state is just the shell program. There's nothing that has been implemented thus far, but there's basically empty, although it will properly import and we saw how to import it in the previous video. Pandas Cup final is what the final version of pandas couple look like. So I put this in here so you can play around with the final version of pandas cub and to see how it works when it's complete. And then I've also imported pandas so that you can also experiment with the pandas DataFrame, since our DataFrame is essentially meant to emulate the pandas DataFrame. Okay, so let's go ahead and execute that cell. Now. In this cell we create those three dataframes, one from Pandas, one for pandas, Kab, one for pain is called Final and 14 hand does. And it uses just a dictionary of strings mapped to one-dimensional NumPy arrays. And we'll see how that works out later on in the project. So I went ahead and instantiate it. So what I did was I created three DataFrame instances. One for each of the libraries that I imported that have dataframes. And now I'm going to output the visual representation of each dataframe in the notebook. So df is simply pandas cub in the current state and there's actually no representation whatsoever. So you will not be able to see anything just yet. Again, it's just like an empty shell. So there's no visual representation. There's simply a place in your address and your computer where it's currently residing with actually no data in it. Df final is the pandas Cup final. And df pandas as pd pandas is going to be what pandas produces. So the end result, the DataFrame will look fairly similar or very similar to what Pandas does. And I'll show you how to produce this when we get to that step. But for now I really want to focus on the process of iterating back and forth between the source code and between no, the Jupiter notebook where you can actually see how things are, how things are working. So let's go ahead and let's go back to this source code and make some changes so I can show you what this does. So here's pandas cub. Okay? And whenever a data, whenever a, you know, whenever you instantiate your class, this dunder init special method is called. So this is actually, this method is actually already implemented for you. You will be editing this one, but I do want to add something in here just so you can see what happens whenever we make some changes in the notebook. So in init method. So I just added a simple print statement, print function at the end of this initialization method, which always gets run whenever you instantiate a class. So if I go back over here, normally, you would have to restart the kernel in order for this change to be reflected. But since we did auto reload, this will detect that change in automatically. Provide those changes for us. So I'm not going to read import anything over here. I'm simply just going to run the initialization again. So I'm going to run this statement. In fact, I can just run this one just by itself and its own cell. So I have not executed any other cells yet. So let me just go ahead and execute this. And you can see that very nicely. It has propagated those changes. And it reflects those changes that I just made. This way will help you quickly prototype quickly, write code without having to go back and forth. So this is why I'm showing you this method is something that I use and I believe it is something that is very valuable when you're developing. And it will give you a way forward to, to quickly iterate your ideas. Okay, So that is how you would manually test out your DataFrame in a Jupiter notebook. It'll allow you to see whether your results are matching. Instead of like just getting results from, from PY test, you can now go into the notebook and in fact, you know, see how your DataFrame is working here. Outside of just the manual or the automated testing I should say that pi test provides. Okay. So that's it for this one. I hope you guys enjoyed it. Thanks. 11. Getting Ready to Start: In this video, we are going to be getting ready to start pandas cub. Okay, so there's just a couple of things that I'd like to cover before we start editing the dunder init dot py file and starting to complete the steps. So in the very last video, we added one line into the done during it dot py file, this print function. So let's go ahead and erase that and bring it back to, it's back to its original state. So this is again, is the only file that you will be editing for the entire project. Okay? Now, one other thing is the answers to all the steps and there are 40 steps. So if we look over here, you'll see they just begin at one. And if you go all the way down to the bottom, you will see that there are 40 steps. So the answer, the final version of pandas cub is available in that pan is Cup final directory. And it's a full file with the exact same name done during that dot py. So you can look at that to figure, to see the code that will complete all of the tests. So I only recommend looking at this once you've at least attempted to complete each step and not beforehand, but this is the answer key and what you can use to compare your code against the code that I wrote. Okay, So this is all, we finally covered all the necessary work, getting setup for the project. And in the next video we're going to begin with step one, which is check the DataFrame constructor input types. All right, See you there. 12. Check DataFrame Input Types: In this video, we are going to begin coding. So this is where the real fun begins. And this is step one out of 40. And the step we will check the DataFrame constructor input types. So let's go over here and hop on over to VS code where we have our injury knit dot py file open. Again, this is the only file that will be edited throughout the entire project. And the way the project has been created is that the methods that you're going to complete basically just come in order on the way down. So you won't be, you won't have to kind of hop around the entire file. You should just go vertically down and complete the methods as you come across them in order. Okay, so let's go ahead and look at step one, check DataFrame constructor input types. So this DataFrame, whatever our users are going to construct an instance of this DataFrame, they have to pass it as single parameter named data. This parameter we're going to enforce that it is a dictionary of strings mapped to one-dimensional NumPy arrays. And this is what this first step is going to do. It's going to check that data is in fact a dictionary that maps strings to one-dimensional arrays. So those strings will eventually be the column names of our DataFrame. And the NumPy arrays will be the values for those columns. So yes, So the dunder init method is what gets called whenever. This is a special method. And this is what Python uses to initialize the, whenever we initialize the object upon creation. But this method is actually already complete. So it has several other method calls that it makes. You do not have to be will not have to edit the dunder init method. Instead, we will have to add it some of the methods that it calls. And in this video, in this step we're going to edit the check input type Smith. So we're down here. We're only going to be editing the check input types method. Now the first thing before we get started is that you'll notice there's a single underscore preceding check input types. This is denoting that this is a private method. This is not meant to be accessed by the user, so it is only meant to be accessed internally within this class, like in this file. Now, in Python, there are no, there is no such thing as private methods. Technically, meaning that our users can call this method if they so choose. But we use convention in Python instead of a another keyword. Denote that this is strictly meant to be used internally and is not a public method for our users. So, you know, IDEs like Jupiter notebook, whenever you're coding, will not by default, even show these private methods because they interest like the code and they see that, well, if it begins with a single underscore, it is not meant for public use. Okay, So this is exactly what check input types is gonna do. Number 1, it's going to raise a type error. If data is not a dictionary. Number 2, it's going to raise a type error if the keys of data or not a string. So if it's a, if the keys must be strings. And then again, another type air if the values are not NumPy arrays. And lastly, we're going to enforce that our users only give us one-dimensional NumPy arrays. So that's what this method needs to do. And once we complete these things, ensure that our data, that the data parameter follows these four bullet points. Then we can see if we pass the test. Okay, So let's go ahead and delete pass and see if we can go ahead and complete the first part of this. So raise TypeError if data is not a dictionary, so if not is instance, so is instance is a function that is able to check whether an object is of a particular type. So the way it works is that actually VS Code is nice and it will pop up this help menu, the documentation. You give it an object first. In this case it's going to be data and then went to check what classes, or you can give it a tuple of classes as well. So you want to say, is it an instance of war of two or more classes? Or there's possibilities of it. Not just one class. Okay, So we're going to check if data is a dictionary. So dict is the class name. If it is not a dictionary, we're going to arrays a type error. And so the raise keyword is how you raise an error. And if we have, if we don't have a dictionary, when we're just going to give the user a message, we'll say that data must be a dictionary. Okay, so that completes that one part of it. Now, we can actually go ahead and test this out in the notebook. So I told you that it's good to manually test things out. So you can just see if it's working or not before you run the formal tests. So let's go ahead and do that. Now I'm going to split this terminal real quick. I know my face is somewhere on the right-hand side, but so let me go ahead and move this up real quick. I'm going to for selected, activate this over here. I'll go ahead and launch the Jupiter notebook fund. This side of the terminals is just the one covered up by my face talking. And I will use that portion, so that's okay. Let me go ahead and put this back down. So this is just running the Jupyter Notebook and i'll, I'll issue the other commands up over here. And you know what, we can actually minimize the left-hand side and give us quite a bit more room for our screen. We don't need to look at the file explorer during the, during our coding for this session. So let's go ahead and go back to Jupiter. And we'll open up a notebook. And let's go ahead and, and so none of the code has been executed. It's important to run auto reload. So let's go ahead and run that. And then this is just some default DataFrame. It will begin to use for testing. But what we want to test in this case is what we just coded. So we said, if the input is not a dictionary and needs to raise this type error. So let's go ahead and do that right now. Let's see if we could arrays a type error. And we get myself a little bit more room. So if I pass this, say 10, ten is not a dictionary. So I should get the type error that I just created. So it says data must be a dictionary. Okay, great. Fantastic. Now, so that, that is how I check my code in the Jupiter notebook. So lets me iterate pretty, pretty quickly. Now, if I do give it a dictionary, so to say d is 100 of sudden dictionary, it will say this. So it's just a single item dictionary. And I try it again and his cup, then I will not get any air at all. This is just the current representation of it, which is not very useful at the moment. Okay, So that's fine. Now let's go and finish the other parts we can just do all of them are one at a time. Alright, so let's keep going. So if we, if the keys of the dictionary, so if it, if it, if the code reaches this line, then we're guaranteed that data is a dictionary. So we can say, if not the data, if the keys. So we need to check if all of the keys are strings. So we're going to have to go through one by one of the keys and check and see if they are not strings. So because there could be multiple keys in the dictionary. So we're gonna say for key in data. So we're gonna go ahead and iterate through here. We know what we are going to. The way you iterate through a dictionary to get, the way you iterate through dictionary to return the key and the value is with the items method. So data is now guaranteed to be a dictionary. So this will iterate through every combination of key-value pairs that are in the dictionary. So if we say not is instance, if the key is not a string, then we're going to raise a pipe air and we'll say that, you know, you can say the keys of data must be springs. Okay? So let's go ahead and run this again. So let's make this key something else. So if we run this, we're not going to get an error. But if we, now, if we say our dictionary is some integer with the key, so ten is a will be the key here. Then I'm going to, it's going to raise the type here so the auto reload automatically reloads the new code. So it just says, so, so I just triggered that air. So it looks like everything is good. The keys of data must be strings. So I need to have some string here. Okay, good. So that passes that. It's a dictionary with all the keys as strings. In this case, we just have one key-value pair. Okay, great, So that takes care of this one. Now it says raise a type error if the values of data are not NumPy arrays. Okay, So if it passes the string, we still need to check and see if the value is a NumPy array. So we're going to say If not is instance value, and we want this to check whether it's a num pi. Right? Now. I have already imported the NumPy libraries, already imported as the first line in the file up here. So if you're not familiar with NumPy, you might not know what the object, the underlying classes, but it is a and the array that is the name of the class. So this is not a NumPy array. We're going to arrays a type error and we'll say values of different values of I use the back ticks to signify a. You know, this is like some of the code. Values of data must be num pi arrays. Okay, good. So I think that takes care of that one. Let's go back to our notebook. So here we have our dictionary and it would just call it data, so it maps exactly with what we have. Now, our value here is a integer. So this should create the air that we just did. It says values of data must be NumPy arrays. Okay, great. So let's make sure it is a NumPy array. So we'll just say ARR is, I've already imported numpy up there and we'll say it's an array. And what we will make it a two-dimensional array. Okay? So just so you can see what this array looks like. This is a two-dimensional array. Let's make it three rows and two columns. Okay? So now we know for sure it is 2-dimensional here. Okay? So, yes, So now if we run this, it should pass the test. And I did not actually put it right there. So data did not get a written properly. So data's going to be an array that has a key, that's a string value, that is a NumPy array. And now this will be valid for my, for my constructor at this point in time. Okay, The last step to this is raise a value error if the values of data are not one-dimensional. Okay? So we need to check that this value is a one-dimensional array. So in this particular example over here, RA is two-dimensional, so this is not going to fly. You need to check the number of dimensions. So now we're not checking an instance. It's no longer going to be a type error is going to be a value error. You can raise whatever errors you want. By the way, there's a number of different errors that are available right out of the box that Python gives you. But type error and value error are two of the most common errors that you'll raise. And these are basically just about the only errors we will raise. So we either have some wrong type or there's some sort of Ron value. In this case it's the wrong value. We have the wrong number of want to check and see if the dimensions are correct. All right, so now we're guarantee that value is a NumPy array. So if value and you would not know this unless you know num pi, the end in attribute returns the number of dimensions. So we're gonna say if number of dimensions is not one, so we have to, we're going to be forcing it to be one. We are going to raise a value error. And we're going to say values of data must be a one dimensional array. Okay, so let's go back here and to our Jupiter notebook. And let's rerun this and see if we get that value, if we've triggered that value error. So there we go. It says values of data must be a one-dimensional array. All right, fantastic. So we've done all four cases right here. And it looks like check DataFrame constructor input types is all ready to go. Now, we've manually verified it. This is what I mean by manually testing. So everything looks good. So if we look back in the Jupiter notebook, we can forth, Let's just create a new array, say array one, and we'll make this one-dimensional put some random numbers. So instead of data having this two-dimensional array, we'll just say it's one-dimensional. And just so you can see right here, it is a single-dimensional array. And now dado should be able to not give us any air and it does not. Well, right now, just because we've passed the manual test does not mean you have completely passed the automated testing HIV test. And this is what we need to do. We are going to run a single test. It's contained in the test DataFrame dot py file and the test DataFrame creation class, and then the test input types method. Let's go ahead and look at that test. So I'm going to go ahead and open up the test folder and open up test DataFrame. Under test DataFrame creation is tested input types. So this is the test that it needs to pass. This one right here. So this is, this is what pi test is going to run. It's going to run this method and see if it passes this test right here. Let's go ahead and you can just copy and paste this command into your terminal. So here it is. And if you've passed the test, it'll say we've only run one test. So it says it collected one item and there'll be a single green dot. And then I'll say one past. If you fail the test, it will say one failed. So this past, and we're happy with this, we'll close out of that. So it looks like check input types has passed our test. And that's what we're required to do to continue on to the next step. So we're going to I'm just going to stop there. Let me just recount what we did. So we fired up a Jupiter notebook so that we could manually test the auto reload, which is the first thing that was one of the first thing that was run helped us tremendously because we were able to make changes to our code and then go back to the Jupyter Notebook and see the changes happen like in real time without having to restart the kernel. So that's a, that's a valuable tool. We went ahead and raised for errors for four different situations. And the end result is that we have a NumPy array with, sorry, data's going to be a dictionary of strings mapped to one-dimensional NumPy arrays. All right, so that's it for check input types and step 1, hope you enjoyed coverage. Step 1, and I'm looking forward to the other 39 steps. So we're going to check the array links next. 13. Check Array Lengths: In this video, we will complete step to check array lights. In the last video, we completed the method check input types. This is the very first method that is called the DataFrame constructor. This dunder, dunder init method. So within this method, we check that data which is going to be provided us by the user, is going to be a dictionary of strings mapped to one-dimensional NumPy arrays. In this method that check array lengths. In this video, she's me, the check array lengths method will be completed and filled out. So within here, we're going to check that all the arrays in our dictionary have the same exact length. And these, and this is because DataFrames have to have the same number of elements in each column. So we're going to go one by one through the values of this dictionary, this data dictionary, which is again being passed, is the same data that our user passed. It's, it's, it will be passed into this method. We will iterate through the values of that dictionary. And those values are NumPy arrays. And we will check to see that they are all the same link. So let's go ahead and begin iterating through here. Now, when I iterate through here, I'm going to use the enumerate method. And this is because I'd like to keep track of or enumerate function. I'd like to keep track of what iteration it is. So if it is the first iteration, so I equals 0, I'll create a new variable that is assigned to the length of just that first array. So what this is going to do is it will iterate through all the values of the dictionary. So the values of method returns a, an iterable object, that of just the values of your dictionary. In this case, these are NumPy arrays. So for the very first iteration, so I equals 0. The, what I'm going to do is assign this variable length to the length of the first array. So if it is not, it is not the very first iteration. Then what I'm gonna do is I'm going to test to see if this length is now equal to the length of the new array. So value again is just the next NumPy array. So if there is a mismatch in length between the first one, so length is equal to the length of the very first array versus the current array length. Then we will raise a value error. And we'll, we could say something like all arrays must be the same length. And that should do it. So I still have my Jupiter notebook open. If I go back over here to the notebook, this is where it was left off at. So this was a valid way of constructing and get rid of this. Our data. Excuse me, I must have restarted the notebook. Okay, so I'm passing it a dictionary of a string mapped to a NumPy array. This is R1. Now, I'm going to create another array. But in this array, I'll just have two elements in it instead of three L's. Use the string b to map it to or two. So if I can get rid of this. So data is a dictionary that has two elements in it. It has a map2, R1, and be mapped to R2. So they're different lengths. So this should trigger that air and it does good. So it says value, our value error, all arrays must be the same length. So if I go back up here and I add just one more element to this array. So now that they're both three elements in length, this should not trigger the air and it doesn't. So that completes. So this is one way to, to complete this. Now, for a god, we actually have to run the unit tests to formally passed as part to fully complete this step. So let's go ahead and copy and paste this. So it's going to go and run the very next test in that test DataFrame creation class, which is in the test DataFrame dot py file. Let's go ahead and see if we pass and we do so. We have one pass. So there's only one test that I ran. And it just a green dot just signifies that it has passed. Okay, Great. So we, we passed step 2, which was verifying that all of the array lengths were the exact same. Okay, that's it for this step. We're going to move on to step three in the next video. 14. Convert Unicode Arrays to Object: In this video, we are going to be completing step three, convert unicode arrays to Object. So back here in our DataFrame class, we are still in our constructor. So the first step we ran this check input types method, and the second step we ran this checker re links Method. And now we're right here. So our data at this point is guaranteed to be, guaranteed to be a dictionary of strings mapped to one-dimensional NumPy arrays, where all the NumPy arrays are the exact same length. So in this method, what we're going to be doing is converting any arrays that have r of datatype Unicode into this datatype called object. So we haven't said anything about the datatypes of the NumPy arrays that are going to be inside of our inside of our DataFrame. So we are going to allow basically any datatype except for this unicode data type, we're going to convert it into an object. So if our user gives us a Unicode NumPy array, we are going to convert it to an object array. So I want to spend a little bit of time covering datatypes in NumPy just so that you can have a better idea of what they are. I have a link in the README where you can read much more about them. But I want to show a few examples here. So yes, our DataFrame will be composed of one-dimensional NumPy arrays. Every NumPy array has a data type. So let's just go ahead and look at some of these. So if we import numpy as np, we will be, I'll create summaries. Well, let's see here. So let's first create an array of some Boolean values. We get an array of true and false. It's a simple one-dimensional array with two elements. And it has type Bool. All right, so let's keep going. Let's create another array. And this time we'll create some integers. And I'll use the dtype parameter or dtype attribute once again to access the underlying data type. And in this case we have a 64 bit integer. So that's what the 64 is, but it is integer. So we could also create floats with NumPy. So here is a NumPy array that is composed of floats. So if we do, if we retrieve the data type with the dtype attribute, once again, we can verify that we do have a NumPy array with a float data type, a 64-bit float. Now are things get a little bit interesting is the whenever we create strings. So I'm just going to reproduce this data over here in the Read Me. So if you create an array of strings, numpy will return this kind of cryptic data form, datatype. It says you five. So this is a capital uppercase U stands for Unicode. Five is the length of the maximum length of string that you gave it. So snake here as five characters. So it's U5. Now, the issue with these Unicode or raise in NumPy is that they are not very flexible. So for instance, if I wanted to overwrite cat with say, elephants, so I'm going to overwrite the very first element in this array with elephant. This code completes without an error. But if I look at the underlying NumPy array, you can see that it only captured the first five characters. So each element in this array has a maximum of five characters. The other thing that is not flexible is let's say we want to set this equal to missing values, say something like none. Well then this also completes without error, but if you look at the underlying data, it has turned it into the string none. So that's not ideal if we want to have missing values in our, in our arrays. So what we're gonna do is we're going to use a more flexible datatype called object. So object is a, just a catch-all datatype in NumPy. So which allows any Python object to be in an array. So you can put any object in there. There's no restriction. So you can put, you can mix numbers and strings with none or whatever. And that's valid. So we're going to. We're going to do that. That's actually what Pandas does. So we're copying pandas in this instance. Pandas does not have a Unicode cop datatype for any of its columns. It uses the NumPy object. So we're just going to follow in his footsteps and, and do that conversion so that our strings are, our string columns can be a little bit more flexible. Okay, so one thing before we get there is that as you saw these datatypes, integers, they have the bit, the number of bits, and append it to them. But, so if we want to get a, what we're gonna do is get the datatype kind to identify it. So this is a single character that will identify basically just it's generic datatype. So just so I can show you that. So here's a again, which is Boolean. But if we use the kind attribute of the dtype, you'll get a single character string. So same with integers. We're going to use this kinda wanna get back i. And the reason I'm showing you this is that it's, it's easier to deal with the kind, just a single character then worry about this object like this. This is a much bulkier object. So it's just easier if we check for the kind. So and the last one, of course is the Unicode one is U. So what we're gonna do now in our code is if we come across an array that has a kind of you, we are going to convert it to object. And then we use the as type method to do the conversion. So the way we do this, so here's D. It's still a Unicode array. So what we will do in our code is we'll do as type and we'll type in the string object, and that will make the conversion to this object array. So if we just call this O for instance, as objects array, then, now we can put in anything in here. So I can put in the string elephant. And it will, it will put that exact object in there and nothing will get cutoff anymore. So hopefully I've convinced you that's why we're using object. It's just a simple way to make our column of strings more flexible. So let's hop down here to this convert unicode to object array. Now there is a couple of lines of code that have already been written. So what we're gonna do is we're going to use this new data dictionary to take the place to hold all of the new data. So I'll actually all of the data for our DataFrame. So instead of changing this dictionary itself, we're just going to simply place it, place the contents in this new data dictionary, and then just return that. And that's what eventually will get stored in the underscore data instance attribute. Okay, So what we wanna do here is iterate through this, through our data dictionary. So for key comma value and data.dat items. So what we're gonna do is a simple check if value. So value is a NumPy array. If the kind is you. We will go ahead and assign this to new data. And we're just use key again. So we're just going to replicate the same thing, but we're going to use as type here. I'm going to put it in object. Otherwise, we'll just keep new data. We won't change the datatype of our NumPy array. So that's all we're going to do here. And so anytime a user gives us NumPy arrays that have the unicode data type, we're going to convert that array to as type. So this should work. So let's go ahead and run the test at the bottom here. So it says tests unit code. This is the test, test underscore Unicode to Object. So I didn't give the entire test names, I'm going to type it out. So Unicode to object. And let's see if this passes and it does. So that's good. So one passed and it's green, the green dot. So we, we've passed that test, so we've correctly converted our unicode arrays to object. Okay, So that does it for step three and we're going to move on, keep on going and move on to step 4 in the next one. 15. Implementing the __len__ special method: In this video, we're going to be completing step for implementing that dunder lens special method. So in the last video, we finished right here, we completed these, this convert unicode to object method. The result of that was stored in the underscore data attribute. So this is where our dictionary of data will be, will reside for the rest of the project. So we're going to be accessing this underscore data attribute frequently throughout to find where our data is stored internally. Now you'll notice there are a couple other lines of code here at the bottom of the constructor. So these are actually already implemented for you and we're not going to cover them right now. So, so with that said, the initialization didn't Dairy Net constructor method is complete. So all the methods in here are fully implemented. There's no other things to do. There's another code to be written for any of these methods. So let's go ahead and skip on down to our current step, which involves completing this method, the dunder Len special method. So this is a method we're going to use to return the number of rows in our DataFrame. So this is a special method which means this gets called whenever the Len function is used with our object. So whenever you use lend and then pass in our DataFrame df, it will return the return. That's what it will return. It will call this dunder Len method and it will run the code and whatever returns that will get returned. So this is a, this is what you use to make the Len method work with your object or the Len function work with your object. Okay, so to do this, we need to get one of the NumPy arrays in our underscore data dictionary and return the length of that array since they all have the same exact length. So I'm going to show you a couple of ways on how to do this. Number one, we're going to work on iterating. We're just gonna iterate through all the values of the array with a for-loop and then return the length of one of the values that we come across. So let's go ahead and do that. Now. I'm just gonna write a simple for loop for value and self dot underscore data. So this is where our data dictionary Is is stored. And when you use the values method of the dictionary which returns, which creates an interval of all the, all the NumPy arrays in this case. Now we don't actually have to loop through all of them since they're all the same length. We'll just write a return statement in here. And this should be good. This should be good enough. So this should work. So we're just going to begin an iteration through the values and then return the length of that value. Now, you're wondering, well why can we do this? Well, value is a NumPy array. Numpy arrays already have already worked with the Len function. So we are piggybacking off of num pi to do all the hard work for us, we're just simply returned. We're using the Len function on a NumPy array to see if that works. So let's go ahead and run the test here. So it's called test underscore Len. And it's gonna go ahead and execute that to see if the test passed and indeed it pass. So we're good to go here. I want to show you how this looks in a Jupyter notebook. So I'm gonna hop on over to Jupiter know, the Jupyter Notebook down here. So again, just to quickly cover this, we have made three dataframes. One for pen, does Kab, one for pandas Cup final, and one for pandas itself. So df is the one and pandas KAB. So now when I call len or past DFE into Len, what's really happening is the dunder Len special method is being called. And it's right here. So this is exactly what's being called. So you can actually call this directly in. You typically would never do this since it's only be called internally. But you can do it like this, call it directly. So what happens in, whenever Python sees this lens? It will translate this into the dunder Len method and cold that on your object. So that is exactly what Python is doing. So we're good to go into the next step, but I want to, I want, I want to show it an alternative way of doing this. So you might think that this, Why can't we just, why can't we just get the first vout, the first array by indexing into with the brackets and selecting it like that. So why not? Why is it? Why, why can't we just do something like this? Well, it turns out this object that it returns, it's not a list. It's called a dict view, I believe, or some, something similar, and it does not allow indexing. So when I run this, if I try to run this code, I will get an error. So this dict values object, it does not support indexing. So it's trying to do this right here, but it fails. So it is not, it is does not support indexing. Now, we can still do, We can still do this in a different manner. So I'm gonna go back to the values. Now. This is an iterable object as we saw, we can loop through all the values in a for-loop. But if you don't want to actually write the for loop, you can. You can do something a little tricky and it's a little bit more advanced. And that is to create an iterator out of this. So the iterator is the object that is actually sort of like doing iteration. And we can do this with the inner function. So you can actually see here VS Code is pumped up, it will return an iterator. So this is something that you can manually iterate through by using what's called the next function. So iter is a function, it's just that built-in function in Python is not used often, but we're going to go ahead and use it now. So to get the very first value in the iteration, we can use next. So I'm gonna go ahead and do this and when I pass this into next. So now what we've done is we have the very first NumPy array in our dictionary. Okay? So then we can just call, or we can pass this to lend. So now we have a NumPy array and we can return this and now we can pass the test. So just to unpack this once again. So first of all, self dot underscore data is our dictionary of strings mapped to NumPy arrays. Dot values is a dictionary method that returns this dict values object and it does not support indexing, so we cannot use the brackets there. Iter returns an iterator. An iterator, which is the thing that you can iterate over an a for loop. So this is actually what the for loop is doing. It's creating an iterator sort of behind the scenes for you to iterate through. Next is how you can manually iterate through this. So next we'll get the very next one. And then len, sorry, sorry, the very next one is a NumPy array. We finally got to that one NumPy array. And then we can just take the length of that and returns the number that we want, which is whatever the length of our DataFrame is. So now if I run the test again, you can see that it's passed. So we're good to go. So this is a way to do this in a single line of code without using it for loop explicitly. So this is sort of a manual way of setting up a for loop. You can do this with any of anything that's iterable like a list. You can pass in a list here and whatever. And it would do the same thing, it would make an iterator. Next, we'll give you the very first object in that list. So it's a sort of a manual way of doing a for loop. And the only reason we want to do that in this case is because we don't want to loop through all of them and we cannot use this, cannot use indexing to grab the very first element. So I'll just go ahead and leave this one right here and say that this method is, or this step is now complete. And we can move on to step five. 16. Return Columns as a List: In this video, we're going to be completing Step 5, return columns as a list. Okay, so we're right here in step five. And what we want to do in this step is make all the columns attribute return a list. So whenever, whenever anyone accesses columns as an attribute, such as df.columns, we would like a list to be returned. So let's go ahead and see how that looks like in the final product. So we're back over here in the Jupyter Notebook. So we already have pandas cub already loaded. So if we do df final dock columns, we're going to get a list of the column names in order. Now this is exactly the same functionality as pandas. Pandas has its own separate object for this, not a list, but it essentially works identically. Now you'll notice that there is no, this is not a method, so columns is not a method. There's nothing to execute. You do not end the line of code with parentheses. There's no parentheses attached to columns. It's simply an attribute access and it'll return a list. So in order to implement this, we're actually going to define columns as if it were a method. So it's going to look just like any other method, except it's going to be decorated by the, by this, by, by property which is a, this is built into Python itself. So I'm not going to spend time in this tutorial talking about decorators or in this particular case, the property decorator. There are other tutorials that will cover that. All I'm going to say about it is that it will basically transform what something that looks like a method into an attribute. So there will be, so we'll be able to execute all the code under here without actually putting parentheses at the end of columns. So it essentially turns this thing that looks like a method into a, into an attribute. So that's what the property decorator does. It does quite a few more things and we'll see one of the other things it does in the very next method. So basically whenever someone calls df.columns, it will throw us into this method right here, and it will execute all the code over here. So let's go ahead and simply just return the columns as a list. So the column names are now stored in the data dictionary underscore data dictionary, right? Underscore data holds all the column names as the keys in that dictionary. So all we have to do is just say simply return it. We want to return as a list. And we're just going to return, we're going to force our dictionary to be a list. Now you might wonder, well, are dictionaries composed of keys and the values. Why will this only returned the keys? Well, that's just the way. So the way Python created dictionaries was that if you iterate through them one by one, you don't get the keys and the values, you only get the keys. So when you pass this into list, like this, it will turn. Whenever you pass a dictionary into list, it will, it will simply iterate through all of the keys in forced the keys to be a, a list. It will not look at the values whatsoever. You have to call dot items or dot values in order to retrieve the values of a dictionary. So it's as simple as this self.age underscore data is a dictionary list. We'll just simply iterate through all those keys in the dictionary and force them to be a list. Now one other minor note here is that the order in our DataFrame does indeed matter. So dictionaries used to be inherently unordered and they're not, they're still not. Before python 3.6, they were inherently unordered. In Python 3.6 interminably they became ordered. So whatever you, whatever you entered into your dictionary first is how it would be retrieved whenever you iterate, iterated through it. And so we're going to use this property, this, and this underlying orderliness of our dictionaries to our advantage. Otherwise, we would have to use something like an order dict from the collections module in order to store data. But now that the dictionaries have ordered, have this orderly this to them, we do not have to do that anymore. So whatever the user passes us a data dictionary, we're going to assume that is the order that they want the columns is, and that's the order that we're going to preserve. And Python does this for us automatically without having to get an order dict anymore. So this should work and let's go ahead and see if it passes this test. Underscore columns test, so, and it does pass. So we just implemented it with just a very short amount of, a very small amount of characters. Let's go back into our, into our notebook over here and just test it directly over here. And we can see that now we match pandas. Pandas is now matching the pandas cub. Final. Okay? So that does it for this one. For step five. And we're gonna go on to step 6 next. 17. Set New Column Names: In this video, we are going to be completing steps six, set new column names. In the previous video, we retrieved all the columns and our DataFrame by issuing the command df.columns. So what we did to implement this was to write columns as if it were a method which executed code. But we decorated it with the property decorator. So this basically converts columns into an attribute, meaning that there's no parentheses then need to be followed here to execute it as if it were a method. So we can still have lots of code executed by attribute access if we use this property decorator. So the property decorator also allows us to set attributes as well. And that's what we're going to be doing in this step. So when we issue the command df.columns equals a new list of strings. What's going to happen underneath is we'll be able to execute another method. So we have to use a, a, another, another decorator. And the pattern for this is that whatever whatever whatever attribute that you decorate it with the property decorator, use that attribute name dot setter and use that as another, as another decorator to the same method name. So this, this method will get executed whenever our user does an assignment statement, two columns. So any kind of assignment statements. So anytime you do df.columns equals something, that will trigger this particular method to be executed. So all the code under here will get executed. So this is part of the property decorator. And in fact, the very first one that gets defined is called the getter. So this is how you get or retrieve some attribute by using with the property decorator. The next part of it is called the center. And so if you, so this is the code that will get executed whenever someone tries to make an assignment statement. There's also a deciliter so which we're not going to implement. So you could also define a nother columns method and decorate it with columns Dot Delete or instead of setter. And that would get invoked whenever someone writes the delete df.columns, they use the delete statement D L, which is a keyword in Python. Okay, So with that said, we're now going to implement this method right here, this setter method for this property, this columns property. So whenever this somewhat our user does df.columns equals something, That's something we'll get passed into this method as the perimeter columns. So whatever's on the right-hand side of the equals sign will get assigned to this columns parameter. Okay, So there's a few things that we need to do to, to do complete this step. And they're all listed right here in this bulleted list. So the first thing is that is that we are going to require columns to be a list. So it has to be a list. So let's go ahead and do that now. So if not, so if columns is not a list, so when you use instance the check its type, then we will raise a type error and say something to the effect that columns must be a list. All right, so that takes care of the first one. The next one is, is if the number of column names in the list does not match the current datatype. So what we're going to raise a value error. So we're also going to enforce our user is passed as a list that has the same length or the same number of columns as our current DataFrame. Which makes sense. So let's go ahead and do that. So if the length of columns does not equal, the length of self dot underscore data. So remember data is our dictionary that maps the column names as strings to NumPy arrays. So the length of that dictionary is the current length of the columns. So we will raise a value error here and say columns. Or we can say maybe new columns must be same length as current DataFrame. Okay? So maybe that's a little wordy and you can reward this to be a little bit different later. So the error message is important to be clear. So, but at the same time, you know, just getting something down and then moving on is appropriate as well. Just to complete this step. Okay, so the next piece is, we're going to raise a type error if any of the columns are not strings. So we are only going to allow strings to be column names. So what we're gonna do here is we can go ahead and loop through them one by one. Which is fine. So we can say for call in columns. And now if these are not strings, so if the column name is not a string, then we will raise a type error and say all column names must be strings. Okay, So that takes care of that one. The next validation we had to do is to check to see if any of the column names are duplicated. So we're not going to allow duplicates. And we can't allow duplicates because our dictionary, we are using a dictionary to store the data. So I'm, dictionaries can only have keys appear once. So just by that alone, we're not allowed to have duplicate column names. And when makes sense to have duplicate column names either to it would make it very difficult to know which one you're referencing. So what we're gonna do here is, well, we could do this, we could continue in this loop, and that's fine. That's you could, you could add some logic in here to check and see if there's any duplicate column names. But we're going to go ahead and use a little bit more clear statements. So we're gonna say if, if the length of the columns does not equal the length of the set of the columns. Then we're gonna raise in there. So we'll raise a value error and say that your columns have duplicate. This is not allowed. Okay? Now, the reason this works is because when you say set of columns, sets can only have unique values. So if there are duplicates in the columns list, then the set command will remove them. So if we say the length of that now non duplicated sequence of items does not equal the original length. Then, then we have a problem. So then we'll raise the value error. Okay? So that actually does all of the checks. They'll very last bullet point says to update the data. So we obviously need to figure out a way to change the keys in the DataFrame. So what we're gonna do here is we're going to update self.age data. And we need a new dictionary. And the way we're gonna do this is we're going to zip up the columns which now have been validated as strings that are not duplicates, or a list of strings that are not duplicates that are the same length as the original. And we're going to get just the value. So we're, so what we're doing here, the zip is zipping up the columns with all of the Numpy arrays. And then this dict, whenever you give it a zip object, it will use the, the, the, the very first sequence as the keys and the second sequence as the values. So this will make a, this will reassign that data for us with the new column names. Okay, so let's go ahead and run the test, test set columns to see if this is actually working appropriately. And I've already written it in here. So we have test set columns. We can see if this passes and good it does pass. So let's go ahead and before we end this video, Let's go ahead and see how this works in the notebook. So here we are again in the notebook and we have df over here, and this is from some previous work. So let's go ahead and reassign df.columns to be. We'll just make it simple. Looks like there's five columns right now. So we did that. And if we just look at the underlying data, since we don't have a visual representation, you'll see that in fact the underlying data now has a, B, C, D, E instead of the original, which was these over here. So in fact, we were able to reset the columns, looks like correctly, and our tests verified that. And in the notebook we've just done some manual verification to do that, to see it as well. Okay, so that concludes step number seven or step number six. 18. The Shape Property: In this video, we are going to be completing Step 7, the shape property. So the shape property, we'll return to our users the number of rows and number of columns of our DataFrame as a tuple. So we're going to return a two item tuple of the number of rows and number of columns. And we will, we're going to make this again a property just like columns was a property. So our users don't have to write df.head shape with parentheses following it like a method. Instead they'll just be able to do df.head shape. Now, they could have implemented it by doing, making it a method without the property. So, but we're following pandas lead, which treats it as a property. So, you know, our DataFrame class is meant to be very similar to Pandas. So we're going to just follow exactly what they've done with the shape property, keep it as a property and not as a method. Okay? So this is actually going to be quite as simple task. So I'm going to turn to return a two item, two boolean actually do this in just one line of code, the number of rows. So we already have implemented dunder Len, which returns the number of rows. So we can just go ahead and call Len on ourself. That's going to return the number of rows and the number of columns. But we're just going to use Len again. And we'll just say the length of the dictionary, which has, which has contains the number of columns. Now we could do self dot columns, which will return a list of columns. And that is perfectly valid too. But this just saves us a little bit of computational time because this is exactly what columns returns it just a forces to be a list. So there's no need to actually go ahead and call dot columns, which we'll call this. Instead, we can just get length of the data itself directly. So this will save us a little bit of a very small amount of computational time. But I guess if you did have a DataFrame with a ridiculously large number of columns, converting it into a list would be expensive, so this would save some time there. Okay, So this should work. Let's go ahead and run this test. So the test is called Test shape. Let's see if we pass. So indeed we pass. So everything looks good. Let's go back to our Jupiter notebook and look at that. So now we'll have df.head shape should work in this DataFrame has three rows and five columns and it's returned as a tuple. So that looks good. Now, we're not going to be any setter for the shape property. There's only going to be a getter because we don't want our users to just set the shape unless they are going to delete rows or delete columns. So we're not going to let our users just set the shape with the shape property itself. So this will not be settable. There'll be no way to, to set this, so there will only be a getter here. Okay, so that does it for step 7. 19. Visual Notebook Representation: In this video, we are going to be completing Step 8, visual notebook representation. I want to just hop right into the notebook first before we do any coding. So if we just look back up here at the top of our notebook, we'll see that pandas cub currently has no visual representation in the notebook. We simply get this garbage, just display whenever we try to look at the contents of our DataFrame, if we just put df as the last line of a cell in a notebook, we just yeah, get this garbage. There's just the location at, in your memory where the DataFrame is located. Df final, that's when Pandas cut will be complete. And after this step, this is what your DataFrame will look like, will be a nice and have a nice representation in the notebook. Pandas is represented almost identically as pendants KAB. So this is what we're trying to do. We're trying to produce some nice visual representations so that our users can see what's in there DataFrame. So what we're gonna do is we're going to return some HTML that will represent the notebook. So this is simply this. What's good, what this representation is. This is not some way that Panda or the pandas couple store the data. This is simply a sort of visual front end, if you will, representation of our DataFrame. And it will be HTML. And if we inspect the, if you look into our browser, so I just right-click and click the inspect on Google Chrome, we can see exactly what is being returned over here. And so if you know any HTML, and I can zoom in right here, you'll see that what has happened is that there is a, so if you look over here on the left-hand side, you'll see that is all the HTML. Our DataFrame is highlighted and this is looks like it's a table class. So, so we're using the table tag, excuse me. So we're making an HTML table that has a header and a body. And it's rendering as such in the notebook. So you can see the contents. You can look in here and look at all the contents of our data. It see that here the column names, Here's a row name. We actually have the 0 over here. Then we have Penelope Texas, 3.6 true, and 45. So this continues for all the other, I guess there's just a couple other rows in our DataFrame. And so we can inspect those values as well. So this is it. So that's what we wanna do. We want to somehow represent this as, as HTML so that our Jupiter notebook can render it nicely in our notebooks. So let's go ahead and look at inspector, our file. So this is something that the method we're going to work on in this particular step is this wrapper HTML method. So this is a special method that is only available, that only gets triggered inside a Jupiter notebook. Okay, So this is specific to IPython has actually given us access to this. So this does not work outside of the IPython or Jupyter system. This is very special. So I have There's a link in the documentation right here where you can find out more. And there's other wrapper methods to that. Ipython exposes four different representations such as locked tech, um, or PDF. There's some other ones I believe that are exposed and you can read about them in the documentation. So what happens is, well, we need to do is we need to return some nicely formatted HTML from this method So it as a string. So we're going to return a string that contains HTML. And so Jupiter will pick that string up in. We'll use it to show us our data. Okay, So this is a, this is going to be quite a difficult task if you don't know any. Html at all. So in fact, what I'm gonna do here is I'm, I'm not actually going to code this one live. I'm going to cheat here and just simply copy and paste the solution that I already have and go through that a little bit instead of trying to live coded here. So this is in the doc strings. So before we get there, I want to cover a little bit about the structure of the HTML that I want to produce. So you, again, you have to know some basics of HTML. You don't, it's not a whole lot, just, uh, just some basics in order to, in order to complete this step. So we're essentially going to create an HTML table. There's going to be a header and the HTML table and that header, we'll just have the column names in them. And then the body will just have all the values of our data. So whatever the NumPy arrays have, that's going to be at. So TR stands for table row, and each individual value is going to be in this td tag. So TV just simply stands for table data. So it's a very simple HTML table that it has a header and a body, and within the body are the rows. So that's, that's all that it is. So I'm going to open up my file explorer and go ahead and go into the answer sheet. And go grab all of the code that's in this rubber HTML of the answers. So it is quite a long, quite a bit of code here. And I'm just going to simply copy and paste this and go back over here to the current can't pandas KAB and just paste that right in here. So we have this and this is going to be the final answer. So this will in fact work. So let me just, I'm just gonna go, I'm going to step through this just a little bit. Perhaps not all of it, but just so you can get a sense of what's happening. So what I'm gonna do here is I'm going to create just some variable within this method just called HTML. And this is exactly what's going to be returned. Someone to return this HTML, It's going to be a string. So notice that I've begun the string by creating a table and then the header, and then began with one row in the header. And then THE O stands for table head. So I'm just starting some HTML. And in this header I'm going to simply one by one. We know add columns to the HTML inside of a header tag. Now in here you'll see that I'm using a lot of f strings or which are available in only Python 3.6 or greater. So these are formatted strings that allow you to use these braces in order to put in Python variables in string. So this is a popular feature. So basically this is just saying put the column names in here. And this is giving it some formatting with at least ten spaces for each column. So I go through that and then I start going through the actual data. So and going one by one are going through row by row and getting the data and putting the data inside these table row and table data tags. So TR in td. Um, one thing that I also do is that I limit the output in the Jupyter Notebook to 20 rows. So I'm not, so it was for instance, if there's a 1000 rows, I'm not going to print out all 1, 0, 0, 0, 0, 0 and HTML. That will actually dramatically slow down the Jupiter notebook. So I limited to 20 rows. So if the DataFrame is more than 20, then only print out the first ten and last ten rows. So don't clog up the notebook with a tremendous amount of output. Instead, we just add, at most going to put the first ten or €20. So basically, all this code does is go one by one through each row. And we select the value for that particular row of that column and put it inside. One of these td tags is tabled data tag, which is going to be inside a table row tag. So that's all that method, that's all this does. And so if you want to read through here, um, it basically just does this. And at the very end we close our tags, our body in table tags, and we just return that string of HTML. So that's all we've done. So there's actually no tests, I believe, for this. So we will just manually test this in the notebook. So if we go back here, df will now have this dunder wrapper already completed. So now our, all of our DataFrames will have a visual representation. You could actually get. You could actually call this private method if you would like, just like this. So nothing is going to stop you from doing that. There's nothing is technically private in Python. And you can see it'll return the, return the HTML string which, you know, Jupiter uses to display your object. Okay, so that's very cool that we're able to create some HTML on here and give our users a very nice representation of our DataFrame. Okay, so that does it for step number eight. 20. The values Property: In this video, we are going to be completing Step nine, values property. So whenever somebody accesses values as an attribute, we are going to return to them a single NumPy array with all of the columns. So this is exactly what Pandas does. Whenever you do something like df dot values, it returns all of the underlying data as a single two-dimensional NumPy array. So we're going to be doing the same exact thing. So let's go ahead and see this, how this works in pandas and then we will implement it ourselves. So if we look at DEF Pandas dot values, you'll see that a single one-dimensional or shoot me, a single NumPy array with all of the columns put together. Are, it is returned and that is what we're gonna do. So currently our data resides in the data attribute, the underscore data attributes. So what we need to do here is collect all of those, collect all of those columns, and simply return, stack them together one after another as columns into one single NumPy array. So the way to do this is simply there's a, we're going to rely on NumPy here. So there's a column stack function where if you give it a tuple or really just any sequence of one-dimensional arrays. It, we'll stack them one after another as columns into a two-dimensional array. So our data is in this underscore data parameter, but we don't want the data, we just want the values. Now we're going to have to force this to be a list. So it is not, it's not an object that NumPy can deal with just by itself. This dict view we from a few steps above. So we have to actually force it to be a list of NumPy arrays. And when you do this, you'll get back a list of all the NumPy arrays. We need to add a return statement here. So this should do it. We can check it in the notebook df.loc values. And it looks like everything is correct. And let's go ahead and run the test for this. So let's go here and do test. The test is test values over here. So we're in this section, step 9. Let's go ahead and run this. And we've passed, so it looks good. So this is all we need here is to use this column stack function and pass in a list of all the NumPy arrays in order to return a single two-dimensional NumPy array. So even if we have one column, it will return a single one-dimensional NumPy array. All right, that's it for step number nine. 21. The dtypes Property: In this video, we are going to be completing Step 10, the D types property. So this video will cover another property, D types, which will also act very similar to how the pandas D types property behaves. So whenever somebody makes a statement like df dot D types will we're going to return is a two column dataframe. The first column is going to contain just all the column names. And the second column is going to contain their datatype as a string. So we're going to use this dictionary, map the column kind, the NumPy unkind to the datatype strings. So we talked about the kind of the datatype which is simply a single character string. So this datatype name dictionary is going to be used to sort of translate the kind into a more readable datatype for our users. Okay, so let's see an example of this before we get started. So if we look at the Pandas DataFrame and you'd type in D type or excuse me, D types. You'll get back the original column names over here, and then their datatype is right here. So as we look at the final one, we can see what is supposed to be returned. So it's going to be a, a DataFrame that has two columns. The column names will be called column name and datatype. So those we're just going to fix and set. And then we just simply have the old column names over here as the values of the first column. And then the datatype is a string as the second one. So let's get started with this. So what we're gonna do is first of all, let's create a new dictionary that's simply going to hold all of the, the new data. And then we will, we will, we will use this to construct an entire, entire new DataFrame. Okay, So we have that we define a new dictionary as we are the new data. So let's go ahead and get the keys of the dictionary. So this will be the column name, so this will be in the first column of the dictionary. So we're going to, we want to make this a, so our dictionaries to be composed of strings mapped to one-dimensional arrays. So backing up just a little bit, this new data is what we're going to pass this to the constructor. So actually you know what? Weekend go at the bottom and say return a DataFrame of new data. So that's what it's gonna look like. This is, I'm going to use this to hold a dictionary of strings mapped to one-dimensional arrays. And I'm going to use the DataFrame constructor that's already defined up above to return an entire new DataFrame. So this needs to conform to the rules that we set up in the first few steps. You know where the strings are, the keys are strings and the values are one-dimensional NumPy arrays of all the same link. Okay? So the first thing, let's go ahead and just create our columns here. So this keys, we'll just return the keys, but to create a NumPy array out of these, we're going to use a NumPy array itself. We need to convert this into a list to formally make it a list. So we will do that. And so we can actually start this out inside this dictionary definition. So we'll go ahead and say column name is np array. So this will be the values of that first column. So we'll go ahead and get started with that. So that takes care of the very first call. Now, when you're already at this juncture, you can go ahead and test to see if this is actually working without actually running the unit tests, you can go into the notebook and see how this looks. So if we look at DFS, D types in its current state, we can see that, all right, we have one column and the column, it's actually number one, it's working. There's no air. We have a single column dataframe with the old column names as the values for this single column column name. Okay, that looks good. So we just need to add one more column and that's the datatype of each column. So what we're gonna do here is we're going to loop through all the values in our old data. Okay, so we're going to loop through all these values. So every value is a NumPy array. And what we wanna do here is we want to figure out the, the kind number 1. And we're going to use this kind to look up a more readable data type name. So we're just going to make it simple and return either string, int, float, or bool. So those are the only two possible data types that we're going to allow in our, in our DataFrame. So let's go ahead and make this convergence. So this is a dictionary, so I'm going to put the kind in there. And we'll need a list to hold. This would just say d equals this. And we'll just say D types dot append this new string. So we're going to loop through here one by one and get this. So we could actually combine this into a single, could do a list comprehension. So we can actually go ahead and do this in one single step here. Okay? So let me go ahead and delete this. So it looks like we can do this in one line of code types. Okay, so that looks good. And you know what, maybe we can just go ahead and go one step even further. Put this before here. So D types is that. And maybe to make for some clarity, we can take this out and say, let's call this call names array. So that'll be that. Okay, that looks good. And we'll have to convert this to a NumPy array. So let's go ahead and make the conversion here. Okay, Good. So we have now column names. This column I made a new variable called names. That's going to be a numpy array of all the column names, the keys of the array. D types will first we'll just make it into a list of all of these strings. So we loop through all the values, all the NumPy arrays, and simply use this preconstructed DataFrame or this dictionary to map those kinds to better looking strings for users. Then we need to convert this to a NumPy array. So we've done that. So that looks good. So we will map column name to column names and then we need to do what it says over here. It says use datatype as the other column names, but we'll do that data type and it will map it to num pi or a D types. So that looks good. And then we already have our return statement. So I think everything is looking good. Let's go back over here and rerun this and it looks like it matches the final. So hopefully our tests will work. So let's walk through this one last time to make sure everything looks good. And it explained, well, this dtype name dictionary was already provided for us and maps the kind to a string that's easier to interpret by our users. Colnames is going to be a numpy array of all the, all the old column names which are held in the keys of the data dictionary. Then we use a list comprehension. Iterate through all the values of our data dictionary, which are NumPy arrays. We get the kind of each one and we do a lookup with its dtype named dictionary. So this is returning all the either string in flow to a bool. We need to make this a NumPy array. So we pass that list into a NumPy array, or we force that list to be a NumPy array. And then the last or the next thing we do is we finished creating that dictionary, the all-important dictionary of the data that we can pass to the DataFrame constructor, which are strings of app to one-dimensional NumPy arrays. And then we call our own constructor here. So this is going to go back up and called our deck constructor, run all those checks again, and then return an entirely new DataFrame. So that looks good. And let's go ahead and run this. So the test is called test D types and see if we get it correct. So all right, we got it passed. Puts his one past. It's one test. So that looks good. All right, so that is it for the D types property. And conclude step number 10. 22. Select a Single Column: In this video, we are completing Step 11, select a single column. So we want our users to be able to select one or more columns from our DataFrame. In this step, we're just going to have them be able to select a single column. And we're going to use the brackets operator in order to do this. So the selection will look something like this. D, F, open brackets, then the column name as a string, and then the end of the brackets. Eventually we will allow our users to select in a number of different ways. But this, but in this step we're only gonna, we're only gonna do selection for a single column. So we will get to the other column selections in a little bit. But for now, we will just do a single column selection. So in order for these brackets, have these brackets work with your object. This is again part of the Python data model that is built-in know inherently into Python. So if you want these brackets to work with your object and you need to define this get item special method. So this good items special method is triggered whenever somebody uses these brackets directly appended to your object. Whatever object is within these brackets gets passed to this variable right here to the very first parameter. So we're just going to call it item. You can call it whatever you want, but we're just going to assign it to the variable named item. So in this step, what we're gonna do is we're going to handle the case whenever a user gives us a single string in here, we will return a single DataFrame or a single column dataframe of just that column. Okay, So let's go inside here. And what we're gonna do is if, if item is a string, so if the user is passed us a string, we're going to return a DataFrame of just that single column. So this is actually going to be a fairly simple thing. We're going to use our own DataFrame constructor, which requires a dictionary of the column name. So in this case it's just item and it's going to be mapped to a single one-dimensional NumPy array. Well, we have our, we have our one-dimensional NumPy array by accessing it through our data dictionary. So we'll just put in item like this, so this should work out. So here's our inner dictionary that we're just giving. We're passing to our DataFrame constructor. We're simply creating a one-item dictionary. And we're just returning within here, we're getting that array. So that's all we want to do. We can go ahead and see how this looks in the notebook. So I'd like to test this out manually. So here's our DataFrame. That's our Pandas cub in its current state. So if we want to get the state column, it looks like that's working. When they get to school column that looks good too. So it looks like it's working. Let's go ahead and run the test. It's called test one column. So here we go, test1 column. And it doesn't look like this actually worked. So there's no such name as test1 column. So it looks like I've actually must have changed the name of this. So let's go in here and correct that. So if we go in here and tests, okay, That's why, because it's an entire new class. I made a mistake here. So now if I actually read the instructions, I would have known this. It's in the test selection class, so improperly called T-Test. So if we look up here and so this is a good, a good segue to actually take a look at the test DataFrame dot py file. So the tests are broken out into different classes. So I've used classes to categorize different tests. So we've finished the test DataFrame creation part of the, you know, of this file. So I put all of these selections inside of a new class called test selection. So let's go ahead and let's get this backup here. So we're no longer in this test DataFrame class. We're in test selection, so in properly called the test and now I've found it and now it passed. So and you will see in the notes exactly what class it's under. Anytime there's a new class, I will write that in the notes. So I just was not paying attention to my own notes anyway, so I fix that and it looks like I've passed the test. So that completes step 11, where we select a single column from our DataFrame by placing a string within the brackets operator. 23. Select Multiple Columns: In this video, we completes step 12, select a multiple columns. So the last video we looked at the good items special method. We tested whether or not our users passes as a single string and it will return just a single column dataframe, assuming that string was one of the columns in the DataFrame. Now in this video, we're going to select multiple columns with a list. So this is exactly what Pandas does, and we're following pandas here. So the syntax will look something like this, df. And then within the brackets we start a brackets, which is the universal selection operating for, for Python. We're going to place a list of column names as string in here and select just those columns. So we're just going to continue on here. And if our users, if the item they give us is a list, then we will branch off into here. What we're going to do again is just going to return a DataFrame. And we're again, we're going to map from the columns that they pass us to the arrays that the underlying NumPy arrays. So let's go ahead and write a for loop. And here we'll do a dictionary comprehension. So we'll say four column in self.age data. Let's call it column values in column dot data dot items. So we'll iterate through both the column and the values. Actually, you know what? Let's take that back. So we'll say for column in items. So we're going to iterate through exactly what the user has given us. So item here is a list. So we're going to get that column, we're going to map it to data of column like this. Or think that's a better way of doing it. So we're just going to, we're only going to iterate through item, which is a list. So column is going to be a string name. And we're going to say make a mapping of column to that data. So that's all we're gonna do is we're going to return that. So that should work. Let's go test it out manually. I really like testing this out manually. So here's a df again. And let's say we want to get a spool and wait. And see if this works. Okay, that looks good. What if we want three columns? So school, wait and state. Okay, that's looking, that's looking good. So it's just a one line we're doing. Let's just explain this once again. So we're doing a dictionary comprehension. We're returning a call to our DataFrame constructor, which takes a dictionary. So here's the dictionary strings. So column is going to be assumed a string. So if this is not a string, so someone passes and bad data, then it will fail the constructor. So we don't have to worry about taking care of that case. It'll be taken care of by the constructor. And then we're going to map it to the NumPy array that it is already mapped to. So let's go ahead and run this. So it's test multiple columns is the name of the test multiple columns. And let's see if it passes, and in fact it does pass. So we have completed this step of selecting multiple columns with a list. So that is it for Step 12. 24. Boolean Selection: In this video, we are completing Step 13, boolean selection. So boolean selection is where we select data based on the values themselves were not selecting based on the column names. So for instance, if we wanted to select, select all the rows that had a height greater than 30 or just some value greater than 30, then that would be something called Boolean selection. So let's take a look at this with pandas, just so that we can have an idea of what a better idea of what I'm talking about. So we go back up here. We'll use the pandas DataFrame and do some Boolean selection with a Pandas DataFrame. So typically the way Pandas, we show Pandas Boolean selection is that you select a single column. So let's just go ahead and use this height column. And we're going to use one of the comparison operators. All right, and we'll say if this is say, greater than four. So for instance, we're only going to select here the one person that's greater than four or the one row that has height greater than four. So there's two false in one true. So we can assign this to some variable called filter. And then we can pass in filled to df pandas. And it will select all the rows such that filled is true. So here we just have one. We can change this a little bit. Let's do 3.55 just so we can get two rows. So there's two rows, two people that have a height of greater than 3.55. So that's an example of boolean selection. We're gonna make our DataFrame work similar to that. So what we're going to do is we're going to check if the user gives us a data, a single column dataframe that has nothing but Boolean values. We will then do Boolean selection. So that's going to be the requirements. So we're going to test that out. So we're gonna make another case here. So the first case was if the user gave us a string within the brackets. The second one was whether the user gave us a list. And in this case, we're going to check if the instance is a DataFrame. So if in fact it is a DataFrame, then we have to do a few other things first. So here are the items we need to, to check to see if we can actually process this DataFrame. First of all, we're only at allow for DataFrames to be one column. So you, the user is going to be forced to give us a single column dataframe. So let's go ahead and do this. So if item, so we already have this shape parameter that returns the number of rows and columns. So it's a tuple. If this does not equal 12 when a force it to be a one column dataframe we're going to return, are we going to raise, excuse me, a value error and say that item must be a one column dataframe. So we've handled that case. So if it passes that, we're going to extract the underlying NumPy array. So let's go ahead and call this array. We're going to get the array of underneath item. So it's stored in the data underscore data parameter. And it's going to be under the values. Now remember that we can't just use the brackets like this to select a single value from here. So instead, we're going to force it to be an iterable and just say get the very next one. So again, this is just because when you call values on your dictionary, it doesn't return something that you can just use the brackets to select the like the index of a, you know, using integers like 012 or three. So it is iterable. So we're going to make an iterator. And then we're just going to get the very next one of that iterator. So this is the underlying array of that one column dataframe. So we've done that. So we're only going to be able to process this if it is a Boolean array. So the type must be Boolean. So if array dot dtype.com sign, if this does not equal b, we will raise a. Will, we probably should raise a type error, although it says value error. But since it says that will just save value error just to be consistent with the notes. So we get, we'll just say that. Item must be a one column, we'll say Boolean data frame, give it a little bit more descriptive. So if the user did not give us a boolean DataFrame or excuse me, a, a DataFrame that had one column with a Boolean, that, with the column being a boolean that we are going to raise an error. Okay? Now, so now we're guaranteed that this array, a IRR is in fact Boolean. So it's a one-dimensional NumPy array. So now we're going to simply create some new data and iterate through all of the columns and only select those rows where in fact we have a true value. So we're going to rely on num pi here again. So let's go ahead and iterate through every single one of these. So in fact, we can actually just go straight to a return statement and use a dictionary in here. So we could say call value of array or call value in self.age underscore data.dat items. Okay? So what we're gonna do is we're going to iterate through our data, through our current data. So the data frame that we want to make the Boolean selection for. So we're going to iterate through each column value pair. And basically we're just going to pass this Boolean array that our users given us in, which will allow us to do the selection like this. Okay? So this is, this is NumPy acting for us. So you have to know a little bit of NumPy. So NumPy does boolean selection. So ARR is a Boolean array. Value is also a one-dimensional NumPy array. So this will, this will do Boolean selection via NumPy. And it will return also a one-dimensional NumPy array. And it will get mapped from this column string. So this should work right here. We could break this out into two steps, so maybe that's a little bit easier. So we could just say this is the new data and we're going to return the DataFrame constructor of this new data. So maybe that can shorten the lip and the line though single line and break it out. So hopefully this will work. We can test this out now in our, in the Jupyter Notebook. Again, here is D F, This is our current state of pandas. Now, we don't, we have not yet implemented the greater than or less than operator, but we do have a column of Boolean value. So let's go ahead and just use that as our filter. So we can actually select a Boolean column like this. So we'll just call this filter. And we've assigned the school column to fill. So we should be able to pass in this filter into here. And there's true, true, true. So the very first and the last rows should be returned and it looks like it is done, just that. So it looks, everything looks good. If you give it a one column dataframe of Boolean values, you will in fact return this. So it's also good to check the error sometimes. So let's just do some other column. We'll just call it state. Let's select State as a column and see if we can force an error here and give it this. So we do in fact invoke a value error saying item must be a one column Boolean DataFrame. Okay, so everything is looking good. And when we give it a one column dataframe of Boolean, a Boolean column, it will do the Boolean selection properly. Okay, so let's go ahead and run this. It says Step 13 is going to, we need to run tests simple Boolean. So we're going to test simple Boolean here and looks like we've passed. All right, so that does it for step number 13. 25. Check for Simultaneous Selection: In this video we are completing Step 14, check for simultaneous selection. And what I mean by simultaneous selection is that we will be checking to see if there's simultaneous row and column selection from our DataFrame. So it's nice to be able to select columns like we did in the previous few steps. But our users. But, but it also be nice if we read granted our users the ability to, to select rows as well as columns. So we're going to allow them to select the rows and columns simultaneously by, by, by giving the brackets two items separated by a comma. So this step, and this step we're just going to check that. We're just going to make a check to see if the user wants to select rows and columns simultaneously. So whenever you pass the brackets, objects like this separated by a comma. What's going to happen is that the getItems special method will receive them as a tuple. So that is the type that Python uses of internally to pass around comma separated values within the brackets themselves. So if item is a tuple, that means that the user is trying to send us one or more items separated by a cop, by a comma. Sorry, it is double-click that. Hop on back down. Okay? So I have here that the next steps, steps 14 through 18 are optional. They are difficult. So if you want to skip them, you're perfectly okay to skip them and you won't miss out on too much. But these next steps, 14 through 18. So these are five steps are all going to deal with simultaneous row and column selection. So there's a lot that needs to be covered in order to sort of give our users a complete ability to select rows and columns simultaneously. So with that said, let's move on to the actual step 14. Again. So we're just going to check that our user is trying to do row and column selection simultaneously. So here we are. We have in, we're inside the good item special method. We first check whether or not a user passes the string. Then we checked whether the user passed us a list. And the last step we check whether the user passes a one-column Boolean DataFrame. In this step we are going to do, we're first going to check whether or not the user gives us a, a, a tuple. So if it get, if the user gives us a tuple, then that means that they are, we are hoping that they're trying to do row and column selection simultaneously. So if the user is, item is a tuple, what we're gonna do here is we're going to, we're actually going to call a separate method. So we're going to separate out the logic of this. So we're going to return the result of the good item tuple method that I have defined right below this. So good item tuple will. We're going to pass it the item right here, and that's what we're going to return. So this method will be implemented in a later step. Okay? So now, if, now, so this is if we do not receive a tuple. So if we have, if the code reaches to the end of it, is not a string, is not a list, is not a DataFrame, is not a tuple, then we are going to raise a type error. So and tell the user that they must, they must give us a, either a string or a list, a DataFrame, or a tuple. So those are the options. So if one of these if statements is not triggered, Dave's not turned out to be true, then we're going to just raise it, raise a typer and say you must has either a string, list, DataFrame or tuple. And I'll just press Enter here. So we can get a new line here to the selection operator. Okay? So I think that's fine. We'll put a space there. So that's going to raise a type error. You must pass either a string, lists, DataFrame or tuple to the selection operator. So that completes that part. We're going to raise a typo there. So, so all of these branches are taken care of. So if we have a string, we do, we return a one-item DataFrame. If we have a list, we're going to return a DataFrame with just those columns. If I receive a DataFrame, we're going to ensure that it is a one column Boolean DataFrame and do Boolean selection. If it's a tuple, we will go ahead and actually call this method and we will implement this later. And if it's not either of those four types that we are going to raise a type error. So all of these all, so every, everything is covered. So it has to be either one of those four types. If it's not, we're going to get an error. Okay, So one last thing inside here. So now we're guaranteed that item is a tuple. So the only thing we wanna do here is verified for that. To complete this step is to verify that in fact we were getting a two item tuples. So if you're gonna give a pass a tuple to the brackets selection operator, it needs to be exactly two items. And if it's not, we are going to raise a type or excuse me, a value error. We'll just say if the length of the item is not too, then we will raise a value error. Item to bool. I must have length 2. Okay? So that's should do it. So that is the end of this. So we'll say run this test run simultaneous, will say run simultaneous tuple tests and see if it passes. Okay, great, so it passed this test. So just a recap of this particular step of step 14. All we're doing here is we're checking to see if our user has given us a tuple. If it has given us a tuple, then we're going to call this method down here, which we'll work on for the rest of problems, steps through 18. If it's not a tuple, will simply raise a type error and say you have to return. We have to pass us a string list, DataFrame, or a tuple. Okay? So this sets us up for simultaneous row and column selection. And we'll get to the actual selecting in the next step. 26. Select a Single Cell: In this video, we will complete Step 15, select a single cell. So we are going to allow our users to select a single cell of the data using the brackets operator. So to select a single cell you need the username will give us both a row selection and column selection. So the very last video, we began this method get item tuple handled the case whenever a user would give us a, a, a, a row selection and a column selection. So this is a, this is what we're going to be working with. We're going to write the code within here. So if the user does give us a single row and single column sludge and we will be able to process it here. So the first thing that this step requires us to do is to create a variable called Rose collection and another variable called column selection. And, and assign row selection to the first line of the tuple and call selection the assigned to the second item. So what we're gonna do here is simply unpack our tuple. So let's go ahead and write this equals item. So item is guaranteed to be a two item, two bullet this stage, which means we can use unpacking to put the very first item in row selection in the second item and call selection. Okay, So let's go ahead and keep reading over here. It says if row selection isn't integer, reassign it as a one element list of that integer. Okay, let's go ahead and do that. So if row selection is an integer, we are going to reassign a row selection to be just simply a one item list of that particular integer. Now I'm going to do this. If column selection is an integer, we're going to reassign it to a one item list of the string name of the column that represents. Okay, So if call selection is an integer, so, so for instance, if it's, if the user has given us, say two comma three, well two will become a list of just 23. Instead of just being the list of three, it will be the column name. So what we're gonna do here is we're going to reassign this to be a list. But we're going to actually use our columns property and select it like this. So this will return to us the column name of that integer. So columns we implemented earlier. So we're just taking advantage of our earlier code that we've already implemented. And then we're going to put that inside of a list. Ok? Now it column selection is a string and we're going to assign it to an, an element, a one-element list of just that string. So if it's already a string, then we're going to be, it's going to be okay the way it is. So we'll do an if else statement here. So if called Selection, otherwise if it's a string, we'll just reassign it as just the list of that string. So we're going to assume the user is passing us a column name, is a string there. Okay? So yes, Now, row selection and column selection are all lists containing one single item in them. After that, after they've gone through, through this. Now, what it says to do here is that we're going to write a for loop to go through every, to go through every column selection. And then make a new dictionary, new data, and return just the rows for that particular column selection. In this case, there's only going to be one row per column selection. And in fact there's only going to be one column selection. So this for-loop will actually work for the, all the other steps in this good item tuples. So this is going to be a generic for-loop that we're going to build that will be used for the rest of the, for the rest of the problem. So they will all all of the other problems that 15 through 18 will be able to be solved. We'll be able to work with this particular for-loop. Okay? So let's go ahead and create the new data dictionary. Okay, so again, a lot of these methods are going to use this new data. So when you create an entirely new dictionary and then pass that to our constructor. So let's go ahead and iterate through column selection. So for call column selection, here's just one member. We're going to update this. So I'm gonna say new data of that call. We want it to be a numpy array. So in this case, we're going to go ahead and get the NumPy array by just doing call. But we're going to use the row selection here. So row selection. Is, we're going to go ahead and say We're just going to select just those rows. So this returns us the underlying NumPy array and NumPy arrays have, are able to use the brackets operator. So within the brackets we're going to pass it the row selection. So this, this should work. And let's see what are returns here. So let's go ahead and return a DataFrame of this new data. So this, so our code should only work right now whenever the row selection is an integer and the column selection is either an integer or a string. Okay, so let's go ahead and look at the Jupiter notebook here and see if we can test this out live. So here's our Pandas cub df. So it should work if we're going to select, say like the first row and maybe the second column. So this would be here, this is row number 10, indexed column number two, or with index 2, 0, 1 to be heights. So this hopefully should return 3.5. And it does. So that looks like it's working there. Let's go ahead and select a different one. So let's use a string like wait, for one of the columns in so that works as well. Okay, good. So everything looks like it's working. And let's go ahead and run the test that we need to run. And this time the test is called. Right here, It's tests single element. And let's see if it works. Okay, good, so it passed, everything is green and I've passed that test. So let's go back and make sure we understand that. So we're simply taking the case, whatever row selection is an integer, we're going to make row selection of one item list. If columns election is an integer, then we're going to make column select into one list string, one item, or a list containing one item that's a string. If it is a string, then we'll just leave it as it is, but still make it into a one-item list. So the end result is that row selection and call selection are both lists. So we're going to iterate through all of the columns like this. We're going to create a new dictionary that's going to contain our data. And then we're going to make entries in our dictionary that have the column names which are now strings and we're going to map them to one-dimensional arrays. So this is the original array over here. This, we're going to select the original array, but we're only going to select these rows. And in this case it's just a single row. But we're going to use the brackets operator to put the row selection in there. And then we'll call our own constructor with that new dictionary, which will return a completely new DataFrame. Okay, so that's it for four-part. Step 15. 27. Select Rows as Booleans, Lists, or Slices: In this video, we complete Step 16, select rows as booleans, list or slices. So in the last video, we only handle the case when our user gave us a single integer for the row selection. For the column selection, we allowed our user to give us either an integer or a string. So we're going to go ahead and give our users even more choices for rows in this step. So we're going to allow our users to give us a single column Boolean DataFrame, a list, or a slice. So we're going to allow our users to make selections by those three things at the end. So here's some examples of, of selection with either a Boolean, a list, or a slice over here. So when I say a Boolean, I'm saying a one column Boolean DataFrame. So it looks something like this. And there's some sort of comparison operator. You will able, will, our DataFrame will be able to work with either a list like this to four or one like this, just like thing, the rows 241 or a slice objects. So slice is used with the slice notation with the two, with the colon saying going from row two all the way up to row five, not including five. So we're not gonna do anything with the column selection here. We're going to allow it to remain the way it is. So someone passes an integer or a string, that will be fine right now will, we will handle more complex column selections in a later step. Okay? So we handled the case whenever row selection was an integer. Let's look at whether it, if it, if this is the DataFrame, so let's say if row selection Is a DataFrame. So if it is a DataFrame, then we're going to do the same checks that we did for the Boolean selection couple of steps above. So if row selection, if it's not a one column dataframe, then we will raise a value error. And that's exactly what note says over there. Say, you can say rows selection DataFrame must be one column. Okay? Now, so if it passes this, then we will extract row selection again. So we did this up above. We're going to, sorry, we're going to extract the underlying NumPy array. So now that we're in this branch, we are guaranteed that row selection is a dataframe and also as exactly a one column over here because we did that with shape. So what we're going to do is we're going to use the kind of awkward thing to get the, the only, the only value in the dictionary. So Rose selection not underscore data values. So that gets the very first one. And we're just going to reassign it to row selection itself. So now row selection, we don't care about the name of the column. We just care about getting that underlying NumPy array. So row selection is now an underlying NumPy array. Okay? And we have to do one check here. So it has to be a Boolean NumPy array. So if row selection, the dtype of the kind of it is not B. So we did the same exact operation. We will raise a type error. And that's the era meant to raise and the other one, but it's fairly minus. We're just gonna keep on going and say row selection DataFrame must be a Boolean. Okay? All right, so that takes care of the case, whatever row selection is a dataframe. But what happens if it's a list or a slice? Okay, so let's do that. So that's what this says. If row selection is not a list or a slice raise a type error. So LL if so, let's do another branch here. So if row selection is not a list or a slice object, okay? Then we will raise a type error so we won't actually change row selection Lee. So it's perfectly fine that row selection is a list. So our, our dictionary builder over here will work out just fine as a NumPy arrays or handled just fine when rose, when you select via I list or a, or a slice or a Boolean array. So either one of those three are okay. So that's why we don't have to do anything here if it's a list or slice, but we do need to raise an error. So I should actually say if not, if it's not an instance of row, of lists or slice, then we will raise a type air and say that the row selection must be one of those, must be low, sludge must be an int, list, slice or DataFrame. Okay? Okay, so that should handle all the cases correctly. So let's just review this really quick. So. At the end of last step, we are right here. So we handled the case that someone gave us an integer as the row selection. Now, if someone gives us an, a DataFrame, we're going to raise an error if it's more than one column. Otherwise, we're going to extract the underlying NumPy array, which we do here. If that NumPy array is not a Boolean, we are only going to only accept DataFrames that have Boolean columns or one column boolean. Okay, So row selection, if it passes all these tests, that will be a one-dimensional NumPy array. Now, if row selection, otherwise if row selection is a list or is not a list of our slice, then we will raise a type error. So we will just keep it, keep the list or a slice, whatever it was. So we're not going to actually change it here like we did over here. There's no reassignment on this branch over here. Okay, so that's all we've done here. We simply added this code right here, this block of code. We're not going to touch the columns selection over here until the next step. So let's go back to our DataFrame to, so we can see this in action now. So what we did here was we allowed our DataFrame to handle lists. So let's go ahead and allow it to handle little lists. So here we have a list 0 comma one. So this just gives rows 0 and 1 along with column with index 3. You can of course put the column name over here. So that's one way we can use slicing. So say we want to slice from one to the end. That will also work. We want to slice from the beginning to one that also works, lies from the beginning to two that works. And we could also use boolean selection here in the rows. So let's go ahead and use that school column, which is a Boolean column. And we'll assign it to the word or the name filter. And it looks like we have an error. So let's go ahead and look at here it says row selection must be a Boolean. So let's look at filled. That's certainly looks like a Boolean. This should work. We'll select. So let's go look over here. 0. So this should be not equal to, that was a little bug in our code. So it's not a Boolean, then we will raise a type error. Let's go ahead and do this. And we might have to. Let's go ahead and restart the kernel. And let's just run all. And now let's go down here. And it does in fact work. Okay, So the reason I had to restart the kernel is that it is auto reload from above. So this is a good chance to talk about this. The auto reload is not a 100 percent foolproof. In fact, what happens is it messes up whenever you're checking for an instance of itself. So when you auto reload, it doesn't like reset the instance. So it doesn't know that is, it's actually an instance of itself. So it'll, it'll call it the auto reload doesn't always work when you do Is instance of a DataFrame. So it actually thinks that the, you know, the instance that you're checking is a completely different height when it is not. So that's the onetime you will need to reset the kernel when you're checking for is instance of a DataFrame. So that's a little bit of a bummer, but that's something that will crop up. So if you're getting these weird errors and you don't think you have an error when you're manually testing, then you just might need to go ahead and restart the notebook. Okay, So we have not actually formally tested it with Pi tests. So let's go ahead and do this. And let's look at the test says test all row selection. So test all rows selections. And it looks like we're good right there, so we pass the test. All right, Fantastic. That completes Step Number 16. 28. Multiple Column Simultaneous Selection: In this video, we complete Step 17, multiple columns simultaneous selection. So we're going to allow our users to be able to select multiple columns with a list whenever they're selecting rows and columns simultaneously. So again, this, this entire section right here, we're in this good item tuple is whatever the user has attempted to select rows and columns simultaneously with this particular syntax. So the other DataFrame, df, they use the brackets. And within the brackets there passing in two separate objects, a row selection in a column selection. Thus far, our row selection, we've able, we handled the case whenever it was. Selection is either an integer right here, or a DataFrame of one column dataframe that's a Boolean, or a list, or a slice. If it's not any of those and we will not be able to handle it and will raise a type error. On the other hand, whenever the user has given us a column selection, we're only able to handle the cases right now whenever the column select n is either an integer or a string. So we're going to add the additional case such that if the user has given us a list. So we want to, we want to be able to process this. So, so far, nothing's going to happen if the user has given us a list. So let's go ahead and make a specific case for this. So call selection is indeed a list. Well, we wanna do is allow users to give us either integers or strings within this list. Okay, so in the notes it says, let's create a empty list called new column selection. So just go ahead and do that new called selection and we'll make this equal list. So the final value of nu call selection will be a list of strings we only want to accept. We want to turn any integer into a string. So when iterate through each element of calls selection and check if it is an integer. If it is, then we'll do, we will go ahead and do the conversion and select the actual column name. So let's go ahead and do this. So for call and call selection, if it call is an integer, then we'll go ahead and append to this the value of that. So do self dot col and look at this string of that call. So if it's an integer, we will append. We will have to dig into the columns, find out that string name and append it to new column selection. Otherwise, we're just going to take a shortcut here and just assume that our user has given us a string. So if they haven't, we're not going to handle the case. We're just going to assume that this is a string. So you might want to write a little note here, Assuming call is a string. Okay? So now new call selection is a list. Put our little for-loop here that returns our actual data that's responsible for, for making the dictionary is using call selection. So let's go ahead and resize, reassign called selection to equal this new coal selection. So it basically transformed know call selection by, by, by iterating through here. And that's what this is, this is what this asked to do. So reassigned call selection to it. So our, our original loop that we defined in problems are in step 15 should still work. So it requires I call selection be a list of strings for it to work. And then row selection can be any of the four other types that we've already discussed. Okay, are actually row selection will end up either being a list, a slice, or a Boolean array. Okay, so let's see if this works. Let's go ahead and take a look at this in the notebook itself. So let's go down here. So we have df. So now we're going to allow df to be or allow column selection to be a list. So we could use like name and then height, and we can actually allow it to be integers here. So negative one does grabs the very last column name. So it looks like it works on the row selection. Here is a list, the column selection here is a list. They both seem to work. We can make this a slice and empty slice. So this just select all the rows. So everything's looking good. Let's go ahead and run the test. So we applied test, it will be here and then the test is called test list columns. And we pass the test. All right, fantastic. So all we did here was we handled the case. Whenever the user gave us a list for the column selection when doing simultaneous column selection. And we turned to any integers into strings. And then we just assume that the user was hopefully intelligent enough to give us the correct column names as strings for the other ones. You can do some type checking here, but I've decided against it for simplicity. Okay, so that completes Step 17. 29. Column Slices: In this video, we complete Step 18 column slices. So we are going to handle one additional case for column selection when we're doing row and column simultaneous selection. And that's whatever, and that's the case when column selections are slices. So all three of these selections over here on the right-hand side where I have my mouse hovering over Our column selections that involves slicing. So all of these have the slice notation for the column selection part. So we are going to allow us the slices to either have integers or strings in them. So if column selections have integers, like here, they will be exclusive of the very last one. And that's exactly what Pandas does. So for instance, this first slice will go up to but not including index three. Whereas slices that do include column names as strings will be inclusive of the last one. So in this case, if f happens, f will be included as long as it's a multiple of two. So this is the step of course, of the slice. Alright? So if we look back up here, we're still in this good item tuple method. So we've handled all the cases for row selection for our DataFrame. So we've handled the case whenever the row selection is an integer, a DataFrame, or a list or a slice. So we're done with all those cases. If it's not any those cases, we will raise a type error. For the column selection we've handled the case whatever columns like Jane is an integer, a string, or a list. Now we're gonna handle one final case. And that's whenever column selection is a slice. So the column selection has three attributes, start, stop, and step. So I'm gonna go ahead and define three variables with inside here that are simply equal, have taken the same name as those attributes of a slice. So start, stop and step. I'm going to make separate variables for them. So it'll be easier to handle. Okay, so that's what this does right here. We're going to define new variables with the same name, the whole those attributes. Okay, so before I get any further, it says if column selection is not a slice, raise a type error. So let's step out of this, this part of this branch right here. So let's end this. So if, if it's not an integer string, list or slice, then we will raise a type error and say columns selection must be one of those must be int, string, list, or slice. Okay, so that'll handle that. So continuing, if, say, if start is a string, reassign it to its integer index amongst the column. So let's check. So if in the case that we do have a string here, We're going to, we're going to go ahead and reassign it to the integer index where it's at. And this will become more apparent later on. So let's go ahead and do that. So we'll say that if it starts as a string and we're just gonna go ahead and get its index. So we're going to reassign it. That's what this says. If we're going to reassign it to the integer index, so start dot index 0, excuse me. So we'll find the column number here. So then find its index. Okay, so the columns is a list and will simply get the index of that start. Now with stuff. We're going to reassign it to the integer index again, but we're going to add one since we're going to include the stop column if it is a string. So would add one to this top one. So let's go ahead and do this. So if stuff is also a string, f-stop is a string, then we'll do a reassignment. So we're going to get the index of this, so and we'll just add one to this. All right? Now, we're not going to check the step. We're going to assume our users have given us a correct step, which should be an integer or just will be left blank as it is in most cases. And when it's not, given, it's just the step size is going to be one by default so you won't mess around with that. We'll just as hope the user gives us a correct step. So start, stop, and step should now be integers. This is not entirely true. Step could be anything, but we'll just assume it's the default in B1. So now we're going to reassign column selection as a list. So let's now start, stop and step are integers. We're going to reassign column selection. We're gonna go ahead and make, when I go ahead and grab columns again. And we're gonna do start winning, give it stop and step. So we're gonna go ahead and slice the columns for exactly what we need here. And this will, this will get us the correct column selection as a list of the columns we want. So we simply broke down, start, stop and step. If they were shrink, we put them, we change them to integers. We use those integers. Start, stop and step in our slice notation of the columns. And that will allow us to choose column selection as a list, which is what our little for-loop requires. So we need column selection to be a list of the strings of the exact columns that we want. So that looks like that should work. If, if start is an integer, we're not gonna go into this branch. So it'll just go directly here is stop is an integer. It'll just go here as well. Okay, so let's go ahead and look at the exit out of here. Let's go ahead and look at this in action and see if we can get this to work. So here again, we have our DataFrame. So let's slice all the rows, keep all the rows. And let's slice from state all the way to school and see if this works. So that looks good. So it's state this school, so state all the columns between state and school and included school. So that looks good. So let's do this again. Let's go from one to three. Let's use integers in C here. The third column is not included, so this grabs column with index 12, which are simply state and height. So it looks like column selection is indeed working for us whenever we give it a slice. Alright, so let's formally test this with tests call slice. And it passed. So looks good. All right, so that is we have completed subset selection and we're going to move on to actually have one more small section on subset selection. And then we'll move on to something else. After that step. 30. Tab Completion for Columns: In this video, we complete Step 19, tab completion for columns. So this is a step that will simply help out or to users correctly type in the column names whenever they're doing, whenever they're selecting a single column of data from their DataFrame. So let's just see how this works in the Jupiter notebook. So if we're down here and let's look at how it works in pandas. So here's pandas. In the way Pandas works is if you want to select a single column, say like height and you start typing it in. Here, I've just typed in the first three characters and you press Tab. So if you don't have to pay attention very closely and see if I can go back one character. So you press Tab. So if there's more than one option available, then it'll bring down a drop-down menu and you can go down or up in select the correct option. So I believe there's actually a bug with IPython at the moment in is showing too many options here because obviously help or hex or not column names. So that's unfortunate and it's sort of pollutes the drop-down menu, but hopefully that will get sorted out at some point. Regardless, you still have help available and you can just press Enter to select the column that you want. So there's a little bit of tab completion help Let's see if yes, So weight is a little bit more unique. So I'm right here, WE and I press Tab and I get weight completed instantaneously. So to do this, you have to pull up, fill out. This is sort of the method that special just to IPython. So it begins with a single underscore. So I, python will pick up on this and it's called ipython key completions. So when you, what neat what you need to do here and if you read the documentation in iPython, is that you're going to need to return a list here of all the possibilities that you want. I'm your users to be able to choose from whatever they're putting a key into the brackets. So this is a specificly whenever you have the brackets somewhere, when you're doing some sort of get item operation that you want to give your users choices on what to select from those brackets. So here we just want it, want our users to select the, be able to select the column names. So let's go ahead and do that. We're just going to return the column name so that which are already a list. So we're just going to return self dot columns. And that should do it. So let's go ahead and see. So here's our DataFrame. It has the same exact columns, so if we do it, okay, good. So I just press Tab and I checked it out live and it looks like it's working. So that looks that looks very good. Okay, so let's go ahead. There is a test here, Test Tab complete. And let's see if we get it right. So Tests tab complete, and everything looks good. Okay, so that does it. For Step 19. 31. Create a New Column: In this video, we complete Step 20, create a new column. The past several videos had us selecting subsets of data with the brackets. And the way Python allows us to do this is gives us this special get item method, this special method, dunder, good item. So in this video we are going to be doing is creating an entire new column, or in fact overriding an old column as well. So for instance, this is what our syntax is going to look like. We're going to be able to do something like this. Df and using the same brackets, put the name of the new column. Or it could be actually an old column and we're overriding it. But regardless, we're going to allow our DataFrame to take an, a, a single column or append a new column to it. And we're going to force our users to use a NumPy array to give us the values for the new call. So this is not done. This is what we're going to be doing an assignment statement here. So this is done not with the dunder good item, but with dunder set item special method. So here it is, over here. It's already defined. And we're going to complete this. So this method gets triggered whenever you do an assignment statement to the brackets operator. So it takes two parameters that are sent to it. So first is the key, this is the value inside the brackets, and the second is the, what's called the value. And this is on the right-hand side of the equal sign. So those are the two values that are passed over here to the setItem special method. So it's our job to implement this right now. Okay, So let's just follow the instructions over here and knock this out one by one. So it says the key. Yes, so the key is new column and the value is whatever's on the right-hand side of the equal sign. Here's our first sort of task. It's as if the key is not a string, raise a not implemented error stating that the DataFrame can only handle singles, can only set a single column. Okay, So we're going to limit the functionality of set item to just setting a single column. Now pandas allows you to select, to set multiple columns or events, set subsets of your data. So we're not gonna do that. It's actually fairly complex. We're only going to handle the simple case. So with that said, we'll say that if the key is not a string, so I'll say not is instance. If the key is not a string, then literally raise and not implemented error. And we'll say that setting columns only done only a single column, okay, you may make a better error message there, but that'll get us started for this. So V is not a string. We're just gonna say we haven't implemented that yet. And we're gonna give the user a message for that. So we're not going to handle the cases. If the user gives us the list or an integer or anything like that, we're simply going to force our users to give us a string. Okay? Now let's move on to the value. Let's check, let's do some checking for the value. So the values, a NumPy array, we're going to raise a value error if it is not one-dimensional. Okay, so now let's look at the value. So we're going to forced the value here to be a, well, not necessarily a NumPy array. But if it's not a one-dimensional NumPy array, we will raise a value error. So if, let's just say if it is, its value is a NumPy array. So the underlying data type is ND array. Then we will check if the number of dimensions is not one. So n dim is an attribute that returns the number of dimensions of your NumPy array. So better be one-dimensional NumPy array. Then we will raise a value error. Over here. Think I just ran a commanded and me too. All right, so we're gonna raise a value or, and say the num pi are a, must be one-dimensional. Okay? So that handles that. And I'm hopping all over the place here. Alright, so the next thing is raise a different value error if the length is different than the calling DataFrame. So if we, if we are setting a new column, then the new column length the better be the same exact length. All the other columns. So let's go ahead and do that now. If the length of the value, if it does not equal the length of the current DataFrame, which just simply length of self. So if that new column that you're setting, that new NumPy array is not the length of the self, then we will raise a value error. Length of, say, length of setting array must match length of DataFrame. Okay? So we'll have that error message for that. Okay, continuing, so if, if we do have a NumPy array, it has to do these two checks. Now let's keep going. So if a value is a DataFrame, let's do LF. So if value is, if we allow our users to give us the DataFrame, so value is a DataFrame. Then we're going to enforce it this DataFrame as a single column. So we're right here. So a value is a DataFrame when to raise a value error if it is not a single column. So we'll use the shape parameter here. So if this does not equal one, they will raise a value error. And say that. We'll say that the setting, the DataFrame must be one column. Okay, great. Now it says raise a different value error if the length is different than the column DataFrame. So this is the same thing as what we just did above. So if the length of value so does not equal the length of self, then we have another problem. We're going to raise another value error and say, setting DataFrame must be the same length. Okay, good. And now so if it passes those two tests, then we will actually just reassign value to be the underlying NumPy array of that column. Let's go ahead and get that plus reassign value to be that NumPy array. So what we did here, we're going to be the same thing as we've done in the past. We're going to do create an iterator of the next one. So the underlying values are stored in the data parameter. And we'll call the values method on this underlying dictionary to get those values. Okay? So value equals the next one. Okay, That looks good. Or right? Now. So we've handled the case if value is a NumPy array and the case if it is a DataFrame, here's a case where if the value is a single integer, float, or Boolean, or any single value. Okay? So someone does this work? Well, we're gonna do so if you make an assignment like this, we're going to make all the values, just this single number, or a string, or a float or Boolean. So if you give a scalar values is called a scalar value. So this is not a list. Or numpy array is just a single scalar value, so you give it an integer string, float, or Boolean. We will repeat this for every single value in the, for every single row that you have. So let's handle this case. So if we have a value is a integer, a Boolean, a string, or a float. Then we will repeat this. So what we're gonna do here is we're gonna say when a reassign value, and we will repeat np dot repeat. So if we look at the documentation, it says, here's the array that we'd like to repeat. And you could actually make this equal to just a single value, so it's fine and then the number of repeats, so this needs to be an integer. So we're going to repeat the value. And when I repeat it, the, how long our current DataFrame is. So it's length of self. So this is going to return a one-dimensional NumPy array of our scalar value repeated that many times. Okay, good. So if it's not one of these, then a value is not one of these. And maybe we should put a space there to separate the keys from the value logic. Then we will raise a type error. And we can say that setting object must be array, DataFrame, int, bool, string, or float. Okay? Okay, good. So now value is guaranteed to be a numpy array. So each one of these cases will make sure that it is a NumPy array. So if it is a NumPy array, though the state numpy array, if it's a DataFrame, then we're going to get the underlying NumPy array. If it is an integer Boolean string or a float. And we're going to repeat that value and create a NumPy array out of that. Okay, So it says after completing the above value would be a one-dimensional array. Okay, One last thing, if its datatype kind is U. So we need to do the, we need to actually change any datatypes that have you into object. So let's go ahead and do that here. So if value dot-dot, dot-dot kind equals, equals U, so it's a Unicode code array. We will reassign this to object like we did in the constructor. Okay, so that'll handle the case that we have strings. So if you have a string, a new column that's that strings, we will take care of it and make sure it's objects so that it is more flexible. Okay, so finally we're just gonna make one more assignment. And that's, well, we have to overwrite the data or, or write in a new data. So what we're gonna do is we will take whatever the key is. So hopefully this is going to be a valid column name. It must be a string. And we're just going to assign it the new data, which is just value. So all that work now. We can just make this assignment directly. So key is going to be the string and value's going to be a NumPy array. And this should either reassigning, reassign an old column or adding new column to the dataframe. Let's go ahead and do as we've done before and check how this works in a Jupiter Notebook. So let's look at df again. Let's go ahead and make a new assignment here. So let's just make a new column. We'll just say nu equals, call it new column. And we'll say np.array. So we'll just go ahead and give it some values, 100, 99, 98, and makes sure that it is the same length, so it's three. So that actually completed. And when we look at df, alright, it looks like we have a new column. Let's make another new columns, a new call to and lazy. So let's just give it a string and see if this works. Okay, good, so they're all the same three strings, so it's been repeated. Let's overwrite a column. So let's say df of state equals, let's just make it all five. So it's a little bit this. And so now we've overridden state with all five, so everything is looking good. When we manually test in the notebook. Let's go back over here and test this. So we're going to run the test new column, test, test new column, and see if this works and do it. Okay, we passed. All right, so we pass the test. Excellent. So that completes Step 20, which is either a signing a new column or overriding an old column with the Set Items special method. 32. head and tail methods: In this video, we complete Step 21, head and tail methods. So these methods, head and tail, are going to return either the first or last n rows of the DataFrame where n is defaulted to the integer five. So these are actually quite simple methods that will take much work. And these are the same exact names that pandas uses to do the exact same task. So let's go ahead and complete the head method, which will simply just return yes, the first and last or the, sorry, the first n rows. So what we're gonna do here is we've already used, we already implemented the good item special method. So we're going to take advantage of this and use our brackets here. So to select just the rows, we will do use slice notation here. So for the rows and the columns. So this is something we just worked hard on in some of the previous steps. So we're going to use this to our advantage and we're just going to return. This is going to reach row selection, comma column selection. So the columns, we're not gonna do any selection with the head method. We're just going to keep all of the columns. The rows are going to slice when you slice notation to select the very first n rows. So let's go ahead and check this out in the Jupiter notebook. So I went ahead and just go ahead and restart this and run all of them. It's been awhile to kind of clear out some of the old data. So df, pandas, df is pandas cup again. So if we do DOD head, well that's gonna give us the top five rules. Well, there's not actually even five rows, so we can't even tell if this works properly. So if we put a two in there that's going to return the first two rows, which put a one in there. It's gonna return the first, the first row. So that looks good. So we use slice notation here now for tail. Let's go ahead and do the same thing, but we would like to get the last n rows. So instead of doing colon n will do from negative end to the end. So we'll just reverse n. So if it's five, it'll say from negative five to the n. So that is valid slice notation. And then again, we'll just select all of the, all of the columns that go along with it. So if we go here and manually test this out, if we want the very last row, that should get this one right here. You want the variant, the last two rows, then that'll work as well. Okay, so let's go ahead and run the tests for this. It's called Test head tail. And it's still in the test selection since this is still selecting data. And test head tail works, we have passed it and that completed Step 21. All right, fantastic. 33. Generic Aggregation Methods: In this video, we complete Step 22, generic aggregation methods. Okay, So we finally done all of our subset selection. We've done setting new columns. And now we're going to implement about 10 methods all at the same exact time. So we're finally give our DataFrame power to do basic statistical aggregation. So yes, this step is called generic aggregation methods. So by the end of this step, our DataFrame will be able to compute the min-max mean, median, some var, standard deviation, All any arg max and Min for every column. So these will only work column-wise. These are all what I call aggregation methods. So technically, an aggregation method returns a single value when it's given a sequence of values. So in this case, our sequence of values are each column that we will call this on. So for instance, if we use the Min method, if we say df dot Min, for instance, this will return a single value for every single column. So all of these methods, so whenever we call any of these methods, they're going to return just one row for our DataFrames. So all of these, when we call them, will return a single row. Now, each individual method is actually already implemented. So if we take a look here, men, men mean, median and so forth. You can see that they all have return statements. You will not be editing any of these, okay, these are already done. You will notice that there is a call to the private agg method and two, and what is sent to the aggregate method is the NumPy function. So we're actually sending it a function and it will have the exact same name, I believe for all of these as the actual method name. So the underlying function or the underlying method that you will, we will, we will edit is this underscore add method. So we're going to basically greatly simplifying our lives. So all of these, all of these methods, all of these aggregation methods essentially work the exact same way. They return a single value for each column. So, um, so that's exactly what we're going to, we're going to do here is just you used instead of doing, instead of doing, implementing each one by themselves, we're going to implement just one generic method that's able to handle all of them. Okay, so let's see how this works. So basically what we're going to want to do is we're going to iterate through each column of the DataFrame and then pass the underlying array to this aggregation function. So this NumPy aggregation function. And we're going to return a new DataFrame with the same number of columns but with just a single row. So the columns won't change, but every column we will be aggregated into a single row. Now, not all aggregation columns worked for strings. So for instance, if you try to take the mean of a string or the median of a string, it's not going to work. So whenever that happens, a numpy will raise a type error. So instead of having our program error out for the columns that it cannot aggregate, and you're simply going to not return those columns. So this shrink columns won't be returned whenever we call liked, something like mean or median on them. So we're going to accept a type error and basically just do not do anything when we get, when we get an error. And we'll get an error during the aggregation. Okay, so let's go ahead and implement this. So they're all going to look the exact same. So this is the generic underscore agg method and we're being given this AGG func. So there's AGG func is a NumPy function. So num pi is again going to do the hard work for us. So let's go ahead and iterate this through our data for col comma value and self dot underscore data. Again, that's for our data is held. So let's go ahead and create a new data dictionary. And let's add to our dictionary. Let's try to do this. Agg func. So AGG func, we will pass it values. So AGG func is the aggregation function, the NumPy array. Okay, Great. Now, the only problem here is that it will return a single value. So we need to actually force it to be a NumPy array. For it to work with our DataFrame constructor. So what we're gonna do here is just force it to be an array. And we're going to put it in a one-element list. So this will take care of that for us. So it's a little bit, little bit convoluted here that we have to force it into a list and then again force it into a NumPy array. But that's how it's going to work. Now, this function will fail whenever four strings for some string method, for some string columns, it will not work for, for a particular methods. So what we're gonna do is this is the first time we are going to accept an error. So you have to know a little bit of NumPy. You know that whenever you do an aggregation method on a column that doesn't work, it's going to, it's going to raise a type air. So let's accept this exact air. And if there's a type error, then you know what weren't actually just not gonna do anything. We're just going to continue on with our program. And instead we're just going to return a DataFrame of new data. So we'll just say we'll pass, we will not do anything we're not going to add to our dictionary, will simply just move on to the next iteration of the loop. So we're going to try this. If it does work, that's great. If not, we're just gonna do nothing. We're going to assume that the column is just not able to be processed. So we're just going to move on to the next column and just not return that particular column into our return DataFrame. All right, so let's go ahead and test this out manually in the notebook. So we have df here it is. So if we take, if we just want to sum all the columns, well, it looks like this works. So some actually is a method that works with string, so it looks kind of bizarre. So every column has been summed up. So you can see your what sum does for string to simply concatenates all the strings for height, it's a float, it simply adds all of them up. School is a Boolean. So Booleans are evaluated as 0 or one. So there's two true values there, and then integers are simply added up as well. So let's try an operation where there is no string ability to do that. So if we take the mean of every column, then only the numeric columns. And it actually also works for Booleans. Since they're treated as numeric, they will work as well. So that works. And the toString columns, name and state are dropped out of the final DataFrame. So that looks good. Let's go back over here and all the, there is one test for every single method. But instead we will just run all the aggregation tests at once. So instead of running individual test, we'll just going to run this test aggregation class, and hopefully that will work. So there was 11 tests, as you can see here, in all 11 have passed. So that completes Step 22. These generic aggregation functions or aggregation methods. 34. The isna Method: In this video, we complete Step 23, the Is and a method. So in this step we will be completing a single method. Unlike the previous step where we completed 11 methods with a call to a single method. So this is a specific just to is in a. So isn't a is going to be a method that takes no arguments. And all it's gonna do is it would hurt a DataFrame with the exact same size as the original, meaning the same number of rows and same number of columns. But it'll return all Boolean values. And the way derives its Boolean values as whether or not each value in the DataFrame is missing or not. So if it is miss, if the value is missing, then it will be true. If it is false, if it is not missing, then it will be false. So every column will be a Boolean array at the end. Okay, so we're going to use numpy again to help us out. So the is nan function of numpy will help us determine whether or not we have missing values. Now, This does not work for string columns. So when we have a string column, technically datatype of object, then we are going to test it. We're going to make a comparison to the None object. So we're going to assume our users are going to use none the Python none to as missing values for strings. Okay, so let's go ahead and do this. Let's create a new data dictionary that is empty, that will contain our data that we pass to the constructor. And let's again iterate through our underlying data dictionaries. So we have columns. The column name value is the NumPy array. So we actually need a test what, whether or not is it an object array? So I'm gonna go ahead and look at the kind if it is uppercase O, which is what it will be if it's object. Then what we need to do here is we're going to make a new entry into our dictionary and we're gonna save value equals, equals none. So this will test whether every value is none, and that's why I use the word vectorize equality expressions. So we're gonna say whether each value in this NumPy array is equal to none. So this will return a, an array of all Boolean values. Otherwise. So if it is not an object, then np dot is and then we'll work. So we'll just say new data will now be disabled to be able to directly use is nan and put that in here. So that should work just like that. And we'll just return the DataFrame of new data. Okay, so if we look at that, that looks good. And let's test this out to see if this is going to work. So let's go hop on over to the test notebook. We don't have any, it doesn't look like we have any missing values over here. So we might want to create some missing values over here for some missing values. So let's go ahead and do this. Let's put a non right there. And let's put a NAN right here. So now we'll have Nan right there. And we'll have a none over here. Okay, so let's test this out. So do df.loc is NA, and then looks like, okay, Good. So we have a true here for a state that has none and a true here for height when it is missing. Now one cork of NumPy is that there are no missing values allowed for Boolean columns, for Boolean arrays. So school can never have missing values and there are no Boolean or sorry, no missing values for integers. So weight will not have any missing values either because this is an integer column. Okay, So let's make sure we pass our tests. So now we're going to use test in the new class of tests, tests other methods. So let's go back up here and type in test other methods. And let's use test is NA, that's the test we're going to do and see if we pass this. And great, So it looks like we've passed the test. So is N A looks good. All right, so that completes Step 23. 35. The count method: In this video, we completed step 24, the count method. To count method returns the number of non missing values per column. This is an aggregation method. There's going to be one integer returned per column. So when you call this method, the number of columns will not change, but the DataFrame returned will have a single row of integers. Okay, So let's get started here. We, this method actually requires the isn't a method. And we're going to take advantage that we have just implemented is NA. And then use, use that as our DataFrame that we are going to be calculating way. So let's just go ahead and call isn't a on ourselves. So on the current DataFrame, so this will return a DataFrame of all true or false values, whether or not what, whether or not each value is missing. And then we'll just go ahead and create our new data dictionary as we usually do. So let's iterate through the columns and the NumPy arrays in DFS. So it's just a regular, so df is just another DataFrame. We simply just created a DataFrame from our, you know, the current DataFrame. And so value is a NumPy array of all true or false values. So if, so it's a Boolean array. So when you sum a Boolean array, booleans are evaluated as 0 or one. So this would actually return the number of non missing values or excuse me, the number of missing values. So since I called is NA, so this would return the number of missing values, but that's not what we want. We want the number of non missing values. So what we can do here is get the number of rows first. So get the number of rows and we'll just simply subtract this sum, which is the number of missing values from the total length. So that will leave us with a number of non missing values. So let's go ahead and add this to our dictionary. So new data, call this the string, that's the column name. And it will equal this. Now this will be a single integer, but we need to make it a NumPy array to follow to match to match how the data can DataFrame constructor works. So that's all that should, that should do it. So we just need to return the constructor, pass with this new dictionary. So let's walk through this one more time. The first step is to call is NA, turning the entire DataFrame and the true or false values whether or not each value is missing or not. We then initialize our data dictionary. We get the number of rows for each of the current DataFrame. And then I'm going to iterate through this Boolean DataFrame that we created in the first line. And what we're gonna do is simply subtract from the total length of the rows, the number of missing values so that we're calling the sum method on value, which is an array of booleans house. And we sum it up. We'll get the number of missing values and we subtract the length, we will get the number of non missing values. So let's go ahead and see this work manually. And from before we have a couple of rows are a couple of columns that have missing values. So we did this right here. So here's DF, DF file, df Pandas. So let's go ahead and do df count. And so name has no missing values, so it has a count of three, state has one missing value, so it has a count of two. Height has one missing values, photos accounted to and school and wait, don't have any missing values. So they have a count of three, which is just the total number of rows. So that looks good. Let's go ahead and run the test for this. So let's do tests count and we passed, which is great. So this completes Step 24. 36. The unique Method: This video, we complete Step 25, the unique method. So the unique method is going to return all the unique values for each column. Specifically, it's going to return one DataFrame per column. And that DataFrame will have one column and only unique values. So if we have a five column dataframe, we're going to return to the unique method will be turn five DataFrames in a list. Each DataFrame will have one column, and it will have just the unique values. So this is a little bit different than Pandas. So the DataFrame in pandas does not have a unique method. Instead, only Series have unique methods, have the unique method. And a series is simply like a one column object. It is not a DataFrame, but it's very similar to a DataFrame. So there is a unique method, but it only works on a single column of data. Okay, So are they are unique method will work on entire DataFrames and return a list of one column DataFrames. So let's get started here. The first thing we can do is just create an empty list to contain our DataFrames. The next thing we can do is just start iterating through our data. So call value in. Let's go through our data dictionary again, like we normally do. So within here, we're going to create new DataFrames, a new DataFrame for every iteration. So let's get started and let's create a empty dictionary that will hold our data. Now, we're going to rely a NumPy array here. And NumPy has the unique function. So we're going to use that to be able to create this dictionary. So the, so again, we're heavily relying on them. Prize winner call np dot unique. We're simply going to pass it in the one-dimensional NumPy array that's stored in our data dictionary. So this is going to create a single element dictionary with a column, with a string the same as their column name. And then we'll want to go ahead and append a new DataFrame of this data. So we'll call our DataFrame constructor and we'll pass in, this DataFrame, will pass them in the dictionary that has one column in it. And we'll append it to this list. Now there is one other note here. It says if there is a single column, just return the DataFrame. So we won't have the users get troubled by having to look inside of a list if there's just a single column dataframe. So we'll say if the length of DFS equals one, so if it's a one item list, we'll just return the very first item in the list, which will be a DataFrame. Otherwise we will just return that list itself. Okay, So that looks good to me. Let's go ahead and test this out over here. So I changed up the data a little bit to make sure that there were some columns that had some values in common. So I'm gonna go ahead and restart the notebook here and execute things from the top. Okay, So our DataFrame is a little bit different. We have something unique values, say for or some, some values in common like wait to only one unique value. School has only two unique values and so forth, name and state Ito lamp to unique values. So let's go ahead and call the unique method and see if we see if it works. So it looks like we have five DataFrames as to return them the list. Let's go ahead and assign this to some variable. So if we, if we select the first item in this list is just a one column dataframe of the unique values. So Lenny and Nico, and we can do this continuously. So we can get, for the second column, state it just California and Texas. For height, they're all unique. And then let's do it for the last one for weight. And we see that weighed only has one unique value, so that is 40. Okay, one caveat here is that this will not work if there's any missing values in your string. So we're not going to cover that case. So unfortunately, if there are missing values in your string columns, it will not cover that. You could use a Python set instead. So the NumPy has unique function doesn't work with, with if you have none in your object columns. So we're just not going to deal with that. Instead, just leave it as it is. Keep it simple so you can take care of that case if you would like to. Let's go ahead and run this test. Unique and see if we pass in. All right, Good, We passed. So looks like we can move on to the next step and that completes Step 25. 37. The nunique Method: In this video, we complete step 2006, the n unique method. So n unique stands for number of unique. So this method will return the number of unique values for each column. That means it will return a DataFrame with the same number of columns as the original, but with a single row. That row will contain only integers in the integral will represent the number of unique values for that column. So this is going to be fairly similar to the unique method which we just, which we just completed up above. So we're going to follow a fairly similar procedure, but all we are interested is in the length of all the unique values. So the number of unique values. Let's start off by creating new data as a dictionary. And then let's iterate through the column names and their associated one-dimensional NumPy arrays. So let's loop through here. We're going to use items again to extract the strings in NumPy arrays from the dictionary. So what we're gonna do here is pretty simple. We'll just say new data will say column equals. What we want is just the length. And when they use unique here of the value. And that's pretty much the ones we want, the length of the unique values. Now, we have to ensure that this is a NumPy array. So we're gonna go ahead and wrap this around and make it a list with this, with these brackets right here, and then put it into the np.array function. Okay, so that looks like that should work. And then we just need to simply return a DataFrame constructor and pass in this new data dictionary. So let's go ahead and see if this is going to work. Let's go ahead and look at it in the Jupiter notebook. And go ahead and restart the kernel just because test this out on my own. But Let's just do it right here again. Let me just run through here. Okay, so we have our DataFrame, got five columns and the data, some of the columns have repeated values. So let's go ahead and see it's called not unique, but, and you need and I can get the syntax right. You'll see that, okay, so name as to need values, state also as 2D values. Height has, they're all unique. So there's three. School has to wait, just has a single unique value. Everything looks good there. Let's go ahead and run the test, test and unique, and we pass. So that looks good. So this is a fairly simple method and we just iterate through every column we run unique on it. We find the length of that. We turned it into an array, and then we just return the new DataFrame. Okay, So that completes step 26. 38. The value counts Method: In this video, we completed step 2007, the value counts method. So the value counts method, its primary purpose is to simply look at one column of data and count the number of times each value appears in that particular column. So if you have, say, a column and there's a 100 values in it, say ten of them are unique and will return just those 10 values and their associated counts, meaning the number of times that they occurred in that particular column. So our value counts method Will, will act on every single column and only turn one DataFrame for every single column. And it will return the results as a list. So if you have five columns, it's going to return a list of five DataFrames. The only time it's not going to return a DataFrame is if there is one column and then, or sorry, the only time it will now return a list is if there is one column and then it will return just a single DataFrame. So each dataframe that our returns will consist of two columns. The first column will have the same name as the original, and it will have just the unique values. The second column will be titled count, and it will hold the frequency of each of these values. So that's what we're after, is a DataFrame with the first column of the unique values in the second column, holding the counts with a otherwise known as the frequency. Okay, so let's go ahead and go ahead and do this. So we're going to start off by creating an empty list to hold our DataFrames. And then we're going to iterate through our data, which is always stored in or it's continually stored in the underscore data dictionary. So very interestingly, we're gonna, we're gonna use the numpy unique function here. And this doesn't, intuitively, you would think would have this ability, but it does. So it actually has other parameter called return counts. And it's a little difficult to see because it's all jumped up, jumbled up here. But if it is true, by default it is false, then it's also going to return the number of times each unique item appears. So it actually returns two arrays as a tuple if you set return counts to true. So the first, the first argument will be just the array that you want to get the unique values. But then we're going to set unique counts as true. And this will return a tuple of the unique values. So we'll just call it uniques. And then we'll say counts like this. So we're going to unpack this tuple into two separate variables, uniques and counts. So That's, that's the bulk of the work is actually done by this unique function. The only thing that we wanna do here is that it says return the dataframe with counts sorted from greatest to least. So we could sort counts independently, but we also need to sort uniques along with it. So to do this, we're going to rely on numpy again and we're going to use the sort function. So the arg sort function returns the order of the data from lowest to highest. So it returns integers as the order. So if there's ten values in county council or a, it will return a number between 09, giving us their order. So the very first integer will simply correspond to the very lowest item in that array. So it might be a good idea to see how this works in the notebook. And I already have an array here that I like to look at. So let's say we have defined this array a is just some values tend to 10, 523, 299. So if we do our sort and we go ahead and run our sort on our array, we will get back the order from least to greatest. So just to see how this works is that, so this first entry is four, which means the element with index four is the smallest. So if we count there, 0, 1, 2, 3, 4, we'll see that indeed two is the smallest element. And it just builds from there goes from least to greatest. And we can verify that five, which is the very last item in the array 99, corresponds to the largest item. So let's go ahead and assign. This array to just some other variable like B. And if we pass B into the brackets, it will select items in that order. So it will select the items in that order and a correctly sorts it. Okay, great, So that works. And the reason we need arg sort is because we need to use this, this order not only to sort the counts, but this sort the first column which are just the unique values. So let's go ahead and get a variable called order. It will say Argh sort and we'll pass in counts over here. Now by default in reach, it will sort from least to greatest. So one way to sort from greatest to least is to actually make this a negative value. So that will sort of from greatest to least if we turn all of them negative. And we can verify this over here. If we put in a negative value, you can see that we go, we have sorted it now from greatest to least by putting that negative sign. So this is simply multiplies all the values by negative one. All right, so that's how we get the order. And then we're just going to simply are reassign uniques and counts by with that newOrder array. So let's go ahead and do this reassignment. We're going to sort it. So that looks good. Now, we need to create a DataFrame out of this. So we want to create a new car, a new DataFrame. So we're going to put it in the dictionary here. On if you want, you can go ahead and do a new data as a dictionary and we can create a separate variables. So the very first column will just, we will keep the same column name and we will map it to uniques, which is just a one-dimensional array of unique values. And the second column would be called count. And we will put the counts value over here. So if we pass this new data dictionary to the constructor, we will have successfully completed or created an array, created a DataFrame out of this, Let's go ahead and just append this to our list. And then I'm pretty much does it. So the only thing we need is left to do is check and see if the length of DFS, the list is one. If it is, we'll just return the very first one. If it's not, we will just, we don't actually even need an else statement here. We'll just say, so, sorry, we need a return statement. We'll just say return DFS. So that looks good. And we need to verify that this works. So let's go ahead and run our new or new test, which is actually in the test grouping class. So this is not a new, a new testing class for value counts. And the test is called test value counts. And we can see that it indeed, it did pass good. So that completes step 2007. Value counts. All right, fantastic. 39. Normalize value counts: In this video, we complete Step 28, normalized value counts. So going to remain in the value counts method for this step. And you can see that there's a single parameter that we have. We are going to allow our users to pass to the value counts method and that is normalize. It is defaulted to false. So what we implemented in the last step is the default choice, which is to simply return the raw counts. Now, instead of returning the raw counts, in this step, we will return the, what's called the relative frequency, or just the percentage of the values that are in that column. So not the raw counts, but the relative frequency. So if you sum up all of the relative frequencies, it will add up to one. Okay, so we only need to change one small thing in this particular, for this particular step. So, and that is counts. So we don't have to change the order. The order is not going to change, but the actual values of counts will change. So if the user has given us normalize, what we will do is we will reassign counts and we'll simply divide all of the counts by the total. So that should do it. And in fact, you could probably, you know, you should be able to just do length of d f. So that will do it as well. So if we just take the length of the DataFrame itself, that would do it since all the counts should sum up to the length, either one will work. So let's go ahead and run this test. Test value counts normalize. So a test value counts normalize. And oh, not length of DNA but length of self. Okay. So df, df is, it would not be correct. Anyways, the info is defined. We're looking for the length of the original DataFrame. So it looks like. All right, good, so that passed. Let's take a quick look at how this functions in the Jupiter notebook. So here we have our data, df. So if we run value counts on it, we will get back a list of dataframes. And if we want to look at, say, the first one that we can do that. So that's where the name column we have two Nick goes in one ear, Lenny. That's how that operates. To want to look at, let's say wait, for instance, we can look at the very last one in our list. So there's only one value of weight. Now. That's with normalized equals false. But if we say normalize equals true, this should return us the relative frequency. So here we have two-thirds or Nico and 1 third is a Lenny. And for weight, well, this is simply just a 100 percent on the weight of 40. Okay, So we already passed the test, and so that completes Step 28. 40. The rename Method: In this video, we complete Step 29, the rename method. So the rename method will allow us to rename the column to the DataFrame. So it's pretty simple. It's going to accept a single parameter. This parameter will be a dictionary mapping an old column name to a new column name. So both the key and the value will be a string. So let's go ahead and begin this. Now. There's one check here. So we're going to force our users to give us a dictionary. So we'll raise a type error if columns is not a dictionary. So columns is not a dictionary, then we'll re the type error. So if not, columns is addict, so columns is the name of the parameter. We will simply raise a type error and just say columns must be a dictionary. Okay, good. Now, what we're gonna do is we're going to actually return and hire new DataFrame with the columns renamed. So we're not going to assign anything in place. So this will actually, it will return a new DataFrame, as we can see here. So let's go ahead and create that new data dictionary that we typically do. We're just going to go ahead and iterate through every column and value pair in our data. And so the values not changing. So this is not going to change, but we are going to update the new column name. So columns is a dictionary. So now what we're gonna do is use the get method of a dictionary. So the get method allows us to attempt to get a particular column. So we'll try to retrieve the new column name from the dictionary that the user gave us. So columns is the dictionary that the user gave us. Now the user doesn't have to give us new names for every column. You could just have one column name at a minimal. So we're going to try to get the new column name. And if that column does not appear, it will just keep that same column. So this will just be the new columns. We'll just say, new call would just say that. And then we'll just add this to our new data. So we'll say new call and we'll say equal value. So this, this rule, this should exchange the old column name for the new column name, which is inside of columns. And That should work. So let's go ahead and return the DataFrame call to the data to create a constructor. And we can go ahead and test this out. Now, this test, tests rename is in the test other methods class. So this will, will have to go back here and go into the test other methods and do test rename and see if this works. So everything looks good. We passed it there. Let's go ahead and check it out in the Jupiter notebook to see how it looks live. So here's our DataFrame. So to rename. So we're gonna go ahead and call the rename method. And let's just say we want to rename the column name, says make it to all caps. And maybe we want to change and no school to university or something. So you can see this, so it looks like that's work. So name has been renamed to all caps name and school was renamed to university. So that looks good in the notebook. So you give it a dictionary and it simply goes through and replaces those particular column names with the new name. Alright, That completes Step 29. 41. The drop Method: In this video, we complete Step 30, the drop method. So the drop method will accept a single parameter, columns. And this, this parameter will be either a string or a list of strings. And it will contain all the columns that we will drop from our DataFrame. So the end result will be a DataFrame without the columns that are passed to the drop method. Okay, so let's go ahead and begin over here. So the first thing is, is if it columns is a string, we're going to go ahead and turn it into a list. So make it easier to process. So if columns is a string, and we'll reassign columns as a one-item list of itself. So if it is not a list, then we'll just raise a type error and that's what it says. Raise a type error if it's a string, if it's not a string or a list, okay? So if it is not a list, then we will simply erase a type error. And say, columns must be either a string or list, either a string or a list. Alright, great. So now we're guaranteed to have a list. We're not going to verify whether everything with inside the string or with inside the list is actually a string. So what we, what we will do is we're going to create some new data, new data dictionary. And we're going to iterate through all the, all the data. As we normally do. So dot items here. So if call is, so if it's in the list, we won't do anything. So we'll say if it's not, if not call and columns. So what this is gonna do is it's going to check whether column is in the call columns. So if it's not in there, then we will keep it in our DataFrame. So we will make a new entry into our dictionary and we will simply do this. So this is, this is actually checking whether or not it's in the columns at the user has given up, it is not in there. Then we will make a new entry to our dictionary. So will only return those columns that are, that are not in that, not in the list that the user has provided. So what we should be able to return our DataFrame constructor with this new data dictionary. Okay, so let's go ahead and run this. So this is in the other methods. So it's called test drop and it passed. Okay, good. So let's just see this in action over here. So if we have df over here, and let's say we want to drop, I don't know the state column. Okay, good, That worked. And let's say we want to drop analyse, use a, a list of strings. So say state and I don't know, wait. And that looks like it has done it correctly. All right. So we verified in the notebook, we formally tested it with PY test, with our unit tests. So it looks like everything is good for step 30. 42. Non Aggregation Methods: In this video, we complete step 31 non aggregation methods. So we're at this portion. I have a common here it says non aggregation methods. In a step above, we completed a bunch of methods that were aggregation methods. So just some little recap on that. And aggregation method is simply a method that returns a single value for every single column. So Min, median, max, sum, these are all aggregation methods. They all returned to a single value for every column. And the end result was a DataFrame that had one row. And I'm just the, kept the original columns for those that it could actually do the aggregation on. So these non aggregation methods or simply are essentially the opposite of the aggregation methods. These are a group of methods that will all work the same. They will all preserve the shape of the DataFrame. So there's no aggregating going on. So for instance, when you take the absolute value of a column, all your, every single value is going to remain in its place. The only thing that's happening is that the absolute value of that particular value will be returned. So all of these comment stands for cumulative minimum, KM max, cumulative maximum. And they're all listed over here comes some clip round and copy. So these all work. They all have similar functionality in that they will just sort of transform the data. But they won't aggregate. So the shape of the DataFrame will remain the same for all of these methods. So your data frame has a shape of a 100 rows by ten columns. The final result will also have a 100 rows and 10 columns. So if you notice for all these methods, they are all implemented very similarly. There's a single line of code and that is a call to the non snag Method. And withinside the non ag method, each one of them passes a numpy function, and that is the NumPy function that the underlying non ag method we'll use. So let's just take a look down here. They're all calling non ag. They're all returning the result from non ag. So it is this method non ag that we are going to edit in order to complete this step. Okay, so as you can see here, non ag takes two parameters actually. So we'll take the function name, but it's also going to take this other very strange looking parameter has two stars in front of it and it's, and it's kw args. So there's two. So this is useful whenever there's other parameters that you want to pass to a function or a method. So we do have two methods here that will pass additional keyword arguments. So the clip method is going to pass two additional arguments, a man and a max. So this determines know how to clip the numerical column around is going to pass one parameter decimals. So this is a Python way of giving. Python gives you some flexibility when, when calling other methods. In that you can pass any number of extra arguments, or in this case, keyword arguments, like this. So these two stars denote that all the keyword, extra keyword arguments that you pass to this non-ideal method will be stored in this kw args variable as a dictionary. So whenever we're making this call, actually a dictionary will be created and it will be assigned to this kw args. So the dictionary will have a Min as a key and it will map it to a lower. Whatever value that is in a max will be mapped to oper. So this is just Python's way of providing developers a way to pass around extra arguments, extra keyword arguments to methods or functions. Now, if there was just a single star preceding here instead of two stars, then this, then you would, then it would collect the extra arguments as a tuple and not, as, not as a dictionary. So you wouldn't be able to give it a keyword arguments here. You'd only be able to give it say Just value comma separated values, and those would all be collected as a tuple. Okay? So regardless, we're going to use this to pass these extra keyword arguments to the underlying NumPy function. Okay, so let's go ahead and complete this method. So we'll begin as we usually do by creating a dictionary and new data dictionary, and then it will just iterate through our underlying data dictionary. Now, one thing I have not mentioned yet is that all of these methods, besides copy only work on numeric columns or. I guess you can make a case that some of them like come in, could work on a string column. But for our purposes, we are not going to allow them. We're not going to worry about them working on a string column. So we're going to write an if statement here to handle this case. So if the datatype kind is uppercase O, which stands for object. So if we have a string column, we will just not process this. Okay? So we will not process this. And we'll do one thing here, and we'll actually technically just make a copy of the underlying array so won't be equal to the exact array. Honestly probably should have been doing that in some other steps is copying the data so that we have entirely new data for different DataFrames. But anyways, it's also allows us to make this copy method of work as well. So this is the copy method actually does work with strings. The other ones don't really work with strings. So this will sort of allow us to cheat a little bit in and have it work with the copy method. Now. Otherwise, so if we don't have a string column, we'll then we're going to just do a calculation. So We will go ahead and take that function name, whatever we gave it. Now the first value to these functions will be the underlying array. So we're going to give it the underlying array. And then we are going to pass it this dictionary, this underlying keyword argument dictionary. And these two stars will allow us to pass a dictionary such that each element in the dictionary will be read in as one parameter equal to some value. So this sort of unpacks the dictionary for us within a function call. So let's just take this for instance. Let's say we call the round method. So our non ag method gets passed this np dot round function. This np dot round function becomes font name. So this is now equivalent to np dot around value is simply the underlying data, the underlying NumPy array. And then kw args here will simply be a dictionary containing this decimals map to whatever end that the user provided. So and that will be unpacked within this function calls. So would it be exactly like calling np dot round on this NumPy array with decimals equal to n. So the last thing we need to do is return our DataFrame and we'll just give it new data. So that's, that's all, that's all we have to do here. So just implementing this one single method allows us to complete all of these other methods at once. So they all follow the same pattern. So we're taking advantage that they all follow the same pattern. And this one for loop will, should work for us. Let's go ahead and test this. And what does the test name? It's called O, so they're all in the test non ag plus. So this is a totally new class. So test non ag. And let's see if all of these work. So they don't. Okay, let's see here. Which one did it fail? Okay. Okay, so I see what happened. So there's actually we can't just run all the tests at once. So and that's because there's a couple other There's a couple other methods that rely on this nonane method. So when you test them individually, so it's a little bit of a pain here. So but if we go in here, they should all have the same name. So let me this should all have the same name. So if I'm gonna go back down here, let me clear this up. So there'll be called test, test ABS for instance. We'll just test them one at a time. So that looks good. Tests come in. Okay, that looks good. Come max. That looks good. Some. And then give tests clip that has passed and test round and then test copy. All right, so it passed all those tests. So there's a few more tests. And the non ag, one on that we'll get run. And that's for the upcoming steps, which is why it failed in the beginning. Let's take a quick look in our notebook to see how this looks. So we have df. So if we do df dot, dot, say Min. So this will not work for any string columns. It will work for Boolean. So Booleans are treated as 0 or one. So come in, It does not work if it stops working after you have a NAN. So so it just will fill out the rest of the way. And for weight, well, they're all 40, so that is actually the men, the cumulative Min for do come some well, name and state will be ignored. But height, once it hits a NAN, it will stop accumulating. School. 0, one that works in weight gets accumulated. So that looks good. So the tests, we manually tested our DataFrame and appears to work. And we formally tested it with PY test and passed all of those tests. So step 3 one is now complete. 43. The diff Method: In this video, we complete Step 32, the diff method. So the method subtracts the current value in a particular column from the nth value above it. So by default and is going to be equal to one. So whatever the current value is, it'll just look right above it and it'll subtract that and report that as the difference. It will return a DataFrame with the same number of rows and columns as the original. So this was actually a good one to see before we embark on it, since it's a little bit hard to explain, Let's go ahead and go into the Jupiter notebook. I've created a new DataFrame that's similar to the one that we have been using for the entire project. But it has one more row. And I got rid of one of the string columns, and I got rid of all the missing values. So here it is. So I'm going to call the diff method on this one, not df pandas yet final called def. By default n equals 1. I will go ahead and put that in there just so we can see that indeed n equals 1. And I went to run this. So what happens here is that say, let's take the height column for instant. It's gonna go through one. It's gonna go through every value at a time and it's going to subtract whatever the value is above it. And it will return that as the value for the new for the new column. So for instance, you know, 3.6 has no value above it, which is why there's missing values for the first, for the first row right here. 8.5. It does have a value above it. So a 0.5 minus 3.6 is 4.9. And then 5.2 minus 8.5 is negative 3.3 and so forth. So Booleans are treated as 0 or one. Weight is going to be a little bit easier to calculate since it's just whole numbers over here. So 50 minus 40 is 10, 45 minus 50 is negative five, and a 100 minus 45 is 55. So all of the columns will be turned into float columns as you can see. And that's because of these nans. And there is no Nan or missing value representation for four integers or for Boolean, so we have to convert them to floats. So let's take a look at what happens when he say n equals two. N equals two. The first two rows will be filled with an ends because they were going to make, we're going to have to hop back up to two rows in order to make the subtraction. So here 5.2 is the first value and height where Nan can recover or where the diff can be computed. So 5.2 minus 3.6 is 1.610, minus 8.5 is 1.5, and so forth. So that's how the diff method works. So let's go ahead and fill this out. So what we're gonna do here, there's actually a function defined in here. So this is the function that we're going to pass through the NAT non ag method that we completed before. So we're going to use a diff is a non aggregating function at non-IT creating a method. But numpy does not have any, there's no diff function in NumPy, which is why we're going to create our own and then we're going to simply pass that function into non ag. Okay? So this will take, let's say, some variable called values, which will just be a single column where we've been using value. Think in my solutions I use values but I'm, either one is fine. So the first thing we're going to do is actually we're going to convert this to a float. So if it's a Boolean, we're going to convert it to a float. If it's an integer, we're going to convert it to a float. So we're just going to force this to be a float, a float column or float array. That's the first thing we're gonna do is we're gonna just gonna reassign it as a float and that's because of those nans. The next thing we're gonna do is we're going to shift, we're going to shift this array using the roll function of NumPy. So I'm going to call this value shifted. And I'm going to go ahead and roll this function or use the roll function. So this will do kind of what it sounds like it's doing. It just shifts an array and you give it the number of the shift over here. So let's go ahead and do a value comma. We're going to shift this over n. So whatever n is, it's going to roll that. And then we'll just no performance subtraction here. We'll say value. Minus value shifted. And that should work now when you roll this function. So the roll function in NumPy, and it's probably good to just see what happens here. So let's go back to the Jupiter notebook and let's go ahead and roll like the weight array. So you can see this happen when you roll this. So it'll just roll it. One, what happens is, is the very last item actually moves to the front. And the other was just gets shifted downward. So if you roll it two times in the last two values, now become the first two values. You can also shift by a negative number. We can shift this way. So the first value goes to the end and the other ones gets shifted to the left. So we will go ahead and reassign this to value. So this is what we want to return, but it's not quite right because this will not have any man's in it. So we're going to have to replace the first few values of the first n values with a man. So we could do something like this value gets, say the first n values will be np dot nan. So whatever make it equal to, would I make them ends. But we are going to allow and to be negative. So this will work if n is greater than or equal to 0. We'll say if it's less than 0, then we will have to go from. So if n is negative, so negative two, then we'll go from the end all the way from end to end. And we'll make that equal to np dot nan. So there's different logic based on whether n is positive or negative, which one will be? Which one or which values are we gonna force to be Nan? So I think that looks like it works. And then we will just return, in this case, just the value which is that array. So again, this is a, an internal function underneath the DIF methods. So this is going to sort of be written in place of a NumPy functions. So we don't have NumPy does not have a diff method for us. So we're, we're sort of creating our own. And then we're going to pass it here. So n is in here, it's also over here. So the scope of Python allows us to do this. So once we define func, n will get transformed to whatever variable or user gives us. So we don't actually have to pass it along here as n. It'll already be carried with it correctly. So this looks like it should work. Let's go ahead and test it out in the notebook. So df and not df final. So this is what we just implemented. So if we do the dot diff, looks like, that, looks good to me. We could do it with two, so I think that looks good. All right, So we wrote like an internal function and then we, we are going to make a call to the non ag, which we did in the previous step. So let's go ahead and run this. Let's say test diff. And it looks like we passed. Okay, fantastic. Looks like this amount of code. As completed Step 32. 44. The pct change Method: In this video, we complete Step 33, the percent change method. So the percent change your method is nearly identical as the diff method which we just completed, except instead of returning the actual difference between the current value in the nth value above it, we're going to return the percentage change. So it's just a, a minor, very minor difference. One returns the absolute or the difference, and the other is going to return the percentage difference. So I'm gonna go ahead and copy and paste the function that we just worked on above. And I'm gonna make one minor adjustment here. So in diff, the diff method, we are going to return the difference between the current value and the one above it by N. Instead of returning this difference, what we're going to do is return the percent difference. So we're going to divide it by that nth value above which is the value shifted. And this should get us exactly what we want. So we're just doing one minor thing. We're simply dividing by that value and rose above. So let's see how this works in the Jupiter notebook. So here is our df, and if we do percent change with n equals one, we can see that, well, number one, we get this this warning blasted in our face. And this is, this is caused because it looks like it says is a 0 encountered in yes, so there's a division by 0 essentially. So false evaluates to 0. And that's not going to be good when we're doing division. So any, any place where you have a 0 here, you get a little bit of, uh, you'll get a warning. But we're not gonna, we're not going to handle cases like this. So we're just gonna go ahead and bypass that and not worry about it. What we can do is manually check, say something, say a jump from 40 to 50 and the weight column. So that's a difference of 10. So 50 minus 40 is 10, and then 10 divided by 40 is 0.25. So that looks like, you know, there's a 25 percent jump from there. So that checks out manually. And then we should obviously check with PY test and we have the test percent change test. So I'm gonna go ahead and run that. And I have passed that. So this is nearly identical to the disk method. Just make one small division over here to complete Step Number 33. 45. Arithmetic and Comparison Operators: In this video, we complete step 34, arithmetic and comparison operators. So before we begin this, it's vital to know exactly what an arithmetic or comparison operator is. So these are quite simple. The arithmetic operators are simply all the common operators that are used to do addition, subtraction, multiplication, and division. So for instance, specifically the plus sign is an arithmetic operator. The minus sign is an arithmetic operator. There's also the multiplication sign, the division sign. And then you have a few more like two multiplication signs, which stand for raising to a power to division signs of Florida vision. A percent sign is the modulus operator which returns the remainder of a division. So those are the arithmetic operators. We also have comparison operators such as greater than, which is right here. Less than, greater than, equal to, less than or equal to, equal to, which is two equals signs or not equal to, which is a exclamation mark and then an equal sign. So in this step, we are going to make sure DataFrame is able to know what to do whenever someone, never a user wants to use one of these arithmetic or comparison operators to it. So all of these examples are going to work after this step. So, so in order for a DataFrame to understand what this plus sign does or what this minus sign does, or what this greater than sign does, we have to implement its corresponding special method. So the corresponding special method for the plus sign is the dunder add method, which is over here. So this is what gets triggered whenever you have df, your DataFrame plus some other object. This is the method that gets called. It accepts a single parameter, and that's the other object that it's operating width. So whatever you're adding to the DataFrame is gonna get passed here. So whenever we make this called df plus 5, 5 will get passed to this other as this other parameter. And then the body of the function or the body of the method that will get executed. So before we talk about the implementation on our part, what I wanna do is look at all of the arithmetic and comparison special methods in the official Python documentation. So if we already have this open, there's a link to it in the Read Me. So if we look at this under the emulating numeric types in the data model, you'll see all of these special methods, their corresponding operators over here. So for instance, the add special method will be used to implement this plus sign, this, you know, this addition operator and so forth. The minus sign. You use dunder sub for the multiplication sign. You use dunder mole. For this at symbol you use dunder Matt mole. So where this is typically done when you're implementing matrix multiplication. But we're not going to be doing matrix multiplication in our DataFrame. You can, you certainly can. But we're not going to do it with ours. It's a little bit more complex. So yeah, there's a few more true derivative, true div for one division sign, floor div for two division signs. And a dunder MOD method for this percent, which is the modulus operator. So there's, there's several more methods that at least from L shift on down that deal with Bitwise manipulation. And we're not going to deal with those either. So those are, that's how you can find out in the documentation what are the corresponding arithmetic and comparison operators are. If we look down a little bit further down, you'll see that we do have, sorry, we do have the six comparison operators and their corresponding methods as well. So L T stands for less than. Alley is four or less than or equal to, you know, EQ 4 equals, equals NE for not equals and so forth. So the documentation is very valuable here to figure out what are the exact names for the method that you need to implement for each specific operator. Now one other note is that you'll see that we want our DataFrame to work regardless whether the operator, whether the DataFrame is on the left-hand side, or whether the DataFrame is on the right-hand side of the operator. So if the DataFrame is on the left-hand side of the operator, then the normal dunder method will be called. So for instance, if for this particular one, df plus 5, then df or dunder add will be called. In this instance when you have 50 plus df or just any object operator, then df. So the DataFrame is on the right-hand side. Then you'll see that all of these methods also have another distinctive method with an R preceding the actual word. So it's just stands for the right, you know, whatever your object is on the right-hand side. Because you might need to do something different, whatever your object is on the right-hand side. So python provides a separate special method that has this r in front for these instances. So whenever 5 plus d f gets called, what happens is this dunder, our add method gets called and not the dunder add method. So when the DataFrame is on the right-hand side, Excuse me, then the are the methods beginning with our GET call it. Okay. So then you can see that also in the documentation as well. So it's just right below the normal dunder add methods that you can see. They're right here, so they're the same methods. They just have the letter R preceding them. Okay, so let's go on to the implementation. So if we look over here, all of these methods are already complete. They all basically are calling the same thing. They're calling this underlying offer method. That's a private method. And they're passing it two parameters. Okay? So what is a String name of the method name itself? And the other is just simply what is the other object that's operating with. So in this case, all of these are five, That's the other objects. It's operating with the integer five. So if you look down here that are all already implemented, so you don't have to edit any of these. Instead, we're just going to edit this dunder method or dunder, but this underscore opera method. So implementing this one method will make all of the others work. And the reason we only have one method here to do's because it just makes the, it makes the work a lot easier for us. So instead of trying to do them all individually, we'll just do this one methods and they all basically behave the exact same. Okay, so let's go ahead and get started here. So we will allow other. So let's think about what other can be. So it'll, it can be either just a sync to some other scalar value, like five or you know, float, maybe even a Boolean. Or we can allow it to be a DataFrame. So this is what this says. So now within the opera method, we're going to check if others a DataFrame. So if it is a DataFrame, then we're going to raise an error of value error as it's not one column. Otherwise we're going to reassign. If it is one column that it would assign other two that one-dimensional array of the values of its column. Okay, so let's go ahead and do that. So if we have a DataFrame, so if other is a DataFrame, then we need to check whether its shape. So if other Dutch shapes, if it's a one column dataframe. So if it's not a one column DataFrame, then we're gonna raise a value error here. We'll just say data frame must be a single column. Otherwise, we will now get the underlying data. So if we look, if we remember how to get that so stored in data values. And you remember we had to do something like if they get next, get an iterator, and this would get the very next one, which is the only one in this case. But this is sort of the swift way of doing that as we've already discussed in previous videos. So other will simply be the underlying one-dimensional array of that DataFrame. If someone is trying to, say, add two DataFrames together. Okay, so we're going to force it to, it's going to have to be just a single dimension. So, yeah. So raising it was going to add one column on top of all the other columns. All right, so if other is not a DataFrame, so in the instance that it's not a DataFrame, we will just not. Check what type it is. Okay, for we're just going to pass this, you know, this function, functionality on to NumPy. So if the NumPy array can add whatever the other object is, then we're good. If it can't, then numpy will throw the air. So we're not going to worry about that. Okay, so now we're here. So we're going to iterate through all the columns of the DataFrame of our original DataFrame. And then we're going to apply that operation to each array. So let's go ahead and do that. So let's go ahead and iterate through your so for Cole Cum of value. And we're going to iterate through that dictionary like we always do. So we need to apply the operation. So we're given the operation as a string. So if you look back up here, the operation, like for GT, it's just a string of that. So why did I give you a string? And the reason for that is we're going to use this get attr function. So this is a built-in function which will get a method if you give it a string. So what we're gonna do is do get adder. And it says it takes two parameters. The first is the object you want to get the attribute for. And the second is the up, is the attribute name as a string. So get adder is like saying, Hey, get this attribute from this object in. And what you have to give it is the opera is the operations name as a string, as the method name as a string. So this is, this is a way to find attributes and methods of an object via a string. So let's go ahead and call this. We can call this font, for instance. So it's going to get that underlying method. Maybe we should call it method instead. So we'll get that. So that retrieves the method that we want. And then we're going to call that method with the other or with other like this. So it's going to apply the operation on other. So it's going to get, for instance, the dunder add method, and then it will call that method on. Now we're going to call that method with others. So this will be the result of the new results. So what we need to do here, like we usually do, forget to do this is create a dictionary. And this is so value is a NumPy array. Op is the string. It's going to retrieve that method and then all other is no, it could be an integer, could be anything. It could be another NumPy array, whatever it is. We're going to try to do that here. So we'll say new data of column equals method of others. So it's going to try to compute that operation and hopefully it'll work. So let's go ahead and return DataFrame of new data. Okay, So this will work. Now this will hopefully work. And if it doesn't, maybe we can debug. So let's go ahead back into the notebook. And let's take a look at our last DataFrame that we created. So we have this DataFrame. Now, if we try to add five here, well, the way we implemented this is that we're going to try to do that to every single column. So it will error out here if your string, column of strings, let's say for instance, this column with strings is unable to add five to it. So that's fine. Where this arrow is valid. We're not handling the case when we have a calculation that cannot occur, so that's fine. So what we need to do is actually get a DataFrame that has all numeric. So let's go ahead and use our selection properties over here. We'll call it df f1 will be these three columns. So d of one if we add five here, well that works. So it looks like we've successfully added five. When you add five to a Boolean holds, good, I just turn it into an integer. And so every value in the entire DataFrame got added five to. So let's do some other ones. Let's say, I don't know, four is, say df is greater than 3 or something. So df one. Okay? So yes, it's going to return us all Boolean value. So that looks good. All right, so everything is working out and we can maybe even to raise this to a power. So raise this to the second power. So everything is looking like it is working. Let's go ahead and formally test this with PY test. And this is going to be in the, it says right here, the test during the test operators class. And it actually runs all the tests for all the operators. And it looks like they've all pass. Everything is looking good. We tested it in the notebook and we tested it formally with our unit test, with PY test. Okay, so that completes Step 30 for the arithmetic and comparison operator. 46. The sort values Method: In this video, we complete step 35, the sorted values method. So we're going to be sorting our DataFrame based on one or more columns with this sword values method. So this method is going to have two parameters. Number one is the column or columns that we are going to be sorting by. And the other one is whether we are going to be sorting a ascending or descending. So by default we're going to sort from lowest to highest. So ascending will be true. That's what ASC stands for, just a ascending. And so that's what those are, the two parameters. So let's go over here and let's first do the case. I'll just keep it simple just for a single column. So we're going to allow by this column, by to be either a string or a list of column names. So we'll just, so in the case that BI is a string, we're just tell us that we're going to be sorting a single column. So if bi is a string, then we know that we're just sorting one. So let's go ahead and sort this by this particular column so we can get, we get the, get the underlying NumPy array. So that's it. So it's in the, we can use our underlying data dictionary to get the underlying NumPy array. And then to sort by this, we're actually going to call arg sort. The fluid passes the arch sort function in NumPy, which returns the border, is as integers of the border. So we'll just say 4 equals like this. And then we will, let's go ahead and return. So we have self of order, so we're actually going to use the brackets that we've already implemented the getItems special method. So we're going to pass this order, which is just simply a, an array of an array of integers of their correct order into here. Now, one small thing here is that we order is going to be a numpy array. So we're going to have to use the two list method, which is an array method which just returns the values of an array as a list. And this is because are we only implemented selection with rows with a list. So you cannot give it a NumPy arrays. That's unfortunate, but that's what we did during the implementation. So yes, our DataFrame, which is self, we will select particular rows by passing in a list. This simply denotes that you will select all of the columns. So remember, you give it two items that are separated by a comma. So here we have rows, comma columns. It's just going to do simultaneous collection of rows and columns. But in this case, we're going to select all of the columns, which is why we just have a single colon which is slice notation for saying slice, all of them. Let's go ahead and look in the Jupiter notebook here. And if we say we want to sort values by, say that column weight. So it looks like goes from 40 all the way to a 100. So basically the second, third rows just exchange places. So that looks good. We can sort by a string name. So by this name or by a string column. So the column name, it's sorted alphabetically. So everything looks good there. So let's go ahead and make this work. If sending is false. So if not a sending, then what we're gonna do to order is simply just reverse it. So if we wanted to sort descending limb, we will just reverse it. So this colon, colon negative one. You should know from this regular Python is how you reverse a list. It's also you can reverse a NumPy array. So if we go here now and say a sending is false, then they'll sort from greatest to least. And we can do this again with like wheat. So a 100 down to 40. So everything looks good when we're sorting with single, just a single column. Now we have to implement the case where we have multiple columns. So this is a little bit different. So there's the else. If, if it's buys a list, then we will sort over here multiple columns. So we're assuming by the list now you can not use art sort to sort and multiple columns NumPy as a separate function called lex sort. So we can give it the columns that we want. And we should give it in the reverse order. So this is a little bit, a little bit tricky here. So what we're gonna do here is we're going to loop through all of the values in the bi-layer, so by the list. So let's go ahead and loop through here. So we'll say four column in by. And we'll do a list comprehension. So let's go back and do this. And we're gonna say, we're gonna get the data of that like this. So we're gonna get a list of NumPy arrays. And we can, we can call this whatever you only need to assign this to something so we can just reassign it to buy. That's okay. So just have a list of NumPy arrays. But the way lex sort works is that the, the, the, the array that you want to sort first must come last. So it actually does it like in reverse order. So again, they give it a list of NumPy arrays with the primary sorting array will come last just a little. It's certainly not intuitive. So we need to reverse the order of by assuming the user. We'll want to use the use the first item and by as the, as the primary sorting column. So let's go ahead and reverse by. And then we will get the underlying data. So for every column in by that is reversed, we will get the underlying NumPy array. So by is now a list of NumPy arrays. And now we'll rely on NumPy to do all the hard work for us. And we will assign this to order. So that should work. And then the very last thing is, I say to raise a type error if we don't have a list or a string. So you can say bi must be either a list or string. Okay? So, and then we've already implemented, so we'll just reverse this in the case that we're going to go high to low. So, alright, so let's, let's check this out over here in our Jupiter notebook. So if we want to sort by multiple values, so let's go ahead and pass in a list here. So if we have a list, well, this should actually still work even if it's a one-item lists. So we can try that and maybe we will just get rid of a sending. So if we're sorting by weight, that looks good. If we do false, okay, now we're going from a 100 to 40. Okay, that also looks good. Let's sort by school. So yes, Booleans or 1, False Is 0. So that's okay. But now what if we want to sort by multiple columns? So say within school. So within all the true values of school, we will sort by weight. Let's go ahead and do that. So in this case, I've done a ascending goes true, so it's going to go from false to true. Now within true, then the sorting will happen from least to greatest. So that looks good. I can make a ascending equals false. So within true. So now we're going this way. It'll sort from greatest to least. So from a 140, there's only one false. So that looks good. The only way it makes sense to sort by multiple columns. Only case when it makes sense to sort by multiple columns is when there's duplicated values in one of the columns, at least in the first sorting column. So in this case, school is the only column that has values that are on that repeat. So this is the only way we can actually verify it visually. But also you'd have to add more rows or more columns or data. But everything is looking good here with at least that simple example. Let's go ahead and test with PY test. So there's actually four separate methods. Are four separate test cases that we're going to test individually. So it's going to be in this test more methods class. And within here we'll test the sorted values. Okay, so that's that test that passes. That was, that looks good. We can test the next one tests or descending. And then the student next one tests or values too. Okay, good, That passes. And the very last one is tests to descending. And great. So we've passed all of the tests. Okay, good. So that does it for sort values and for step number 35. 47. The sample Method: In this video, we completed step 36, the sample method. The sample method will randomly select rows from the DataFrame. We're going to give our users two choices on how to randomly select rows. One way is by providing this parameter n will just simply be an integer as the number of rows to randomly sample. The other way is to provide a fraction, some number between 01. That will, that will select a fraction of the number of rows. So for instance, if the user provides us a number like point to, the sample method will return 20 percent of the rows back. Now, there's going to be one other option here and that's to sample with or without replacement. So that's what this parameter means. So regardless of whether you're sampling with N or with frac, will you'll be able to sample with or without replacement, meaning that if replacement is true, then the same row can appear more than one time. If it is false, which is the default, then at most every row will appear one time in the sample. The final parameter is seed, which will be an integer and is not required, but will allow the users to set the random number generator seed in NumPy if they so choose. Okay, so let's get started here. We're going to first begin by completing the case whenever the user gives us n. So to do this, we're going to rely on the numpy random module to do the heavy lifting. So again, we're relying on NumPy array, but this time we're going to dig into the random module which contains dozens of functions. They'll help us do stuff with randomization. So in this case we're going to choose the choice function. So that's choice function. What it does is it selects randomly from a collection of objects. So our collection of objects will be the numbers 0 to n. And that's df put self. So we're going to create a little range object, which is just a collection that it's able to sample from. And the second parameter here is the size of this, which will be n. That's what our users providing us. And we will allow our users to, we'll just use whatever the users give us four replace. So the choice function also has this same exact replace parameter to sample with or without replacement. Okay? So what this is gonna do is it's going to return us some numbers between 0 and n, non-inclusive of n. Now I'm sorry, 0 and the number of rows in the DataFrame. So length we've already computed above. So this will be between 0 and the number of rows in the DataFrame. You will select n of those rows with or without replacement, determining whether or not the user wants to. So this will return us some row number. So we're gonna, we're actually going to sample the row numbers. So that's going to help us out here on how to, how to actually make this selection. So we will just return our DataFrame, just those particular rows. And we're going to have to actually convert this to a list. So Rose is going to be a NumPy array. And just like in the previous step, we had to convert NumPy arrays into a list in order to make the selection, since our dataframe only works with lists as the row selection. So we're going to do simultaneous row and column selection. So the rows will go there, the columns, we're just going to select everything for the columns. Okay, so let's go ahead and see how this looks in the Jupiter notebook. So we're down here, so here's DF. So if we want to randomly sample, say two rows, that looks good. So these will be rewriting the same thing over and over again. Now, we can make replacement true. And then actually, we can actually sample more than the number of rows that we have. So we're going to guarantee yourself some duplication somewhere. So you can see a Lenny and teddy or duplicated. So those rows have been duplicated. So you'll get duplication or the possibility of duplication and whenever we place is through. Okay, So that looks good. Now. So that's working for it looks like it's working when the user gives us n. So if the user provides us a fraction. So then we're going to have to do a little bit of math here. So if, if the user does provide us with a fraction, so if frack, so if it's not none and it's not 0, then we can, we could actually just use this frack to calculate n. So we'll just say n equals frack times the length of self. Okay? So this will give us some fraction. Now, this will not return us an integer. So we're going to have to force this to be an integer by passing it into int. So basically we're just going to, if the user gives us a number for frac, we're going to turn this into a fraction like this or turn it into an actual whole number n. And that will get passed, and will eventually get passed into here. Okay, So let's, let's see how that works. So if we say for I equals like 0.5. So 0.5 is 50 percent of force had returned to rows. And it does. We say 0.8, which is 3.2, which would be rounded down to three, so will return three rows. So that looks good. Now three places, true. We can say we want 300% and that will return here, 12 rows, so three times four is 12. So that also works and looks good. Okay, so that looks like it's working. It says, raise a value error if frack is not positive and a type error and if not an integer. Okay, so let's go ahead and do that. So just a complete this. So we kind of skipped over that. So if, if frack is not a is instance, but if frack is less than 0 and not positive so less than or equal to 0. Then I'm going to raise a value error here and say must be positive. That takes care of that case. And then we'll raise a type error and is not an integer. So n is not an integer. Say, say If not is instance and integer, then we will raise a type error. And we'll just say that n must be an integer. Okay? So we're not going to handle the case that the user gives us both Frac and ends. You might be wondering about that. You can certainly add a case for that to make sure they only give us one of those. We're just not going to handle that here. One last thing. We knew that we needed, we need to do something whatever this c parameters given us. So if the user does give us a seed, What we're gonna do is say, when a set the seed by doing calling a seed function and just passing the seed into here like this. So if c is given to us, then we will go ahead and set the seed. So that looks pretty good. So let's go through here one at a time again. So yes, at the seat is given to us and we'll set the seed. If we're given a frac, then we'll just check that the fraction is, is positive. If it is, then we will create or calculate N. If n is not an integer, so the user must give us an integer or give us frack. So we'll re the type error if n is not an integer. Otherwise, here's where the heavy lifting comes into place. We're going to choose when to use the choice function from numpy and select from the possible number of rows, the possible values for the row numbers. And then we'll also just pass along to replace whatever the option that the user gave us where we place. So this will return us the no, the row numbers. And we'd have to convert this to a list and then make the selection and select, adjust those rows along with all of the columns. So that looks good for that. And let's go ahead and run my test here. And this is going to be test sample. And looks like we passed it. All right, great. So this, so now we can randomly sample rows in our DataFrame and our DataFrame with or without replacement. And that completes Step Number 36. 48. Pivot Tables Part 1: In this video, we completed step 37, PivotTables Part 1. So we're going to be beginning the pivot table method in this step. It is the most complex method in the entire project and will take multiple steps in order for it to be complete. So this video will focus on just getting started and discussing what exactly a pivot table is. So let's go ahead and look at this data. I have over here is just an image of some city of Houston. At employee data, it contains four columns, department, race, gender, and salary. So this is the raw data over here. On the right-hand side is the actual, the pivot table. So in this case, a typical pivot table involves two columns that will, who's independent of whose unique values will form independent groups. So in this case, we are using the race and the gender columns to form groups. So any unique combination of race and gender will form a group. So as you can see here, on this vertical grouping column is the vertical axis, if you will. The race column has five unique values. The gender column is horizontal and has two unique values. So each one of these combination of race and gender will be represented in the actual data over here. And this pivot table shows the average salary for those groups. So for instance, in the upper left-hand corner right here, this number represents the average salary for all the Asian females in the group. If we take a look at this cell, this contains the average salary for all Hispanic males in the group or in the data. So that is a, that is a pivot table, a generic pivot table. So it consists of four pieces. One is, you have one grouping column. You have another grouping column. Here, it's gender, race, and gender as a grouping column, we have a third column of values that will get aggregated. And the fourth piece is the type of aggregation. So we could have done some other aggregation beside the mean, we could have taken the max salary or the Min, or we could have done like a talent. But you have to choose an aggregation for a pivot table. So only one value can be reported for each unique combination of race and gender. So let's go ahead and look at the function signature over here, the method signature we can see here that it does have four parameters. So Rose is one of the columns whose unique values will form the, one of the groups. Columns is the going to be, another column is unique values will form groups. Values is the column whose values are going to be aggregated. And values is almost always a numeric column. Since numeric columns are the ones that are typically capable of being aggregated, the AGG func parameter is going to be a string of the type of aggregation. So this is going to be min-max mean, median, standard deviation. All those aggregation functions are going to be possible here. So let's go ahead and take a look at what the final product looks like in a Jupyter notebook. Just so you can get a better idea of how pandas cub is going to behave. So I've created a, a fake DataFrame with eight rows, has three columns. So I've actually created it with cup and pandas Cup final just so you can see how the final version looks. This is a very simple data-frame. And let's say for instance, I wanted to create a pivot table such that I had the state as one of the grouping columns. And columns would be, let's say fruit the other grouping column. And I wanted to aggregate weight. And how did I want to aggregate it? Well, I'm going to aggregate it by summing it up. In this case, we could choose a different aggregation, but that is perfectly viable one. Okay, So what this is saying is that Texas has 15 apples or pounds of apples or whatever. Florida has seven pounds of apples, three pounds of oranges, and Texas has six pounds of oranges. So you could go in here and verify that, you know, Texas and apples appears there, that's five and 10 here, that's 15. Florida appears here for apples, 43 in that seven, so it looks good. And then you can also verify for four oranges as well. So 246 for Texas in 21 is three for Florida is everything looks good there, but that is a pivot table. That, that is the pivot table that we're going to produce for pandas cup. Now, I'm just going to copy and paste this. We are also going to allow our pivot table to accept just either one of rows or columns. So we're going to allow our pivot table to work with, with just one of those. So in this case, we're just providing its state for the rows. So for, we're gonna get the total weight for all the fruit regardless of what type of fruit it is. So all of the Florida fruit weighs 2010 pounds. All of the Texas fruit weighs 21 pounds. Now, we will allow either one to be either want to be observed either rows or columns. If you do it, columns and you'll get a different view. You'll have just simply the unique values as the column names and then the aggregation on directly underneath them as a single row. And you can put any, you know, any value here. So here we put the fruit that we could change it to the rows and change the direction of it. Changes orientation. We could change aggregation methods over here. So There's a lot of possibilities that our pivot table will be able to handle once it's complete. So let's get started with the code now. I am actually just copied and pasted some of the code to get started here. And this is because I don't want to spend time on this beginning code. There's already some more interesting code. There'll be written in the next step. But this is the, this is a step to get us started. So this code, what it will do, it does two things. Number 1, it extracts the data into NumPy arrays, so it gets the NumPy array is out of our data dictionary. And number two, it determines what type of pivot we're doing. All we're doing just rows. Are we doing just columns or are we doing rows and columns? So there's three possibilities that we're doing. So you've seen them over here. So this is one where there's rows and columns. This is one where there's just rows. And this one we can change so that it's just columns. So there's going to be three different branches within our program on what's going to happen. So either just, you know, it's going to either be both rows and columns, just rose or just columns. Okay, so let's just walk through this code real quick to see what it shows. Number 1. Both rows and columns cannot both be none. So if roses not add columns is none, then we're gonna raise a value error. I'm going to say what you cannot do that that is not that's not valid. You had that, you had to at least have one grouping column here. Now, we are going to allow values to be none actually. So in that case, and I need to show you this how this works. If values is none, then you certainly can't have an aggregation on values that don't exist. So to small bug in the code and I fixed it, but it's still going to return an error here or raise an error here. You're not going to allow someone to give us an aggregation function if there are no values to aggregate. So that's simply doesn't make any sense. So back here in the code, we're on this branch where values is none. So values is none. And you try to give an AGG func, which is on now in this branch, then we're gonna raise an error. Okay, so let's go back up to the case where values is not none. So if you give us a column for values, we're going to extract the underlying NumPy array. So we had to get those NumPy arrays into their own variables. Now if you give us a values, but you don't provide a AGG func and there's no way to know how to aggregate. So that will also raise an error. Now there is going to be one case, a special case where if the user gives us know values and no AGG func, we're going to assume the user just wants to straight up count the co-occurrences of all the unique combinations in the DataFrame. So in this case, what we're going to do is make the AGG func equal to size. And we'll see there's a NumPy array size function, which is why it's called size. It's not just a call def for any reason whatsoever. We are going to create just an empty NumPy array as a placeholder for the data. Now we can sort of get around not doing this, but it's actually fairly inexpensive operation. Since it just, it just, it just puts nothing in there. It just sort of creates a data with whatever in memory, with whatever data's actually already in those addresses. So this actually happens quite fast. So you don't to worry about expensive computation here. Okay, so that's, that takes care of the values and AGG func, whether they exist. Now in the next segment of the program. So we're just going to continue to collect our data into NumPy arrays. So if Rose is not none, we're going to extract its data into row data. If columns is not none, we're going to extract its NumPy array into cold data. So by the end of this line, we've extracted Val data, row data, and call data. The last thing is to determine what type of PivotTable are we doing? Are we doing something with only columns, only rows, or are we doing it with both rows and columns which will be given the string all. So I'm actually over here. I should be, sorry. I was in the final, but it's the same code. So yes, it's going to be either, either columns, rows or all. So if roses none, that means columns is not none, it must be. Then our pivot type will be columns. If columns is none, then we know that rose has to not be none. So our pivot type is just rows. Otherwise, we're going to get both of them and it's going to be all. So at this stage, there's no more code in this method. We can actually just go ahead and return. And I like to do this so I can slowly check my code. We'd actually just return the pivot type. This is just gonna be one of these three strings. And I do this just so I can check on the fly. You know, doing this sort of thing. So let's go ahead and let's make this valid. So if I just call, if I use rows and columns and better return all. If it's just rows, it's going to say rows. If it's just columns, it's going to return just column. So it looks like it's working. All right here at this point in time. Okay, so that covers setting up the problem and extracting the data from the NumPy arrays and determining what type of PivotTable we're going to do. So we just did those two things and that completes Step Number 37, part. 49. Pivot Tables Part 2: In this video, we complete step 37, pivot tables part 2. So in the last video we extracted all the underlying numpy arrays for the row, column and value data if it exists. And then we also classified our pivot table as either having just columns, just rose or both rows and columns with the string all. So now that we've got our data and that we've classified what type a pivot table that we've, that we're going to be creating. We need to divide the data into groups. So this is a little bit easier seen that with by looking at the data itself. So let's go ahead. Let me delete these. Let's go ahead and look at our data. So when I say divide the data into groups, so let's just say we are, I'm creating a pivot table where we want. We're going to say just the rows are state. The values are going to be weight. And the AGG func here is going to be sum. So what we want to complete this successfully, we have to do is map all of the weights for Texas into one list or one array, and map all of the weights for Florida into another one. So we have to perform some sort of grouping here. So if we were to do this manually, you would just simply go through this. You would iterate through here one by one. And whenever you encountered state of Texas, you would append this value to the values for Texas. And you would then append this value, for instance, to all the other previous values for Florida. And you would go through and for every single state, every single unique state at the end would have a list or an array of all the values that were associated with it. So we're gonna do the same thing here. We're going to iterate through the state or the, you know, all, all of the columns to form these unique groups. And then to, you know, to keep appending whatever the values are for those groups. Let's see how this might work. First of all, we're just going to delete this return statements so we don't need that anymore. So let's go ahead and iterate through here. So what are we going to iterate through here? So maybe before we do an iteration, we can think about zipping up. Let's just say we're working with only only columns. So let's write an if statement first. So if the pivot type, if its columns, and we only have column data in value data. So what if we zip up the call data right here and the value data? So if we, if we go through here, so we can iterate through this, so we're just going to zip this up. So if we go back over here, It's like zipping up state and wait, for instance, in this area. So if we had columns for instance, maybe that's just because I have an error here. So that'll go away. So let's just, let's iterate through state or through the column data in the value data. And we'll call this, the first one will be the group so called data. We can just say it the group. And the second one is Val. So these will just be the individual variable names as we were going as we're iterating. So what we need to do here is do something like a dictionary that has a list and it keeps appending values to that end of the dictionary. So we could just create a dictionary, but there's actually a good, you know, there's a good data structure called a default dictionary in the collections module. So we're gonna go ahead and use that. So from collections we're going to import this default dictionary object, and we're gonna go ahead and create a default dictionary. Whereas default value is a list. So if what this means is every value is automatically, every key is automatically mapped to an empty list. It, if there's, if it's not mapped to something already. So what we're gonna do here is, so D will be our data or our dictionary. So for every, every group, we are simply going to append vowel to it. So that looks good. So we're just gonna, we're just literally going to iterate through all of the data that we have if its columns. So if this is. If this is rows, then we're gonna do the same thing, except it's going to call row data. Now. So that looks, this should work for rows. If it's just move, we just have Rosa. Otherwise, this means that one more branch here for the case where if we have both rows and columns defined. So this is going to look a little bit different here. So we're going to have two groups, zeros, a group 1, group 2, and Val, and we're going to zip up all three. So I have row bro data called data and Val data. So we're going to zip these all up. I'm going to iterate through each one of those. Now, instead of this just being a group like this, we have two. So we're gonna actually make a tuple in here to be the key in our dictionary. So there's group 1. Group 2 is going to form this tuple will actually be the group that we're after. What will be the key to the group? So a 20 item tuple will be the key to the group, and we will simply append val like that. Okay? So that looks, that looks okay. Um, and so it looks like we've collected all of the data. So why don't we go ahead and just return this object here and return this dictionary right now in its current state to Soviet can verify that we've collected everything correctly. So let's go back here and run this again and look at this. Okay, good. So we have a returns, a default dictionary, which is exactly what it is. And here's Texas. Texas is mapped to those values out that looks fantastic. Straight in Florida is mapped to these values. That's exactly what we wanted. So we're just mapping our group to the values. We could've done it with fruit. So fruit would have worked out as well. It should work for rows here. So here is the same thing if we use rows. Now, if we do columns and rows together, well, let's see if this works. And that looks good. So we actually have a tuple. So apples and Texas are mapped to five and 10. Apples and Florida are mapped at 43 and so forth. So that looks good. So we've created this dictionary, which has as its keys, the group, which is either a single string or a tuple of strings. And the values with a dictionary are simply a list of whatever column is in the values column over here. So that's it for this one. I just wanted to collect all the data into each independent group as a dictionary. So, yes, that completes part two for step 37. 50. Pivot Tables Part 3: This video we completed step 37, pivot tables Part 3. So we collected all of the values for each group and the last step. And what we have is this dictionary d that maps every group to a list of the values that we would like to aggregate. So in this step, what we're going to do is we're going to iterate through this dictionary and perform the actual aggregation for that group. So this will actually be a fairly short part of this entire process. We're gonna do is we're going to create a new dictionary called AG dict. This will be the, I guess the aggregation. So we're going to map the groups to the aggregation. Right now we have the groups mapped to the all of the values, all of the raw values. Let's go ahead and iterate through all the groups. So we can say for group, come O vowels in d dot items. We're going to go ahead and iterate through here. So the first thing we'll do is we're actually going to convert this list. So vals is a list, we're going to convert it to a NumPy array. So that is fine the way it is. Because it converted it to a NumPy array. Now, we need to perform the aggregation. So if we look back up, the user has given us an aggregation over here as a string. So AGG func holds the aggregation, that type of aggregation as a string. So what we're gonna do here is we are going to use git add or sensitive strings when it get the attribute from NumPy. So we're going to get the num pi function that, that the user has given us. So AGG func is a string, NP is the module. So, so this get adder will get the actual function from numpy. We'll just call it Fung. And what we'll do here is we'll take AG dict. And for every group, we will now apply this function to that underlying NumPy array. So there's iterating through all of our previous dictionary for converting all the values into a NumPy array. We are getting the method that function name from numpy is we are given a string and then we're going to compute, we're going to calculate the aggregation and we're going to assign it to a new dictionary. Let's go ahead and return addict right here at this stage to see if it's working. So this is just one little chunk of code that has, see it, we'll see if it works. So this was the result of the last one. So now if we call this same one, we should see that yes, indeed we have an aggregation that's performed. So five and 10 have now been aggregated to 15, 43 of an aggregate to seven and so forth. So it looks like we've correctly performed the aggregation and, and moved on to a completely new dictionary to hold our data. So we're nearly there. We now have to just transform this into them. Now, do the actual pivoting of this data to return a DataFrame. So the way it is right now is quite usable and good information, but it is not a DataFrame. And we go one more step and pivot this information so that we are returned a DataFrame. All right, so that completes step 37, part three. 51. Pivot Tables Part 4: In this video, we completed step 37, pivot tables part four. So the last step, we aggregated all of our, all the values in each group into a single dictionary. So the final results, if we just look back here, was this dictionary mapping, our group 2 are the aggregated value. It's going to be a single value. We can also aggregate just by rows or just by columns. So this is a much simpler aggregation you see here. You know, if it would work identically if we used columns. So let's go ahead and handle those two cases first. And by that I mean is we need to turn and need to convert this result into a DataFrame. So we're at that point we neither, neither return the result to our users. So we're going to create a DataFrame at this stage. So as we normally do to create DataFrames, we're going to begin by just creating an empty dictionary. And for the three cases, we're going to have different logic for each of the three pivot types. So let's go ahead and just assume that our user has given us columns. So if we have the columns provided, that means what we want to produce is a DataFrame that has one row and all the unique values are in the columns. So let's just do that. So let's go through our Ag dict and create a new column for every entry in there. So the one thing that we're going to do here, if we say for column in ag, dig, what we're actually going to do is I want to sort the aggregation dictionary. So all the keys I want to go through in a sorted order. So for instance, if so all the states will be sorted, all the, if it's race that will be sorted, if it's gender, that'll be sorted. So all the, all the unique values in the columns, column will be sorted. And this is what Pandas does. So this will be easier to read the pivot table. So we're going to, we're going to sort this. So we're gonna say new data of call will equal the AG. Now if I select that one again, so this is going to be a single value. So what we need to do here is convert it to a list and then pass it to the NumPy array function to make it an array. Okay, So all we're doing here is iterating through all the keys, this sorted keys of Ag dick. So just to give you a little bit more information on this, when you pass an egg dict into sorted, sorted, we'll iterate through just the keys of addict. And so we'll iterate through, through, through, through just the keys here. And then we'll extract just that NumPy array from AG dict or not. It's actually not a number. A is a single value, it's going to be a single number. We're going to wrap it inside of a list. Then want to convert it to a NumPy array and assign that to the new data dictionary. Okay, so we should be able to return our DataFrame correctly. And let's just see how this looks over here. So see we can see if we have converted and I'll just type it over here so we can see this. Okay, good. So that looks like it worked, so it turn it. Now we have a DataFrame, we have a formal DataFrame with the, you know, the the column names are sorted and underneath them is the how they were aggregated. All right, now let's do the same thing. Except this time we will have pivot type equal two rows. So in this instance, we're just gonna have two columns. So the first column will have the unique values of the rows, and the second one will have the aggregation. So let's go ahead and extract all of those. So we have addict. So the keys contain all of our new values for the, for the first column. So let's go ahead and extract those. And what we could do here, we'll just convert it into a list and then one more time into an array. So, yeah, you can't just convert the keys directly into a NumPy array. So we're going to turn it into a list first and then, and then turn it into an array. Okay? So we'll just say these are the row values. We can just call it rose. And the vowels we will do this same theme here is copy and paste here, but we're just going to get the values. Of that. So you have rows and vowels. So these are both going to be erased. So we can, we can directly create new data here. So we can say new data of the rows and actually need to use a different word here, just say row vowels. So Rose was our original column name. So we'll say, we'll map this to row vals. And new data of. We'll say we'll use AGG func. So this new column will be the aggregation function name. And it will equal to the vowels. Okay? So the aggregation vowels over here. Okay, so that looks good. Now, let's go ahead and see if this returns something that looks good. So it's going to just pivot this a little bit. So we have two columns, is the first will be just fruit and the individual fruit. And so that looks good. Now, one thing we'll have to do what you might want to, we want to do to be consistent with pandas is we're going to want to sort this data. So I think we should sort this DataFrame by the row vowels. So what we could do here is we need to sort this. So instead of, instead of immediately assigning row vowels to our dictionary, we will call large sword here and get the order of this and we'll sort it from high to low or low to high. So when a sort row vowels first, and then we will use that order for both row vowels and for their corresponding vowels. So I think that should sort it so that, so that we, so that we match what Pandas does in Pandas sorts of both the rows and the columns. So there we have the order. We're going to get the, we're gonna get that. And then we're simply going to use that to do the sorting. Okay, So that looks good. Let's see if it computes over here. And it was already sorted so he can't add apples and oranges. So we'd have to add another fruit and, but we're just going to leave that test, 2 pi tests when we get there in the next step, which will be the last one for pivot tables. So this completes Pivot Table video Part 4. And we have one more portion and that's when to pivot the data when we have both rows and columns given to us. So we did the simple case is when, when just column was given to us or just rows. But the next case will be when pivot type is all. Alright, so that is it for part 4 of the pivot tables. 52. Pivot Tables Part 5: In this video, we complete step 37, pivot tables part five. So this is going to be the last step for the pivot table section. And this step, we're going to pivot the results of when our user gives us both rows and columns. So in the pivot type is all, we're going to return a DataFrame. So in the prior two steps or the previous step, we returned a DataFrame of simple data-frame whenever the pivot type was rose or whenever it was columns. In this case, I've added actually all the code that will complete this step. So this is already complete. I'm simply going to talk about every line in here instead of live coding it. So before we get there, I want to show what the end or what the result is right before we get into this branch of the program. So after we've aggregated our results, I want to go ahead and output that is once again. So I'm going to just do a little return statement here, return dict, AG, dig. So this is the aggregated dictionary of the keys mapped to that though single aggregated value. So it will never actually reach this part of the program for the temporarily. Let's go ahead in the Jupiter notebook. And what I wanna do here is let's go ahead and do a pivot table that has both rows and columns here. Just going to sum up the values for every combination of fruit and state. So we have four unique combinations of fruit in state. So what we're gonna do here is we want to turn this into a pivot table. So we don't have, we don't have our we don't, we don't yet have a DataFrame here. We didn't convert this into a DataFrame. Just so you can see the final result. It needs to look like this. So we need to transform this dictionary into this pivot table. So the way we're gonna do this is we're going to have the strategy that we're going to employ is we're going to find all the unique values for the, for the first entry in our pivot table, and all the unique values for the second entry of the key portion of this dictionary. And then we're going to iterate over both of those lists. And then, and then fill in the blanks accordingly with the actual value. So that's what we're gonna do here. Let's go ahead and let's go back to the code and we'll remove this return statement. So let's go down here. These are the two branches that we completed in the previous step. So now we're here else. So we've checked whether pivot type is columns or rows, and now we're here in this branch. So this, the only other option is when the pivot type is all. So this is when the user has given us both rows and columns. So what we're gonna do here to get the unique values of all the, all the rows or all the columns is with a set. Okay, so we're going to, we're going to iterate through this Agora, this addict. So this is just going to go through the, it's gonna go, it's going to iterate through the keys of the FDIC. So this is what this does. And we're simply going to add every first key to the row set and every second or every second element of the tuple to call set. So remember the groups and here are our R2 item tuples. So this gets the very first item and this gives the very second item. And we just add it to each set respectively. So sets cannot have duplicates. So row set and call set will contain all the unique combinations for the rows and for the columns. Okay, So we will convert this row set to a list, a sorted list. So sorted returns a list. So it's gonna go through and sort that set and actually return a list. And so I have new data here, so I can actually go ahead and erase that must have implemented at a slightly differently elsewhere. So there's new data up there. We don't need it to redefine it. So the rows are going to make up or the values and the role, this will make a one column. So in our new data dictionary, we're gonna go ahead and, go ahead and make that one column, so that will be complete. Then we're going to iterate through every single call, every single unique value of the columns. So we're going to iterate through there. We're gonna go one-by-one over the columns. So maybe it helps to look back over here. So in this instance, our columns where states it's gonna go buy one-by-one over, over. The unique values are, here are just Texas and Florida. So we're just going to iterate over Texas and Florida. And then for every unique value of this, we're going to iterate through every, you know, every unique value of the row list. And then we're going to look up that value. We're going to look up what was the actual aggregated value here. So we're going to pass in the tuple tour AG dicks. So we're going to create an Ag dicks. We have called here in a row here. So we're just going to try to get that value. And if it doesn't exist when I say it is missing or just say, we're going to return a NAN numPy, I'm missing value. So yes, so we'll just keep appending, are appending the lookup value's in the aggregated dictionary to this new vows list. And then that will form each new column. So within here I'm just simply, i'm, I'm iterating over each column and then iterating through each unique row and creating a new list. And then finally, converting that list into an array. And then finally that should, that should, that should create those all the columns that we need to make the make the DataFrame. Okay, so let's go ahead and see if this works now. So we do this and it looks like everything is good. So the last thing we'll need to do here is to run these tests. So let's go ahead and run the pilot test. And it says in the test group in class, so I need to update this. So test grouping then we're going to test the pivot table rows or calls tests. And that passes good. And then we have one more test, the pivot table. Both. Okay, and let's go ahead and run that one. And now it also passes. Okay, great. So this was a very complex method. You can either do it by just rose, by just columns or by rows and columns. And you know, the, the, the bulk of the work is sort of, you could sort of separated into two distinct areas. One is to gather the data into groups, and that's what the dictionary, the second one was to actually then, then convert that, that gathered and aggregated data into a DataFrame. So this is sort of the second portion of it. And the first portion was to gather the data into a dictionary, which was this highlighted portion. So that completes step 37, all five parts creating pivot tables. 53. Automatically Add Documentation: In this video, we complete step 38, automatically add documentation. So in this step we are going to automatically add documentation, which means docstrings to several methods. So these are the methods that we're going to automatically add the documentation to. And in fact, this, this particular step is actually already complete, So there's no additional coding that will be, that will be needed. But I will explain exactly what happens. So just to give a little rehash on the docstrings before we, before I actually go into this method right down here, let me take a, let me, let me scroll up a little bit to go into the pivot table method, which we just completed. So the pivot table method has a docstring like this. It has all the parameters, has a small description, it has a return section. So this is a literal Python string. And this is not just useful for the developers. This is actually very useful for the, the users of our library because the docstrings are able to be viewed as a user whenever you're coding in an environment such as VS Code or Jupiter notebook or many other environments. So let's go take a look at what I mean when I say that. So if we go ahead and go back to the test notebook. So df, if I call the pivot table method or if I bring it up without calling it. And I put a single question mark at the end. Jupiter notebook will pull up this window, this little pop-up window at the bottom of the screen. And it will show us the documentation string so the users can easily pull it up. So that's very nice. Now, you can also access the docstring by pressing Shift Tab, Tab. So if I press Shift tab tab this will, this will also pop up the docstring. This is actually what I use when I'm using libraries in the Jupiter notebook. I am pressing Shift tab tab frequently as it is a great way to quickly get the documentation string. And there's, there's no way to memorize all the possible, all the possibilities for all the methods and functions that are available. So the docstrings are very valuable to quickly get help while you're coding. Now, if we look back up to the aggregation functions, aggregation methods, you'll notice there are no doc strings here. So there's nothing under these. Oh, sorry, those are not the aggregation functions. But if I go up a little bit, a little bit further up, where we get to our previous AG ufuncs. So here we go. So you have min-max, mean, median sum, all those. You'll notice there are no aggregate or there are no docstrings available here. There's nothing that was written here. But yet, if I go and say I want to call them Min method and let me just pull this up. There in fact is a docstring. So how did that get there? How did a doctor and get there when it's not clearly visible in the code? Well, the answer lies in a special attribute called dunder doc. So you can actually retrieve the doc strings as a special method, this doc special method, dunder doc. And it's just like any other attribute. This is just a normal Python string that it's assigned to this variable. So this is how you can dynamically set your docstrings is with this dunder docs special attribute. And that's exactly what I did. The code to go ahead and add all those doctrines to all these aggregation methods. So what I noticed was when I was building this code is all these aggregation methods have very similar. They're going to have very similar docstring. So instead of writing that same thing over and over again, I decided to define a method that will automatically add documentation. And that's what this entire step is all about, is about automatically adding documentation. The first thing I did was to assign a list of all of the methods that I would like to automatically ADH, ADH documentation to. So these are all these aggregation method names. And then I had a generic docstring, which is a string. So I assign it to a variable called AG doc. So this is just a string. It has these braces in here which signify something important. And that is that I can replace whatever the contents of these braces with a variables that I'll come up that'll be important later. But this is the docstring. So I'm making a very simple doc string. Find the blank of each column. So blank will be one of these once we fill it in. So what we do here is we're going to iterate through all of these names. Okay, so for name in names, what we're gonna do is we're going to, from our DataFrame, we're going to get, we're going to use good adder to retrieve the method using a string. So we have our DataFrame class, we're going to retrieve the method. And then all of that, that methods dunder doc attribute we will reassign. So we're going to reassign it to add duck, which is this string. But we are going to fill in those breaths. We're going to replace those braces. So this is what the format format method for strings allows you to do. And then we're going to replace it with name, whatever the current name is. So for the first one, it'll say find the Min of each column. So that's what this ad docs method is doing. It's conveniently creating documentation automatically to several methods. So instead of having to repeat it over and over again, we're just going to have one single, one single method to add all of this documentation. So this only really works whenever you have methods that have basically, essentially the same documentation just with no very small difference like the name of it. So then you can do this and just replace each methods dunder doc attribute with whatever string you want the docstring to be. So you might be wondering when this thing actually gets invoked and this happens in the constructor. So in one of the very first steps, we very quickly glanced over this. So I'm gonna go ahead and scroll up to the constructor just so that you can see where this gets executed. And it gets executed right here. So the very last line of the constructor we're going to call this dunder are not under but this ad docs method, which we'll go through all of those aggregation methods and add documentation to them by overriding the dunder doc attribute. Okay, so that's a little special trick on how to, you know, to change the docstring is dynamically and how to automatically add documentation too many methods at once. And that completes step 38. 54. String only Methods: In this video, we will complete step 39 string only methods. So the last step completed the DataFrame class. There are no more methods to be implemented in the DataFrame class. After this add docs and method, you're actually going to be looking at an entire new class called string methods. And what this class will help us do is to process columns of data in our DataFrame that our string. So we will be able to manipulate strings in our DataFrame with an entirely new class. So let's just take a quick look through this string methods class views. This is, we're going to be dealing only with this within this class. For this step. If you look inside here, you'll see the names of a lot of methods. And these all will do some sort of string manipulation or string calculation. So you know, there's capitalize count ends with, starts with, and so forth. There's many, many string methods that are going to be available to string columns in our DataFrame. So how does this even become a part of our DataFrame and how is it Pax6 will, this is this happens during DataFrame construction. So within the dunder init method of the dataframe method or the DataFrame class. You will see this. So let's go ahead and scroll up till we come, come back into the DataFrame constructor. So here's the DataFrame constructor. And you'll see this line right here. So what we're doing here is we're assigning to this STR attribute a new instance of the string methods class. So we're, we're, we're, we're creating a new instance. We're calling the constructor of the string methods class right here. And we're actually passing it, the passing itself as a parameter to that constructor. Okay, so this is exactly where the string methods class comes in and we create an instance of it and assign it to this STR attribute. Let's go take a look at the string methods constructor and see what this self is doing. So this is just the current DataFrame. That's what self is at this right here. Let's scroll back down to the string methods class and go inside the constructor. So during that line of this will get executed. And the, the self from above is a DataFrame. It will get passed to this DF variable and it will get assigned to this underscore df instance variable for the, for the current instance. So there's only one thing that happens during construction of the string methods of class, and that is to have a reference to the DataFrame. So we can sort of go back one level by having a reference to the dataframe that holds an instance of the string methods class. This will make more sense when we go ahead and look at a DataFrame. Let's go ahead and do this in the Jupiter notebook. So if we have df, which is just some DataFrame, and we look at STR. So SCR is just currently an attribute right now. And as you can see here, it's pointing to a single instance of the string methods class. And you can still us. So now you can call methods on this, on this particular object. And if you press Tab, you'll see that all of the public methods become available. And there you cannot call them yet. We have to implement one single method within string methods for, for these to work. For these, these are not working in the current state, but they are available, just not working. So let's go ahead and make them work. Let's flip back over to VS code and take a look at all of the methods here, and they're all going to look very similar. So notice that almost all of these are going to be a single line of code within here. And they're going to call this underlying STR method. So this STR method is sort of the generic string method that will handle all of the other methods. So you'll notice that the first parameter or the first argument to this method is going to be the, the, the Python method named for the Python string method name for we're actually going to use the underlying, this is Python, right? And this is the Python string capitalize method for instance. And the second argument will always be the column name that we want to, that we want to. Uses particular method on. So all of these are going to look the same. So, you know, Center the first parameter, the first argument is Sibley. The the Python, the Python string method name. The second parameter is always the column name of the dataframe. Now you'll notice that several of these methods have extra arguments. So these are all the same exact arguments that come along with these Python string methods. So I've simply copied and pasted them from Python into this code and we're just going to pass them along. We're going to allow our user to use the exact same methods that Python allows its users to use on strings. But regardless, the first two arguments are always going to be the method name and then the column name for every single one of these. So it's going to call. So, so basically this will always be the same for all of the methods. We're going to, we're just going to pass in the method name and the column name. Okay, So here we go. Let's just go down here to this string method. So this is what we're going to implement. And this is the only thing that we need to do it. If we implement this one correctly, then the other ones will all work. So if we look at the signature here, the first thing is yes, the method that we're getting that we want to apply. The second is the column name, okay? But then the third is this strange thing called argues with a star in front. So we talked a little about this in a previous step. So anything that has two stars is going to capture any arguments as a keyword arguments anything with one star is going to capture an extra arguments as a tuple. So we're going to collect any extra arguments like for instance here, these last three arguments will be collected in this star ARDS. And we will simply pass this along to Python and say Here, just take these and use them to, to execute the method with. Okay. So we'll see that in just a little bit. So what we're gonna do is number one is that we need to get the underlying NumPy array for the column. So the data is now in this underscore df. So we have to go like one level back because self is the, is the current instance of string methods. Underscore df is the DataFrame and then underscore data is the dictionary words held. So we're gonna go ahead and get all that data. So let's go ahead and get that array. And this says, raise a type error if it does not have kind o. So we're only going to allow our users to call this if indeed we have an object datatype. So if value dot dtype dot kinds of the kind is not uppercase, O stands for object. Then we're going to raise a type error. Just say the STR accesser only works with strings. Okay, good. So now what we're gonna do is we're going to iterate over each value in the array. So we're assuming a one-by-one down each value of this array. So we're going to just go for Val in value so we can iterate through a NumPy array simply like this. And we're going to, we're going to now call. We're going to pass in the string. So Val is now a single string. So we're value is a array of strings, so valid a single string. We're going to pass it in here to the actual method. And we're gonna give it any extra arguments that we had collected over here. So this is going to be the result. So what I suggest doing is creating an array. So we can say like new values or you can just use a list and say nouvelles dot append this result. So it will append this result right here. Okay? So that's basically it. So what we're gonna do here now is all we're going to return a one column dataframe. So let's return a one column dataframe. So we're going to return a DataFrame. And we'll just go ahead and use the original column name. And we'll have to convert this into a NumPy array and we'll put in new vals as a list. So that should work. It's actually quite simple. It's just a small amount of code here. So let's go test this out real quick on some actual data. So here is what will we need to get a string column? So let's say we want to uppercase. Some string column. So the first thing we need to do is I press Shift Tab, Tab. Well, I don't actually have the documentation here, so that's something we could probably work on. But we're just going to give the name of the column here. And that should uppercase it. So that looks good. So it looks like that is working now, one thing you might want to do is be aware of missing values. So, so if someone has a none, if Val is none for instance, then we'll just, you can just append none. So we don't actually touch the data. We won't do anything if there's a none so that the string methods, we will just throw an error if there's a nun in there. So we can just make an exception for that. So if, if one of the values is none, so you have a missing value in your string column. We just won't do anything, will just append none and move on to the next, the next value. But this is the main thing. This is what actually does the transformation for us. And we'll do the string manipulation. So yeah, that's basically it. That's the, that's the crux of it. You're simply iterating through a single column and you're applying a string method to it and you're passing along any extra arguments. So for instance, let's see one with any extra arguments. We'll say df, that SDR dot count. So count has several parameters here. And we'll just use one of them will say count the number of eyes and each one. And this will return. There's one eye and each name over here. Maybe we can make it more interesting. And there are no US state. California has two eyes. Texas as none. A is one letter that's in common to both. Okay? So that basically does it for string columns. So now we have quite a bit of firepower to deal with strings. And that's thanks to Python, which we're relying on. We're, in this case, we're not relying on Numpy, we're relying on just core Python to do all of our string manipulation for us. And core Python has excellent string capabilities just naturally built into the language. Okay, so that does it. We have not actually tested, so you will need to do this. So let's run pi test. And when a test the entire class test streams and hopefully this path. So there's actually 25 tests that were just run. So probably one for each method and they each past. So that's good. All right, so that does it. For step number 39. 55. The read csv Function Part 1: In this video, we completed step 40, the read CSV function part 1. So completing this step will be the very last thing we do in this project. And after it's done, then the project will be complete. So it's a fairly difficult step, which is why I've broken it up into two parts. But the read CSV function is going to read in a comma separated value text file and return a DataFrame from it. Now the read CSV function is only going to be able to handle very simple DataFrames are simple text files. Text or CSV or text files can be very difficult and there's not necessarily a standard for them. So ours is only going to handle the very simplest cases. We're not going to go and dig into edge cases of, you know, of for CSVs. So one thing to notice before we get started is that read CSV is a function. It is not a method. It is defined at the module level. So this is not part of the DataFrame class. You will not be accessing it with df.loc and csv. Instead, you'll be accessing it with like PDC dot read CSV. So it is a function does not bound to any object, um, and it will return a DataFrame. So this function has one parameter. The parameter is the file location where the CSV is located in your file system. It's going to be a string, so Fn is simply going to be where the file is located in your file system. Alright, so we need to begin to create a dictionary, mapping the column names to the column values. So what we're gonna do here is we're going to assume that the first row in this file, so the first line in this file is going to contain the column names. So let's go ahead and assume that. And before we get started coding, let's create a dictionary to hold all this data. So the dictionary, we will map the column name to the values. And since we won't know why it really opened the file, how long the column is. What we're gonna do is create a default dictionary with the default values as a list and just continually append each column value to the end of the list. So I'm going to go ahead and import the default dictionary again. Now we could have imported default dict at the very top to the entire module. But there's only two areas where we needed, so we're just going to import it on relatively are not relatively well. We're only going to import it within the two places where we need it. So you could import it at the top. It's not really a big deal, we're just gonna import it here. So it's clear that we only needed in this function and one other method from above, like in the pivot table method. Okay, So normally we create some variable name called new data. But before we get there, we'll just call it data. Sorry, I will say it's the default dict, will be a default dictionary of lists. We'll just call it data for now because we'll create a new data, set the iterate through this one more time in order to convert it to a NumPy array. So for the time being, we'll just call it data and assign it as a default dictionary. All right, so what we're gonna do here is we need to open up this file. So we'll just sit, we'll use the width statement. And we're going to pass in this filename. So this will open it up. They sell you open up files in Python with the open function. And the width statement allows us to refer to this open file as f. So now f is the file. And we're going to assume the first line in the file is the, you know, where the column names are. So we're gonna say the first line. We're going to read, there's a readline method for file, for files. So that's going to read this. So we're going to read that in. And what we're gonna do here will read this in. And we'll, we'll actually reassign it. Or actually we could say, gave me use a different variable, say column names equals header. So at the end of every line, There's going to be a new line character. So what we need to do here is strip it off with the strip method. So backing up just a little bit, header is going to be a string of that very first line in the file. Okay? So it's a comma separated string. So it's a string with a bunch of column names separated by commas. The very last character is, is typically going to be a new line character. And we can use the strip method. In order to strip this from there. And then what we can do is we can use the split method, which will split our string into, into a list. So it'll, wherever it sees a comma, it will, it will make that split. Okay, So let's actually go ahead and let's go ahead and actually just return the column names and see if this actually works with some actual data. So let's go back to our Jupiter notebook. And let's call P dc dot read CSV. And there is a sample data file. Hey, it's called employee dot CSV. And that looks good. So it returned the very first row of that CSV. Now you can actually open up the CSV in the notebook if you wish. So if you go into the data folder or not in the notebook, but using Jupiter to open it up. So if you click on this, you'll actually able to open up the CSV and you can see that it read everything in correctly. That's that first line. Okay. So I'll keep that open just so we can refer back to it. And let's go back to our code. So we don't care about the column names here. So we've opened that, we've extracted the column names. So what we're gonna do here now is we need to iterate through every other line and the DataFrame. So we have the column names as a list. And that's good. So let's iterate through every other line in the file. So I'll files are iterable. So when you do this, you'll line will take on the very next line in the file. So we read in the first line, so we're already on the second line of this file f. So we need to basically do the same thing for, for the values here. So we could say values equal and then we could just say line dot. Do the same thing. We'll strip away the last character and then it will split on the comma. So we're gonna get all these values here. So this is all the Comma Separated Values. And what we'll need to do here is, so what we can do here is we can actually zip up the column names with the values. And we can actually do an iteration here. And we can say for call Como Val in the zipped up column names and values. Why don't we just take the data dot col in a pen, the val onto there. Okay, so let's just unpack this real quick one more time. So we're gonna go through every other line in the file. So we're gonna assume every other line in the file is some sort of data that's valid. That line is at ends, is going to end in a new line characters. So we're going to strip that off. And then we're going to use the split method to split, to split all the data on all the commas that it sees. And it will return that as a list. So values is a list. I'll same way as column names is a list. So these are going to be the same length. We're going to assume that they're the same length of the number of items in every row of data. So we can zip them up. And then we can iterate through each one and simply, simply append whatever the current value is to the particular column. So this is going row by row, appending all this. I'm appending each value one at a time. So let's actually return data over here. So let's go ahead and run this again. And I'll just assign this to this data. And we can look at this. So here's the keys. So if I look at this, that IT department key, I'll get a list of all the departments. So let's say string and we'll get a look at all of the salaries, for instance. And notice that they are read in as strings here. So, so it looks good. So we've read in all the data, it's very close to what we want. We have a dictionary of column names, map2 strings of all of our data in the CSV file. But that's all I wanted to complete in this step and the next step, we're going to turn those into NumPy arrays and then finally create the DataFrame. So that does it for Part 1 of step 40. 56. The read csv Function Part 2: In this video, we completed step 40, the read CSV function part 2. And so we ended up last time with a dictionary of the column names mapped to lists of strings. So we're very close to what we want, but we need to convert lists of strings into the correct datatype. So if our list is consisting all of say, integers, and we don't want to have a, you know, a column that is of strings that are integers. We're going to assume that this column should be an integer, not a string. So we're going to try to make this conversion and try to have NumPy do it for us. So the first thing we're gonna do is create a new data dictionary like we usually do. And then we're going to iterate through, we're going to iterate through these dictionary that we created from the last step. We're gonna go through every single key value pair. So call is a string and vowels a list of strings. And so what we're going to try to do here is when I try to make a conversion. So we're gonna go from the strictest conversion to the least strict. So that means that I'm going to try to convert whatever this list of strings into an integer first if that doesn't work. And then I'll try to move up and converted to a list of strings. And if that doesn't work, I can convert it into our sorry, list of converted the floats. And if that doesn't work, I can convert it to leave it as strings. So just a C is to show you how this would work with NumPy array itself. So say we have some sort of list. We'll just call it a and put in some numbers. Just so you can see how this would work with num pi if I, if I converted this to an array number 1, well, it's going to convert it to a unicode is going to keep it as a string. But you can try to, you can try to give it a particular datatype. So particular datatype. So if we try to do an integer here, it will actually convert it to an int. So that looks good. So if I do B and do b dot d type is converted into a 64-bit int. Now. So that works. Now what if I had a decimal here? So say 4.3 and I try to convert it to an integer, excuse me, run that line of code. So NumPy raise a value error and it says it can't do that. So in that case, if this doesn't work, then we'll move up to the next and most flexible datatype, which is a float. And it will make the conversion like that. So we're going to begin by trying to convert the list of strings to an integer. So what we're gonna do here is we'll put vowels in here. What does a dtype equals n? So we're try to convert it to an integer. So let's do that. So we're going to try to make this conversion. And if it doesn't work, we're going to try something else. Okay, So where are you going to try to do this? When I say new data of coal will be assigned to this. So we're going to try to make this happen. If this doesn't work, where they're gonna, we're gonna try something else. So we're gonna say new data of call is going to equal when I try to make this vowels into a float. Okay? So let's go back over here to the Jupiter notebook. Now, if there's a string in here, say somebody, there's an actual string in here. And I didn't close this off. So let me just say this for now so I can have some working code here. Let's all that big pink mass was saying that there was a bug in my code and I just didn't close off the try-except. Alright, so if I tried to convert this short list of strings into a NumPy array, I'm going to get another value error. So, and this is where we are going to fall back to the object data type and it will keep everything as a string. So that's our, that's our last our last ditch effort is to just say we give up, just keep, just keep it as strings. We will not be able to go any further here. So that should work, that you might be wondering what about Boolean. So our DataFrame is capable of having booleans. The problem is that NumPy doesn't do any sort of conversion. You can't even try. It won't do the conversion if you say dtype equals bullet here, but if you have a list of strings, so you'll have to do that manually after, after you read in the file. So there's not gonna be any way to convert booleans as an actual Boolean data type. So they'll be read in as just true and false springs. And then from there you can overwrite it with a Boolean column. All right, so that looks like it should do it and we just need to return now our DataFrame class with new data. Okay, So this is going one by one. It tries to do an integer. If that fails, it tries to do a float. If that fails, It's going to just fall back to an object. So let's go ahead and read this end now. So let me delete this stuff. So we're up here and let's read this and that. So no errors. And lo and behold, we have our DataFrame. And we can check the D types. Cva worked out so string, string, string and then Salary got read in as an integer. So everything is looking good there. Okay? Um, last thing is the test, last test is read CSV. So it's an entire class. So IID test read CSV. And there were four tests in there and they all passed. All right, so that was the very last thing. So now you can read in simple CSV files with, with this code, with the read CSV function. Okay, so let's make sure that we actually can pass all the tests. So if you just type in PY test by itself, you'll be able to run all all the tests. And there's 96 and they all passed. All right, good. So we got a 100 percent. Okay. All right. So this is actually completes this sort of, you know, you've checked off all the tests and with completed every single one of them all 96 of them all 40 steps. I'm going to have one more video coming up next. That's going to wrap this up and in and conclude the project. Hopefully it was all a lot of fine building your own data analysis library from scratch. And so that completes Step number 40. 57. 57 Conclusion: I really hope you had a lot of fun building the entire project. It is a lot of work. But it is a project that you can refer back to and actually continue to improve on for quite some time. It's one of my favorite things to do is to develop a library is completely from scratch. And to just really take a hold of the ecosystem that Python offers developers and kinda just do whatever I want with the language. So with that said, there's still quite a bit more to do in this project if you wish. Number one is that there are many different implementations for the methods that you could probably come up with. Some might be better or even faster than the ones that I developed. So it wasn't necessarily going for the very fastest implementations, but typically focus on the cleanest and clearest implementations during the, during the project. So you can go through, and you know now that you have the unit tests, especially you can feel comfortable going through, going back and rewriting some of the code. And seeing if you can, if you can get it working a little bit better. And an obvious thing that you should think about doing is just adding more functionality. So there's quite a lot of functionality that can be added to the project. And one way to figure out what kind of functionality you'd like to add is to simply go to the pandas documentation and to read up on some of the methods and functions that are available to it and to implement those for yourself. So for instance, say you want to add in functionality to read in data from a SQL database. Well, you can go ahead and you can write a function for that. If you wanted to write a group BY a method. Now that's a very powerful method that we did not do well. We, we sort of implemented in one scenario with pivot table. But regardless, there's lots of other functionality that's available. And I encourage you to look into Pandas, end to take some of the methods that we did not implement and put them into Pandas cub. There's so if you do add more functionality, you need to write tests. You absolutely need to write the unit tests. And just if you look inside the test DataFrame dot py file, you'll be able to see how I wrote the tests. And you can more or less be able to just follow that pattern of test writing and write your own tests. Whenever you write more functionality, you can even write tests on cases that were not covered during the project. So yeah, Testing is something you can definitely expand on. There's also many more special methods that we did not we did not get into. So for instance, the abs special method dunder ABS can be used to make your DataFrame object work whenever someone calls the abs function in python. So and there's many more like that. So to do, to figure out what kind of special methods are available, I've already recommended this, but you really need to look at the Python data model and actually read the entire document. From the beginning. It's a very, very long document, but it is extremely important one and probably one of the most important ones in the entire, in the actual documentation itself. So if you go through here, you'll learn quite a bit more about what is available in Python itself. And you'll be able to implement even more special methods that your DataFrame can operate with. Other functions and other operators built into the language. Okay, so that basically does it for this project. I hope you had a lot of fun, again, building it. I'm going to have many more videos to come. You can also check out my website. I have lots of tutorials on my website. I have a new book coming out called master data analysis with Python. And lots of good stuff. Hopefully you can stick with me and subscribe and stay tuned for lots of other projects involving the Python data science ecosystem.