Polars: Python library for Speed in Data Processing. Analysis with Real-World Large Dataset | Olha Al | Skillshare

Playback Speed


1.0x


  • 0.5x
  • 0.75x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 1.75x
  • 2x

Polars: Python library for Speed in Data Processing. Analysis with Real-World Large Dataset

teacher avatar Olha Al, Software engineer

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

    • 1.

      Intro

      1:34

    • 2.

      Introduction to Polars: Key Differences from Pandas and Why It’s Faster

      6:07

    • 3.

      Installing Polars, Loading DataFrames, and Accessing Columns Efficiently

      7:12

    • 4.

      Data Manipulation in Polars: Arithmetic Operations, Column Management, and Filtering Techniques

      8:05

    • 5.

      Mastering DataFrames in Polars: Slicing, Descriptive Statistics, and Advanced Data Exploration

      9:52

    • 6.

      Exploring Polars DataFrame Methods: Flags, Schema, Column Operations, and Data Conversion Techniques

      7:38

    • 7.

      Advanced Data Manipulation in Polars: Grouping, Aggregation, Sorting, and Custom Transformation

      5:25

    • 8.

      Advanced Data Operations in Polars: write_csv, Pivot Tables, and Join Strategies

      8:20

    • 9.

      Understanding Eager and Lazy Execution in Polars: A Speed Comparison with Pandas for Large DataFrame

      16:53

    • 10.

      Data Visualization in Polars. Advantages, Limitations, and a Comparative Analysis

      4:23

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.

3

Students

--

Projects

About This Class

Ready to supercharge your data analysis?

In this course, you’ll dive into Polars, the fast and efficient library built to handle massive datasets with ease. If you've struggled with performance issues using Pandas, or simply want to work faster with large amounts of data, this course is for you.

I'll guide you through everything - from the basics of loading and manipulating data to more advanced concepts like lazy evaluation and chunk processing. You’ll also learn how to compare Polars with Pandas in real-world scenarios, and why Polars could be the game-changer you've been looking for.

We will work with a massive real-world dataset that is several gigabytes in size, giving you hands-on experience in efficiently processing large datasets.

By the end of this course, you'll be equipped to process millions of rows in a fraction of the time it takes with traditional methods. Join me and discover how to streamline your workflow and take your data analysis to the next level.



It’s also helpful to explore Matplotlib! You can find my Matplotlib course here: 
Data Visualization with Matplotlib: From Basics to Advanced Techniques 

Also useful to learn Pandas for fast data analysis! You can check out my Pandas course here: Pandas for Data Analysis: Master Data Handling in Python      

If you’re just getting started, my Python course will be very helpful! You can find it here: 
Python Programming: From Beginner to OOP Mastery

You might also find my Streamlit course useful! Check it out here: 
Interactive Data Visualization with Python: Streamlit & Matplotlib. Free Deployment on Cloud 
Mastering NumPy
Mastering NumPy: A Comprehensive Guide to Array Operations, Data Manipulation and Advanced Technique  

Subscribe so you don’t miss the upcoming lessons!

Meet Your Teacher

Teacher Profile Image

Olha Al

Software engineer

Teacher

Hi, I'm Olha. I have over 10 years of experience in production environments, working with backend technologies, containerization, and version control with Git. I specialize in Python and its ecosystem - including Pandas, Polars, NumPy, and Streamlit - for cleaning, processing, analyzing, and visualizing real-world datasets. My courses are designed to give you practical, hands-on experience, helping you build skills in coding, data analysis, and modern development workflows.

For more, check out my Skillshare courses and projects.

See full profile

Level: Intermediate

Class Ratings

Expectations Met?
    Exceeded!
  • 0%
  • Yes
  • 0%
  • Somewhat
  • 0%
  • Not really
  • 0%

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

Transcripts

1. Intro: Hi, guys. Are you tired of slow data processing? Do you struggle with large datasets that push your system to its limits? If you work with data and need speed efficiency and scalability, this course is exactly what you need. Let's get acquainted with the Pollard's library, and we will compare it with the ever popular Pandas. If you are tired of waiting for large amounts of data to load and process in Pandas, this tutorial is for you because this library is set to replace Pandas, but will it really be able to do that? Unlike Pandas, polars is designed to easily handle large datasets using advanced parallel processing and lazy evaluation. That's what we will explore in this session. We will cover the basic principles of how polars operates, discuss its advantages and disadvantages, and compare it with Pandas in terms of performance and usability. We will also go through the main functions for working with data using principal examples to illustrate these concepts. You'll discover how polars outperforms Pandas. When working with datasets over 100 million rows, you'll learn how to process data in chunks, allowing you to work efficiently without running out of memory. We'll explore data visualization and polars, and most importantly, you will understand lazy mode, the secret behind polars and matched speed. By the end of this tutorial, you will have a solid understanding of how to use polars for your data analysis needs and be able to decide whether it's the right tool for your project or not. So ready to take your skills to the next level, let's dive in. 2. Introduction to Polars: Key Differences from Pandas and Why It’s Faster: Hi, guys, let's get acquainted with the polars library, and we will compare it with the ever popular Pandas. Polars is a fast and efficient data frame library designed for working with large datasets. It is built with performance in mind, utilizing multi threading and parallel processing to handle data manipulation tasks quickly. Polars is implemented in RST, which allows it to offer superior speed compared to other data frame libraries like Pandas. The library supports operations such as filtering, aggregation, and transformation of data, and it's particularly useful for data analysts and data scientists who need to work with large volumes of data efficiently. With both a Pattern and Rust API, polars is accessible and powerful for modern data processing workflows. As I said before, this library provides a wide range of functions for data manipulation, aggregation, and transformation. But in my opinion, its main feature is lazy evaluation. What is lazy evaluation? As evaluation is a computational strategy that delays the execution of an operation until its result is actually needed. In the context of polars, this means that data manipulation operations aren't executed immediately when they are defined. Instead, they are recorded as a series of steps to be executed later. This approach allows polars to optimize the entire sequence of operations, reducing the overall computational workload and improving performance by only executing necessary calculations at the end. Another important feature is multi threaded processing. One of the key advantages of polars its ability to process data using multiple threads at the same time. This means that instead of performing tasks one after another sequentially, polars can split the workload into smaller parts and run them simultaneously across multiple CPU course. They significantly speed up data operations, especially when working with large datasets. Polars achieves the efficiency because it's built with rust, a programming language designed for high performance and safe memory management. Rust makes it easier to work with parallel computing, meaning that polars can fully utilize the power of modern computers with multi core processors. Another powerful feature in polars is its ability to memory map data. It means that instead of loading an entire large dataset into Rum, which could slow down your computer or even cause it to crash, polars can read and process only the parts that are needed at the moment. For example, if you are working with massive CSV or Parki file, polars doesn't need to load the whole file into memory. Instead, it accesses the data directly from the file as needed, making the process much faster and more memory efficient. These features make polars an excellent choice for working with big data. As it helps analysts and researchers handle large datasets quickly and effectively without requiring high end hardware, polars is built on top of apache arrow, a data format designed to make data storage and transfer faster and more efficient. Think of arrow as a high optimized way to organize and structure data so that it can be processed quickly by different systems because arrow is used by polars, it allows polars to share data seamlessly with other tools and systems that also use arrow. Example, if you're working in polars and want to pass your data to different system like a Machine Learning tool or another data analysis library, AR makes that data transfer happen smoothly and efficiently without needing to convert data into a different format, which can be slow and costly. Another feature that makes polars user friendly, it's ponds like API. Pandas is one of the most popular tools for data analysis in Python, and many data analysts and scientists are already familiar with how it works. Polars was designed to feel familiar to Pandas users. So if you already know Pandas, you can start using polars without needing to learn everything from scratch. However, while the API look familiar, polars have the added performance advantage of being faster and more efficient, especially when dealing with large datasets. So if you are coming from Pandas, you can benefit from the same easy to use syntax, but enjoy the speed and memory efficiency of polars. Even though polars has many great features, there are some limitation to consider. It's important to keep in mind that like any tool, it may not be the best fit for every situation. We will go into more detail about these drawbacks later on. But for now, let's take a look at some of the challenges. One potential disadvantage is that polars is relatively new compared to more established libraries like Pandas. Since it's still growing, there may be fewer resources available such as tutorials, community support, or documentation, which could make it harder for beginners to get started. Also, since it's new, there may be fewer examples of how companies are using it in real world, large scale projects. Because polars is not as widely adopted yet, there is limited information about how many companies are using it in production environment, the real world systems where companies run their operations. Most companies that use polars might not publicly share details about how it fits into their workflows. So it's harder to know how well it performs under very large or complex workloads. However, some companies are starting to use polars for their data processing tasks, and you might see examples of this in the industry. As the library matures, its adoption will likely grow and more companies will start sharing their experiences. So let's get started. 3. Installing Polars, Loading DataFrames, and Accessing Columns Efficiently: Here's the command to install the library. Let's get started. I will open my terminal since I'm used to working with it. First, I will activate my virtual environment. If you're not familiar with virtual environments, I strongly recommend checking out my video on how to manage virtual environments and how they can make your life easier. But if you don't know what a virtual environment is, you can run the command directly in the terminal. Right now, knowing how to work with virtual environment is not priority. After activating my environment, I can run my Jupiter notebook right from the terminal by executing the command Jupiter notebook. If you use Anaconda after starting Jupiter Notebook, you can install this library directly inside Jupiter by running the following command. As you can see, I already have the library installed, so we can start working. First, I will import all the necessary libraries that we will be working with. I'm importing the Numbi library because we'll need it. For those who aren't familiar with it, Nampi is a powerful Python library used for numerical computing. In my profile, you can find a utorial on this library. Then I will check the version of polars. I've downloaded a huge dataset over 1 gigabyte in size, and now I'm going to import it using the Read CSV function in polars. It takes a little time to load then using the shape function, which you may have heard from Pandas, we can check the dimensions of our dataset. If you're not familiar with it, the shape function and polars returns the double representing the number of rows and columns in the data frame. This function is useful for quickly understanding the size of your dataset. And here we can see that the dataset contains more than 130 million rows. Another useful function for quickly understanding a data frame without loading all the data in the head function. By default, it displays the first five rows. As we can see, the appearance of the data frame in polars is slightly different from what we observed when loading data with Pandas. The first noticeable difference in the data type information displayed for each column. Using the two Pandas function in polars, we can convert Polars data frame into Pandas dataframe. This is particularly useful when you need to leverage Pandas specific functionality or integrate with libraries that only support Panda's dataframes. The two Pandas method ensures a smooth transition between polars and Pandas, allowing you to take advantage of both libraries within the same workflow. If you don't want to download large datasets onto your computer, you can use publicly available datasets on Github. To do this, find a large dataset on GitHub. Go to the row tub and copy the direct link. Use the reads as we function in polars to load it. Like before, it takes a little time to process. But eventually, we can see the data. For today's example, we are working with the Chicago crime data set of for 2022. We can also check the column types, and we can see that it contains 2,525,551 rows, quite large but not too overwhelming. To avoid confusion with the first data frame we loaded earlier, I'll rename this new data frame. I will still use the first data frame later, but for now, we will work with this second one, which was loaded from Github. I will reload it all cells, just like in Pandas. In polars, a series is one dimensional array like structure that represents a single column of a data frame. It can contain homogeneous data types, such as integers, floats, strings, buoyans. Series are the building blocks of data frames and polars. They allow us to perform various data manipulation and analysis operation, such as filtering, transformation, and aggregation. Each series has an associated name and data type, making it easy to reference within the data frame. We can extract a series from a data frame in several ways. The first one using the column name. The most straightforward way is accessing the column by name. To avoid loading the entire series in large datasets, I will use the head function to display only the first four rows. The second one using GAD column function. This approach has performance and flexibility advantages over direct column access. For me, the main reason to use Get column function is its error handling. If you try to retrieve a non existing column, pullers immediately raises an error. This provides instant feedback if there is a typo in the column name. In contrast, accessing a column by name may either silently return non or raise a key error, which can make debugging harder. The third way is the select method. It creates a new data frame containing only the specified column. However, even if it contains just one column, the result is still a data frame. To work directly with a series, we need to convert it using two series function. The select method in polars returns a new data frame even if we are selecting only a single column. This means that if I create a one variable and assign this result to it, it's not just a single column, but the polars data frame containing one column. So if I check the D type of a one, also I remove two series method. It might result in an error. As D type is typically an attribute of series, not a data frame. And here we are, we have an error. If I again use two series method, as one is no longer a data frame, body polar series and D type attribute will return the data type of district column. Let's print it out. In the first case, we can see that print as one, displacing the value of the district column as a series. Here we can see the shape, the data type, and the values. The second case displaces the district column inside the data frame format, showing the column name and its data. 4. Data Manipulation in Polars: Arithmetic Operations, Column Management, and Filtering Techniques: In polars, we can perform arithmetic operations on series or columns using standard Python operations. For example, we can use addition. If we want to add ten to a series, we can see that each number in the resulting series has increased by exactly ten compared to the number in the original series. Here I print the original series and we can see the result. Age value increases by ten. Multiplication. Multiplying a series by two doubles each value. Here I compare to the original series, and we can see that each value is doubled. Subtracting 20 decreases each value by 20. Let's go on with aggregation methods. Polars provide built in methods for aggregations, such as sum. The sum method in polars is used to calculate the sum of all elements in a series or a column in a data frame. When applied to a column containing integer values, it returns to the total sum of those values. The mean method in polars calculates the average of all elements in series or column. In our example, it computes the arithmetic mean of all integers values in our series. We can also use comparison operators in polars to create Boolean series. In polars, comparison operators allow you to compare values in a column with other values, and the result is a boolean series. A boolean series is a sequence of true or false values, which you can use to filter, analyze, or manipulate your data. Here's how the eparators work. In this case, we create a new series containing Boolean values. The operation compares each value in the column with 20 and returns a series of the same length at the original column where each element is either true or false. Also in polars, we can use the greater than method. At first time it returned only false, so I changed the comparison value to 30. This method is used to compare a column with a specific value, similar to using the greater than operator that we used earlier. The greater than method is a built in function that performs, yes, the same operation, but allows for more flexibility. It can be useful when working with methods that require function calls instead of operators. In some cases, it may be preferred when using method chaining or working with custom expressions in a query. So the first case we can use when checking values in a data frame. The greater than method that we used in the second case, we can use when we need function based processing, such as method chaining or compatibility with certain query structures. In most cases, both approaches achieve the same result. This is useful for filtering or analyzing data where you want to focus on values that meet a specific condition like finding records with higher sales than a certain target. I between this operator checks if a value falls within a specific range of numbers. For example, using is 20-30 will return true for all values. In the column that are between ten and TG, including ten and TJ and falls. For those that are outside of this range, this is particularly helpful when you need to filter data or create conditions that check if values lie within a specific range, like age 18-65 or prices $10-100. In polars, adding a new column to a data frame is done using with columns method. It's important to know that polars data frame are immutable, meaning that the original data frame doesn't change when you add a new column. Instead, this method creates and returns a new data frame with a new column added, leaving the original data frame unchanged. So we use with columns method. It tells polars to add new columns to our data frame. To add a constant value to a new column, we use lead function. This function creates a literal or fixed value that will be assigned to every row in the new column. Example, if we use t one, it means every row in the new column will have the value one. After creating new column, you can give it a name using AS. In our case, I call it new column. This is how you specify what the new column will be called in the data frame. After executing these steps, we create a new data frame based on data frame two with a new column named new column added. The new column contains the value one for all rows. And as expected, the original data frame remains unchanged. If you want to delete column from a data frame, you can use that drop method. This method allows you to remove a column by providing its name as the argument. You simply need to tell polars which column you want to remove by specifying the column's name. It's important to know that polars data frames are immutable, meaning that the drop method doesn't modify the original data frame directly. Instead, it creates a new data frame with the column removed. This is done to ensure that the original data remains unchanged. Since the original data frame doesn't change, if you want to keep the modified data frame, the one without the dropped column, you need to reassign the result back to the original variable or store it in a new variable. Without doing this, the original data frame will remain unchanged. Now let's look at filtering a polar series based on certain conditions. For example, here we are filtering to keep only the even numbers. The condition checks for even numbers, and the filter method keeps only those elements where the condition evaluates to true. This is really useful, as you can adjust the condition and sign the filter to filter the series based on different criteria. For instance, I filtered all values greater than 20. If I change it to 30, the condition becomes more clear. Let me quickly update that and restart the cell for the next example. Now, let's create two series. Probably it's better to use just name series. I removed the print operator because Jupiter notebook render it great. And the second series is just slightly different from the first. Now let's concatenate them. When working with pullers, the concatenation of series depends on the data type. If you concatenate two string type series using the plus operator, the operation performs element wise concatenation of the strings. This means that each element in the first series is concatenate with the corresponding element in the second series. Since both series are the string type, the plus operator concatenates the strings. However, when series of different types are involved, polars will automatically cost the series to compatible data type before performing the operation. Here, polars automatically converts the integers into strings before performing the concatenation. If you concatenate two integers type series using the plus operator, the operation will perform element wise addition of these integers, not string concatenation. As a result, the series will contain summed integers. 5. Mastering DataFrames in Polars: Slicing, Descriptive Statistics, and Advanced Data Exploration: Let's quickly review how to access rows and polars. In polars, you can access rows using indexing and slicing, similar to how it works in other data manipulation libraries like Pandas. However, there are some key differences in how polars handles rows due to its columnar data structure. You can access rows using indexing, but polars returns a new data frame instead of single row object. When you access a specific row using an index, polars returns it as a new data frame rather than as a single tuple or individual row object. This is different from Pandas, where accessing row returns a series one dimensional object. Since polars is optimized for working with columns rather than rows, this behavior ensures that data remains consistent and efficient when performing operations. You can also extract multiple rows using slice notation, which works similarly to how you slice lists or ray in Python. This allows you to retrieve a range of rows efficiently without changing the data frame structure. Unlike Pandas, polars doesn't use an explicit row index. Instead, rows are identified by their position, making it more efficient when working with large datasets. This columnar approach helps polars perform data operations much faster than traditional role based indexing methods. The described method in polars is a powerful tool used to generate descriptive statistics for a data frame. Descriptive statistics helps you summarize and understand the key features of your data, allowing you to quickly get an overview of its characteristics. Here we can see what the described method provides. Count shows the number of non null values in each column. It helps you understand how much data is available for each variable and whether there are any missing values. The mean or average of each numerical column, it gives you an idea of the central tendency of the data. What value is the most typical for that column? The standard deviation tells you how spread out the data around the mean. A small standard deviation means the values are close to the mean, while a large standard deviation means the values are more spread out. Depending on the data, pollers may also provide other statistics, which gives more insights into the distribution of values in each column. The described method is great for quickly gaining insight into your data. It helps you identify patterns and understand the general behavior of the data, detect outliers or values that are usually far from the mean, get a sense of how much variation exists in the dataset, which can help with making decisions about how to clean or process the data further. The estimated size method in polars is used to estimate how much memory a data frame will use in your system. It gives you an approximate size of data frame, so you can understand how much space it's taking up in your memory. When you call estimated size on a data frame, it calculates the memory usage of the data frame in megabytes. This is useful when you're working with large datasets and want to ensure your system has enough memory to process them efficiently. Instead of loading the entire data frame and seeing how much space it uses in memory, which can be slow or inefficient, you can get a quick estimate. The duplicated method in polars is used to identify duplicate rows within a data frame based on all columns. It returns able and series, indicating whether each row is a duplicate of a previous row. To get only the duplicated rows, we can filter the data. This will remove all non duplicated rows, leaving only the duplicates in the resulting data frame. As we can see, there are no duplicates in this case. These empty method and polars is used to check whether a data frame contains any data or not. Essentially, it helps you determine if the data frame is empty, meaning it has no rows or data inside it. For example, you could have a data frame with column names, but no data in the rows. This would be considered empty. When you use is empty method, it checks if the data frame has any rows. If the data frame has no rows, it will return true, indicating that the data frame is empty. If the data frame has one or more rows, it will return false, meaning the data frame is not empty. This method is especially useful in situations where you might want to perform some operations on a data frame, but only if it contains data. For example, before performing a complex calculation or data transformation, you may want to make sure the data frame actually has data to work with. This can prevent errors or unnecessary computations on an empty dataset. This unique method in polars is used to check whether the values in a particular column are unique. When you call is unique method on a column in a polars data frame, it checks whether all the values in that column are unique. The method will return true if every value in the column is unique and falls if any values are repeated. It helps in cleaning or validating data before performing tasks like analysis, merging datasets, or creating indexes. Here I used the filter again to get unique values instead of just or false. Ports also provides unique method, which counts the number of unique values in a series or data frame. It returns an integer representing the count of unique values. When applied to a column, and unique returns the total count of distinct values of that column. If used on a data frame, it typically counts unique values for each column separately. And unique method is helpful when analyzing data to understand its diversity or spread. It can be used to detect duplicate values. For example, checking how many different customers IDs, product names or user emails exist in a dataset. It helps with data validation, ensuring that a column meant to have unique values doesn't contain unexpected duplicates. The null count method and polars is used to count the number of missing values in a column or an entire data frame. When applied to a column, it returns the total number of null values in that column. If I use it on a data frame, it typically counts null values for each column separately. Missing values can cause issues in calculations. So knowing how many null values exist helps decide how to handle them. If a column that should always have data has null values, it may indicate data entry errors. Many machine learning models cannot handle missing data directly. So counting null values is the first step in deciding how to fill or remove them. The count method in polars is used to count the number of non null values in a column or an entire dataframe. It helps you quickly determine how many actual non missing values exist in your dataset. When I apply this method to column, it returns the total number of non null values in that column. If I use it on a data frame, it typically counts the non null values for each column separately. It helps determine how much usable data is available in each column. By comparing count null values with a total number of rows, you can see how many missing values exist. Many data processing steps require non null values. So knowing how many valid entries are present helps with data cleaning and processing. The mean horizontal method in polars is used to calculate the mean or average values across horizontal rows of a data frame. It computes the mean for each row individually, treating each row as a separate sequence of values. This method is useful for summarizing data and gaining insights into the distribution and characteristics of values within a dataset. To find the minimum or maximum values in a dataframe, you can use methods like mean and max. The mean function returns at the minimum value across all numerical columns in the data frame. The max function gets the maximum values across all numerical columns. These functions work not only with numerical data, but also with other types of data such as strings or dates. The product function in polars is used to calculate the product of all values in a column or series or multiple columns in a data frame. This means it multiplies all the values together and returns the result. When applied to a column, it returns a single number representing the product of all values in that column. For data frame, it calculates the product for each numerical column separately. It's very useful for mathematical calculations, financial analysis and data validation. If you want to estimate the spread or squared deviation of values from their mean in each numerical column, you can use the var function. It measures how much on average values in a column deviate from the mean. It's commonly used in statistical analysis to understand the distribution and variability of data. The SDD function computes the standard deviation of values in each column of a data frame. This function returns a series containing the standard deviation for each column. A low standard deviation means that values are close to the mean, while a high standard deviation indicates that the values are spread out across a wide range. 6. Exploring Polars DataFrame Methods: Flags, Schema, Column Operations, and Data Conversion Techniques: The flag function and polars returns a dictionary of flags for each column in the data frame. Each flag is presented a boolean value indicating whether the flag is set for that column or not. For example, I see sorted flags here, but if I want to check whether a column is unique, I can use this unique function. This unique function returns a boolean value. True if all the values in the specified columns are unique, no duplicates and false if there are any duplicate values. And here I check whether the distinct column in my data frame contains only unique values or not. In the first case, I checked for the presence of unique values in the column, whether they exists at all. And then we found out whether the column consists entirely of unique values. And as we can see, there are no unique values, and the column is not marked with the unique flag. To retrieve the list of column names in the data frame, you can use the columns method. It returns a list of strings where each string is the name of column. This is useful for quickly checking the structure of the data frame and understanding which columns are available for analysis or processing. The schema of a data frame refers to its structure, specifically the column names and their data types. In polars understanding, the schema is important because it helps you know what kind of data you're working with. The schema includes the names of all columns in the data frame, and this helps you quickly see what data is available. Each column in the data frame has a data type, and knowing the data types is also important because some operations only work on specific types of data. So schema helps check if the data is in the correct format before performing operations. Some operations might fall if the data types is incorrect. So checking the schema beforehand can prevent errors. The width method in polars is used to find how many columns are presented in a data frame. This is useful when working with large datasets. Where manual counting columns is impractical. It helps quickly check how many different data fields exist in your dataset. It also can help when working with dynamic datasets. Knowing the number of columns can help with tasks like looping through columns or selecting specific ones. The glimpse method in polars is used to preview a summary of your data frame, giving you a quick look at the data structure. It's helpful when you want to understand the general layout of your data without having to display the entire dataset, especially when working with large datasets. Glimse method provides a compact overview of data frame, including column names, data types of each column, a preview of the first few values in each column. This method doesn't show the entire dataset, but instead offers a quick snapshot, so you can understand the data structure at a glance. It helps to quickly check what data is available, what the columns are, and the types of data you're working with without displaying everything. It also helps you avoid overwhelming yourself with large amounts of data by only showing a small summarized portion. You can also spot potential issues like unexpected data types, missing values or inconsistencies in column names during this quick review. The N chunks method in polars is used to find out how many chunks a data frame is divided into internally. In polars, data can be split into chunks for more efficient processing, especially when working with large datasets that do not fit into memory all at once. A chunk is a smaller portion of the entire data frame. Polars often divides large datasets into chunks to handle them more effectively in memory. Each chunk can be processed separately, allowing polars to work with datasets that are too large to fit entirely into memory at once. This is part of Polar's efficient memory management system. When you use the N chunks method, it returns the number of chunks that our data frame has been split into for processing. This can be useful for understand how polars is managing the memory and data distribution for your specific data frame. In some cases, understanding the chunking of data helps you monitor and debug how polars is handling data in your program. If a data frame has a large number of chunks, operations might be less efficient compared to when the data is stored in a single chunk. Understanding chunks is key when working with large datasets. The two arrow function and polars is used to convert a polars data frame into a Pache arrow table. This is useful for data interoperability because apache arrow is widely used format for efficient data exchange between different data processing systems. Apache arrow is a columnar memory format designed for fast data processing. It allows different data processing tools like Pandas or Spark to share data efficiently without needing to copy or convert it multiple times. When you use two arrow function, Poller's transforms its internal data format into a patch arrow table while keeping the columnar structure. It helps with faster data sharing, efficient memory usage, and better compatibility. Polars also provides the two digT function, which converts a data frame or series into a Python dictionary. Each column in the data frame becomes a key in the dictionary with corresponding values, forming a list for that key. We also have two dicts method. In polars, it converts a data frame into a list of dictionaries. Many Python functions and libraries work well with lists of dictionaries. For example, if you need to convert your data into JSON format, these can be helpful intermediate step as list of dictionaries are easily serializable to JSON. If for some reason you need a string representation of the data frame, you can use the two N representation method. This is particularly useful for debugging or for scenarios where you need to generate code that can reproduce that data frame exactly as it is. Can also use two Napi method, which converts the polars data frame to NumPy array. This is useful when you need to leverage Napi's powerful array manipulation and mathematical functions. Since Napi arrays are highly efficient for numerical computations, this method is beneficial when performing operations that are better handled by Nam Pi. The same I can say about the two Pandas method. It converts the polars data frame to Panda's dataframe. This is useful when you want to use Pandas specific functionality or when working in an ecosystem where Pandas is primary data manipulation tool. The two torch method converts a polar data frame to a Pytorch tensor. This is ideal for preparing data for deep learning models in PyTorch. However, since I don't have this library installed, I encounter at and error. We can install this library with this command, but right now we don't need it, so I leave it as it is. 7. Advanced Data Manipulation in Polars: Grouping, Aggregation, Sorting, and Custom Transformation: Now let's continue with the group B method. If you have worked with Pandas before, you're likely familiar with this method. If not, here's a quick explanation. The group B method in polars is used to group a data frame by one or more columns. In our case, grouping by the year column means that all rows with the same value in the year column will be grouped together. Each unique year will form separate group. However, in our case, we only have one year 2022. Next, we use aggregation. The method and polars is used to perform aggregate calculations on groups of data within a data frame. We then specify the column, bit, and apply the count function. This means that for each group, we count the number of occurrences of the bit column. Essentially, this counts the number of rows in each group since each row represents an incident. Then we use the Alias function. To rename the resulting column from the aggregation to count incidents. This is done to give the aggregated column a meaningful name. In theory, we can use this function to summarize data and understand the frequency of incidents per year. However, since we only have one year in our dataset, we only see the count of incidents for 2022. Now let's consider another example when I group by year. But instead of using an aggregation function, I use the L method. The OL method retains all rows within each group. The seperation is useful when you need to work with all data points within each group without performing any aggregation or transformation that reduces the dataset. In some cases, when you want to get quick overview of each group, you can use the first method instead of all. This is useful when you want to extract sample data points from each group or reduce the data frame to one row per group for further analysis or reporting. I don't have the best example for this, so let's group by primary type instead. That makes it clearer. Here we can see the first row for each group. In our case, the primary type column. The last method works almost the same except it returns only the last row for each group. Now I want to define a custom function to calculate the arrest percentage. First, I get the total number of cases by calculating the length of the data frame. Then I get the number of arrest cases by summing the arrest column. Next, I calculate the arrest percentage by dividing the number of arrest by the total number of cases. Finally, I return the data frame with the result. This code groups the data by primary type and calculates the arrest percentage for each type of crime. From the result, we can see which types of crimes have higher or lower arrest rates. Now that we have created this function, it's time to use it. A very useful method for this is nab groups. This method takes the grouped data as input, applies the custom function to each group, and then returns a new data frame with the result of this operation. In this case, we group the data frame by the primary type column and then apply our custom arrest percentage function to each group. This function processes each group independently and can perform any custom logic needed. The result is a new data frame where each row contains the result of applying the arrest percentage function to each group. There is also an apply function, which applies a custom or user defined function in a group by context, however it has been deprecated and renamed to map groups. So if you see the apply function in older code, don't get confused. It's just an outdated version. You're probably familiar with the head function, which allows you to see the first several rows of a data frame. However, let's consider another useful function, tail. Let's say I want to see only the last rows of each group. Using the tail method within a group allows us to focus on the most recent or last few entries for each category. This can be useful for examining the most recent cases in our dataset. In this case, grams, but in other cases, it could be any type of data. You can also use the sort function after applying these operations. It allows you to sort the extracted entries by a specific column, for example, community area for better readability or further analysis. So what did we do here? We grouped by primary type. We selected the last row for each group and we sorted the data by community area. The sequence of operations helps in analyzing the most recent entries while improving readability. 8. Advanced Data Operations in Polars: write_csv, Pivot Tables, and Join Strategies: For the next example, I need two data frames. Joining data frames is a common operation in data manipulation where rows from two or more data frames are combined based on common columns. Polars provides various types of joins, similar to SQL style joins. Let's print both data frames and start with the inner join. I take my first data frame and use the join function. Then I specify the second data frame that I want to join with the first one. Next, I define the column to join on. This can be a single column name or a list of column names. In my case, it's name. Finally, I specify the type of join, which in our case is an inner join. Sorry for typo. This means that only rows with matching keys in both data frames will appear in the result. We see that both Bob and Charlie are present in both data frames, so we see them in our result. Now let's try a left join. This means that all rows from the left data frame and only the matching rows from the right data frame will be in the result. And we can see that all names and all information from the left data frame appeared in our result and from the right data frame, in our case, the second data frame. We can see only two names, Bob and Charlie that are present in both data frames. For a full drawing, all rows from both data frames will appear in the result with no values where there are no matches. And here we can see the nulls in the result. We also have cross join. It returns all possible combinations of rows from both data frames, a cartesian product. Every row from the left is combined with every row from the right. And to show you all types of joins, I want to explain you also semi join. It returns only rows from the left data frame that have matching keys in the right data frame. You might ask me what the difference between inner join and semijoin? We got almost the same. Both inner join and semijoin returns rows where there is a match between two tables, but they have a key difference. Inner join returns all matching rows from both tables, including columns from both. Semijoin returns only the rows from the left table that have match in the right table, but does not include columns from the right table. So while they may return the same number of rows, the inner join includes additional columns from the right table, whereas the semijoin keeps only the original columns from the left table. And we have antidow. It's a type of join operation where rows from the left data frame are included in the result only if there is no matching key in the right data frame. Essentially, it filters out any rows from the left dataset that have a corresponding match in the right dataset based on specified key or column. If you have noticed, I often use Tap to avoid manually typing existing variables repeatedly. Press the top key and Dutra notebook will suggest the variable for selection instead of manual input. Let's continue with pivot tables. For this, I will slightly change my data frame. If you have worked with Pandas before, you probably know what it is. If not, pivot operations allows you to reshape your data by summarizing it in different ways, based on a specified aggregation function. Here, I will show you how to reshape my recently created data frame. I specify the values to be pivot, and in my case, it will be the score column. Then I specify that the rows of the new pivot table will be indexed by the name column. I will also use the City column, which will serve as the basis for the new column of the pivot table. For the aggregation function, I will use first. This means that if there are multiple values for a particular combination of name and city, only the first value will be used. The resulting data frame will have unique name values as the index rows and unique city values at the columns. Null values in the pivot table indicate that there were no matching combinations of name and city in the original data frame for those cells. Despite the fact that we have two bobs, we only get the first one with a score of 90. Next, I will change the aggregation function to sum. And now we get the sum of scores for the two bobs, two Charli's and so on. I add a row below our data frame to show the updated results. And of course, the sum is 160. We can see that the first bob has a score of 90. The second bob has a score of 70. The same applies to Frank and the other names. The mean function shows the average square value for each name and city combination. The max function shows the maximum score value. We can also calculate the average or median score values. In this case, we don't see a difference between the median and the mean because our data frame isn't the best example, but they represent different aspects of data distribution. The mean represents the average value of the dataset and is suitable for symmetrically distributed data without extreme values. The median, on the other hand, is the middle value of a dataset when it's ordered from least to greatest. The median is more suitable for skewed or non normally distributed data. For the next example, I need two different data frames. I will copy the first one and make a few changes. In pullers, we can compare two data frame to check if they are equal. We can check if both data frames match exactly. If both data frames had the same schema, which means the same column names and data types, as well as the same data in each corresponding columns in row, the comparison will return bull and value true, which indicates that both data frames are identical in terms of both schema and data. Or false if there are any difference in schema or data between these two data frames. In the first example, I compare two different data frames and got false. Then I compare two identical data frames, the same data frames, and got true, which makes sense since they are exactly the same. If I undo the changes I made to the second data frame, we are left with two identical data frames again, both in terms of schema and data, and the function returns true. Using the equals function allows you to check whether two data frames are completely identical, which is useful for data validation and testing purposes. You can save a data frame to a CSV file using the CSV method. For example, if I want to save our data frame to a file named data CSV, I can do so with a single command. After running the LS command, we can see that the file has been successfully created. Saving data frame to a CSV file is really helpful when you need to store data for later use, making it easier to share or reload. CSV files are widely supported and can be opened by many software programs. 9. Understanding Eager and Lazy Execution in Polars: A Speed Comparison with Pandas for Large DataFrame: Polars offers two execution models for data frame operations Iger mode and lazy mode. Understanding the difference between these two models is crucial for optimizing performance. Well, let's start with Iger mode. Operations on a data frame are executed immediately. The result are computed and returned as soon as an operation is called. This means that every step you take is executed immediately and sequentially. This mode is easier to debug because you can see the result of each operation right away. Eager mode is often preferred in interactive environments like Jupra notebook because it allows you to see the result of operations immediately. It's most suitable for interactive data analysis and smaller datasets where immediate feedback is needed. Let's continue with Lazy Mode. Instead of executing operations immediately, polars first builds and optimizes a query plan before execution. This approach allows polars to optimize the execution plan, reducing the number of operations and the amount of data processed. Lazy molten polars refers to a way of performing data operations where computations are not executed immediately. Instead, operations are delayed and only executed when you explicitly request the result. This approach allows polars to optimize the sequence of operations before actually performing them, improving performance, especially with large datasets. Lazy Mode is generally most suitable for large datasets or workflows that involve many steps. The optimization it applies can lead to significant performance improvements over eager execution. I think the function reaches Vans Kansas, we are pretty self explanatory. But I will briefly explain A. The read A function in polars is used to read data from a A file into a polars data frame. Park is a columnar storage file format, optimized for use with data processing frameworks. Unlike row based formats like CSV, Park stores data by columns instead of rows, which makes it highly efficient for both storage and processing. The columnar format is particularly useful when you need to read only specific columns from large datasets, as it allows you to avoid loading unnecessary data into memory. If you only need a few columns from a dataset, Park allows you to load only the relevant data into memory, improving speed and reducing memory usage. Big data applications. Park is a great choice for big data applications, especially when you are using frameworks that support it, such as Apache park, dusk or polars. Now let's compare the performance of reading CSV file using Pandas versus polars. If you remember, we have a large dataset with over than 100 million rows. So I'm going to use it. First, I import Pandas and use the read CSV function. To measure execution time, I use the Time it magic command in Jupiter notebook. This command allows us to measure execution time and calculate average results over multiple runs. So it will make some time to load. This really takes a lot of time. Let me remind you that the data frame has more than 100 million rows. So executing operations on it takes a considerable amount of time. After running the code seven times we get an average execution time of 32 seconds per loop. Now let's repeat the same steps as we did with Pandas. But this time using polars, we will use the time it again to see how long it takes to load the data frame. The difference in execution time is immediately noticeable. Polar loads the data frame significantly faster. I assigned the previous expression to the variable TF so that we can continue working with loaded data frame. So we have just read more than 100 million rows using both Pandas and polars. If we look at the data frame, we can see that some of the columns are meaningless. To make our data easier to work with, I will rename them. First, I prepare a list of new column names and then use the rename function to rename the columns in the polars data frame. Now we can see that we have renamed the column names. We have properly structured dataset, and I will proceed with grouping the dataset by the unit and level columns. After grouping, I will perform an aggregation operation on the number column by calculating the sum of its values. First, I select the column named number from the data frame. Then I use an aggregation function that adds up all the values in that column. Instead of keeping the default name, I assign a more meaningful name to the output column. For this, I use an alias. For time off, I'm calculating the mean. For salary, I'm calculating the max function. Finally, I will select the required columns from the aggregated data frame. The select method and polars is used to choose specific columns and apply transformation to them. It creates a new data frame with only the selected columns, leaving the original data frame unchanged. Sorry for typo. I use the print function to display the result along with the heading in a regular code editor. This is necessary to see the output. However, in Gebr notebook, you don't need to print. You can simply type the variable name, pole result, and it will automatically display the result in an interactive way. And here we are. Now, I'll use the time it command to run the code several times. At the end, we will see how long it takes to perform the operation. We see the code running several times. And at the end, we find that it took three dodge, 83 seconds per loop, or I can say per operation. Let's repeat the same process with pandas. It takes more time than with polars. We look at the same data frame and need to rename the columns. So let's do that. The command is almost the same as with polars, except we define the columns first and then pass in the new column names. Now we can proceed with the same grouping and aggregation as we did above. We can see here that Pandas uses a dictionary inside aggregation function to specify the column names and operations. In polars, the aggregation was applied to columns by selecting them first and then using functions like sum mean and box. In polars, you cannot use a dictionary directly inside the aggregation function in the same way as in Pandas. Polars requires you to specify each aggregation operation explicitly for each column, and this is the difference. Pandas provides a shorter syntax for aggregation. Both expressions will give us the same result, but the syntax is slightly different between Pandas and polars. I strongly recommend trying it yourself. Pause the video and repeat this code by yourself. This is what the final table looks like. Now I will use time it again to run the code several times. And it was run seven times. After running it, we see it took 7.34 seconds per loop. We can clearly see the difference. But we also need to consider the cases when we perform these operations along with reading the data. I will rewrite the code just a bit. So we will read the CSV file, rename the columns. Perform the group by operation, aggregation, and then select specific columns all in one chain using polars method chaining syntax. This takes more time. And now it took 11.8 seconds. Here, I re wrote the code for Pandas to do the same thing. I combined all the steps into a single line just like we did in the polars code above. Even without time it, I run it and my system runs out of memory. Oh, yep, my system is dead. But we can manage this even with Pandas. If we start reading the CSV file in chunks, let me show you how we can handle this situation. Reading the file in chunks helps manage memory usage. First, I set the number of rows to be read into memory at one time by setting the chunk size. Then I initialize an empty data frame that will store the final aggregation results from each chunk. I read the CSV file in chunks using the four loop, rename the columns, and perform the same grouping and aggregation as before. At the end, I can coatenate the aggregated results from each chunk to the final data frame. Also, I used Ignore index. True. This means that when performing operations like concatenation, the original index values will be ignored and the new default integer index will be assigned to the result. It helps avoid duplicate or non sequential index values when modifying data frames. Here we can't break the line like that. So I changed it for better readability. It takes some time, but we finally get the result. To avoid running out of memory, we can use MMIT a magic command from the memory profile package. It measures and pre ints the memory usage of the current code execution. This helps monitor memory usage for each chunk. In this version, MMIT is inside the for loop before pre nt operation. It will execute in every iteration of the loop. This is useful for analyzing and optimizing work with large files, but it may slow down code execution due to additional measurements. You can install it using PIP. Of course, you should use this command to enable it in a Jupiter notebook environment. You use this command only once, not before each use of MMD. After that, you can use MMIT to check the peak memory usage. Here we used this command for total memory consumption of Pandas result. After all chunks we have been processed. So if the goal is to check the final memory usage, the second variant is better. If the goal is to track memory consumption over time, the first variant is more informative. However, we can avoid this issue entirely by using polars instead of Pandas. Let's return to the polars library. In the previous example, we used read CSV from polars. This function reads the entire CSV file into memory immediately, loading all data into a data frame. So we can perform operations on it directly. However, if the CSV file is large, this approach can consume a lot of memory, just like we saw with Pandas. But as I mentioned earlier, polars also has a lazy mode. If we check the type of polars result now, we will see that it's a lazy frame instead of regular data frame. With lazy evaluation or lazy mode, we use SCNcSV instead of readCSV. SCNcSV doesn't read the data into memory immediately. Instead, it creates a lazy frame which records transformations and only executes them when explicitly triggered. This enables query optimization and more efficient execution. I copied the previous code and simply replaced readCSV with ScancSV if we check the type of polars result now, we get a lazy frame instead of regular data frame. Since lazy frames use deferred execution, we can actually visualize the execution plan using the show graph method. This method generates a graphical representation of the query plan, helping us understand and debug the data processing pipeline. It provides insights into the steps and optimizations involved in the query execution. To use Show graph, you need to have graph with installed on your system. On MacOS, you can install it using Brew install graph with on Windows or Linux. You can check the official documentation for the appropriate installation command. Here you can choose the command for your operation system. Looking at the query plan, we can see that only five out of eight columns are being selected. This means that polars only loads the required columns instead of reading the entire dataset into memory. Next, I call HAD on Polar'sRsult, and instead of getting the actual data, we see naive query plan. This happens because lazy frames don't execute immediately. They just build the execution plan. To actually execute the query, we need to call the collect. The collect method triggers execution, processes all operations, and returns the regular data frame instead of a lazy frame. Since our dataset contains more than 1 million rows, execution takes some time. At the end, we get the result. If we check the type of our data frame now, we see that it's a pulse data frame. Not a laser frame anymore. Now I copy the previous code using Scan CSV and run it with collect. Then measure execution time with time. The code runs multiple times, and at the end, we get the result. Let's compare this to the previous execution. Using read CSV, the execution took ten dot 7 seconds. There is no huge difference, but look at this. If we enable streaming execution by setting streaming equals true, in collect method, we process data incrementally instead of loading everything into memory at once. Streaming execution is great for large datasets because it reduces memory usage by processing chunks instead of the entire datasets, and it can potentially leverage parallel processing where different chunks are processes on separate CPO course simultaneously. Streaming execution generally improves performance for large datasets by reducing memory usage and allowing parallel processing. However, for small datasets, the difference may be negligible. With polars, users don't need to manually split their data into smaller chunks like we did with Pandas. Polars handles it automatically. Now, we can see a significant difference in performance. 10. Data Visualization in Polars. Advantages, Limitations, and a Comparative Analysis: When it comes to data visualization and polars, it's often recommended to convert your data frame to a Pandas dataframe. As I mentioned earlier, we can use the two Pandas method for it. Using Pandas for plotting provides more flexibility. But you can also use Matplotlip or Seaborne both widely used libraries for data visualization. You can find video about these libraries in my profile. You can check out. MD plot leap provides more flexibility and allows fine tuned control over plots. While Seaborn makes it easier to create complex statistical plots with a simpler syntax and better default styles. However, these aren't the only options. You can also use HV plot, a high level plotting library that works natively with Polar's data frame. It offers a powerful and flexible interface for creating interactive visualizations. If you want to learn more about HVPlot, you can check out the documentation. Of course, you will need to install it first before using it. I polars, you can also use the built in plot method to create basic visualization without converting to pandas or using HV plot. The plot method understands the structure of polars data frames and can generate plots using the basic MD plot leap back end. You can specify the type of plots, such as Scutter, line or histogram. You also define column names for the X axis and Y axis along with additional plot parameters like the title. These types of plots are suitable for creating simple visualization. However, keep in mind that the built in plotting functionality in polars is limited compared to libraries like HV plot or Mod plot Leap, which offers more advanced features. For additional advantages, you may need to use other libraries. So we have covered a lot today. While polars offers many advantages, it also has some limitations and disadvantages. Let's see what it is. First, polars is relatively new library compared to Pandas, and it may not have as extensive community support or as many third party extensions available. Pandas has a larger and more established ecosystem with many third party libraries and tools designed to work seamlessly with Pandas data frames. Being a newer library, polars may not have the same level of compatibility with existing data analysis libraries and tools. You should consider this before switching from the Pandas library to polars. Another limitation is that polars doesn't have the same level of indexing flexibility as Pandas. While it supports row based indexing, it lacks the robust multi level and hierarchical indexing capabilities that Pandas offers. For example, in polars, you can only set a single columns as the row index, whereas Pandas allows for more complex indexing structures. If you have used multi level indexing in your project before, this might be something to consider when switching to polars. Additionally, visualization capabilities in polars are currently limited compared to Pandas. In Pandas, it's relatively straightforward to visualize data directly from the data frame using built in plotting methods or by seamlessly integrating with external visualization libraries. So if visualization is crucial part of your workflow and you don't need the performance benefits of polars for your specific use cases, Pandas might be more convenient option for you. Switching from Pandas to polars can potentially lead to unexpected behavior or functionality gaps due to the differences in the feature sets and capabilities of the two libraries. As I mentioned earlier, Pandas has been around for a longer time and has a more mature and comprehensive set of features. While polars provides many essential data manipulation and analysis functions, it may lack some of the more advanced features or nice functionality available in Pandas. So, guys, congratulations to complete this course. You've done a good job. Keep learning, keep coding. See you in the next courses. Bye.