Looping Through Pandas DataFrames

Data Science Rebalanced, Data Scientists

Get unlimited access to every class

Taught by industry leaders & working professionals

Topics include illustration, design, photography, and more

Get unlimited access to every class

Taught by industry leaders & working professionals

Topics include illustration, design, photography, and more

Lessons in This Class

- 1.
  
  Class Trailer
  
  1:43
- 2.
  
  Course Overview & Tools
  
  1:22
- 3.
  
  Guide to Optimization
  
  1:19
- 4.
  
  When to Optimize
  
  2:36
- 5.
  
  Overview of Methods
  
  0:22
- 6.
  
  Load a Jupyter Notebook
  
  2:17
- 7.
  
  About the Data
  
  1:03
- 8.
  
  For Loop
  
  2:46
- 9.
  
  Iterrows
  
  1:13
- 10.
  
  Itertuples
  
  1:22
- 11.
  
  List Comprehension
  
  1:41
- 12.
  
  Apply
  
  1:19
- 13.
  
  Vectorization with Pandas Series
  
  1:09
- 14.
  
  Code Example 1 - Slow
  
  4:26
- 15.
  
  Code Example 2 - Refactored
  
  3:59
- 16.
  
  Code Example 2 - Slow
  
  2:44
- 17.
  
  Code Example 2 - Refactored
  
  2:13

Beginner level

Intermediate level

Advanced level

All levels

Students

Project

About This Class

Have you ever spent an hour writing code to clean your data only to find it takes three hours to run? Sometimes it's ok to switch to another task while you wait for your code to run; however, refactoring is often needed when moving to production environments. In this course, you'll learn how to speed up a task you'll find yourself doing a lot as a Data Scientist or Data Analyst: looping through Pandas DataFrames and transforming your data.

Leah is a Data Scientist at a large financial institution and discovered there is a serious gap between the skills and techniques students learn in school versus what they actually need on the job in the real world. Writing efficient Python wasn't stressed at all in Leah's undergraduate program. She'll help you avoid making the same mistakes she made at her first job by teaching you how to loop through DataFrames quickly.

This course is intended for aspiring data scientists and programmers looking to expand their knowledge of writing efficient Python.

In this course you’ll learn the following techniques for looping through Pandas DataFrames:

For loops
Iterrows()
Itertuples()
List Comprehension
Apply()
Vectorization with Pandas Series

Leah will walk through two real world examples of slow code and show you how to refactor it.

No prior knowledge of Pandas is needed for this course; however, a a basic understanding of Python 3 will be helpful (but not required).

Music by TimMoor from Pixabay

Images use in this course - computer

Meet Your Teacher

Data Science Rebalanced

Data Scientists

Teacher

Leah Berg and Ray McLendon are Data Scientists at a large financial institution and have over 15 years of combined experience. They have a passion for seeing people grow and become the best versions of themselves. When Leah and Ray graduated from university, they struggled at their first Data Scientist jobs and quickly realized that academia only told half the story.

While their degree programs placed a large emphasis on machine learning algorithms with perfectly cleaned and balanced data sets, they found the opposite true in the industry. Every problem they encountered required 90% of their time spent focusing on messy and imbalanced data sets, as well as the people generating those data sets.

Leah and Ray created Data Science Rebalanced to help data scientists new to the... See full profile

Related Skills

Development More Development Data Science

Level: Beginner

Hands-on Class Project

Now that you've learned how to efficiently loop through a Pandas DataFrame, review any projects you've worked on in the past where you've looped through a Pandas DataFrame. Refactor your code using any of the methods from this course, and calculate how much faster your code runs.

If you didn't use Pandas to loop through data in your past projects, create an account on Kaggle.com, review an existing project (check out the Code) section, and refactor someone else's code.

Share your code before and after refactoring with the class by uploading to the "Your Project" section.

Class Ratings

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

Transcripts

1. Class Trailer: Hi everyone, Welcome to today's video, which is on tips for writing efficient Python. My name is Lisa Simpson and I am a data scientist at a large financial institution. But about four years experience my coworker Ray and I really wanted to make these videos because we notice there's a huge gap and the skills that you learn in school versus the ones that you actually need in the real-world. So all of our videos will be focused on real-world datasets, real-world problems, and giving you the skills that you need to solve those that they don't necessarily teach in school. We are super excited to have you today and hope you will stick around. There are tons of different ways to speed up or optimize your code in Python. In this course, I'm going to focus on speeding up a specific task that you'll probably find yourself doing a lot as a data scientist and not as looping through a Pandas DataFrame and applying some transformations to text data. We'll walk through when and why to optimize your code as well as go through six different ways to loop through a Pandas DataFrame. Then we'll take what we've learned to apply it to two examples of code that we have refactored. This tutorial is meant to be super beginner friendly. We'll use Jupyter Notebooks so that you'll be able to see all the code that I'm writing and run it alone for yourself. And we'll use some really popular data science libraries like pandas and NLTK. The data that we'll use for today's course is the AG news dataset. This is a really popular texts dataset that contains over a million news articles, their titles and their descriptions, as well as categories for each of the news articles. In this course, we will just be focusing on the title of the news article. Also, although there are over a million data points in this dataset, we're going to only sample a few thousand records just because we don't want our code to run for forever. And with that, let's get started. 2. Course Overview & Tools: Alright, let's get into some tips for writing efficient Python. This course is meant to be really beginner friendly. We'll be using a couple of really popular data science libraries, such as pandas and NLTK or Natural Language Toolkit. Now you don't need to be an expert in those libraries, but it does help to have a little bit of familiarity with them. I'll do my best to explain things as we go along throughout the course. I've listed Python 3 under the intermediate skills section. I do expect you to have quite a bit of familiarity with Python already. But if you don't, we'll be using Jupyter notebooks. You can easily follow along and run the cells yourself. Throughout this course, we're going to be focusing on some text preprocessing for our news articles. I do actually have another course all about natural language processing in Python, we'll be using several of the methods that are taught in that course. So I would highly recommend checking that one out before you watch this course, so forth, this course, we'll be using PyCharm as our development environment, but really we'll be using Jupyter Notebooks for the actual development. Pycharm will just allow us to spin up a Jupiter notebook and make all of our edits in there. Now you can use any IDE that you want. I just prefer PyCharm but feel free if you use Anaconda, might be spidered or VS Code or any IDE will do. In addition to PyCharm and Jupiter notebooks, we will be focusing on the libraries, pandas and Natural Language Toolkit, or NLTK. 3. Guide to Optimization: So the short version to sum up this course is you may not need to do optimization right away. I'd recommend running through a few steps before you actually jump into optimizing your code. First of all, your code has to actually run for you to be able to know if it means optimization. So first take the step to make sure that your code runs, then after that check, but you get the output that you're expecting because sometimes your code might work, but the results that is producing art once you would expect, once you've got it running, and is the output that you would expect it, then you can start to think about, okay, is this view that is running at currently acceptable for me? But it's not all about you. If you're writing this code for someone else or someone else is consuming your code if you want to make sure that speed is acceptable for that, then if there aren't people that are using your code, you might have users that indirectly work with your code. So it's important to talk with them about what kind of performance they're expecting out of a certain application. Then finally, after you cover all that, you are potentially ready to optimize. And when you start to optimize, you're going to end up repeating the cycle over and over again until you meet your optimization requirements. I modified this guy is slightly from Python's documentation. So feel free to go out to their website and read a ton more about different ways to optimize. And again, in this course I'm going to only be focusing on a really, really small subset of how you can optimize your Python code. 4. When to Optimize: We've talked about steps to optimization, but when should you actually optimize your code? I've listed a few different examples here of when you might want to optimize your code. So let's talk through those. And one of the first times you might need to optimize your code is when you start working with big data AI for the longest time and worked with really small datasets. I wrote code that worked for those datasets. So there were times when I started to scale up to larger datasets that I wasn't actually able to run my code efficiently like it had before. So once you start dealing with really large datasets, you want to start to try and write your code in the most optimal way possible. Another reason why you might want to optimize your code is if you're creating a reusable component. So let's say you're working on a certain function and you find out that that's actually really useful function for other people on your team. You have other people that are doing the exact same thing as you, and they could use that function as well. That might be a case when you want to optimize your code since more people are using it, and potentially they might be running, say, on a larger dataset, then you are, see if you go ahead and do some optimization for them. Next up, you might want to optimize when you are deployed to the Cloud it, so whenever you start running things and say AWS or Google Cloud, those resources aren't free when you run things on your local laptop. It doesn't matter how slow they run. It's just basically about how long are you willing to wait for them to complete the Cloud. You have this added complexity of actually having to pay for these resources that you're using. So in order to not have your stuff running for forever and charging you up this huge bell, you probably want to try and optimize your codes, make it run as fast as possible. Another example of when you might want to optimize is when you are working with external users and trying to make the best experience possible for them. Say you've created some sort of model that makes a prediction on some website and the user has to interact with that model. Well, they don't want to be waiting 10 hours to get a response back from you for what your prediction is. When you have people that are dependent on something that you are providing them, you definitely want to try and optimize your code and make it as quick as possible. Optimization comes into play whenever you're trying to quickly iterate during a proof of concept. And oftentimes when you're working on a proof-of-concept, you'll have a lot of different methods or techniques that you want to check it out. And to be able to try out all of those quickly, your code has to be optimized. Finally, there are ethical concerns for a code that runs really long, even if you have your code running in the Cloud really behind the scenes, that's all running in a data center somewhere in datacenters have a carbon footprint attached to them. So the longer your code runs, the more you're increasing potentially that carbon footprint of the datacenter. So if you're conscious about the environment, that might also be another reason to optimize. 5. Overview of Methods: We are going to walk through six different ways to loop through a Pandas DataFrame. We're going to start from the slowest version and work our way up to the faster version. So we're going to be covering for loops. It arose it or tuples, a list comprehension, apply statements and vectorization with pandas series. And if you don't know what any of that means, that's fine. We're going to cover it all in this course today. 6. Load a Jupyter Notebook: All right, so I have opened up my IDE, which is PyCharm. And what I've done is already just create a project TA for this class. I have a folder for the data, have my Jupyter Notebook in there as well as my requirements.txt file that has all the library that we'll be using in it. Again, feel free to use whatever development environment works for you. Maybe that's VS code, maybe you, that spider doesn't really matter. We're just going to use this to our files and then open up a Jupiter notebook from here. So one of the best practices whenever you start up a Python project is to create a virtual environment and install all of your libraries there. I've already done this, but if you haven't, what you can do is just open up the terminal in PyCharm, I already have my original environment setup and to install those requirements, what I would do is just do this line here. So this is pip install dash r, telling it that we have a requirements file and then we are listing out the name of the requirements file. So to get our Jupiter notebook up and running here in the terminal, I'm just going to type jupyter notebook. And that will spin up observing for us to interact with the Jupyter Notebook. And so here's our local web server with our Jupiter notebook in it. I've got this Python notebook here. You should have been able to download this plus the data and the requirements file with the libraries from the course website. I'm going to click on this tips for writing efficient Python notebook to open that up and we'll get started. All right, So now that we have our notebook, let's start walking through some code. This notebook is really meant to be a standalone piece, and there's a lot of what I covered in the slides in it. So feel free on your untimed to read through all this text if you want as a little bit of motivation for why we might want to speed up our code. Little example here at the beginning, I also have a note here of how to download PyCharm if you don't have a development environment and would like to have one, I definitely recommend PyCharm. You can follow this link here, will go out to that site and then from there you can select your different operating system. And I would recommend going with the community version because it has pretty much everything you would need in it. A foot I'm running right now. But if you want to pay for a few more features, they do have a professional version as well. I'm going to skip over some of the stuff that I already covered in the slides and let's move down to learn more about our dataset. 7. About the Data: We'll be working with the AG news dataset. And this is really popular text dataset that has over a million different news articles in it and it contains their news article titles, the descriptions, as well as a category column for and what the different categories of the news articles are. I pulled this dataset from Kaggle, so I did link out here if you'd like to go and scared to, even though there are over a million records in dataset, we're just going to be using a sample of them. So what I'm doing here is loading and a couple of libraries that we'll need. I'm reading in this train.py CSV. This is going to be our dataset that we'll work with. And then I am sampling 5000 records from there. This random state as just a random number that allows you to reproduce my results. So let's go ahead and run this and take a look at our dataset. So you can see here that we've got this class index column. This has a category for each of the news articles. We have the title of the news article as well as a description of the news article. In this class, we're really going to only be focused on manipulating the title of the news article that you could also use this dataset for a classification problem if you want to do. 8. For Loop: So today we're going to be looping over a data frame. And there's a lot of different ways you can do that. One of the really great things that I love about Python or really any programming language, is that there's no single right way to solve a problem. You can have two people tend to be bold. 20 people all come up with entirely different solutions that solve the same problem. So we're going to walk through six different ways to loop over a Pandas DataFrame. And we're going to be calculating the number of characters that are in the title of each of these news articles. And we'll see how performance impacts each of these methods. So we'll start off with one of the slowest method. And this is just a basic four loop for loops are used, are really across a lot of different languages and are one of the most basic ways that you can iterate over data. So if you've learned another programming language before Python, you've probably done four loops before. What this ends up looking like in Python is the follow for each of these examples, I'm going to start a timer to track how long the examples take. Just note that depending on the resources that you have on your machine, you might get slightly different answers than I do. Even if you run it multiple times, you'll probably get different answers. And so we use the time library to get our starting time. And then we also in our ending time as well and calculate how long it takes. So to loop over a DataFrame with a for-loop, what we first do, temporary list to hold all of the number of characters in the title that we're going to be looping through. The syntax is similar to a lot of different languages. We're going to say for I in range 0 through the length of the DataFrame, all that's doing is getting how many rows are in the DataFrame, figuring out what's the maximum number that we need to iterate over. And we're using I as this index variable. So inside the for loop, what we're doing is taking this DataFrame, we're doing that. I lock on it. And this is a way to index different pieces of a Pandas DataFrame and access different parts of them were using our variable from before to say which item we're on in the DataFrame. And then were also extracting out the title of that column and to get the number of characters, all we do this Len function to get the length of that. We store that in a variable and then append that onto our list that we had made temporarily before. So it's going to iterate over each item in our DataFrame, calculate the length, append that onto the list, and then finally we assign that list to a new column in our DataFrame. So let's see how long that takes on a 5000 records and let's see the result. So for me that took about five seconds and depending on again, what your resources are like on your machine, you probably will get a different answer than I do, and that's totally fine. So we're starting with our slowest method here and then we're going to work up to the more efficient methods. So far our results here, we've got this new column on our DataFrame called lengthen. And that has the number of characters that are in the title for each of these records. So I'm just printing out the first virus oh, records. But this has done this for if I have 1000 records total. 9. Iterrows: Next up we've got a function called iter rows. This is a function that is specific to Pandas DataFrames and I lets you iterate over a DataFrame as index series pairs. And the index series pairs is the important part because that's where we get some of our efficiency gains here. Syntax, why is this actually looks really similar to the for loop that we wrote before. It actually does start with a for-loop. It looks like we use this ITER rows function instead of the range to be able to automatically get the number of rows that need to be iterated over. We don't have to do that range stuff that we did before. However, this time we do name two variables, one called index and then one column row so that we can iterate over the indices. And then we have our row, and we take each row, we get the title for that row, and then get the length of that. And we also append that to our list is similar to our previous example. Then assign that list to a new column called length. So if we run this, we can see how long that takes and compare that to the original method. So it looks like previously we had about five seconds here and we got it down to about 1.5 seconds or so. And you can see in our DataFrame here we get the exact same answer. So just find a slightly changing the way that we wrote our code. We gained a lot of efficiencies here, but spoiler alert, we can actually do better than this. 10. Itertuples: So there is another function in pandas called itertools and it's really similar to it arose, but instead of iterating over index series pairs, and this actually creates each row as a named tuple. And that's where we get the efficiencies here. Syntax wise looks similar to the other one. We don't have to define an index since it's a named tuple, we just define our row variable and then we call this df.head are tuples to loop over each item in our DataFrame without having to do any of the range stuff that we did in our first example. So this time we take each row, we get the title, and then we have the length of that to appended that onto our list and we assign that a new column. So when we run that and that actually ends up getting a sub-second results if which is quite an improvement from where we started out that you're giving about it or tuples is that you do have to use this dot accessor. So rho dot title instead of this brackets notation that we used with ITER rows. And the annoying thing about that is that this does not work if you have spaces in the name of your columns. So to use this method, which I would highly recommend that you do over it arose at you have to rename all of your columns if they have spaces to get rid of this, which can be a little annoying to do. But once you've written and once you have that for the rest of your life. So in the case where you're trying to decide if I should use it arose or enter tuples out. Always recommend using it or tuples to iterate over a data frame. So one of the fastest ways that you can. 11. List Comprehension: Last three methods and I'm going to talk about it. They're all pretty similar in speed. And depending on your situation that you're in, you might be able to use one or you might have to pick it up there. Next step we've got list comprehension. List comprehension just offers a shorter syntax for creating a new list based on values from an existing list. And this is specific to Python, I believe So. You may not have seen this in other languages before him. The syntax can be a little bit confusing. I know when I first started learning list comprehensions, a kinda blew my mind. Really hard for me to understand. That once you start seeing them, a few times, you get really comfortable with them and then started seeing where you can use them a lot in your code. Actually, it's what I like to do with a list comprehension is start with the for loop, read there, and then go back to what's happening at the beginning of the list comprehension. So here we're saying for x in our DataFrame column title get the length of each exit. So we're using this as a list and basically we're saying that our original column was a list and then we're going to get the length of those and save that off as a new list. We do that all in one line, so it's a bit simpler and more compact than these other methods that we've talked about so far. And this actually makes a list for us, so we don't have to append to the list like we did and other scenarios. We take our list and then we assign that to a new column and we can run that and see how long that takes. And again, this is slightly faster. The end, the ITER tuples approach that we did before. But sometimes there is a lot of complicated stuff that you might need to do to text or whatever data you're working with. And it doesn't quite fit perfectly in a list comprehension. So there are times when you can use this and other times where it might make sense to use the next method which is applied. 12. Apply: Fly is specific to Pandas DataFrames. And what it does is it allows you to apply a function along an axis of a DataFrame. And what that means is I can apply a function to all of my columns or all of my rows. And the way that I use it most times is I'm going to apply a function to all of the rows in my DataFrame. Oftentimes you use apply with Lambda statements to create inline functions. So the way this looks in Python is we figure out which column we want to manipulate. In this case, it's our title of our DataFrame. We do dot apply and then we do this lambda x with a colon and then give it what function we want to do. So in this case I'm just getting the length of my x value is it's going to get the length, each title in our DataFrame rows. But they're really nice thing about applies statements is that you can write your own custom function, which we'll see actually later on on a couple of examples that, and that instead of putting length here, you just call your function. So it's really nice that you can do some more complex things in your function. And then just call the function here and apply that to each row in your dataframe. So if we run that to that one is super, super fast. We've got again, sub-second speeds here on 5000 records. And apply is one of my personal favorite ways to apply functions to different rows and my DataFrame. As this is the one I tend to pick a lot of times out of all the methods that we're talking about. 13. Vectorization with Pandas Series: One final method for looping over a DataFrame is vectorization with pandas series. This is usually the fastest approach that you can get for looping over at DataFrame. The tricky thing is that sometimes you can't get your data in the right format to work with this method. We're lucky that in this case, I have given us an example where this doesn't work. We can see how that works, but sometimes you just can't get your data in the right format for this to work. And so vectorization actually allows you to apply different operations on entire arrays instead of each individual. So for every other example that we've gone through so far, it's iterating through each row in our DataFrame. Vectorization is taking the entire column and then applying a transformation to the column. So this way we don't have to iterate over each item in the DataFrame. How that looks in practice here is we have our column that we want to manipulate. I'm getting the string version of that column and then getting the length of that. And when we run that, and that's also pretty fast. And I do see here that it is showing up as a little bit slower than the apply method. But if you run this a few different times, you'll get different results. So these are all sub seconds, so we're in pretty good shape. 14. Code Example 1 - Slow: So now let's take all of this that we've learned here and go through a couple examples of code where we have a slower version of code and then we'll make some refactoring to it to make it a little bit faster. In this first example, what we're going to do is clean the title of our news articles by lowercase m and texted, or placing certain tokens from a dictionary and then removing numbers and punctuation from the texts. These are pretty standard preprocessing steps that you would normally do if you're dealing with text data. And I do actually have a course on natural language processing with Python that covers several of these, as well as different methods for working with natural language processing. So if these are new to you, I would definitely recommend checking out that course and learning more about those steps there. So in our data folder, one other thing that I have included is a dictionary that I have made up. It's just a CSV file. So we have two columns from and to. From column has a bunch of abbreviations and UK and NBA, us and then the two column has the fully expanded version of that abbreviation. Sometimes there are cases when you want to expand out to the abbreviations that you might have in your text. And this is one way to do it. I'm creating a dictionary yourself. What we're going to do here is exactly that. We've got some abbreviations in the titles of our news articles that we want to expand out to the full version of them. So for this first example, we're going to start by importing a few libraries that we'll be using. Most of them are just in the base Python and then we'll be using pandas for our DataFrames like we've been using in these other examples. So I'm going to filter out a few warnings that we got so it doesn't clutter up our outputs plots without warning, stop filter warnings is doing. And then we have two functions here to preprocess our texts. The first one is a dictionary replacement function, and we are just taking in the title of the news article here. So in this function, we load in our dictionary and save it off as a DataFrame, we are creating a temporary list to hold all of our expanded or clean tokens. And then what we're doing is taking our title of our news article and splitting that on whitespace. So that gets us a list of tokens that we're going to iterate over it. So what we're doing here at looping through each of our tokens, checking if it's in the dictionary that we loaded it. If it is, then we're going to append that expanded out version to our new list. Otherwise, just append the previous token from there. What we do is join all of those back together using this join function. And we just separate those with whitespace so that this is a single string instead of a list of tokens. So all of that is to do some dictionary replacement. And then we have the second part of our preprocessing here. We take in an entire dataframe and this function and what we do is actually do some factorization, which was one of the fastest methods that we talked about it, we are first lowercase in our texts with this string lower on the entire column. And then we are replacing digits with a space. And then we have a bunch of different punctuation that we've listed out here that we're going to replace with spaces as well. So after we do all that, then we're actually using an apply with a Lambda to call that dictionary replacement function and apply that to each row of the DataFrame. So we're using a mix here of some slower method as well as some fast methods. And we'll see here how long this will take to run. I'm going to load in those functions that I just talked through. And then here the way we actually call it that and preprocess title function, we pass in our entire DataFrame. And then remember that also applies the dictionary replacement after we do the cleaning up of our tokens. So if we run that, we'll see how long that takes. And so on my machine note I am running some reporting software. So this is probably why it's taking a little bit longer, but this took about two minutes for me to run on about 5000 records. Now if you only have 5 thousand data points, so maybe this is good enough for you. Again, you can think back to those different steps for optimization. We got our foot urine, we got the right results, and then we start thinking about, okay, does this make sense for it to run this long in my scenario? Am I okay with that? So if I'm just doing this analysis on the side, probably that's fine for me. But the problems usually come in when you start dealing with really large datasets. Again, this dataset has in total over a million records. So if we were gonna do this on the entire dataset and might take too long for us to be patient and wait around for us. So we can see the results here we have this new column called, titled Clean. And you can see here that we've lowercased it. We have some abbreviations I got expanded out to. We also got rid of numbers and then any of these punctuation. 15. Code Example 2 - Refactored: So moving onto our refactored version of this, we've got our same two functions, but we have written them slightly differently. So instead of loading the dictionary every time the function runs, which was what we were doing before, we're actually going to load it in once and then pass it into our function. And then we'll also take in the title of the news article like we've gotten prior. So here we've condensed our original function down into lion. And this is using the functionality of dictionaries as well as using list comprehension. So you can see down here in our code below, when we load in our dictionary, we do some formatting to get it actually in a true dictionary format, whereas prior it was actually a DataFrame. When we do that, we can actually use this dot get functionality to do that. Look up to see if the word exists in the dictionary. If it does return back the fully expanded version, otherwise return back the original word. So we changed that whole if else statement into just a single line here. And then the rest of this is just a list comprehension. So we say for each word in our title dot split, perform this function. This title dot split just gets a list of the tokens that are in the title based on the whitespace. So, and then we loop through each of those and then perform this dictionary function. Finally, after we get our entire list back, we joined that back together as a single string with the join function. Next up we've got our preprocessing function this time, instead of taking an entire DataFrame and performing operations on the entire DataFrame, we're going to just do these on the title itself and using apply later on to apply this. So here we've got the same few things that we were doing before. We're doing our lower casing with title dot lower. You're replacing digits this time with the 3D library, which is a little bit faster than the approach we were doing before. And then we're also removing all punctuation here. This might look a little bit complicated, but what we're doing here is using this Translate function, which is a function you can call on strings. And then we'd use this str dot make trans to perform some transformations. Here you'll notice we've got string dot punctuation. The string library is really nice because it has a list already calculated for us of all the different punctuation. So we don't have to list them all out like we did previously. So ultimately ambitious replaces all punctuation with whitespace. And to be honest, I didn't write this myself. I went to my old friend Google and found a post on Stack Overflow that was discussing different ways to remove punctuation and which way was the fastest. I found this and just copy it over into this notebook and it works great for our situation. So that's just a quick reminder that you don't have to know how to optimize everything is totally normal not to feel free to use resources like Google and StackOverflow to find the most optimal way to do something. Now let's move on to actually apply in these functions and see, you'll see here that we are loading up our dictionary just once it, originally, it seemed like it made sense to include that with the function since it was related to what we were doing, but it was opening up every single time that function was called. We don't need to do that. We can just load it in once and get it in this dictionary format and then pass that into our function. So to actually apply these functions, we are using apply statements with a Lambda function to call this inline function. Here we're taking the DataFrame column title. We're using that apply with lambda x and then we call our preprocessing function. Now this is the more advanced version of what I was talking about earlier, where we were just using lambdas to call the length of the title. This time we're actually calling our own custom function, which is really nice. So usually the way that I like to do things. And so after we get our text preprocessed and next thing we can do, call our dictionary replacement and replace all those instances. And so if we run, this is actually ends up taking less than a second. So we went from like two minutes previously to around less than a second here, which is crazy just by rewriting the way that we wrote our original code. You'll see here that I have saved off this new column called title clean version two. And this gives us exactly the same result that we got before. 16. Code Example 2 - Slow: So moving on to our next example. Let's say in this example we want I count the number of tokens that are in the title of the news article. But to be considered a token, and we're going to have a few conditions that have to be true. We're going to lemmatized all of our tokens to get them to their root form. And those lemmatized tokens have to one, be greater than one character to not be a stop word. And three, be in Natural Language Toolkit, vocabulary. All of that is kind of new terminology to you. I would recommend you check out my other course, natural language processing in Python, where we dive deep into what it means to lemmatized whether stop words, what is a vocabulary, all this terminology. And so to actually implement these rules in Python, one way we might do this is by importing a few libraries that we need, mainly NLTK, which is Natural Language Toolkit and pandas. So NLTK has a list of stop words and then has his own lemmatized are that we'll be doing. If you're running this for the first time, you might need to download these and I have already downloaded them, so it should just skip over for me when I run it. We've got this function here called count tokens in NLTK vocab, and it takes in a string inside the function we are calling our WordNet lemmatized from NLTK. And then we're setting up a counter. We're starting out with just a simple for loop here. And we are again splitting our string on whitespace similar to our other examples. That gets us a list of tokens and we're going to iterate over each of those tokens. Um, so for each token in our list are going to lemmatized the token. And then if the length of the tokens greater than one, it's not in our stop words and it is in our vocabulary, then increase the counter and this will return back the number of tokens that meet this criteria. Let's load these up. You can see here that I already have these NLTK items downloaded, but it might take a second for you to download. Next up, I'm going to delete the DataFrame that we were working with earlier and instead just sampled 50 records because spoiler alert, this method takes quite awhile. And if we were doing 5000 records that I wouldn't want us to sit here for. So we'll save us some time here and delete off those and reload in a DataFrame with just 50. So to actually apply that function first, we're going to apply the preprocessing that we did in our first example, as well as our dictionary replacement. Now, we're going to use one of the faster methods to apply this function to our title clean column. So let's see how long that takes to run. All right, so for me this took about close to three minutes, which is probably not acceptable considering this is only 50 records it. So let's see what the results here where. So just looking at the first five records in our DataFrame, you can see here that we've got the number of tokens that meet a certain criteria. 17. Code Example 2 - Refactored: Now on to the refactored version of this, we're actually going to get rid of the function that we wrote before and instead use it in a slightly different way. We're gonna keep the first two things that we did from our prior example, preprocessing and doing our dictionary replacement. Now what we've also done here is take out that word net lemmatized and the stop words and vocabulary from the function. And we're just calling those ones so they don't get recreated every single time the function is called. Now here's where the magic happens. We have an apply statement here where the lambda with a list comprehension. Let's break this down because it might seem a little bit complicated at first. So what I usually like to do again, when I look at list comprehensions, Start where the for-loop is happening, read all of that to the right, and then go back to what's happening on the left-hand side of that. So here we're applying this operation to our entire DataFrame on our title clean column. So in that case that's our exit. So for each item that we're iterating over, we're going to split that on the whitespace. So that gives us a list of tokens similar to what we've done in previous examples. And then we're going to check some different conditions. We're going to make sure the lemmatized token is greater than one character. It's not in our stop words and it is in our NLTK vocab it, if all of that criteria is met, we're going to return back just that token. This is all wrapped in a list comprehension. Remember, so we're going to have a list of tokens that meet that criteria. And then we can just get the length of the list. And that should give us the same result as what we had on the previous slide. So when we run that, we again go from close to three minutes down to less than a second, which is awesome to see it. So in our code here, you'll see we get the exact same answer as we did for the previous version. And we've saved so much time doing it. So that wraps up this course on tips for writing efficient Python. Hopefully you learned a few techniques for how to loop through Pandas, DataFrames and also some preprocessing along the way. And this really just scratches the surface of what you can do with optimizing your code in Python, also, when you start dealing with agar data, there are other tools and libraries such as Hadoop and VAX that can help you deal with your big data. Thanks for following along in this lesson, and I look forward to seeing you in the next one.

Looping Through Pandas DataFrames

Data Science Rebalanced, Data Scientists

Watch this class and thousands more

Watch this class and thousands more

Lessons in This Class

1.

Class Trailer

1:43

2.

Course Overview & Tools

1:22

3.

Guide to Optimization

1:19

4.

When to Optimize

2:36

5.

Overview of Methods

0:22

6.

Load a Jupyter Notebook

2:17

7.

About the Data

1:03

8.

For Loop

2:46

9.

Iterrows

1:13

10.

Itertuples

1:22

11.

List Comprehension

1:41

12.

Apply

1:19

13.

Vectorization with Pandas Series

1:09

14.

Code Example 1 - Slow

4:26

15.

Code Example 2 - Refactored

3:59

16.

Code Example 2 - Slow

2:44

17.