Machine Learning: K-Means Clustering | Lazy Programmer Inc | Skillshare

Playback Speed


  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Machine Learning: K-Means Clustering

teacher avatar Lazy Programmer Inc

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

11 Lessons (1h 32m)
    • 1. Cluster Analysis Introduction

      5:03
    • 2. An Easy Introduction to K-Means Clustering

      7:06
    • 3. K-Means Exercise Prompt 1

      9:13
    • 4. K-Means Exercise Solution 1

      11:09
    • 5. K-Means Exercise Prompt 2

      5:04
    • 6. K-Means Exercise Solution 2

      7:08
    • 7. K-Means Exercise Prompt 3

      6:55
    • 8. K-Means Exercise Solution 3

      16:22
    • 9. K-Means Objective (Theory)

      13:01
    • 10. K-Means Objective (Code)

      5:13
    • 11. Where to get discount coupons and FREE machine learning material

      5:31
  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.

4

Students

--

Projects

About This Class

Cluster analysis is a staple of unsupervised machine learning and data science.

It is very useful for data mining and big data because it automatically finds patterns in the data, without the need for labels, unlike supervised machine learning.

In a real-world environment, you can imagine that a robot or an artificial intelligence won’t always have access to the optimal answer, or maybe there isn’t an optimal correct answer. You’d want that robot to be able to explore the world on its own, and learn things just by looking for patterns.

Do you ever wonder how we get the data that we use in our supervised machine learning algorithms?

We always seem to have a nice CSV or a table, complete with Xs and corresponding Ys.

If you haven’t been involved in acquiring data yourself, you might not have thought about this, but someone has to make this data!

Those “Y”s have to come from somewhere, and a lot of the time that involves manual labor.

Sometimes, you don’t have access to this kind of information or it is infeasible or costly to acquire.

But you still want to have some idea of the structure of the data. If you're doing data analytics automating pattern recognition in your data would be invaluable.

This is where unsupervised machine learning comes into play.

In this course we are first going to talk about clustering. This is where instead of training on labels, we try to create our own labels! We’ll do this by grouping together data that looks alike.

In this course, you will learn how to build the k-means clustering algorithm.

Prerequisites:

  • Basic math
  • Basic Python coding

Meet Your Teacher

Class Ratings

Expectations Met?
  • Exceeded!
    0%
  • Yes
    0%
  • Somewhat
    0%
  • Not really
    0%
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

Transcripts

1. Cluster Analysis Introduction: Hey everyone and welcome to cluster analysis, unsupervised machine learning in Python. Guys, this is one of my most exciting courses yet. I'm sure I don't have to tell you this, but machine learning is one of the hottest subjects to learn right now. If you want to advance your career and do something interesting at your job that almost no one else can do. And you might want to consider learning machine-learning. But machine-learning is kind of a hard subject. There's lots of college-level math and on top of that, pretty significant programming work. So how am I going to make this process easy for you? Well, I'm going to show you how to do everything from scratch, from basic first principles. After this course, you'll be able to impress your boss or your supervisor or your coworkers with your new-found machine learning skills. So why unsupervised learning and cluster analysis in particular? I've been making machine learning courses for some time now, but they were all focused on supervised learning. Supervised learning requires a labeled dataset, which means someone has gone in and put labels onto all of your data. So why might that be undesirable? Well, whoever label that data has to get paid. So either you spend money to get your labeled dataset, all you had to do that work yourself, which is time-consuming. It's been shown that people who actually go in and label these datasets can make mistakes. It's not their full labeling data is boring and repetitive work. But we should keep in mind that just because we have a labeled dataset, it doesn't mean those labels are correct. So what is unsupervised machine learning all about and cluster analysis specifically? Well, unsupervised learning requires no labeled data at all. This is very nice because in today's modern technology platforms, we are collecting more data than we know what to do with. This data doesn't just come with labels. For example, there are tons of images on the Internet. And it's not as if underneath each of these images, someone writes, this is a cat. Instead, it's probably just going to be the image by itself. And let's not forget about the greatest Encyclopaedia ever built Wikipedia. This is essentially a collection of nearly all human knowledge. What is the format of this knowledge? Well, it's text, and of course none of that text is labeled. It's just paragraphs and paragraphs of information. And yet, do we not want to build a machine that learns from this great repository of human knowledge. What we can't do that if we only know algorithms that require labels. And how about conversation platforms and social media such as Twitter? These are definitely not labeled. When you make a tweet. You just make a tweet. You don't say this is a tweet about tech or this is a tweet about politics. You just make a tweet and that's it. So what's clustering all about? Clustering is one of the major important categories of unsupervised learning with tons of applications in business and in the sciences such as biology, genetics, and so forth. It allows us to group data together, even when they don't come with specific labels to begin with. So for example, I can look at some attributes of my customers and say, these guys are likely to buy the $100 phone. And these guys are likely to buy the $100 phone. Or perhaps I'm a fashion designer or a clothing manufacturer. I can look at data about human body measurements and cluster them. So I can answer a question like, how many different sizes of T-shirts should we have? What should the dimensions of the XL t-shirts be so that they fit a majority of our customers in the XL cluster. Or perhaps I'm a geneticist and I'm looking at DNA sequence is, I want to be able to see, it looks like these viruses are all related genetically, but they are unrelated to those viruses. Okay, So I hope it's clear that whether you are a tech worker, a biologist, or a financial analyst, you can make use of cluster analysis. There are higher level reasons why we don't always want to do supervised learning. Let's consider the fact that in order for you and me to learn, we don't need someone to label everything for us as correct or incorrect. We just learned based on experience and based on our own inherent pattern-recognition abilities. We understand analogies and we understand the relationships between concepts. So it should be clear that if what we want to do with machine learning is to create an actual intelligence, something that we would regard as intelligence. Perhaps supervised learning is not the way, instead is possible that unsupervised learning might hold the key. Thanks for listening, and I'll see you in the next lecture. 2. An Easy Introduction to K-Means Clustering: Hey everyone and welcome back to this class, cluster analysis, unsupervised machine learning in Python. In this lecture, I'm going to introduce you to the intuition behind K-means clustering and how it works. So firstly, we know that since this is a machine-learning algorithm, we're going to be working on data. So let's try to visualize some data we might get. The first thing you'll notice about this data is that all these points are the same color. This is because we're doing unsupervised learning. So there are no classes given to these points. Each point is just the vector, and that's all we know about each point. We don't know if it should be red or blue or otherwise. But there's one important feature of this set of data. What do you notice about it? Well, our human pattern-recognition abilities allow us to immediately see that there appear to be three groups of data here. In other words, we don't need the dataset to tell us that these three groups are distinct. Our own pattern-recognition abilities allow us to see this very clearly. That's what we mean by unsupervised learning. It's unsupervised because nobody needs to tell us the answer. This is a key concept behind artificial general intelligence as well. If we're going to build an AI that can navigate the real-world. Practically speaking, it probably has to have some general intuition and pattern recognition ability since it would be infeasible to provide it with training data for every possible situation it could encounter. So that's a very important point. We're using our own pattern recognition abilities to tell these clusters apart. It's not in the data itself. But now you may have noticed that what we just looked at was a very unique and very specific situation. The first limitation of this data was that it was two-dimensional. This is of course necessary because if we had, say, a 100-dimensional dataset, you wouldn't be able to see it. The universe itself only has three dimensions of space. So that's all you can see. Sometimes I get students that are not convinced of his fat. So if you're convinced that you can see beyond three-dimensions, try drawing a 10 dimensional cube and see what happens. Now. Why do I mention this? Well, most real-world datasets are not two-dimensional. Therefore, you can't see most real-world data. And hence, your own pattern recognition skills are not useful in this scenario. It would be nice if we had some algorithm to find clusters are groups of data that could work no matter what the dimensionality of the data is. Of course, That's what this entire course is about. So you can be confident, you know the answer to this question by the end of this course. Here's another issue. In the original data I showed you. I generated it so that it would show three distinct clusters very clearly. But what if our data looks like this? Or what if our data looks like this? Now you can see the clusters are not so clear cut. In general, we'd like to know, is the cluster I found good or not? And in this course will answer this question as well. So as a first step into understanding k-means clustering, Let's consider two sort of fundamental truths about the clusters in the dataset. Suppose I am told there is yellow, purple, and green points are the centers of some clusters. And I'd like to know which Center does this new blue point belong to? And I think it makes intuitive sense that this point of course, belongs to the center that is closest to. So to decide which cluster a data point belongs to, I pick the nearest cluster center. Pretty intuitive. Let's now consider the second fundamental fact about clusters. Suppose I'm told that these datapoints you see here all belong to the same cluster. Let's call them x1 to x. We'd like to know what is the centre of this cluster. Of course, that's just the mean of all these data points. This is also called the centroid, if you're thinking geometrically. And as you know, the way to find the mean of a set of vectors is to add them up and divide by the number of vectors. That's how you find the cluster center. The big question, we have these two fundamental facts about clusters. Is there any way that we can combine these two ideas to give us a clustering algorithm. Now believe it or not, these two fundamental facts are all you need to implement K-Means clustering. It turns out if I initialize the cluster centers randomly, then if I just repeat these two steps over and over, I'll converge to an answer. So just to recap, we start out by picking some random points as our cluster centers. Then I can use the fact that each point in my dataset just belongs to the cluster whose center is closest to. So I assign all my data points to their nearest cluster center. Then, now that I know each data point belongs to a cluster, I can calculate new cluster centers based on all the points that belong to it. Then I go back to the first step, which is assigning data points to the new cluster centers. Again. Later in this section, we'll see visual demonstrations of how the cluster assignments and cluster centers evolve over each iteration. Let's recap this lecture. Since while it was quite sure there were some very important ideas introduced. First, we talked about the fact that sometimes it's really easy for us to see clusters. We don't need the dataset to tell us the label for each data point. We can easily see which data point belongs to which cluster. That means we've learned to understand the data in an unsupervised way. Second, it's important to understand the ways in which that example was limited. Sometimes the boundaries between the data might not be so clearly-defined. And more often than not, we won't be able to look at our data because our data won't be two-dimensional. So what we would like to have is an automatic algorithm that can find clusters for us given any arbitrary dataset. Finally, we looked at two fundamental facts about clusters. The fact that each point should be assigned to the cluster it's closest to. And the fact that each cluster center should just be the center of each point that belongs to it. We saw that when we combine these two facts together, we get the k-means clustering algorithm. 3. K-Means Exercise Prompt 1: In this lecture, we're going to jump right in to exercising your coding skills. This course is designed to be a hands-on course. So for every topic you'll learn about, you will implement it in code. As my saying goes, if you can't implement it, then you don't understand it. Actually, I recently discovered the famous physicist Richard Feynman said a very similar thing. He said, What I cannot create, I do not understand. So if you don't agree with me, and you also disagree with one of the most famous physicist of all time. Okay, so how can you practice what you have just learned? Well in this lecture, your exercise will be to implement one part of K-means clustering. So here's the exercise. First, you're going to generate a random synthetic dataset called x. As you recall, for machine-learning, x should be two-dimensional n by d as the number of samples in d is the number of features. You're also going to create a one-dimensional array of cluster identities called the why. Why should also be of size n, the number of samples. This is because for each of the samples, you're going to give it a cluster identity. Remember that these must be integers from 0 up to k minus one inclusive, indicating that you have K clusters in total. So for example, the first X might belong to cluster 0, the second X might belong to cluster 1, and so on. Now, how you create this data is totally up to you. You can make the data randomly or you can try to generate data which is appropriate for clustering. Personally, I would opt for the latter since the result will make more sense and it will be more visually pleasing in the end. Let's say for our example that n equals 300, d equals 2 and k equals 3. This means your cluster identities will take on the values 012. I would suggest making it so that your data are evenly split amongst the three clusters. So you'll have 100 data points belonging to cluster 1, 100 data points belonging to cluster 2100, data points belonging to cluster three. Okay, so once you've generated the data, what do you do with it? So to recap, what we have so far is x, which is of size n by d, and a y, which is of size n. The next step, which is the centerpiece of this exercise, is to calculate the mean of each cluster. So what does that entail? Well, let's suppose I have three clusters, so k equals three. Then what you have to do is loop through each of the cluster values, that's 0, 1, and 2. For each of these clusters, find every data point that belongs to that cluster. So for the sake of this example, suppose you choose Cluster 0. So you want to find all the data points in x which belongs to Cluster 0. As you recall, that information is stored in y, then you want to find the mean of all these data points. As an exercise, think about the size of the result. I'll give you a minute to think about it or you can pause this video until you have the answer. Okay, So what is the size of the mean vector of all the data points that belong to a specific cluster. Well, remember that all of our data points are feature vectors that live in a d-dimensional space. In our example, d equals to simply put, the mean of a bunch of d-dimensional vectors is still d-dimensional. In other words, this simply means that you have a bunch of data points and you want to find the data point which is at the centroid or the center of mass. And of course, doing this is just a matter of adding up all the vectors and dividing by the total number of vectors. Okay, So now let's suppose that for each of the k clusters, you calculate the d-dimensional mean vectors. This means that you have k vectors, each of size d. Of course, you can store this in a two-dimensional array of size k by d. So that should be the result of this exercise. A k by d matrix that contains the d-dimensional mean vectors for each of the k clusters is our desired output. Let's think of this problem in a different way, just in case you didn't understand it the first time. Okay, So suppose we have n data points, X1, X2, all the way up to xn. Each of these axes is a d-dimensional vector. Furthermore, we have n, the corresponding cluster identities, Y1, Y2, all the way up to y. At each of these y's is an integer which can be 0 up to k exclusive. Remember that if we combine all these axes into a single matrix, a big X, that was the end by d matrix I was referring to earlier. We can do the same thing to other wise to get a big vector y of length n. Let's make this example a little more concrete. Suppose that Y takes on the values you see here. So Y1 equals 0, Y2, Y3 equals 2, y 4 equals 1 by 5 equals to y 6 equals 0, y 7 equals 0, y eta equals one, and y nine equals two. It should be clear that n is equal to nine. Now, your job is to find the mean of all the x's that belong to each of the clusters. So for cluster 0, you have to find the mean of X1, X6, X7. That's because Y1, E6, and E7 are 0. For cluster 1, you have to find the mean of X2, X4, and x eight. That's because Y2, Y3, Y4, and y are equal to one. For cluster 2, you have to find the mean of X3, X5, and X9. That's because Y3, Y five and Y nine or equal to two. Let's call these means m1, m2, and m3. Logically, these should all be d length vectors, since they're just the average of D length vectors. Finally, you should combine these three mean vectors into a single matrix of size k by d, where in this example, k is equal to 3. And remember that the data we will use for this exercise will be randomly generated by you. As mentioned previously, you can make it completely random or you can make it look like data, which is appropriate for clustering. Now you might ask, why can't we use real data for this exercise? This is an important question for beginners. Remember that all data is the same. Therefore, this exercise would be the same no matter what kind of data we use. Generating the data yourself ensures that you understand the shape and nature of the data, rather than simply loading in some CSV. Therefore, this requires a greater understanding, which is good. Furthermore, since we have many real-world practical datasets used elsewhere in the course. Now is not the time for that. Now I want to be clear that we're not doing any machine learning just yet. This is just a very simple programming exercise to get you warmed up. All we're doing is a bit of geometry, and I hope you'll agree with me on that. We're saying, Here's a bunch of data points, and each of the data points belongs to a different group. Now find the average of each data point within each group. Okay, so I hope you'll agree with me that this is just a simple geometric programming exercise. And there's nothing to be afraid of at this point. As a bonus, since our data is two-dimensional, we can also plot it on a two-dimensional grid. So your second job is this. First plot, all of the data on a two-dimensional grid using the scatter plot. In this scatter plot, you should also color-code each data point according to the cluster identities stored in the array y. It doesn't matter what the actual colors are, just that every data point in the same cluster should have the same color. Finally, now that you have the means of each cluster, you should also plot the centers of each cluster on the same scatterplot. You should use a different style so you can differentiate the cluster centers from the actual data points. As you can see in this plot, I've used star. Okay, so that's exercise number 1 for k-means clustering. Good luck, and I'll see you in the next lecture. 4. K-Means Exercise Solution 1: In this lecture, we are going to look at the solution to the previous exercise, k-means exercise number 1. Note that currently I will not provide the code for this exercise since it is very short. And because your job in this course is to learn how to code by yourself, that includes typing, designing, and getting the syntax right. At the very least, you should be able to copy what I do, although I do not recommend doing that. If you were doing the exercises as instructed, you should have 0 need for a code file containing the solutions to these exercises. Typing on the keyboard, it builds muscle memory, which is an invaluable skill to have when coding. If you have a legitimate reason as to why you cannot type the code by yourself, please let me know and I will do my best to accommodate you. Okay, so let's get started. First, we're going to import numpy and matplotlib to standard libraries for numerical computing in Python. Next, we're going to set our configuration parameters. As mentioned previously. This means that the dimensionality of the data is to the number of clusters is three, and the number of data points is 300. Next, we're going to create the data. This is perhaps the most challenging part of this script. So let's talk about what we want to do at a high level. First, the basic idea is I want to have three Clouds of data points so that each cloud can be considered a different cluster. This is the picture you should have in your mind when you think about clustering. So how can we generate such clouds? Well, one possible idea is to draw samples from three different Gaussian distributions, each with a different mean. As you recall, a Gaussian distribution is characterized by its mean and covariance. The mean will tell us where the Gaussian is located. And the covariance will tell us how the data points from the Gaussian would be spread out. So let's start by defining three means corresponding to the three Gaussians, will say that Mu 1 is 000, which is at the origin, will say that mu two is 55. And we'll say that Mu 3 is 0, 5. Next I'm going to create an N by D array called x. This is where we will store the data to populate x, we'll start with the first 100 points, as you recall, the colon. And then the 100 means selects the indices from 0 up to 100. Then on the right side, I say generate an array of size 100 by d from the standard normal, and then I add a mu 1. This results in 100 data points with dimension two centered at mu one with identity covariance. Next I'm going to do the same thing. But for the next 100 data points, this time the indices will be from 100 up to 200. And on the right side, the data points will be centered at mu two. Finally, we'll do the same thing for the last 100 data points, for the indices 200 of the 300. These data points will be centered at mu three. Next, I'm going to create the array y, which tells us the cluster identity corresponding to each of the axes. Since we just created the axis and they are all in order, the structure of Y is very simple. It's just 100 zeros followed by 100, 1's, followed by 102. As you recall, if you create a list and multiply it by an integer, it's simply repeats that element of the list that many times. Then when we use the plus operation on these lists, the result is the concatenation of the lists. And lastly, we cast the result to a NumPy array. Since it's easier to work with NumPy arrays compared to lists. Note that although our data has a special structure with all the clusters in order, the subsequent code does not assume this. In other words, the code we are about to write, well work with data with any structure. Okay, so the next step is to visualize the data we just created. To do that, we'll create a scatter plot by calling the scatter function. In the first argument, we pass in the first column of x. In the second argument, we pass in the second column of x. In the third at Names argument, we specify the color corresponding to our data points and we pass in y. Note that y is just a bunch of zeros, ones and twos. Using this scheme, matplotlib will decide what actual colors to assign to our data points. Of course, you can have more fine grain control if you would like to choose the colors yourself. But this is good enough for us. Okay, So hopefully this is what you expected to see. We can see that we have three Clouds of data points, and each of them is colored according to y. Next, we're going to consider the question, given a two-dimensional array of size n by b, by convention, how do I get the D size, that mean vector of that array? That is, I want to take the mean across each of the n samples. Now you have to be careful here, because if you just call mean by itself, you're going to get a single scalar, which is the mean of every single element of X. We don't want that, since what we should end up with is a mean vector and not a scalar. To get what we want, we have to pass in the argument axis equals 0, which says, take the mean along the n dimension. If you pass it an axis equals one, you would get the mean along the d dimension. As always, you're encouraged to try these things out for yourself so you get a better feel for how they work. Don't just take my word for it. Okay, so when we run this and we check the shape of the result, we get two as expected. Next, we're going to use what we just discovered to calculate the mean of each cluster using the cluster identities provided in the array y. To start, we'll create an array of size K by D called means. Next, we'll calculate the mean of each cluster one by one, starting with Cluster 0. Okay, So this code is quite compact, but in fact, there are many things happening at once. If you don't understand this at first glance, hopefully you did the exercise in a way that didn't make sense to you. So you never have to come up with the same solution as I did. You just have to come up with the same answer. Your code can be completely different than mine. So as you know, code can be read from the inside out. So let's start with the most insight thing, which is y equals equals 0. If you don't know what this does, I would recommend isolating this and printing it out by itself so you get a better understanding. Essentially what this returns is a Boolean array. As you know, equals, equals must return true or false. But since we're using it on an array, which is why the result will also be an array. Since we say equals equals 0, the resulting array will be true in every location where y is equal to 0 and false otherwise. Okay, so what happens when we index x with this Boolean array? Well, as you may have guessed, given the exercise we are doing, this selects all the elements of X where the index is true. Okay, so hopefully that makes sense. As you recall, X is an array with n rows. Y is also an array with n elements. When we say y equals equals 0, that returns a Boolean array with n elements. Then by indexing x with this Boolean array, we get back only the parts of X where the index was true. That is to say, we get only the rows of x corresponding to when y is equal to 0. And equivalently, we only get the rows of X that belongs to Cluster 0. Finally, we take the mean using the mean function passing in axis equals 0. Then we assign the resulting mean vector to our means array on the left side at index 0. Next, we do the same thing to calculate the means for Cluster 1 and Cluster 2. Okay, so technically this is the end of the exercise. Although it would be nice if we could visualize what we've just found. So the next step is to redraw our scatterplot. But with the cluster means we calculated previously. First, we'll start by drawing the same scatter plot we had before. Next, we call scatter again, but this time using the means array we just created. As before. The first argument is the first column of means, which refers to the first dimension. And the second argument is the second column of means, which refers to the second dimension. Next, I'm going to pass in a few additional arguments to make the plot easier to see. First, I'll pass NS equal to 500, which controls the size of the data points. This will make the means much larger than the data. Second, I'll pass in C equals red to make each of the cluster centers red. This will differentiate these points from my data further. As a side note, this is an example of how we can set the color of the data points in the scatterplot manually. Whereas we did not choose the colors in the previous scatterplot. Finally, we'll set marker equal to star so that instead of just appearing as a circle, the means will look like stars, which will further make them easier to see. Okay, so let's run this. Okay, so hopefully this is what you expect it to see. Each mean vector up here is roughly at the center of each cluster. As mentioned, these represent the centers of mass or the centroids of each cluster. Further, we can see that they are at approximately the original Gaussian centers. One of the stars is at 000, another is at 0, 5, and another is at 55. 5. K-Means Exercise Prompt 2: In this lecture, we are going to jump right into your second exercise for this section. Earlier your exercise was to take a set of data points and a set of cluster identities. Then you had to find the mean of the data points for each cluster. This time, you're going to go in the opposite direction. Now, you will be given a set of means along with a set of data points. Your job will be to take these means and these data points and find out which cluster each data point belongs to. So remember that this is done according to Euclidean distance or equivalently squared Euclidean distance. Note that since the squares, a monotonically increasing function, whether you take the square or not, does not change the answer. That is to say, if you're comparing distances, say 24. If you square these, you'll get four and 16. So that relationship, that two is less than four is preserved because four is also less than 16. And it's only the relative distances that matter because all you're trying to do is find the closest mean. The actual distance itself does not come out in the answer. Okay, so as before, let's think of a very simple example of how this will work. Suppose we have three means, m, 0, M1, and M2. We also have nine data points, x1, x2, x3, all the way up to x nine. So what can we see? Well, we can see that X1, X2, and X3 are all closer to 0 than they are to M1 and M2. Therefore, X1, X2, and X3 belongs to Cluster 0. In similar fashion, we can see that X4, X5, and X6 are closer to M1 than they are to m 0 or M2. Therefore, X4, X5 and X6 belong to cluster 1. Finally, X7, X8 and X9 are closer to m2, then they are two, m0 or m1. Therefore, X7, X8, X9 belongs to cluster 2. So hopefully this makes sense. The challenge is to put this visual intuition and the code. Here's how your code should work at a high level. First, you're going to generate a set of means in a matrix of size k by b. Just like in the previous exercise. Also, you can use the same values for n at D and K as we did previously. So n equals 300, d equals 2 and k equals 3. Remember that d equals 2 is helpful because we can see it visually. Second, you're also going to generate a data matrix of size n by d. As before, you can create these matrices randomly. But it may be helpful to generate them in a way that makes sense from a clustering perspective. Either way you do it is up to you. Personally. I think it's more useful to generate the data in a structured way so that after you find the answer, you can plot the results as a sanity check. Speaking of which as a bonus, I would encourage you to plot the results with the output of the exercise. As output, you should have an n length vector, which tells us the cluster identity of each data point. So for example, if the first x belongs to Cluster 0, then the first value of your output array should be 0. Finally, I want to note that there are many ways to do this exercise. You may have heard that when you're using NumPy, it's not good to use for loops because for loops or slow. Instead, you may have been told that it's better to use a vectorized operations. That is to say, normally you would like to find a numpy function that can do what you want all at the same time. Personally, I think that for this exercise, it's more instructive to do it with for loops because it makes you think about what's going on algorithmically. Ultimately you, whether you decide to use for loops or not, is up to you. Finally, I want to reiterate that just like exercise number one, this is not quite machine learning just yet. In fact, it's just another simple programming warm up in geometry. What we are doing is taking a bunch of data points and a bunch of centers. And we're just trying to find out for each data point which center is the closest. So I hope you'll agree with me that this exercise is just another simple geometric programming exercise. Okay, so good luck on exercise number two, and I'll see you in the next lecture. 6. K-Means Exercise Solution 2: In this lecture, we are going to look at the solution for k-means exercise number 2. As before, please ensure that you have completed the exercise prior to watching this lecture. Note that currently I will not provide the code for this exercise since it is very short. And because it is your job in this course to learn how to code by yourself. That includes typing, designing, and getting the syntax right. At the very least, you should be able to copy what I do, although I do not recommend doing that. If you are doing the exercises as instructed, you should have 0 need for a code file containing the solutions to these exercises. Typing on the keyboard and builds muscle memory, which is an invaluable skill to have when coding. If you have a legitimate reason as to why you cannot type the code by yourself, please let me know and I will do my best to accommodate you. Okay, so let's get started. So first we're going to start by importing numpy and matplotlib, same as before. Next, we initialize our configuration parameters. That's n, the number of samples equal to 300, d the number of features equal to 2, and k, the number of clusters equal to three. Next, we initialize three means which we'll define our cluster centers. For convenience, I've used the same means that we had previously. Next, we're going to generate the data. We start by initializing an array of all zeros called the X of shape N by B. Next, we set the first 100 points to be normally distributed centered at the mean with identity covariance. So these will go in the indices 0 up to 100. Next, we set the next 100 data points to be normally distributed centered at the second meeting. Also with identity covariance, these will go into indices 100 to 200. Finally, we set the last 100 data points to be normally distributed centered at the third mean, also with identity covariance. These will go in the indices 200 to 300. Next, we have our main loop, which is the heart of this exercise. This is the loop where we will assign to the cluster identity is. We'll start by creating an array called a Y filled with all zeros of size n. Of course, this must have size n because we need a cluster identity for each of our N data points in x. Why is the array that will store our cluster identity? Next, we loop through all n data points. As mentioned in the previous exercise prompt, there are many ways to do this, even without you having to write any FOR loops yourself. However, this is not really helpful for this exercise because writing things this way helps you think about the algorithm more clearly. Okay, So it should be clear that since we're looping over all n data points, we're going to find the best cluster identity one by one for each data point. The basic idea is this. For each data point x of little n, we're going to loop through each of the cluster means. For each of these means, we're going to check the distance from x of little n two. That mean. We'll keep track of all these distances so we can find the one that is minimum. So if mean 0 is the closest, then that means the cluster identity is 0. If mean one is the closest, that means the cluster identity is one and so forth. Okay, So we'll start by initializing a variable called closest k to minus one. We use minus1 as a sentinel value. This will be overwritten as we iterate through the inner loop. Similarly, we initialize a variable called min dist to infinity. Clearly this variable will store the minimum distance from x of n to all the means. By initializing it to infinity, any distance we find within the loop will be smaller. Okay, so next we loop through little k equals 0 up to big K. That is, we're going to loop through each cluster center. Inside the loop, we find the squared distance between x of little n and means of k, The k, if me and if you don't recognize why this is the formula for a squared distance, I would recommend reviewing the definition of squared Euclidean distance. We'll call this variable D. Next, we check whether or not d is less than min dist. Of course, the first time we iterate through this loop, this is guaranteed to be true since min disk starts out as infinity and any finite number is less than infinity. If this condition is true, we do two things. First, we update min disk to be the current distance d. That is to say min disc is the smallest distance we have found so far. Next, we assign closest k to be the current k. And again, this means that closest k is the index of the closest cluster center we have found so far. You should be convinced that by the end of this loop, closest K will store the index of the closest cluster center and Min dist will store the corresponding distance. Thus, at this point, we can assign closest k to y, index that little n. In other words, closest k is the cluster identity for the data point x of little n. When the outer loop is complete, we will have found the cluster identities for all big N data points. Okay, so as a sanity check, we would like to plot our results. Luckily, you already know how to do this, because we did this in the previous lecture. So let's run this. Okay, so hopefully this is what you expect it to see. We can see that all the data points are assigned to the correct color according to which of the means they are closest to. That is all the points closest to 00 or one color, which looks to be purple. All the points closest to 05, or a different color, which looks to be green. Finally, all the data points closest to 55, or another color, which looks to be yellow. 7. K-Means Exercise Prompt 3: In this lecture, we are going to continue our discussion of K-means clustering. One surprising fact you will learn is that by doing the previous two exercises, you've essentially done 90% of the work. You may recall me saying that the previous two exercises, we're just a very simple warm ups. They were nothing but a simple programming exercises for doing some geometry. And yet at the same time, what you will see is that k-means clustering is nothing but these two simple geometric exercises repeated over and over. Earlier. At the start of this section, we stated two fundamental intuitions about clusters. Let's recall what they are. Fax number one is that the cluster identity of each data point should be the cluster center is closest to. I mean, this just makes sense. For example, suppose we have two groups of students, once who are very tall, say six feet on average, and once we're very short, say five feet on average. If your height is five foot, ten inches, then it should be obvious that you belong in the tall cluster and not the short cluster. Why? Because you are closer to six feet then you are to five feet. So hopefully this is intuitive. So here's fact number 2. The cluster center is simply the average or mean of all the data points belonging to that cluster. That also makes sense if we have five people in our group than the so-called center would be the average height of everyone in our group. Hopefully this is just as intuitive as factor number one. Moreover, you should recognize that these two facts are nothing but the previous two exercises. Earlier, I said that k-means clustering just happens to be a loop that simply repeats these two operations over and over. So what is the implication of this? Well, it means that by doing the previous two exercises, you have essentially implemented at k-means clustering already. The only thing left is to put them together. On the other hand, there seems to be a slight problem. What one might consider a chicken and egg problem. You see each of the two steps seems to depend on the results of the other. To find the means, you need the cluster identities. But to find the cluster identities, you need the means. So which came first, the chicken or the egg? The answer to this is called the initialization step. In this case, k-means does provide an answer for whether the chicken or egg comes first. Basically what we do is assign the cluster centers to randomly chosen data points in x. So this is a step we do before we loop over the two steps we discussed previously. So the end result is this. To initialize K-means, we start by picking a k random points in x and assigning these cluster centers. Then we enter a loop. Inside the loop, we employ fax number one. We assign each data point in x to our current cluster centers. The first time this runs, those cluster centers will simply be a subset of the points in x. The output of this step is a new set of cluster identities. Then we employ facts and number two, we recalculate each of the cluster centers according to the latest cluster identities. So the output of this step is a new set of cluster centers. Then we repeat this process over and over until the algorithm converges. So how do we know when the algorithm converges? Well, it's when the cluster identities stop changing. When the cluster identities stop changing, then obviously the cluster centers will also stop changing. And there is no point in running the loop any further, since the answer will no longer change. One crucial detail you have to know about K-means clustering is this. Although the process I've just described seems pretty straightforward, it does not result in the same answer every time you run it. That is to say k-means and never finds a globally optimal solution. Instead, it's only possible to find a local optima one. A simple solution to this is to simply run k-means several times and then choose the best answer. We'll define what best means later in this section. Practically speaking, this is pretty much all you can do. And in the real world, this is just fine. Okay, so hopefully the next exercise is not a surprise. Your job, before moving on to the next lecture will be to implement k-means in code. Remember, you have already done most of the work in exercises 1 and 2 of this section. All you really have to do is put these two things in a loop. The basic outline of your code should be as follows. First, you're going to generate a dataset called X of shape n by d. This time, you really do want to generate a dataset that is appropriate for k-means clustering. That is to say, don't generate a dataset from just a single Gaussian with no clearly discernible clusters. Later in this section, we will look at real-world datasets. But for now, we just want to make sure that your version of k-means does what it's supposed to do. Next, you're going to initialize the cluster centers by selecting k randomly selected points from x. Since you've created the data yourself, you can choose the values of n at D and K. I would still recommend it using d equals two so that you can visualize the results. Next, you're going to enter a loop that performs the two steps we discussed previously. Inside the loop, you'll want to check for convergence. That is, when the cluster identities stopped changing at that point, you should exit the loop. Finally, if you chose D equals 2, you should visualize the results. You can do this by drawing a scatterplot of the data along with the cluster identities you found. Furthermore, you should also draw the corresponding cluster centers. Since you've already learned how to do this, it shouldn't be much trouble to do it again. Okay, So good luck on this exercise, and I'll see you in the next lecture. 8. K-Means Exercise Solution 3: In this lecture, we are going to implement k-means clustering and code. As always, I hope you had a chance to complete this exercise before watching this lecture. If not, then I would recommend stopping this video now until you've completed the exercise. Remember that this is to benefit your learning. So I hope you've spent some quality time on implementing this code. Actually, the third exercise should have been much easier than the first two. Since the third exercise really is just putting the first two exercises together. So hopefully you found that to be the case. Okay, so let's get started. First. We're again going to import numpy and matplotlib. By now, you are familiar with how we make use of these libraries. Next, we again a set our configuration parameters. That's d equals to k equals three and n equals 300. So the data dimensionality is two, the number of clusters is three, and the number of samples is 300. Next we create the data. We start by first creating the true means mu one, mu two, and mu 31. Important factor to keep in mind is that these true means are unknown by the K-means algorithm. There are known to us because we're using them to create the data. But k-means doesn't make use of these at all. You might want to go over the code yourself to verify this fact. So as before, we're going to use the means 000, 55, and 05. Next, we create our X matrix of size n by d. We set the first 100 points to be a Gaussian centered at mu one identity covariance. We set the next 100 points to be a Gaussian centered at mu 2, again with identity covariance. Finally, we set the last 100 points to be a Gaussian centered at mu three, again with identity covariance. So hopefully you remember all this from the previous exercises. Now, you might ask, now that we are doing the actual K-means clustering code, why are we still not using real data? So this is a good question for beginners and all beginners should be able to answer this question by the end of this course, if not already. Okay, so number 1, remember that we've done a good number of examples on real-world data outside of this lecture, the focus of this lecture is to implement k-means. So the kind of data that we use is irrelevant. Why is it irrelevant? Well, remember that no matter what the data is, this code for k-means clustering would not change. Therefore, there is no real advantage to using real data at this point. In fact, there is only disadvantage. So why is that? Well, that brings us to point number 2, which is that using synthetic data allows us to test whether or not our code working as expected. That's an important thing in machine learning. When you write code, you have to test whether or not it works. But you can't do that if you use your code on some unknown high-dimensional dataset. On the other hand, this is a dataset we created ourselves and it's two-dimensional, so we can see it. Most importantly, we know what the answer should be. Therefore, it's a good way to test whether or not our code is working. So hopefully you now have a good understanding about why synthetic data is important. And also, you understand that we are still going to look at real data elsewhere in the course. So you haven't lost any opportunities to look at real data. Okay, So at this point, let's plot our data so we can remind ourselves what it looks like. This time we will only plot ax because that's the only thing given to our k-means algorithm, as you will see shortly. Previously in our first two warm up assignments, we assume that we were either given the cluster centers or that we were given the cluster identities. Now, we are in a more realistic scenario where we know neither, we only know Acts. However, even just plotting x by itself, we can see intuitively where the clusters should be. Okay, so let's draw our plot. So hopefully you agree that the cluster groupings are very intuitive. Later, when you learn more advanced techniques, you will understand why generating our data from a Gaussian clouds is actually the ideal use case. Next, we begin the code for k-means clustering. We'll start by doing the initialization. As you recall, this involves randomly assigning the cluster centers using randomly chosen points from x. So first, we start by creating an array of cluster centers of shape K by D. That's because there are k cluster centers, each with dimension B. Next, we do a loop, k times. Inside the loop, we pick a random index from 0 up to n minus one. If you're interested, you can verify that these are chosen from a uniform distribution. Okay? So we call the result of this I. And I is an integer from 0 up to n minus 1. Next, we assign x sub I to be the KF cluster center. So we take our cluster centers array and we index it at k. Next we do our k-means loop. As you recall, this involves two steps repeated over and over. And these two steps are simply the two previous exercises you've already done. So first we'll start by setting a variable called max iterators to 20. This will set an upper limit on how many times our loop iterates. Normally, k-means converges quite fast. So you'll find that not all of the iterations are needed. Next, we create an empty array to store the cluster identities. As before, this will be an array of length n, which stores integers from 0 up to k minus one inclusive. Next, for debugging purposes, I'll start an empty list called saved cluster identities to help us visualize what k-means is doing on each step, we're going to save the cluster identities on each iteration of the loop. Next we enter our loop. So I goes from 0 up to max editors exclusive. Inside the loop, we first make a copy of the cluster identities array. By calling the copy function. We'll assign this to a variable called old cluster identities. Remember that this is how we will check whether or not K-means has converted. If the cluster identities haven't changed from one iteration to the next, then there's no point in continuing because if the cluster identities are the same, then the means will also be the same. Therefore, both of the two steps would result in no change. Next, we also save the old cluster identities to our saved that cluster identities list. Next we perform step one. As you recall, this is to determine that the cluster identities given the current cluster centers. Note that the first time you run this loop, the cluster centers are just random points in X. This is Y. On the previous exercises, it wasn't necessary to make the data look nice the way that I did. K-means at some point, we'll calculate cluster identities for which the cluster means are in the wrong place. So that's why I gave the instruction that you could create any x and any set of cluster meeting. The point was only to write working code. Okay, So basically, you should recognize that this entire loop is exactly the same as what we had before. I'll go through it more quickly this time since you've already seen it. First, we loop through all n data points. Inside this loop, we initialize closest k to minus 1, a sentinel value. Next, we initialize min dist to infinity. This is so that any finite distance we calculate will be smaller. Next, we loop through all k cluster means. This is because we want to find out which of these is closest to the nth data point. Inside the loop, we calculate the squared Euclidean distance between x sub n and the kth cluster center. We'll call this D. Next, we check whether d is less than the current Min dest. If it is, we save the current B as our new MR1 dist, and we save k as our closest k. When we're outside the loop, we assign closest k to be the nth cluster identity. Next we have step two, which was the other exercise we did. This is to calculate the new cluster means based on the cluster identities we just found. So again, you should recognize this code from earlier with a few naming adjustments. As you recall, we work from the inside out. First, we use equals equals to obtain a Boolean array telling us which of the data points belong to cluster 0, 1, or 2. Then we use this Boolean array as an index into x. Next we call the mean function to get the mean of all those data points. And we pass in axis equals 0. So we take the mean along the rows. Next, we assign this to our array, which contains the cluster centers. Once we're done, step one and step two, we can check for convergence. Again. You can see we make use of equals, equals. We want to know if our old cluster identities or equal to the new cluster identities we just found as before, since the operands for equals, equals our NumPy arrays, this will return a NumPy array of booleans containing only trues and false is true when both sides are the same and false otherwise. Note that our algorithm has only converged if all the values are true. Basically we are doing a big statement. So we're saying if the first position is true and if the second position is true, and if the third position is true and so on. A shortcut way of doing this is to call the np dot all function. This will return true if the array that you pass in contains all true inside the if statement, which only occurs if the condition is true. We print out which iteration we converged on and then we call break to end the outer loop. Okay, so let's run this. All right, so as you can see, K-means has converged in much less than 20 steps. Next, just like in our earlier exercise, we are going to plot the result of our algorithm. So let's think about this at a high level. Starting out, all we had was the array x. Yes, we had the true cluster means, but we're pretending that we don't know those. And in fact, when you think about what you would have with a real dataset, you would not know those. So our original plot was a plot of only the data without any colors and without any cluster centers. Now, after running K-means, we have more information. Now we have cluster identities, which will let us give each point in each cluster a different color. Furthermore, we also have the cluster centers, so we can draw those on our plot as well. As you recall, we earlier drew these as big red star. Okay, so hopefully you can recognize this code from earlier. The first line does a scatterplot of the data, coloring each data point by the cluster identities. So all the points belonging to cluster 0, we'll have one color. All the points belonging to cluster 1, we'll have another color, and so on. Next we call the scatter function. But this time on the means, as before, we pass in S equals 500, make these data points bigger and we pass in C equals red to make them red. And we set marker equal to star, so they show up as star. Okay, so let's run this. All right, so hopefully this is what you expect it to see. This is the same data as before, but now it has colors according to the cluster identities found by K-Medians. Furthermore, we've marched to the cluster centers with red star. This is all information we did not have before running K-means where we only had the data itself. Next, we have a short loop to visualize the training process. As you recall, we saved the cluster identities on each step of k-means. So by plotting the data according to the cluster identities on each step, we can see how the clusters evolve as the algorithm learns. Okay, So if you want to understand this code, I'll walk through it very quickly. First, we grab the number of cluster identities we've stored. Note that this is variable, since we don't know how many times the training loop will go, we'll assign this to a variable called m. Next we call the function at plt.plot subplots to set the size of the plot. The first argument, it sets the width, and the second argument says the height. I've chosen the number of five arbitrarily, which is large enough for me. But the important part is to set the height, which is m times the size of the width. This is because we are going to make em a subplots and we want them all to fit. Each of the individual subplots will be five by five. Okay, so next we do a loop, m times. Inside the loop we call plt.plot. There are three arguments to this function. First is the number of rows in our plot, and second is the number of columns in our plot. So we're going to have m rows and one column. That is to say, we're going to make em plots and they'll all be on top of one another. The third argument specifies which plotted is out of the m by one plots. Basically this is just I plus 1, since the first value of i will be 0. Next we get y, which will be assigned to the same cluster identities on iteration i. Next we call plt.plot scatter passing in our data and why the current color settings. Okay, so let's run this. All right, So hopefully this plot is useful for you to see the progression of k-means. You can see that we start with everything the same color because we initialize the cluster identities to be an array of all zeros. Then we can see that gradually the cluster separation improves. Eventually, we end up with the final cluster identities, at which point we've quit training. The important thing to notice is how the cluster identities slowly improve over each step. By setting the cluster identities in an intelligent way, the cluster identities do not yet worse, only better. In later lectures, we will quantify this idea. 9. K-Means Objective (Theory): In this lecture, we're going to talk about the k-means objective function. To introduce this idea, it helps to have some experience with other machine-learning models first, such as linear regression and logistic regression. So if you do, that will be very helpful. But if you don't, then try your best to follow along. So essentially, linear regression, logistic regression, and K-means clustering are all examples of machine learning. The key word in this phrase is learning. Thus, the important question to ask is, what do we mean by learning? Generally speaking, our model usually has some objective. Yes, you might say our objective is to learn, but we want to be more specific and particular. The objective is a number. In supervised learning such as linear regression and logistic regression, that number is the error. You can think of the objective like how many errors that I made. Thus to learn is to configure your brain such that you make fewer and fewer errors. So I hope the analogy to human learning is clear. Learning is equivalent to configuring your brain or your model of the world to make less errors. In linear regression, our objective is the sum of squared errors. There are multiple reasons why this makes sense, but here is one simple reason. Linear regression is all about finding the line of best fit. In other words, we would like the line to be as close to the points as possible. On this plot, you can see that sometimes our line of best fit overestimates when it makes a prediction, and other times it underestimates. Thus, the error can actually be positive or negative. Of course, you don't want to just add the arrows together, because if you add a positive number and a negative number which are the same, then you will get 0. For example, if one of the areas is plus five and the other error is minus five, plus five plus minus five is 0. And obviously you don't want the total error to be 0 when your model actually made a non-zero error. Therefore, we square the errors so that they are all positive. And in order to turn them into a single number, we add them together, hence, the sum of squared error. We call this sum of squared errors a cost function. Note that there are other kinds of cost functions as the cross-entropy, which is used for logistic regression. However, for this course, we are primarily interested in the squared error. The next step when you are training a linear regression or logistic regression model is to figure out how to update the model parameters so that you can make the error smaller. Exactly how you do this is outside the scope of this course, but you are encouraged to learn about linear regression and logistic regression if you're curious. The important thing is to note that just like k-means clustering, this is done iteratively. That is to say we have some loop. Inside the loop, we update the model parameters. On each iteration of the loop, we hope to update the model's parameters in such a way that the error always improves. Again, how we do this is outside the scope of this course, but recognize that it can be done. Note that the analogy to human learning also makes sense in this context. You can think of each iteration of the loop as one pass through the training data. That is to say, each time you get to see the training data, your understanding of it improves such that you will decrease the areas that you made. This is true for both humans and machines. The more I practice, the more intelligent I become. Okay, So it turns out that many machine learning algorithms are trained in this way. The method is, we think of some useful cost function. Then we derive some update rule such that if we keep applying this update rule, the model improves its error. Eventually, the error converges to some minimum value. At which point you can say that you have finished that learning. In these situations, it's useful to plot your cost per iteration when training is complete so that you can verify that the training process was successful. And again, many algorithms use this method of learning. As mentioned previously, the supplies to k-means clustering, linear regression and logistic regression. It also applies to deepen neural networks, reinforcement learning and matrix factorization, just to name a few more examples. Okay, so that's the idea of learning and cost functions in general. But what is the cost function or the objective function for k-means? As a side note, recall that cost and objective and loss and error are all synonyms. So if I ever use one of these terms in place of the other, don't be alarmed. They all mean the same thing. In any case, if you are comfortable with the sum of squared errors for linear regression, then you will also be comfortable with the cost function for k-means, which looks very similar. Okay, so let's go through this slowly so that you understand each component of the objective. First, you can see that it's in the form of a sum of squared errors. We sum over all n data points and we square the thing inside. Note that because x sub n and m sub k are vectors, we denote the Euclidean distance with double bar. This is our notation for the norm. The part that's confusing for some people is the stranger one notation. This is called an indicator function. It has a value of one when the argument is true and 0 otherwise. As you recall, y sub n stores the cluster identity of the nth data point. Therefore, we can interpret this as follows. Although we sum over all values of k, only one of them contributes to the cost. It's the one that is equal to y of n. And this is because y of n assigns X of n to 1 a cluster. Therefore, the cost is the squared distance between x sub n and the mean of whichever cluster it belongs to. So hopefully that makes sense. We sum over all n data points. For each of the n datapoints, we add a single squared distance. The square distance we add is the squared distance between x sub n and the mean of the cluster that it belongs to. So if x sub n belongs to cluster k, then we add the squared distance between x of n and m sub k. As a side note, you can see that little case sums from one up to big K. In the code. Little k sums from 0 up to k minus one. This is just due to how programming in Python works. So just be mindful of this difference. When we're writing equations, typically it's easier to count from one instead of 0. It's also with talking about why this objective makes sense. To understand this, we can consider some different scenarios. First, let's consider the scenario where the clusters are very clearly defined. All the data points belonging to a cluster or very close to the cluster center. Since that is the case, all of the squared distances will be very small. And therefore, the total sum of squared distances will also be very small. Okay, now let's consider a second scenario. This time, the clusters are less well-defined. Now, the data points can be very far away from the cluster center. Because of this, the squared distances are larger and therefore the total sum of squared distances will be larger. The cost function does not favor this scenario because the clusters are less well-defined than before. Now let's consider a third scenario where the actual cluster assignments or wrong in this case, because the data points are not assigned it to the closest cluster center. The squared distance between each data point and the cluster center it is actually assigned to will be very large. Because these distances are very large. The total sum of squared distances will also be very large. So clearly, in order to make the cost small, we should assign each data point to the closest cluster center, thus minimizing the distance is. Note that because of the k-means algorithm, we can write the objective function in an alternative way, and perhaps this one will be easier to understand. So in this forum, we no longer have to sum over k. We know that x of n will be assigned to two whichever cluster it is closest to. Therefore, the inner part of the cost is just the squared distance between x of n and whichever mean that is closest to using the squared Euclidean distance. In this form, we take the minimum distance directly using the Min notation. Yet another way to write the k-means objective is this. We can define a quantity called the responsibility matrix, denoted by the letter R. Basically this is the same thing as the indicator function we saw earlier. But now it is a matrix instead of a function, essentially r of n k, that is, the value at row n, column k is equal to 1. If x of n belongs to cluster k, otherwise it is 0. You should be able to verify that each row of our corresponding to a single data point in X contains only a single one and the rest of the values must be 0. That's because x can only belong to a single cluster. In other words, x can only have a single cluster identity. And so again, we have this situation where even though we sum over all values of k, technically, in actuality, only one of these terms is non-zero and the rest are 0 because the responsibility value is 0 when x does not belong to those clusters. One interesting fact is that for k-means, we did not derive the learning algorithm from the cost function. All we did was start from two simple facts about clustering that seemed to make sense. Then we discovered that if we repeat these two simple facts in a loop, we would end up with K-Medians. To recap, these two facts are as follows. Number 1, the cluster identities should be assigned to so that X of n belongs to whichever cluster center is closest to. And number two, the cluster centers are just the average of all the x's that belong to it. However, using what we know about cost functions, this doesn't seem very rigorous. You might ask if linear regression and neural networks can derive their learning rules from the objective function, why can't we do this for K-mean? At this point, we are not yet ready to understand how this algorithm can be derived. But if you want to learn more, you're encouraged to learn about the Gaussian mixture models, also known as GMAT. Gmat or a probabilistic model, which can be seen as a generalization of K-means clustering through the lens of the GMM, you can derive the update algorithm for K-means in terms of probabilistic quantities. So if you're looking for a more rigorous derivation, that would be the next step to explore. The main point of this lecture is to understand the idea that k-means has an objective function in the first place. You wouldn't have known this only from the algorithm itself. That's why we haven't discussed it until now. The second is that this sets us up to write more code. In particular, your next exercises this, take the k-means code we had earlier and add more code so that after training, we can plot the objective function at each iteration. We expect that at each iteration of k-means clustering, the objective should get smaller and smaller until it converges to some minimum value. So please write this code and confirm that this is true. All right, so that's the exercise. Good luck, and I'll see you in the next lecture. 10. K-Means Objective (Code): In this lecture, we are going to complete the next exercise in implementing k-means, which is to plot the cost function that during the training process. Once we've done this, we can consider our k-means implementation complete. As you'll see, this is just a simple modification of the code we've already written since we completed most of the work in the previous lecture. As always, please make sure you've completed this exercise before watching this lecture, since that is your homework for this course. In other words, don't look at the solutions for the homework before you've completed the homework. Okay, so let's get started. First, you can see that all of the initial parts of the code are the same. The inputs are the same, the data is the same. What we're interested in is the k-means training loop. So let's scroll down to that part of the notebook. Now you may have assumed that we would have a separate function for calculating the cost, but it turns out it is simpler to compute within our training loop. Note that we are going to use the version of the cost that contains the Min operator. In this case, there's no need to do a double summation over the n samples and the k cluster. The real reason you need to sum over the k clusters is because you need to find out which cluster each data point x of n belongs to. In K-means, this is defined by the minimum distance between x of n and each of them means. However, you will recall that we've already found this in the code. Thus, if we made a function to do this, we would be repeating the work we've already done, which would be inefficient. Okay, so let's look at the code so we can review where we've already made this calculation. First, before we even go inside the training loop, we're going to create two new variables. We have Min discs, that's plural, which is an array that will store the minimum distances for all n sample. Next, we have a list to store the costs. This is a list and not an array since the number of iterations is variable. So this is going to store the cost at each iteration. Okay, So inside the loop, let's look carefully at step number 1. You will notice that we do two loops, one over the n data points and one over the k cluster. For each of the k clusters, we calculate the squared Euclidean distance between the cluster mean and the nth data point. Furthermore, you'll recall that we store this in a variable called the Min dist. This is exactly what we need for the cost. Therefore, all we need to do is our inner loop is complete. Simply save men disk into the array min disputes. Once we have completed step 1, we now have all N Min discs, which are the squared Euclidean distance is between each data point and the closest cluster center. As you know, the total cost is just the sum of these. Therefore, we call the SUM function on men disks to get the cost. And we store this in our list of costs. After that, the rest of the code is the same. Okay, So after we are done training, we can call plt.plot to plot the cost per iteration. So as you can see, the cost per iteration decreases on every iteration and eventually converges to some minimum value as promised. And again, and let's remind ourselves why this is important. This is important to do whenever you are training a model using an iterative algorithm. Usually this involves minimizing or maximizing some objective. Therefore, you would like to plot that objective over each iteration. Check that your code is working as intended. In fact, sometimes like in the deep learning, even if your code is correct, things can still fail due to sub-optimal selection of hyperparameters. So that's another reason you need to plot the cost per iteration. If you don't and you have no idea if your model has learned what it was supposed to learn just because your code is correct. This doesn't mean the hyperparameters you chose are also correct. So the lesson is, whenever you have an iterative algorithm is always important to plot the cost per iteration. 11. Where to get discount coupons and FREE machine learning material: Hey everyone and welcome back to this class. In this lecture, I'm going to answer one of the most common questions I get. Where can I get discount coupons and free deep learning material? Let's start with coupons. I have several ways for you to keep up to date with me. The absolute number one best way for you to keep up to date with newly released discount coupons is to subscribe to my newsletter. There are several ways you can do this. First, you can visit my website, lazy programmer dot ME. At the top of the page, there is a box where you can enter your email and sign up for the newsletter. Another website I own and operate is deep learning courses.com. This website largely contains the same courses as you see on this platform, but it also contains extra VIP material. More on that later. So if you scroll to the bottom of this website, you'll find a box to enter your e-mail, which will sign you up for the newsletter as you would on lazy programmer dot ME. So you only have to do one of these. Now let's do a small digression because this is another common question I get. What's this VIP material all about, and how can I get it? So here's how the VIP thing works. Usually, when I release a course, I'll release it with temporary VIP material, which is exclusive for those early birds who signed up for the course during my announcement. This is a nice little reward for those of you who stay up to date with my announcements and of course actually read them. It's important to note that VIP material can come out at anytime. For example, I could make major updates to a course three years after starting in and do another VIP release. The purpose of deep-learning courses.com is to have a permanent home for these VIP materials. So even though it could be temporary on the platform you signed up on. If you signed up for the VIP version of the course, then you'll get access to the VIP materials on deep learning courses.com permanently upon request. Here are some examples of materials you might find in the VIP sections in my TensorFlow 2 course, there are three extra hours of material on Deep Dream and objects localization. Usually I don't release the VIP content in video format, but this was an exception. Another example in my cutting edge AI Corps was an extra written section on the CD3 algorithm. This course covered three algorithms in total. So the extra section, and it gives you one more, or in other words, 33 percent more material. Another example in my advanced NLP and RNNs chorus is a section on speech recognition it using deep learning. In addition, there is an all new section of the course on a stock predictions or memory networks, depending on which version of the course you are taking. The reason for this is I might release slightly different versions of each course on different platforms. Because of how the rules on all these platforms work, I must differentiate the courses. However, since I own a deep-learning courses.com, this is the only platform that contains the most complete version of the course, which includes all the sections. Please note that this is rare, so depending on what course you are taking, it might not affect you. All right, So let's get back to you discount coupons and free material. Other places where I announced discount coupons are Facebook, Twitter, and YouTube. You might want to pause this video so you can go to these URLs and follow me or subscribe to me on these sites if they are websites that you use regularly. So for Facebook, facebook.com slash lazy programmer dot Emmy for Twitter, that's twitter.com slash lazy underscore scientists for YouTube, that's youtube.com slash C slash lazy programmer x. Okay, finally, I still release completely free material. This is nice if I want to just talk about a singular topic without having to make an entire course for it. For example, I just released a video on stock market predictions and why most other blogs and courses approach this problem completely wrong. That's another benefit of signing up for these things. I get to expose fake data scientists who are really marketers, whereas I wouldn't ever make an entire course about that. Sometimes this can be in written form and sometimes it can be in video form. If it's in written form, it will either be lazy programmer dot ME or deep learning courses.com. If it's a video, it will be on YouTube. So make sure you subscribe to me on YouTube. If I release a video, I may also make a post about it on lazy programmer dot ME, and I may also announce it using all the methods I previously discussed. So that's the newsletter, Facebook, twitter, and obviously YouTube itself. Now I realize that's a lot of stuff and you probably don't use all these platforms. I certainly don't, at least not regularly. So if you want to do the bare minimum, Here's what you should do. First, sign up for my newsletter. Remember you can do that either on lazy programmer dot ME or deep learning courses.com. Second, subscribe to my YouTube channel at youtube.com. Slash C slash lazy programmer x. Thanks for listening and I'll see you in the next lecture.