Cluster Analysis in Data Mining - Master Clustering Algorithms | Abhishek Kumar | Skillshare

Playback Speed


  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Cluster Analysis in Data Mining - Master Clustering Algorithms

teacher avatar Abhishek Kumar, Computer Scientist at Adobe

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

6 Lessons (1h 20m)
    • 1. Introduction

      1:08
    • 2. Clustering Overview

      10:41
    • 3. K-Means Clustering - part 1

      10:13
    • 4. K-Means Clustering | part -2 | Python Code

      9:32
    • 5. DBSCAN - density based clustering

      28:41
    • 6. Hierarchical Clustering Algorithm

      19:49
  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.

25

Students

1

Project

About This Class

    Welcome to Cluster Analysis in Data Mining! The problem of clustering is to take a collection of points and group them into "clusters," such that members of the same cluster are close, while members of different clusters are far.

    Clustering is an Unsupervised Learning subject and is very crucial for various applications like Categorization, Data preprocessing, Data Visualization, Outlier mining and many more. In this course we explore the different techniques used for clustering data.

    Good luck as you get started, and I hope you enjoy the course!

Meet Your Teacher

Teacher Profile Image

Abhishek Kumar

Computer Scientist at Adobe

Teacher

Computer Scientist @Adobe

See full profile

Class Ratings

Expectations Met?
  • Exceeded!
    0%
  • Yes
    0%
  • Somewhat
    0%
  • Not really
    0%
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

Transcripts

2. Clustering Overview: This is the first radio clustering or cluster analysis, and in this review I will give an overview north the clustering process. So I will talk award the applications off clustering every toggled, worthies clustering and I will also off discussed some of the most common clustering matters that are popular. So in the basic problem in clustering is that were given a bunch of data points on, we want to understand their structured. So if we have some later points we see there, de sets of data points are located together and dear somewhere different. So we're trying to get some structural profit. So a more, more formal way of doing this old we are given a set of points who group the points into some number of partisans or Kristers inserted later, were not allowed to group at this point along with this point. So there are some conditions which should be satisfied and the two conditions are there. The members off one cluster should be similar. We will come to the similarity. How do we define similarity? And the members of different clusters should be dissimilar on the similarities defined using distance major and this distance major also is defined in different ways, depending on their application, and are depending on the representation of data. For example, if this is the given data points, then we can group these together. So we see that we have grouped it into three clusters or partitions. And these were just or players because they're not matching with any other data points there somewhere different. This this and this. And if you see considered to be an Euclidean space, you will see that and take Euclidean distance for no, it does not happen. We always the case that repeatedly collegian distance for this case for understanding purpose. Let's that you were considering Euclidean distances, so you see that these points their distance were doing any pair of points in this cluster is less as compared to every pick one point from this cluster and one from pro from other cluster. So the distance in that case would be large, so distance it's kind of a major off, a similarity or dissimilarity. So we have groups in lower points. Twitter. No, let's see some of the applications off clustering, so one application will be categorization. Oh, so we may want to categorize given market segment forgiven product into some categories, like heavy User, a medium user in light user, or even on some other uses patterns. We would like to get a Jetta great many different things, and other school would be pre processing. So when common uses of question is that if you have a very big amount of Gator, let's say in 1,000,000 points so you can cluster 10 million points into, let's say, 10,000 points or lets 1000 points so you re golden clusters or two over, and then you pick one point from each cluster as the representative, often that cluster. So if you want to do sample sampling on those 10,010 million points, then you can pick 10,000 points once turned off, randomly picking those 10,000 points or 1000 points, you can do clustering into the number of points natural. Why want and pick one out of those? So that would be a more better representation of the old later said. So this is called sample. Witness it, and this is the interjected. So there are many applications where you I want to work on a sample data sit in north interjected and crusting, would help you in getting a better or distributed or data samples. Another. You see these individual edges and purpose. It helps you in visual aging, your detain any space and then one of the replication is our play remaining. So there are many clustering techniques, visual help you in or player detection and return or was tour players. And it will help you, nor plier detection or a player mining. No, Let's come back to distance majors, which which we talked to in the previous slight, which is a major off similar de between different points. So are different ways of representing data can lead to different data meters, so same data could be represented in different ways. For example, let's say you are trying to cluster different documents such there. Ah, documents belonging to some particular topic are grouped together so you can get a rate them on the basis off words presenting the documents. So there are certain words we took her together and you will see that these are common in all these documents. Group this together. Some other set of words are common in some other ah group of documents, so you will group them together based on subject so dinner, but full of age of representing the same document, you could use it as a set of words. At bag of words is a common technique that is used. So here the words that is sick. So in this case, you're considered using jacquard distance because you're working with search off words. No, you can alter vision document at a point in space of words tender kiss, you can use Euclidean distance. Similarly, you can also represent document way director of words and in that case, it would be naturally toe. Consider the angle between the directors. So this is the word one were indeed European tear, but just in the space of words. So you could consider using KUSA in distance here. So you see that the same uh document can be represented in different ways. And then you said to use the distance major accordingly. Now let's see some of the common methods off clustering. We can categorize them into partitions, clustering where we source directly through the partisan. So one example is gay means where we start with que Kristers and try toe assign a cluster to each of the data points which is nearest to but list too. Onda. We keep on organizing it. Another a protease, her clustering there. We don't know the number of clusters beforehand. So if we have different data points Oh, we will treat each data point addict Lister in itself and we will then find and to nearest clusters and group them together. So we have reduced the number of clusters Way one and we will take some destroyed or central oId and then living again. Find the two nearest Christians and group them together and they will keep on doing there or until some condition is reached. And we will see all of this when we study harder call clustering. What are the terminus in conditions and this can be bottom up approach. This is also called AGL America. This example is able of merit to where we started with one East data Point being the single cluster on a really stopped down or divisive where we start with one single cluster and then we break them depending on certain journey since and in the Jordan, your degree can be sensitive, ist were redefined certain and city threshold so after merging some points into one cluster . If each point Oh, it's satisfied given trestle condition, that is, they're sort of be at least some number off points within a given. Well, you then we group them together. If by further merging the density trestle is breached, that is, the points are known no more dark Muslims. Then we will stop. So that approaches, then Steve ist. And this is not a, uh, dis joint kind off categorization of question, Mr. So these are not disappoint three cemeteries. Her too. Then it does not mean that it cannot be density was in fact, it can be both or than City Vista has realised her killer at the same time Similarly, for other communism. So these air nor distant did you just one method off integrating cluster? There can be other matters like point assignment or flare divorces article. So there can be many different ways of categorizing the same and this partisan mattered. In Oregon, city based metal also have a different property called point descendant. It will be maintained a set of cluster and points belong to the nearest cluster. So in the will, see some example off mixture of those go on a string matters. For example, energy question method called HDB scan so devious can eject density. Vist listing mattered, and it's a hurricane question method. Also, this HDB skin so you see there tear or it uses hierarchical as violence density based approaches. 3. K-Means Clustering - part 1: in this video to see what is came in celeb rhythm, and we will see hopeful, do clustering, using it on some example. So let's begin. Game Means is the most popular and most widely used clustering algorithm, and it's a partisan clustering algorithm that we start with partisans who we divide over there trying to partisans. And, incredibly, we optimize does for our distance, and there are different criteria based on it. Be optimized those partitions. Anderson I treated algorithm and it consists, will two steps when it's cluster assignment. Step here we assign each data point. So some cluster, which is nearest to it. And then once we're done with that, we take the for each cluster, we will take Bay Bridge or the means to find a new central so that it scored moves and Red State. Well, it certainly Central year, So there's doing there. We assign the points to it, and finally, after assignment, we re computer central, so these are the main two steps, so let's quickly run it on an example. So let's say the these are the data points, So first we will. We don't have blisters assaying toe the data points So we will randomly assigned two points but Savior working it. Okay, cool toe. Later we will also see of what should be the number of key. How can we decide what k to choose on other things in the next will do so no. Let's see savings randomly to cluster points Two clusters in towards one injured, one in green and we will see that this point is nearer to rid then green. So this will come under the red cluster. Similarly, this is nearer to this and this is nearer to green. So based on the nearest cluster, available to a data point, we will say Ni cristo, you should go in the top one. So let's see. This is the assignment after flustered Oh, cluster centers. No, we see that these points belong toe one cluster and digit Poynter, another cluster. So no, we will compute the mean off all these five points in the red. It may be somewhere here, so it will move toe this point so this will no not exist in the next step. Similarly, the green will move here and no, we will do the same step. So the red has no come here on Green has come here and these air the world cluster levels we will need to do plus during a game So some points may move from green and vice versa. So let's see, some points move and these become good and others are nearer to green. So no, we will again find dear means. So it seems Bas has four points in this spartan do in this point. So maybe some they're here and similarly green will move a little bit more towards right. So let's say they re computer Oh, center. It's on this. No reassigned the point. Clearly you can see that district point will no closer to Green Central and this green school year 200 solar. It's reassigned the century. So no, these points are in one cluster and these air in other. So clearly you can see that for Green Bay Bridge will be somewhere around here on forward. It will be somewhere here because these are the means points in this cluster. So let's speak. Include the center and red moved here and the green moon there And no, was he do the sentiment. So we see that this is it turning to green because it was nearer to this. No, we will re compute this entered. And this is the new central. Since this moved from red to green. No, this will not contribute to its Everett. And now it's placed in between these red points and the green is here. And now when we re compute the central, it will again lie here only. So there will not be any change in the center. So we have. This has converged to these two clusters and then we will stop. So the includes to this algorithm is kay. The number of clusters. So you need toe. Give it the number of clusters beforehand. Unlike some other clustering algorithm like inherit oracle algorithm or ah, then city. Very still rhythm. You will keep merging based on density and once though, points that we have included most no clusters under them city for the blue trestle. Then we will not do that merging so there. We don't need to provide the number of cluster centers, but here we need to combine the cluster centers and the details. It obviously to really include off any clustering algorithms alert severe in data points and we will represent it using extra next to still X in. And all of these data points can be indefensible. For example, excellent can be full of flavor seven point flavor then and so on. Three tests for them and since similarly can be any other demons. And also so no, no, Let's see the algorithm formally. So we will randomly initialize k clusters entrance. Let the center to the moon Milk and Miyuki on bees also central, sir. The just a bridge off data points x one x two So these will also be in There is no victors on go. After initial ageism, we will a Sinek lister to each of the needle points. Remember, we had ended up points. Excellent, Helixon. So for each data point, we will give it a level Cluster index closes to excite. So this see, I can be any value between one okay? And this means we're finding X I minus nooky for all the values off K. And we will minimize it over K. So whatever value of K. So whatever, um UK reduces give the minimum distance will pick their kids and this valuable come here so it can be some diluting one NK. And once we have assigned the clusters, interest rates of the data points, the real compute the mean off each cluster, so we will have a clusters. So for each cluster, well concluded the reserve points, let's say or x one. Explain six goes to one cluster new Ulitsa. Then the new mutant will be. This is ordered Mewtwo off the previous steps. A new new jewelry Explain Pless, explain 66 and then divided way the number of points in this case three. So this is the second step. So when it's clusters and render step and the second is moving, this injured is to so these are the mean to ah steps in this game in the algorithm. So in this case, we have taken the same example. And here one will mean when is equivalent to X one and these are in them. It's not trusting this case we're seeing on the de plane, so we can assume it's o. R. Two. So this is, um, you one and this is Mewtwo and here keys to. So we assigned each point through its nearest lister and then we will compute mu one equal to excellent lis, extra hopeless x tree. And then we have X 11 the next tool this excellent plus extra and do it really slave because district has flayed points. Similarly, Muto will be the remaining 456789 Then do 467 in the roadway. Seven explode Exley. So this dream will move here and this veritable come here and then we will do the resentment and this really the seven. Oh, it will come here and then we will re compute. Distinct are intimately will stop window New central remains at the same place. So this was the basic explanation Off came into algorithm in the next two. Really, we will see how to choose the values key. Ah, and how picked any silky And then we will also see a runny example in later. 4. K-Means Clustering | part -2 | Python Code: in this video, you will see all to you. Take it. Learns give means implement isn't inviting. So let's begin by importing deliveries. So, first weevil in warp, we need to important amply one does and we will in fort the pier T B report from our quarterly. And no, we will import it. Date Asian Richards, who will use make blobs function from escalations Do dessert s pillows do treasured sample in return. And from here we will import Mick blobs foursome And then we will be him for window influences on off scheme means that is also presenting s killer North Lister No, let's run it to import required leverage. Ease. So these have been important? No, Well to your double room due to assert using Mick blobs or under NATO using making gloves function we will store it in Nexen away and we will use making blobs. The number of samples looks at 300 then centers. We specify how many blobs often data we want. So let's create three listers three centers Onda standard reviews him and let's Lewis get report off The Italian netted generated just to see if it looks good or not. So we see their detergent, Richard. Three different centers never be passed here. So you can try it for and you'll see that it generates fool bullets, gender, three centers and no drivel do the K means full living categorized the new toe. And here we need toe specified in Brooklyn clusters. We want toe categorized, detained two because, well, number of clusters is one of the in para meters off K means algorithm. We need to pass it. So here I am. I know that there are three clusters, so I'm passing it. But in general, you may not know, So you will need toe determine the number off listers using the name. It s so you put the average distance off each point from each cluster. Semper on the way exists and and number off the new keep on increasing K on the X axis. So you will see that initially there will be start droppin average distance. But after a time they're dropping very slow. So that point is called any point and you will be bad value of kids. So there will be no before and the number of clusters three because we here only generated three clusters full. It's posit aan den. You know K means listless will initially your que clusters, which are initialized randomly in a smart way for faster conversions. You can skip it also, and Max later, Let's see 100 or 200. And in it is a perimeter which will tell how many times he will drunk came in. So in K means it's not hard to pick a random centers and then ah, assign cluster to each other Data point and you stop. You run it multiple times and it the best off those. So in any turtles, how many times to run the K means. So here I am, passing turn so it will run k means 10 times and it will be in the best sort of those parents and then we'll do the prediction. So no, we will let you live on it. So we will call it ST Predict on X and again, Let's Lord and we were also so this in floor just the creak Novia not floated. The center solar told where the cluster centers here so you can print cluster centers also for first let me printed. So these are the three centers that came in a deterrent. So we will float these three points. So you see there terror of the centrists. You can also Luzon cosmetic stuff by changing your sides. So, let's say 18 year size 200 and color, too. Oh, who is he is lately. Bigger center looks. Increase it further. So now it seems clear. So you see there, dear, the three centers that game means has foreign or in running, came instant names and is suing Mr On that. So that was a briefly morph hold to use. It's like it learns he means implementation. 5. DBSCAN - density based clustering: In this video, you'll study or bury our full clustering algorithm called TV skin. They will seawater that wouldn't pages off using the skin. And what are their disadvantages of fury in using it? And finally, we will also see a brief them off whole to use devious can. Presenting the sacred learned every bite on. So let's begin. Revis can stand for density based, special clustering off applications with noise on, or it was initially given or by Esther Anna Sandor in 1996. And then they came off between new paper in Italian. 17 were the Revis too devious can so devious skin is a density based clustering algorithm, and here, given a set of points in some space, it will try to group the points that are closely packed together. So there are mainly three in post it. Obviously, you will need to enter the data points to work with. And then these are the two men in cooks. One is asylum. So it's the radius off neighborhood with respect to some points. So it's deserved the points in space. These are the points then. Ah yes, three are considering this point. Then begins visit all the points that are breeding style in distance. From this point, this is called a silent never rolled off this point so it can be in any Derrickson maximum ab silent. It can be less for it's easy to draw. Yes, certainly with really, except silent in all the direction with respect to this point, taking this point as the center. So if some point is here, then this distance will be listed up silent because distance from this point still here is asylum. And this to any point falling in this circle. Well written did never does that point. So this is one condition and then the main points. So if we have a point and we have, let's say 12 three points on one, the this point itself, so it's never would will have four points within a pile in distance. So this mean points parameters is there. If a point has more than mean points within yourself island neighborhood, then that point will be called cool Point. So we will see more differences so we can divide the points into court points, border points and noise. So in the same example received this, let's say our main point is three and a failing is when you need. Let's say it's equal to the distance. So if we see a fella never will do from this point, we see that there is only one point in this neighborhood. So this is it noise? No. If we see this, let's say you have more points. Obviously this point and we draw it. Let's assume these distances one or asylum the desert. It has 1234 on this point itself. So here number of points is five, which is more than men. Points in our case mean going to us three. So this point will be a core point. No, we received for this. It will be the circuitry. So again, this is a core point. Similarly, this if this is connected, but this is, let's say, connected to this. Only then this is not a core point, but it's connected to a core point, so we will call it border point and border points are also part off the same cluster. If it's connected to any core point, then it's also included. But hear this. There is no it's not connected to a border point or core point, so we will call it noise or or player. So this algorithm is very robust to noise. Also, it will be more here when we see this algorithm working on Oh real example. So first, let's understand what is numerically irritable point. So is Vieira Point B, and we have a point. Q. Then Q is directly irritable from B your fruits if b s to a core point. So it's not that this is a cluster. Some points are here within this cluster and this disease a mortar point. So we included in this cluster and then here again, another point. This is also border coined. It's not for so Ah, three point which is connected to just the border point is north of order, point point. So border points are points on the border and it cannot be used to extend the cluster. So is this is a border point, and at this point, we just connected to this border point, so it will not be part of the cluster. It will be noise. Or maybe some other Callister. What? Let's see if this point was a core point on this point. It just connected to the school point. Then this point will become border point. Now it will be included in this cluster. What you cannot for the X turn, the cluster so the vegetable is also related to this, so he has to a core point. Then you will be directly irritable if it's within a file in distance on this Sunday has to realist in a player excuse beyond the felon, Then it will not be directly ritual frumpy. The points are only said to either clear ritual from cool points for north border points. No, another term is reachable, so this was directly ritual, and this is reachable for reach, even means Que is reachable from P. So let's appease here and Q easier. And there's many points between them. Let's at this point, it's people. This is equal to be in the last point, disappear, which is equal to you. An intermediate ones up even be too B three and so on. Be in minus one. The first point is, P second point is cute, and we'll see that huge ritual from B. Oh, you should be careful that we're not seeing saying they're clear ritual rectal irritability only core point on this point, this nuclear ritual frumpy what tell There can be many points between B and Q, but the current isn't is that all has to be directly ritual even has to be directly reachable from P zero. We do has to directly ritual from people between has to directly ritual from me, too, and similarly being has to directly ritual from be in the minus phone and the earth Cinder . A point is directly reachable from Onley, a core point. So here, in this case, easy us. He has to be a core point, even has to be a core point because between the recklessness will from people, so this is also core point. This is also core point your point core point, but this cute can be accord current or it cannot be Corporan, for example. In this case, Q was not a core point, so only exception is that the last point may or may not be a core point. All letter has to be core point and connected. That is, the distance between B and even is less than asylum. Similarly, year here, so even it's pp and sq and be a place minister eclipses will from B I. So all points are cool. Going to accept the last last can be a corpulent man or really good point. Let's move further and let's do this on this. Let's understand the cool points and other points on this example. So we will start randomly at some point. And here we have picking the subsoil in this land. So let's say we start with this point. The level core points with red border with blue and white with black. So we come to this point. We draw a circular, this radius. You see there is no point, so we can straight a bit. Make it noise alerts and next revisit this point. These air part officers Oh, sit so we can reverse any non V student. Or so let's say we start here, then receives neighbors. Some for one, never is this. We see that it's connected because it's less than a fail. Similarly, this is also connected, busy also connected. And this who lets citizens also connected food. It's four neighbors are connected to it, and when he's disappointed, self so here connectivity for this point, this fight or densities for you. When a requirement is four. So it's more than that. So clearly this point is a core points we call Richard No Streeterville begins sitter, Do your also part off the same cluster because these are connected to a core point. So these are at least of order point these maybe even core points because we have not explored all the neighbors off these neighbors. So we will not level it border. But we will move to one off its neighbors. It's like a gift. First serves that we're doing the graph. Oh rather for story. Look at all the neighbors and then we Oh, expand one on the news. So we come here. It already has one connectivity. So we see that this is connected to it and this is also connector. So it has three neighbors and one this point itself so totally becomes full. So this is also core point. No. We will expand one off its neighbors. Let's sit this one on its not connected to this point. Just this. So it's gonna densities. Just do this point at this point. So we will live a little bit border point and this is a core point. No, they come back to this parent node. And here we came from here. So we go back here and it's next. Never this. It's connected also for for it's a core point. Similarly, this is a cool point, but this is not a core point. We go, this is connected to this on listen, this sorts of border points similarly, this is and border point because you just connected to a core point. No, uh, we moved to next on Mr Note. Let's see here again. This is not going to do any point. So it's a noise. Next. Only study this and this is also noise. So this little leader blister. So disease one cluster and 2 to 3 points out. Noises ready. Feared done. Some came in similar to say we're taking or to Christerson it prepared. Do you read it in tow? Clusters And the central? It will be some of their here for this, an heir for this and these would be in one cluster. These would mean another cluster or no, let's understand the D V scan algorithm. We have just seen that in this example. Let's understand this formally. So first we compute the neighbors off each point and identify core points, then join neighbouring core points into clusters. All the core coins if their neighbors these are part of the cluster and then for non core points we will be left with non core points. Off them will be borders. Some of them believe it noise. So for each such point, which are not cold wind even trade. To add these, these points to the nearest nearby core point. So maybe this is a core point near to it, but it's more than a felon distance, so then it will automatically become a noise. But if some point is within Upsell in distance of a core point, then we will assign it border point and included in the cluster. So for each non core point with the leader, make it water point or other ways noise. Let's see this another jump in and it will be clear. So let's if we start from here. We see that it's connected to this. This and this, Airmen points is again for third school point. Similarly, this is a cool point. This is a core point. So we're live this and this. This is not 1/4 point you go get connected to this and forcibly this thirds less than for. So this is it border point. And similarly, this is a border point regard. It's connected to this in this certain city street, and this is connected to more than tea. So little market. Yes, core point. No, we have the stood disconnected, confident. So he was editor, another non western north from the set. Let's say big rule here and it's not connected to any of the points. So it's noise. Similarly, it is his noise. This is forcefully connected to this. So we should have registered this in disconnected, confident itself. Been decision noise. And no, this is a cool. It's connected to this. This and this and this one is also a core point. No, this is a border point X connected to a core points of our leaders this and other certain ways. So you see there, tone. We have identified two clusters here. This is one cluster and this is the 2nd 1 Oh, if you had used some or the radical algorithm like Timmons then and given to clusters and personally, it may have clustered these into one, and there's into second and central. It would be somewhere here and here so it would not have detected or players or the noise there is in the case of this algorithm, it has clearly identified or players. So it has formed two clusters. Two of these clusters, which look ah, correct, if you're done it manually also, so there tweets good comfort with their algorithms. So let's see some of that one busy if devious, Can't we don't first is that we don't need to specify the number of clusters. It automatically figured out these two clusters. Then it can find or be truly saved clusters. So it was pretty arbitrary. And even you can have place just like this. Point points are in a circle and some Windsor in the center. So they said algorithm. We'll figure out that these order points are one blister and the inner points or other clusters paraded. You have given ah, good ab silent and nine points. So if you pick a Playland, very large bees all will be Vidin, that selling neighborhood, and this will be one cluster. If you have given a felon too small, then it will be too many blisters and If you do, if they're incorrectly, then you will see that this is one cluster and the order points are second cluster and these air robust tour players. So we're senior that it was able to detect the noise and mostly insensitive to ordering off points in the uterus. Some exceptions under some order points. Let's set this water point disconnected to the score point, also in this court warned also, and these are in separate clusters. His core points are not connected with this mortar point is in proximity of the sender's. So it me there goto this cluster or discussed er, depending on the order in victory, put the points otherwise, mostly it's insensitive total toe the order in which we pick it. No, let's see some wonder disadvantages. So this needs some understanding off data in order to choose a proper asylum and win points . And it may require some experimentation also, and it's not interrelated to Mr. It's due to the same region that this border point me good, the disc lister or disc lister, depending on order, we pick what that's not a big deal, and it doesn't work alone. Clusters having large difference. Indian cities. So it's it does a test. Let's sit in clusters. We can clearly see that, but receded in the first kiss. It's very dense and these points are very close to each other and in the second cluster desire a bit apart. So it's difficult to pick one up silent, which will work for both of these. If we pick very large Upsell on, then it may grouped into one cluster. You'd be pick very small trail. Then this may be in one cluster. With these, Meg get clustered into many different blisters through here, picking a suitable epsilon, maybe 30. No, let's see this in our own in court, in the north, in the Norbu can really move here. So I will be using psychic loans or Levi's can library. So slow slur. It's important. Shortly we will you be using their double moon on the desert that this very popular and it's like this 01 and one other one. So there are some points along this and some points along this So we really using this, you know, remember. - And I also like to avoid avoid warnings in these nor books. So I will just ignore the warnings. You may skip this. These lines, you're optional. There's a very long warnings and it becomes irritating. Sometimes. Let's run it and no ogle luhnd Anita for Try to come into Let's pick 100 points and looks hard. Some boys we will see. What is the effect off adding noise? And what happens if we don't? Certainly. And let's print 1st 5 elements of this. So you see that it's in every off duty points, and this level will contain on Jiro and one. So this date us it norm, wounded as it has to cluster. So we did a little beady roar one so you can print level. You can print the complete level. You see that it's and you know one, but we will not be using it because of your really or using clustering. And it's on to avoid. We can just I applaud this to see the actual actual part, the correct clustering, and then we can plot our one question whether it was clustered clear, not no room. Train the Morton to Lord the Mortar, and we will be kept silent off. You're going to sleep and mean samples, which is similar the main PDS, which we had seen in the algorithm on duty The Modern on X Let's run it. And now let's print mortals level. So this is due, Ah, cluster generated by our model level generated way over more than. And this is the origin of the correct level. So it seems you've had also unidentified two clusters which looks quite similar to this one . In fact, it looks exactly similar to this wonderful project. NC Uh oh. Realtors. Oh, field order clusters All the ex points all the way points First columns, again column tendons we need to satisfy. See, otherwise all the points will be level in same color. So it will just past the more the Lord levels three separatist three levels. Then it's reported in t three close. In this case, it looks like it had just to live. London It's bloated. So you see there to test correctly classified these Let's all support on the levels here we had passed more than eleven's the tune more than we will also pass just the level that we had read from the details this level and explored. So we see that book look exactly similar. So Let's add some more noise and let's run it again before because it will every time create new points. So here you can see their difference. With more noise, we can see the original level will be just two more. Here this has phone five blisters. So we see that. No, it's Norden that correct, but it has created four clusters. Let's Trey von More time that lets digital noise. So we do you tonight it will be exactly this. So there will be no problem in question These Let's trading some smaller noise, General. Point on on. Run it So we see that does it pretty good job. So these old points are clustered into one cluster. These all are the second cluster and you can see too noisy points here. So this and this are noise. It was possibly desire, just connected to a border. Points on one of the core points on a roll so you can see these Moyes's minus one. So two minus one And in the plot you can see these two points. So this is a brief Demoff, how you can use the Eskil allowance or baby scan literary. So Treo doing this with varying the perimeter's health epsilon and mean points and he hold changes, tried singing the number of sample points, try adding more noise points, tried toe, use some different details it and are trying to play with it, and you will get more clarity over whole. Beavis can actually works. So that's all for this. Thanks for watching. 6. Hierarchical Clustering Algorithm: in this really lose 30 or tire article clustering methods. So what? Its hierarchical clustering and how is it different from other approaches? So as you can see in the name itself, there is hierarchy and you can see your small diagram here. So where are fruitless string means? Oh, we fix many data points and one approach would be to find the clothes deter points like these two so more these tools, the next closest are these two. So Mercedes no Thies to represent one data 10.1 single Pallister. And if the store closes most these two other ways, these two and so on. So there are two approaches. One is bottom which they just sort their. So you start but you to deter point as a cluster and then you repeatedly look for two nearest blisters and combined them into one. And at some things we will stop. We will see those conditions later in the video and the second approaches Devery syrup roots or uptown. So here we start with all the entire points under a single cluster. And then we re conceivably split it into multiple conditions. Multiple clusters. And then finally we will stop somewhere and this first aglow merited questioning a record Clustering is more popular, and we will be mainly starting that in this video. This is not much so. The main or person that we do in our article Plastering is combined to nearest listers into a large clusters. We're this is the mean or prison, but there are many questions and we need to answer those before we can do the question we need to combine. What hole should be represented. Cluster hoping should be stopped. Oh, what do you mean? My nearest clusters will be so there are certain questions which need to answer before starting the cluster. So first is the representation part. So how do you do represent a cluster? One point is fine. But if we are more than one point in a single cluster, I hope we will represent that. So they take this point. So I take this point or a regional two or some or some other condition. So we will answer this and then we'll see the concept off. Nearness off blisters. If you're Martin Lester's three twisters. Let's see. Is this near to this sword? This near to this andan? Finally we need to answer. When should we stop? We had seen that we start with it point as one cluster and then we keep merging. Do nearest blisters into one. But if we keep doing, we will reach 21 cluster in. And so should we go all the way up to one cluster? Should we stop somewhere? If we are merging all them all of them toe one cluster than what is the meaning of clustering each of them. We can state we merge into one. So let's answer all of these questions. One we want so so stiff representation, part order to represent a cluster with more than one point. So we will start with a simpler case off Euclidean space. And we will also see what we need to modify in orderto for the represent isn't working nor negligence. This also so uniquely Dennis P So we can having no sign off central oId we have let's see, three points then centrally would be day village of these. So we to lie somewhere it should Bissinger Too late. Let's see here on this point need not be to displace these each point. Maybe in Damon's noose piece x of the next two X in. So each of these points is an endemic is number two and we can take the average by doing element ways, Avery's and the very way the number of points. So in nuclear Dennis face, we can represent it using it it is or central. So the first question is done for Rick. Luminous face. Next, what is the concept of new business? So we're continuing our discussion Nuclear genesis. So we major the cluster distances by distance between Central. No way. Let's appoint one. This is 0.1. Does this point to let's say there are a few more points alert citizens three full slave and six. So we're six points one Oh, okay. Full for you and six. So here we find the nearest cluster in nuclear Gina space can Euclidean distance. We see that one and two were quite near. So we will include them in one cluster and we can represent it as this and then we see that three in four are also the next nearest let's see. So no, this will be represented way. One point which will read the central this point. It's the original one into. So no, we will see the remaining five. So we had six points. We most two were left with no flight one lis and this inter cluster will be represented by this one. Singles injured. Now we need to see which to off these fiver nearest receded three and four are nearest to include them in one crystal. And now to lose represented where you're central. So we have most to reinforce. So we will mourn these two next to receive that on this and this slave are the nearest So we most this mitt flavor. And now this entire thing is one cluster and let's see the central it is here. So no, we're left with three clusters 12 g foot slave and six. And we see that you want to disclose Terry's nearest to six. So we will Moore's it here and centrally did somebody area. And now we're left with two clusters where we murdered this 12 with six. So lots of age six this side for convenience. So we're merging this to six? No, Obviously there are two clusters who is very much further. People left the left with one cluster. Some words this endless. So finally we will end of it. One clusters. We need to find where to stop. And this diagram on the right, we call Dhirendra groom. And we used this data, Um, a lot in that article, clustering of to understand the concepts. No, we have answered two questions, but only for Euclidean case. But in the case of non ingredient space, we cannot big, very brittle points and create a central central may not even a valid point in the Nordic luminous face. So here, though it's more complicated. So Oh, there's the Lucas since we can talk about are the points themselves. So we have in points and if we take the average of these, maybe does not line that is also or it may be some very, uh, away from these are not near to these. So by non, including space, you can mean some others face like some hyperbolic space. Three nuclear in space. When you join two points, you get the street ling here. If you join two points, you make you lane like this and you divide it into grits. All the greats will not bean off the same media, so the concept off everything. Other things don't work exactly the same there. So instead of nerd, we need to represent, Litt said. This cluster. So instead of taking the everything that we didn't or Euclidean space, we can pick one off these points to represent this whole clusters. This point lets it rip it and it's the representative of this whole blister. Be picking existing valid point in this cluster and on what basis we can pick those points . There may be multiple conditions on which we met. Pick one representative. So let's represent the cluster in non Euclidean space. As I said earlier, we can one off look data point itself to represent the cluster, and we will call it destroyed in Euclidean recorded Central here we will call it destroyed and it really data point. Let's say it's closest to other points, so we picked it and we see that if you picked this point, this is not close to all the points on every Mrs father's. Better pick some point, which is more or two other points. No, we will come to this turn notion off Cluj closest also, so we will revisit it. So for splits on understand the nearest nearness off, Mr. So we're treat destroyed as if it were centrally and computing the distance between clusters and central is the riddle for little points. So it's in our deficit of points. For example, in this case of the central place, somewhere where they rid of these three points. So this is centred. What we said that destroyed will be one of these for data points itself. So it seems that this is a good candidate for destroyed because it's in close proximity to this as well as this. So this is destroyed. So no, we understand the nose knows, enjoyed and clustering. So no reason. Understand? What is the notion of clothes? Just because we use this term in our previously were rescind that big in their point which is close to other Corange within the cluster. So there can be many criteria for selecting the closest 0.1. Maybe there look for the maximum distance off a point with respect to other points in the cluster. So it's one distances This this this and this. So maximum seems to be this distance. So for at this point, it doesn t similarly for other points so we will pick or such point. So for this point, the maximum distances D or lets a divan for this kind, it's D two. Similarly, for other points, we will do the same and we will see that which is the minimum. So let's say the trees less than others be one or the tour before. So the maximum distance off the point toe, all other points we picked the minimum that and we can pick as the definition of closest. Similarly, we can ah have another criteria. We can pick the average distance. So this Avery's this distance, this distance, this distance and this and that will be the venue for this point and similarly do for other points and pick the smallest of gold. Similarly, beginning order deafness since also like some of the squares of distance as well. Ah, for the Spurs. If in including this phase we could have returned, even squared it to school. So is it Ah into some off the squares of distance of the other points and then pick the smallest. Then that point will be selected us look destroyed or that bill a week said to be close to a leather points. No, we're left with the cordon and one of the most important conditions or questions called the terminus in condition that is, when should be stopped doing the clustering in aglow Mitic listing. We keep merging the nearest clusters and keep producing the numbers. Lester. So then, should we stop? One approach you can take is that it can number K up front. So this can be useful when you know that my data point consists of one lead, these three closest. You know that I want to divide ah given market segment for a product into three segments. One is, uh, a abusers light users or medium users kind of this. So you know the number off classes that you want to divide your data before and so you can pick that number of front and you keep doing the hierarchical clustering and keep reducing the number of clusters and you stop when you reach the number off required twisters on the second ah, brutally to stop when the next Moore's would create it. Let's do it, Luke Wisam or a bad cluster So the school isn't Is it measurables goodness or feckless? So we keep margin until we have defined some would mistress old. And if in the next moves that is below that would mistress all we will stop So listen can be also are defined in different degrees So busy diameter so began either take some sold on the diameter of most Crestor. So by diameter we mean max distance between points So is this is a cluster you see their distance between this point at this point is all seems to be maximum because we will see all the fairways distances This one this one all such piers and big toe maximum So this seems to a maximum and this little bit of diameter off this cluster Similarly, reduce max distance off a point from central or destroyed So whichever point is the destroyed pick the distance off all the points from that illustrator central and pick the maximum that so that will be the release of blister. So you can define cuisine in terms of dignitaries years or you can take it and city based approach so by density with the number of points but you need volume. So if you have a certain number of points in space and ah you define our divide the number of points violated years or diameter, or some higher power of three years or diameter. Then you can define them. Sitting your a b you in your room. V As for requirement of the application, and you can cluster them based on density of densities more than certain trestle, you blistered them. Villas stop mourning, even Ah, the module producing density that is lower than trestle. You upset that the clusters really Nicolas to the point should be at least so much intense , that is, or you need some William. There were at least this number of points, but when you so you must have grouped the most dense points every year and you keep looking for next most dense clusters. And when you don't find any such blister, that is when you most to plaster and the density drops below that value, you will stop so that can butin city wrist uproots. No, let's look at the implement business. Hierarchical clustering. So the naive implement isn't would be that at each step we compute their voice distance. Whitman all plate bears off, plus Tristan Mars. So we had seen that initially via in points and each point is a cluster in itself. So we will pick their wives distance between all points, for we will have in more deployed way and principally to such pierce in minus one way to Sorry. So this many peers are possible within points, so we see that it's in square. So for us to step, it took some sector off and score time. Then we're left with in the minus one points. So for end minus one points, the number of posterity in the finest do it right toe. So again it's in the school and call the next. It will be and minus two multiplayer way and minus three very to So this is order into. So if we at every step we take the fairways distance between all the remaining points, then it will be order in cube. What do you see? That this is a naive influences. And we are if we have calculated these pairs earlier and alerts that this was the point, the the Steubenville clusters that were most together. So these distances we're nor changing why you're very calculating them on Lee. You need to find the distance off. It's centrally orchestrated with respect to other points. You don't need to modify these so you can do some careful implement is in using particular . And, oh, you tell parents one that you can reduce it toe in square Lord and complex city from and Cube so but still, it's it's too expensive. When you have in these very large, then it can be a big number in Square Logan. So that way it's still too expensive. Even after him. Implementing is carefully. So for very big details, it's It's generally not recommended to use dire question instead, or we will look for other approaches when the detest thirties very big, other way, you'd say an excellent clustering algorithm.