Machine Learning Optimization Using Genetic Algorithm | Dana Knight | Skillshare

Machine Learning Optimization Using Genetic Algorithm

Dana Knight

Machine Learning Optimization Using Genetic Algorithm

Dana Knight

Play Speed
  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x
25 Lessons (2h 6m)
    • 1. Introduction and Course Scope

      1:48
    • 2. Machine Learning

      7:45
    • 3. Support Vector Machine

      4:39
    • 4. Neural Networks

      11:11
    • 5. Optimization

      5:29
    • 6. Metaheuristics

      2:41
    • 7. Genetic Algorithm #1

      2:35
    • 8. Genetic Algorithm #2

      7:33
    • 9. Genetic Algorithm #3

      4:35
    • 10. Genetic Algorithm #4

      5:08
    • 11. Dataset

      1:54
    • 12. Support Vector Machine Optimization #1

      5:57
    • 13. Support Vector Machine Optimization #2

      7:37
    • 14. Support Vector Machine Optimization #3

      7:42
    • 15. Support Vector Machine Optimization #4

      6:40
    • 16. Support Vector Machine Optimization #5

      6:28
    • 17. Support Vector Machine Optimization #6

      5:58
    • 18. Support Vector Machine Optimization #7

      7:31
    • 19. Multilayer Perceptron Neural Network Optimization #1

      3:09
    • 20. Multilayer Perceptron Neural Network Optimization #2

      4:39
    • 21. Multilayer Perceptron Neural Network Optimization #3

      5:16
    • 22. Multilayer Perceptron Neural Network Optimization #4

      6:25
    • 23. Feature Selection #1

      3:28
    • 24. Feature Selection #2

      5:19
    • 25. Feature Selection #3

      7:00
  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.

205

Students

--

Projects

About This Class

In this course, you will learn what hyperparameters are, what Genetic Algorithm is, and what hyperparameter optimization is. In this course, you will apply Genetic Algorithm to optimize the performance of Support Vector Machines and Multilayer Perceptron Neural Networks. Hyperparameter optimization will be done on a regression dataset for the prediction of cooling and heating loads of buildings. The SVM and MLP will be applied on the dataset without optimization and compare their results to after their optimization.

By the end of this course, you will have learnt how to code Genetic Algorithm in Python and how to optimize your Machine Learning algorithms for maximal performance. The ideal student is someone with some knowledge in Machine Learning algorithms and some prior knowledge in optimization, some prior knowledge in coding will help too.

Meet Your Teacher

Teacher Profile Image

Dana Knight

Teacher

Hi! I'm Dana. I'm currently a PhD student in Industrial Engineering. I finished my B.S. in Architectural Engineering and my M.S. in Industrial Engineering. Lean Six Sigma Green Belt certified. I enjoy learning new things. My research interest is Data Science including Deep Learning, Machine Learning, and Artificial Intelligence.

See full profile

Class Ratings

Expectations Met?
  • Exceeded!
    0%
  • Yes
    0%
  • Somewhat
    0%
  • Not really
    0%
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Your creative journey starts here.

  • Unlimited access to every class
  • Supportive online creative community
  • Learn offline with Skillshare’s app

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

Transcripts

1. Introduction and Course Scope: welcome everyone. My name is Donna and I'll be your instructor for the machine learning optimization using genetic algorithm Course. The first thing will cover is machine learning just as a brief introduction. We're not not going to go deep with it, then the second thing we're going to learn are too very widely used machine learning algorithms, which are the support vector machine. Andi multi layered Perceptron Neural network. We'll also cover optimization why it is necessary to find optimal or near optimal solutions . I will also cover the difference between optimal and your optimal solutions. There will discuss matter here, sticks and why they are sometimes chosen over traditional optimization methods, and one of those which we are going to implement is the genetic algorithm. Lastly, we will implement genetic algorithm as an optimization method to find the best hyper parameters for both. Support Vector Machine, a multi layer perception neural network where we'll use it on a regression problem where we will predict the cooling and heating loads off buildings based on several independent variables or of you call them features. Now, in this course, you are assumed to know the basics off three things. First of all you are sentinel the basics off machine learning. You are also assumed to know the basics of operations Risha research, which is optimization. Andi. Lastly, you are assumed to know the basics off python coding. So I'll see you in the next lecture. Thank you. 2. Machine Learning: Welcome back this lecture. We're not going to too deep into machine learning. We're just going to touch the surface of machine learning because it is assumed that you know all about machine learning or a Cleese the basics off it. So machine learning is one of the best fields off computer science. Machine Learning is a type of artificial intelligence that provides computers with the ability to learn without being explicitly explicitly programmed. Machine learning focuses on the development of computer programs that can teach themselves to grow and change when exposed to new data. It studies computer algorithms for learning to do stuff. We might, for instance, be interested in learning to complete a task or to make accurate predictions or to behave intelligently. The learning that is being done is always based on some sort of observations or data, such as examples. The most common case in this is like direct experience or instruction. So in general, machine learning is about learning to do better in the future. Based on walked was experienced in the past. So in the 19 fifties, artificial intelligence came where it was about creating smart machines and then, after decades machine learning flourished and decades after that, deep learning, I mean just recently boomed where mostly talks about neural networks. Actually, it's old neural networks. You have convolution in your network recurrent neural network, Um, both men machines and others. And deep learning is mostly about image recognition, pattern recognition, voice recognition, um, sentimental and now sentiment analysis on da So on. So the three types of machine learning is that you have supervised learning, unsupervised learning and reinforcement, learning with supervised learning. You have class A Class B. So you know that this is class A and you know that this is class B. Imagine this is for, let's say, class A patients who have cancer and Class B is patients who turned out not to have cancer . So if you have a future observation here, so you would ask it. Okay, so does this future observation. Does this observation? Does he have cancer or not? And based on this, um, separation, this hyper plane you tell, you know, this patient does not have cancer, so But, you know, the labels with supervised learning there are labelled their classes, tells you this has a label and this has a label with also provides learning there's no Class A or Class B. There's no labels, no targets. You give it a bunch of data on the machine learning algorithm starts to cluster and group those data The observation thought you give the machine into clusters that look alike that are similar that shared the same information. For example, here you have yellow that they look the same red, orange, green, blue. So you give it data and say, Hey, what can you make out of this data? And it starts to cluster and, uh, bring all data that look alike and that share similar features together, like on Facebook. Have you ever been targeted with an ad that is relevant to you? The reason is because you are clustered with other people who have the same features as you , the same interests and you as a cluster. You and your other people who uh, share the same similar features as you are targeted with similar ads. So there are very many, many clusters and each cluster or targeted with ads that are specific to their, uh to their interests. With reinforcement learning. It's like self driving cars. It, based on the previous move, it adopts and change its next small, based on the previous move on its own environment around it. So it learns, based on previous moves now for supervised learning the most commonly used, the most commonly used tasks or classification and regression. So for classification, it was like the example where we said patients who have cancer in patients who don't have cancer, you have Class A and Class B As an example. Let's say we have a set of emails where e mails were classified into spam emails and not spam emails. Let's say we already have this historic data of emails. There was an email with six exclamation points, and it turned out to be spam une male with two exclamation points and turned out not to be spam and so on. So the so after you create the model based on this data right here, you ask the model. Okay, So if we had eight exclamation points, would it be email with the email? Be spam or not? Spam. So this is for classification tasks. You have labels. You have discrete labels like spam, not spam tumor, not humor. Yes, no pass fail, but regression, You have new miracle output. So let's say you have like this. You have an ex and why This is for two variables the dependent variable of the independent variable. Let's say you have historic data. There was a house with two bedrooms. Its price waas $100,000 a bedroom. Sorry. Ah house with four bedrooms. It had the price off $150,000 so on. So you ask it. Okay, So if I had to about house with two bedrooms, how much? On average, would it be the price now Regression and close regression and classifications for supervised learning regarding unsupervised learning. Just like we said, you can cluster the data into similar into clusters that have similar features, and you decide the number of closer you'll say. Okay, so give me three clusters and it will separate the data into three clusters. You can ask for two. You can ask for four. You can ask for as many as you want, but there's always an optimal answer on on optimal KK. The number of clusters. There's always an optimal K. It's using the elbow diagram, all the elbow method where it shows you that at this number, this K is the optimal number off K that could, um that could explain the information and the variance in this amount of clusters. But we're not going to get into that. I just wanted you to know about this. So what we did in this course in this lecture Sorry. We talked about what machine learning is. We discuss the time frame off artificial intelligence, machine learning and deep learning. We discussed the three types of machine learning so provides versus unsupervised versus reinforcement. Lording We talked about what classifications is and give an example. We talked about regression, also gave an example. And lastly, we gave a very short to brief example about clustering. So I hope to see you in the next lecture. Thank you. 3. Support Vector Machine: Welcome back. This lecture will cover support vector machines support vector machines are very, very widely used, their supervised learning algorithms and used for both classification and regression and it works through it maximizes. Let's say you have Class A and Class B and you want to find this hyper plane right here that separates the two classes ripe. So you want to find the maximum the hyper plane were. The margin here between the two classes is our maximum. And these here, which help define this help hyper plane, are cold support directors because they support and define this Harper plane. So if the the hyper plane was positioned in a different angle, then the margin between the two classes would not be the maximum that you confined. So support vector machines helps you find the margin. The the hyper plane where the margin dig up between them is at maximum. You can see here sometimes you can get some things misclassified like class. The red is misclassified in the blue and the blue is misclassified in the red. It could happen because nothing nothing is 100% accurate. You can't find an algorithm that would predict 100% accuracy. So you can have sometimes something that is lean, eerie, linearly, separable like this. You can see just one simple, uh, piper plane lean. Your heart pumping can can't separate the two classes, but sometimes you have problems that are non linearly. Separable. So what you do is you something called Colonel, which we will cover in the next slide. But for now, the hyper parameters off the SPM function Yes, we em algorithm. It has a lot of hyper parameters, but the four most used hyper prompters or first kernels kernels are methods used for pattern analysis. It is a similarity function where it takes two inputs and outputs and how an outputs, how similar they are the degree is when you have the colonel is set to Polly Degree is a degree off the political meal kernel function. Now the sea hyper prompter is a regularization parameter that determines how correctly the hyper plane separates the data. It controlled the trade off between achievement alot better on the training data on maximizing the norm off the weight and lastly, gamma if the parameter off the Gaza Ghazi and Colonel, it defines how far the influence of a single training operation reaches with low values, meaning for and high values meaning close. So when we said here that sometimes they could be non lee near inseparable something called the Colonel trick is used. So this is not linearly separable, right? You can't separate them. These two classes with a with ah line for what you do is something called a kernel of trick . You see here now they are linearly separable with this being the decision surface or the hyper plane. Andi, for example, you have Let's see, you have a linear kernel and you haven't RBF, which is radial basis function colonel and polynomial worthy degrees three. You can see that they they are different than each other. So you can see that the colonel plays a huge role in shaping on classifying and clustering the classes, finding that hyper plain between the classes. If you can see that Colonel's plays a huge role and the most commonly used colonel is radio basis function, Colonel. And that's what we're going to use in our implementation off Sveum, even gamma and see you see how you change them. The hyper plain between the two classes start changing. You can see here, Gama set to a relatively high number. And also see you can see here. But when both gamma and see are set to, ah, low number, you can see how it differs from here to here. So I'll see you in the next lecture. Thank you. 4. Neural Networks: Welcome back. This lecture will discuss in your networks the new networks. All are modeled after the human brain and the brain. You have billions off new runs on these new runs are connected to each other by accents. Same thing with the artificial neural networks, the ones that are machine learning algorithms. They are considered consisting off nodes, which is considered as neurons. And these notes are connected to each other through edges, which is like the human brain with axe on. The beautiful thing about neural networks and which distinguishes it from many other algorithms, is that new networks are capable of capturing complex, nonlinear models. The three types of the three components that make up in your network are the input nodes, the hidden nodes on the output node or the input layer they hadn't layer, and the output layer where the input layer is the features that you have on your data set. The number of input knows are the number of features, and you have hidden nodes where the number of notice as much as much as you want. Of course, there's a rule of thumb on Do you can have as many lay hidden layers as you want on the outward note. Is the output off for example, the probability off something being a zero or something being go on? And and all of these nodes are connected to each other through through edges on and on each edge, There exists a weight now. We said here that they are capable of capturing complex, nonlinear models. But in order to kind of process these nonlinear models, they have to also be reprocessed into linear models. So something called an activation function is used where it transforms non lee near to linear and calculations happen in the hidden nodes, and then it outwards a prediction in the outward node. So, for example, let's say you have two features, hours of sleep and hours of study, and you want to predict the great that you will have, depending on how many hours of sleep, your heart and how many hours of study you got Now. Sometimes your little can have several hidden layers, and with this, it started coming to be called deep, like deep neural networks. Now let's say you have one hidden note list for simplicity reasons. Let's say you have one note here one hidden node. This is the first input from the feature it hasn't it's associated with the weight and here it's associate with another wait here with the weight and so on. So what happens is you multiply the first feature by its weight plus second feature multiplied by its weight plus the third feature multiplied by its weight, plus up until X am multiplied by W. M. On you add all these together in order to get the value in this node right here. Now, in order to process it to the next layer or process it into the output, the activation function is applied, which transforms this number into a number between zero and one. Because at the end of the day, you want something between zero unwanted probabilities. Probabilities can never be under zero and can never be above one or 100%. So you wanted to spit out a number between zero on one for you. Transform this number into something twin zero on one, using the activation function now as ours. Our example Use the neural network to model this historic data right here. Let's say if you had eight hours off sleep 10 hours of study. You had the great off 95 somebody else. Five hours of sleep, seven hours of study had the great off 84 so on. So your question would be OK. So we have a student who slept six hoe, slept eight hours on, studying eight hours. How much would you predict their great to be? So you have the hours of sleep in hours of study and let's say we have three hit the nodes in one hidden layer and have the great. So for the first note here, you have w 11 and w 211 one is the first feature coming to the first node and to one the second feature. Come to the first note and you must apply this input here, let's say eight hours or 10 hours by the weight. Usually the weight is set at random. Actually always said that random plus this multiplied by the weight. So you have the value inside here and then you do the same thing for the second had the node and the same thing for the third signaled Now, in order to avoid ah over fitting, you add something called the bias where it has its own waits on you multiply. Of course, you get this times this plus this times this on bias times its weight and add them to get here the value here. And in order to make these values between zero and one, you apply an activation function on it. This is called a sigmoid activation function, where it squishes the number down to something between zero and one. And you do the same thing where you have a bias here to avoid over fitting and you apply to get the number here. This value here after their transformation. Times w 31 plus this number here times w 31 plus this time w 33 plus the bias and you got the grade and you apply the activation function so you can get something between zero and one, which is either the output if you have the if you have a girl aggression problem or if you have a classification problem, it would output the probability of belonging to a certain class like probability or belong to class zero. So you have several activation functions the most commonly used our logistic, which is sigmoid which we just talked about, where you squish it between zero and 10 and one and also Rehl, you or rectifier rectified. Linear unit is also you get you the maximum between zero and Z. Where does the the maximum could reaches one. So this is the number that would be output head after you got it from here. No, here. I forgot to mention After you get the grade here, the great would be way off target. Right, Because you cannot get something that is on target or close from the first time. So what happens is you go back, you adjust the way it's based on the error. Here you go back. You are just the waste and then you retrain the bottle and then you check for the etr again from the thing it out Put it to the actual number that it's supposed to get and you calculate that. Enter the difference theater function or the cost, and then you go back and you retrain the model, and then you go back again and then you go back, adjust the models and retrain. And then until you keep doing this until the great can no longer. Your prediction can no longer get improved when your prediction can no longer get improved . It's when you stop and those sort of weights are used for future predictions. Now, when you go back and you adjust the weights, this is gold called back propagation. But when you go back, you packed back propagate, and you are just the weights. And then you retrained the model. Now, the way the weights are adjusted is based on something called Grady and dissent. Let's say this is your error function and this is your weight. So you want to find the best. Wait where the editor would be lowest, right? So you want to find the lowest, the weight where the theater is obvious lowest. So what it does is it uses the slope of the function, the radiant. So if you are right here, right now, where W is equal to this point right here, you want to go to somewhere where it gives you a lower enter. So you want to find the steepest point on your on your function in the convex function. So you go downhill to you the slope off the function or the Grady int and using the slope, you go downhill. You go until the slope gets to a zero where if you go any higher, the slope would not be zero and then it will go higher. So, no, you want to come to a place where the slope is at zero. So you will keep on descending until you find the location where you are A zero. The slope is a zero and that's where you are at the minimum possible. Better you can find. So the number off you have five very commonly used hyper parameters off neural networks. You have the number of hidden layers. So how many hidden layers they are? You can have several hidden layers than how maney nodes are in each layer. So number of hidden layers, number of hidden nodes, the optimizer we said here the other way, it optimizes the weight is using Grady descent. You could have other methods to optimize the weights and you have something called learning great and momentum. You can to make it very, very easy. You can think about learning great and momentum about is our like the direction and step size off the, uh, move that you are doing in the weight. So the step side that you are taking, whether it's a large step, size or small step size, these are the learning rate and momentum. And of course, if you take just for an f. Y I, you have a small learning great, like very, very small steps. It would take you a very long time to reach, to converge to a local or global minima. And if you have a large learning great, you could overshoot and miss the minimum. So you need to find a balance. And that's where optimization what we're going to do. We're going to find the optimal set off hyper parameters, including momentum and learning. Great. Thank you. 5. Optimization: Welcome back. This lecture will talk about optimization. Remember when we talked about in the last lecture about Grady and dissent where you have one weight that you need to optimize where you find where you want to find the minimum? Ah, the least amount of error that it could achieve. So you go down the slope down the function using the radiant, and you find a place where it is the slope zero where you find the least amount. So let's say you have one Wait, you want to optimize and you want to try 1000 combinations. So you have 1000. Imagine finding the best and optimal combination. When you have two weights 1000 times, 1000 less. You want to try 1000? Uh, here on 1000 combinations there 1000 times 1000 1 milion combinations. Offsets of solution for two sets of weights, only two. And when you increase that to 3 1000 times 1000 times 1000 you will have one bilion combinations between those three weights. So how exactly can you find the best set that would give you the least amount off better from those three weights you want to find the three weights? Were they working together in way in a way to achieve the lowest better? Onda 1,000,000,000 combinations is a lot of combinations. Andi. It could take you even more time than the universe was born. Literally. That's how much time it would take you. That's how scientists determined that it would take more time than the time it took for the universe up until now. And another problem, why we need optimization is that even if you found here, let's say this is a complex problem. There's only one minima right. You confined the other for one location. But what if your search space looked something like this and you ended up getting stuck here? You think that you reached ah global minimum or a good minimum? Imagine this is your editor function here. So you think you would reach a low enter? But in fact, if you go through your first space, you can find a better one, a place where the editor is even lower and you never know. Maybe even far away in your search space you can have even lower and lower ADDers. Now here we're talking about one wait. So you have one dimension. Imagine if you had 100 weights you want to optimize. Not only the number of combinations gets so, so huge, but also you cannot imagine 100 D. You can imagine one d two d at max three d, which is three weights. So as the number of weights increase and the number of combinations of weights increase, the more complex it gets. So you need to use ways such admitted heuristics to find these solutions for you or to find this is called a local minimum, so called the global minima a local is something that is not global, like local is something near best or a good solution, but not the best. So meta heuristics help us find either good local solutions. Local minimum or will find will help us find the global minimus. So we have a famous problem called the P versus NP problem, which until now it is yet to be solved. P problems are short for polynomial problems. N p is short for non deterministic problems. P When we say polynomial problems, it means that they could be solved in polynomial time. The simplest examples are polynomial problems are subtraction. Addition. Division Multiplication. These are examples off P problems, which is political meal because you can solve them in polynomial time, so polynomial time algorithm is where it's operational. Complexity is proportional to end to the power off. See where ended the size off the input and C is a constant non polynomial increase exponentially two to the power off end. So the more weights you give the algorithm to find, the more the exponentially it will grow. It grow exponentially. That's why finding the optimal solution for a neural network is is considered an NP problem because the more weights you have the the problem would would start to get exponentially exponentially bigger. Now we said that P problems can be solved in polynomial time, but any problems cannot be solved in polynomial time. But if you have a solution in NP, you can verify that solution and polynomial time. But you cannot find it in polynomial time. Fall. See in the next lecture. Thank you 6. Metaheuristics: Welcome back, everyone. This lecture will cover Walk me to hear sticks are so heuristics is a technique which seeks optimal or near optimal solutions at a reasonable computational cost. Now make your sticks are here sticks But inspired by nature and the beautiful thing about them are that they are not problems specific. You can apply mental heuristics to about any problem you want And we talked about in the last lecture about P versus NP problems which MP problems are much more difficult to solve them p problems And we also talked about the size of the problem. When you have a large problem, then traditional determinant ways of optimization wouldn't work. You need something like magic heuristics to help you solve those optimization problems where you could, where you want to find either the global Optima or global minimum. You want to find the best solution to that problem when we talked about theater function because we want the other to be the lowest. So that's a minima problem, a minimization problem. So with local minima, you have a local one, but you have a global 12 The global is the best answer. Sometimes much of heuristics. If the problem was very, very large, materialistic sometimes wouldn't help us reach a global minima. But it would help us reach a good local one which we can be satisfied with. Because if we applied, you know, traditional optimization tools. Trust me, we wouldn't even be able to hit a local local minima ever. It would be very, very slow. The convergence will be very slow and we would be way off target. So that's why we use mental heuristics now. Commonly used minty heuristics are genetic algorithm, double search and colony optimization particles warm optimization and simulated the kneeling on the one that we are covering in this lecture and discourse is genetic algorithm . Now, as an example, let's say we have a minimization problem. If we had a small problem, we can just calculate the object of function or the solution for each combination and choose the minimum right. But the problem is, in real life, the problems in real life, the number of possible solutions would be way too big to calculate all the possible answers . But this is why we need magic heuristics. Thank you 7. Genetic Algorithm #1: welcome back. So this is the first lecture off genetic algorithm. But tonight the cargo them was developed by John Holland in the 19 seventies. And it's inspired by nature, most specifically evolution, such as natural selection, reproduction and survival of the fittest. The concept of genetic algorithm is that you have parents and those parents through crossover mutation and selection. You produce us offspring, and these offspring end up to become the new population. And then the cycle repeats itself. So you have parents where you apply. Cross over there can get Children, and you apply mutation on those Children to get mutated Children. And then you have a new set of population, which is a new sense of solutions that go to the next generation. If it doesn't seem clear right now, it will as we go through the lectures, so it's not the cog. Rhythm is population based, search you in cold them in a binary setting, especially if you have a new miracle problem, a continuous problem. You encode them a zeros and ones. This is called a chromosome, and each one of them is called on L Ile and Elian's are encoded with jeans. You have two parents, and these two parents were produced to Children again. All these would be clear with the next lecture and the lecture after that. So when we say that genetic algorithms population based search, it means that it uses a population off solutions in each iteration or generation. It evaluates its each pitch in potential solution and attractively sexy solutions to produce other soldiers. Genetic algorithm, evolution strategies, particles, warm optimization. These are meaty heuristics under the population based search. So let's say you have a set of parents and these you create. You take two parents out, you do cross over, you do mutation, and then you get a new set of parents or a new set of solutions. So with a genetic algorithm, you have population or solutions everywhere, and you try to find the solutions that would give you the minimum answered the global minimum again. All this would be clear in the next lectures. Thank you. 8. Genetic Algorithm #2: welcome back as a reminder. Genetic algorithm works on reproduction and what it does is it has a set of solutions. Okay? And these solutions are called parents. You end up, you take two solutions out of those ah, set of solutions through some selection mechanism. And then you cross over these two solutions on do you Ah, apply mute Mutis mutation on these a crossed over solutions. And so you obtain a new set off solutions and these go to the new generation. So what happens is you end up convergent from a set of solutions at the beginning, up until a very, very new and different set of solutions at the end. So the terminology is that you need to know something called crossover mutation Philip Chism, Fitness value and selection for cross over. Let's say you have this parent, one and parent to you want to produce Children, your Children, child, one child to so you select. This is the two point crossover feel select at random, two points and then you take this and and switch it with this and take this and switch it with this. So you end up having 11 and then take this 100 and then 0010 and this gets substitute with this. So you end up 01 and then 0100111 So this would be your first child and this would be your second child. Now, the idea of crossover is that you want to keep the solution as it is, but you just want to change it slightly. So that's why we change only a part of it. Because we don't want to lose a lot of information. Because when we end up getting converging to good answers, good solutions, you don't want to go ahead and break that total solution up. We want to get a different solution. May be a better one. So we apply crossover like this. So only a part of the solution has changed. Now, we just talked about 22 point crossover. Another example is a single point crossover. So what happens is this gets substituted with this, or you can do this way. There's substitute of this, but at the end, one side will get substituted. So you end up 0110 00010 and the other would be the same thing. And for multi point crossover, you select from multi points on. Do you apply the same thing? You substitute this with this and this with this. So that's crossover. In order to produce your two Children from your to parents. Now, on the Children, you apply something called mutation. Of course, at a certain probability, even cross over was a certain probability. But usually the probability with crossover is usually said toe one. The probability with the mutation is usually set toe a small number. So what happens with mutation is that it goes through each one individually. It says. Okay, so it applies. Based on a probability it goes, it goes here. OK. Did this hit the probability? No, no, no. It came here. The probability off It was lower than the probability of mutation. So it got mutated. It became a one became a zero. So the mutation one becomes zero and zero has become ones. And then you apply here, you see the probability apply around them number, and you check that random number. If it is less than the probability of your mutation, you mutate. If it was more you don't mutate. So you end up you come here, you end up OK, So this probability was less than the probability. I mean, like, the random number that you got here was less than the probability for mutation. It means go ahead of mutate. So the zero turned into a one and he do so on and so so on. So again, what happens is you go through each one, you create a random number you say, Is this random number less than the probability of mutation? If yes, you mutate. If no, you go, go to the next one and the same thing you apply around them. Number is this round of number less than your mutation rate or more if less mutated more, go to the next second number and so on. So this is how mutation happens now For a little ism, it means the organism, which is the child or mutated child or parent that had the best solution or best fitness value. It gets to live to the next generation so you can take that guy out and put him to the next generation and without any alteration, so it wouldn't he wouldn't get mutated or cross over, you would just take him as he years and put him in the next generation. Now we said here that you need to select two parents, right this parent and despairing from a set of solutions. Now, their way you select these parents are through certain ways. The most commonly used are something called roulette Wheel on something called toward tournament selection for roulette wheel. Let's say you have a set of solutions, the fitness value that is the highest or a minimum if you have a minimization problem than you want the lowest value. If you have a maximization problem than you want the fitness value with the maximum number . So you assign. Probability is based on the fitness value off that individual. So you have fitness value of 80 for example. You see it has a big proportion. 70 has a little bit smaller, 50 smaller and so on. So these are parents. This is parent, parent, parents parent. All these are parents, so, based on a selection, make others of on the roulette wheel you select your parent. Those with good fitness values get selected, have a higher chance of being selected. Another method is tournament selection where Let's say you have a population off solutions . You, by random randomly select three. You can suck a 23 or four or whatever you want. You're randomly select three. And if this is a maximization problem, you choose your champion, which is the best one with the highest number. If it was minimization problem, five would win. But because this is an example of a maximization problem 91 So you get your first parents. So this solution right here would be your first parent. You do the same thing again, you randomly select three and you get the best solution from those three. And that would be parent number two. Fitness value is the value off objective function associated with that organism, which is the strength in the case of machine learning different its value would be the models, accuracy or theater function. So this is how genetic algorithm works. Thank you. 9. Genetic Algorithm #3: Welcome back to the third lecture off genetic. I'll go to them. So when we use the binary from zero zeros and ones usually have arranged a lower range and on a lower range in an op arrange, let's say, for example, you want to optimize the parameters off SPM yuan. Optimize a prompter gamma. You have to put a lower range, and you have to put ah ah, the highest range the operating. The reason is because it's infinite, so you have to specify a range that you want your search to be in. So let's say, for example, support vector machine For gamma, you select the upper range to be 0.99 since it goes from 0 to 1. So let's say your op arranges one or 10.999 and your lower range A would be like 0.1 for example, and you have something called the chromosome length, which is how many genes you have and something you called precision, which is the upper bound, minus the lower bound over to to the power off the chromosome. Length minus one, and you have something called decoding and encoding in coding is when you transform that certain value Two zeros and ones that's called including and decoding is when you take this number of zeros and ones and decoded toe a number you can understand now. For the 1st 1 it is two to the power of zero. This is two to the power of 12 to the power of to the power three all the way until two to the power off your length, the length minus one. Because this is length of nine, you start a zero for two to the power off L minus one. Um, so in order to decode it from this encoding right here toe a number, you take the summation off these bits here, times multiplied by the precision and you add the lower bound as an example. Let's say you you have one decision body, But let's say this is, for example, just for sake of example, Gamma, you have a lower bound or actually, gamma cannot be minus six and six. Let's say you have a decision. Variable, where it's lower bound, is six minus six and it's up about 26 So this is your search space within negative six and six if you have a B is minus six and six and you select on your own How much l you want l to be less So you put l is equal to nine. So you put random numbers here at the beginning Complete rodham zeros and ones that the way you decode this is we said you add all these up to so zero times two to the power of zero plus one times two to the power of one plus zero times two to the power off, two plus zero times two to the power of three and so on. So you end up here at the end, plus one times two to the power of eight. So you end up getting a number off 4400 and 18 and now we want the precision right. The precision is B minus a over two to the power of L minus one. So six minus minus six over to to the power of nine minus one. You get this number here. So in order to decode this, you have this function here. So you take this number here that you got 418 times the precision minus plus a plus A and A here was minus six negative six. So plus negative sex, which is negative sex. And you end up with 3.81464 That means that if you want to decode this, it has to be between minus six and six because this is your search space. So after decoding this, this is the number that you end up with. And just to let you know that if you had more than decision variable, you put them in the same chromosome where you assign that you say the 1st 10 for example, is why and the 2nd 10 is X. And just to let you know, this decoding is called final type. And this encoding is called a Gino type. Thank you. 10. Genetic Algorithm #4: welcome to the fourth lecture off genetic algorithm. So in this lecture, we'll learn how exactly it works the flow chart, the steps. So first of all, you need to you need to know how, how many generations you want, which is m how much of a population size you want the problem of cross over the probability of mutation. Andi, if you want to select tournament selection as your selection criteria, how maney contestants or how many, uh, how many parents you want to select or how many solutions you want to select for apparent to show up. Now, the methodology is not first of all, you set the number of generations the number off population you want probable off, a crossover probability of mutation and, uh, the length of your chromosome, Andi. Okay, which is how many warriors or contestants you want for your tournament selection? So, first of all, you create the end population randomly off string size L. And at this moment, your generation still at one, and you start at popular family want Let's call them family one. You select to parents through a selection mechanism, and then you cross over those two parents to get the Children at probability off probability of crossover. And then you take these two Children that were produced from the crossover and you apply the mutation mechanism on the mutation operator on them. So you got to muted mutated Children, and then you calculate the fitness or the objective function or objective value off these two mutated Children, and you save the fitness values. The reason why you save them is because you want to keep track of solutions because at the end of your search, if you do not converge to a good solution, you can look back through all of your solutions and pick the best solution out of all the ones that you stored. And then what you do is European steps 4 to 74 until seven and over two times The reason why and over two times is because at each time you select to parents, so you repeat them and over two times in order to get your population and back the number the same number back and then you you. Now you start a new generation. You now m becomes M plus one, which is to you apply elitism if it's applicable you can select the best out off that generation and take it and put it in the next generation. Then you're Pete steps for 2 10 4 until 10 a.m. Times m generation times. And then from the end, when you have all the the solution stored, you can choose the best fitness value from the love. You either choose the best, fit this value from the last generation, or keep track of the best chromosome in each generation for the final answer. And that's what we're going to do exactly. We are going to keep track off the best, the best chromosome in each generation, so that at the end we can look back at the best of each generation and pick the best Out of those are restored so as a flow chart. And again, this is for tournament selection. We select number of generations, number of population, problem of crossover pop, ability of mutation and how many competitors we want for the tournament selection. We create the population off size and we picked at random case strings. If we select K two b three, we pick at random three solutions and we evaluate and pick the best one. Now you have parent number one. You do this another time to get parent number two. Now you have two parents you applied cross over at the probability off crossover and you have your two Children. You apply a mutation operator on your two Children and you obtained too muted, too muted to mutated Children. And you see, is an equal to an over to If no, you repeat this again and you add and is equal toe end plus one so you can repeat this and over two times each h time you get to parents to Children, too. Muted Children. So in over two times times too. So you end up with a population off end. If n is equal to and over to then is m equal to Capital m like for example, if you set the number of generations to 300 are you act generation 300? If no, you repeat all this m times m as in mark m times. If yes, you end and you returned the final set. Thank you 11. Dataset: Welcome back, everyone in this lecture I'm going to show you the data set that we're going to apply genetic algorithm on. So you go to archive dot I CS dot You see, I don't you dash ml, which is machine learning. Dash data sets dash energy, add or plus efficiency. So we're going to This is the data set right here. Apply support. Vector machine. Andi multi Leah. Perceptive neural network in order to, uh, these are the independent variables and these two are dependent variables. So in order to create a prediction model to predict the cooling and thermal the cooling and heating load off buildings So you go back here. X one is relative compactness X two is surface area, the result of compactness surface area and you have all area roof area, overall height, orientation, glazing, area glazing area distribution, and why one which is it? Dependent? Variable is heating load and why two is cooling load. So here you apply a prediction model on these eight independent variables to predict the cool the heating load and also apply. Ah, a model here to predict the heating glowed the cooling load. So why one is hitting load y two is cooling glowed. So this is what we are going to work on this data set. So if you go to this website here, you can download the state dissent by clicking on data folder and then download the Excel file here. And you can, um, follow along with us while we code. Thank you. 12. Support Vector Machine Optimization #1: Welcome back, everyone. So in this lecture, we're going to start coding genetic algorithm for support vector machine. But this is our data set. We have eight independent variables and too dependent variables. Now we're going to only do it for the why warm the heating load. We're going to apply a SVM genetic algorithm on as I am to predict this and the same can be done for Why to So if you want, you can do it on your own. So we have a total of 768 Observation six date because the first stroll has the headers. So let's go ahead and code decode is already ready, so we're going to go through it. First of all we needed we need the numerical package off python, which is numb pie. We need to also the package off pandas for data frames. We might need random for shuffling. Actually, we're going to need that. I would need to cross validation in order to, um, apply k fold cross validation. And we're going to pre process our data and also, of course, support vector machine. Now here. The first thing we do is that we read the D. C s A V file, which is located in my folder called Data Set. This is the data set, and after we read it, let's go ahead and run it first. So this is our data set. We have y one y 218 independent variables. And then let's go ahead and shuffle the data. This is data dot sample frack Juan frack. One means returned all the roads. So if we go ahead and run this, we will end up with shuffle data, which is just better, you know, just to make sure that every time you run the code, uh, you just get different results. Although a low genetic algorithm works on the element of randomness. Still, I like to shuffle the data, especially if the data set has, for example, if it was the classification data set especially, I like to shuffle the Rose. So now we want to create an ex, And why So X is going to be all the independent variables X one up until x eight and we only care about why one for why? So let's go ahead and run this. So now we have X one with all the independent variables. And why, with why one only So x, X and X eight or discrete variables. Let's show you ex 60 next eight. These are categorical variables. So we want to get dummy variables. So if we go ahead and run this so now you have X eight and xx are independent. Our story, our dummy variables. So, for example, if you have, let's see here, ex sex is three. The first, uh, the first rope. So this should be zero. A two at three should be one at four. It should be zero at five. It should be zero. There will be 0100 And the same thing applies for ex eight on the rest of XX. So let's take another example just to see that you understood x six, the second is to. So if you will, we go ahead and do this. So will be 100 And also the third waas too. So 100 just close this. Okay, So now because the ah, the independent variable, they have different scales and we need to scale them the reason for that. Now imagine age we have to independent variables age and income. H has to ah to, uh how do you say to does not decimal points, but two numbers, right? Two figures on income would have five or six figures. Now the weight of income would be much, much higher on the model than the weight off age. Now we need the waits for these independent variables to be the same in order to have the same effect on the model. Or maybe maybe even weight income wouldn't have, Ah, a huge impact on the dependent variable. But because of its weight, that's why ill over way age. And it will affect the model more. So we need to scale them all to a same scale. So basically, we're going to scare them all between zero and one. The maximum value of each column would be one. The minimum value would be zero. So let's go. Go ahead and scale them. You use pre profit Duckman Max scaler, and then we fit the men Max Kaler to the X before, which is this. After getting the dummy variables, we'll get the new data frame, which is X. Let's go ahead and run this now. X is you see, it's all, uh, scaled from 0 to 1. So let's go ahead and stop here and let's continue in the next lecture. Thank you. 13. Support Vector Machine Optimization #2: Welcome back, everyone. So let's continue where we last stopped. So we talked about The last thing is that we went ahead and standardized Normalized actually the X data frame to values between zero and one. So now, for careful cross validation, the number of data points we need, the number of data points will be the land the length off X, which is 768. This is for careful cross validation and we will get what you understand. Why we got 768. You can put 607 168 here. Or you can just put the length and hear how many observations do we have always go ahead around this. So we have 760 doctor rations now the parameters for genetic algorithm we want always to cross over. So that probability of crossover is going to be one. The probability of mutations going to be point to population. Let's have 100 population, although it would be wise to have it more that of course, the running the computation time would be more. Let's have 50 generations again. You need more generations, but only for a running time. I'm just making it 50 and this 100 also again for running time. I'm just going to make it three fold validation and we start off with a complete, random, complete random string. Let's say I think this is 26. So the 1st 13 let me make sure they're 26. I think they're 26 to 4. Fix 10. Eight. Done. 12 14 experience. 20. You too. They're 30 actually. Okay, so the 1st 3rd 15 would represent, see, and the next 13 would represent the gamma. So we have C and gamma, which is the parameters off. The SVR now need to create an empty list in order to have population to populated with the population. The random solutions off this. So we create an empty array. Basically, it's going to be populated with zeros, and the length is going to be the same length here and now we're going toe shuffle this and stack them on top of each other so that we can have around the population. So if we go head around this so you have enlist is complete random. How many elements in range pop, which is 100 we have 100 different population. Okay, so now if you remember the formula, we need the precision, which is the upper bound, minus the lower bound over to to the power off the length minus one. So the lower bound let's put a lower bound on C. Let's put it as 10. Let's put a lower bound on another bound on seals put in 2000 and the length would be half of the length off. This the length of the chromosome. Half of this and also with gamma. Let's put a lower bound of 0.5 point 05 on an upper bound off 0.99 and same length is gonna be half the length off. This and the precision for X, which is foresee, is going to be the upper by minds, the lower bound over this. So now let's go ahead and calculate. Let's decode this so I can show you how you can decode this. First of all, let's start with X, which is see, actually, this is going to be C, and this is going to be gamma, so let's start with C. So we need start, remember two to the power of 02 to the power off one plus two to the power of two plus two to the power off. You know, this times to the power of zero Plus this times two to the power of warrantless. Zero times two to the power off to and so on. So what we're going to do is that we're going to start from backwards. So this index would be minus one minus two, minus three minus four and so on. So for I in range half of this. So for I in range from here until here T is equal to one for minus t. So we need the index of minus one, which is this times two to the power off Z, which is zero started zero. And then we add them. So be this times two to the power off zero and then plus this and then we incrementally and these will be the index off minus two and he would be one. So we add them all together and same thing for ah, for the wife or gamma. The reason why I put one plus the length is that if the length is 50 15 for each one so want lost 15 to 16. So at minus, you will be minus 60. And so it will start at minus 60 in the index and same thing here you add them all together and the zero unto to the power of zero to the power of one and so on. Here for white? No, to decode them. We said the sun off the x times the precision plus the lower bound If we go ahead and run this Oh, oops. Sorry. Okay. Uh, I think we didn't print them. Let's go ahead and print. I forgot to remove this. So let's go ahead and print this. So our decoded acts as 433 and the decoded Why? As 4330.95 Of course, if I run it against going, going to change, actually, no. Oh, because we're taking a zero. Which is this So no knuckle change? I thought that we were going to take from the last element from the Shuffled Array, actually actually should change. If I take all this now, it should change. All right. You see it because I took the entire end list on the last element off because the last element would be now called X Y zero. So, uh, so that would. So based on the LRs element off the unless that would be decoded because the end list it changes. You see, because every time you shuffle changes. So if we do it again, run. We got different numbers. You see, I see different random numbers. Look here. So that's why you changes. All right, let's stop here and will continue the next lecture. Thank you. 14. Support Vector Machine Optimization #3: welcome back. So let's continue where we lost. Stopped. So now we're going to create several empty lists and some several NPR raise. The reason would be shown while we go through the code. But in general, we need to keep truck off the solutions like we mentioned. Need to keep track off each mute mutated child off generation of his generation for mutates child one a mutated child to We also need to keep track off at which generation This, um this mutate the child was achieved so that at the end, when we get our final solution, we can know that the final solution was achieved at this generation right here. So we started generations equal toe one on for I in range on which is for I in range 50. We started the counter zero. I don't think we need the counter. Eso weaken. Delete this. All right, So new population also again, we need an empty array. The reason why is because now we're going to start with a We're going to start with the end list. You see the endless but at the end, we need to populate the endless with a new set of population, which is the, uh, the mutated Children because the mutated Children will go to the next generation and so on . So the first end list will be gotten from here. But then we're we're going to need to delete them all and populate them with a new set of solutions. So this is for it. And we're going to keep track off all the mutant childs for for one and two, and we're going to get the minimum off them. And we're going to also save the best for each generation because in each generation you end up getting a population of 100 right? We're going to save the best, and we're also going to say the worst and best. The reason why is because we want to apply elitism. Elitism is that we're going to take out the worst and replace it with the best. So first we start out with printing generations, so just we can keep truck off. Uh, you know, in with generation where running act Now. We started here for I in Reagan, Jenna, for each generation we're going to apply and over two populations. So for I in range population over to Which is it? The family? Let's say now we need to select because it's tournament selection. We need to select three contestants where they will fight and see who is the best, which has the minimum amount of better. So what we'll do here for I in range to Because we want to repeat the two times to get to parents. So we create around them integer from zero until length enlist because we want from zero until the until 100. So basically so that we can choose one randomly out of this. So we pick a random number for warrior one warrior to random number and warrior three, a random number. So now we have three random numbers on because we want to create diversity. If we don't want the around of numbers to be the same, the random integers. So if by accident warrior one index, the warrior to index happen to be the same, we want to select again. So while round. So while worrier one index is equal to Warrior two index, go ahead and, um, go ahead and take another index. And so long. So now we have, let's say, for example, the warrior one index is equal to two, for example. So now where your one is enlist, the index of two will take this. The warrior one is equal to from the Enlist the index off too. So we'll take this. This would become warrior one. We do the same thing for warriors to one warrior. Three Onda. We just put them in in 11 Ah, Array. One list where you're one warrior to warrior three. And now we have Let's say this is warrior one, for example. Now we need to calculate. See, And gamma psi would be the last 15 elements, and gamma would be the 1st 15 elements. So we do the same thing like we did here. Here we do the same thing for warrior one and warrior to one worker. Threes off. Warrior Warrior one. We get the decoded see and decoded Gamma. And now what we do is we apply cross validation. So basically, we want to apply it now to S V M now for close cross validation in 768 and we need three folds. So now what we do is we apply the decoded Gammon decoded see to the data set that we have. So we split extra in an X test from X and Y my training like dust from white. So what we do is model longs equal to s v m dot SVR So support factor regress er and we do Colonel is equal to rbf radio basis function now see would be equal to the decoded X and gamma would be equal to the decoded Why so see would be equal to whatever was decoded here after implementing the stick, decoding this and gamma would be also decoded. Why? And then what we do is we fit extreme toe. Why train on? We apply the spotted with this SPM with these parameters and then we predict x test on we get the score which is the accuracy off the model based on, um ex test in X y and we want the objective function walked. We need to minimize which is the editor so one minus three accuracy. So if theocracy was 90% the editor would be 10 and we need to minimize the utter now because we have three fold, so it's going to do it three times, so we have them on top of each other. Then at the end, the objective function so far off worrier one would be this over three. So now we have the decoded, the accuracy Fuller or the editor for Warrior One. After we implemented, uh, the the decoded Why and decoded axe off for the parameters for a C. M. Now you do the same thing for worrier too. And also get objective function so far for warrior two and you do the same thing. For warrior three, you get the object of eso far objective function so far for warrior three again here the decoded C and D goaded gamma is from the, uh is from the decoding of warrior three the 1st 15 on the last 15 binary codes on the string. So let's go ahead and stop here and let's continue in the next lecture. Thank you. 15. Support Vector Machine Optimization #4: welcome back of one. So let's continue where we last left off. We said that we, uh, decoded the string for each warrior warrior One warrior to warrior three. And for each warrior, we ended up creating a ah support vector machine on X and y with the decoded cnd coded gamma. And we fitted the model in order to get the objective function value, which is theatre, which is one minus the accuracy. So now we have let's call them prize one is equal to object to function. So far, one and price toothy object to function for value for warrior to and prize Warrior three is the object of function value for working three. Now, we want to check the minimum out of these because the minimum etter so the minimum better would be selected to be the parent. So if prides number one you know, we we could do it this way, or we could just say which one is the minimum out of this is equal to the price. Uh, so I chose the long way. But you could just say prize equal to minimum prize warrior, One problem worker to prise worrier three. So that would be the winner. So basically parents would be, Ah, the winner would be parent number. This would be one like, for example, if price where you're too hot, the least amount off the minimum amount of etter, then that would be the winner. And then because we did it in range to remember here for I in range too. So we get at the end to parents with the two least amount off better. So now we have two parents parents we have to The apparent one would be in that zero and parent who would be Index one. Now, let's great to empty arrays for the Children. So now we want to apply crossover, so we create around them number. If this round the number was less than probability of crossover crossover. And remember crossover, we put the probability of one so we will always crossover. Now what we're going to do here that we're going to create a, um how does it a, uh, random integer for in order to cross over and another round of interject across over just to show you, I'm just going to put this as, uh, population to publish in for and one generation so that when I run it run fast and I can show you, um all right where we left off. Okay, so now we created to run dimension there. Integers. The reason for this random integer is that we want to know where the cut off point is for crossover. Like, for example, if we have the round of integer here and the other here so we can take this and switch it or exchange it with the other parent. So while while this is equal to this, one of the surrounding vintage is equal to this random integer we want to pick again, choose again. So if the random into your C one was lesson see to, they want to make sure the C one comes before C two. So what you do is you want the mid segment, so the mid segment would be see one oppa until c two plus one. The reason why C two plus one is because this'd is not included. So you want from c R. One open Telsey or two, but because always index off, this would not be included. That's why we put plus one so we can include CR two. So the mid segment for the first parent would be from here to here at the mid segment from Parent number two. Also would be parent number two from Let's Say, for example, this is tool and this is for from two until five, so four can be included. It will be 23 and four. If I would not be included on the first segment would be all from the beginning up until, uh, this not included, and the second segment would be from five until the end. Same thing for parents to the first segment would be from the beginning, up until, ah, the first integer not included because it's included here and the second segment would be from five, for example, up until the end. So from five up until the end. Now we need to create the first child. So will be the first segment from the the first for the first Parent, plus the middle segment from parent to plus the ah plus the last segment from Parent number one and Child, who would be thesis segment from parent to plus the middle segment from Parent One plus the last segment from parent to will be +212 So will be a lot. So you have to breath. Let's say these are your two parents. So the first child would be the first segment plus this middle segment plus this last segment. Now if if the crossover integer number two was more than the crossover integer one, then what we do is we just switch them so that too is less. There will be, for example, if to Was two on one was five will be from 2 to 5. So same thing just because they are too is left since er one. So we start out with CR two instead of cr one. So that's how we create the two Children on else. In case this was greater than p off, See, then child one would remain a child parent, one and child who would remain as parent to because no crossover happened. But again, our problems have crossover with one, so we always cross over. So let's go ahead and stop here. And in the next lecture will will know how we learn how to mutate the Children. Thank you 16. Support Vector Machine Optimization #5: welcome back. So we lost. What we did in the last lecture was we created the two Children through the two parents. So it'll be the first segment from parent one and then the middle second from parent to and then the last segment from Parent Wants. So we exchanged the middle segment from parent one and two. Now we need to mutate. So for I So we create an empty list for the mutate child for I in range child one. So we go through each element off child one Imagine this is the child. We go through each element off the child and for each element, we create around them. Number. So if this random number was less than probability of mutation, I think we put it a point to than what we do is if this wasa one will become a zero. And if this was a zero off child, one up index T is equal to zero. Then what we do is we it becomes the one else. If it was a one, it becomes a zero. Then t plus t plus TTY vehicle to d plus one and then you take the child's equal to mutate to child one. And what we do is we repeat this for every index. Now, if the probability of mutation waas not equal 2.2, then it remain as the child would remain untouched, so one would remain 10 remains zero on mutated child would remain as 40 would remain as mutated child. One same thing for mutiny child to wherever there's a zero will turn into a one else if it was the one who will turn into a zero of the probability of mutation was less than point to so that we have to mutated Children child one and try muted to child. Wanna muted to Charles two. Now we need to decode them because we have new Children. We have mutated child one mutated child to now we need to decode them and find the object of value for them. So same thing for mutated child one. We applied the same thing minus t for mutely to child one times two to the power of the and we get the we sum them up together. Then Now we have a deke for x and y for gamma and, uh, for gamma and see and then we have now a decoded for muted to chilled. One decoded Why form you take to child too? And then we apply it for the cross validation for ex train y test in X and walk an X Y train and white test for why that we apply support Vector machine with the new parameter C is equal to decoded X for mute and child one Gammas equal to decoded wife or mutated child one and that we fit the model And then we get the accuracy off that model because we want to minimize theater. So one minus t one minus the accuracy and then we add the month top of each other and then we divide by three because our K fold was three. So this is for child number one muted child number one. Now we do the same thing form you take the child to and same thing sees equal to decode that the decoded see on Gammas equal to the decoded gamma muted child too and same thing we got here. Now we have objective so far form you to child too. So we have the fitness value from you to child one our generation, whatever generation it waas and we have the fitness value for muted child to a generation again, whatever the generation waas as, uh, for objective function so far from you take child too. And then now we want to stack them on top of each other so that we can keep track of the mutant Children. So what we do is, first of all, we carried the temporary because we need to, um, switch the access because it's it will be vertical. We need to make it horizontal like this. So we switched the access and paedo new access. Then we stock all the we stack the object of function, uh, next to the next to the muted the child so that we have a string which would be the objective function on the the's string off the the chromosome off the mutant child. One and same thing we do for muted child to we flip the access. And then we put the objective function off it next to the strength off butin child to, because we need to keep track off the A mutant Children on their objective function, and then we start them on top of each other for mutant child, one with stock, all of them on top of each other. Plus, you know their objective value. Plus, there their strings. We stopped them on top of each other. Remember all these? We created empty arrays. Uh, we created several empty arrays at the beginning. This is why? Because we want to keep track off the objective function on each mutant child. Now we want to keep We want to save all of them in Generation X, regardless, if they were mutant child one or mutant childhood. So we keep truck off all the mutant Children's from all the mutant child's from all the meat and Children from muted Children want. Um, you tick Children to keep them all and save best and generation acts. So now we would have a new array with old off the mutated Children. Now, if you remember corrected the new population this new population would have or the mutant child one a mutant child, too on for each generation, each generation. Oh, sorry for each family you would keep on stocking. Each muted the child one each week to the child to, so you would end up with a new population because you would stock every mutant child, one and every mutated child to on top of each other. So let's go ahead and stop here and let's continue the next lecture. Thank you. 17. Support Vector Machine Optimization #6: welcome back. So let's continue. Our we last left off. We started stocking mu tinted child's. We took the Children on top of each other so we can keep track off them. We can keep track off the objective function off them the objective function value or the fitness value now. And we also created the new population, which would stock all the mutated Children on top of each other. Now here, we need to find the best in it. In minimum, off generation, excel for all in mutated child form utility child one. We need to find the best out off those. Right. So it'll be the minimum for the first column for the role for the four old the roads and only the first column, which has the objective function value. So you know what? What I'm gonna do is I'm going to run this so that I can show you. Okay? Save best in generation. Okay, let's go ahead and stop here. Okay? So new populist. So for all in generation one, here you have the objective value and all the string for the chromosome for the muted to child one. So after you run it, several times because I just stopped it. He would have several, um, several chromosomes here. So you want to find the minimum out of this so you can save it as minimum and Generation X for mute and child one. So you find the minimum of only this column column zero. So that's why you put here only for the 1st 4 column numbers. Here we can put zero here. That's fine. And so you find the minimum off docked column and you find also for poor mute and child to you find the minimum out of this because we want to keep track off the minimum for each generation. So at the end, we can choose out of those now. Also four save in best in Generation X. We also stocked all of them on top of each other. Also, we want to find the best out of that because in saving best Generation X, we need to find the best and worst, the reason why, for saving Generation X, it has all the mu tinted Children. Regardless, if they were muted child one or mutant child to because we stopped them on top of each other, meet a child wanna muted child too. So we need to find the best and worst for ill. It is, um so we find the best here and we call it final best in Generation X, we find the worst. We call it worst best in Generation X and our final best, which is the best guy, the minimum with the minimum around over who called him Darwin guy cause he will survive. And not so Darwin guy would be the worst, which is the maximum amount of better. So what we do is we here? Best One is where in the new population, where Darwin guy is and where worst one is. Where in the new population. Because we stack them in the new population here. Well stocked all the mutant to Children. So we want to find where where Darwin guys in the new population. And we're not so Darwin guys in the new population so that the new population off not so Darwin guy would be substituted with Darwin guy so we would end up with substituting. We would remove the worst and substitute it with the best that we would have at the end to Darwin guys because the original one and the substituted one. And now here you can see that and list we substituted with me, nor with the new population. So at the end, when we return when we create a new generation, our end list would be substituted with the new population while we created eso. Now let's we started them on top of each other again. You have minimum Children and you keep track off. Ah, words that degeneration men of all generations for mutant child one and you keep track off minimum generations. So you keep track of all the muted to Children for all of the generations. Because now here, if you could see it, the tub. There's a top here now because we're done with all the family. This is the new generation. So this is for East Generation, not each family, because now we're out off the family loop were in the generation loop. So now we want to keep truck off the best of each generation on and, um for beauty, child want a muted to child to And now we need to insert here degeneration so that we can keep track of each generation like this best child was achieved at this generation that best child will achieve that this generation just at the end. Because at the end, when we find a solution, we could say Okay, the solution was found a generation number 39 for example. And then we create we stock them. So we have these with stack them on top of each other, which has the generation. We inserted the generation before. So what? Zero before the minimum off doc generation and also form used a child to and we stopped them on top of each other. So here we just keep track of what generation we are at and then we add the generation of equal to generation plus one. So let's go ahead and stop here unless continued the next lecture. 18. Support Vector Machine Optimization #7: welcome back. So because we kept track of each generation the minimum obese generation, for example here we kept track off the minimum off each generation in muted child wanna mutant child to kept track of at what generation that was achieved out. So basically, we stuck. We will stock all the minimum of his generation off mutant child one on the minimum of his generation or butin child to on top of each other and will call them one final guy. Now we have for each. For example, if generations were 50 we have 50 mutant child one, the best and 50 muted child to the best off those 50. So we have 50 and 50 here. So 100. Now we need to choose the best out of those. So it will be if we go here. So we need to choose the least amount of better for the best guy. So we choose the minimum out off the first column. The reason why this is index one, not zero, is because before this, we are the generation. Here we are the generation, so it'll be generation for example, 42 then this other, and then the chromosome. So be column number one, which is theater. So we need to choose the minimum out of that. Then after we do, we call him one final guy final. So minimum in all generations is that one final guy final. And then the final solution is that here at intact zero would be the generation. This would be the better, and the rest will be the chromosome, so into zero index one index to so index to until the rest would be the final solution. And that will show the binary chromosome. And then at Inducts One, which is this column, because before that, we said generations or this index would be the editor. So what, minus this letter would be the r squared? And then now we need to get again. The chromosome would be to up until the end, so we need to decode it so we can know how much gamma is on how much she is. Um, so here we want to decode it. So we have the binary chromosome, so we go ahead and say we do it. Decoded the 1st 15 for C and the last 15 for Gamma. And then, um, here we decode at the same way we did above at the beginning. At the top, Onda, we say decoded X before, after and decoded. Why after So we got X and y so I went ahead and ran it for 100 generations equal to 50 and populations equal to 100. It's still running. It's our generation 31 family number 16 uh, 17 now. So it's running at this moment, um, going to pause Andi, wait until it finishes when it finishes and then will, uh will come back. But when it does, what you'll get here is that we'll see the minimum of all the generations. Aan dat. What's the what? How much the r squared is, which is the evaluation make evaluation metric for regression problems on the chromosome for it, Then what we're going to do is that we're going to the regression problem and apply it before ah, before optimization and apply it with the optimized parameters and see the difference between them. So I'll pause the video now and then. We'll be back when this is done. All right, so we're back on. It's almost done. One family left. Okay, perfect. So This is our C and this is our gamma. So if you can see here minimum and all generations will print the entire string which is this letter right here. So one minus this letter. We get the r squared, the highest R squared. It was achieved in generation 17 on this is the chromosome, which is this. So when we decode this, we got si is equal to this gum is equal to this now. Before we had on to solve the before and after optimization, let me just show you the support vector regression here when we apply it, it has defect parameters. If we don't put the default parameters, if we don't put any parameters, it has its own default parameters. The criminal would be radio basis function gamma would be Oto, which is one over the number of features. So if you had 10 features, gamma would be 100.1 on D. C is equal to one. So these are the default parameters in case you don't specify any parameters. So for the ah, here, let's go ahead and say you got this is X one. And of course, you apply the scaling You applied the dummy. And we only want why one This is what exactly? We did hear a the beginning. So the number of observations we want to apply can fold cross validation. Onda, we want to apply without genetic algorithm and with genetic algorithms. So without you can see here that we did not specify any parameter, so it would use the default parameters. So this it would go through the model and it would print off course it would add, Then it would divide by 10 because 10 fold, so it would print r squared worse. We m without genetic algorithm Here. On the other hand, we want Colonel to be real based function or even if you don't specify it because the default is or be off. Si is equal to this much. And gum is equal to this much. So if we go ahead and run it, we can see the before and after before optimization, which is with the default barometer and after optimization with after we got the problem there. So if you go ahead and run this Oops, sorry, because I changed the location. Okay. Um, my courses. Okay? No, hopefully wouldn't. Okay, so without and you can see the difference is 6% which is, uh, honestly considered a lot. So without optimization without genetically alter them, this was the accuracy off the model 87%. And after a plummeting genetic algorithm with the optimize prompters, we got not almost 94% so you can see the differences over 6%. So this is how you do genetic algorithm for optimizing machine learning algorithm for S V m . So the next lecture, we're going to edit the code so that we can apply it to MLP or neural and multilayered perception. Your network. So I'll see in the next lecture. Thank you. 19. Multilayer Perceptron Neural Network Optimization #1: Welcome back in this lecture we're going to talk about the implementation of multi layer Perceptron Neural network on our data set on because everything is the same only with a little some tweaks here and there. That's why I went ahead and wrote the code. But I'll show you exactly because it's the same thing as SPM. Just a little bit of ah difference in how you encode the chromosome. So the difference is out first. Of course we imported MLP regress er so this a regression problem from a scaler and dot your left work So everything is the same but here instead, now our let me show you this Basically what we have is we have the number off hidden layers , the number off neurons Onda we also have We have the number off neurons, the number of hidden layers We have the learning great and momentum and we also have a solver. So basically what's gonna happen here is that we're going to have one that indicates the number off neurons. Another one's gonna indicate the number of hidden layers on everything else is gonna indicate learning great and momentum just like we did with C and gamma, so see with gamma. And also we have like like C and gamma. But in addition, we put some discreet decision variables, which are the number off neurons and the number of hidden layers. Andi, basically. So we have the probability off crossover for the continuous, which is the learning greater and momentum the probability off crossover for the combinatorial, which is the integer, earthy number off hidden layers and hidden neurons. Probability of mutation for the continuous, Probably a mutation for the discrete variables. And we have also probability of mutation for solver because the two solvers that we are going to choose or Adam as to cast a radiant descent. So everything is the same here. But instead, now, this the 1st 15 represent, uh, learning great. And the other 15 represent momentum. All right, so for I in range population just like we did before we want STE is a soldier type to keep it alone. We want to keep it somewhere here in S t endless dot SD and for enlist, we want to put the number of hidden layers, which is gonna be between six and 10 on the steps off to on the number of hidden new runs. Actually, this is a number of hidden neurons gonna be between six and 10 with the step size of two on the number hidden that the number off hidden layers, which is gonna be between three and eight and step size off one. So in the next lecture, we are going to continue with describing exactly what happened. Thanks. 20. Multilayer Perceptron Neural Network Optimization #2: Welcome back for now here just like we did with support Vector Machine where we did see on gobble we did see and Gamma where the fist 1st 15 represented see and the 2nd 15 represented gamma Same thing here we have learning great momentum where this one represents learning rate and the other represented momentum on the minimum value with learning great would take his 0.0 on and the maximum is 0.3 The reason why I put it 0.3 not more is because if you remember, we said we don't want overshoot on miss the local or global There is a continual problem No way we can hit the global Optima but we don't want overshoot and miss a local a good local minima Same thing with the momentum should be between zero and one But this is how we decoded If you remember when we did zero to the to to borrow zero times this plus two to the power of warm times that plus two to the power of two times as and so on. So it's the same as we did in support Vector machine and these are all the lists and are empty arrays that we had also on. Same thing here but here. Now, here we want to do to take two parents from three warriors. So we did the random integer and take it from the endless. But you know what? First, let me explain something here actually are. We already did. So it will, random into your from enlist. We take three warriors and then the minimum would win. They would fight together. And when? Now we come here, you can see that the number off hidden layer size the number the way they put the number off hidden layers and neurons is that they put them like this. This is considered one hidden layer with 100 neurons. If you have another 100 here, it would be two hidden layers with 100 neurons 100 neurons. But this is how they put it, the same thing here. Oh, for I in range the number off hidden layers. So it will be like this hidden layers equal to hidden layer, plus the number off hidden neurons so we can have it in the same matter. Uh, the same manner as this way, because here he didn't layer, so it could be, for example, 666 So it will be three hidden layers, six new runs each. So this is how we do it. I hope you understand. Here, a gun here is one hidden layer with 100 neurons. So in order to represent it that way, we do it like this for I in range in range, the number off hidden layers. We put the number off neuron. So it will be like 555 or 4444 And then we have the learning Great. Which is the decoded X three on the momentum, which is decoded. Export on activation will always be a rally. You and, um where's the Okay? Looks good. This should we should have an optimizer here. Um, where is it? Let salt with silver. You know what? That's fine. Let's keep solver at Adam. Let's not have to cast a Grady and dissent because Adams stronger than still cast a great Anderson. So we're going to keep it a default. So here here, since we don't want to put solver solvers equal to, uh, solver. So we are going to keep it without this over. Keep it at the default, which is Adam. Adam is a variant off fantastic, radiant descent. And we do this for all the three warriors. And we looked them compete. Same thing as support Vector machine. There's no difference. The only difference between multi layered Perceptron you'll network and support recognition in genetic algorithm is the chromosome. All we did is that we added to discrete variables, which is the number of hidden layers on the number off hidden neurons. So they will compete on the minimum, would get to be a parent number one on it will go again for rate I and range to. And that would be parent number two. And then we start them on top of each other. Now let's stop here in the next lecture will continue. Thanks. 21. Multilayer Perceptron Neural Network Optimization #3: Welcome back. Let's go ahead and continue. So we have two Children. We have empty arrays for these two Children because now we want to cross over. So we did like we did with the, um First of all, let's start with the combinatorial, which is the ah, this would be the hidden layer. Didn't you wrong this would before the hidden neurons. And this would be for the hidden layers. So what would happen here is that if the company, if the random number was less than the probability for crossover, then what would happen is they would exchange. So basically, let's say, for example, just going to show you, for example, something, um, no home. Just okay. Bear with me. Let me just grab something. Hopes? No. Okay, Perfect. So imagine these are the two parents. So if the probability off the your random number was less than the problems have come off combination than it would cross over, so this would become instead of this and this would become said this so they will get exchanged and same thing with the number off hidden neuron. So the number of hidden, uh, this would be the number of hidden layers. This would be the number off hidden neurons. Um, so they will get exchanged if the probability was less. And then with the crossover. If you remember here, where did it go? Okay for these. Imagine these off the parents. You you take a segment somewhere around them segment, and then you exchange those segment. Same thing as we did in support. Director Machine. There's nothing different where we put the first segment of parent one plus the middle segment of parent two plus the second segment of parent one. And this is how we do it. Same thing. And then now we have our two Children. Okay, Onda, forget about this. Because the solver we remained, we kept it at Adam. But in case you want to add the solver, you can do if probability off this random If this round of numbers lessen the probability of mutation off solver than Adam would becomes the gas a Grady in dissent else F. Scott stick great. Anderson will become Adam. And if the probability of mutation waas greater sorry was less than a random number. Ah, the silver would remain the same on same thing. For now. The combinatorial t a number of hidden layers and number of hidden neurons. What would happen is that we will check if it's on the upper bound, which is, if you remember here, the upper bound and lower bound were set as the upper bound. Off the hidden neurons was 10 the lower bound of six, which is 10 6 And here eight and three are the lower and upper bound. So, we say here, If the number off hidden neurons is equal to the upper bound, then what would happen is that it would remain the way it is. But if it was on if it was equal to the lower bound, it would remain the way it is. But if it was not equal to the lower bound or not equal to the upper bound to the lower bound on an upper bound, it would add to, because we are in step sizes off to. This is, of course, if the probability is less than you want to mutate. So you take this random number on you, you say, if it's less than we mutate and then we take another random number. If this random number is less than it is greater than 0.5. We do this way. If it was less than 0.5, instead of adding two, we deduct two. So this is how we mutate. It's the same thing for the number of hidden layers. We have a random number. If it was less than probability off mutation than we mutate, and then we have another random number. If it was greater than 0.5, we add one. If it was less than 10.5 than we subtract one, because the subsides is one, Um, And for the chromosome, these right here for mutation. If the probability was less stand probability off mutation than zero becomes the one and one becomes zero else. The muted the child would remain of the child. And then we would insert, Remember, Ah, commuting to Children. Um, basically, what is this? This is for Okay, this is the number off hidden neurons, number off hidden layers. And this is the ones and zeros. This is the continuous ones and zeros on. Before that, we want to put the number of hidden neurons a number off hidden layers. So let's continue in the next lecture, but again, muted child to is the same exact same. We put the number of hidden neurons number of hidden layers in front, off the zeros and ones. Let's continue in the next lecture. Thanks. 22. Multilayer Perceptron Neural Network Optimization #4: Hey, welcome back. So we will start where we left off. So here we have mutated. Okay. We did the mutated child. One would. Now we got the the decoded do. Now we decode them so that we can get the fitness value. Same thing here, if you remember. We did it for the warriors and the parents. So here we do the same thing. We have our mutated child. Now we need to get the fitness value, which is the object of function theater, because one minus this and same thing we do from your mutated child to we get the objective function one minus this. And then remember, we stack them on top of each other. We stock the, uh, the objective function so that you have the object to function number here, and then the rest would be the chromosome. And you start them on top of each other so that you can have a new population, and then you want to find the minimum off. All in generation for mutant child one if you can save it in minimum in generation muted child one. And then here. Same thing. You want to find the minimum in for mutated child to so that you can stop stuck them in front of on top of each other, and you can have a new entire, um, you could have an entire, uh, thing where you have all the mutated Children one and all. The mutate the Children to the best off them for every generation. And then you have the best state best in generation X, where you want to take the minimum off. That So in order that would be the, um so you can get the final answer here. We did the save best in Generation X. We want to find the maximum, which is the worst. The minimum we want to find for ill. It is, um, so that we can have the Darwin guy a not so Darwin guy. He would take the minimum, which is the best. We take the worst, and then we substitute them here, uh, where the best is and where the worst is and then substitute the worst. With the best Onda, we throw out the worst and also here this is for Get this. This is for the solver and we don't tinkle solver. Um and then our new population would become enlist. And then we take it and we put the generation in front just like we did with support Vector machine Onda. Uh, basically basically that set. And then here we decode. So it's very, very It's just like support vector machine. But the difference is that, ah, instead of having just zeros and ones, we added hidden layers and hidden neurons. So well, what I'm gonna do with that, I'm gonna let it run for let's say, let's have, um, population off 50. Of course, you have to put higher, but because of time reasons, I'm just gonna put 50. And let's make case equal to three. And let's go ahead and run, Doc. You run a gun, Onda, uh, let me pause and ill will get back soon. Okay, so we're back. Ah, it's about to finish again. It took longer than I expected. All right, so it says that we use for get this. Uh, well, you was 0.9. So the high sacristy R squared was 0.92 on it used 10 new Ron's and three hidden layers on the learning rate is 30.21 at the moment in this 0.6 a list. Right here. This a multi regress er So we got the MLP regress. Er, we don't need this Onda eso We're going to try one with the default parameters and want with the parameters that we just got. So let's put this. Let's keep it, Adam. So said hit a new rooms 10 10 10 because three hidden layers but the hidden the learning great is point zero to to almost and the momentum is 0.6 08 Ah, and let's, um let's see. Okay, so without without genetic algorithm, we have 90% with genital doctor them 92% so we can see the difference is that genetic algorithm optimized the, um the performance off the MLP, which is a multilayered perception neural network. Just to let you know that the default parameters are one hidden layer with 100 nodes. The activation Israeli you solvers Adam, and learning great is constant. What does constant mean? Constant is learning rate by giving learned great in it. Um, learning rate NFL. I don't know what that means to be honest. Okay. No, Sorry. The default is 0.1 Andi Momentum is default 0.9. So let me see if this this is almost done. But you can see here that there is a difference. I'm just going to try and change this discussed ingredient descent and see how it works. I'm just curious. Okay, Definitely. Keep it, Todd. Um, okay. So you can see the the genetic algorithm did help the performance off MLP. So there you have it. Um, genetic algorithm helped MLP achieve a higher accuracy. So now you can see how optimization helps machine learning. It helps by helping theology rhythms achieve maximum performance. Thank you. 23. Feature Selection #1: Hi, everyone. Welcome back. So now we're going to learn how to do feature selection for certain data set where we use genetic algorithm by feature selection. I mean, what subset off the features or, um, independent variables that when you take only those subset, you will have either ah, higher accuracy or you'll have have the same accuracy as using all of the features. But you're since you're using less features, so it's better because it's less complex. So first of all, we're going to use a new data set called breast Cancer. So go to Google. You see, I breast cancer. If you open the first link, you will see that you we have 32 variables and 569 instances. We don't want to use this. So what we do is we go to the second link where we only have 10 variables because it's easier to work with and 699 observations or instances. So you go to data folder and you download breast cancer Wisconsin data. So you end up getting this data right here. So what you do is you go to you right click and then open with Excel. So you end up having this. Where? The column For one column. The first column would have all the data. SECOND column, Third column four counts. They don't have anything, so we need to split this data so you top use. Select all the first column. You go to data text to columns The limited next. Since they are separated by commas, you write, you click comma next finish, and that's how you get your data set. The first column is not important to the data or your dependent variable. So what we do is we just remove it because the patient I D is not relevant to if somebody has cancer or not. Cancer. Benigno Malignant So let's delete this. So this is your data set where you have nine independent variables or nine features on Day two is benign and four is malignant Now, first of all, let's clean the data. So what we do is you see here there's a question mark. It means missing data. So we delete all the rows that has a question mark. So also, this row has a question mark. So we delete all the roads that have question marks and we go from having 699 uh, data points or 699 observations to 683 now. And save it will say with breast cancer dot CVS Make sure that you sort of see SV make sure that you say that there's a C S V file, which is comma separated values, so we can work with it on python. So let's stop here and in the next lecture will jump into how we can use genetic algorithm to you to do feature selection on this data set, so I'll see you there. 24. Feature Selection #2: Welcome back in this lecture. I'll show you the concept behind feature selection or how we are going to choose a subset off features from the original data set. First, let's run this file now. The original data set is this. Where you have nine independent variables or nine features on the last one is the dependent . Variable of benign or malignant to is benign for is malignant, and you have nine independent variables. It starts with zero, so you end up with 0 to 8. Now, let's say, for example, we just wanted the first feature. Second feature. 3rd 4th 5th 6th feature and eight feature and the ninth feature. So let's say that we want again the 1st 2nd 6768 and nine. So let's do Let's call a gene because it's going to be a chromosome. Let's say this is a chromosome, so we create an empty list. And in that empty list, if gene is equal to one, we append that to the empty list, and the reason why we increments from zero until we increments zeros here is because Reese first start at feature zero. Remember feature. One in Python is called zero so 01234 So we started zero. If zero feature zero is equal to one, we append that number, which is weapon zero. So the 1st 1 since it is one, it will be here. Empty list would be future zero And then it will go to the 2nd 1 It will be t is equal to t plus one. It will go to the 2nd 1 is the 2nd 1 is equal to one, then also append that t And then it will go to the 3rd 1 since zero. It'll skip it. The 4th 5th it will skip. It will go to the sex. So it will so 123456 Uh, sorry. 012345 Don't go to five and then it will go to 67 and eight. So wouldn't choose seven and eight because of an innate are zeros. So seven and eight. So this would be in the empty list. And you can see here when we run this since we put print empty list. So we end up with 01578 So, just as previously when we did, we called. Let's say X old before applying the transformation. Where we youth pre processing and Min Max scaler. So our ex old, before applying pre processing, still has nine independent valuables and one dependent variable. Oh, they dependent variables. And why so? Nine independent variables in X. But since we only want the 1st 2nd 6 and eight and nine feature because it sources zero So this is 1268 and nine features. What we do is our ex temporary. Let's call a temporary would be our X after applying feature up after applying main Max scaler. So this is after applying Min Max scaler. You still have nine features. So our ex temporary would be exact x all of the rows and only the columns off the empty list that will take all the rows on just the columns off 01578 So if we go ahead and run dot and if we open X temperate, temporary on if we open X. So since it took the first and second so zero and one, which is the first and second you can see here and here is equal to this and this. And then it went to the, uh, the six features a 123456 or number five, and it took this one. You can see here 1.11 point 33 which is here. And it also took the last two features. Sophie open here. I can see it took these two columns here, So what we do is we create from this gene. It will tell us what feature to include and what feature to not include and will do that in an empty list. We'll just put the ones in an empty list so again here one would be would be put in the empty list. This would be put in the empty less because it's one. If it zero, it won't depend it. And it'll keep adding and just adding, If it's one and this is how you end up with this upset off features based on the gene you choose, let's do this zero. For example, let's do this zero. Let's actually just have the last three features and let's run that feel end up with the feature 67 and eight, which is the 7th 8th and ninth feature. So let's stop here and in the next lecture will actually implement that in genetic algorithm 25. Feature Selection #3: all right that this lecture also you how we can do it. We can implement genetic algorithm for feature selection. So instead off, for example, support vector machine instead of finding the best gamma and see to achieve the highest accuracy here, we don't care about gum and see, we just want to find the best subset of features that will give you the highest accuracy. That is why here I just used for in order to get the accuracy for a subset of features are you support Vector machine And with the default parameters, of course, you can always use other algorithms or other class of fires, and you can always use the optimized, the optimized parameters. This would be, of course it would be better. But this is for illustration purposes and only to show you feature selection. So that's why we only care about feature selection set off the optimized parameter. But definitely you can try with the optimize prompter. I'm sure you'll get better results. So basically everything is the same as we learned before. The only difference is actually here. Instead off. First of all, let's actually let's do this. First of all, I'm going to run this. And if we look at the end list if you remember, it's the same as before. So now we have zeros and ones, right? We have nine of them. So this would mean it didn't include the first feature and included the second feature. It didn't include The third feature included 4th 5th and sixth feature, and then the eighth and ninth feature and so on. So this would be it didn't include the first norther second, but included the third and fourth and six or seven and eight and nine and so on. So this is what we're going to work with, and I won't go below here for Warrior one, the empty list we need to create. For example, here we ended up getting for warrior one, a certain chromosome, right where, for example, here If it if it chose, for example, Number two, the random interviewer was too, so it will choose this chromosome. Here. It means let this chromosome include the first feature on the fourth feature, and so on. So after we get the random integer, we apply it to the Enlist. So enlist for let's say, the random integer was too for example, and um, yeah, let's say two. So we end up with Warrior one having the chromosome off the second, the second chromosome, where it included the first and fourth feature, and so on. So what we do is we need to create this empty list so we do the same thing as we learned here. Where for I and Warrior one in the chromosome, zeros and ones wherever it's a one, a pentti. Wherever there's a zero, don't depend it. So you end up with a new subset off features, right so that feature new subset of feature. Use it for for training the models who use New X, which is the new subset, and you have your why it doesn't change. And then you end up getting the accuracy off that model based on the new subset of features . You do the same thing for warrior to on war your three and you pick the one that had the highest accuracy. So you pick the chromosome, which, based on the subset of features, achieved the highest accuracy cross over the same mutation of the same. Nothing is different. It's all the same as we learned, but basically This is the only difference here at the Warriors, where instead of choosing the best combination off gamma and see for support Vector Machine , we just use the best combination off zeros and ones, which is the best subset of features that will give you the highest accuracy. So, for example, let's say this achieved actress off 90% this 92% this 91%. So this would be better, though, for example, warrior to had this chromosomes. A warrior two would go on to be parent number one, and you'll do the same thing for parent number two. So everything is the same. Nothing is different, and at the end, what he do is, uh, actually, there's nothing different here at the end. After you get the chromosome. Now you want to show that chromosome, Let's say, for example, the final answer was chromosome to say 110 So will be the first and second the first and second features, not the third. So we want to take that final chromosome that we had, which had the highest accuracy on. Just converted to the features. Included are that empty list that we end up putting this chromosome in Let's go ahead and run that. We run it for 30 generations on a population of 80. So a five. Okay, I think I deleted something by mistake. Okay, So a five ongoing propose it, And as soon as it finishes 30 generations, I'll, uh, Alan Pause. So I'll see you soon. All right? It's almost done. Just a very important and quick note. The number off one's and zero's they use here is the same number of features that you have . So, with the breast cancer data set, we had nine features, so we end up having nine zeros and ones. This is almost done. Okay. Perfect. So we ended up with the highest accuracy off 97% where we used the 1st 5th 67 on ninth feature. So feature number 1567 and nine and deduct getting the highest accuracy off 97%. The reason why I said 15 67970456 There's no feature. Zero view in part in his first so 1st 5th and so on. So this is how we use be genetic algorithm for future selection. 97%. And these five features. Now, if you run it with all the features and still get 97% as I said, even if you do end up getting the same accuracy as with the whole number of features. Still, this is better because you reduce the number of features, which means you reduce the complexity. So I hope you enjoy this lecture. Andi, Um, please let me know if you have any questions. Post it on the Q and A forum, Thanks.