Transcripts
1. Introduction to the Class: Machine learning is an exciting topic, but the real question is, how can you combine it with your current skill stack and your existing experience to stay relevant in today's work environment. Whether you're coming from a background in mobile and web development, business analysis, engineering, or as a recent graduate, your journey into this exciting and complex field starts learning the fundamentals. Hi, my name is Oliver, and I'm the author of machine learning for absolute being. In this video course, I will teach you the basics of machine learning. I will walk you through the fundamental concepts, algorithms, and terms without overwhelming you in advanced math and lines and lines of code. After completing this course, you can conflict go into more complex learning resources on skill share and other platforms. Or perhaps a general understanding of machine learning is enough to satisfy your curiosity for now. I will also provide recommendations for further learning resources at the end of the course. Ok, so let's get started.
2. What is Machine Learning? Part 1: So what exactly is machine learning? Here are two definitions. The first comes from Google Machine Learning gives computers the ability to make predictions and perform tasks without specific instructions. By identifying patterns in large and complex aggregations of data. Machine learning can be applied in a variety of ways, such as searching for information, using imagery, personalizing a chat app, experience and identifying music. The second definition comes from Arthur Samuel. He wrote the Blueprints of Machine Learning and Program, The first working of rhythmic models in the 19 fifties and sixties, according to Samuel Machine Learning, is a field of study that gives computers the ability to learn without being explicitly programmed. So if both definitions convene on the fact that machine learning relies on self learning to , for my prediction, violent direct instructions or programming now to understand self learning, we first need to look at traditional computer programming. A fielder computing You're likely, well versed and familiar with, partly because of its domination over self learning from most of the information age, traditional computer programming works on direct commands with a pre programmed and put on one of the main competitors to machine learning in the 19 seventies was what we called knowledge based systems. You don't hear about these systems anymore, but they were essentially humans in putting or dumping their knowledge into a computer repository, which could then answer. Career is based on human knowledge. Pre loaded into the computer, these systems relied on a lot of outfront human input and then a series of, if else, rules to reach a decision. Knowledge was therefore saved into the programs, however. Increasing supplies of data and cheap computing power eventually handed mission learning victory over knowledge based systems as it became more efficient to derive knowledge from data rather than toss experts to migrate their knowledge to the computer. Another example of traditional computer programming is the free chest game that comes preinstalled with a new computer. The bought player in the game is pre programmed to play a certain way based on input from a human developer. And, sure, this functions built into the program to randomize certain movements, meaning that the but doesn't perform the exact same opening move every game you play. Otherwise, that would get very boring. However, this doesn't qualify as machine learning or artificial intelligence because the bought player isn't thinking for itself. It's not adapting to your game play like a human player could. And it's not cognisant of how it might improve itself to beat you next time. It's simply following a pre cut a sequence of moves with an element of randomness thrown in which you guessed it is also pre coded by the game developers. So computer games are a great example of traditional computer programming with a pre programmed input command and a direct and put many of the Chaput applications on the Web. But also examples are traditional computer programming because their answers a pre programmed by a computer programmer again using a sequence of, if else, rules. Having said that, checkbooks applications, as with modern chess applications, are moving towards a new model based on self learning. Now, unlike traditional computer programming, where outputs of pre defined by the programmer self Learning uses data as input to build what's quote a model. The model itself consists of an algorithmic equation, such a linear regression decision trees or a neural network. Each Alberta equation, or algorithm, is also different, so it's some use probabilistic reasoning, such as the base classifier wish, use a conditional probability to classify a class such as pregnant or not pregnant. And linear regression uses correlation to predict a continuous variable, such as the price of a house. Other algorithms used massive child era and brute force to find a possible solution. Okay, now I want to pause here to distinguish between what's an elderly and what is a model. The algorithm is the rule of mathematical equation. Was extracts insight from the data such as B X, plus a equals Y in the case of linear regression. The values of the equation here, though, are not concrete. They have not been assigning actual value. A model, on the other hand, is concrete. It's an equation that has been filled by values learned from the data. So the model you can say, is the final equation for explaining patterns in the data. It has bean tuned to the data and saved for future use, and that saved algorithm we can call the model, so hopefully this helps to tidy up the difference between a model and algorithm later. Once the motor was ready, we can then use it to look at new input data. I predict an output an important point to make about the model is that the output is determined more by the contents of the input data rather than any preset rules defined by a human programmer. Yes, the machine learning developers responsible for collecting the data selecting appropriate over them and fine tuning its settings, which we call hyper parameters to reduce. Prediction. Arab. But ultimately it's the model which decides what the output should be. And in contrast to traditional programming, the machine learning developer therefore operates a layer apart from the and put, as is indicated in the diagram, where we have the model in between the input and the output. This is the space where the self learning comes in a concrete example of self learning. Other recommendations you see daily on YouTube in Amazon, which are determined by user actions such as what items other users click on on what comment or ratings they allocate to relevant products. These actions are uses input data that is fed into an algorithm, which then generates relevant recommendations as part of a prediction model. This qualifies as self learning because that developers at Amazon and not seeing there whispering into the years the model to serve you book recommendations related to data science. Will SciFi, for example, the model basis its recommendation by looking at the input data. What's more, it has the ability to identify combinations that a human might not easily find, such as correlation between two random book genres moving away now from e commerce. Suffering is also coming to dominate what we see in the search engine results, which is super interesting because we can actually see the before and after effect of injecting machine learning into a business model. The previously relied exclusively on traditional computer programming. In the past, Google relied heavily on pre programmed rules that broke words down into strings of letters . This included strings of letters found in the Web pages title, navigation menu, body texts, meta data in its descriptions and so forth. Google stored the strings in a massive repository, which gave it the ability to return results based on the string you entered into the search bar. So if you typed in Donald Trump, the search engine would go to its repository of index websites and return Web pages with that exact same string of letters. This methodology has all changed now Google's latest algorithms back to buy machine learning, rely less on program commands and more input data, courtesy of every time that research. For something to give you a more concrete example, let's say you want to search. Who was Donald Trump's first wife prior to machine learning Google with search for Web pages. Containing those six key words as strings of letters, however, the results might not always match up with the true intent of the user's search. The search engine might by mistake, returned somewhere pages that aren't relevant, but they contain similar strings of letters such as Malena Trump, Dawn's first wife on the First Lady of the United States of America. Google would therefore be lured into returning content featuring Molano Trump, Trump's third wife, on the first page of its search results on the same era still happens on Yahoo today. Actually, Google, meanwhile, has overcome this problem, using past search queries as input data. Past Google, searches from users likely point to a high click through rate for content containing the key words Ivana Trump for this search query. So from users passed such activities on the results they quick on, Google can identify a relationship between the keywords Donald Trump's first wife and I've on a trump Google know, understand Savanah. Trump is Donald Trump's first wife, and it can eliminate any content that is not related from his first page results. As Wide Magazine's founder, Kevin Kelly, says each of the three billion queries that Google conducts each day shooters the Google AI over and over again. So to summarize, self learning produces an output based on the model that is a combination, often algorithm and the patterns to suffered from the input data, whereas in traditional programming in the output is pre defined by the programmer.
3. What is Machine Learning? Part 2: Having established how machine learning is unique and different from traditional computer-based decision-making. It's useful to understand how machine learning overlaps and relates with other relevant fields, beginning with computer science, which is depicted here as the outer circle on the right. Computer science is a broad field and encompasses everything related to the design and use of computers. Machine learning, data mining, artificial intelligence, and computer programming each fall under the high-level umbrella of computer science. Then within the all-encompassing space of computer science is the next broad field of data science. Narrowly, computer science. Data science comprises methods and systems to extract knowledge and insights from data with the aid of computers. Then, within data science we have artificial intelligence. Artificial intelligence, or AI, encompasses the ability of machines to perform intelligent and cognitive tasks. So, just like how the Industrial Revolution gave birth to an era of machines simulating physical tasks. Ai is now driving the development of machines capable of simulating cognitive abilities. So this includes some fields that are popular and newsworthy today, such as computer vision, natural language processing, and of course, Machine Learning. And finally, we have data mining, which focuses on analyzing input variables to predict a new output. And as similar to machine learning, but uses a more direct approach that relies on human intuition rather than self-learning. And it also focuses more on the racing what has happened in the past, rather than trying to predict the future, which really is the focus of machine learning. Money also doesn't cover the depth of algorithms that machine learning employers, such as neural networks, which really put his machine learning into the space of AI. To vanish out this section, I want to point out some interesting relationships contained in this graphic. The first is that it's perfectly viable to work inside the space of data science without touching machine learning. You could, for example, be somewhere in the data mining space working on solving traditional classification task. Or you could be using a decision tree that narrates past data rather than attempting to build a self-learning model to predict the future. Now, on the other hand, it's unworkable to separate data science from machine learning, data mining, and AI. And as all three subdisciplines involve extracting knowledge and insight from data. So in some, you can be a data science company or a professional without touching machine learning. But you can't call yourself an exponent of machine learning without saying that you also practice data science. This is because it's impossible to build a prediction model without common data science methods like data cleaning, feature selection, and evaluation techniques. Okay, and the third interesting relationship we have here is the overlap between the three smaller circles and the problem a lot of people have in trying to isolate machine-learning from AI and data mining. Algorithms, from machine learning crossover into AI and data mining. So it's very hard to decipher the boundary sometimes. And in a way, machine learning is a bridge from data mining to produce artificial intelligence.
4. Independent & Dependent Variables: having looked up the definition of missing leading in related fields of computing. Let's now look under the hood to review the actual mechanics of machine learning, and this leads us to the discussion of X and Y variables. As of other fields off statistical enquiry, machine learning is built around the cross analysis of dependent and independent variables , and the gold machine learning is to teach the model to make predictions through exposure to data. The dinner itself is split into independent variables, which is up a case X on a single dependent variable, which is denoted as a lower case. Why the dependent variable is the output you want to predict. An independent variable is a input that supposedly impacts the dependent variable. So, for example, the supply of oil X impacts the cost of fuel. Why cost of fuel is the dependent variable, which is affected by the supply, which is constantly changing. Other independent variables to including population growth and economic development also affect the cost of fuel. Other examples are independent. Variable might be like coordinates for a rectangle around a person in a digital photo, so face recognition the price of a house all the category of a product such as a sports drink. Their independent variables, which supposedly impact the dependent variable, could be the pixel colors, the distance to the city, the suburb and a number of rooms in the house. All the specifications, all the car, respectively. After analyzing a significant number of examples, we could then create a model, which is an algorithmic equation for predicting why based on eggs, so to predict the value of a house. For example, we can analyze the house attributes such as the distance of the city suburb number of rooms as independent variables to predict the dependent variable, which is the asset value off the house. Using the model, the machine can then predict an output based exclusively on the input data. The market price for your existing house, for example, coming estimated using the labeled examples of other houses recently sold in your neighborhood
5. Model Overview: OK in this video, I want to give a quick, high overview of how to build a model based on machine learning. So machine learning models work by first taking the input data and selecting which features to analyze, such as height, weight and arm span and so forth, and this is called feature selection. Next, we take those features and refined the data to reduce the noise, feel missing values or just fix from general errors on this process is called data scrubbing, and we'll go into more detail later in the course. But for the steps come under the head off data preparation. Once the data is prepared, weaken and feed that data into the algorithm. However, if we're performing supervised learning, for example, we first need to split that data into two sets, one set for building the model and one set for testing the model. This step there isn't necessary for unsupervised and reinforcement learning. Next, the algorithm will run through the input data to generate a prediction output, which is usually a category or a numerical prediction, and then the final step is to evaluate their results
6. Feature Selection: Okay, so in this video, I'm going to elaborate further on features and what they look like in a typical dataset. Now, features help to distinguish contrast and qualities between objects such as age, gender and height. Objects meanwhile, constitute the item you are describing, such as people, products, customers, or films, and may also referred to as instances, datapoints or samples. Features are also often called variables, and these two terms can be used interchangeably. For this example on the screen we have a dataset with online use information used for ad targeting. Dataset consists of ten features as highlighted in navy blue, including Delhi, time spent on site, age, income, and so on. In this dataset, we have Rosa information documenting individual users. So in row one, we have a user who's 35 years of age from Tunisia. Most datasets you'll find had the variables despite horizontally from left to right and individual instances of those variables or rhos, such as product items or individual users listed vertically below. This is generally because the number of items in a dataset outnumber the number of features. And so it's easier to look at dataset scrolling down rather than scrolling horizontally. Equally though, there's nothing stopping you from looking at a dataset with the instances at the top and features listed vertically below. Now, variables also have other qualities, such as continuous or discrete. Continuous variables are integers or floating point numbers that are compatible with mathematical operations such as addition, subtraction, division, and so forth. So Delhi time spent on site, age, income, and Delhi Internet usage. These are all examples of continuous variables. So if we add each row together for time spent on page for example, we can then aggregate this information to find the mean value or the range. We can't do the same though for a variable like city, because this is a discrete variable. A discrete variable is categorical or finite in value. So what this means is that it cannot be aggregated or mathematically manipulated as with other variable observations. So examples of discrete variables here include Ad topic line, city, gender, and country. Even categorical variables describing numbers such as zip codes and customer ID numbers. They also meet the criteria of discrete as they cannot be aggregated like natural numbers. In the case of our dataset, male and clicked on add, both of these variables are expressed using integers, but these numbers cannot be aggregated like natural numbers because they are categorical. Identify as timestamps are also discrete because they mark a fixed period in time. Now, age, on the other hand, is a little tricky. It is easy to aggregate age and find the mean age or workout the range which makes this variable continuous. However, on the other hand, it could be interpreted as a discrete variable because each age has their own discreet preferences. So for example, if you add a 30-year-old and a 20-year-old, you won't be able to suddenly find the preferences of a 50-year-old. So age usually is a discrete variable, but in other cases, it can also be seen as a continuous variable. It really depends on what you're attempting to predict and the values of the other input variables in your analysis. Now in this slide we have the independent verbals highlighted in purple and the dependent variable clicked on add, highlighted in green. And in this case, we are attempting to predict if a user will click on an ad based on the nine input variables, including Delhi, time spent on site, age, income, and so on. However, it's important to note that no verbal is intrinsically linked to behaving as a dependent or independent variable. And we can switch this around based on the goals of our analysis. So now in this updated example, we are predicting how much time a usable spend on the website, which is the dependent variable in green, based on the other nine independent variables as displayed in purple.
7. Data Scrubbing: Having reviewed I sampled dataset, we now need to talk about preparing the data for analysis, which involves a discussion of data scrubbing. Data scrubbing is a common term used for manipulating data in preparation for analysis. Some algorithms, for example, cod, read certain types of data or they return an error message in response to missing values or non-numerical input. Variables to, may need to be scaled to size, will convert it to a more compatible datatype. Linear regression, for example, analyses continuous variables. Whereas Gradient Boosting OS that both discrete and continuous variables are expressed numerically as an integer or a floating point number. So each algorithm has its own requirements on what the data should look like. Duplicate information, redundant variables and errors in the data can also conspired to derail the models capacity to dispense valuable insight. And so these are also other considerations for Dana scrubbing. Another important consideration when working with data and especially private data, is removing personal identifies that could contravene relevant data privacy regulations or damaged the trust of customers, users, and other stakeholders. This is less of a problem for publicly available datasets, but something to be mindful of when working with private data. Now, in terms of some concrete examples of data scrubbing, they include one-hot encoding, which is where you take a discrete variable and then convert that into an integer. You can remove variables, you can remove personal identifies, you can merge variables. And there's also some really interesting algorithms which helped to reduce the actual data on the dataset, but we'll get to that later in the course. Now, while machine learning algorithms can run without using the next two techniques in this video, normalization and standardization really helped to improve model accuracy when used with the right algorithm. The former normalization rescales the range of values for a given feature. I set range with a prescribed minimum and maximum, such as 0 to one or negative one to one, and y containing the range of the feature. This technique helps to normalize the variance among the datasets features which may otherwise be exaggerated by another factor. The variance of a feature measured in centimeters, for example, might distract the algorithm from an oven. A feature with a similar, a high degree of variance that is measured in meters or another metric that downplays the actual variants of the feature. Standardization meanwhile converts unit variants to a standard normal distribution with a mean of 0 and a standard deviation of one. Standard deviation measures variability by calculating the average squared distance of old data points from the mean of the dataset. So in this example we have the original data which is quite spread out. And then using normalization, we can then put that data into a range of 0 to one. Or standardization, we can squeeze the data within one standard deviation of the mean, which would be considered 0. Now, the narrow range of normalization isn't really suitable for features when extreme range. So very high or very low feature values. Sterilization meanwhile excels when the variability of the data reflects a bell curve shape of normal distribution. It's also often used in unsupervised learning with principal component analysis and K nearest neighbors. In other situations, normalization and standardization can be applied separately and compared for accuracy. It's also good practice to train the model twice with m, without re-scaling and compared the performance.
8. Split Validation: After cleaning your dataset, the next job is to split the data into two segments for training and testing. This process is known as split validation. The first split of the data is your training data, which is the initial reserved data used to develop your prediction model. After you have developed a model based on patterns extracted from the training data, you can then test the model on the remaining data, which is called the test data. Please note that the test it is should be used to assess model performance rather than to optimize the model. Now, when splitting the data at the ratio of the two splits should be approximately 70-30 or 80-20. This means that your training data should account for 70% to 80% of the rows in your dataset. And the remaining 20 to 30% of rows are left field test data. It's also vital to split your data by rows and by columns if the variables are expressed horizontally. Other options for splitting the data are a three-way split using a validation set or K-fold validation. As the test data cannot be used to build an optimize the model. Data scientists sometimes use a third independent data set called the validation set. After building an initial model with the training data, the validation set can be fed into the prediction model and used as feedback to improve the models hyperparameters. The test set is then used to assess the prediction error final model. Alternatively, in K-fold validation, the data is randomly assigned to the k number of equal size buckets. One bucket is reserved as the test bucket and is used to measure and evaluate the performance of the remaining k minus1 buckets. The process is repeated until all buckets had been utilized as both a training and test bucket. The results are then aggregated and combined to formulate a single prediction model. Thus, by using all the available data for both training and testing purposes, that K-fold validation technique dramatically minimizes the prediction error found by relying on a fixed split of training and test data.
9. Self-learning Algorithms: Now, in terms of algorithms, there are three main categories, supervised, unsupervised, and reinforcement learning. These three techniques can be easily separated based on what kind of data you are looking at. If you have a dataset with both the independent variables and the dependent variable that you are predicting. You can use the first technique, which is called supervised learning, where you decode relationships between known X and Y variable combinations to create a model to explain that relationship. However, if you have a dataset that doesn't have a dependent variable in the context of a specific variable that you wish to predict, such as house price, then you need to use an unsupervised learning technique which finds relationships between all the known X variables to create a completely new y variable, which might be a new category of houses clustered in the same city. For example, we have similar attributes, such as a tennis court and Swimming Pool. So unsupervised learning doesn't have a set variable that you're trying to predict, like supervised learning. And you usually use unsupervised learning when you have a complex dataset and you're not sure how or even if the data falls into set categories. So for instance, if you have a lot of customer data, you might want to break that down by x variables like age and suburb to create new categories that are useful for marketing activities. And finally, we have reinforcement learning where there's actually a blank dataset, meaning you still have variables you want to look at, but you don't know their values yet. Instead, you have a known why goal, which is the dependent variable. And you want to use a bunch of random independent variable values, See if you can achieve that goal. So this method involves a large-scale trial and error approach to find input data combinations that can fulfill the target output, such as winning a game of chess. So that provides a quick snapshot of supervised, unsupervised and reinforcement learning. But let's spend some time here in really getting to know these techniques because they really do play a big part in matching your data with the right algorithm. So let's start by looking at supervised learning as the first branch of machine learning. Supervised learning takes labeled datasets to decode the relationship between known input variables x and they're known output wire. Datasets are labeled when both X and Y are known in the dataset. Supervised learning essentially works by feeding the machine and lots of sampled data containing various independent variables and the dependent variable values as well. The algorithm then deciphers patterns that exist between the x and y values and uses this knowledge to build a prediction model. Out of the prediction model was built. It can then be used to predict the output of new cases based exclusively on the independent variables. So an example of this is having a dataset with lots of houses. The x-variables might be the landslide is the distance to the city, number of rooms, etc. And the y-variable, which you are trying to predict is the cost of the house. So the x and y variables for each house of fed into an algorithm which works out how X variables impact the y-variable of price. And then to create a model based on those rules. Using the model, we can then take a new house with a completely unknown selling price and tell them on it's x values. And then based on knowledge from the training data from over here on the left, it can predict the price of the house, which is 395 thousand in this particular example. So remember, supervised learning has two parts. One is training the model with lots of known examples of x and y. And then the second part is using the model to predict the dependent value of new cases based on just the value of the independent variables. Next, we have unsupervised learning, which is where the independent features are known, but the dependent variable is not available, making the dataset unlabeled. This being the case, the model is not able to decode the relationship between existing X and Y examples. Unsupervised learning algorithms instead focus on analyzing relationships between X variables and uncovering hidden patterns that can be extracted to create new labels regarding possible y outputs. So if you group data points based on the purchasing behavior of small and medium-sized enterprises and also large enterprise customers, for example, you're likely to see two clusters of data points emerge. This is because the SMEs, the small and medium size enterprises, and the large companies tend to have different procurement needs when it comes to purchasing Cloud computing infrastructure, for example, Essential cloud hosting products and a content delivery network should prove sufficient for most SME customers. Large enterprise customers, on the other hand, are likely to purchase a broader array of products and solutions that involve advanced security and networking products like a web application, firewall, a dedicated private connection, VPC and so forth. So by analyzing customer purchasing habits, unsupervised learning is actually capable of identifying these two groups are customers. We've asked specific labels that classify a given company as small, medium, or large. So this is the main point of unsupervised learning algorithms. You don't actually have any pre-existing knowledge of how to group your data. But using unsupervised learning, you can put it in the right direction to discover new relationships contained in your data. Now, let's look again at a real life example to reinforce the process of unsupervised learning. With this model, we want to analyze some Lego pieces to produce a unspecified Y output. This time though, we only have the x data, which means that yes, we have independent variables in our dataset, but not the dependent variable, which is what we want to predict. In fact, we don't even know what we really want to predict. The nature of unsupervised learning is that why is completely unknown, hence forth, It's the model's job to produce and uncover a new type of Y output. So here we can feed in a known input data of X variables, which is the Lego pieces we have in our toolbox. And we ask the algorithm to shuffle and rearrange these pieces until it finds something interesting. So this might mean grouping the Lego pieces based on similarity or even reducing the number of pieces to those that have the most functionality. Some Lego pieces, for example, such as tires, may under connect with a very small number of other pieces. Whereas basic building blocks, they can have far more versatility in terms of joining together with other lego pieces. The model will therefore task to focus on blocks rather than tires, as these pieces are more versatile in their functionality. We can then use that information to solve another challenge. For instance, if we know that bookcases are the most versatile and join it with other Lego parts. Then we need to concentrate on analyzing. These pieces fit together to produce a car. Instead of focusing on the tires, which we already know connects with the gray wheels, for example. So in some unsupervised learning is really a great way of exploring relationships when the dataset, rather than trying to find a set value or category, as is the case with supervised learning. Now, reinforcement learning is the third and most advanced category of machine learning, and it's generally used for performing a sequence of decisions, such as playing chess or driving a automobile. Reinforcement learning is actually the opposite of unsupervised learning in that the wife feature is known, but the independent feature and values of x unknown. So in reinforcement learning, the y dependent variable can be considered as the intended goal. So this could be to win the game of chess. And the optimal x values are found using a brute force technique based on trial and error. So, as I mentioned before, you basically have a blank dataset and you still have the independent variables, but you just don't know what their values should be. So using random x values as input, the model graze their performance according to how close those x values can get to the target y value. In the case of self-driving vehicles, movements to avoid a crash, for example, would be graded positively. And in the case of chess, moves to avoid defeat would likewise be rewarded. Over time. The model leverages this feedback to progressively improve its choice of x variables to achieve its target y output. And there's a really good demonstration that you can find online of this iterative approach to learning and learning from fire, which is watching how cars are trained to drought. So I've included the link here, which you can check out in your own time. Okay, so let's look at a visual example once again, here we have a blank dataset and we want to find out what Lego pieces go into producing a specified output, which in this case is a yellow car. So upfront, we already know what our y-value is, but inputs produce that desired output. We don't know. So to solve this challenge using reinforcement learning, we must feed the algorithm lots and lots of potential x-values to find a good match. This is going to start off, we've randomly chosen x values that we can feed the algorithm. Then once the model it spits out an output, we can then compare the actual output with the desired output. And this feedback is then used to refine the input x data. This means that the model will improve over time as the x input data is optimized based on experience using trial and error. Another example would be say, designing a model that generates a positive reputation on Reddit, known as common specified y value is to boost the user accounts comma score. The x inputs are actions that the model can take on the website, such as commenting other user posts, sharing different types of content on the platform, and choosing which reddit threads to target. In the beginning, the model is probably going to get his account suspended by read-out for taking inappropriate actions, such as posting unrelated content to a thread or spamming lots of posts with low quality comments. And because getting into cancer spend it doesn't result in the target output of lifting comma. The model will learn to avoid those actions that lead to suspension while also hurting on the actions that actually boost its karma or reputation. Okay, so to wrap up this section of the course, I have all three categories of learning algorithms mapped on a single slide, which might look a little scary at first, but this is my best attempt to visualize the workflow of each type of learning. So first we have supervised learning which has two steps using known examples of x and y, it creates a model which can then be used later to predict the y-value of new data based exclusively on the X value. Second, we have unsupervised learning, which is much simpler. It's just a matter of pushing x values into an algorithm and seeing if any interesting y values come out. And it's mostly used first shuffling data around to either group data points or to summarize their structure. And then there's reinforcement learning, which feeds random x values into an algorithm. And based on a loop of trial and error, it eventually gets better at hitting the target y value. Lastly, it's important to remember that the choice between supervised, unsupervised and reinforcement learning comes down to what data you have available and the problem you are seeking to solve. In addition, it's not practical to actually switch out one type of algorithm for an other. So if the task is to predict the price of a house based on historical data, the only real option is supervised learning algorithms. We can't use unsupervised learning algorithms as this doesn't lead us decide what the y-variable actually is, which in this case, this house price. So it's really useful for finding interesting relationships between independent variables, such as rooms and land size. And there's no existing example of price, then it's impossible to create a model to predict them. Likewise, reinforcement won't work because this method doesn't use existing data. It instead generates random x values to produce y, which weren't helping us to predict house prices based on historical data. However, if you have a problem where you want to try lots of potential combinations, such as drug combinations to produce a vaccine for a new type of virus such as the COVID-19 virus. There reinforcement learning is a really good approach. So for this example, the y-value is specified, which is to eradicate the effects of the virus from humans. But what goes into producing their drug is not actually known and there's no historical data we can use. So we need to use a trial and error approach to examine different combinations of X to produce the target y value. So yes, definitely keep in mind that these three learning techniques are not interchangeable and you will need to look at your data and what you are aiming to do in order to evaluate which of these techniques make the most sense for your prediction model.
10. Classification & Regression: okay. Before we move on to looking at specific examples of algorithms, I want to talk a little bit about the difference between regression analysis and classification. So these are usually the most popular. Examples are algorithm types, especially for supervised learning. Now, regression analysis is very popular statistical technique that is used to model the relationship between one or more independent variables and the dependent variable. By determining the mathematical strength of relationship between a dependent variable. And it's changing independent variables, Regression analysis is able to make predictions about the target value off future input. So businesses, for example, often utilized regression to predict sales based on a range of import variables, including whether social media mentions historical sales. Judy P. Inbound tourists and so forth. And one of the key features is that the output is an integral or floating point number, which means is expressed in numbers, and those numbers have mathematical meaning. The other option is classification, So this is a broad and brother term for algorithms that generate category predictions, such as recommended systems that predict user preferences, maybe image processing that recognises individual faces and cluster analysis for conducting market research on customer profiling specification models tend to be more common in industry than regression analysis, because most analytical problems involved forming a decision and classification helps to produce an inflammation road map. In the case of an email spam detection system, the email client applies a classification algorithm. Determine whether an incoming email is spam or non spam, which can be linked to a designated action rather than feeding back and indexed number. Qualifying the level off possible melas, for example, so he can say that the output is actually a category rather than a numerical number, as was the case in aggression. So here's an overview off the algorithms covered in this course, the only pure of aggression or room that we're looking at is linear regression on most of the other over, the rooms are either a classification algorithm or a combination off classification and regression, which means there's two different ways you can use the algorithm. You can either use the algorithm for solving a classification problem, which is predicting a category, or you could use the algorithm to solve a numerical problem, such as predicting the price of a car or the value of a house, and also we have K means clustering, which is actually a sorting algorithm. So it's a little bit like classification in terms of trying to categorize the data. But it's more about sorting and rearranging the data rather than a pure classification technique. Okay, so now in the next five years, we'll start looking at supervised learning algorithms, beginning with any of aggression.
11. Linear Regression: OK, moving on to Albertans. We have linear regression as the hello world of machine learning algorithms. Linear Regression is a simple, supervised learning algorithm for finding the best trendline to describe patterns between X and Y. This trend line is known as the hype it plain in linear regression. The overall rhythm is based on the equation of y equals B X, plus a where B is the slope and a is the white intercept. And this may look familiar from high school mathematics. A notable attributes This algorithm is the hyper plane, which is the straight line used to explain the dependence between a dependent variable. Why? And it's changing independent variables. X The slope is B and A is the location of where the hyper plane crosses the vertical y axis . Let's say we want to predict the daily views for someone's YouTube channel. Then we might look at the independent variable, such as the number off subscribers. And let's say we also look at five other you tubers and plot their X and Y values on a scatter plot as shown here. So why the in daily views and ex been a number of YouTube subscribers we can summarize this relationship using a hyper plane based on the Elgin equation of why equals B X plus I. And the job of the model is to find the slope on the Y intercept to fix the position off the hyper plane. And then we can use the hyper plane to make some predictions. So if our friends YouTube account has 400 subscribers than we can predict that hell, she attracts approximately 1000 views each day. Now approximate is a key word here, as linear regression is not renowned for amazing accuracy, and ocracy itself really comes down to the data. If you have a lot of outliers in your data, such as a YouTuber who gets thousands of use with hardly any subscribers because they pay for advertising, then this is naturally going to radically shift the position off the hyper plane. So in general, the close of the data points are to the hyper plane, the more accurate the predictions. If there is a high deviation between the data points and the regression line, that slope may not prove very accurate at his predictions. So one of the major prerequisites of regression analysis is that the variables behave in a linear manner. So, for example, on the right hand side, we have four plots and plots a, which is a positive correlation and plot B, which is a negative correlation. They have birth examples of a linear relationship, where a given change in the independent variable produces a reliable and corresponding change in the dependent variable. Negative. Correlation means an increase in the independent variables leads to a decrease in the dependent variable. For example, a houses value, which is the dependent variable, is likely to depreciate as the distance to the city, which is the independent variable, increases. So the further that houses from the city, the more like of the house's value is to go down impress. Conversely, a positive correlation captures a positive relationship between the dependent and independent variables. House value as a dependent variable, for instance, increases in sync with the size of the house as the independent variable. In the case of Plot C and B, there is no linear relationship between variables making regression analysis a poor choice for interpreting these patterns. Also, another primary requirement of linear regression is that both your X and Y variables are continuous in the form of whole numbers, which I intercourse, or with the decimal placing in the form off a floating point number. So let's look at a sample data set, and we can see that the number of rooms price distance to the city number of bathrooms and landslides. These are all examples of continuous variables that we can use to assess mathematical Asian ships based on correlation. The categorical variables may well like suburb and address don't comply with linear regression as they are not expressed in numbers, and we can't plot them on a scatter plot on even some of the other variables, such as car and postcode. They also not eligible as well because they are discreet and not continuous cause, for example, which does have inter jizz in terms of zero and one. This is discreet because houses that do have a carport are classified as one and houses without a carport classified a zero. So these numbers are place holders rather than natural numbers that we can aggregate. Okay, so let's now walk through a practical example of how to use linear regression On the first slide, we looked at predicting YouTube views based on only one independent variable on. We complete that on a two D. Scott afloat in machine learning, though you're virtually never looking at one independent variable. And if we have five independent variables, we don't have enough axes on the scatter plot to put every single independent variable. So we actually go away with the visualizing the heifer plane as a physical line. This doesn't mean we're not using a hybrid plants. Just that the high Plains splits the data in a higher number of dimensions that we as humans can visualize. So in this example, we have four independent variables with rooms, distance, the city bathroom and land size as the independent variables on the dependent variable is price, which is what we want to predict. So, using a deficit of houses with each of these features, we can feed that data into the algorithm of BX plus A. And the algorithm has to work out the Y intercept and the slope for each independent variable, which is B. And from there we can generate the slope coefficients on the Y intercept, so rooms, for example, is worth 400,000 on the Y. Intercept is 300,560. We can then take a new house on input the independent variables, such as the number of rooms, the distance to the city bathroom and land size, and then multiply that with the curve efficient surf example. Rooms would be 400,200 times to on the distance. The city would be 30,000 times 10 and each of these is added with the Y intercept to then get prediction, which is why which in this case is $595,802 now moving on to the strengths and weaknesses of this algorithm. Linear regression is very effective for analyzing linear relationships. As mentioned, it's also faster run its simple, fairly transparent mean that's easy to interpret features that could be noted by the slope value. And we can say which failures have the biggest impact on the final result. And also it really encounters over fitting, which we'll discuss in a second now the downsizes over them that it's highly sensitive toe out lioce as discussed earlier because they drastically reposition the hype of plane. It's also vulnerable to Carlini Garrity, which occurs when is a strong linear correlation among two independent variables such that they do not provide unique and an independent information to the regression model, thereby undermining the prediction of the dependent variable. An example of this would be using leaders of fuel consumed and leaders of fuel remaining in the tank to predict car mileage. Both variables are directly correlated and in this case, negatively correlated. So when put into the same model to predict fuel mileage, they have a tendency to cancel themselves out. And you're better off including one of these variables and leaving the other one out height and weight. Also another example off two variables that could be highly correlated, which can lead to problems with Carlini already on. The final drawback is under fitting, which we can discuss in tandem with the opposite concept of over feeding. So under fitting basically occurs when the model is too simple and over fitting is when the model is too complicated, meaning it might not generalize well with new data. So because the linear regression have it plain is straight and does not bend, a patterns in the data is typically over. Simplifying the patterns in the data set, which can lead to under fitting and equally because the hyper plane doesn't bend to patterns in the data, it hardly ever runs into trouble with over fitting, which is where the models over complicated and might work well with the training data but fail when it comes to the test data, which contains patterns that diverge slightly from the training data. We will dig further into under fitting and overfeeding later in the course. But for now, just remember that linear regression can be susceptible to under feeding or over simple, finally, true patterns of the data, which can result in some losses of ocracy with his final predictions.
12. Logistic Regression: now where is linear regression quantifies relationship between continuous variables. Logistic regression is used as a classification techniques to predict discreet classes. Hence, rather than quantify how much a customer will spend over a given period of time, logistic regression is used to qualify. Whether the customer is, for example, a new customer or every turning customer. There are thus two discrete possible outputs new customer or returning customer, similar to linear regression. Logistic regression measures a mathematical relationship between Dependent Variable and its independent variables, but then adds what's called a sigmoid function to convert the relationship into an expression of probability between zero and one. A value of zero represents no chance of occurring, whereas one represents a certain chance of occurring after a stunning data points to classes using the sigmoid function, a Harvard plane is used as a decision boundary to split the classes and predict the class off new data points. So in this example, we're using the same data set as the previous video for linear regression. However, this time we can select more variables because we can accept both discrete and continuous variables into our model. So here we have rooms, price, distance postcode, bathroom car and landslides as our six independent variables on the dependent rebel that we wish to predict is the type. So I'm trying to predict whether the property is a house, which is a zero or a unit, which is one. So if we then take our training data, which consists off X and Y examples from the data set, we can then put that into the algorithm to build a model. Once that model has Damon developed, we can then test it on the test data or take a new unknown data point and put that into the model and see what the prediction producers. So in this case, we have unlearned property where the X variables are known, such as the land size and the distance to the city, and we put those values into the model, and it will then predict whether that property is a house or an apartment. And in this particular example, it predicts that it is a house. So in terms of the strengths and weaknesses of this algorithm, it's very faster run. It's transparent and it really does excel. A binary classification which is predicting one of two discrete classes for example, pregnant or not pregnant, given its strength and binary classification. Logistic regression, therefore, is due to many fields, including four detection, disease diagnosis, emergency detection, learn default detection or to identify spam email through the process of dissenting specific classes such a spam or non spam. The weaknesses is that it's not so you gonna analyzing a high number of variables, and it's also very sensitive toe outlines. So other techniques such a support vector machines may be required to overcome significant issues with outliers.
13. Bias and Variance: so, as alluded to in previous videos, we're not gonna talk about under feeding and over fitting and also bias and variance, which describes how closely your model understands the actual patterns in the data now to understand, under fitting and over fitting, we first need to focus on bias and variance. Bias is the term used to refer to the gap between the value predicted by your model on the actual value. Variance, meanwhile, describes how scattered your predicted values are in relation to each other. Bias and variance can also be understood by viewing visual representation, using the shooting targets on the left hand side. So shooting targets have nothing to do with machine learning. But we can use them here as a visual metaphor for explaining bias and variance. So if we imagine that the center of the target or the bullseye perfectly predicts the correct value of your data, the dots marked on the target representing individual prediction of your model based on the training or test data provided now, in some cases, the dots will be densely position close to the bull's eye, ensuring that the predictions made by the model are very close to the actual patterns off the data. In other cases, the model's predictions will perhaps lie more scattered and further apart. The more the predictions deviate from that bull's eye, the higher the bias and less reliable your model is at making accurate predictions. In the first target, we can see an example of low bias and low variance. The bias is low because the most predictions are closely aligned to the center, and there is a low variance between the predictions because they're positioned densely in one general location. The second target shows a case of low bias and high variance. Although the predictions are not as close to the bull's eye as the previous example, they're still pretty near on. The bias is therefore relatively low. However, There is a high variance this time because the predictions are much more spaced out. The third target, meanwhile, shows high bias and low variance, and the fourth target shows high bias and high variance. Some combinations of bias and variance can lead to over fitting on under feeding as shown in the next diagram on the right hand side. So on the left we have under feeling where the model is too simple on the linear regression hybrid plane is over, simplifying the patterns in the data set. This will lead to some cases of high bias, despite the controlled variance in the model as it looks very consistent, especially in contrast to the model on the right. On the right hand side, we have a case of low bias and high variance, which is leading to some overfeeding, and this might be good in the short term. But it could lead to inaccurate predictions when the model was fed with new data that contains some variations in the pattern. Ideally, you want to encounter a situation where there's both low variants and low bias. But in reality there's usually a trade off between optimal bias and optimal variants, bias and variance both contributed era. But it's the prediction error that we want to minimize, not the bias or variance necessarily. And finding an optimal balance is one the most challenging aspects of machine learning. In order to mitigate under fitting and overfeeding, you may need to modify the models hybrid parameters to ensure that they fit the patterns of both the training and test data and not just one split off The data, they said will fit should acknowledge significant trends in the data and play down or even admit minor variations. This might mean re vandalizing your training and test data, adding new data points to better detect underlying patterns or switching algorithms to manage the issue off the bias variance tradeoff. So specifically, this might entail switching from linear regression, which is a straight line to nonlinear regression, which can bend and curve to the data to thereby reduce bias by increasing variants.
14. Support Vector Machines: next we have SPM support Vector Machines, which is considered one of the best classifies in supervised learning for analyzing complex data sets and downplaying out liars. Developed inside the computer science community in the 19 nineties, SPM was initially designed for predicting numeric and categorical outcomes as a sort of double barrel prediction technique. Today, the SPM is mostly used as a classifications technique for predicting categorical outcomes similar to logistic regression. In binary prediction scenarios, SPM mirrors logistic regression as attempts to separate classes based on the mathematic relationship between variables unlikely to stick progression. However, SPM attempt to separate data classes from a position of maximum distance between itself and the petition data points. One of its key features is the margin, which is the distance between the boundary line and the nearest data point multiplied by two. The margin provides support to cope of new data points and out liars that would otherwise infringe on a logistic regression boundary line as shown in the diagram. So here we can see that line a logistic regression, its locals right up against the data points on both sides, which is great for the train data, but with new data points that may lead to some misclassification rose with the B line, which is SPM. You have the great margin there, which gives you a lot more sort of buffer on more room to move if new data points diverge slightly from the existing data points that the model was trained on. So if you look at a new data point, which is classified as a circle, it is correctly categorised by the B S VM line, but incorrectly predicted by the A logistic line. So that's just one example off where SPM gives you a lot more flexibility in predicting new data points. Now, the SPM boundary can also be modified to ignore cases in training. Using a hyper parameter called C hyper parameter. Remember, is just the settings off the algorithm now by learning thesis E. Harper parameter. This introduces what's called a soft margin, which ignores a determined portion of cases that cross over the soft margin, leading to greater generalization in your model. A large see value, meanwhile, narrows the width of the margin to avoid misclassification in training, but may lead to overfeeding down the road, and the C of O is chosen usually experimentally and could be automated over time using grid search. So in this slide we have an example of a soft margin on a hard margin. With the soft margin, there's a lot more leeway for data points to full within the margin, whereas with a hard margin, there's much less tolerance of data points infringing on the margin. Now. SPM is quite sensitive to feature scale, so you may need to re scale the data prior to training. Using sterilization, for example, you can convert the range of each feature to a standard normal distribution. Using a mean of zero on sterilization is implemented in psych. It learned using standard scale on Okay, so let's look at the strengths and weaknesses. All this algorithm SPM Festival is regarded as one of the best classifies in supervised learning, and it really excels at untangling out liars from complex data patterns and managing high dimensional data sets. It also helps to downplay outlines, and it works well with predicting multiple output classes, and lastly, it excels when the number of features is larger than the number of items in the data set. The main drawback, though, is the processing time and especially datasets with a low featured of road racier. One drawback to using SPM, though, is the processing time it takes to train a model relative to other classification techniques, such as logistic regression. In particular, SPM is not recommended for data sets with a low feature to Roe ratio, which means a low number of features relative to rose due to speed and performance constraints. So this might not be the best a room for trying get a benchmark impression. All dead. It's set. But if you are looking for accuracy and you do have the time and resource is, then this could be a very, very effective algorithm to use with your data set.
15. k-Nearest Neighbors: The second clustering technique is K nearest neighbors, which is a supervised learning technique used. We've continuous variables, it predicts. The category of new data points according to the known category off nearby data points. This process of classification is determined by sitting K number of data points closest to the target data point so equivalent to a voting system. We can predict the category of new data points based on their position to the existing data points as demonstrated in the diagram. First, though, we need to set K to determine how many data points we risk to nominate in order to classify the data point if we set catered three. For example, que nn analyzes the new data point position to three nearest data points, all neighbors. The outcome off selecting the three closest neighbors returns to Class B data points on one class a data point in this example, so it defined by K three. The most prediction for determining the category of the new data point is equivalent to Class B because that returns to match is among the three nearest neighbors. The chosen number of neighbors identified defined by K, is very important in determining the results. You can see here that the outcome of classification changes when older in K from 3 to 7. So it's therefore useful to test numerous K combinations to find the best fit and avoid setting K too low or too high. The main advantage of this algorithm is that's very simple to understand, and it's easy to implement on small data sets. The downside is that it has a fairly high linear cost on computing. Resource is on. It doesn't do so well with high dimensional data because this is taxing on computing resource is and accuracy So although curry nearest neighbors is fairly accurate, that's a great simple technique. Storing an entire data set and then trying to calculate the distance between each new data point and all the existing data points really does insert Alina cost on computing. Resource is, this means that the number of data points the data set is proportional to the time it takes to execute a single prediction, which can lead to slow processing times. For this reason, Cayenne in is generally not recommended analyzing large data sets. Andi. Other downside is that could be difficult to use Canaan with high dimensional data measuring multiple distances between data points and say a eight or nine dimensional space can be taxing on computing resource is, and it makes it difficult to perform accurate classifications.
16. k-Means Clustering: Our next algorithm is K means clustering, which is a popular unsupervised learning techniques for identifying groups of data points. We have no upfront knowledge of existing categories, and it's useful for scenarios where you want to find new unidentified groupings. K Means works by dividing data into Kate clusters with K representing the number of clusters. Each cluster is randomly assigned a random centrally, which is a center point, and this is a data point that becomes the epicenter of an individual cluster. The remaining data points are then assigned to the closest centrally, and the central coordinates are also updated based on the mean off the new cluster. This update may cause some data points to switch allegiance and join a new cluster based on the comparative proximity with a different century. If so, the central coordinates must be recalculated and updated, and this may lead to a shift in the shape of the cluster. This process is repeated by matching data points to the closest century, and the algorithm usually takes several iterations to find an ideal solution and reaches. Final output were no data points, which clusters after updating the centuries position. So having covered some of the theory of K means clustering. Let's not look at the following diagrams to break down the full algorithmic process. Okay, so as a unsupervised, any technique and usually we don't know the category of our existing data points using K means clustering. We first need to generate to random centuries to act as a centre point for our two clusters . Here we have two central. It's highlighted in blue and then we need to align The remaining data points to the closest centrally which produces the final result. Say here we have a cluster on the left on a cluster on the right. The cluster on the left has three data points, whereas the cluster on the right has full data points and these two clusters are formed after calculating the Euclidean distance of data points to the two central. So now that we have are two clusters, we now need to update these centrally coordinates to represent the mean value off data points containing age cluster. So now we'll look at version two on the central have changed location. So in terms of the left cluster thes century, it has moved more to the right and for the cluster on the right. The century it has moved to the left on what has also happened is that one of the data points has now actually switched clusters. So if we go back for a second, we can see that this cluster here with the pink error has actually switched to the other cluster. So given that one of the data points has changed to a different cluster, we now need to go back and re calculate the mean coordinates off the data point contained in age cluster. So that now brings us to version three. So in version three, this century in the left cluster remains the same because nothing here has changed. However, the century in the right cluster has now been updated and actually it's now seen on top of an existing data point. So it doesn't mean we removed any data. Points is just three data points there in exact same location. And now we have our to find clusters formed based on the updated century off the top right cluster. So for this example, it took three operations to successfully create our two clusters, however, came into clustering is not always able to identify a final composition of clusters, and in some cases you will need to switch tactics and utilize another algorithm to formulate declassification model. Now, in terms of these strengths and weaknesses of this algorithm, K means clustering is greater, uncovering previously unknown groups as an unsupervised learning technique. It's also good for analyzing complex data patterns and also very faster run compared to other clustering algorithms regarding limitations of cluster analysis. The most notable are multi dimensional analysis where there's a lot of variables and the identification of variable relevance and quality. This is because success depends largely on the quality of your data and there's no mechanism and K means clustering to differentiate between relevant and irrelevant variables . K means clustering also segments data points based on the variables that you provide. But this isn't to say that the variables you selected relevant and especially if chosen from a large pool of variables in summary K means clustering is useful when you already have an indication off how many groups exist in the data set and you want to check how these groups look, or in situations where you don't know how many groups exist and you want the algorithm to produce an estimate
17. Decision Trees: Okay, in a lot of situations, there is no simple path for predicting an outcome. Doctors, for example, only decide on the diagnosis of a patient after asking them an extended Siris of questions like ways salespeople try to elicit as much information from the customer is possible through questioning in order to understand the customers individual requirements in both cases, the outcome Each question links together to produce a final decision, such as a medical diagnosis or a recommendation for a car model to buy this useful method of structuring and solving problems is using data science using a technique called decision trees as a supervised learning algorithm. Decision Tree Structure a series of rules in a hierarchal sequence based on variables Age bearable sub branches are used to connect data points to relief by its on each of the variable outcomes, such as sonny over cost and rainy. The variable is contained in those leaves and then exposed to the next terrible, and this process is then repeated using the data points collected in each new leaf. Eventually, an output is reached when I leave, no longer generates any new branches and results and what's called a terminal node. The aim of this method is to keep the tree as small as possible, which means using as few variables as possible. This is achieved by selecting a variable that best puts the data into homogeneous groups, such that it minimizes the level of variants at the next branch. Also, by analyzing outputs against imports, decision trees is an example of supervised learning, and it could be used as both a regression or as a classification technique. So in classifications we may be seeking to predict the general weather forecast of, say, raining, snowing or fine on dim aggression, we may be aiming to predict actual temperature expressed in Celsius or Fahrenheit. Now, part of the appeal of decision trees is that they could be displayed graphically and they're very easy to explain to non experts when a customer queries why they weren't selected for a home loan. For example, you can share the decision tree to show the decision making process, which is not possible when using a black box techniques such as a neural network. A notable drawback there of this technique is this susceptibility toe over fitting the model to the training data. While this is entries are useful for explaining and models decision structure because there is a fixed sequence off decision with past any variance in the test data or new data cameras or import predictions. The fact that there's only one tree design also limits the flexibility off this method to manage variants and future outlines.
18. Random Forests: In the previous video, we looked at decision trees, and we noted this technique is vulnerable toe over feeling as the model relies on a fixed sequence off decision past designed based on one set of data, which is great for explaining that particular set of data. But the model might not work well with new data that contains outliers, or even slightly different patterns. So instead of trying to build a single tree that is highly tuned to one set of training data on which may be limits its scalability an alternative technique used to construct mortal trees and combined their predictions. A popular example of this technique is random forest. This technique involves merging a collection of independent decision trees to retain a more accurate and stable prediction. Rather than aiming for the most efficient split at each layer off a single tree, random forests works by growing multiple trees and then combining the results into an output by averaging from aggression, all voting in the case of classifications. Also, by combining predictions into a unified prediction model, Van in Forest is an example off on some blame learning. The key premise of on Sunday learning is that rather than relying on a single estimate on some blame, modeling helps to build a consensus on the meaning of the data by combining predictions from different models. Now, running for us also artificially limits the choice of variables by capping the number of variables considered for each split. In other words, the algorithm is not allowed to consider all and variables at age petition by randomly limiting the choice of variables. This algorithm gives other variables a greater chance of selection, which helps to correct more unique trees. And this is really important because for multiple decision trees to generate unique insight , there needs to be an element of variation and randomness spread across each tree model. There's little sense in compiling five or 10 identical trees. And while his tree mon appears well optimized as it could be, the dominant patterns in the data will appear in the final prediction as more trees are added. In addition, venom forest extracts a different random variation of the data to train history, using a technique called bootstrap sampling and lastly, by using venom variations, the training data read. A forest can do without lies on lower the degree of variance in a way that a single decision tree count
19. Gradient Boosting: okay now for our final algorithm, which is great in boosting like random forest, Grady and Boosting provides yet another regression classification technique for aggregating the outcome of multiple decision trees to obtain a prediction. However, rather than building random independent variants of a decision tree in parallel, Grady and boosting is a sequential method, their aims to improve the performance of decision trees with each subsequent tree. This works by evaluating the performance of weak trees and then overweighting subsequent trees to mitigate the outcome of instances misclassified in earlier trees. Instances that were classified correctly at the previous tree also replaced with a higher proportion off instances that weren't actually classified well. This, in effect, creates another week tree. This new tree is able to key in on the mistakes made by the previous tree, and it's this adaptability of the algorithm to learn from its mistakes that makes great and boosting one of the most used algorithms in machine learning today. Now, a unique distinction of this algorithm is that it deliberately starts out by creating week learners. The transition from a week to a strong liner is achieved by making small changes to each new trees decision structure on by week. I mean that the model is a poor predictor on perhaps marginally better than a random guess . So, for instance, if it's a binary outcome with two possible outcomes, there were roughly have a 50% chance of getting the right answer just by guessing, so a week model would only be slightly better than a 50% chance prediction. A strong model, though, is one worst predictions are very strongly correlated with the actual outcome or the true classification. So what happens is that based on misclassification or errors from the first few trees, new trees are grown that learned from the week loaners to create progressively stronger trees. Then, using waiting, the algorithm allocates votes to each tree based on performance. This means that the best individual models have a greatest say on the final prediction. This, as you will remember from the previous video, is different to random. First, where trees haven't even impact on the final decision. Regardless of the actual accuracy. The end result is that great infusing usually has an edge over round of forest in terms of accuracy, because off this waiting system on a sequential approach to self learning. So in this life we have an example off, great in boosting where we have three trees. Of course, you'd have far more trees in your actual model, but this is just a basic illustration off the algorithm. So we have treat one tree to entry three, and we can see that they have different outcomes Now. As mentioned, the weighted voting system is a major aspect of this algorithm. So by taking the outcome of a decision tree and then factoring in the actual prediction error rate, it will then create a weighted voting system which will generate the final prediction based on the accuracy of each individual tree. On also is prediction. Now, while adding more trees to around a forest usually helps offset over fitting, the same approach can cause over fitting when it comes to Grady and boosting. In general, the model should not fit to close toe outlier cases, but this can be difficult as great, and boosting is constantly optimizing to offset earlier errors that might be Judah outliers in the data. So for complex data sets for the last number of outlines, random forest may therefore be a preferred algorithm for the job. Okay, so let's now turn to the strength of this algorithm. As mentioned, it's highly accurate with consistent data. It's also compatible with both discrete and continuous variables as input and also his output as generally more accurate than random for us. In terms of prediction, ocracy based on the voting system and also the sequential approached herself learning the weaknesses there is that this over them, like random for us, cannot be visualized easily, like a single decision tree and also as mentioned, more trees can lead to over feeding on. The final downside is that this model can be a little bit slow to build. This is because trees are trained sequentially and each tree must wait for the previous tree, thereby limiting the production scalability of the model and especially as more trees are added. A random first, meanwhile, is trained in parallel, and this makes it much faster to try in the model
20. Evaluation: after implementing your chosen algorithm, the next step is to evaluate the results. The angle of evaluation will depend on the scope of your model and specifically whether it is a classification over aggression model. In the case of classification, common evaluation methods include the Confusion Matrix classification Report on Accuracy School. So starting with ocracy school, this is a simple metric measuring how many cases the model classified correctly, divided by the total number of cases. If all predictions are correct, theocracy score is one and zero when or predicted cases are wrong. So while ocracy alone is normally a reliable metric of performance, it may hide a lopsided number of false positives or false negatives. So what I mean here is that although you know the overall accuracy, you don't know whether the error is due to a high number of false positives or due to a high number of false negatives or even balance off the to. This leads us on to discussing the confusion matrix. Also known as an arrow matrix, the confusion matrix is a simple table that summarizes the performance off the model, including the exact number of false positives and false negatives. So as seen in the top left box in red. The model in this example predicted 134 data points correctly as zero, and also 125 cases correctly as one in the bottom red square. The model also miss predicted 12 data points as one when they should have been predicted a zero and 21 cases as zero when they should have been predicted as one from the confusion matrix. We can then analyze the ratio of false positives to force natives as well as calculate the final ocracy off the predictions by dividing the total number of false positives which was 12 and false negatives, which was 29 by the total number of data points which in this case is 300. So if we take 12 plus 29 divided by 300 we get 13.66%. So the model misclassified 13.66% of data points, and if we take the inverse of this calculation, we have the ocracy school of the model, which is 86.34%. The are very popular evaluation technique for classification is the classification report, which consists of three different metrics, with those being precision recall and the F one score. So starting with precision, this is the ratio of correctly predicted true positives to the total number of predictive positive cases. A high precision score translates to a low number of false positives on this metric addresses the question of how accurate the model is when predicting a positive outcome. This, in other words, is the ability of the model not to label and negative cases positive, which is important in the case of drug testing. For example, this metric also addresses the models ocracy of predicting a positive outcome and avoiding a false positive prediction. The next metric is recall, which is the ratio of correctly predicted true positives to the total number of actual positive cases. This metric addresses the question of how many positive outcomes were rightly classified as positive, and this can be understood as the ability of the model toe. Identify all positive cases. Also note that the numerator, which is on top, is the same for both precision and recall, while the denominator on the bottom is different and lastly, we have the F one school, which is a weighted average of precision and recall is typically used as a metric from model to model comparison rather than internal model ocracy. In addition, the F one score is generally lower than the ocracy school due to the way recall and precision are calculated. And also I should mention there is support, which is not an evaluation metric per se, but just a tally up the total number of positive and negative cases respectively in the data set. Okay, so haven't talked about classification evaluation techniques. We now need to talk about regression, so the two most common measures are mean absolute error on root mean square error. I mean, absolute era measures the accuracy of the errors in a set of predictions. So how far was the regression line to the actual data points root mean Square era, meanwhile, measures the standard deviation of the prediction errors, which informs how concentrated or spread out prediction errors are in relation to an optimal fear. Now, given errors a squared before they are averaged root mean square error is far more sensitive to large areas than mean. Absolute error, on the other hand, are MSC is not as easy to interpret as M A as it doesn't describe the average error off the model's predictions subsequently are. MSC is used more as a feedback mechanism to penalize poor predictions rather than to investigate the actual era for each prediction.
21. Introducing the Class Project: as part of our class project will be building a prediction model in Python to predict whether a user will click on an advertisement using a free data set downloaded from congo dot com on the Decision Trees algorithm, which we covered in a previous video. So before we commence our project, I just quickly want to go over the different steps involved in designing our model. So Step one is to import our libraries. So in machine learning, libraries are a collection off pre compiled and standardized code routines that enable you to manipulate data and execute algorithms with minimal use of code. So, in other words, they're basically your best friend because rather than dividing lines and lines of code in order to plot a simple graph, all run I Simple regression algorithm. You can often execute advanced functions from a relevant library in just one line of code. So for our model will be using to libraries with those being pandas and psych, it learn So pandas is used to structure your data as what's called a data frame, which consists of rows and columns similar to a spreadsheet, and you can also use panders to import and manipulate data sets without affecting the source file. Psychic Learned, meanwhile, offers an extensive collection of shallow algorithms, including linear and logistic regression decision trees, Kenya's neighbors and so on. It also offers many evaluation metrics such as mean absolute error as well as data petition methods, including split validation and cross validation. Step two is to import the data set, so in this case, we're using the advertising deficit, which will download for free from cattle dot com. Then we have the data scrubbing process, where we use one hot encoding to take categorical strings and convert them into new Miracle identifies, and also will be removing some columns which aren't relevant to our model. Then, as with any other machine learning model, you'll set your ex and why variables before then choosing your algorithm, which in this case is the decision trees classification algorithm. Then we need to train over them on our training data and then evaluate the results using a confusion matrix and classification report as introduced in the previous video. Okay, so there's theseventies steps in the next few videos. I'm going to show you how to set up your development environment and also have to download the data set to complete this class exercise
22. Installing a Development Environment: So in order to build a prediction model in Python, we first need to install a development environment on for this class we were using. Jupiter NOTEBOOK Jupiter Notebook is a very popular choice for machine learning practitioners on also online courses as it combines life code, explanatory notes and visualizations into one convenient workspace. And it runs from any Web browser do for notebook coming installed using the Anaconda distribution or pythons packet manager Pip As an experienced Python user, you may wish to install Jupiter Notebook via Pip. And there are instructions available on the Jupiter notebook website outlining this option for beginners, though I recommend choosing the Anaconda distribution option, which offers an easy clicking drag set up, which I will now walk you through. Okay, let's start by never getting to anaconda dot com, and from there we can go to products onto Individual Edition. Let's now quick, all in dullard, which will take us to the bottom of the page. But we have our three operating systems Windows, Mac OS and Linux, so you need to choose the appropriate operating system based on your computer, and in my case, I'll be using Mac OS also There's two options. Python 3.7 and python 2.7. So for this class, I strongly recommend using Python 3.7 and I'm gonna go ahead and install the 64 bit graphic and stolen, so let's no wait. Well, this downloads. Okay, Once that's delegate, we can then click on the file. We can then proceed with continue. So I will now go through the various installation options. We can continue. Continue. Agree we can skip. This will continue so I can now navigate to my naps and go to the anaconda. Never get up. Okay? After installing Anaconda onto your computer, you're now have access to a number of data science applications, including our studio duper notebook and graft fears for data visualization through the anaconda application. So it next we need to select a trooper notebook by clicking on launch inside the Jupiter notebook tab. So we'll go to launch that will then open up Jupiter in our browser on. We can then go to New Python three to open up a new notebook. Okay, so we now have our notebook which is ready for coding. And in the next video, we'll start producing our machine learning model in Python
23. Downloading the Dataset: Okay. How everyone? Welcome back. Now, in order to complete our class project, we obviously need some data toe work with So one of the great resource is available on the Web is Kangol dot com, which offers a whole range of different data sets that you can download to your computer. So for our class project, we're using the advertising data set which you confined at cable dot com ford slash fire me f a y o m i ford slash advertising. Also, keep in mind that your need to register an account on cargo, which is free, and then log in in order to download the deficit so it once you reach this page, you can check out the data set below. We have some visualisations here generated off the individual variables, including daily time spent on site, the age area, incom, daily Internet usage and so forth. So these visualizations are great for getting idea off the different variables. They're distribution on also their type. Okay, so now we can go up and download the CSP file. So once it's downloaded, you can unzip the fall and is now ready for us to use as part of our class project to predict the outcome off the advertising data set
24. Class Project: Predicting Ad Responses: so, as mentioned earlier, we will be using a decision tree classifier to predict the outcome of a user clicks advertisement using the advertising data set as part of our class project. Now why? Decision trees may be used for regression or classification problems. This model uses three classifications version off the algorithm as we're predicting a discreet variable. So using the decision tree, classify over them from psych it learn we will attempt to riddick the dependent variable which is clicked on add which is either a zero false or one true. And the performance on the model will be evaluated using and classification report and the confusion matrix. Okay, so let's begin by importing the libraries for this model. So every festival need pandas as PD, then from psych it land will input the train test split method also from psychic land. We will need the decision tree classify over rhythm on again from psychic learn will have the classification report and confusion matrix which will use for evaluation purposes once we've trained our model. Step two is to input the advertising deficit which we downloaded in the previous video. So we want to install the data set as a data frame and a sign of arable to the data set. So the variable here is DF, which stands for data frame. You could use another variable name, but DF is very common in machine learning. So using the pd dot reid underscore CSP command. We can then called the data set which has saved a Nardelli's photo. And the name of the file is advertising dot CS fee. But keep in mind this file directory name will differ depending on your operating system and also depending on where you saved the data set. Okay, So let's now convert the country and city variables to numerical values using one hot encouraging. So we need to call our deficit, which is DF and then used the p d dot get underscored dummies function and then again call the data set on the columns that we wish to remove, which in this case is country on city Step four. We need to delete some variable. So we're going to remove the discrete variables of ad, topic line and timestamp which aren't relevant and practical for use with this model. So we used the delete d e l and then DF which is the name of data set on. Then we need to insert the names of those columns that we wish to remove. Okay, let's now inspect the dead a frame. Using DF dot head, we go to cool run cells and that will then populate the data frame balloon in the input section. So here we have our different variables, and we can see the 1st 5 rows and weaken school across to see the other columns. As you can see, a lot of these variables have being one hot included, which means that they've been split up into sections. Okay, so it's now move on on a sign our X and Y variables clicked on. Add will serve as the dependent variable wife of this exercise, while the remaining variables constitute our independent variables, which are X so the independent variables are comprised of daily time spent on site age, area, income, daily, Internet usage, male, country and city. So what we can do here is a shortcut and just cool. X equals DF and then drop. The column clicked on add, which will basically give us all the variables except clicked on add. And then we can go to our Why variable and linked that to clicked on Add next, people desperately data 70 30. We want to shuffle it and also set the random state to 10. Okay, so we need X Underscore. Train X under school test. Why underscored train? Why underscore? Test equals train test split calling on X. Why? With a test size of 0.3 So it's sitting the tests. I 0.3 will automatically set the training data to 0.7 or 70%. Then we have our random state, which is number 10. You can use any random state. It's just a basic bookmark to give you the same variation of data each time you produce the model. However, we're also shuffling out model using the shuffle function, which will be true. Let's now go ahead and it's signed the algorithm, which is the decision tree classifier. Now let's fit the variable containing the decision tree algorithm to the X and Y training data to build our model. The next step is to test the training model on the X test data using the predict method on assigning a new variable name which we're gonna use model underscore. Predict. So we use model underscore. Predict equals model dot predict X under school a test. And then we can use the confusion matrix and classification report to review the performance of the model using the white test data. So let's go ahead and use print confusion matrix on the white test on model underscore. Predict. Then we need prints. Classification underscore. Report on the white test data again on also model predict which is the data set. Okay, let's now run the model. Let's go to so and then run cells. Okay. And now below we have our results. So we had 139 cases predicted correctly as zero on 145 cases correctly predicted as one. However, we had seven false positives and also nine false negatives. Now, in terms of precision and recall, we have 0.95 and also for the F one score. So in general, it's a fairly accurate model, and we can also see the support was 300 cases. So we had 100 46 cases of 0 154 cases off one. Okay, so there we go build our first picture model in python using the decision trees algorithm. If you come across any problems or you need some help or you have some feedback about using this model, please contact me at Oliver dot Theobald at scatter plot. Doc press dot com