Machine Learning and Jupyter Notebooks and AWS | Qasim Shah | Skillshare

Machine Learning and Jupyter Notebooks and AWS

Qasim Shah, Digitization and marketing expert

Play Speed
  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x
14 Lessons (1h 42m)
    • 1. Course Agenda and Intro

      3:32
    • 2. Machine Learning on AWS

      7:25
    • 3. First steps in building a Machine Learning model

      10:45
    • 4. Understanding AWS Datasources

      9:54
    • 5. Machine Learning Training Models in AWS

      7:24
    • 6. Importance of Feature Transformation

      6:26
    • 7. Evaluating Models

      13:40
    • 8. Creating a Datasource and Model

      7:01
    • 9. Serverless machine learning inference with AWS Lambda

      8:31
    • 10. What is a notebook and Installing Jupyter

      4:19
    • 11. Creating first Jupyter Notebook through Anaconda

      7:36
    • 12. Data Analysis in Jupyter

      6:43
    • 13. Looking at Jupyter Kernels

      5:20
    • 14. Plotting with MatPlotLib

      3:43

About This Class

Are you a company or a IT administrator, data center architectconsultant, enterprise architect, data protection officer, programmer, data security specialist, or big data analyst and want to gain fundamental and intermediate level skills and enjoy a fascinating high paying career?

Or maybe you just want to learn additional tips and techniques taking to a whole new level?

Welcome to Machine Learning, Reinforcement Learning and AWS course For Beginners - A one of its kind course!

The flipped classroom model with hand-on learning will help you experience direct  into the course as your begin your learning journey. Be sure to watch the preview lectures that set course expectations!

In this course, you'll learn and practice:

  1. Machine Learning topics

  2. Jupyter Notebooks

  3. Reinforcement Learning

  4. Machine Learning Services in AWS

  5. AWS Sagemaker

  6. Dynamic Programming

  7. Q-Learning

  8. Understand  best practices, and much more....  

Transcripts

1. Course Agenda and Intro: So the world of artificial intelligence, machine learning as well as cloud computing and big data is growing exponentially. More and more enterprise organizations that I work with myself as a senior project manager , enterprise architect and helping these organizations develop solutions based on cloud. Because the only viable alternative left coming in future is actually going to cloud computing and working with machine learning and artificial intelligence. Which is, of course, the most growing field at this point in time. And it is going to grow further as we move ahead of time. Welcome to this course on working with AWS Machine Learning using Jupiter notebooks together myself and my co instructor Simcha, we've designed this course specifically keeping in mind the real world scenario. So let me walk you through the course agenda. In this course, we are going to start off with the basics off what AWS machine learning is. So if you're a beginner, perfect, we're also don't explain the types of machine learning in a manner that is most understandable. And, of course, then we are going to apply all of these concepts using Jupiter notebooks and work with Amazon sagemaker as well. Let me backtrack a little bit. So we'll talk about the AWS machine learning concepts such as unsupervised learning, supervised learning, reinforcement, learning, working with structure data, unstructured data. So you will get a solid understanding of all of these concepts. And, of course, we're gonna take all these concepts and apply them using Jupiter notebooks and working with code such as python or Java or any other coat. Then we'll get into the Amazon sage maker, which allows developers to build and train A I am she learning models so that we can do predictive analysis or other analytical analysis. Who is this course for? We've designed this course for complete beginners as well at some intermediate. So if you were in Twitter, user of AWS for other cloud computing technologies such as azure, you may learn a tip and techniques out of this course. So the ultimate goal is to provide you valuable content, working with real world projects and use cases. So it's not just talking about definitions in terms, but actually taking those terms and definitions concepts and applying in a real world scenario, working hands on using Amazon Web services. So by the end of this course, you will not only gain valuable skills or maybe enhance your skills and get some valuable content. And this will help you improve your career. Maybe if you're working in projects, you will definitely see this course as being a valuable I've. We value your feedback, please feel free to provide your feedback so we can make our courses better and better. Together, we're teaching over 200,000 happy students that you know me. And we're super excited to bring you these courses. We have over 45 courses so far. So based on our student feedback, we've actually designed this specific course so you can get some valuable skills. So if you're ready to learn machine learning within AWS working with sage maker working with Jupiter notebooks, this is the right course for you. I welcome you to this course. What are you waiting for? Click on the enroll button now and we will see you in class 2. Machine Learning on AWS: everybody. And welcome to this lesson are looking at machine learning that's offered by AWS. AWS has a pretty broad and deep set off machine learning and AI services to cater to almost any business of any size. There's many different things you can choose from, such as pre trained AI services for computer vision language recommendations forecasting. And then there's also the Amazon sage maker, which you can use to build, train and employ machine learning models. And we'll look at sage Maker a little more deeply in another section. But just to give you guys an or review. Sage maker is basically one of the main machine learning services that are offered by Amazon, and it's very customizable in that you are able to build, train and deploy your own customized machine learning models based on even your own, a log rhythm. Or you can use one of the building, a log rhythms that are offered by AWS and again, like I mentioned, Ah, there's an entire section in this course dedicated to Amazon. Sage makers will look at specifically how you can build your machine learning model, train it using the different logarithms and then how you can deploy it in Amazon Web services to give you guys an or view of some of the other services that are offered by AWS . These AI services Utkan basically very easily integrate them with your applications to address common cases that as if you want personalized recommendations or if you're trying to improve safety and security, or if you're trying to increase your customer engagement and trying to get more analytics to improve your marketing campaigns or improve your operations, these various services all allow you to do that very easily and quite often, very simply. So. There's one service called recommendations, which personalized experiences for customers with the same recommendation technology that's used by amazon dot com. So again, I'm sure all of us are familiar with amazon dot com and the robustness off it and how it personalizes almost everything for you when you go on it so you can use that same technology and that same service that they used to personalize your experience. When you go on their website and implement that within your own organization, whether it's an e commerce website or whether you have your own corporate website, you can use the same information that amazon dot com is using and try to implement that maybe at a smaller scale in your own organization. There's also yuk unbilled accurate forecasting models based on machine learning technology . For example, we're trying to ah forecast your sales for the next quarter or for the next year, or you're trying to forecast the amount of users that are going to be hitting your website or your e commerce site that you're launching or that you're building up. You can use those machine learning models to forecast, even, for example, how much stock to order so you can minimize the stock that you guys have or you keep in your house. So there's various applications they can use their forecasting for. There's also image and video analysis that you can add to your applications to catalog assets. Automate media workflow is extract meaning and much more. Additionally, you can use national language processing, toe extract, insights and relationships from unstructured text. You can also do a document analysis, for example, you can extract text and data from millions of documents in amount of hours, reducing ah lot of manual effort that a lot of us do and trying to analyze the different documents regardless off which industry that we're currently and and various other ones. For example, there's voice. You can turn text to lifelike speech. You can use chatbots, which referred to as conversational agents. You can use translation services. There's also transcription services, which is basically easily can easily add high quality speech to text capabilities to your applications and to your work flows. There's a host of different AI services that are offered by machine learning that AWS offers. So depending on you know what we're trying to do within the eye framework or within the machine learning from where we can pick and choose to implement various services within that AI framer, you guys see Hannibal made that in your and our applications in our business processes and to add to those A S services. AWS also offers a host of other services to make it a complete package for you. For now, we can get the right compute. For any use case, you can leverage a broad set of powerful easy to instances ranging from GP use for compute instant intensive, deep learning. Two F PGA is for a specialized hardware acceleration to memory instances for running inference. So these easy to instances, which are basically your virtual machines RV, EMS, offer a range of hardware options that you can pick and choose from. The best part about it is you don't have to provisioned them on a permanent basis. You can use, for example, something called on demand, so they're only provisioned when you need them. So if there's off peak hours, you can scale down your instances on on peak hours. You can scale up your instances, so it's a very, very cost effective way to manage your infrastructure on the cloud. And they also offer analytics and security from machine learning. So in order to do machine learning successfully, you not only need the capabilities, but he also need the right security. You need the right data store and you need the right analytics service toe work together in harmony in orderto have a very good and well functioning machine learning model. And with AWS, it offers all of that through the S three services, re storage to AWS analytics and eight of your security. So it kind of brings everything together for you in one package. Additionally, there's a couple of neat learning tools that are within the airframe work. You can get deep with machine learning through something called the AWS Deep Razor, which is the car that you guys see on there. So it's basically a fully autonomous, 18 scale race car designed to help you learn about reinforcement learning through autonomous driving. And we'll talk about reinforcement learning a little bit later on in a bit more detail because that's one of the most up and coming areas off the machine, learning that especially used for autonomous driving. You can experience, for example, what it's like to race in the real world when you deploy your own re first Met learning model into the AWS Deep Racer. There's also the deeplens, which is which is the camera you guys see on the bottom, right? So it's basically the world's first deep learning enabled video camera that's primarily develop for developers. It's integrated with sage maker and many other AWS services and allows you to get started with deep learning in minimal amount of time, and most times, if you're trying to set it up, you can probably do it in less than 10 minutes and the best part about it is is it's all integrated to AWS, and you can watch the results real time through the AWS console. So this is basically just to give you guys an or view and a broad understanding off what AWS offers through machine learning and the capabilities that it offers. So it's basically a one stop shop for you to do everything that is a I or machine learning related and and for me, the best part about it is everything is available in one console in one area, so you don't need to go to several different platforms or several different programs to do machine learning everything and be done right here within the AWS framework. 3. First steps in building a Machine Learning model: everybody. And welcome to this lesson are looking at how we can build a machine learning application. So building applications are building ML applications is an iterated process that enrolled is sequence of steps. So to build a machine learning applications, there's generally five different steps that one should follow. For example, there is formulate the problem and collect and label data, analyze the data feature processing and splitting the data. So what I mean by formulating the problem or formulate a problem to the first step in machine learning is to decide what you want to predict, which is known as the label or target answer. So imagine a scenario in which you want to manufacture products, but your decision to manufacture each product depends on its number of potential sales. On the scenario. You want to predict how many times each product will be purchased, which is predicting the number of sales. So there's multiple ways to define this problem by using machine learning. So choosing how to define the problem depends on your use case or specific business need. For example, do you want to predict the number of purchases your customers will make for each product In that case, the target is in America, and you're solving a regression problem, or do you want to predict which products will get more than 10 purchases? In that case, the target is binary, and you're simply solving a binary classification problem, so it's very important to avoid over complicating the problem and to frame the simplest solution that meets your needs. However, it's also important to avoid losing information, especially information in the historical answers. For example, converting an actual past sales number into a binary value off over 10 versus fewer would lose valuable information. So you have to make sure you invest the time and deciding which target makes the most sense for you to predict will save you from building models that don't really answer your question. So that's basically formula in. The problem is will lay the groundwork for making sure that the model that you're trying to build and we're trying to predict will essentially meet your needs and meet your business case. So if you do that incorrectly, the rest of the process kind of falls apart. So it's very important that we spend the right amount of time and resource is and making sure we get this right. So after we have that, then we need to do the data. So machine learning problems obviously start with data, preferably lots of data for which you already know the target. Answer. So what I mean by that is basically historical data. So an example that I had given you would need to get the historical data for all of your past sales. So, data again, the data for which you already know the target answer is what's called the labeled data and unsupervised machine Learning. The algorithm teaches itself toe learn from the label data examples that we provide. So each observation, each example your data has to have two elements the target and then the features when the target is the answer that you want to predict. For example, the number of sales and the features are attributes of the example that can be used to identify patterns to predict the target. So, for example, for the email classification, that target is a label that indicates whether an email is spam or not. Spam Examples of variables are the center of the email. The text in the body of the email detection, the subject line, the time the email was sent, and so on. All those variables will help the machine learning mile predict whether that email is spam or not. Often, data is not readily available in a labeled form, so collecting and preparing the variables and the target are often the most important steps in solving a machine learning problem. So the example data should be representative of the data that you that you will have when you're using the model to make a prediction. So, for example, if you want to predict whether an email is spam or not, you must collect both positive, which is, you know, you should collect both spam emails and non spam emails so the machine learning can learn what is spam and what is not spam. So, and once you've labeled the data, you might need to convert it to a format that's acceptable to your longer them or softer. For example, the Amazon machine learning you need to convert the data to a C S V format, with each example making up one row off the CSB file. So this is specifically for the Amazon she learning. And if you're using another software. You have to make sure that that data is formatted according to the requirements of that program or are that software. So after we have the data and before feeding that label data to the algorithm, it's a good practice. Inspect your data to identify issues and gain insights about the data you're using, and the predictive power off model will only be as good as the data. You feed it, so you have to make sure you analyze your data and keep a following things in mind. It's useful to understand the values that your variable take in which valuables are dominant in your data so you could run these summaries by a subject matter expert for the problem that you want to solve. Ask yourself or whoever the expert is, does the data match your expectations? Does it look like you have a data collection problem and so on? You have to make sure that you run it by an expert to make sure the data makes sense. And also the second is knowing the correlation between each variable and the target class is helpful because the high correlation implies that there is a relationship between the variable and the target class. And in general, you want to include variables with high correlation because they're the ones with higher predictive power and leave out the variable to local relation. So now that we have labeled data, we've analyzed to make sure that it's proper and that it's correct that we have the right correlations. That's when you might want to transform your variables further to make them more meaningful . And this is what's known as feature processing. So, for example, so you have a variable that captures the date and time at which an event occurred. This date in time will never occur again and hence won't be useful to predict your target. However, if this very was transformed into features that represent the hour of the day, the day of the week, the month now these variables could be useful toe learn if the event tends to happen at a particular our particular day of the week or a particular month of the year. So such feature processing to form more generalize, herbal data points to learn from can provide significant improvements to the predictive models. Obviously, it's not always possible to know the features that have the predictive influence in advance . So it's good to include as many features as you potentially can and that are potentially related to the target label and let them model training a longer than pick the features with the strongest correlations. So your job and your job should be to include as many features as possible and let the machine learning do its work. Let it learn from all of that data, because if you remember, as I mentioned, you want to make sure you have as much data as possible and feed as much data as possible into the machine learning model. But that does not mean that feeding data that's unstructured or data that's not analysed or data that's not relevant. We have to make sure that all that is done, but after all of that is done, we have to make sure, and we want to make sure that we have as much data as possible into the machine learning model so it can learn and make correct predictive answers and then finally splitting the data. Now, the fundamental goal off machine learning is to generalize beyond the data instances used to train models, We want to evaluate the model to estimate the quality of its pattern. Generalization for data and model has not been trained up, however, because future instances have unknown target values and we cannot really check the accuracy of our predictions for future instances. Since we don't have a crystal ball, we need to use some of the data that we already know. The answer for as a proxy for future data and evaluating the model with the same data that would use for training is not really useful because it rewards models that can remember the training data as opposed to generalizing for it. So common strategy is to take all available label data and split it into training and evaluation subsets, usually commonly with a racial off, 70 to 80% for training and 20 to 30% for evaluation. And the machine learning system uses the training data to train models to see patterns and uses the evaluation data to evaluate the predictive quality off. The trained model on the system evaluates the predictive performance by comparing predictions on the evaluation data set with true values or known as ground truth. Using a variety of metrics and usually you see the best model on the evaluation subset to make predictions on future instances for which you do not know the target. Answer. Now the Amazon machine learning Splits data sent for training a model through the council into 70% for training and 30% for evaluation. That's done by default. It also allows you to select a random 70% off your source data for training instead of using the 1st 70% and using the complement of this random subset for evaluation. And you can use the AP eyes that are provided by Amazon. She learned to specify custom split ratios and provide training and evaluation data that was split outside of machine learning. So if you have done it outside and you like to pump in that data, you do have that capability with Amazon machine learning if you do not want to use that automated system. So these are the five main steps that we need to make sure that we follow in order. If we want to have a properly working machine learning model that predicts our future sales are forecast, our future sales or predicts spam emails and the example that I had given properly and effectively because if we do not follow these steps to the tea, then we can have issues off making wrong predictions. And essentially, if you are going to be doing this in the production environment, that can have pretty dire consequences for you and for your organization. So it's the simple steps to follow but one but nonetheless ones that we need to make sure that we follow in detail in order to have a probably were properly working machine learning model. 4. Understanding AWS Datasources: everybody and welcome this lesson on understanding a little bit more detail that data sources and that are going to be used within Amazon machine learning now. Data source objects contain meta data about your input data. Now, when you create a data source, the Amazon machine Learning reads your input data computes to spring descriptive statistics on its attributes and stores. The statistics, which is called a schema and other information as part off the data source, object after you use or after you create a data source, you can use the data insights, which are provided by the Machine Learning Consul, which will take a look at to explore the statistical properties off your input data. And you can use the data source to train a machine learning model. Input data is like I mentioned the data used to create the data source, and you must save your input data as a CS format for AWS. No different programs and different software have different requirements, but for Amazon machine learning, it has to be in a CSB file. And each column in the CIA's We file contains an attributes of the observation. For example, in the figure that you guys see, it's basically a snapshot off a CS refile that has four observation each in its own rope, and each observation contains eight attributes, which are separated by a comma. Now the attributes represent information about each individual represented by an observation. For example, it starts off with a customer i d. Then it goes into a job i d the education, whether it's basic, whether it's high school, the housing, the loan campaign, the duration and whether they will respond to a campaign which is classified by either a zero or a 10 being no and one being. Yes, Now I'm some machine learning requires a name for h attributes, and you can specify these attribute names by either including them in the first line, which is again known as a header line of the CSP file, or including them in a separate schema file that's can be located or that should be located in the same S three bucket as input data. Now, the CSB file that contains your input data has to meet a certain set of requirements. For example, it must be in plain text consists of observations, and each line should have only one observation, and they should be separated by commas. And if an attribute value contains a comma, the entire attribute value must be enclosed in double quotes. He's observation must be terminated with an end of Floyd line character and attributes. Values cannot include end of line characters, even if the values enclosed in double quotes. Every observation must have the same number of attributes, and each observation must be no longer than 100 kilobytes. So these are requirements that must be met by the CSB file that you're going to be using as your data source. And like I mentioned, you can provide your input to machine learning in a single file or as a collection of files , and these must satisfy a few other conditions. Forgettable files must have the same data schema, and they must reside in the same s three bucket and and and path, for example. You guys see the three pats that I've given input data files are named input 12 and three and has three. Bucket is, for example, the example Bucket. Your paths should look like what you guys see on the screen. Lastly, when you create the CSB file. Each observation, like I mentioned as a requirement, will be terminated by a special end of line character of this character is not visible, but it's automatically included at the end of each operation when you press enter or the return keep. Now, the special character that represents the end of line varies depending on the toys that you're using. For Apple. Lennox uses a line feed character that's indicated by a slash and Windows uses to courage is called carriage return and line feed that are indicated by slash R slash. And so when you're saving those files, you have to make sure that you specify the correct format that those files should be saved in based on the operating system that you're using, and especially for using a Mac OS and using a Microsoft Excel on a Mac OS and you create your CSB file, you have to make sure that you save it as a Windows car comma, separated or ah, windows. See SV fall in order for it to work properly within Amazon machine learning. So it's very important, so you make sure that our data format is correct before feeding in the data to the machine learning model. Otherwise, if you started to format your data, especially if you have large amounts of data, it can become a hassle to correct it later on. So we want to make sure that we know what the requirements are, and we start following them when we're preparing the data for ingestion into the machine learning model. One of the fundamental goals off machine learning is to make accurate predictions on future data instances before using ah machine learning model. To make predictions, we need to obviously evaluate the predictive performance of the model to see if is actually working properly and to estimate the quality off machine learning model predictions with data it has not seen weaken reserve or what's called a split a portion of the data for which we already know the answer and define that as a proxy for future data and evaluate how well the machine learning models actually predicting it. So with Amazon machine learning, we have three options for splitting the data. We can pre split the data, which is you can split the data into two data input locations before uploading them to the S three bucket and create two separate data sources. You can use what's called a sequential split, which is basically telling Amazon machine learning to split your data sequentially. One. Creating the training and evaluation data sources or the random split, which is it will use it. It will. It will split it randomly, and it just for your purposes. By default, Amazon machine Learning splits the data in a 70 30 format, meaning that 70% will be used for training and 30% will be used for evaluation. Now, a simple way to split your input data for training and evil is to select the non overlapping subsets or for data while preserving the order of the records. This approach is useful if you want to value your models on data for a certain date or within a certain time rings. So, for example, say that you have a customer engagement data for the past five months and you want to use this historical data to predict customer engagement in the upcoming month. Are using the beginning of the range for training, and the data from the end of the range for evaluation purposes might produce a more accurate estimate off the models quality than using records data drawn from an entire data range. So the two figures you guys see given example of one, he should use a sequential splitting strategy versus when you should use a random splitting strategy. When you create a data source, you can choose to split your data source sequentially and again. Like I mentioned, Amazon will use the 1st 70% for training and the remaining 30% of your data for evaluation . So it's very important that we know what data were working with or predictions we want to make that will help us define what the right way off, either sequentially splitting it or random is blending. It was very important that we understand our data before we get to this point and will help us split the data properly because if we split it incorrectly, the machine learning model, regardless of how good the back end is, it will predict the wrong answers because the data that we have ingested into it and the way we're splitting it is incorrect and finally the data schema. So there's two different options we can use for the data schema and I had mentioned. This proves briefly before you can either allow AWS to infer the data Types of Fiji attributes and input data file and automatically create a schemer for you. Or you can provide a dot schema file when you upload your data into the Amazon as three bucket, and you have to make sure it's uploaded in the correct in the same bucket. So either or and depending on how your business is operating, how advance your machine learning capabilities are of your organization will depend on whether you provide your own schema file or whether you want A W s two and for automatically for you. For most of the small to even medium size organizations, it's probably best to let AWS and for the data types, If you are pretty heavy into data science and you have your data structured properly, you are able to provide your own schema file. And just for your information, schema, files, biscuits, skeleton skeleton structure that represents the logical view off the entire data. It was basically telling the machine learning model what the data is and what it's structured as and again, you can also either let it'll be Do it for you, which is an automated way of doing it, a simple way of doing it. Or if you have a more advanced need for machine learning, then you can also upload your own dot schema file. But please make sure that you upload it or it has to be uploaded in the same s three bucket that has your orginal data source. So I hope you guys got a good understanding off the data structures and how they're utilized by AWS machine learning and the importance of making sure that we have our data structured properly in terms of CS refile and following the certain requirements that are there that are put by AWS in terms of what the CSP file should be, where to be located and so on. 5. Machine Learning Training Models in AWS: Hi, everybody. And welcome to this lesson. I'm looking at the different training models that are offered by AWS. So the machine learning supports three different types off models. You have been a reclassification, multi class classification and regression. So models for Bonaly binary classification problems predict a binary outcome one off to possible classes and to train these models, it uses the industry standard learning a longer than known as logistic regression, for example. Some examples are. Is this email spam or not? Which again, is a yes or no? Will the customer buy? This product is a product, a book or a farm animal? Is this review written by a customer or a robot again, and either or in the multi class, it allows you to generate predictions for multiple classes. Just predict one off more than two outcomes. And for this AWS uses and longer than known as the multi a no meal logistic regression. So, for example, is this product a book, movie or clothing? Is this movie a romantic comedy, a documentary or thriller or horror? Which category of products is most interesting to this customer and you would have a list off products haven't finally, the regression model, and that basically predicts a numerical value. And for this AWS uses the longer them linear regression. So, for example, what is with the temperature in Dubai tomorrow? Or how much will the sell for on what the price of this house so so on the three main model types that are used by Amazon machine learning. So to train the machine learning model, you need to specify a few things. First of all, you need to do it in put training data source. Name off the data attribute that contains the target to be predicted the required data transformation instructions and then finally training primaries to control the learning a logger them. And during the training process, the machine learning automatically selects the correct learning longer before you based on the type off target that you specified in the training data source. So it's a very automated and easy way off, starting out and doing your machine learning and typically machine learning longer than accept parameters that could be used to control certain properties of the training process and of the resulting model. Not in AWS thes air called training parameters, and you can set these parameters using the console, the A P I or even the command line interface. If you don't set any parameters, don't worry, because the AWS will use default values that are known toe work well for a large range off machine learning tasks so you can specify values for the following training parameters that you guys see. You can specify parameters for the model size, the max number of passes to shuffle type, the regionalization type and the amount. Not all of these parameters are set by default, and the default settings are adequate for most machine learning problems. But again, depending on your specific business case, you are able to choose and define your own values and fine tune them based on your data. Not to discuss a few of these in a little bit more detail. I won't get, too. I won't discuss all of them for you, but I'll discuss a few of them for you. So what do we mean by saying Maxim model size? That's basically and the units of bites the total size of your model. So by default, AWS creates 100 meg model and you construct it to create a smaller or larger model bus, pacifying a different size. If it, for example, cannot find enough patterns to fill the model size, it creates a small model for you automatically. So, for example, if you specify a maximum model size of 100 megs, but it finds patterns that are only 50 megs, the resulting model only be 50 megs, so it's not going to enlarge the data for you. It'll only reduce it. Choosing the right model basically allows you to control the trade off between the predictive quality and the cost of use, because small models can cause the total BS to remove many patterns to fit within a maximum size limit affecting the quality of predictions. But the smaller models also cost less. So again, there's trade off that you need to decide the maximum number of passes and or for best results, it may need to make multiple passes over your data to discover the patterns, because most of the time, by this going over through one pass, it will not be able to correctly predict and discover all patterns, so by default it makes 10 passes, but you can change the default by setting number upto a maximum of 100 passes. So, for example, if you set the number of passes just to 20 but AWS discovers that no new patterns can be found by the end of 15 passes, then it will stop. Even if you specify a maximum of 100 it will stop when it stops discovering new patterns. You can also shuffle your training data and what that basically does. It mixes up the order of your data so that the longer them doesn't in cholera. One type of data for too many observations in succession. For example, if you're training a model to predict a product type your training data includes, let's see a movie, a toy video game, product types. And if you sort of the data by the product type column before uploading it, the longer term sees the data alphabetically by the product type, and it sees all of your data for movies first, and your model begins to learn patterns for movies. Then, when your model encounters the next option, which is toys in this example, every update that they log in the makes would fit the model to the toy product type. Even if those updates degrade the patterns that fit movies. So this sudden switch from movie the toy type can produce a model that doesn't really learn how to predict product types accurately. So shuffling basically shuffles all those up and allows a mile to predict it more by using by doing random shuffling. Now, lastly, the regularization both type in a mountain. Most predictive performance of very large or complex models suffer when the data contains too many patterns. As the number of patterns increases, so does the likely hold that the model learned unintentional data artifacts rather than true patterns. When that happens, the model does very well on the training data but doesn't really farewell in the generalization off new data. And this is basically what a lot of people over the industry of first to as over fitting the training data. So regular list helps prevent linear model from over fitting training data examples by penalizing extreme Vaid values. And I'm not gonna get into too much of the details off the regularization because it does get a bit outside of the scope of this course. But I just want you familiar as you guys with the different parameters that are used by eight of US machine learning and times off training 6. Importance of Feature Transformation: everybody. And welcome to this lesson on looking at our feature transformation feature transformation and the importance off it in the machine learning process. So let's just consider if a machine learning model whose task is to decide whether a credit card transaction is fraudulent or not. Now, based on your application, background knowledge and data analysis, you might decide which data fields or features are important to include in the input data. So, for example, you might include the transaction amount the merchant named The address and the credit card owners address are important to provide to the learning process. On the other hand, Iran only generated transaction. It I d carries no beneficial information. And once you have decided on which fields include, you transform these features to help the learning process. Transformations add background experience to the input data, enabling the model to benefit from this experience. So, for example, you guys see a merchant address has represented a string to, for example, the 16 25 32nd Avenue, New York New York, 45678 Let's see if if that by itself the address has limited expressive power, it is useful only for learning patterns associated with that exact address. Breaking it up into ah constituent parts does create additional features such as address specifically being 16 25 32nd Avenue cities, specifically being New York state, specifically being New York and the ZIP court, specifically being 45678 No, the learning log or them can group more disparate transactions together and discover broader patterns. Perhaps some urgent ZIP codes experience more fraud activity than others, so it's very important that we learn and we transform our data properly. Another two ways to transform DataFeatures, either before creating the machine learning miles with machine Amazon machine learning you the transform your input data directly before showing it to AWS. Oregon is a built in data transformations off Amazon machine learning. Now AWS Machine ling recipes contain instructions for transforming your data as part off the process. No, they're just defined using Jason like syntax, but they have additional restrictions beyond the normal Jason restrictions, So there's three different sections that recipes have. First, we have groups which enable grouping off multiple variables for ease of applying transformations. So, for example, you can create a group off all variables having to do with free text parts off a Web page, for example, the title or the body, and then perform a transformation on all of these parts at once. Many of assignments which enabled the creation of intermediate named variables that can be re used in processing and then finally, outputs was define which variables will used in the learning process and what transformations, if any, apply to these variables. Now, when you create a new data source in Amazon, machine learning and statistics are computed for that daily resource. It will also create a suggested recipe that could be used to create a new model from that data. Source suggested data sources based on the data and the target attributes present in the data and provide useful starting born for creating and fine tuning your models. Other various different types of data transformations that are used by AWS. For example, there's the engram, which takes a text variable as input and provides strings corresponding to a sliding off a window off user configurable and words generating outputs in the process. So, for example, consider the string the texting. I really enjoyed reading this book, specifying the Engram transformation with the window size of one simply gives you all individual words in that strength. For example, each word will be by itself. I really enjoy reading this book. If you specify and size or window size of two, it'll grip. It'll group two words together and three and so on, and I won't discuss each one of these in detail. That gets a bit outside of the scope in a bit too involved in the details off the types of transformations. But just know that these are the main tribes of transformations that AWS uses. If you're using the AWS service for doing the transformations, if you're not going to do him by yourself or on your own now, it also has a data rearrangement functionality, which enables you to create a data source that is based on on Lee a portion off the input data that it points to. So, for example, when you create a model using the wizard and choose a default evaluation option, it automatically reserves 30% for evaluation and uses 70 for training. Now this functionality is enabled by the data rearrangement feature off Amazon machine learning. I can use the Amazon machine learning a P I or command line interface. If you want to change some of those parameters and it does allow you to do, it does allow you to change certain parameters such as, for example, of percent begin to indicate where the data for the data source starts or the percent, and indicate where the data for the data source ends. The compliment primer tells AWS to use data that is not included in the range of percent begin 2%. And and the Koplan parameter is useful if you need to create complementary data sources for training and for evaluation. And then there's also the strategy, which splits the data for our data sores from other than the 70 30 default that it does it . So if you want to use 50 50 50 for training and 50 for evaluation, you can use a strategy stream to go ahead and change those options because got a good or you off the types of data transformations that are offered by AWS and how important it is that we make sure that our data is transformed properly and correctly in order to make sure that our machine learning model predicts the correct answers and learns properly and trains itself properly 7. Evaluating Models: everybody. And welcome to his lesson. I'm looking at evaluating the machine learning models developed by AWS, so you should always as a good practice. You're always evaluate a model to determine if it will do a good job of predicting the target on a new or and future data. Because Fisher instances have unknown target values, you need to check the accuracy metric off the model on data for which you already know the target. Answer and the news is assessment. As a proxy for predictive accuracy on future data, not a properly valued a model. You hold out a sample of data that has been labeled with a target from the training data source. Evaluating the predictive accuracy with the same data that was used for training is not really useful because it just kind of rewards the model for remembering the training data rather than generalizing from it. Now, in Amazon machine learning, you can evaluate a model by using or by creating an evaluation and to create an evaluation . For a model, you need a model that you want to evaluate and you need the label data that was not used for training. So let's go ahead and look at how we can evaluate the different types of models from the insights love and you value in a model it provides. Our Amazon machine Learning provides an industry standard metric and a number of insights to review the predictive accuracy off. The model of outcome of evaluation contains a number of insights. For example, contains a prediction accuracy metric to report on the overall success visualizations To help explore the accuracy of your model beyond the prediction accuracy metric the ability to review the impact of setting a score threshold. Just keep in mind that's Onley for binary classification and alerts on criteria to check the validity of the evaluation. Now the choice of the metric and reservation depends on the type of model that you're evaluating. So it's important to review these visualizations to decide if your model is performing. Roll enough to match your business requirements and your expectations. How the actual output of many Byner classifications longer them's is what's called a prediction score. The score indicates the system certainty that the given observation belongs to the positive class by reclassification models output a score that ranges from 0 to 1 and as a consumer of the score to make the decision about whether the observation should be classified as one or zero. You interpret the score by picking classifications threshold or what's called a cut off and compare the score against it. And the observations would scores higher than the cut off are predicted as target one and scores lore are predicted. A zero of the default cut off is 00.5 for Amazon, so you can choose to update this cut off to match your business needs. You can also use the visualizations in the console, which will look at in the demonstrations to understand how the choice off cut off will affect your application. Now it provides an industry standard accuracy metric for buying reclassification models, called a U C or our area under the curve. That measures the ability of the model to predict a higher score for positive examples as compared to negative ones. Because it is independent of the score cut off, you can get a sense of the prediction accuracy of your model from the au symmetric without picking a threshold. It usually returns a decimal value from 01 and values near one indicate a model that's highly accurate. Values near 10.5 indicated model. That's no better than guessing at random and values that zero are unusual to see and typically indicate a problem with the data. So, essentially, an A you see near zero says that the model has learned the correct patterns but is using them to make predictions that are opposite from reality and just keep in mind the baseline for the a. U C. Metric. For Biden, remodel is 0.5, so it is a value for ah, hypothetical machine learning model. Let randomly predicts a one or zero answer. Explore the accuracy of the model you can review, and we will review the graphs in evaluation through the console and a model that is good predictive accuracy will predict higher scores to the actual ones and lower scores to the actual zeros. Ah, perfect model will have to history grams at two different ends of the X axis showing the actual positives. All got high scores and actual negatives all got lost course. However, Hammel model makes mistakes, obviously, and a typical graph will show that the two Instagram's overlap at certain scores an extremely poor performing model will be unable to distinguish between the positive and medic negative classes, and both classes will have mostly overlapping history, grams, even the examples you guys can see on your screen in terms of the good chart and the bad chart on the right hand. Now for correct prediction, there's what's called a true positive or a teepee, which predicted the value as one and the true value is one where, as a true negative, the model predicted the value of zero, and the true value is zero. And then you have Aaron s predictions, which is a false, positive or false negative. And then you can also adjust the score cut off by changing the score cut off. You adjust the models behavior. When it makes a mistake, just keep in mind when you're moving the cut off to the left, it will capture more true positives. But the trade off is an increase in the number of false positive airs. Moving into the right captures less of the false positive errors, but again, there's a trade off that it will miss some true positives. So for your predictive applications, you can make a decision. You have to make a decision, which kind of air is more tolerable by selecting an appropriate cut off score. And as I mentioned in the very beginning of this course, machine learning is an iterative process, so most likely you will have to test out a different cut off point to see which ones will work better for your business case, then we have the multi class. On. The actual output of this class is a set of prediction scores, so the scores indicate the models certainty that the certain observation belongs to each of the classes. And unlike binary, you do not need to choose a score cut off to make predictions. The predictive answer is the class. For example, the label with the highest predicted score. So typical metrics used in multi class are the same as a metrics used in Byron classifications case after averaging them over all classes, not in AWS, the Mac er, average F one score is used to evaluate the predictive accuracy off a multi class metric of the effort. Score is basically a binary classification metric that considers both binary metrics, precision and recall. It is the harmonic mean between precision and recall, and again the ranges from 0 to 1 are larger. Value indicates better predictive accuracy, and you guys can see the mathematical calculation for the F one score are the macro average . F one is an unweighted average off the F one score over all classes in the case. It does not take into account the frequency of occurrence of the classes and the data set, so a larger value indicates better predictive accuracy. Well, keep in mind that eight Elvis does provide baseline metric four multi class models. For example, if you were predicting a genre off movie and the most common genre in your training, that data set was romance. For example, the baseline model would always predict the genre as romance. You would compare your model against the baseline to validate if you're ML model is better than the ML model that predicts this constant answer, and this is where you can use the performance visualization that you guys see on the right hand side. It basically provides a confusion matrix as a way to visualize the accuracy off the multi class and the confusion matrix illustrates in a table form the number or percentage of correct and incorrect predictions for each class by comparing and observations, predicted class. And it's true class. So, for example, going back to the movies if you're trying to classify a movie into a genre, the predictive model might predict that this genre is romance. However, it's true genre might actually be a thriller. So when you evaluate theocracy off a multi class classification model, AWS identifies this miss classifications and displays the results in the confusion matrix. As you guys can see in that illustration, Sebesta displays a number of features. The number of correct and incorrect predictions freeze class, the class wise F one score true class frequencies in the evaluation data and the predicted class frequencies for the evaluation data. So it does provide a visual display, and it can accommodate up to 10 classes in the confusion Matrix listed in the order of most frequent to the least frequent class in the evaluation data of the output of a regression model is a numeric value for the model's prediction. So, for example, if you're predicting, let's say housing prices, the prediction of the model could be a value such as 300,000 or 357,004 and 55 or and so on . So for regression tasks, it'll be us uses the industry standard root mean square air or R M s E. Because the distance measure between their predicted in America Target and the actual answer. So the smaller the RMS see the better is the predictive accuracy of the model. And a model with a perfectly correct predictions would have a value off zero. Now again, like with most things, AWS does provide a baseline metric or regressive models. And it's the arrest and me for a hypothetical regression model that would always predict the mean off the target as the answer. So, for example, if you're predicting the age of our house buyer and the mean age for the observations in your training data set was 35 the baseline model always predict the answer as 35 and again , just like with the other ones. You can also use the performance visualization with you guys. See on the right hand side. Basically, is your history graham off the residuals on the evaluation data? One distributed in a bell shape and centered at zero, indicating that the model makes mistakes in a random manner and does not systematically over or under predict any particular range of target values. Then we have cross validation, which is a technique for evaluating, modeled by training several models on subsets off the available input data and evaluating them on the complimentary subset of the data. So basically, you would use cross validation to detect over fitting or also known as failing to generalize a pattern. So basically, the diagram that you guys see shows an example for training subsets and complementary evaluation subsets generated for each of the four models that are created and trained during during a four fold cross validation. So Model the first model uses the 1st 25% off of the data for evaluation and the remaining for training. The 2nd 1 uses 25% for evaluation and then so on. Each model is trained and evaluated using complementary data sources and the data in the evaluation data source includes and is limited toe all of the data. That is not the training data source. The biscuit About performing this validation, you would generate four different models for data sources to train the models fourty valued them and four evaluations for each model and then finally AWS does provide insights to help you validate whether you evaluated the model correctly. And if any of the validation criteria are not met by the valuation. Hato Bs will alert you by displaying and let you know that a validation criteria has been violated and it will do so according to these five metric city. I see either the valuation model is done on held out data for now. But if you use the same data source for training and evaluation if sufficient data was used for evaluation off the predictive model, for example, if the number off records in your evaluation data is less than 10% of the number of observations you have in your training for DAPA or if the schema is matched or mismatched. All records from evaluation files were used for predictive model performance evaluation and finally the distribution off the target variable. So I because got a good overview in this lesson on how AWS values to different models and how we can use the metrics provided by AWS to evaluate if the models that we have developed for our data and for our business case are working properly or if we to change them and then see how in which options are available for us to use based on the binary or multi class classifications. To see how we can find tune the machine learning model Istres to help us predict the correct answers. 8. Creating a Datasource and Model: Hi, everybody, and welcome back. So, Mr Trial, we're gonna go ahead and create our first training data source. Previously, we had uploaded both of RCs REFILES, the one that's gonna be used for training and the one that's going to be used for predicting unto our newly created as three buckets. So after we go ahead and do that, we wanna go where we want to create a data source, which is a object that contains the location off the input data and important metadata about our input data and the Amazon machine Learning uses the data source for operations like model training and evaluation. So before we create the nature source, we have to make sure that we have the following things. We have a location off our Amazon s three bucket the schema again, which includes the names of the attributes in the data and the type of each attributes. For example, if it's Merrick if it's taxed, if it's categorical or binary and the name of the ad true, that continues the answer that you want Amazon machine learning toe learn to predict which in our case, is that why whether the customer is going to subscribe to our product or not. So let's go ahead and open up our Amazon machine learning console. We're gonna go ahead and choose on, get started, and we want to go ahead. And since we have nothing launched, if we had stuff launched, we can go directly into our dashboard. But since the first loss that we're doing, we're gonna go ahead and click on launch. Now, here's where we can specify where our data is located, whether it's in the S three bucket or we can also use the Amazon red shifts. So if we have massive amounts of data than we would want the red shift. But since we have a small pieces of data, we can use our s three bucket. And just for your purposes of your information, this is the same data that I've downloaded so that you have downloaded and I have provided it's provided by AWS. So you can also download it right directly from here. If you have not downloaded it previously from the course section in you dummy. So here's where we can specify the ste locations. I'm going to type in the path of the bucket and the CSB file and for the data source, I'm going to go ahead and type in. I'll just put a one just so I can differentiate. Are the two sets off data that I provided one for training and one for prediction. McGoohan Click on Verify and here is going to ask me that Amazon machine Learning needs permission or read permission to the extreme locations. We want to make sure we grant it. Read permission. And once it can access and read, you'll see a page. So most of what you guys he here interview their properties off. What the data is the CSE format. The schema is auto generated because I have not provided dot schema source for it. Number of files and the size I'm going click on Continue. So here's where we establish our schema on again. A schema is basically information that machine learning needs to interpret the input data for the model, including the attribute names, the data types and any special attributes. And if you guys remember, there's two different ways of providing the schema. Either let it'll be s do it automatically for you like it has done here, or provide a dot schema file. So the one option we're going to using for this demonstration is allow it a B s in for the schema. And here's where you can see that in for inference that is made for us. So here we can review the data types that machine learning inferred for the attributes. And it's important the activities are assigned the correct data type to help the machine learning in just the data correctly and to enable the correct feature processing on the attributes. So attributes that have only two possible states, such as yes or no, should be marked as binary attributes that are numbers or strings that are used to denote a category should be categorical. Mark one should be numeric and afterwards, that are strings that you would like to treat. His words should be text, so you're more than welcome to go through each one of these attributes Bar. For the purposes of this demonstration, let me just tell you that the machine learning has inferred the correct attributes or the crack schema for all of these attributes. So next what I'm going to go and do is go to my last column, which, if you guys remember is the White column, and this is where, where we had specified either a 01 or yes or no to let the machine learning know whether this person had subscribe to the product or had not subscribed to the part product. So this is what I want to specify as my target attributes. And because remember, the target attribute is the one that we want the machine long machine learning while to learn how to predict. After I do that, I'm gonna go ahead and click on Continue. So here is gonna ask if the if the data contains and identify and that's biscuit optional roid and for helps understand how prediction rose correspond to observation rose from the input data. So if you choose to make an attribute, the row identifier, the model will add that column to the prediction output and row identifiers intended for reference purposes. Onley. So it does not really include the raw identifier when training the models. But for the purposes of this simple demonstration, we're going to just make sure we click on no and then click on review here. It gives us an option to go ahead and edit any of our options that we had done. If you want a cordon, change anything. But let's just say that we have done everything correctly and I'm gonna click on Continue so that we have created our training data source. We can use it to create the machine learning model, train the model and then evaluate the results not to create the machinery model on because I was on machine Learning automatically uses the training data source that we just created . They'll take us directly to this model settings page, and here we can give the model a name, depending on how we want to classify it. And for the purpose of this demonstration, we've just specified a default evaluation settings. We can specify custom settings if we want to go ahead and utilize some of the advanced machine learning training features. Let's say if we want to change that 70 30 rule that AWS automated the offers in terms of using 70% of the data for training and 30% for prediction. We can change all those variables through the custom settings less to stick to the default settings. We're gonna go and give this evaluation that same name again. Well, was classified as one. What a Glen Click on review. And here again, it gives us a chance to go ahead and change any of the options. And then we're gonna go ahead and create the machine learning model. So while your machine learning model is in, Q was processing is going to give the status as pending while gauze and creates all of your model. Now, this will change to in progress and then finally to completed Once a model has been fully completed and trained on there you have it. This is the steps that are involved to create your data source and to create the machine learning model using the Amazon Machine Learning Service. 9. Serverless machine learning inference with AWS Lambda: everybody and welcome this lesson are looking at how we can create a lambda function and the psychic learn reference. So for those who that don't know what is serverless so serverless computing basically enables you to build and run applications without thinking about the servers and other network infrastructure. So it's relieving a developer from the burden off bootstrapping, configuring, networking, clustering, updating and all the other aspects off a physical server management and the enormous benefits. If not being able to do that. Not having to do that so enables you to focus on tasks at hand without sacrificing the access or scalability. So a few things that need to keep in mind. Obviously, we need to make sure that we have our AWS account on. You can also use the free tier for this. It's not going to make a difference. As most as none of the things here are chargeable, So first thing I'm going to do is we need to create a sage maker lifecycle configuration. We're gonna go ahead and log and tour console. We're gonna go to navigate to our sage maker, and you can find that out of the machine learning or you can use the search box and the dashboard contains links to all of the major components notebook training in front and so on. So the notebook, we have the option off lifecycle configurations. We're gonna go and click on that. The life cycle configurations are startup scripts that initialized the Jupiter nor book environments. And they can be run once on creation or or one every notebook startup. I want you to click on that. We're gonna under scripts, going to click on create Notebook, and we're gonna go ahead and paste in this code. So this above code, it does the following when instances created is going to download the code and necessary files from the get hub repository. It's going to organize the folder structure in place files in session folders is going to set right permission to the folder and is going to install seven zip, which is required to compress Lambda packages to their smallest size. So after we do that, we're gonna go ahead and create our notebook. Instance, we're gonna click on notebook instances and created instance that's gonna creator Jupiter notebook using the lifecycle configuration we just created in the last step we're gonna go ahead and pride the deep regular information in terms of the notebook name, we're going to keep the instance the smallest one that's available, we're gonna use the S three bucket and I am role to make sure that has access to all of the S three buckets. And more importantly, we're going to choose the life cycle configuration that we just specified. And we're gonna go ahead and create our notebook. Instance, I was gonna take several minutes to go ahead and get that up and running. And in the meantime, we're gonna go and navigate to our S three console and create our bucket while waiting for the notebook. Instance. Tow launch. So we're gonna go ahead and create a bucket that's going to be that's going to host all of our data for us, and the bucket is necessary to store the training data and models that were going to be creating in this workshop. After we create the bucket we need to set up, I am rolls and attached the necessary Polly's policies to it. So we need to add rights to our newly created sagemaker role and create a new role for the serverless inference. So we're gonna be using to policies were going to use the Lambda full access and the S three full access policies. And these permissions are required in the notebook. As as since we're gonna be uploading objects to the S three and creating Lambda Functions, we're gonna go ahead and go into our I am dashboard. You're gonna go and click on roles. We're gonna go ahead, type out sage maker as one of the role that we want to search for and the summers page. We're going to click on attach policies. And then again, we're going to use the search box to add to policies the Lambda full access and the S three full access and the next we're going to create a serverless inference execution role for the Lambda. So this senses for the Lambda. We're gonna go ahead and click on the Lambda to create a new role ever gonna sign this one full access to the S three bucket and the sagemaker also. So now we have both those roles created and done to make sure that we have access for the Lambda functions and for the stage maker to the s three buckets and vice versa. So if that's done, we're gonna go ahead and go back into our sage maker dashboard and has received the notebook instances up and ready. We're gonna go ahead and open up Jupiter tow. Launch our duper of notebook. Instance. Once we're in there, let's go ahead and navigate to the serverless. A workshop, the Lambda Psych it learn in France. And you wanna open up that psychics? Sentiment, analysis, tweet a couple of things that you guys want to do. Make sure you replace where it says your bucket name with the bucket name that you created in the S three when we created the buckets. Enough of that work. We're going to go ahead and run all of those steps and will train the psych it learns built in a logger. Them logistic regression using the tweets data set and in the last lines off the cord. There is basically uploading the train model and the validation and test data said back into our three buckets that we previously created. So after we do that, we're gonna go and go back into our Jupiter notebook file browser and launch a terminal into our instance from the Jupiter notebook How to set up the Lambda and enable inference on our newly created model. We're gonna use the AWS command line interface and again we're able to do that right from the Jupiter notebook instance on the sea Ally is basically preinstalled in the bash cell provided in every sagemaker. Instance. So on the right hand side of Jupiter notebook, we're gonna click. I knew I And at the bottom we're gonna click on Terminal and it's gonna go ahead and launch our new terminal instance in a new tab. So next we're going to create a Lambda Leer on the notebook instance through issuing a few commands. And then, after doing that, we're going to publish the Lambda Lear. So after that's done, we're going to create a lambda function using just our lambda function. That pie file and all the dependencies have been bundled in the layer that we previously published in the last step. So we need to do is we need the full Aaron or Amazon resource name to run this command, and we can get that air. And by going back into our I am console going into the role and just copying it from there . So after that, now let's go ahead and update the lander to use the layer we just published. So after you do that, I'm gonna go ahead and ever get back into our console and navigate to the Lambda dashboard . And the package has been uploaded, and in this instance, she ate it. So now we can call it on demand and will use the Lambda as testing feature to call the model Lambda is in the compute section of the console. If you're not familiar with where it is, we're gonna go ahead and click on functions and select sentiment analysis. And in the upper right hand corner, you guys see the test option, We're gonna go and click on that. It's going to pop up a new window and we're gonna just call it a test event. We're going to give it again the bucket we're gonna put in our bucket name. As soon as they're done, we're going to click on the test button and after a couple of seconds, we're going to see the following result in terms of the execution result as being succeeded . So there you have it fairly simple is that I should say, are creating a model using the psych it learn, built a lambda lair, published it and used the Lambda function to use that layer and got a successful test from the console so now essentially can use this same model and scale it out. You can choose to call the function from a P a gateway and put it into production. So essentially it's this model can be put into production production, as is the few minor tweaks here and there. But essentially, it is as simple as that in terms of building a server less ai environment. 10. What is a notebook and Installing Jupyter: Hi, everybody, and welcome this lesson on looking at the Trooper notebook and walking you through how to install Jupiter on your PC. So the Jupiter notebook is an incredibly powerful tool for Inter actively developing and presenting data science projects. A notebook basically integrates code, and it's output into a single document that combines visualizations, nerd of tax, mathematical equations and other rich media. The workflow promotes iterative and rapid development, making notebooks and increasingly popular choice at the heart of contemporary data, signs, analysis and, increasingly, science at large. Best of all, as part of the open source project Jupiter, they're completely free. Now. There are multiple, different ways on how we can install and work with Jupiter. One way is we can work on it through the online interface, and you guys can see it's all cloud based, so there's no installation that's required. Always simply do is a navigate to the Jupiter website We navigato. I'm a get on the cloud and assumes we collect on the Jupiter notebook is going to go ahead and launch the Jupiter notebook instance on the cloud for us and again the entire project. It's free of cost of There's no cost of doing this, even on the cloud. And there we go. It's as simple as that. Click and go, and we can start working with our notebook instances right from there. Another way off. Installing Jupiter notebook is on our local PC. One thing to keep in mind that you will need to have python and 3.3 or greater installed on your PC if you want to work with Jupiter notebooks locally, and the best way of doing that is through a software called Anaconda, which is also free. So that's what we're gonna walk through now is using and installing Jupiter by downloading the anaconda, which weaken Do through that you are all that you guys see on top So soon every click on the download button. We can pick and choose which version off python that we want to work with. And again, this is the worst in that is installed on your local PC s previous to installation off Anaconda. So after we download the version that we want to install, we're gonna go ahead and walk through the installation process. It's a pretty straightforward process. As soon as you double click on the executed was going to launch the Anaconda installer for us. We're gonna go ahead and click on next, agreed to the terms. And here's where we can select whether we want the anaconda to be available just for you or for all users. They're working on this PC. We're gonna leave the default destination folder. There's no need on changing that. And we want to make sure that we stick with the option that is making use off Python and not have it in the path environment. Because if you have the path, environment is going to start up when your PC starts up before most of the services and sometimes that will cause issues, especially if you're working in a Windows environment. We're gonna go and click on next and wait for the insulation files to install on the PC. And they're yet another way that we can install is through the command line interface. If you're more advanced user and you would work with python already, you can use the Pip three installed Jupiter Command through the command line interface, and it will download the Jupiter files and have it installed on your local PC. So there's multiple ways on how we can install and work with Jupiter. So one of the first ones that we saw is we can either work on it on a purely cloud based environment through the Jupiter website, where nothing has installed on our local machine. But if you are more than more advanced users and you will be doing some advanced work with , Jupiter is always recommended that we haven't installed locally. And even on the local installation, we have a couple of options. We can either download the anaconda and work on it through their Oregon down water directly to our machine through the command line interface. And again, the command for that is Pip three installed Jupiter, which will download everything on our local machines in terms of the Jupiter notebook instances, and we can work on it from there. So there are three main ways that we can work with and download the Jupiter notebook to start working with the data science on our local machines or through the cloud based interface. And there we have it. We have it just successfully installed it through the command line interface also 11. Creating first Jupyter Notebook through Anaconda: everybody. So now that we have Jupiter downloaded and installed through Anaconda, let's go three and see how we can run and save notebooks, familiarize ourselves with the structure and understand interface. So let's go ahead and ah, launch our anaconda instance that we just downloaded in the previous lesson. And from there, what we're going to do is go ahead and launch our Jupiter instance. So after you opened up an economy as Houston see, there are many different things and programs that we can launch through Anaconda. But the one that we want to work with for the purposes of this course is and this demonstration is specifically the Jupiter notebook. Some to go ahead and click on launch to launch the Jupiter notebook instance, and it's going to open up a tab in my browser for the for the notebook. Now, this isn't not a notebook just yet, but don't worry who are gonna walk you through step by step on how to create it into a notebook. Just keep in mind. As I had mentioned the previous lesson that there are multiple ways off working with this. You can also work on on the cloud so even though it opened it up in your browser. This is still working from your local machine, as you guys conceive from the U. R L. It is a local host, meaning that this notebook instant is working from your local machine. So the Dad dashboard basically will give you access only to the files and sub folders contained with Super Startup directory. But the start of directory can be changed, so it's possible to start the dashboard on any system. Why the command prompt? If you are familiar with current command promise, so we want to do is go ahead and click on the drop down next to New and select Python three or any different version off your choice. I stuck with Python three, but if you have another one, you can always do that. So as soon as there that there we go, we have our first Jupiter notebook open in a tab, and each notebook uses its own tab, and you guys can see when you go back. It has created an untitled I. P. Y and be file, And each file is a text file that describes a contents off a notebook in Jason Format and each cell and its contents, including image attachments that have been converted into strings of text, is listed there in along with some metadata. And you can also change the metadata yourself. You can edit it through there, you can click on edit, and you have an option off editing at the metadata. Now I do highlight can. There's no need that you would need to modify the metadata on a regular base on a routine basis. But this keep in mind that you do have an option to do that. Another year. The notebook instance in front of you. It's interface will hopefully look not alien to you because Jupiter essentially is just an advanced word processor. Ah, in a nutshell, um, you can do you guys can see some of the menus in the interface is very familiar. Looks very familiar with it is modelled right after a word. So there's two are fairly problem in terms that you should notice or keep in mind that we have something called cells and kernels on their key, both to understanding Jupiter and toe what makes it more than just a word processor. So Colonel is basically a computational engine that executes the code contain in a notebook document. And if you guys remember arm using Python three. But there are other codes and Colonel that you guys can use to as the computational engine , and the cell basically is a container for the text to be displayed on, and we'll look at it. Ah, the colonel's and a little bit later. But just keep in mind that the colonel is basically the engine that's going to drive this whole thing. So cells basically form the body of the notebook and in in the screen front of you Ah, in the section above the box with the blue line and next to it is basically an empty self. And there's two main types off cell. We have a code cell and a markdown cell, so coat cell contains code to be executed and a markdown cell contains text formatted using marked down and displays its output in place when it's run. So the first hole and a new notebook is always a code self, so let's try it out. Let's see what happens if you type print hello world into the cell and then we click on the run button that you guys can see on the top menu bar right under a cell. There's an option to run. So is she. That assumes you do run. We have Ah, a Hello, you dummy And next to it. So that second cell is basically you refer to as your mark down cell and the first hole where we typed print is your code cell. And also, when you're on the sell its output again, you guys can see as displayed below, and the label to its left has changed from blank to a one. The output off a code cell also forms part of the document, which is why you can see it and which is why you can see it. You can always tell the difference between cold and markdown cells because coat cells have that label on the left and mark down cells do not, and the in part of the label is simply short for input. While the label number indicates when the cell was executed on the Colonel and in this case it was number one. So on the issue, if we run the cell again, as you guys can see, it has changed the number two because it was a second cell to be run in the colonel's that keeps track off. What cell was run went. We're also able to insert cells either above and below, depending on which order work. We want to run the code. So let's try out another one. Let's try out a code for the following code to see what happens. We're gonna type out import time, and I'm going to give it a sleep off 10. So let's see what happens. Once I run this code, you guys might notice that nothing's happened. But you guys see the ass trick, right? Right in the cell. What should be a number and gastric means that the job is currently running. If you notice I gave it a time, sleep off. 10. Basically meaning that all I'm doing is running this code or having the machine run this for 10 seconds. So it's after the 10 seconds is up. It's going to give that a number, and it was should give it a number. So in general, the output off a cell comes from any text data specifically printed during the cells. Execution as role as a value of the last line in the Selby alone. Variable ah, function cell or something else. So then you'll find yourself using that almost constantly in your own projects, and we'll see a lot of it throughout the rest of this course. And also you guys can see it is the example that I'm doing in the third cell and just keep in mind. There are lots of other keyboard shortcut that you guys should familiarize yourselves with , and I will have. And I have included a sheet that you escandalo that has few handy shortcuts for for different keyboard tour cause that you guys can use while using the Jupiter notebooks. It's always good to familiarize yourself with this Jupiter book, especially if you're doing or are going to working on machine learning. This is an essential part of it. 12. Data Analysis in Jupyter: Hi, everybody. And welcome to this lesson. I'm looking at how we can analyze some data using the Jupiter notebooks. So now that we've looked at what jubilant book is, it's time to look at how they're used in practice. Which could you give you a clear understanding of why they're so popular and why they're used so much. So let's get started. We're going to use a data from a Fortune 500 company that I mentioned earlier in terms of the banking information. So remember, our goal is to find out how the profits of the largest companies in the U. S. Has changed historically. So it's worth noting that everyone will develop their own preferences and styles. But the general principles will still apply, and you can also follow, along with the section in your own notebook. If you wish to give you guys some good hands on practice before we start, let's go ahead and give our notebook a name, and it's always good practice to go ahead and do that. Especially, we're going to be working with multiple notebooks at the same time, so it's common to start off with a code cells specifically for imports and set up so that if you choose to add or change anything, you can simply add it and we run the cell without causing any side effects. So that's what we're gonna do. We import pandas to work with our data, the mat pot lid to plot the charts and see born to make our charts look a little bit prettier. And it's also common to import the num pie, which. But in this case, although we use it via pandas, we don't need it explicitly. And the first line isn't a python command but uses something called line Magic, then struck Jupiter to capture the Met plot lib plots and render them in a cell output. This is one off the range of advanced features that are out of the scope of this course that will look at a little bit later in some of the other courses that are coming ahead in the pipeline. So let's go ahead and a load our data, and just to keep in mind, it's sensible also to do this in a single cells. In this case, we need to reload it in any point, so we're going to load the data in a cell all by itself and the end. We're grabbing that Fortune 500 CSP file, which I have included as part of the Donald section so you guys can download it right from there from the course section and use that if you're going to be walking through this on your own PC's after that slaughter. That this keep in mind is always a good practice to regularly save the notebook side, since it is as simple as doing the control asked or if you're going to file, you can also say that there. But it's always good practice to save your notebooks as a regular practice, and I want them keep mine Every time you create a notebook, a checkpoint file is created as well as your notebook filed by default, Jupiter will auto save your notebook every 120 seconds to this checkpoint file without all drinker primary notebook. So when you save and Checkpoint who bought the notebook and checkpoint files are updated, hence, the checkpoint enabled you to recover your unsaved work in the event of an unexpected issue , and you could also revert to the checkpoint from the menu, which we just saw in terms of going on file and revert to checkpoint. All right, now let's get going. Arnold, book is us fully saved. We loaded our data. Set the DF into the most used pandas data structure, which is called data frame. And it basically looks like a table of what you guys see on your screen and and the head of the table and also the tail off the table meeting the end of the table, which concludes the 545 100 companies they were looking at. So now we have the columns we need, and each year old corresponds to a single company in a single year. Unless this renamed those columns, we can refer to them later. So we're gonna name those columns year rank, company revenue and profit respectively. Just we can refer to them by the respective names later on in the coat. Next, we need to explore our data set. Look to see if it's complete that pandas read it as expected. Are there any values that are missing? So we run a Lenin DF command, so that looks good. That's 500 rose for every year from 1955 to 2005. So let's check whether our data has been imported. As we would expect, a simple check is to see if the data types or the D types have been correctly interpreted. Well, you guys can see there is a bit of a problem. It looks like there's something wrong with the profits column. We would expect it to be a float 64 like the Reverend column, since it's a number. So this indicates that it probably contains some non integer values. So let's take a closer look at that to see what the issue is and the reason I've sort of shoulders to you on purpose. And just so you guys know or can know what to do. Want issues such as these arise when you're working with your own data set? So it looks like some of the values, our strengths, which have been used to indicate missing data. So let's see if there's any other values that have crept in well, that makes it a little bit easy to interpret, and so let's see what we should do. That also depends on how many values air missing. For less, check to see how Maney values there are that are missing out off the 55,000 odd those that we have for this data. So as you can see, there's 369 values that are missing. Well, that's a small fraction off our data said so no, not completely inconsequential as it's still around 1.5%. So if the rows containing and they are roughly uniformly distributed over the years, the easiest solution would be to remove them so less is, go ahead and take a quick look at the distribution to see where it is. So as we can see that the most of the invalid values are in a single year is fewer than 25 there are 500 data points per year. So removing those values with Akon for less than 4% of the data off for the worst years. So other than the third round the nineties most most years have fewer than half the missing values of the peak. So for our purposes and for the purposes of the demo, let's say that this is acceptable and go ahead. And so after that lesson, see if that works, we're gonna do a land deaf to see if it worked and then do the DF types to see I'm under the D of types to see if that profit has actually changed back to a full 64 as it should be . So great Small. Our data set set up its finish Your notebook as a report, you could get rid of it if you can get rid of investor gory cells as we created, which are included here as a demonstration of the flow of working with notebooks and merge relevant cells to create a single A data set up sell. Now, this would mean that if we ever mess up our data set elsewhere, we could just rerun the set up sell to restore it. 13. Looking at Jupyter Kernels: Hi, everybody, and welcome this lesson on looking at Jupiter kernels. So this is as far as we had gotten in the proofs lesson. And I had mentioned that behind every notebook runs a carnal. So when you run a coat sell, that code is executed within the colonel and any output has returned back to the cell to be displayed, and the Colonel state persists over time and between cells, it pertains to the document as a whole and not individual cells. So, for example, if he import libraries or dickered declare variables in one cell that will be available in another in this way, you can think of a notebook document is being somewhat comparable to a script file except that it's multimedia. So first will import a python package and define a function. We're gonna import numpty as MP, and then we're going to define a function for square, so want to Once we've executed to sell above, we can reference NP and square in any other cell. So you let's see how that would work. So here what I'm going to do is I'm going to specify X as being any a random into gear from 1 to 10. And why, as the square root off X and what I want to do is go ahead and print that saying that X is X squared is why they rebuilt again. This will work regardless of the order of the cells in your notebook, and you can try it for yourself by printing the various variables regardless, if it it will work on any cell regardless off, as I have mentioned, which order it's in. So let's say, if we specify a value for why which, as you guys can see above, I have said that it should be a square of X. So let's say, if we specify a specific integer for why there's were specified, wise equals to 10. So what do you think will happen if we run the cell again? Do you think that output will change? Or do you think that output will stay where having it being the square root off X? Because most of the time the floor in your notebook will be about top to bottom, but it's common to go back to make changes here in the kernel. Well, we have an option to either restart the Colonel weaken, restart and run all or restart and clear All so this will basically do is if we do our restart the colonel and read Run all It will rerun all of the cells from top to bottom so it will go order by order Will do the 1st 1 the 2nd 1 all the way down. So if, for example, want her on the code all over again from a fresh start, we're able to do that. And as you guys can see, once I specified, why equals 10 as you guys can see from where it is in number seven, it's saying that eight squared is 10 which is mathematically and correct. About the reason is saying that is the cell previous to that I specified why, as being 10 I gave it a specific, hard courting number. So regardless of the equation which precedes it after the equation, you guys can see that the eight squared is 64 which is correct. But after that, I specified, why is being 10? So next time, every time it does that is going to specify why, as being 10. Just keep in mind that whenever you're running cold in a Jupiter notebook. Instances, they're going from top to bottom. So make sure you are aware that because if you have a significant number off cells that are running cold and they're in different orders, it might impact the results that you're trying to achieve. And lastly, you condone you've noticed that Jupiter gives an option to change the colonel and in fact, there many different options to choose from. So when you initially created the notebook from the dashboard by selecting python version, you can actually choose which Colonel you want to use. So not only are there kernels for different versions off Python, but they're literally over 100 languages, including Java, sea and even four track. So, as a data scientists you can, particularly a lot of from are interested in kernels For our and Julia as well as as both, Um, I'm at Lab and Callisto. Matt Lab Colonel Format Lab. The S. O S Colonel provides multi lane would support within single notebook. So each cardinal had just own installation instructions, but will likely require to run some commands on your computer. You have to make sure depending on what type off machine learning that you're trying to do , or what type of data analysis you're trying to do will determine what type of colonel you'll be working with. One of the most common ones is Python. That's why I have stuck with it for the purposes of my demonstrations. But just keep in mind that you are able to use any different kernel based on what type of data you're trying to analyze, missed on what type of data you're trying to analyze. 14. Plotting with MatPlotLib: Hi, everybody, and welcome this lesson on plotting with Matt Platt Lib from the previous lesson, we prepared our data, make make sure there's all cleaned up. So next we can address the question at hand by plotting the average profit by year off the Fortune 500 companies data that we had uploaded in the previous lesson. So you might as well plot that revenue as well to first weaken, define some variables and a method to reduce our code. So we're gonna do is we're going to group by year. After that, we're going to grab an average group by year. Mean next. We're going to specify access being the index and the Y one being the averages profit. Next, I'm going to dis plot some labels for the X, The why the X, the title and the UAE label this so we can have everything nice and organized. So if we have that done, let's go ahead and punch in a small code to do the plotting and I stated about this is basically to plot the increase in the mean off the Fortune 500 company profits. So as you can see, there's an exponential growth but it got some huge dips also, so there must correspond to the, for example, when there was a recession in the early nineties or the dot com bubble in the early two thousands. So it's pretty interesting to see that in the data. But how come the profits recovered to even higher levels? Post East recession? So let's see if maybe some of the revenues can give us a more deeper insight. So let's go ahead and plot the revenues to see what the increase in the mean Fortune 500 company revenues from 1955 all the way up to 2005. Does he begin analyze this data a bit better soon as we plot that now we can see another side of the story was badly hit, as the profits were. And that's, um, pretty good accounting work for some of the finance departments. With a little help from Stack overflow, we can superimpose these plots with a positive negative, their standard deviations to give us a little more clear insight into what's going on. And as I mentioned, I have included this court in the Donald section of the score. So you guys condone loaded and use it on your own when you're doing this at home at home. Well, so now we can see that the standard deviations are huge. Some companies made billions while others lost billions on. The risk has increased along with the rise in profits over the years. So perhaps some companies perform better than others are the profits of the top 10% mawr or less wild style than the bottom 10%? So there's plenty of questions that we can look into next, and it's easy to see how the flow of working in a notebook matches our own thought process . So this flow basically helped us to easily investigate our data set in one place without context, switching between applications and our work is immediately shareable and reproducible. So if we wish to create a more concise report for a particular audience, we could quickly re factor. I work by merging the cells and removing the intermediary code and presenting it to whoever we like to present it toe. Additionally, we can also download it in various formats to be shared with our team or with external parties so it can be downloaded as HTML as pdf as a host of other. So this gives you guys a good or view off why Jupiter notebooks are used so often and why they've become so popular. They're very easy to use, their very robust, and they give us very good and quick insights into our data.