Introduction to WEKA + Real Data Mining Project | Amanda Courses | Skillshare

Playback Speed

  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Introduction to WEKA + Real Data Mining Project

teacher avatar Amanda Courses, Entrepreneurship + IT

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

8 Lessons (54m)
    • 1. WekaIntroSkillShare

    • 2. Introduction to Weka

    • 3. Preprocessing the data in Weka

    • 4. Classification in Weka

    • 5. Clustering in Weka

    • 6. REAL PROJECT -Business Understanding

    • 7. REAL PROJECT - Data Understanding and Preparation

    • 8. REAL PROJECT -Modeling and Evaluation

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.





About This Class

Introduction to WEKA + Real Data Mining Project

This course will give you an overview of basic functionalities in Weka:

  • introduction to Weka
  • preparing and processing the data
  • classification of instances
  • creating clusters
  • visualizing the data

You will also be able to see how a REAL DATA MINING PROJECT in Weka is done.

Meet Your Teacher

Teacher Profile Image

Amanda Courses

Entrepreneurship + IT


Software Developer, Data Scientist - based in Prague. Here she wants to share her knowledge and experience that she has regarding information technology, data science, machine learning and entrepreneurship.

See full profile

Class Ratings

Expectations Met?
  • Exceeded!
  • Yes
  • Somewhat
  • Not really
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.


1. WekaIntroSkillShare: Hello and welcome to the scores. My name is Amanda, and I'm a computer science engineer. In this course, I'm gonna sure you health work with a popular data mining software called Becca, where I will be giving you a basic overview, Rebecca, as well as explaining how to do pre processing of the data, preparation of the data classifications clustering on missile ization of the data and also a part of this course will be a riel data mining project that is following a Christie and methodology. And it is also done in this data mining software called Leca. 2. Introduction to Weka: in this lecture. I want to talk to you a little bit more about Becca as a data mining software and also give you a basic Overbey of its capabilities. Then we will talk about each of those into more detail in the following lectures. Rebecca is a data mining system developed at the university of by Kata in New Zealand that implements data mining algorithms. And in fact, if you put the cursor here, you can see that Becca is a native bird of New Zealand. And that's how the Psalter got its name. Leca is a state of the art facility for developing machine learning techniques on their applications. The real world data mining problems. It is a collection of machine learning algorithms for data mining tasks. The algorithms are applied directly. Three data set. Becca implements algorithms for data pre processing classification regression clustering association rules on it also includes visualization tools. We'll talk about these individually and the following lectures. Becca is open source software issued under the G and you general public license. So in my case, I I haven't hear open and Lennox specifically in a boon to, but you should be seeing something similar, even if you're open it in on another operating system. So here on the on the right side, I have explorer experiment, their knowledge float workbench and simple CLI. Now Explorer is a graphical user interface that uses the power of the Vek assaulter, and we will look at this in a few in a few seconds. Experimenter allows you to design your own experiments of running algorithms and data sets and also analyzing the results. Right below it is knowledge flow, which is a data flor inspired interface. Rebecca the user can sell like the components from a toolbar, place them on a layout canvas and connect them together in order for me knowledge flow for processing and analyzing data. Finally, we have workbench on. It is an environment that combines all of the G Y interfaces in a single interface. It is useful, especially if you find yourself jumping a lot between two or more different interfaces such as, for example, between the Explorer and the experimenter. So first of all, let's let's open Explorer and let me show you some of the capabilities that we have here now. The first thing, of course, that we have to do is that we have to give it some sort of data that we want the process here you have open file now. The native data file for um, for Rebecca is, in fact, a R F f, which is attributes relation. File format. Up on a R F file is an ESC I text Father describes a list of instances sharing a set of attributes. A r FF files were developed by the Machine Learning Project at the Department of Computer Science of the University, off by Kata for use with the back of machine learning software. But don't worry, you can in fact upload utter of file types as well. For example, the most common one uses CS we Andi. Then all you have to do is use the converter, and you will be able to work for that data in Becca as well. There are about 10 types that you can then different file four months that you can use, and then you simply use the converter and you can work with the data normally, just as if it were on a r f F file. So if we if we want to access the classified cluster associates of the rest of the the rest of the functions here we do have to upload the data first. So when you want to upload your row data into the pre processor, so all they do is click open file and then Aziz, you can see the default type is our air f F data file. But since I don't have that, in fact I'm gonna choose the one that I do have that SCS we. But you can also choose from the from the rest of the files as well. So depending on what you have, in my case, it's CSB. I'm just going to click on a click open and then you can see I'm not able that the bad guys not able to load the instance says, But I do have to use the converter first. So the most important thing here in this converter, at least for this introductory lesson as this field separator say norther the properly convert your file. You have to know what your field separator in your CSB file is Now. You can let us simply by opening your CSE file in a text that the third for you should be able to see the field separator. In my case, I already know that it is a semi colons. I'm just gonna put a semi call on here now. Another ah, thing that you may also want the changes, the missing value. So in case you have a missing value anywhere in your CSS file, you're gonna have and here in Mecca, it's gonna market as a question, Morris. It's a placeholder for missing values. And also another thing that's very needn't back eyes, that if you don't know what something does or what's gonna change it, you can just you just place your, uh, mouse over it and you can see it's gonna tell you the placeholder for missing values that fall This question mark. So this is this is a need future. You're not sure what Ah, something is supposed to do, for example, of this buffer size supposed to do. Then you just put your mouth and will tell you that it is the number rose the process in memory at any one time, so you're able to get able to get some hints here. Now you can change your dismissing value to know, for example, or anything else? I'm just gonna leave that There's a question mark, because it's not Not really important at this point. So you just get go ahead in the death on, uh, click. Okay. Now, as you can see you we have our data here now, in Christie, Um uh, methodology, which is, if he standard that is used and majority of data mining projects 1st 1st step is business understanding. But the second step is data understanding. And the understanding is both we do here. So we have a list of attributes. Once you click on a certain attributes here, you have the select attributes and you get more information about it. A swell. And you also get a nice visualization here. So this is trying to understand the data on Becca has given you a simple interface to do it . But also, we know that the third step in Christie M projects is is a data pre processing and usually data Pre processing is the most time consuming phase off the project. So how you can, uh, process this date that as also, by using filters so you can see here filter and then you can click on choose very consult elect from a list off supervised and unsupervised filters and also in the side. Very want to apply this filter than attributes or to an instance, But we'll talk more about this and a different elector and the lecture about pre processing your datum. Now another thing you can do is that you can also decide. Teoh, um, apply to all attributes you can also sorry none. If you want Teoh, remove it and then you can remove a specific remove selected attributes as well, so we will talk more about this into more details. Let's look at other capabilities as well. So the second Zach second tab we have here and I were able to access thes because we do have the data uploaded already so you can see her that you have classified plus find senses . Eso So we only we are only able taxes all for reload. A data set in Tribeca Onda. We we can see here that we can choose from a list of classifies which are neatly divided into groups, and also we have four different testing options that we can choose training set supply tests. I cross elation or percentage split we'll talk more about. This is, well, now the second. The second thing is also clustering your data, and it looks very similar to classifications that we've just described with a few differences regarding the options that you can select your then we have associate, which, which are you can see here. Discover associational. So these are association rules. Becca provides algorithms to extract association rules from none numerical data. And then we have select attributes. It's ah, severe ECA provides techniques the discard irrelevant attributes on or reduce the dimensionality off your data set. So what you want to do is offer loading a data set. You can click own a select attributes. Fact the open a Jew. Why? Which will allow you to choose both the evaluation method, such as principal components analysis, for example, and the search method so you can see you can see both. Here you can click, choose, and then you can pick one of those, as I've said for his own principal components and then here, it can also choose the search method, for example, greedy step flies. Now, depending on the chosen combination, the actual time spending selecting attributes can very substantially and can also be very long, even for small data sets up. But it's also important know that not all evaluation search method combinations are valid. So you should watch over the ever message in this thought. Those bars, they're not able to make all possible combinations out of these out of these two. And the last thing that I want to show to you is data visualization. Let's look, forget it and pre processing. We also have data visualization so we can see it here on day, which, which tells, is this mangy? Why will show he hits the ground for the AK tribute distribution for a single selected attributes at the time and by the fall. This is the Kloss attributes. But here also, I have a selected for example agenda and we can see we can see the hissed a gram here. Now, if you move the mouse over the history Graham, it will show you the ranges and how maney samples full in each range. So here we can see male 6158 Andi, but we have a button here. Visualize all that will let you bring up a screen showing all distributions at once. Andi, if you click on so you can see all of the distributions here now. But there is also a visualization tab, which can be used to obtain the scatter plots for all attributes bears. So for now, this will be all regarding a basic overview, Rebecca. But in the following, lecturers will look at each of these into more detail. 3. Preprocessing the data in Weka: in this lecture. I want to talk to you about pre processing the data now. I already said this, but three processing is the third step of crispy and methodology that is used for data mining projects on it is usually the most time consuming face because sometimes it can be really difficult to figure out how we want to process our data so that we can correctly apply the modeling algorithms on it in order to get the correct results. If we don't prepare and pre processor data properly, then we may thing that we got the correct results. Where is in fact we didn't. That's why we also have the evaluation face. But we'll talk more about that than other lectures. Now, here we have the Becca Explorer open and we are in the pre process tap. So I've already opened my data file, and you should do the same. So one of the most important things that want to talk to you is our filters. So if you click here choose, you can see that there are categories of supervised and unsupervised filters, supervised filters considering the class value, while unsupervised filters don't, for example, the unsupervised this growth eyes filter on. Lee considers the attributes being this critized. Andi. I also want to talk to you about the most important, most important filters here. So as you can see, that boat supervised and unsupervised folder are also split into two categories each, depending on better. You won't apply the filter onto the attributes or on the instance. Some get started with the actual with the supervised filters. If they apply through the attributes on the 1st 1 of them eyes, the discredit eyes filter. Now they that this crowd ization is defined as a process of converting continues. They attributes values in the finite set of internals with minimal loss of information. Some techniques, such as association rule mining, can only be performed on categorical data. And this is why. Then the data requires this great ization. So here the filter, we have this great thighs. It is an instance filter that describe ties is a range of numeric attributes in the data set in phenomenal attributes and also, if you click on it left like you can see her that we have debug, which is set fools. But if set, the Drew filter may help with additional information to the console. So the this is this is the 1st 1 I wouldn't talk to you about. And the filters that I've chosen are the ones that I have mostly used myself in data mining purchase that work on and also the one that I think are the most important. So the 2nd 1 of in a shirt when a show is nominal toe binary, it converts all liminal attributes in the binary numeric attributes. Now let's go to Let's go to instances So those that we can apply on instance on here we have class balancer it three ways the instances in the data so that each class has the same total weight. Now, of course, this specific folder may not mean to you right now, much but off course, When you have the data, you should understand what kind of pre processing you want to do on it and why is it necessary? So, for example, clause balance or can be necessary because sometimes not each class has the total way, and then this may affect the modeling algorithm and this this can lead to over fitting. But this, of course, heavily depends on the data that you specifically have and that you're working on. So the next the next that when I look at is re sample now free sample produced, says a random sub sample of data set using either sampling with replacement or adult replacement so we can decide whether we want to replace the data or no. So let's go now to de unsupervised filters and specifically to attributes. Now, the 1st 1 I'm gonna doctor is the one that I am using very often, and it's very, very often using data mining projects that is inter quartile range. We will also use it in our projects. It's important that you understand while how it works. It is a filter for detecting hope, Lars and extreme values based on in their quartile ranges. So once you open the filter, you can see here that there is protection for attributes and if this falls but I want to set it the truth so that I can detect a potential out liar on attribute in each or an extreme Hallyu in each attributes and also here you can see that the debug kiss falls. But if you said that to true is gonna help with some additional information, which is something I'm just gonna show you. So here you go. Okay. And then you click. Apply in order to apply the filter. Now, once he scrolled down, you're going to see that it has create that. Now these new attributes we see her or not created that rather displays it to us. It tells us precisely which attributes those are so we can see that they have this out liar and extreme value at the end of their name, which can be extremely useful if we want to remove all of our out Lars and extreme values. And now what you can do is that simply you can pick those that you want to remove. Andi, remove them so you may want you may decide you want to remove all of them, and this can be done simply as well. And then you can just uncheck those that you don't want to remove. OK, but let's get on with our with our filters. Now this this here we have a lot of importance in very, very commonly useful. There's so so this one is, um we're going to scroll down and you're going to see here that we have removed, removed by name removed. Type on this. Remove. Remove Name uh removes attributes based on a regular expression matched against their names , but will not remove the class attributes. Once you open here, you can see that in here going to write the attributes, the sort of the expression, and against this expression, the actual boots will be removed. So you're removing by the name that they have. Andi then also is I've said we have also simply remove, which removes a branch of attributes. Onda. We have removed type which removes attributes of a given type. So these can be especially useful if you have some attributes that you want to that you know that you have to remove from your data before you start model in the data. So let's get on with it and continue. So the next time that we have is remove. It's art. Replace missing values. Now this this one can be actually a little bit tricky. So let me explain a little bit about missing rallies before now missing values. There are actually three different types of missing values. The the values can be missing completely. ADR end them. They could be missing a friend. Um, and they can be missing, not at random. So we can find, for example, some sort of pattern of the data that is missing. Now, there are different ways to handle missing values, and this is something that is that is simply left up to you to decide what way you want to remove or handle the missing values regarding the specific data that you have on the specific modeling algorithms that you want to use. But there are some suggestions. So the first suggestion would be to ignore object with some missing values. You can decide to do that. The 2nd 1 is to replace missing value with a new value. For this, just dismissing value could be some, like nal or unknown or a question mark. But you have to you have to consider that not all, um, modeling algorithms can work with missing values. Some of them can, but not all of them. So it depends on which modeling algorithms you want to choose. And then the turned possibility is to replace missing values with an existing value. Now, this existing value could eat or be moose frequent value. You can either be proportional fraction of all values. It can be an arbitrary value, or it could be a predictive value. And what we have here in our case, that the old missing values will be replaced with the modes and means from the training data. So that is just there's just there, dressed all possibilities. And the most important thing before you start handling the missing data is to think about the data said that you have. And what kind of modelling algorithms really want toe apply? Apply on it Now, finally, we come to D unsupervised filters that can be applied. Teoh Instance. So the first thing that we want to look at is here. Remove with values and what it does is that it filters instances according to the value of that attributes on Azaz. I've already said here again we have this d bag where we can decide to set the truth and to get some additional information as well. On the final one that I want to show you is, um, sorry. It's a subset by expression on this filters instances according to a user specified expression, So expression here is true, but you can also change it to whatever you want to. Now these aren't these air. Someone filters. I consider to be very important than that. You should know them. But you certainly don't have to memorize any of them just important to have an understanding of how they work. What are some futures of what it's supervised? What is unsupervised? Andi, In case you decide that you need any other filter and that you want to work in any other filter, I think back guys, great, because it gives you a lot of hands. So you're able to Let's say you want to pick this one, you can open it and then what you can do is here you have about the filter. It have a short summary, but you can also go to war where you will get more information. And another great thing. But I've already said, is that if you hold your mouse over something, it's gonna tell you also some, uh, some information as the wood it that's it's very, very, very friendly user, very user friendly. Shukan. You shouldn't be having trouble understanding what these new new filters of the that we didn't talk about, uh, do so That's where the that's that's regard the pre processing it's It's a very important to, uh, prepare and pre processor data correctly so that you can you can indeed be ableto fly older modeling algorithms that you want and be able to get correct results. 4. Classification in Weka: in this lecture will take a look at classifications in Becca. Now, as you can see, I had d tab classify open, which we can access Onley after we have low that the data which we do here. So we opened the file, we load the date. That and then only off there. Now we can access the rest of the taps. And the one that we're going to talk about now is castigation. Classifications is a data mining function that assigns items in a collection, the target categories or classes. The goal of classification is to accurately predict the target class for each class for each case in the data. So you can see in Becca, we can choose from six groups off classification algorithms. Now, these thes six groups, we can summarize them nicely. So starting starting here we have Sorry, we have base. It is a set of classifications. Algorithms that use base there. Um, such as naive base sort such as because he here now ive base or bass net. Now the 2nd 1 is functions. Once you open that, you can see the algorithms. It is a set of regression functions such as linear and logistic regression and also here we have multi layer part Sopron. Now it's important to say as you can, as you can see that some of the algorithms or grace or not able to access them. Now this depends. Things depends on what kind of data you have for not so. For example, if you have numeric data, then you cannot use the algorithms that are specifically categorical data. That's why a pre processing eyes very, very important in case you want to use a specific algorithm than you need to understand. What what import does it expect? What kind of ah data? Ah, visit? Does it irk far as input on ESA? Let's let's continue here. We have lazy and um, thes air lazy learning algorithms such as the que nearest neighbors method. Then we have metta, which is a set of on some bling methods and dimensionality reduction algorithms, such as other boost that we hear hear that we have and bagging through reduced variance This girl down we have misc, misc, Elaine ius such as Sorry, open it here such as, for example, serialized classified as we see that can be used to load a pre trained model to make predictions. Here we have rules which are rules based algorithms. And finally we have trees, which are decision tree algorithms. A, uh a famous one is, for example, a random tree around them for us or also J 48. Now, the other thing that we can that we can see here as we have best options. So unless you have your own training said, or a client supplied set, you can see here you can use training said that is your own drinks. That or a client supplied set, you would use one of these food, cross validation and percentage splits. Now percentage split that we can see here splits the data and separates the percent of separatism. Narc A 66% off the data for learning, and the rest of it for test in percentage. Split is especially useful when your algorithm is slope on. The second time that we kept that we have here that you most typically used in Ah, when working in a data mining project, it's gross validation and it works like many percentage splits. You fall the data in, for example, our case 10 folds and repeat, then times the following process because this 10 fall So it's it's ah, if it were 20 then be 20 folds Ah, we use and how how we do this is that we use in our case, nine folds for learning and we leave one full out for testing every time, leaving a different fold for testing. Now, what you're gonna what you're going to choose will depend on the amount of they that that you have on on the type of the algorithm that you choose. And as I already said, if your algorithm miss slow that it is definitely recommended to use percentage split. But in our case, we can We can take a look at, um, at an example. So let's let's say we want to We want to see a J 48 tree. Now I've chosen cross validation with 10 folds, and here you can click start in order. The star declassification says you can see it is working right now. The bird is moving and it is finished. So here you have the class. If our output, which will which you have here a nice summer you get to correctly classified instance, says Israel's incorrectly plus find an instant, says. You also get different types of error so you can see all of that. They all of that information here and also going from the beginning, you can see what they did use how many instance says how many attributes said is there's just some run information and you can also see the test mode. So in our case it's a 10 fold cross validation. So this is the class five metallic unsee that we use a J 48 pruned the tree. So get all of the information indie outfit. And finally, you have the confusion matrix that tells you, um uh, that the shows you the number of correctly classified examples now on editing that you can see here. So here have results. Listen, for example, if you decide to run the algorithm again, let's say in this case, wanna try percentage splits to go run. Now you're going to see another result here. And as you can see as it tells you, right here, right click for options. They could be right, click. One thing that you're anything that you're able to do is to visualize the tree seeing visualized tree, and in our case. This doesn't look as a very nice tree. Select. Let's try here. Sorry, Weaken, Try here. So we go visualize tree. And unfortunately, I didn't reprocess my date. This I'm not sure what they then precisely working on. But otherwise you would see a tree on. Um, it is a nice way to visualize your data while classifying at. So just remember that when you're choosing the algorithm, you have to think about the output that you want to get. And also I over the explained, ah, majority of these Ah, the most important algorithm X have explained the way they work in previous lectures. So make sure to check those Electra self is well. And now it is only important to apply the your desired Allah great. And based on the data that you haven't results that you want to cheap on, just like when we were applying the filters here, you can also ah, left click on de algorithm and you can make certain adjustments in case you won't do and also are just like in filters here. You can get more information about the specific, uh, classification algorithm that you want to use 5. Clustering in Weka: in this lecture, we'll talk about clustering. As you can see, I'm opened the third camp here, and that is the that is gold cluster. Now cluster analysis is used. The fine groups also called clusters off similar examples. Now, this looks a lot like glass vacation, as we've seen here. So you can choose from the clustering algorithm from the clustering mode. And then here you're gonna get the out both, and you can also visualize the results that you get from this result list. So let's take a look at a few clustering algorithms, so the 1st 1 is. Cannot be now. Cannot be. It's ah, it requires just one pass over the data. It can run in either batch or incremental mood results are generally not as good than running incrementally as the many moment maximum for each numeric. Attributes is not known in advance. The second, the second clustering algorithm that I think it's important talk about as called simple K means it is used to cluster the data using the K means algorithm. It can use either the Euclidean distance, which is the default or D Manhattan distance, and you can see here once you open it. Here is the distance function said to defaults Euclidean. But you can also Jews from utter distances as well, and one most commonly used is also man happened distance. And but if you use the Manhattan distance than central roids are computed as the component lies median rotter than the mean the third, the third clustering algorithm that I think it's important to talk about his gold e M E M assigns a probability distribution. Each instance, which indicates the probability of it belonging to each of the cluster's e m, can decide how many clusters to create by cross validation. Or you may specify a priori how many clusters to generate the cross validation performed to determine the number of Gloucester's is done in the following steps. Number one, the number of clusters is set, the one to the training set split randomly in the 10 folds. Treat E M is performed 10 times using the 10 foals, the usual CB cross validation way, for the The likelihood is averaged over all 10 results and five. If the likelihood has increased, the number of clusters is increased by one, and the program continues. At Step two, the number of falls is fixed, then as long as the number of instances in the training set is not smaller than 10 If this is the case, the number full is set equal to the number off instances and also important to say that in case used the E M algorithm, the missing rallies air globally replaced that replace missing values filter that I've spoken about in the pre processing lecture. So from the from the cluster move, you can see here that you can use the training set or the supply tests it. In case you have either of those. Otherwise, you can use the percentage split. You consent it. It's said by default 66% or you can use clauses. The cluster evaluation. So in this case, all we use the percentage split. Just assure you what what it looks like. So click here at the start and then you can see that here we have the cluster output. You can see the, uh, Allah great, um, is working. It is building model on training data. So you concede the little but here moving. And once it stops, you'll know the algorithm has finished. So it, um it is actually taken a bit of time. And, um, but once once it will be finished. One thing that we can do as that you can see here and that we have. Ah, we have tracks. Four clusters for visualization. So, uh, and then we can we can right click here and then we can visualize cluster assignments. Well, right now we have to wait until the algorithm furnaces. No, As you can see here, the clustering algorithm has finished on. It has taken quite a long time, As you can see here, a time taken to build model as 50.8 seconds, almost a whole minute. Now you have the out with here so you can take a look Good. There is. There's quite a lot because I didn't I didn't really bree processing sort of data that I've used here. And that's why I d ah, the model took so long to build. And here you have the clustered instances on Finally, the last thing that I want to show you is the bezel ization, which is something already set. So you think right? Like, and you click visualized cluster assignments so you can see, see them here on again. A repeat I day that I have used I didn't know pre process nor die prepare And that's why it doesn't like, necessarily as beautiful. But this is also the data. This time, Deposit and K is the data that we will be working on in a real project where we will do really preparation and pre processing. So you will be able to get a better understanding of it that is old regarding the, uh, close clustering algorithms in Rebecca. 6. REAL PROJECT -Business Understanding: hello and welcome. In this last section, we will take a look at a real project example where we will use Viveca and Verbal apply D Christie and methodology. So let's get started with business understanding. The purpose of this work is to analyze the data using Christie and methodology. We have to use data mining tools and methods in order to find a solution to the data mining does that we will decide on for the analysis for a simple exploration analysis. I will use Libre office. Skulk it just because I am working on new boom to at the moment. But you can feel free to use excel or even sheets from Google. It doesn't really matter. We're just using it. For a simple overview of the data on, we will use Viveca for a data pre processing and modeling. So since their cupboards only with a r. F A files, I have used Becca file converter to convert the CSB filed through a are after file so that America could ever created. And this is something that I will show you later on. So I will attach the data file that people work on. Please open it on your computer as well, so you can follow me or even explored the data on your own. So it is important to understand the business problem before we start analyzing the data. Now, this is a sample off 10,000 bank customers, as you can see here. So we have a lot of a lot of data. Andi Indian end all the way. At the end, you can find the flag variable. This will be our target variable, also known as the Kloss. So ah, flag wearable indicates the opening of a new term deposit accounts you can see here af it's either fools or, for example, you can see here a T that is for true. That means that the user has opened a new term the post account. So this sample can be used for building predictive models and proper customers profiling for promoting time deposits. Using data like this, we can create models that will help us with targeted market thing. The data description document also states some interesting uses for this data. So, for example, descriptive statistics, association, analysis, propensity, modelling or simple data management. Now, because we're talking about business understanding, this is also the face for redefined the problem that we want the soul by looking at this data. So I will take a look at what are some of the common values of attributes off customers that do the side to create the term account as these could service indicators in the future . The rich customers. We should advertise this option more often. So in the next the next lecture, we will talk more about the data that we have. Andi, the possible indicators. 7. REAL PROJECT - Data Understanding and Preparation: hello and welcome. In this lecture, we will talk about data understanding. So we have our data here opened and Libre office. Calgary can use axle or sheets, as I said already. So we have received the data in the CS ME format. We have 10,000 rows, which means that we have information about 10,000 customers now. Attributes that have flag as you can see her either one or zero are those that can only take a value off basically one or zero, or that means yes or no. In total, we have 45 attributes, and so out of those six are categorical now our date that conveys the full of ng attributes , which can be split into folding groups or better understanding of the data. So let's let's start, um, let's are here at, um, the beginning, so let's move a little bit here. So we have agenda, and gender can basically be either a male or female. So this is a demographic attributes than some other of demographic attributes would be, ah, birthdate as well, uh, miracle status. And then comes the miracle stats security, RBI, single, married, divorced or widow. Then we have a number of Children on as well as occupation category. Andi Other attributes Can eater fall into one of the categories? So, for example, financial attributes that would be, for example, total income, then bank products. That would be, for example, saving currents account. Then we have balances and accounts transactions, credit card informations. Andi, finally, the last attributes of he said that our target attribute that is the time deposits Flag is Veteran declined. Open the new time deposit purchase. So it's a time deposits wagon as your only mentions eater, F or T says either fools or true. So the first thing we're gonna do is something called exploratory analysis. But before we started that, let me show you how to get started with Becca. So I want to open my file here, but that sits they want to do from the beginning. So we're gonna go open while you're gonna locate your CSS file. So open is going to tell you that you need to use the converter. So you used converted. There's a lot of some things you can do here just for a change. The field separator because that's that's the one that's used in this file on for the missing value. Can you to relieve it like this? Or good? For example, Choose the Right now we're just anything very much that you want and you can click OK, and then you will have the date open here. So if you, for example, click agenda, then you can see had to come to the males. Onda females. That's it. That's a nice, nice exploratory analysis that you can do just to see a world. What's happening? And, um, how many? How many, for example, males you have? How many females? Um, if you go America marital status, then that's It's also also interesting that you can see that the majority of people is married on the smallest number of people is actually widows sinking. You can see a lot of interesting things here and in. Another interesting thing that you can see is that you have 00 misson data. That's that's that's very useful. That's very useful to know, because otherwise we would have to do something Woodhead. So there's a lot of things that you can do a lot of exploratory analysis that basically means just going through the day, tensing if you can find something something interesting. So so far, the d important thing is that the that that cartels is their new new missing, no missing values. And another thing that you want to look at is Allah, pliers and extreme valleys. So what we want to do is here you have full thirst. We're gonna go choose from this less you're gonna go the unsupervised filters at tribute on , then here you have in third quartile range. Second, choose that one, and you also wanna set it here. So we have detection. Proud tribute. You want to send it the truth on this detection, Brad, Tribute will generate separate attributes. Were each attribute that his eater an outlaw or an extreme which will help us remove them? Sell, share, you know, go okay. And then you have to click here, apply to apply the filter. So after we have applied to filter because scroll down and you're going to see exactly what I was talking about. So here you have out or extreme Valley out Larry Stream. Ellis, all of the alters Extreme values are now here here separately. So what you can do is that you can, um you can check them. Andi, you can remove Ah, Remove these individual wildfires and extreme Alice, you could go hero removed. But what you could also do is that you could check all of them and then just uncheck those that you one to you here. So I'm gonna go one by one. So just wanna leave the ones that are not out Lars or extreme values. So this is this is one of the things that's also very important because out flyers is those extreme values can have a significant impact on how on how your model is going to look and and if you have a lot of out flyers a lot of extreme values than you're, then, uh, then your model may not be cracks a hero. Now we have all of the settlers extreme, Allison never gonna Monica, like, removed in order to remove them old. And one more think I wouldn't. I want to show you here as if we go through this data. So let's let's look at there's something interesting we can find. So if you go to total income, you can you can look at some statistics we find the minimum value is actually a negative value. This is probably just a typo. But since we cannot be sure in this meager 0.0 zehr 1% of the data I have decided to completely remove this customer. So how do we do this? We can go here on Adit, and we're going to see all of the instances. So here we have total income and we can sort data ascending. So then the first and since we see is the one that has the negative total income. So what I decided to do is to completely remove this instance. Oh, right. Click on it and then the lead selected instance. Okay. 8. REAL PROJECT -Modeling and Evaluation: hello and welcome In this lecture, we'll talk about modeling. That is the modeling on the data that older the process and prepared. So we're gonna take a look at different modelling algorithms that I have decided to use. But you can also try out different ones. So the 1st 1 from the decision trees, I decided to use J 48. So despite the fact that decision trees can only work with categorical data if it is descript eyes, there is no need for us the dude ourselves. Since the algorithm J 48 does it for us. Every split in a decision tree is based on a feature ever. The future is categorical. The split is done with the elements belonging to a particular Kloss. If the future is continues, the split is done with the elements higher than a threshold. So here we have the output from Mecca. Now let us look at the next algorithm at this naive base. Now you've based it bad by classifying a little more more than 40% of examples incorrectly . This should not be taken as a surprise since now you base, as we've already mentioned, breaks great, but on Lee when the assumption of independence among the features holds true and I'm also decided to trout I bu on instance based classifier. We hides elliptic URAS well and finally rules either side tried one are and we can see that percentage of correctly classifies. This is quite high. It's 98.6%. And finally, we're gonna take a look at all of them to get er so it's an evaluation so we can see how successful they are. Also try later, a little bit adjusting certain elements. So, for example, I've used a then across salvation. You can try death as well, saying, by eat naive base. We have success rate of 59%. I've you won 98% J 48 90 almost 99% and then one are we have 98.60 so we can see that J 48 did the best in this case.