Applied Data Science - 4 : Data Engineering | Kumaran Ponnambalam | Skillshare

Applied Data Science - 4 : Data Engineering

Kumaran Ponnambalam, Dedicated to Data Science Education

Play Speed
  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x
6 Lessons (1h 12m)
    • 1. About Applied Data Science Series

      8:12
    • 2. Data Acquisition

      16:01
    • 3. Data Cleansing

      10:50
    • 4. Data Transformations

      11:09
    • 5. Text Pre Processing TF IDF

      14:53
    • 6. R Examples for Data Engineering

      11:14

About This Class

This class is part of the "Applied Data Science Series" on SkillShare presented by V2 Maestros. If you wish to go through the entire curriculum, please register for all the other courses and go through them in the sequence specified.

This course focuses on the Data Engineering. It goes through the steps of Data Acquisition, Cleansing, Transformation and Text Pre-processing

Transcripts

1. About Applied Data Science Series: Hey, welcome to the course are played data signs with our This is your instructor, Cameron Parnham belong from video Mastro's Let's Go Through and understand what this course is all about. The goal of the course is to train students to become full fledged data practitioners. So we're focusing on making people practitioners who can execute into event data since project right from start off acquiring data all the way to transforming it, loading into a final later our destination and then performing organs analytics on them on finally achieving some business results from this analysis, what do you What you by taking this course is you understand the concept and concepts of data signs, you understand the various stages in the in the life cycle off a data science project, you develop proficiency to use our ANDI use are in all stages off ANALITICO right from exploratory Data Analytics to directive an hour. It takes to modeling toe. Finally doing prediction using machine learning algorithms learned the various data engineering tools and techniques about acquiring data and cleansing data on transforming data. Acquired knowledge about the friend machine learning techniques on also learn how you can use them and also most importantly, then you can use them become a full fledged data science practitioner and who is can immediately contribute to real life data. Science projects notto mention that you want to be taking this knowledge to your interview so that you can get a position in data science. Terry was this practice we wanted touch upon this particular thing off theory versus practice, data, signs, principles, tools and techniques. Image from different signs and engineering disciplines. No, they come from computer science, computer engineering, information, terry probability and started sticks, artificial intelligence and so one on theoretical study of data signs it focus on these scientific foundation and reasoning off the various Mission Learning Gardens. It focuses on trying to understand how this mission learning Salgado's work in a deep sense on be ableto develop your own algorithms on. Develop your own implementation of these algorithms to predict a real ball problems. Just one dwells into a lot off in our equations and formal on deprivations and reasoning. Whereas the pact is on the up late at part of the data, science focuses on a playing the tools, principles and techniques in order to solve business problems get the focus on trying to use existing techniques and tools and libraries on how you can take these and a play them to really work problems and come out with business deserves. This one focuses on having adequate understanding of the concepts a knowledge of what are the tools and libraries available on how you can use these tools and libraries to solve real world problems. So this course is focused on the practice off later signs, and that's why it's called Applied Data Science Inclination of the courses. This data science is a trans disciplinary subject, and it is a complex subject. It doesn't mainly three technical areas to focus on. So there is math and statistics that is mission learning. And there is programming on this course is oriented towards. You know, programming is oriented towards existing software professionals. It is heavily focused on programming and solution building. It has limited and asked required explosion exposure. The math and statistics on it covers overview Off machine learning concepts gives you articulate understanding off how these machine learning all guarded them books. But the focus is on using the existing tool to develop real world solution. In fact, 90 95% other work that later science time. Just do in the real world is the practice of data science. Not really, Terry, of greater science and this course strives to keeping things simple and very easy to understand. So we have definitely made this very simple. We have stayed away from some of the complex concept. We either they tried toe tone down This complex concepts are just stayed away from them so that it makes easy for understanding for people of all levels off knowledge in the in the data science field. So it is a kind of a big nurse course. If I may say that the core structure it is goes through the concepts of greater sense to begin with, what exactly is their assigned? How does data science works? It looks at the life cycle of data saints with their various life cycle stages. It then goes into some basics of started sticks that are required for doing data signs. It then goes into our programming. It question to a lot of examples of how you would use our programming for various stages in data science project. The various stages in data sent injured Data engineering, part effort. What other things you typically do in there that's engineering one of the best practices in data undulating, it covers those areas. Finally, there is the modeling and predictive analytics part where we build into the mission Learning or God Adams. We also look at Endo and use cases for these machine learning algorithms, and there are some advanced topics also that we touch upon. Finally, there is a resource bundle that comes as a part of this course, and those results bundle basically contains all the data sets. The data filed the sample court example coat on those kind of things that that we actually teach as a part of this course which is covered in the examples all of them are given in the resource bundle. So do I Don't know the resource bundle that has all the data you need and all the core sample that you need for you to experiment the same things yourself. Guidelines for students, the fasting this toe understand their data. Saints is a complex subject. It needs significant efforts to understand it. So make sure that if you're getting stuck, do review and relieve you the videos and exercises does. He called help from other books on land recommendations and support forums. If your queries 1000 concerns does, and that's a private message, our do posted this question question, and we will be really happy. Toe responded that as soon as possible. We're constantly looking to improve upon our courses, so any kind of feedback that you have is welcome on. Please do provide feedback through private messages are two emails on the end of the course . If you do like the course, do give leave a review. Reviews are helpful for other new prospective students to take this course and to expect Maxim disc ones from other future courses from We Do Mastro's, we want to make that easy for our students relationship with the other. We do Masters courses are courses are focused on data science, really a topics basically, technologies, processes, tools and techniques of data saints on. We want to make our courses self sufficient as much as possible, eh? So what that means is, if you are an existing we do master student, you will make see some content and examples repeated across courses. We want to make themselves a vision So rather than saying that, are any point in the course? Okay, girl, look at despotic like other courses. Register for the other course and learn about this. We rather want to focus on this course itself. Keep two things in the same course itself. Unless that other concept is a huge concert. That theirselves of separate course. We want to India them as a part of this course itself. So you might see some content that is repeated across courses. Finally, we hope this course helps you to advance your career. So best of luck. Happy learning on Don't keep in touch. Thank you. 2. Data Acquisition: Hello. Welcome toe. This model on data engineering. This is your instructor common here. Data engineering is a vital part off data signs on it is the most difficult part of data signs. Getting data from its source is making sure that the data is valid and reliable and cleansing them on doing its transformations and putting it into a repository, but it can be properly analyzed is one of the most painful trust that you do. Indeed, engineering it is the most painful, most laborious, most time consuming on this is what will give you headaches in your data engineering life, Arthur, your data science life. So let's start with trying to look at what kind of data sources exist on our what kind of data sources and data scientists will deal with, uh, the data sources play a very important role in determining what kind of data processing you do. The type of the data, the source of the rate of the domains, what kind of data processing you do on what kind of data processing, architectures and work flows you would set up. And this is depending upon what kind of data quality in the labour, the exes for the source data. Is associated really reliable? It can be believed are you have to do a lot of checks and my shares on validations to make sure the data is indeed relabel. It also impacts your network planning. Because the size of the data and where the data exists on the bandwidth required to move the data from one place to another, they all would impact your network planning considerations. You might have the prudence of fault tolerance capabilities. If the data is going to an on what we call and uh, really risky network, or are that the data is coming in real time on, you cannot go on re process the data. So your descent of $4 capabilities security is an important consideration. Typically, a lot of security measures are put in place, especially if the data is flowing across. Organizations are The data is coming from the clouds. There are also organizational boundaries to deal with because you might be sitting in one department on then the data have to be sitting in another department and you have to work through the organizational boundaries in order to get access to the data on, get the data into your domain to start working on it. What kind of data? Sources that their the first kind of data source is the enterprise data sources that the data sources actually exist within your company or organization. This is the easiest data store are the most convenient data store that you might have on this typically is on our baby. Emma's enterprise data sources are generally see Adam Systems or any kind of systems. Onda. Typically, they're sitting there in RGB Emma's. They are typically populated by some good applications, well built applications. Why, I mean them as well. Build applications is that the applications typically take care of validating data. When the user is entering data, the applicator the US usually violates. And make sure that the data being and that follow certain rules and constraints like okay, the day dollars. A few characters you can undergo a pick list of values. Some column so mandatory, you know it makes all these checks and make sure the data that is coming in is complete. Even things like foreign keys are usually completely by linked with the foreign key table, so it's pretty complete and pretty clean accessibility to the enterprise data sources is easy. There are no great limits. If you're looking at cloud data sources, there are rate limits in terms of how much data you can you access in a given day are in argument 15 minute window organizational. A data source do not have these day rate limits. That's an advantage in for your design. The data flows are how how you can extract data from the sources. It has excellent quality and availability. There won't be any errors on missing green cards, and missing feels typically in an enterprise data source. But the issues are that the diet Ada guardians you may be getting data from another department. Are you getting from radar from I. D. And you have the answer a number of questions as to where you want this data. How are you going to be using this data? Because they have to make sure that the data is not misused in any way or the security of the data is not compromised in any way. So you have to go through these organizational boundaries. You have to go through these organizational boundaries to get across data guardians and get the data flowing but end oppressed, Ate up. A Sources are one of the cleanest and excellent Keenest data sources that you can get the second type of data source today, which is getting more and more popular. Toe. They are cloud data sources. A number of organizations are moving their applications onto the web. So obviously there are the rather than having interprets application, you of cloud based applications on the data is sitting in the club and how do we go get data from the cloud is a big challenge. So data is all stored on the Web Like the salesforce is one of the most popular cloud application centres on you have early the house. Today, a number of companies are using salesforce for their sales activities. Access to data in salesforce is typically true. Soap arrested P s. A lot of companies do supporting associates with downloads, but see, yes, we don't lords are a pain and not that much of a security, especially when you have to program something a soap interested Besar for more secure, far more label and robust on the New Age, cloud based data sources are usually supports Open dressed baby eyes. Security in these cases is a predominant factor because you're getting data from the cloud and the data is flowing through the public Internet. So you're to make sure that the data exchange happens without any kind of security. Compromise rate limit may apply on There's limits usually which these When does put in as to how much of data can you extract in a given day? And that is also the mind by the kind off licensing you have the kind of product you bart. So you have to consider that also, when you're building your data acquisition called as to how many today, you can get how frequently you can get and stuff like that. Okay, quality of data will typically be still be excellent because the cloud it does. Those is also putting constraints of checks and balances to make sure that the data is what their ties, how it should be and it make sure that the data is actually like that like like what are the valid values are ordered. The column that cannot be empty and can be empty on any kind of cross linking for, like foreign key linking are taken care, so the quality of data is also pretty much good. Then third type of data is social media data sources. But you're trying to get data from any of these social media websites like Facebook or Twitter. Arlington are Google to rip. There is a lot of data mining going on on these kind of data sources because you want mine information about people, and then you try to analyze them. Maybe the area customer. They may be potential customers. You main data about them. You won't analyze them and then use this analysis for further research. It is similar to cloud data sources in most aspects, like they have. Ah, they have their own dressed A B, A's and security. They have rate limits, all the stuff. But the thing is, accessing public data about people and companies might involve privacy issues. Eso there is the only thing you want to consider as to what kind of data you can really extract and use about other people without telling them. Rate limits in these cases are pretty restrictive because you're not paying for them. These are all free services, so that typically limit how much you can get how much of data you can get in and get out. So you have to consider this before you build all your data extraction programs. Our data is mostly profile based and transaction based in these cases eso you're basically getting data our people and finding all their links are building their network and stuff like that. On the last way is a brute first way, which is called the Web Scrapping Way, in which case you're just crapping website is a rock was you know, when they're very Robbie just getting the and they ahead html off website on, then you know, using that estimate, extract pieces of information inside the family and then playing to use it. This is a very cumbersome day because a steamer court can be like anything. It has so much things and bedeviling Java scripts inside and stuff like that. So you really have the boulder in really intelligent court to kind of do this weapon scrapping unless Web scrapping are pretty much looking inside the extremely extracted for other links and then you go tos other links and continue to keep extracting data. Ah, very hard way. A very cumbersome way off getting data and a day is very dirty because there is no real ability. There's no guarantee that the data is going to be off. This kind of a shape are the art. Make sure that all the columns you need are the data elements you need always exists. Now there is no guarantee for these kind of things, so you have don't pretty much do a lot of cross checking and balancing on. In fact, a lot of data amputations if you want to get data from that scrapping. And this is mostly text and require a lot off significant processing resources, because for using these data for predictive analytics purposes, you have to convert exe data into numbers data on we're going to be seeing later in this data engineering section as to how you can do that. There's a lot of security, privacy and intellectual property concerns because you're just brute force with scrapping without telling the owner off all the stuff that you're going to be doing that so you are. Do we ever of what even really get what even really scrap without impacting into the security, privacy and intellectual property concerns off the owners of the Web pages. So what kind of data far must therefore Mertzes the data flowing in today you can have table that come from our baby Mother is the most of famous and popular structure Very structured data that is coming in. You have data in CS fees. This is the most common date. Exchange car farmers. Typically, when somebody asked your data, this is the easiest data, other different extract and send off the easiest data former that you can receive. CSC's can hold large amount of data, but that processing is a lot of manual moving around of this year's REFILES from the sorts on the dis NATO, the destination will happen. XML is used for conflagration in meta data, but it sometimes can also hold the actual later that you want to use. It depends upon the source of the data and what kind of formats the source supports. Jason is the New age exchange format that is becoming very popular. A lot of the applications today support Jason. In fact, all the cloud data sources today, like human salesforce or Twitter, they support Jason. Jason is the most popular new each date exchange for murder in which data is slowing across the web. There are text, of course, is the last resort try? When you have a little artist, you start text. You typically have a lot of processing to do in terms of getting the takes and processing, processing the text on cleaning it out and then getting the information you want from the text and the last four murders. Binary like in majors and Weiss streams if you're trying to be trying to unless images and words stream, that is another type of data that you want to be extracting and moving around this day, days like huge quartz the size of the day dies. Typically, huge as we know, images and Y streams take a lot of space, so you have to take into consideration all those requirements, like the storage requirements on the bandwidth requirements toe. Get the state and move this data around are moving on to the data. A question. What kind of trends that are happening in terms of how do you acquire data from the source ? Look where data from the source. Typically, they can be like batch more. That is the most popular mode where you get a data file every day from the source to the destination. Typically a CSE file on. Then you start processing the CSE filed on a daily basis. But today, data acquisition is getting more real time. In fact, really, really real time where you typically set up push triggers on the soles. So there, any time there isn't data new data coming in and a dark modified typically immediately getting even from the source to the destination on, then you get it and immediately process it. So real time analytics are streaming. Analytics is becoming more and more popular today because people want to have information and analytics in real time. So data question also becomes real time. In those cases where Day Doc was, you know, just happened through this. Bush triggers Interval acquisition happens like every 30 minutes or so. It is kind of a balance between the batch in real time, as I Sometimes real time triggers are not possible. The sources not supporting a push technology. Then you have to have an interval based acquisition where you go there every five minutes or 10 minutes. Look for all the records that have changed and pull all the records on then you can actually create a hybrid system off batch real time in in trouble where you can maybe say some day that comes in real time. Some data comes in in trouble on some day that comes in in the batch and the end of the day . Or you might get the same date are gaining batch to end of the day to make sure it is complete. So all kinds of data question strategies are today deployed in the world for moving data from the source to the place where analytics happens. So what remains These are question intervals. Are the analytics needs. How how frequently do people eat this? Analytics, then availability. Is that a really available in real time? Is it only available end of the day, then pray implements. How much data can you get in a given day on a given time frame that determines how much data you can acquire, whether you want to do interval acquisition or real time acquisition, and finally, the relatability off these channels? How reliable are real time channels but says the batch channels in terms of security in availability and making sure that the data comes in one piece from the source to the destination. So all of these determine how you come up with an acquistion. Tragic strategy for the data data questioning from when it comes to the opposition. The according part of the programming part is not the challenge. I mean you are already. If you're programmers, they usually know how to get data from one place to another. On these applications are typically built as the applications are applicator. Java J two ee. Applications are fight on applications that move data from one place to another industrial quarter the applications. The challenge is more the non technical part off getting daytime. Working through all this limitations. All these challenges on making sure that you have a data acquisition architecture a workflow that takes care, can off a lot of these issues and considerations. Thank you 3. Data Cleansing: I in the section we are going to be talking about data cleansing data that is coming in into your data processing stream might have a lot of issues on. You need to come up with some strategies, plan on some cold to do the data cleansing off before you start using them for any kind of analysis off a machine learning purposes. So what kind of issues exist with data quality? It starts with having invalid values. Like if you have something like a gender, you either expected to be f r m are male or female. You might have something like A and B there. There could be many reasons why you have an invalid data in a column, but they don't come of in a data stream. Formats of data, the standard ones being the date formats date formats can be in multiple ways, like DDM by William I'm really weiwei on, then that is a challenge to understand what exactly the data's things like the names. The last name first name, first name, last name Former is a very classic issue, typically with data flowing in attribute. Dependencies means there is a dependency on one attribute on another. For example, there may be a column called dis Manager in an employee data. And then there's another column that says number off reporting people. So it is expected that somebody is a manager is gonna have some number of reporting people on. It could be possible that, you know, one column has data that says it's Manager zero and the number of reporting people is five . So there is some issue with that data being extracted from somewhere. It is coming from our baby Emma's from that come from a CRM system. Typically, these kind of issues won't happen, but there are a lot of problems cases where data can be. Some issues can be created with date, either in the extraction part on the data processing part. So that is always a possibility of issues with data, uniqueness of data that could be duplicate records in the data that is coming in referential integrated problems. If you're getting do data sets and you expect that every time there is a record like a in the Foster said, there s tribute accord like being the second data set, there could be referential integrity issues. There could be issues with missing values. Some columns being blank are like half the record being blank. There could be many reasons for it. Misspellings Spelling issues with data spelling issues will have a significant impact if you're doing like text analytics on in Textile Takes, you're trying to use strings to kind of compare where these documents and related documents . So misspellings is an issue Miss fielded values, in which case the values are in the wrong field. That typically happens in the CIA's refile When you mistletoe comma the column. The the column toe, which protect the car. That column to which, as Pacific variable belongs, is going to be changed. So those kind of issues also happen and wrong references, invalid references invalid you are, and all these kind of issues happen. There's a tons off. These issues happen with respect to data quality. Indeed, when you're doing did an engineering work. So how do you first find these data quality issues? There are a number of ways in which you can do them. The first is usually sample, which will insert inspection. You take a random sample of records from the data that is coming in visually inspected and see if there are any issues. Ah, good way. But it is a laborious Bayer, and this kind of a system is not going to scale. The second way is to have some automated validation code, like a scheme of allegations. When data has coming in, every record goes through a elevation goat called the Czechs, like very similar to how you have you design a data entry screen any time somebody in does data, that is validation code that lends to make sure that the data entered by the user is correct. Similar to that you might have automated validation called what data that is flowing in to come to validate every record and every column to make sure that the data coming in this correct you can do our player analysis, see where their outlets exists. Our players analysis is an excellent way to find their quality issues. For example, suppose you have a column called Age On. You do like a box plot on U. C. Said that there are outliers like the ages shows the past 300 or 350. It immediately tells you that there is a problem with greater because you know that the age cannot be 350 so that actually triggers you to go and find out what is happening with the age here. So outlier analysis is a great way of finding If there are any problems with the data that that is coming in, you can do explore a today donna analysis, look at some charts, some bar chart some X ray plots on. Then they also tend to show if there is some data that is kind of you know, not within the scope are not within the range similar to outline analysis. And that also helps you to identify if there are problems with the data that is coming in. How do you fix the data quality issues, fixing data Quartey's uses, like on a regular coding. You can do that in any language that you're comfortable, but typically there to run into some general purpose. Languages like Java are are there are a number off 88 engines are ideal frameworks of products that actually are used to move data from one place to another. These ideal frameworks typically have some functionality through which you can go data quality issues. You can present data quality issues, but these are pretty much boilerplate. According regular, according with any programming language. So we're not going to be seeing specific examples off. How do you fix these issues in in in the this class? Because it is kind of general bile up recording and the one she identified the issue. Typically, the engineers know what they have to do with it. Possible fixes are that you fix the source if possible. If you know that the data is coming from an inter press database and you say that the data is actually wrong in that database, it means that one of the systems that is putting data into the table is not behaving correctly. So go fix the source. If possible, find possible loopholes in the data processing streams. So issues may not exist in the data source, but issues might exist in the data processing stream. Suppose you have a program that extract data from a data bread a base and then summarizes it and sends it to you. It is possible that that piece of code can also have some bugs. So you So you also look there and see if you can fix that court so that the bug goes of a. You can also analyze the batches that are coming in, and then you can do automating automate. You can automate the fixing of data that is coming in that is also possible. There are a lot of libraries and told US tools available for working with data quality, especially when you look at the Data analytics tools that exist today, like our bite down our or even the ideal Indians like Brenda Ho. You will see that they do have some libraries in which to look at data for possible errors and then go and fix them. The last thing you want to be bothered about. This what is called data imputation because data imputation is needs to have a special case handling because it has severe in shoes on our mission. Learning algorithms work on them. They the imputation is nothing. But how do you fix mixing did How do you if a column particular column doesn't have a value , what value do you put in there? If a column comes empty Desert like, for example, you have a column that's that's a is about gender, male or female, and that column comes empty. So when it is empty, what do you do with it? And that is what is called data imputation. So one thing you are to always remember is that any value present in a data set is used by machine learning algorithms as valid values. What does that mean is that if you have a database in RTB Emma's, if the value is now, RGB must know how to handle it. They know how to ignore it went to use null in a proper way. Not so with any of the mission learning algorithms. Suppose you have a column called Gender and it has values male and female, and wherever it does not exist, it is a blank on maybe in stuff blanket as no are any. It doesn't matter. Mission Learning algorithm Denting that. This particular column has three different valid values. The three valleys being male, female and none on it continues to use null as a class off data. So it is going to consider Dagnall as valid data, so you have to figure out a way by which you have to replace these notes with a proper valid value, like a male or female. But how do you populate? And how do you know that for an America that is missing the gender, whether the gentle has to be male or female, what value do we put in there? That is called data imputation. So missing data populating missing data is going to be a key strip because that is going to impact your prediction to cells. So what techniques exist is that you can populated by mean median and more so if it is a continuous data. The column. That particular column off data is a continues data like age. Andi, Somebody's missing age. One possible way you can replace a missing data is too populated with the mean value. The second thing, maybe call multiple imputation, which is, you know, you can try multiple imputation techniques on, then come up with a research like you can use mean on. Then you can use regulation than you can use more on. You can combine them together as much as you want. You can also predict the missing value so you can actually write a prediction algorithm. Use a provisional guard them to predict the missing value based on the other column. So you end up doing Predictive Analytics to predict the missing data are using something like decision trees or something like that. So that was also possible. But it is important that you don't leave the missing data assets on. You need to replace them by using data imputation techniques. Thank you. 4. Data Transformations: Okay, Now we move on to the various transformations that you have to do to the data toe prep of the data for for that analysis off for the mission learning purposes. So what are they really on? One thing I want to say a frontier is that going forward? A lot of the court examples for what you're seeing in direct cleansing and data transformation activities will be a part of the use cases you're gonna be saying later in the course, because it makes sense for these things to be used in the place where they actually required in the use case to show how important and our hope purposeful on these transformations are far. So that is something justify. If you're looking for a lot more court samples, you will find them as a part of the use cases later in the models. So different senses of data typically have different formats, and hence standardization is required. For example, you are getting customer data from two different sources. Let's say you're getting data about customers. You were customers from your Web CRM on getting data about your phone customers from your phone system. Those two sets of data will have things like, you know, different formats and on different structures and stuff like that. And they need to be standardised before they can be joined together. And I put them together in a single data so as a single data source on to use them for further analysis. So having data in the same format in the same scale makes comparison and some recision activities obviously easier. Eso What are the various things you do in terms of data's generalization? The first thing you would start with the numbers in the case of numbers you want to standardize the decimal places are if the numbers are in the log. Atomic former Do you want? What is the log based? That is there is a log based off to base of 10 based off no that needs to be normalized in the case of date and time, date and time typically comes and different formats you want to convert them to a proper structure. Typically stored them as e book, which is time zone insensitive are in the poll. Six. Former. You also want to make sure that the date and time that you're getting has a time zone associated with it. And are you actually adjusted times on correctly So that all that when you're looking at the data you're looking at the data in the right way to data sources might have the same day chewing up in different time zones. So you have to just for the time zones before you can start comparing them. So these are some of the standardisation that you would do to data Ex data. Of course you do things like name formatting like some names meant becoming as first name, last name some might becoming us last name come off us name. You want to sanitize them to a single format, there could be things like your lower case R party ascending in it. Case all this kind off things has to be done with text before they can be used for proper mission learning. There are further processing that you do get the data into a proper shape before they can be used for analysis and machine learning purposes. And one of the first things you might do is called bending. Why do you do binning? You want to convert numerous continuous data into categorical data you want to convert Newman it, too, the categories. So in this case, for example, that example, On the right side, you have a continuous data called age, but it readies all the way from 11 onwards to 65. You want to convert them into categories, a few classes or categories of four or five categories. In this race, you're gonna be doing them Ice Rangers. You create a new column called Age Range On. Then that age ranges from 10 to 1 to 2020 to 40 40 to 60 and 60 to 80. That's like four different classes. And then, typically, you are a new column on based on the orginal column. You populate the range in the new column so this pre different Rangers are used are what we call us bins. So the aging is or what we call us. Bins on individual data records are classified into these bins on. Putting their bins typically make analysis a lot more easier. It makes the use of classifications algorithms for prediction purposes easier. And, uh, this is a very popular technique. Rather than predicting age as as a single variable, typically in a lot of times on toe only predict the range of the age on in these kind of cases. Classifications algorithms work better. If the age range is, the agent range is available as a classification variable on a class variable rather than as a continuous variable. The next technique that you would want to be using this water called indicator variables. In the case of indicator very variables. What you're doing is you're converting categorical data into bullion data. How do we convert categorical data into bullion data? So an example is shown on the right side. But you have a categorical variable or a classifications variable like pressure. So Pash arrest three unique values high, medium and low. Now how do you convert that? The indicator variable is you create two new columns. One column is called is high. The other is called ISS medium. Both of them are Boolean on. Based on the value in the pressure column, you populate them with the other ones are zeros s. So what you do here is the variable has in different classes. You then create in minus one new variables. So pressure street classes, the high medium and low three classes that they're So you create two new in minus 12 new columns is high end personal. It's medium to populate them with one sword zeros. The thing is, the absence the way that low is going to be shown for you is that when both these condoms are zero, it is going to indicate low. So that's why if there are any classes you create and minus one columns because the absence of value in these both of these karting is going indicate the third value indicator where he was sometimes work better in predictions than categorical variable. Something like it when you're doing clustering indicator variables work lot better then the corresponding categorical counterparts. So you'd spend a for you to try indicator variables. Also, to see if your regular classifications variables are not working that good. Go create indicator variables and sensi if it is going to be giving you better. This is so That's another thing that another kind of data processing that you want to do a data transformation you want to do before to help you in doing better predictions. The next technique we're gonna be talking about is what we call as centering and scaling. Now, when you have two sets of data toes to data columns. They may be in different value rangers. When you have data in different value rangers and you try toe, put them together in the mission learning algorithms. They sometimes tend to skew the behavior off the mission Learning Gardens. So the best thing for you to do is take these values and standardize them, using the centering and scaling methodology. So when he does centering and scaling, the values are converted into values off the same scale. But they do retain their unique signal characteristics. So, for example, how do you do? On the right? Said you have two columns like age and height to the age ranges somewhere between 11 and 65 . The height columns ranges somewhere between, you know, 152 195. So they are coming into different rangers on by centering and scaling. What do you want to achieve is convert them into like the same scale. It makes comparison of both these variables pretty easier on the How do you do? Centering and scaling is you find the mean and standard deviation for body columns, so body variables about the body columns here the age as a mean of 35 standard deviation off 16.3. The high doesn't mean off 1 70 standard deviation off. Wellpoint, Phisix. You first find the mean and standard deviation. Then you centered them. You send out the value by subtracting the mean from the value. So he had the value. 35. You subtract the mean, which is also certified from them, and come up with a value of zero that is called centering. You take 23. You subtract 23 from 35 it comes up with minus 12 that is called century. Then how do you scale to scale? You then divide this value by the standard deviation. So that is called centering and scaling. So you subtract the value that you take every value and then do a minus mean on divided by the standard deviation to get the center and ST scale value. So here you have 23 you know, 23 minus 35 then divided by 16.3 that is your center and skilled value minus 0.74 So if you play this ill guarded them toe all the call all the values you see here you see the center and scale age in the center and scale height on the third and fourth columns Here, you see that they're pretty much in the same grade. The age ranges between minus one point for seven to plus 1.84 But at the height is going to be between somewhere between minus points emanate toe plus 1.92 Now, the important thing that is going to be happening here is what you see here is that it retains its original shape. Suppose you have Ah, probably just a. So if you have a frequency distribution of age on, then compact the frequency distribution of the aid With the centre date, you will see them having the pretty might the same shape. The spread of the values will typically be the same. If you look at the quarters, the behavior of the quarters will also be pretty much the same. So you retain the characteristics of the signals in the data. The highest really continues to be the IRS value. The lowest rally continues to be the lowest value. The middle value continues to be the middle value. You trade in the signals in the data while adjusting the values Toby on the same scale. And a lot of times when mission learning algorithms they use. Basten's measures to find affinity between different between data, scare centering and scaling helps those distance. Michelle's a lot to give you better bristles. So centering and scaling is a very popular thickening that you don't do as a pre processing to your data before feeding them intermission. Learning algorithms. That's it for a centering and scaling. And that's it for data transformations. Thank you. 5. Text Pre Processing TF IDF: we no more want to the last part off the data transformations on processing, and this is called ext pre processing. No text is. We're trying to use more and more text in our data processing and mission learning and predictive analytics. But text has a few characteristics. It has a lot of issues which, regular in no data, said that comes from a table like an R B B. Most able does not have so text has toe undergo a lot of pre processing before we can start using them for predictive analytics on in this section, we're going to see what kind of processing needs to be done on text to convert it into a former. That helps it to be used for predictive analytics. To begin with, let us try to understand how machine learning algorithms work manned mission learning Where God Adams can one lee work with numbers or continuous data? Our classes in a discreet, are categorical data. They do not work with text. They don't understand text in any former. So the challenge is that all sexual data they have to be converted to an equal int numerical are classless presentation. Text data has to be converted into numerical our class based presentation before it can be used for any kind of machine. Learning the use off text is becoming more and more in predictive analytics on the text proposing. That's why it takes a lot of importance and what you want to do in this case. So why is our text pre processing? Becoming more important is that we do a lot off text based classifications like when you emails are coming in, you want a classy way dumber and spam or not, spam that is takes place classifications. You have a number of news articles in the Web and you want to take this news article and classified them based on which domain the news article belongs to, Like politics are sports or economias things like that. All these classifications requires taking text data and then classifying text data onto dough that that particular text it has to be converted into a form that these mission learning algorithms are confortable with. On. That is what we're going to see what kind of things you typically do there to begin with. There are a number off text cleansing steps that you do as a pre processing for any kind of textual data that comes in. So what are the various cleansing that you do? The first thing you do is you remove punctuation, all the punctuation marks in the text. Suppose it's a document. It suppose it's an email you take, then their email and remove all the punctuation is in the email. Then you really move white spaces. You know, paragraph spaces are, you know, courage, it online feats and stuff like that. You to remove all the white space in their extra white spaces. You just keep the white spaces enoughto differentiate the vert, and then you could code everything else. You convert all the text to lower case, so those you what you don't want to do was given. Word can occur in many formats. Like the same word can be in a per case, lower case, the same s. So in that case, you want to just look at which word occurs without bothering about whether it is upper case or lower case, you can have convert everything lower case. You're typically removed numbers the numbers like, you know, in terms of a moans of score or something like that you removed numbers from text. You remove what is called stop words. What are stop forward Stop Words are frequently occurring. Words that are not that doesn't have a meaning than itself, for example, is on the all these commonly use words are called stop words and they're typically don't have any meaning and their current all the documents you're focused on on your focused on birds that occur unique in a document. You're not focused on this commonly occurring words like this Andi Waas then so you could call all these top words out of the document. Then you do what is called stemming. So what is timing is that the same word has multiple forms and rigid is used depending upon the grammatical requirements. Like you have a word called fast. How fast can be used, like fast, faster, fastest. You can have a word called real. So you can say really are really So what do you trying to do? Was you only want the just of the world so you just keep the friend portion of the world and you chop off the the remaining portions off fast, faster and fastest. All of them will become just fast. So that way you know it is the same word. It has the same meaning. It's just that you're removing the grammatical use based on the present ends and future tent and stuff like that that is called stemming and then you remove any other commonly used words. I suppose if you're analysing emails coming into your company, typically all the emails will have your own company names. So you don't want that to be occurring on disturbing all your mission learning stuff. So we just want to kick those commonly word used boats. Also out, the difference between this and stop words is stop what is like a set of commonly occurring words used globally on this is more like specific to your use case. That's only different. So you do all these text processing first on. Then you do what is called a STD of IVF, which we're going to see in the next light. What s D off? Idea of DF idea was the most popular technique by which text is converted into a table based data, so text documents are becoming us. I talked about more and more in machine learning their use for news items for classifications, email messages for spam detection and also sexual based. Search on a text needs to be the president in a different breed distributor present in terms of numbers and classes for machine learning algorithms to properly recognize. And Houston, how do you do that is you use this technique called us term frequency inverse document frequency. It's called D F I D E F dome frequency in one document frequency on what the Stickney does for you is that it converts text into a table on the Terrible, basically contains rows and columns, so every document becomes a role and every word becomes a column. So every document become zero and every word becomes a column. So what do you mean by a document here? Document doesn't mean it's necessarily a word document. Any piece off text is called a document in text processing. Since any piece of text it may be just a sentence, it may be just a tweet. It may be a text SMS message. It may be an email message. It may be an Indian news article, all of the mark, all documents indexed crossing parlance. So every document becomes a row on every word that occurs in any of these documents becomes a column. And then what happens is eat, sell represent a value which is basically the strength off that bird in the document. So you forgiven. Word occurs more number of times in a specific document. The value in the cell is lot higher if it is not occurring in the dark matter doll. Of course, it is going to be zero. So the strangers representing each other cells. So what you see is that it becomes a table very similar to a normal later table by the rows . Represent the documents and the column. Great president of words, and obviously the number of columns you are going to be having is a lot of columns. And it is only to eliminate the number of columns that you have reduced the number of columns you have. You do all those pre processing, like removing the commonly used words, and then you do stemming, and then you also remove removed. Remove the other stuff there. So you're trying to reduce the data said. But doing those techniques. So how does this D of idea off work? We start off with what are the formula for it was 40 of idea on first water The formula for text frequency text frequency is computed for every word for every document. So given averred and given a document clicks frequency off that word in that document text frequency off. That word in that document is equal to the number of times the word occurs in the document divided by the total number of words in the document, the number of times the word occurs in that document divided by the total number of words in the document. This is pretty simple and straightforward. It just tells you how many times of water curso the more number of times award a customer document. The higher is going to be this value. So it just indicates the strength off that board in the document next comes in verse Document frequency Inverse document frequency gives you a measure off How unique that particular word is for how unique that particular word is. Basically where it only occurs in a few documents in northern all the documents. So in verse, Darkman frequency is computed forward. Our cross our documents. It is not done document by document it is done across or documents on the way it is diners , You do a log e off fertile number of documents divided by the total number of documents Where those verdicts is inverse Darkman frequency is for a given word which is equal toe the law G off the total number of dogs. They were lead by the total number of dogs with this sport. So what happens when you compute in worse document frequency is the lesser number of docks . This word occurs, The higher will be the inverse stockman frequency. That's all this formula would work if a word is occurring in all documents in was dark one frequently Billig zero if a border crossing only like one document it was dark. One frequency would be really high. It fails to find the uniqueness off aboard in a document on PF idea has nothing but you multiply the text frequency with the inverse stock one frequency. So that is the final formula you're going to get into. So what we're gonna now do is we're tryingto take a set of documents and we're just going to do all these things we talked about as an example. Use case. So let's talk with a set of original documents or these other tired three documents I just made these words up are the sentences up. So don't worry about a lot about the sentences. Imagine because could be like emails that this could be charged or sms whatever. You have three documents here. This is a sampling of good words. He said again and again The same word after word works can't really hurt. And the first thing you do is you do all the cleansing that we talked about before. And after you do the cleansing, this is the count off output you get. So what do you see here is worth like this eyes and so the world A document This is a sampling of good words has become sample. Good word. You see what has all been kicked out? This is this is an air gun. Sampling has become sample. That is because you do stemming offers gone good as retain words have become vert that again is stemming because you converted to coat all this singular plural in the past, present and future and just focus just on the court word. So this is all it becomes. So This is how the three documents stockman. One document to on documentarians up after you do all the cleansing that we talked about. Then we build what is called as a document film Matt Tricks. This is called a document term metrics in which the documents are at the rose and the terms are the words are the columns. So this is called a document. Oh, metrics. The words are actually columns here, and the documents are rose. The first thing you do with you tried to create a count table, the count table. What you're doing is we just count the number of times each word occurs in the document and this is how the words occur. Then you try to find ticks frequency, which is you take this count and divided by the total number of words in the document. So in document one, there are three words sample lockers, one. So it is one divided by three. It becomes 30.33 and you know the same form. Love for all the three documents for all the words. And you end up with this term frequency table. Once you have the term frequency table The next thing you do is you compute inverse document frequency using this farm law law G off our total documents developed by the documents with the word on when you know there's inverse doc one frequencies computer across all documents for each work. So this is the value in and of it foreign words dog One frequency for each of the words When you applied this formula, then it is easy for you. Now compute pf idea, which is you take the same term frequently, table. I might play that by the inverse document frequency and you end up to the table below. What do you see? Interesting. You see, the word word occurs in all the three documents, and it s a passing having a score of zero because we are not bothered about. We're about words that current all the documents because that does not give us any form of differentiation now that we are focused about unique words that only occurred in one class and north in the other. So word like again, it occurs only in one document document to on because multiple times in the document, so it cuts a pretty high score. The analogy is that when you want a different your documents, you go after the unique worse in the document and the so you found the unique words and scored them. So this is the final table, you end of it, where documents are and rose and no de quotes are in columns. And the score indicates how good our unique that particular borders on this document then can serve as an input. This becomes like a regular table later table, and this can be used as an input the any off your predictive analytics machine learning are guarded terms very similar to the other data. There is no difference in text. Handling was a non text handling in this case because all you're dealing here is our with numbers. So this is our view process text, and you will see some examples in the use cases that follows. I hope this has been helpful to you. Thank you. 6. R Examples for Data Engineering: Hi. In this lecture, we're going to be looking at some examples off data engineering, which is to do some data acquisition on data cleansing and transformation. This is going to be some fuels. Basic examples. You will see a lot more examples when you go through the use cases later in the class. So the first thing I'm gonna be doing is setting up my working directory. There's no set up toe, this particular directory, The first example. We're gonna show you how we can acquire data from a database and fathers. I'm using this lab recall are my S Cuba so but it is going to connect with my school territory and get some data out off it. There are other libraries available for my sequel. As I said, Ours pretty rich in terms off its library support. The first thing I do is I connect, create a connection to the database, and how do we create a connection is using this command db Connect on that is to my sequel . I give with the user name the password, the database name on veg host responding. So this is going to create a connection and I would disconnect er, then I do what they be sent query on this connection, and I'm going to execute this query select name from them or table limit. And so I'm going going to be taking tendrils on that. Is that a cards that I'm gonna be creating outfit now? Once they do this record said, I'm can do a French on this record set on the first wilder done data to me in data frame. So I do this French off this record said, and done their data to a date offering. And now I can look at the data and see how it looks like. So it gives me just names. Record one record. Voter country. I just selected one column name pretty simple, and that gets pulled in out of the database on It shows me three records that came out off the table that they table only a three record, so that's Arcade came out pretty simple. Once it is, then you have the do gooders in only be has completed its to close the cuts set, and then you can clear the result. You have to book you do some out housekeeping here to disconnect from the table and stuff like that. This is a set of steps. You are due toa close the connection on disconnected const art and stuff like that. So this is a pretty basic data acquisition from a ness que a table, Then every move on and say How do I download files from the web? So here is a CS three file that is sitting on the web. It's about flight data you can extracted. We're using this. You are. And I do that by using the download, not file. So I get a local Falco I give a local find name called Download Filed RCs We just the finally on what I'm going to do with download this data from this web you are and store it in this local file, but using this command download file. So download file Web. You are local file is going to download the food, the web data and store it in this download files. I'm just gonna run it blue. So you see, it is trying to go to the U other and fetching the later it is fetching the date. And now you see the cigarette here is no downloaded opened. You contend downloaded on and started in there on one side. A student in that I can read this file using the Reitz ESPN loaded into this data, said Arlen's data. And then I can look at how the island state that looks like this in the str commanded as information a lot of information about the air lanes and how this advance what did the start time in time and stuff like that. The next thing about how do we do scrapping off the pages? I use another library called our Curl. So, Lord of the Library and then I do get you are off the speech. So it's going to get the endia haste email page off here and store it in this variable called down page. It's just sjc Melo put on. Now I'm gonna do a cat off this particular variable. Just cut it out on, then is gonna print a lot of junk here. As you can see, just print the head html content off their particular page. Now you have to take this content and then you can start doing some Web scrapping and stuff you want to do on this one. The last example is How do you access the rest? Data trustful data using rest rest ful ap eyes. And typically today, a lot of the open source cloud system support this on rest. They all pretty much followed the same mechanism. So to do that, I'm just gonna be first loading these three libraries that you drds and light on. Didn't we see to do? Ah, rest, Felipe A first you are use a what? To get a key for yourself. So and then you have to basically go create an application. It has its own processes as to how you go create an application on the website and then get off key. In this case, I'm going to be connecting myself jet hub on Get some data out of do you have? So once I get the what key I have an application on, then I have a secret for that. What key on? Based on that, I'm just going to initialize this. Get a p a variable on duty. Very baby is then used to initiate my what token? So these are some steps I have to follow to connect to arrest Abia. Once I get Did you have a token set up, then I can create a sample requests with that with that token and connect. And once a connect and then get the data, I can get a different sample requests, and it gives me information about my own log in. So this is, uh and then you can get some content out off it. The sample thing to blow on, get some stuff over. Tougher, Easy. So, you know what I'm doing is I'm taking a content out off this are put and it has a blogger as one of the entries on. I'm just extracting that block entry and putting it old here. So these are some steps you have to do to get a rest Full data. If you want to goto Twitter Facebook, our sales for the all pretty much have the same kind off steps you have to do your to go registered an application. Get a secret key. Fear yourself on the secret key has them to be used in all your interactions with the cloud based service going on to data cleansing. How do you know? Data cleansing? First thing? Let's see. How do I find out? Players Let me say in this case, you see that I'm going to be creating a vector off student ages. And you see, there are some negative day Dia put in there purposefully on. Once I do a quantity of the student age, you'll see that there's an everyone shows up immediately here. So you know that age cannot be less than zero. So I minus one is definitely an outlier. We can't even find the same thing by doing a box float on. Then you will see that again shows that has no player. The minus one and all the way to extract them are outlier soldiers based simply putting in the student age Come filter, which is student age less than zero on a run. This one. I see that all the records actually then can be filter and taken out. And then you can apply some cleansing on this one or remove the record or whatever you want to do. Going on two examples on data transformations. I'm going to be using the empty cars data set. I'm going to be copping their toe. This data frame called Khar Data on. We used this. Any other examples too? So you gonna be looking at here. This is card data for you, Onda. It has got mpg sitting there displayed Minhaj power a lot of things like that. First example Hardaway convert a numeric thing into a factor so that a cylinder sheer showing up the cylinder is a numeric column in this case, and I'm gonna be converting this into a factor by using the command as start factor. This is awaken word and numerous toe a categorical variable. The next thing Lange wanted overs, I'm going to show you how to do bending in our So let's look at, in this case, the hospital. I'm gonna convert the hearts power into a bend hospital thing. So she had the quantum's for the hearts power. As you can see, your ranges anywhere from $50 to 335. So I'm gonna be binning them by using the cut command. I'm going to be taking the car straight at Hart Sperber, and I'm gonna be creating been 0 200 102 102 103 103 100 to 400. So this is how you can do a card on you get a new column called been hatch be on that has this information creating indicator variables again for the number of cylinders I'm going to be creating indicator variables. I'm going to be creating indicator. Very. But so there are There are three types off a cylinders available 46 and eight says the music or the three. So I'm going to be creating two columns called S four cylinder and its six cylinder. So obviously, when both of them are zero, it means it is an eight cylinder. So how do I create them? I create This new column is forcing their by using this first function. If cost later dollars, sitting there is equal to four. Then put the value of one is put the value of zero and then, similarly, for six cylinder. If the car's dollar cylinder is equal to six, put the value of one else. Put the value of zero to do. These two new columns is forced in. The genesis is excellent, are created like this, you execute them. God, God is greater, then comes centering and scaling. So Hardaway does killing. There is a command called scale in our there can be used for doing scaling on Then I'm gonna be running the scale on this data only for MPG. And I'm going to be doing the scaling on then that output I'm just taking and storing into another. A new column are new columns here called Scaled Mpg. Now that is done. So now, once all is done now you feel look at you know the structure off guard data, and you will see that the four columns have been added. That is a fact cylinder that has been added with, which is a factor of three levels. There is a big hedge be by Jews for binning. Now we have four levels in their 0 200 100 to 200. Being added that are two columns is four sitting there innocently. Six It in there, which are ordered US indicator variables. And finally, the skilled mpg values are available in this particular skill column. Now, once you created this new columns, you can possibly go hard and delete the old columns. Like the original columns you don't born by ascending them the value of not even a car data dollars MPD equals no, and that will take that column out off the data frame that is all you take the day. I call him out of the data free. So these are your examples off the very estate enduring task we saw. And this is all these things are done in are pretty simple and straightforward, and you will see more examples in the use cases. Thank you.