Applied Data Science - 1 : Overview | Kumaran Ponnambalam | Skillshare
Play Speed
  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x
9 Lessons (1h 44m)
    • 1. About Applied Data Science Series

      8:12
    • 2. What is Data Science - One

      11:51
    • 3. What is Data Science - Two

      10:44
    • 4. What is Data Science - Three

      12:55
    • 5. What is Data Science - Four

      9:31
    • 6. Data Science Use Cases

      7:47
    • 7. Data Science Life Cycle - Setup

      11:46
    • 8. Data Science Life Cycle - Data Engg

      11:57
    • 9. Data Science Life Cycle - Analysis & Production

      19:16

About This Class

This class is part of the "Applied Data Science Series" on SkillShare presented by V2 Maestros. If you wish to go through the entire curriculum, please register for all the other courses and go through them in the sequence specified.

This course focuses on the overview of Data Science. It explains how Data Science works from data elements through relationships and predictions. It then walks through the stages of a Data Science Project

Transcripts

1. About Applied Data Science Series: Hey, welcome to the course are played data signs with our This is your instructor, Cameron Parnham belong from video Mastro's Let's Go Through and understand what this course is all about. The goal of the course is to train students to become full fledged data practitioners. So we're focusing on making people practitioners who can execute into event data since project right from start off acquiring data all the way to transforming it, loading into a final later our destination and then performing organs analytics on them on finally achieving some business results from this analysis, what do you What you by taking this course is you understand the concept and concepts of data signs, you understand the various stages in the in the life cycle off a data science project, you develop proficiency to use our ANDI use are in all stages off ANALITICO right from exploratory Data Analytics to directive an hour. It takes to modeling toe. Finally doing prediction using machine learning algorithms learned the various data engineering tools and techniques about acquiring data and cleansing data on transforming data. Acquired knowledge about the friend machine learning techniques on also learn how you can use them and also most importantly, then you can use them become a full fledged data science practitioner and who is can immediately contribute to real life data. Science projects notto mention that you want to be taking this knowledge to your interview so that you can get a position in data science. Terry was this practice we wanted touch upon this particular thing off theory versus practice, data, signs, principles, tools and techniques. Image from different signs and engineering disciplines. No, they come from computer science, computer engineering, information, terry probability and started sticks, artificial intelligence and so one on theoretical study of data signs it focus on these scientific foundation and reasoning off the various Mission Learning Gardens. It focuses on trying to understand how this mission learning Salgado's work in a deep sense on be ableto develop your own algorithms on. Develop your own implementation of these algorithms to predict a real ball problems. Just one dwells into a lot off in our equations and formal on deprivations and reasoning. Whereas the pact is on the up late at part of the data, science focuses on a playing the tools, principles and techniques in order to solve business problems get the focus on trying to use existing techniques and tools and libraries on how you can take these and a play them to really work problems and come out with business deserves. This one focuses on having adequate understanding of the concepts a knowledge of what are the tools and libraries available on how you can use these tools and libraries to solve real world problems. So this course is focused on the practice off later signs, and that's why it's called Applied Data Science Inclination of the courses. This data science is a trans disciplinary subject, and it is a complex subject. It doesn't mainly three technical areas to focus on. So there is math and statistics that is mission learning. And there is programming on this course is oriented towards. You know, programming is oriented towards existing software professionals. It is heavily focused on programming and solution building. It has limited and asked required explosion exposure. The math and statistics on it covers overview Off machine learning concepts gives you articulate understanding off how these machine learning all guarded them books. But the focus is on using the existing tool to develop real world solution. In fact, 90 95% other work that later science time. Just do in the real world is the practice of data science. Not really, Terry, of greater science and this course strives to keeping things simple and very easy to understand. So we have definitely made this very simple. We have stayed away from some of the complex concept. We either they tried toe tone down This complex concepts are just stayed away from them so that it makes easy for understanding for people of all levels off knowledge in the in the data science field. So it is a kind of a big nurse course. If I may say that the core structure it is goes through the concepts of greater sense to begin with, what exactly is their assigned? How does data science works? It looks at the life cycle of data saints with their various life cycle stages. It then goes into some basics of started sticks that are required for doing data signs. It then goes into our programming. It question to a lot of examples of how you would use our programming for various stages in data science project. The various stages in data sent injured Data engineering, part effort. What other things you typically do in there that's engineering one of the best practices in data undulating, it covers those areas. Finally, there is the modeling and predictive analytics part where we build into the mission Learning or God Adams. We also look at Endo and use cases for these machine learning algorithms, and there are some advanced topics also that we touch upon. Finally, there is a resource bundle that comes as a part of this course, and those results bundle basically contains all the data sets. The data filed the sample court example coat on those kind of things that that we actually teach as a part of this course which is covered in the examples all of them are given in the resource bundle. So do I Don't know the resource bundle that has all the data you need and all the core sample that you need for you to experiment the same things yourself. Guidelines for students, the fasting this toe understand their data. Saints is a complex subject. It needs significant efforts to understand it. So make sure that if you're getting stuck, do review and relieve you the videos and exercises does. He called help from other books on land recommendations and support forums. If your queries 1000 concerns does, and that's a private message, our do posted this question question, and we will be really happy. Toe responded that as soon as possible. We're constantly looking to improve upon our courses, so any kind of feedback that you have is welcome on. Please do provide feedback through private messages are two emails on the end of the course . If you do like the course, do give leave a review. Reviews are helpful for other new prospective students to take this course and to expect Maxim disc ones from other future courses from We Do Mastro's, we want to make that easy for our students relationship with the other. We do Masters courses are courses are focused on data science, really a topics basically, technologies, processes, tools and techniques of data saints on. We want to make our courses self sufficient as much as possible, eh? So what that means is, if you are an existing we do master student, you will make see some content and examples repeated across courses. We want to make themselves a vision So rather than saying that, are any point in the course? Okay, girl, look at despotic like other courses. Register for the other course and learn about this. We rather want to focus on this course itself. Keep two things in the same course itself. Unless that other concept is a huge concert. That theirselves of separate course. We want to India them as a part of this course itself. So you might see some content that is repeated across courses. Finally, we hope this course helps you to advance your career. So best of luck. Happy learning on Don't keep in touch. Thank you. 2. What is Data Science - One: Hello. This is your instructor, Cameron on in this section. We are going to see what is data signs. Data Science is something we have been hearing a lot about. But what exactly does data science consist of? What? What is what is it really about? So we are going to see two things in the session on then. The first thing is about waters data, and the 2nd 1 is want is learning. So we're going to see some definitions off water. Things constitute a data science. So some of the things you are going to be seeing in this session are maybe things you might say are obvious Things inherent. I think you have been used to, but it is good toe. Take a second look at the definitions of each of them because they mean a lot at a data signs. They in fact, form the very foundation off data science. So let us go through all these definitions here. The first thing let us start about what is Data Saints, David Signs is the skill of extracting knowledge from data. We have something called data. And then there s something this raw and then you look at the data and you extract knowledge , Knowledge could be thought about that information inside on his signal. There are different terms being used to deaf person knowledge, but basically might something that you extract from data that is useful. And then you use this knowledge to predict the unknown. So you learn something about the past from data, and then you use that information to predict what is going to happen in the future. And that is what data signs is all about. David Sames. One of the girls is to improve business outcomes with the power of data, you can do prediction, but what is the use off? The use of it is that you want to use the data signs to improve business outcomes, and you're gonna be improving business outcomes using data. And that is what data science is all about. There isn't. Employees are technologies. Theories are drawn from various area broad area that is not restricted to a single domain. So you have mathematics in their statistics, information technology data. If the clever technologies programming languages, we actually use a host of different techniques and theories and areas. When it comes to data science And what is a data scientist? Another scientist is a practitioner off practitioner off data signs. When is a practitioner off? Their Since we're talking about somebody who uses the theories and theories and all the technologies and the skills of data signs to produce a better business outcomes Andan assigned this typically has ah, or should have expertise in a data engineering knowledge ticks statistics on DA any other in the business domain. Also on typically data signed. This investigates complex business problems and uses data to provide solutions. So the most important thing here is used ODA to provide solutions or data is the driver for a data scientist. So let us go into some of the definitions of data. What exactly are we talking about here? When we say data, what are the various things that you are do learn about when you're talking about data. So we are going to go through a set off definitions here again. They may be obvious to you, but it does go to say, take a second look at all of them. The first thing we're gonna be talking about is what is called an NDP. An entity is a thing that exists which would research and predict in data science. So an entity is a thing, an object, something that exists in the real world about which we are going to be working on. So you have a data science problem in the data sends problem. You have a set of entities that you cara bolt on. You do some research on them. You get data about these entities and then work on them to make you predictions. Entities always have a business contact. There is a business context, which is the business problem you're trying to solve in which the indeedy exists. So example of an entity like a customer, a customer off a business is an entity. Customer is an entity the most most popular entered A. I would see about whom we do a lot of research and do predictions. A person at a hospital is another entity. Now you see that the customer of a business and the patient of the hospital might actually be pointing to the same person, but they have different business contacts. So different business contacts means the same person. We bother about different information about the person the person might be doing different things as a customer, as opposed to what he would be doing as a patient. Entities can be non living things, too. Like, for example, a car. So card is on the like that are not out of non living things on which you also collect information and you predict stuff going on to the next item. It is characteristics what are characteristics. Every entity has a set of characteristics, so these are properties offered entity that is information about the identity us. We call them maybe static information because they're kind of bounded to the entity like name, telephone number, age. Those are all characteristics, often entity on properties also again have a business context at doing different business contacts. You quadra about different characteristics for the same entity or the in particular person place in that given business context. For example, if the customer characteristics you would bother about our age income group gender education for a patient, your body double again about AIDS, so the characteristic called age repeats. But now you have a different set of characteristics, especially to being a patient like blood pressure, weight, family history. So again there is a business context of the business requirement, which drives what characteristics you bother aboard an entity again. Cars. When you look at cars, you talk about make model year of the engine type of engine like four cylinder or six cylinder on the wind number of the car. So these are all examples off characteristics. You might also call them US proper piece. For example, properties I thought one of you bothered about is what is environment? Environment points to the ecosystem in which the entity exes are functions. Entity doesn't exist in a vacuum. There is an environment in which an entity exists. So in that environment or other entities, other entities of the same type other entities of a different type, like a patient and to be exists in in hospital, along with other entities off the same DeBlanc with other patients. That may also be other entity types, like doctors and nurses, entities that are non living things, like ambulances, an entity record. A system that is used to monitor patients might be an entity. So under these, all these entities exist in an environment so environment. The shadow immunities on multiple entities exist in the same environment, environment affects an in today's behavior, so that is the most important thing. The same entity might behave differently in different environments are even for the same environment under different conditions, experience in the environment. The same entity might be here, too. Friendly examples of anointment garlic for a customer, the country, the city, the world close. The customer resides in Persian again that maybe the city the climate with hospital where the patient is currently in for a car. It is. But the card is being used mainly for city driving on highway driving that becomes the environment that climate cars perform differently under different weather conditions like heart. Whether was no snow conditions, cars have different behavior. So all this case is what you see is that the environment affect how the indeedy behaves now comes and even what is uneven, uneven. There's a significant business activity in which the entity participates. Entities doesn't sit simply there. It does something. If somebody does something to the identity, that is what you call an even some business activity and even again happen except environment. You, an entity like a Persian, goes to the hospital and in the hospital. The entity is treated for are there isn't set off s that are administered of the given patient on. Then you have some results coming out of these tests. All these are evens. Example off uneven. Might be the customer browsing a website customer making a store, visit a customer getting a sales call from some company to sell something. All these are evens in the girls off portions. It is like the doctors. Was it a blood test for a car? The smart does that According to goes the comparison test. Like if you go to any of this car related websites, you do see that they do like comparison tests. All of them are evens in which an entity participates behavior. So even does something there on entity participates. But what does behavior? What does the entity do in the given? Even that is the entities behavior. So even an entity goes, does something on water. Best in that, given even is what he called the behavior of the entity ended. He might have a different behaviours in different environments and different situations on , for example, in the case, off a customer, a phone call in a phone call water where the customer talks is the customers behavior. The clicks room for a website visit like which links the particular custom website visitors clicking when he's browsing the website That has another type of behavior? No, The response. The customer has to say its offer. I was seeing years. No, the customer happy said. All of them are different behaviors off the customer patients no cr lighted and nurse cramps the potions complaining about something. The patients, you know, falling asleep, showing any kind of bs, any kind off symptoms. All of them are behaviors of the patient and cars likes good acceleration, the stopping distances, all of them represent as a form of behaviour off the entities. No, these are all that those things that you see in the real world, like the entities evens and behavior on Now comes introduction toe data 3. What is Data Science - Two: introduction toe data. No, there is something called an outcome. So what is an outcome? The result often activity deemed significant by the business. So you have events in the evens. There are entities and it is behave differently in different evens. But all these evens typically have some form off outcome which is important with the business on outcome is a result often activity the result of a business activity, for example, so on outcomes can be value. Outcomes are values, right? So outcome values can either be bullion like yes, no, that the particular somebody took a test. They passed our fail. It's a bullion. Berlin is basically yes or no on tape of data. The old can can be like a continuous value like a numeric value. Somebody took a blood pressure test for the blood pressure. Value is a continuous value. Can range anywhere from, you know, $100 under. That is continuous value or it can be some form of a classifications. Classes. Basically, somebody took a review off. A review of a movie Onda relating that you gave might be a class like excellent, very good group fair. But that is kind of a classifications are type that, as the outcomes can be off any of these different tapes. Examples of outcomes in the girls of customer where the customer do a sale is a bullion at the sale value. How much they bought it for as a continuous value continues, meaning the value or that you're a present is basically anywhere from 0 to 100 or 2000. It's a continues value patient, the girls of patient. The outcome can be the blood pressure reading, which is a continuous outcome on the pipe of diabetes. Depression is identified us into the class like diet A diabetes are Type B die, but these it is a class. In the case of cars, the smog levels is a classifications. The level of small globules like ABC. There's a classification stopping businesses. That happens because you do a test for a car, which is an even on them. In that event, you're measuring the stopping distances when you do jam the brakes and what how much distance it would take to stop come to a full stop. That is a continuous outcome. The smart past, our failures there is a bully and outcome. The type of car. Let's say that a sport corner of family sedan that is kind of a classifications, So these are different outcomes that happen as a result of some. Even so, outcomes are off. But the important thing in data science, because typically indeed assigns what you're trying to predict as outcomes in the future. You ever see more about that? Later? Now comes what is called as an observation. So what is an observation? An observation is a measurement. It's on my chairman often. Even so, you measure something about an even deemed significant by the business. So you basically measure and even measure important things and uneven that are important toe the business we're talking about. It captures information about the entities and ball. So given even my no multiple entities involved the characteristics of the entities, the BA curious of the entities, the information about the environment in which the even happens on the outcomes. So on observation is information about all these things that happens and uneven. You basically go and collect all this information and recorded in some form on observation is typically called the system off record. So you anywhere you go, you see that people are recording information the other days to record them in journals, some logbooks and stuff like that. No, everything is automated, computerized. There are scanners who are scanning these information automatically are somebody is entering into the computer, which are for murders. It is called the system off Record on example of observations are, in the case of customers, there's a phone card Rikard. It's also called cdss in telephone department. At transaction like a buying transaction, somebody goes to the store on ghosted by something he goes to the point off sale counter on the transaction is recorded there. Our, um, email offer. An email comes to you, offering some product at some value. No exciting. You do buy something so all of them are observations. If you look at a patient, Dr Bissett recur at test result, a data capture from a monitoring device. All of these are observations, different types of observations. And finally, let's look at the car. In the case of a car, a savage Ricard is an observation. The car goes for surveys and the end. Their findings off the mechanic are logged in the service record. A smart as a result, is an observation. So all of these a RAB diversions captured in some form and recorder and store. So finally we get to data what does a data set? Adela said, as a collection off observation. So every observation, records and even about the center of entities, a collection of observations for a business becomes a data set. Eight. Observation in a data set is typically a record this week. Call it lower logical record that look at physical record may be given. Observation might be recorded in multiple forms. Multiple user interfaces that could be like master detail relationships. All that is fine, but here we're talking about a logical record that represents an observation. Typically, you would have observations having an ivy like a transaction idea, test idea serial number, something like that. So a day I said that the collection of loose observations each record has a set of attributes that point characteristics, behavior outcomes. So if you look at the Excel worksheet, you'll see that typically every rule would represent like a record on an observation. The Excel worksheet itself is, a data said. On every door was an observation on every column is basically attribute that points to either one of the characteristics of the entities. The behaviors are all comes address that can be structured reader Expedia, police records, spreadsheets It can be unstructured. Also, Twitter feeds are an example of unstructured data on newspaper articles. They're not called to be like semi structured like email. So a data scientists, you typically deal with different types of data like structure data. Unstructured data are somebody structure data and breeder SanDisk elect and work on data said that is the bread and butter for a data scientist is data data and more data on data is collected as data sets collected, stored, worked upon on predictions are made based on the data sets. So, they have said, is the core off data science. Ah, what is structured data? The example that you see on the right side is an example of structure data data that where the attributes are labeled and distinctly visible. You see that every attribute in that particular you guys label separately, like when ready when the name currency I d watching number. Everything is Lobel. It is distinctly visible whether it is being used in the U. Y. Whether it has been stored in the database. There is what you call a structure data data being labeled and stored separately. It was easily searchable in, additionally, credible because they had labelled separately even in new do storage in a database, their student, different columns. So it is a vital right and SQL statement toe query This data. It can be, of course, your story easily in Terrible's tables, maybe, like database tables or Excel worksheets, it's easy to store structure data in general. Unstructured data, on the other hand, is not labeled. So there's going to be like continuous Tex like you see on the right side is a text country about a master three a car. So this is the continuous stocks in which the attributes are not distantly label, but their present within the data. So things that are highlighted, an ardent you see, are different attributes like compact compact is the type of the car hodge bags, a type of the car. A six speed transmission is the transition the courthouse, so all of them are present inside the data, but not distinctly label. So this is what you call us. Unstructured data continue sticks. There is no far matter but on your daughter is hidden are embedded inside that next. And quieting, of course, is not going to be easy when it's acquiring. We're talking more about not visual inspection, but we're talking about writing computer programs to extract information off. These are not going to be easy. Now comes the third farm with just semi structured data. What you see here, an example here is an email. So what is in the email is part of the data is structured and part of the data is in structure. So in the emails, you see that some my tributes are distinctly labeled. Like, you know, the from address 200 cc subject their best into Lobel and available as a separate columns are separate pieces of information, whereas others maybe he didn't within the endear texted use either. So you are both structured and and structured a that mixed up In the case of a semi structured data. Some examples of some instructor data could also be like example. Documents are semi structured data. Some information is available in attribute some information in the Syrian part of external documents. That's all examples off somebody structure data so in summary. What? What does that we have seen? With respect data we have seen an entity characteristics, environment, even behavior outcomes on observations and finally, data set. So this are the key, uh, foundations on which data and there's and just build upon. So it is good for you to know and understand each of them. So this completes this part off the section. We will continue on more A. None of the presentation. Thanks. 4. What is Data Science - Three: Hello. This is your instructor Cameron here a continuing on waters data signs. We are going to be talking about what is learning in data science parlance. What is learning on which is discovering knowledge from data? The first thing we want to note as what is a relationship relationship again forms one of the foundations off data science on when we talk about relationships with talk about relationships between attributes. So after buttes in a data said exhibit relationships, that is, you have a non observation. You have a set off observations in a data set on attributes that you see in these observations exhibit what are called relationships, relationships, model the real world and haven't logical explanation. When we say model, the real world relationships are basically something that is happening in the real world. It is not something out of the blue that you would see something in the data set. The data said. Whatever data you have, the relationship that shows is something that exists in the real world. For example, age and blood pressure levels. The relationship between them is that as aged goes up, the propensity for higher blood pressure keeps going up. The higher your age. The higher might be your blood pressure levels on. There is always a logical explanation associated with that on. The reason in the medical field, they say, is that if you're more weight, obviously you allow more fat you Larmore clogged arteries, which would lead to higher blood pressure. So there is something that's happening in the real world, and there is a logical explanation for it. An explanation is a very important part of data science. When you see your relationship, you should be able to explain why it is happening, because that's when we can say whether the relationship is incidental or it happened by chance that does. It does exist something like that. For attributes A and B, the relationship can be like when a occurs be also occurs. You have two attributes and be so whenever is a current bees also agreeing, Let's say, whenever a sale happens, something else also happens like when a sail off a cellphone happens. A sail off a cell phone cover also happened. So things that happened together when a because B does not occur doesn't like the negative relationship. When you and be your kind of mutually exclusive mutual exclusiveness is again a kind of relationship. The 3rd 1 has been very goes up, be also goes up. So that's under that type of relationship and where a increases be decreases. So that's like another negative relationship. So when you're two attributes the values off these attributes, the values that are seen in this attributes show any off this kind of relationship. Not all entities will exhibit relationship that will be always some entities where you will see some relationships someone it is, don't exhibit any relationship at all. Other Golden Learning is to look for entities which together exhibit some form of relationship on relationships can involve multiple attributes to like. When a is present and be increases see will decrease so multiple activities together might exhibit some form of relationship. So this is kind of an overview. Off water relationships are now. Let us go and see what are some of the examples off for relationships like any. Take a customer as aged goes up, spending capacity goes up, so there is a relationship. It will age on revenue from the customer, so in age goes up, spending capacity goes up. There's a logical explanation that as age goes up, possibly the person is making more money so that spending capacity is also high. Now when we talk about relationships and data science, these are not very concrete relationships. You know, it's not literally like a farm lad that they happen all the time. Now those kind of things, like 100% kind off relationship is good. But what we see here is overall, in general, kind of relationships, like when age because of spending capacity. Koza. Not all customers, not all mould customers are going to send in more, but most of them, that is what we talk about as a relationship. The other one is our bun. Customers buy More Internet Bandwidth There is a relationship between the location of the customer on the bandwidth patches, but a customer again, possibly because they're doing more browsing and you look at the patient again. There are a lot of relationships you can see. Like all the Persians have more prevalence of diabetes. There is the relationship between age and disease level all way. Patients typically have higher cholesterol levels. That is, the relationship between weight and head. Really again, there are scientific reasons why these things happen. You would take a car. The relationship between the number of cylinders and the mileage it gives so more doesn Linda's, unless the mileage. Because there is more burning happening when there are more cylinders. Sports cars have higher insurance rates now. This is not a quickening relationship, but you will see this as a business relationship like sports card. Whenever the cars off a spite type a sports car, it's insurance rates are typically higher. So there is a lotion to between the type of the car on the insurance rates, somewhat things about relationships. One of the things you want to bother about is Benussi relationship between two attributes. Is the relationship consistent? Are the incidental relationships can also be said as patterns, patterns what you see in data patterns of behavior. Sometimes the pattern of behavior may be consistent because it happens all the time. You can repeatedly, when it happens all the time, you can actually predict such a behavior in the future. But as there could be incidental patterns, incident that relationship. Also, when it's a incidentally, it just happened by chance. That might be no logical explanation for an incident that behavior are an incident pattern . So whenever you see a relationship, it is very important for you to make sure that the relationship is consistent. Was is incidental. Consistent relationships are what you need for data science. Relationships are also called us correlations that is, the technical term that you will see are being used. Correlation between two are entities are two attributes is when with what do you see as when a goes up and be goes up, It goes up. And Biscoe don't Austin That is body called correlations of Correlation. Is that the tinkle mathematical term you talk about when you talk about relationship? It's and finally you could are people talk about signals and noise When it comes to data science signals are nothing but consistent patterns are consistent. Relationships you see in data Narcisse. Incidental patterns are incident relationships. You day in data. So if you have been hearing about these terms, signal and noise there, nothing about relationships, relationships that are meaningful was his relationships that happened by chance, which are not predictable, which are just incidental. So that is the difference. But being signals and nice now comes, what does learning We take talking about mission learning and this learning and that learning and all forms of learning. So what exactly is learning? Learning implies learning about relationships. That is the most important thing you want to know about data saints do. Their saints has mission learning. Learning here means you're just trying to learn about relationships between these attributes. That's what learning is all about. It involves taking a domain like a remain leg hospital domino business. Don't mind understanding the entities and the attributes that can represent the domain collecting data about all of them on understanding relationships. Being these attributes, this understanding relationships between these attributes is what learning is all about. So models is the outcome of learning. So what do you do after you learn about something is you build model about it Now? This learning when you're talking about here learning happen all the time inside the human brain were consistently collecting data inside a human being human brain, consistently continuously learning about things and continuously building models on. We used this models all the plane without even our knowledge. Subconsciously, we are continuously learning about things on what we talking about here in terms of data science is just learning this kind off made into a proper process on the learning happens outside the human brains in missions. That is what, as a little difference between learning that happens inside the human brain, and learning that happens with missions is as a more a process to wait. There's more data out of it, and there is more of a of doing it. So what is a modern? A model is a simplified, approximated representation off a real world phenomenon. So there is a really well phenomenon. It was happening. And when you do a model you're trying to first build a simplified moral. You're not trying to put too many things into the model. You just trying to take the most important things about the real world phenomenon on then, building a simply fight representation on approximated re presentation off the real world phenomenon. You can actually go on bill as complex models as he wanted a person holding, but usually pin people build models. They wanted to be simplified, so it brings out all the important factors you want to bother ever and ignores everything that you don't want to bother about. So it's a simply fact approximated the presentation off a real world phenomenon. It captures the key attributes, the key attributes of the entities on their relationships on Let's Say, an example of a Model could be to be a mathematical model. A mathematical model is something that represents the relationships like an equation. So you can write an equation that present relationship between the attributes like for example, you can come up. But this is a formal I got from somewhere in the world. You're is a farm. Love how you can do the mind. Blood pressure. This is an equation. So a black pressure record of 56 plus the age of a person in the 560.8 plus weight of oppression in 2.14 plus Israel level of oppression in 2.9 So what do you see here is you're trying to compute blood pressure from blood pressure's one attribute from three other attributes H weight and LDL. Now, this is an approximate competition of the blood pressure. It is never going to give you the accurate value a tall but it could be. It could be approximately close to the real world value So here's a formula which presents a mathematical model off how a blood pressure can be related toe. Three other attributes. Weight, age and Ellie levels. That could be another model, which is click a Decision Tree model. It is like a logical model where you ask a series of questions on the series of questions that you ask. You include questions about various attributes and then, based on that, come up with an outcome like you wanna be, you want to see. You want to predict something like buying a music City and for that thought that you can come up with a decision model like this if age off the customers. Legend 25 on Gender of the Customers mailed by Beyonce A city called Yes, So you used to attributes gender and age on based on them, you're trying to predict with the outcome, which is with the calendar customer by of Beyonce CDR. Not This is another type of mortals. Acura's. Your models depend on the strength of relationships between the attributes. Sometimes the relationship between the attributes are very strong, such that you can predict, like with 100% guarantee that Okay, if I see this. I'm definitely sure this is going to be the outcome. Sometimes accuracy is not that much. So in that case, you might want to combine multiple attributes NC if you can increase the accuracy level. Sometimes there is no relationship at all, eh? So it can be in any form or any kind of varying scale that you might get there. But model overall is a simplified approximated the presentation off something that is happening in the real world. 5. What is Data Science - Four: Once you have a model, what you can do is prediction, so a model can be used to predict unknown attributes. Simple example. This year we already saw that there is a formula. Blood pressure equal the 56 plus agent 2.8 plus waiting, 2.14 plus earlier, Linda pointed a +09 So you have here a formula that relates for attributes blood pressure, age great and LDL know what this means is that if you know three off this for attributes, you can predict the foot one, so that is what we call us. Prediction. So when you soap a computer, you can say compute are you can say predict when he's a compute, you're guaranteeing 100% accuracy that you know, this is the formula when you're painting your mostly approximating. So you have four attributes three or four attributes here. If I know any three of them, I can just really get this formula toe. Compute whichever after, but I want Oh, if we know three of them, I can predict the 4th 1 This is what you call prediction. Prediction from a model double equation can be considered a simple prediction algorithm. Simple thing on dilation. Skips can be a lot more complex, leading to more complex models and prediction algorithm. So what you see in that the equations are very simply find model of us or something really simple as a problem gets more complex, each a little later, more complex, learning more complex models on more complex prediction algorithm. So that is what we have been learning this all about. Learning is all about data, sets, relationships, modeling and prediction. So let's talk about what I predict ours on outcome. So when you're whenever we're talking about our data 100 I sense you talk about predictors and outcomes. So what are they? Outcomes are attributes that you want to predict. So whichever attributes you want to predict, they're called outcomes, like in the year earlier formula. We want to predict blood pressure. It is called the outcome. Senators are attributes that you want toe use to predict the outcome, so you have a set of attributes. What do you want to predict? The outcome? Everything else that you use to predict the outcome, our car predictors so you might have 10 attributes in your data, said one of them may be your outcome, and three others may be your practice. I mean, not all attributes have relationship with the outcome. Attribute only those that have a good relationship with outcome. Variables will obviously become predictor so predictors and outcomes and obviously predictors and outcomes will show some form of relationship because that's all you can predict outcomes from them predictors. So learning is all about building models that can be used to predict outcomes, which is the output using the predictors, which is the infant. Here are some examples we are gonna go back to the same three examples. In the case of a customer, the predictors are age, income, range and location on. The outcome can be Is the customer going to buy your protect or not a patient? The printers can be age, blood pressure and weight on the organ can be. Is the patient that die? But they could not on example of a car might be like the predictors, maybe use things like cylinder, number of cylinders and acceleration on. You might want to predict where the car is going to be. A sports car are a family car. So these are what you call us predictors and outcomes. One of the most important things you want to know about is humans were submissions. Humans understand relationships and predict all the time that happens in the human brain without even week, we being conscious, aborted. We keep collecting data, we keep, We keep understanding relationships. We keep building models in our heads on. We keep predicting all the time, any any time you produce, you predict. Okay, I think this is gonna happen. It means that you're using a model that you built inside your head to predict something you say. I think it may happen. It's a week model. Say I'm 100% sure that this is going to happen. It's a very strong modern but human being can only handle the night amount of data, Right. But, for example, I shall keep shopkeepers. You have seen them. They know about their best customer of the longtime customers. They know what their customers like and what the customers want. Andi, whenever a customer comes in, they usually address them. My name and the immediately know what these customers want. Even with the customer asking for that, they're gonna go big, died. Um, and it would be them. But human being can only handle fill in the amount of data so they can know about the preferences off 100 customers. Not at like 10 million off them. What happens then? That's when machines are computers come into play, right? We want to store all this general in customers information in computers. Andi Let the computers learn about the preferences on help you. The missions come to come into play when the number of entities on the data boredom is large are huge and their incomes mission learning when you 100 or work with your computer's toe, collect all the data, do all the learning, build all the models. Ondo. The prediction. That s where we it comes, becomes mission learning. That's when it become mission learning, predictive analytics and data signs. So what does data saints, entities, relationships, modeling and prediction. So what is data cents? It's all about picking a problem in a specified domain. Understanding the problem domain, the entities and the attributes and the behavior and the evens collecting data sets that represent the entities we go collect all the data that you need, and then you discover relationship from the Reiter. That is what you call learning on when computers do this. It's called Mission Dunning. Permission. Learning is not something, although the world is nothing about. It's all aboard missions. Learning about certain things are discovering relationships from the read eight as and then build models. The president relationship. The mortal can be like a mathematical model. It can be a decision tree model. There can be other types of complex models to, and what we do in really build models is we use past data When you know about the protests . You know the auto, the outcomes. So you know the values of craters. You know the values of the outcomes. Andan using those values, establish relationships on from the relationships you build models. And once you build a modern, you can then start predicting You can start predicting for the current or future data when you know the prototypes. But you don't know the outcomes, so use the past to learn the build models, and then you predict the futures when you don't know about the outcomes. Here is an example of what website shopper would do in the case of greater signs. That isn't example, the problem would be to predict the shopper will buy your smartphone on what they're going to do about it. You get all the past portraits history of all the shoppers, right? You collect shopper characteristics like age, a gender income level. You collect seasonal information when they do buying, like what kind of things they buy during winter was a summer. Was this Halloween? What's a swell of Wednesday? You collect all those 11 data that is there. Then you build models. You build models that talk about relationships, about what goes up or what comes tone. When the customer buys on, the customer does not buy. So you basically tried to let the other attributes that you know so the outcome. So you look at all the values off the other attributes when the customers are buying, What does that values of the attributes when the customers are not buying? So you see that a dame the value off on a tribute age it's greater than 25. The customer buys the value of ages lead less than 25. The customer does not bite. There comes a relationship. Let's try to use this relationship to build a model And then you try to predict, which is whenever you see a customer who's age is greater than 25. Yeah, this guy's going to buy that. So you make predictions. So when a nuke shoppers browsing predicted, the shopper will buy, you use the model and predict in real time. But the customer is going to buy a product or not on. Okay, what I will do with the production now that you know the customers know going to buy are not going to buy is you can do some actions like you wanna offer Childhelp These days, whenever you go to any website, you see that a small pop up comes up and say, Do you want to talk to your live agent? So live agents are costly. They're human beings. You pay them a lot of money, so you only want to offer live, age and help. So shoppers who you think are going to buy your product so you can make an intelligent decision as to which shopper you want. I want off live Agent. Based on this prediction, this is an example of how data signs would work for you. Thank you 6. Data Science Use Cases: so hello. This is your instructor Cameron here. And we are going to be looking at some of the data science use cases. They don't see how the world is benefiting from later science. The use of data science is growing exponentially. Every day has been growing exponentially for the past few years. I was spreading itself across multiple domains and, like business signs are finance and impersonal life. Also on a recent advances in computing power. In terms of hardware, in terms, off software, a lot off opens or so far is coming into the world like the whole are dope ecosystem on predictive algorithms. The combination of all of these have made it very cost effective for you to apply data science in commercial use these days. Okay, let's see some of the examples of using data science. The first letter start with finance for finances. All aboard making money on saving money. So fraud reduction. Credit card fraud reduction is a very important application of our data. Science is being used. So what happens in credit card fraud is that credit card fraud exhibit at tint certain patterns in which they happen whenever you look at transactions that are related to credit card fraud. They exhibit some pattern, some kind off a relationship between the various entities and their attributes. And it is these patterns that are basically captured in the historical later on. They are used to build models off fertile and transactions. So the historical data has good transactions and fraud transactions, and there, then used to build models as to how a fraudulent transaction is going to look like. So whenever a new crime section occurs, that transaction is immediately of elevated. Using computers, using the model to come up with what he called us a fraud score. A Fraud court school basically tells you whether the particular transaction is a fraud, fraudulent transaction or not. It is a school, maybe from 1 200 on whenever there scored causes, especially threshold. It's immediately flagged as a prop. Possible fraudulent transaction It is. Then some action is taken like the calls are being made to the credit card owner as toe asking. Whether isn't doing all these transactions. Sometimes the credit card is immediately blocked from further transactions until they make the verification. So there are some actions taken like this So far, the direction is a very important application for later science in the financial world. The second application you would see is about retailing, So you will see that whenever you go to a website and do your shopping and put some items in your shopping cart immediately, you see some recommendations coming up. Like in the case of a Maison, you would immediately see a recommendation like items frequently brought together. How do they make this recommendations is again? Items exhibit patterns on how their freak brought about together, like cell phones and accessories books, some items that are frequently bought together. They exhibit those affinity patterns. So based on that the bill, what are called affinities course between the items. So between any para five attempts that is an affinity score assigned. The higher, definite is called, the more frequently these items have been brought together. So what happens next? Whenever one off those I attempted bought by a new shopper immediately, items with high affinity scores toe that item order like them are immediately recommended. So you used the videos course to recommend more items to the Sharper, with the idea that if the power shoppers have bought them toe together. Possibly that's how the next shop. But it's also going to do and that value to do more cross selling and absolutely contact center. So we have contact centers, which have been traditionally used for customer service. The use of contacts and there's have grown today to dome or sales a lot off, more high end sales and support, and they also started using data signs to improve their performance. And how did or did they do that is this They have started scoring colors. As for less agents, so past interactions are used to score colors burst on their value value in terms of how much the business value was, ah, war type of color. They are how much business they have already done with the with the company they're using. That was called the colors. They also excusing brings course for agents based on the ability to sell high selling organs. Was a low selling agent or agents who are the ability to handle a specific type of problem , like agents who can handle problems in the specific product are specific type of let's and network issue was is a phone issue that things like that So what did then do is they're trying to mak the right colors with the right agent. Based on this course on idea is, once you might be right, call us with the right agents. It is going to optimize your business outcomes and then call recordings with so car. You see that whenever you are talking to a contact center, they're always going to say your call may be recorded for quality purposes and what they do with these call recordings is they're gonna play machine learning algorithms on these recordings to understand the quality of the call on outcome and use them for future enhancements. And finally, we look at health care now predicting disease operators been a friend. Dusting thing that has happened are flake is you can predict disease outbreaks by looking at what people are searching in Google and what they're tweeting and twitter. So data set this collector from public domains like Google searchers and Twitter feeds and stuff like that on these data is always linked with the location information. So whenever you're googling something over, you know where you're putting something. The location from Mario doing that is always collected, and then this information is then collected. Like what you're putting about our water that you're Googling airboat along with the location toe, come up with Pat. And so are people doing these kind off queries on a specific disease from a specific locality. That item wanted that on the more the moment you start seeing some patterns off toe, people are tweeting more about a specific disease specific location. That is a possibility that there is an outbreak that is happening there. This kind of information is now being used to start predicting this is objects. What is the good thing about making predicting about disease outbreaks is that the government can create in a more proactive manner. You see that this is starting toe or breaking a specific locality. The government can immediately start marshaling its resources to start sending some preventive care. Um, or many sends more doctors. Stuff like that there can organize, like a couple of days in advance on prevent more or brace that is happening in the same area. So don't assigned is helping to prevent our at least manage these disease outbreaks. So these are some of the interesting applications in data scientists is like a very few popular application. That is, in fact, a lot that is happening there on duh. I hope that you will be able to do some more reading and industry and all of them. Ah, and in the near future, thank you. 7. Data Science Life Cycle - Setup: Hello. This is your instructor common here. I'm in this section. We are going to be seeing what a Data's Signs projects life cycle is all about. So we're gonna talk about data science projects what its activities are on, how they are sequenced. So let's start with some introductory notes. Data science efforts are typically ex Uranus projects. So when when any of the many companies are business wants to do anything but data signs, they typically create projects like people wanna build software. They create software projects on for project. They set of some objective, some gold and then go about executing them. Similar to that, they had other signs. Efforts are also executed as projects. So one thing to note here is that data science project should be considered like research projects. They're not like building operate projects they're not. They do not have things really certain stone that you can just go and execute and get away from it. These are research projects. There are a lot off thinking involved. There was a lot off reworking board and until you achieve the objective so they should be considered as research projects, not like software build and operate kind of projects. The projects are starting inundate like any other projects they do on the projects. Do have faces and activities on transition happens between faces and activities, and it has sent projects involve a lot of back and forth between the faces. Then it's morning star, like really of waterfall model. It's more like an iterative model if you want associate that with something related to software development. So in this section we will talk about what data saints, project faces and activities are. What is the importance of each of these activities on how their transition kind of from one to another, and also some of the best practices? We just going to talk about them? So here is an overview off the reader. Science projects and activities you'll see there there are, like full, broad categories or stages in the reader. Since project that is the set of phase on, there is the data engineering face the analytics face on the production phase in the center phase, you just prepping up the team with so what they have to do. The data injuring for years is all about getting data and training of data on working with data the good way. Shape bar. You can do the third stage, which is the analytic stage. So Alex is all about exploring the data and getting some meaningful information or the Fed . So it's all about learning and predicting on Once you do the analytics face and come up with some sort of recommendations, you can then go to the production stage where I actually build out some data products that then do everything you just did in an automated fashion and in a repeatable fashion on keeps producing you outcomes that you desire. I'm only going to the first activity face, which is the set of phase. The first thing you want to go in any innocence project is what you call the goal setting for the Innocence Project. Every day, the essence project will and should have a gold. If anybody wants to. What a dozens project, which is like, Okay, let us look at the data and see what you can get out of that. That project is doomed for failure. Data Science Project should have a specific gold I do for the team to go after. So the team's effort will all be focused on achieving this goal, and the activities will also be based on what do you want to achieve so room. But there's that Projects without goals are drivers, our cars without a driver. So if people somebody cause going to come and tell you that, Okay, we're gonna do what it has since tragic. Just look at the data and see what we can come up with. That project is going nowhere. So that has been the experience off many, many people who try to do. They doesn't projects some of the examples of gold setting our leg. There's no predict which customers will churn in the next three months. That is a goal group that treats that we're getting about our company and then group them based on the sentiment off the tweets are identify patients who have a possibility of getting a heart attack in the next three months. So you're gonna predict the customers, Joan, you going to predict the sentiment of the tweets? Are you gonna predict patients who are gonna have heart attacks? The girls can be anything like this, but the most important thing is to have a well defined goal before you start on your project. The second very important thing that you want to focus on is understanding the problem domain. Unlike software projects, even in software projects, I would say that understanding the business domain is a zoo. Good thing in the case over data science project, it is necessary for all the team members to have some basic understanding off what the business problem remaining is all about. So when we say we need to, we were talking about a problem coming. We're talking about the business basics like you're in the finance feel of the Sierra feel or the medical field, understand some basics about the business, you know? What is that business all about? How does that business make money? One of the business processes involved in what are the workflow on some of the key performance metrics in the business? And that is very even in a larger data science teams. There is always somebody called us don't mind export. I don't mind. Export is a very critical The My expertise is a critical part off a data science team, so large teams typically might have a domain expert who may not be a technical guy is not a static sit as guy, not a programming guy, is just somebody who knows the business. You keeping Keep him in the team in orderto help you with understanding the problem. Domain submissions? No, this is an important thing. Missions only noble numbers and strings. They only do garbage in garbage. They need humans to associate any meaning to these numbers and strength. The mission Don't missions don't understand business. Human beings understand business on in data science. It is important for you to understand and validate anything that there is going to come up with and that can only be done by humans and for the humans to do that, they need understanding of the problem. Knowledge of the domain helps the teams to understand the entities involved the relationship, the patterns, any kind of knowledge discovery you do you need to validate them. And the violation can only be done if you know what the problem don't mind is all about an adult. On this understanding of the problem, domain helps you to validate all the assumptions. More importantly, does you do identify error So the data has some matters. How do you know? What if For example, you're looking at a day Dan, and let's say the age of the person shows up 600 years. The moment you look at it, you know that extended isn't wrong number because there's nobody who's 600 years old. But you can only do that because you know, the domain age is a very commonly used terms. Everybody understand what it is about. What? What about something like cholesterol level? How do you know what is a valid cholesterol level? What is not a valid questionable? If somebody has an illegal off 1000 is it possible? Is it a normal number into the high number Isn't an invited number? You can only tell if you know the domain, and that's why domain expert is is required for you after understanding the domain. The next phase is understanding the data associated with the data. We have seen enough over data and some of the other sections. So here, back to it, business processes on book flows generate data. A lot of data, some captured, some not captured. But wherever the data is captured, there are multiple things like the Application data 100 that you do in various entry applications that are reports There are visualizations. There is automated data coming from Since our data feeds, there are Web clicks that you get in a browser. Every click is also one of the data feet that our point of sale transaction that have been recorded and there are social media our data feeds. Also, it's all of these are business data that is being generated through multiple sources. They have been stored in multiple systems. Some are on the cooperate network. Summer on develops off. There is data everywhere which you might want to use. Data, of course, can be structured, unstructured or semi structured. This again, we have seen this before on data have different origins. Are there sort of different cellos And they might have a lot of logical relationships, relationships, of course, or the key to any kind of diner management understanding, data Understanding what data you have is a very important thing for a data scientists. What is that you want? Understand about the data? You won't understand the source of the data. How relabel is the data is it is it is at machine generated or is it entered by the humans ? Human beings Is that a possibility of somebody? Man? Uploading the data entry are putting in drawing data and getting away from it because our how good the data that you're gonna use for your analysis is what is going into the mine, how good your predictions are going to be. So data has to be valid your to make sure that that data is not man operated by somebody. For some other reasons, you need to understand what kind of processing and transformation steps are performed on the data. Amore reportedly has some data that have been discarded by somebody during the passing because they thought it is not important as some duplicate data making its being to the processing. Are you losing some data because you're doing some summary ization or not? All those things you need to understand of old Lolita how the leaders toward other student enterprise databases clouds Neuf those feeds how the data is synchronized between these different sources of data. You know, when somebody and as data in place a day there might also be going in place be so what? They're really synchronize between each other. What are the relationship that exists within the data. I know. Let's see what kind of things? Like the foreign key relationship between the data, The i d here should match the I d there and stuff like that. Ordering off creation when is ordering, you know, use like the first ordering something like, Okay, on the agent first goes and enter something in system may. Then he goes, and kids are according system be. Then he does something insistent. Steve, this is where the understanding of your business process helps you to understand how the data is being created in what order is being created. Also on understanding data helps the team to identify possible sources off your predictive patterns. And where are you getting these patterns from Rio always violate whenever you see a part on whether there is valid or not. So it is important for you to understand how the day does coming from and how it was created. Understand your patterns themselves. Sometimes the patterns might be created because of the building. There has also been created. So things that are real complex at this point to explain. But an understanding of the data in general is a good thing to have for a data scientist 8. Data Science Life Cycle - Data Engg: The next phase we're going to talk about is the data enduring face where you set up your done always set up and data engineering. It's all the dirty work that you have to do to get the data from various today to the form we wanted to be. So there does out there all over the place. Therapy unmanaged. You gotta get that data together. Get your act together, get all the data together, beat them up, put them all to one single logical nice destination where you can then do any off your further analysis. The first stage in data engineering is data acquisition acquiring data. So where your job is to acquire daughter from different data sources that they may be enterprise data basis, Like maybe sitting in an article database on my sequel databases, it now might have to be done through cloudy piers. There are a lot off obligations on the cloud. They give you a P. A is on the cloud like salesforce, for example, you got to go and get data through the AP. Eyes read. I might be coming to a scanner feeds off sensor feeds like barcode barcode scanners. It may be coming through social media, you might have a download. Social media like Twitter and Facebook. All of them are sources of data. Each of them present a different kind off use case in a different kind of challenge for you . Sometimes the data fits might also be coming in real time. It may be coming in bulk. It might be coming, introvert. A data also. So all that creates different problems for you. One of the most important thing about data acquisition is sanity. Check checking, making sure that you have got all the data that you need. And there is no data that is lost in the transport layer. Eso the tannery. Test checking is an important part of data acquisition. It is a most cumbersome and time consuming step to set up why it is cumbersome. One time consuming to set up on not saying toe acquired to set up is because when you have all these data sources, what comes but that is things like security. There are people who are owning these databases. There are security policies involved. There are sharing policies involved. So you gonna spending a lot of time establishing connections to the missions involved on the human beings who control the missions on this can be really time first frustrating. Because I guess the data scientists, if you are really close, right, greater that us like heaven. If you already are Indictees department on the door. Esos It's also 90 department. Possibly You don't have a lot of issues, but you are not in the energy department, you or maybe a concert and you're in a different department on your data is sitting there and enterprise data basis. It is sitting in the cloud. Then it becomes all the more cumbersome talk to all the people in Wall explained to them why what data you need, what you need the data and what war former editors and getting them to share the data on going through all the organisational crap is going to be a lot of time and effort involved . So this is a very cumbersome, frustrating This is the day they worked at your to do data cleansing. Once you get the data, you had to cleanse it. Why do you have the cleanser? Because data have different degrees of cleanliness and completeness. Not all data that you're going to be getting is clean and complete structure data from corporate applications like, you know, sitting in the database are actually clean and complete, so you don't have the very about it. This already clean, already complete already in the former. Do you want them to be? No problems but data that you're getting from Internet from social media from Voice Transcript all of them might need significant cleansing. You know, there has dirty incomplete on all kinds of multiple formats on Let's say, if you look at any of the Twitter feeds, you know, they're not complete sentences, that a lot of abbreviations and Parkins of things Junkin sitting there, they're all needs to be cleaned up, examined and missing data. That's another big important point. What about missing data? You might be missing attributes for certain. Collins are sort of maybe missing values for certain attributes. How are you going to handle them? Are you going to give them a value? Because if you put something like a main there, for example, your mission learning algorithm doesn't understand any is gonna think inmates under that value if you put zero as a value for some number. Your garden was gonna take. Okay, zero is some value. How do you tell you? Mission Learning algorithm zero means and not available Another where says it has got some value. It's not a easy thing to do. A lot of times you have to put a replacement and before they die in there and they made affect your mission learning algorithms. So missing data handling is a very key decision to make here. Cleansing example are like you're normalising date formats right there. Sometimes realize an imam dd dd mm Oil over mm really know all kinds of former. You want a normal ease and standards them toe 14 months before you can start using them standardizing on decimal places. Sometimes the data is coming on 1.23 Sometimes it is used going to use the exponential format for a number. And all that needs to be strategy is once again under the classic. One is the last name. First name was the first name last name. How one names represented in the data. So So you're getting know what farmers they are. All of them needs to be like standardized. There is one part of the cleansing process on. More importantly, if you're getting like text feeds from somewhere, you have to do a lot of cleansing for text that that's a whole demining itself. What do you do with text cleansing? That's all that work needs to be done before you can start using the data for any other analysis. Data transformation data after a cleansing might have to be clamps on toe. A different former are different shape before you start using it. So the reason for data transformation is extracted information from data while discarding unnecessary baggage. What does unnecessary baggage is against the mind by the girl with what You're searching the data. So if you don't need some data, we don't need some levels of details. You can summarize them and discover all the unnecessary baggage that is their typical aided Moore's processing and summary ization. You try to summer associated logical activities. Levels on transformations help cutting down the day. There's signs on many mazes for the processing used idea. Why do you want Oh, so some transformation is you want it with that data into a shape that you wonder can understand better, like you can collapse a number off the course into one logical records that represent the entire thing that happened from examples you might want to see here is that visitor comes to a website and he clicks a number off the pages in the website. You might wonder somebody's all of them into one single record. But if that's all the level of we did you need, you might want to do some language translations between multiple languages. If there's medical sensor that that that is coming, let's say there is a sensor that is capturing your blood pressure every second and sending you and blood pressure reading. Maybe you want to summarize it by interval. You can take a 30 minute interval and then summaries and say in this 30 minute interval, what is the maximum trading? What is the minimum rating? What is the average reading things like that and summarize it. Also, it I can depend someone your use case, what kind of transformation you want to do and summarize it. In this case, summaries After transformation comes data and Dishman. Embellishment is about adding some additional attributes later that improves the quality of information. You wanna add some more information to your data that can make your analysis a lot better. So what kind of information that you can add? Ah, for example, you can get information, the demographic information from a customer database to a point of sale transaction record . So the point of sale transaction record is just gonna have your customer name, your customer credit card number and what products he brought. Now you can get the customers demographic information from 1/3 party that I be like one of those customers aid, you know, marital status, education, income levels. And you can attach that to this data. Once you had I said that to the data than what he can do is you can do some analysis as to which products people buy like people. Let's say milk, who buys milk other people who are male or female is the people who are over 20 fair below 25. You can do all those kind off analysis once you can. Endless chore data to our tradition. Information. Things like you can no logical groupings of patients by past medical history, like you can attach a patient's past medical history to his current visit. Then you can look at and see, You know how people spar with past medical history. Different kind of medical history, perform, are, are are walking off things you do to them. So encouraging data is a very important step. Adding more data, more meaningful data gives you better insights into what data you have there. And once you're done with all of them, you're going toe. Persist your data, but you save your data in some need. A sensible fashion process. Data is stored in a reliable, retrievable data sync. So you want to process all your daughter and put them in a nice relabeled retrievable data . Sync on all the liver information captured in a single look record as much as possible. You have data coming from multiple different sources. The best thing for you to do is if you can get them all organists as logical record like one single long record that contains all the information you need. You shouldn't be doing a lot off foreign key kind of things. You rather want toe de normalize them and put them all in the same record and put them all together. So further questioning and analysis is made really easy for you. An example, would be like a little souls transaction. You can take the point of sale data. Are the customer demographic information to it on the item characteristics to it, like you have the item that is purchased, you can say type of item. It's diary in a working, updated Does stuff like that and you can also add, like sales associate performance information to it so that you can are then new analysis off a Sales Associates performance based on the product being sold based on the customer demographics and stuff like that. So you can wanna put all of them together in a single straight record and store them. That is the step that has called the data persistence and finally, are scaling inquiry performance are pretty important factors. Of course. There's good in tow. The data architecture domain where the Data Architects is. Job is to architects. Job is to design your data, sing in such a way that it can hold all the data that do you have and has got a reasonable scaling. Its got good quality performance and all of that to help you in the next step, which is the analytic steps data of course, you can store them as flat files, traditional SQL databases. And then, of course, today you have all the big data technologies like Hadoop on Hard Open its databases, like hedge base that you want to store your data. So this complete the second face off a data science project. 9. Data Science Life Cycle - Analysis & Production: Hello. This is your instructor Cameron here are continuing on the data science lifecycle. That thought phase is a narcotics where you're trying to learn from the data and do your predictions. The first step in analysis is what is called exploring three data analysis R E d A. In in shot form. A very famous short form in data science. What are you going to do when e t. A. Is? You want to understand individual attributes patterns that you take an age as an attributes . You won't understand things like the range minimum values, maximum values, the frequency distribution, me, things like that. The next thing you wanted a was understand relationship between the attributes like what does the relationship between age and you're buying pattern relationship between income on da gender, things like that. How does one change in one affect the other? In other words, you're turning all about relationships in this face you're trying to do. Some graphs are trying to do some analysis and understand more about what you see in the data. You then do. Reasoning is the behavior explainable? Whatever relationships up in patterns you're seeing in the data, is there an explanation for why it is so it does not. If you don't find an explanation than possibly, there is a possibility. Often better. Or maybe it's a new pattern. That's something you want to discuss and then figure out you are. Look at our players and then decide what you want to go with them you want, whether include them or exclude, um, are depends on bone. What the Outlier Valley West. And it's a use case by use case basis. You decide on what you want to do without players. Possible errors in processing you can only find but exploited it and listens. That is a very good use off the process. Let's take an example again on off patient waits. We just discussed a few slaves back. The moment you see of eight off like 600 you immediately know that there is something wrong with that. There is a possible error. It was also the what you call out players suppose there are a couple of patients who are a 70 75 years old. Everybody else is like 40 lesson for 40 years old that maybe you want to decide and eliminate those two records without players. That is one possible our client processing. You want to go on you? Of course you want understand relationship between the patient Wait and on the diabetes level, the cholesterol level on the family history and stuff like that. And finally, you violate your findings with the domain experts When say, Hey, this is what I'm seeing in the data. Does that gel with what you already know about something new, you want to talk to them and understand how things are. The next step is inferential analysis. What do you do in inferential analysis is look for signals. You know, you look for patterns, you look for consistency in those back and you look for correlations. You look for reasoning. This is kind of an overlap with explore a treaty down. Unless is, this is more in depth and more focused and more methodical that you do here in French in analysis, then you check and see if the patterns are consistent and reproducible. What you mean by consistent is do you see the same part on month after month after month? You see that that's a As the rate increases, you see that cholesterol levels increase does it happen for your patients? Every month, every month you get a new set of patients and you keep seeing the same pattern. Do you see the same pattern across? Let's see cities across countries across different races, all that as a part of inferential analysis. And then you do some statistical test to see that the findings that you see with the data said that you have. Can that be extrapolated to the India population like you have data from San Francisco can the same, and this is be with results, be the same if you extrapolate it so they and their us out of the entire world are they going to be different? It's all those just you do as a part of infringing analysis again. And let's take an example of patient. Wait, was the diabetes. You do the all this in French In analysis like you might take fast data from one state that California do the analysis and then see how California compares with New York at the patterns of seeing R Calif. Are the then you look out races. Look at Asian Americans to Asian Americans in California showed the same pattern location Americans in the New York. Our donation American showed the same pattern as African Americans. Worse is other people. So you do all these kind off segmentation and then you do all these profiling during the inferential analysis on do you come off it and valued all your findings during this process ? Once you know, inferential analysis the next stages modelling. This is where all your mission learning all guard comes kick into play are you are play early immersion learning all gardens to build models on what you do in model building is your typically tried to build multiple models using different algorithms on different data sets. This is all the techniques that are there and mission learning. There are some techniques about how you can segment your data sets and the multiple substance and then use them to build models and test models on. Then, how different algorithms can be used on this is all the domain off mission learning is all about. If you take a course in mission learning, it was just one line that has bean explode through the entire course. You, of course, have to test your models were a crazy again. Their methods for how you do that in machine learning you Finally, I am. If I your best performing models when we say best performing, we talk about accuracy. We talk about the response time and the resources used, so you have to again make some tradeoffs. Yet as to what your best performing model is all about, let us say one model against you. 80% accuracy on it takes one minute run. That is another model that gives you 85% accuracy, but it takes one hour torrent. So which one is more important to you? Are the more hung up for the 85 or 80? Odd. Is it okay for you to have an 80% accuracy but have a reasonable response time? So we heard about it. Look at all these three things, like the accuracy, the response time and the resources used. The computing power that is required. A building models Andi. Then to say, what is going to be your best model so model that you build at the end could be as simple as a decision tree or equation. It can be asked complexes. The neural net for depends on the problem and the data in question. So but at the end of the process, you do have a model that you select based on the different algorithms and the different tryouts that you know affray, 1,000,000,000 models. Then you're going to go and do all your productions using new data again that adverse have you can test the prediction, test your models again, a part of the mission learning courses that you will see. You have to keep validating your model accuracy. So you just join build one model tested once and get away from it. But you're gonna be trying. Befriend models are sometimes even combination of different models and then see which one gives you the best accuracy possible. You're going to be trying that my people are tires and variations in this process of trial . Maris again the best time you can use your There's a lot of that's why I call it the research project. At the beginning, you're gonna do a lot of research year tray of different things and see which one works best for your specific project, a response time, resources research, all of the mechanical, especially when you have to make predictions in real time like a Web search. Sharper has just got into your website and is browsing through your website making clicks, and you want a predict in real time. But the shoppers going to buy are not. Those decisions have been made like in real time, you know, within a second with as minimal results as possible. So you're a pick your algorithms. Based on that, you want to keep measuring improvements. So as you keep working or different combinations off the production of guardians of traditional governments have two parts. One is the model building part, and second is the prediction part. So you have to look at both of them and see whether there are better at both of them. Sometimes some production of goddamn takes more and the build model, but they may be very fast doing the production parts off different things there. So again, you have to keep measuring all your algorithms how they perform, and then they keep comparing them and then see which one is the best one you want to choose . You might even have simulations. Assimilation may be as simple as mathematical simulations, or you might build software that can similar certain use cases. Assimilation is used to validate whether water suppress your garden was saying that in this given situation this could be the outcome. So unit is similar there that can, similarly, that environment. It can similar what the NDP is doing their environment and then see if the outcome you're predicting is what you're going to be getting. So simulations are complex piece of software. Sometimes people don't build them to validate the predictions. Once you do all these model building on production, the last step that you do in this case has come up with a set off recommendations. What do you do here? Is that at the end of this project, a recommendation need to be provided the project owners Okay, on what you have done, what are the algorithms to be used and where are the expected benefits? So all of them, if you put together in a nice presentation and present their toe the product owners and here comes to catch another science project made have no recommendations to make it the data's that does not exhibit any explainable patterns. We have been talking about the essence all being about learning from relationships. If the data that you have do not exhibit any pattern, any pattern between the outcome on any other variable. If the outcome is not predictable from the data that you have, there is nothing you can predict Desai. Simple as that. That does not mean that the data since project is a failure. You can have a product with chills. Let's look at our customer database and see if we can predict customer churn at the end of the project. We so you can come up and say, Based on the data that we have, we cannot predict customer Chung that it doesn't mean that the Essence project is a failure . The neuroscience project will only work if the data has burdens, so it is not default off the data. Scientists, if your data does not at fault, are any patterns, of course, is the father of data Scientist is the data has patterns and the data scientist fails to find them. But the data does not have any patterns. It is not the fault of the data scientists, so this is another important thing to note. Sometimes unexpected patterns are discovered that made lead to other benefits, so you might be looking at the Dodi with especially goal in mind. Like you're looking at the credit customer churn. But you may see that. Okay, I see some nice patterns. These patterns might be used to predict something else. Like you might be using those data to predict upsets, for example. So a data science project might have this site shoot a side benefit. So you might say, OK, I see this nice pattern here. Maybe we gotta dig in deeper. Then you go create another day, dozens project for that, and then continue down that parts off the door. A science project would also come up with these site of benefits. In fact, a lot of them can come up during the process once you start looking at the data. And, of course, you finally make a presentation on the recommendations. Told the stakeholders the last one of the things you want did not. Here is the iterations that are require, even though the steps are less that here they're supposed to be done in sequence. You do go keep going back and forth between those steps on that maybe burst on intermediate or at the end, analysis and feedback So after you do all your analysis, you shouted with the domain expert. You shouted with the other project stakeholders. They might come back with some feedback that may force you to go back on, then redo the analysis burst on new light that has been shared on the data that you have. So people may have different objectives, different prospectors that might give you new triggers to go back and look at the data that is a commenting. India signs the product in that their response to the findings in the data on then it can take him in multiple analysis parts. If you have, then comes the final face that is the production face or the production icing face. We just implement continuous processes that you two lawyers are all the work that you've done in the earlier faces. Ondo start doing something on a continuous basis year. So here comes what is called building date up products. So what is the date? A product and the product is an application that works on data, gets something out of data and uses it to achieve some objective. It's simple as that order Later product. So once a data modelling and prediction. Ill guarded himself, firmed up. You know, exactly what do you have to do then you better get a product. So what is the better product is basically production izing, you know, making the court the quarters no more and no turning from than 80. You can nothing. You make 1/4. This production quality will all the error checking in place with all the management and monitoring in place that can do that will do all the steps that we have talked about. All the data injecting steps. So you are give us automate getting data feeds from all your data sources and then you have to automate these applications to run regularly. Look at the data that is coming in and it starts cleaning of the data, transforming the data, persisting the data. Then all your analysis code will kick in. Andre will start looking at the data regularly and start building models. So all of them are daughter products in one word, they have to be running continuously regularly and keep producing, keep getting data and producing these models. And, of course, the production part After brands real time, you know, bash for whatever way it has to run. And that again is another data product that parents regularly uses. The more model the model that has been built to make prediction when and wherever it is required. So building there s air protection, the final part that is more like this is very it's more like software originating this motor software project. Actually, if you want to say because you know exactly the department that that they're on already does convert them into a software product, you would be just one needs to have quality software rigor in both development and testing on it can be deploying enterprise and cloud models depends upon border later product is supposed to do. Of course, the most important thing here is also that you need to get operationalized data feeds. The data feeds from all the radar sources. No, they have to be continues. When I say continuous, it's instantaneous. You keep getting them as they happen. Sometimes you're getting this daily. Do that, Adams. Sometimes you know, once a V 15 minutes and travel 30 minute Printable defense depends on your use case, but it has to be operationalized so that there that keeps coming regularly. You don't have toe work with somebody every day to get the data. It's all automated here. And of course, we talked. As we doctor bordered, there s products perform all the cleansing transformation under reporting every reporting is a key thing you want to be doing here and finally pulling off all data might be necessary. You know, as you get about committing Gator, that's gonna be a lot of data, especially once you transform the right out to the form you want all the raw data. You know, you might want to keep them for 10 days, 15 days and then thrown them out. So that completes all the steps that you have to do in a typical our data science project. But there is always something called, ah continuous improvement. Once you deploy a data product, they're always changes in the business environment that might affect all your learning in production. So this is something to dream. Remember everything that you built as a data product. No algorithms, the algorithms to models they made that their accuracy might go down because off changes in business environment and also the learning and production stuff has to be the value that periodically at approximate intervals to make sure they're continuing to show their careers levels they orginally over Minto have on revalidation need happen when their business process gene, You know there is a changing something in the business project process that where the entities behave is changing the world, the environment. And Richard is going to work us changing. So obviously you have a very, very did everything you're doing here. So that might have to be under the child project that was Maker made in his project are an improvement project that has to come up periodically to validate. What you have been doing is all fine. A force agenda better model should be ongoing. No, this is important thing. We just can't would want and stop there we does to be continuous. So in somebody for what we have seen so far, data science projects follow a life cycle. Data science projects are research their projects. There is a lot of experimentation and sometimes no understand. So that's something. That's why we keep calling it. It's a research type project signal in the their dad drive results, not the guard comes. Duda is more important than the algorithms themselves. Multiple iterations might be necessary before reasonable results are achieved. This is another thing you want to remember. So there's not a very a stage serious in a data science project where thinks that are done or should be done. So help. This has been helpful to you. Thank you for for your listening, but