Be Aware of Data Science | Robert Barcik | Skillshare

Playback Speed

  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

Be Aware of Data Science

teacher avatar Robert Barcik, Data Science Trainer

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

51 Lessons (5h 16m)
    • 1. Welcome to the course!

    • 2. Introduction to Chapter 1

    • 3. The Goal of Data Science

    • 4. Approaches of Data Science

    • 5. The "Data" Part of Data Science (1)

    • 6. The "Data" Part of Data Science (2)

    • 7. Statistical bias

    • 8. I love the yellow walkman!

    • 9. The Limitation of Our Mind

    • 10. The "Science" Part of Data Science

    • 11. Introduction to Chapter 2

    • 12. Disciplines: Statistics

    • 13. Disciplines: Databases

    • 14. Disciplines: Big Data (1)

    • 15. Disciplines: Big Data (2)

    • 16. Disciplines: Data Mining

    • 17. Disciplines: Machine Learning

    • 18. Disciplines: Artificial Intelligence

    • 19. T-shaped Skillset of Data Science

    • 20. Skills: Mindset of a Data Scientist

    • 21. Skills: Rectangular Data

    • 22. Skills: Specializations

    • 23. Skills: Technical Wing

    • 24. Skills: Soft Wing

    • 25. Introduction to Chapter 3

    • 26. Describing the life of a foodie!

    • 27. Why do we need to describe the data?

    • 28. Basics of Descriptive Methods (1)

    • 29. Basics of Descriptive Methods (2)

    • 30. Basics of Descriptive Methods (3)

    • 31. Calculating Average Income

    • 32. From Description to Exploration

    • 33. Which house is the right one?

    • 34. Correlation

    • 35. When Temperature Rises

    • 36. Do storks bring babies?

    • 37. Football and Presidents

    • 38. Introduction to Chapter 4

    • 39. From Sample to Population (1)

    • 40. From Sample to Population (2)

    • 41. Is the mushroom edible?

    • 42. Inference: Experiment Setup

    • 43. Inference: Statistical Test

    • 44. Inference: Solving and Summarising

    • 45. The Function of Nature

    • 46. When do we need a predictive model?

    • 47. Building a Predictive Model

    • 48. Predictive Model Types

    • 49. Predictive Model is Never Perfect

    • 50. Are we seeing a dog or a wolf?

    • 51. Is our model having an impact

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.





About This Class

Understanding how we can derive valuable information from the data has become an everyday expectation. Previously, organizations looked up to data scientists. Nowadays, organizations liberate data science. Everyone can contribute to the efforts of turning data into valuable information. Thus, even if your aspirations are not to be a data scientist, open yourself the door to these projects by gaining so-necessary intuitive understanding. With this course, you can take the first step into the world of data science! This course will explain how data science models create value from the absolute basics even if you feel like a complete beginner to the topic. 

Three data scientists deliver the course, with cumulative 15 years of professional and academic experience. Hence, we won't repeat the textbooks. We will uncover a valuable bit of this lucrative field with every lecture and take you closer to your desired future role around data science projects. We do not teach programming aspects of the field. Instead, we entirely focus on data science's conceptual understanding. As practice shows, real-world projects tremendously benefit by incorporating practitioners with thorough, intuitive knowledge. 

Over 5 hours of content, consisting of top-notch video lectures, state-of-the-art assignments, and intuitive learning stories from the real world. The narrative will be straightforward to consume. Instead of boring you with lengthy definitions, the course will enlighten you through dozens of relatable examples. We will put ourselves in the shoes of ice cream vendors, environmentalists examining deer migrations, researchers wondering whether storks bring babies, and much more! After the course, you will be aware of the basic principles, approaches, and methods that allow organizations to turn their datasets into valuable and actionable knowledge!

The course structure follows an intuitive learning path! Here is an outline of chapters and a showcase of questions that we will answer:

  • Chapter 1: "Defining data science". We start our journey by defining data science from multiple perspectives. Why are data so valuable? What is the goal of data science? In which ways can a data science model be biased?  
  • Chapter 2: "Disciplines of Data Science". We continue by exploring individual disciplines that together create data science - such as statistics, big data, or machine learning. What is the difference between artificial intelligence and machine learning? Who is a data scientist, and what skills does s/he need? Why do data science use cases appear so complex?
  • Chapter 3: "Describing and exploring data". We tackle descriptive and exploratory data science approaches and discover how these can create valuable information. What is a correlation, and when is it spurious? What are outliers, and why can they bias our perceptions? Why should we always study measures of spread?
  • Section 4: "Inference and predictive models". Herein, we focus on inferential and predictive approaches. Is Machine Learning our only option when creating a predictive model? How can we verify whether a new sales campaign is successful using statistical inference?
  • Section 5: "Bonus section". We provide personal tips on growing into data science, recommended reading lists, and more!

We bring real-life examples through easy-to-consume narratives instead of boring definitions. These stories cover the most critical learnings in the course, and the story-like description will make it easier to remember and take away. Example:

  • "Do storks bring babies?" story will teach us a key difference among correlation, causation, and spurious correlation.
  • "Are we seeing a dog or a wolf?" story will explain why it is crucial to not blindly trust a Machine Learning model as it might learn unfortunate patterns.
  • "Is the mushroom edible?" case will show a project that might be a complete failure simply because of a biased dataset that we use.
  • "Which house is the right one?" story will explain why we frequently want to rely on Machine Learning if we want to discover some complex, multi-dimensional patterns in our data.
  • "I love the yellow walkman!" is a case from 20 years ago, when a large manufacturer was considering launching a new product. If they relied on what people say instead of what data say, they would have a distorted view of reality!
  • "Don't trust the HIPPO!" is a showcase of what is, unfortunately, happening in many organizations worldwide. People tend to trust the Highest Paid Person's Opinion instead of trusting what the data says.

Important reminder: This course does not teach the programming aspects of the field. Instead, it covers the conceptual and business learnings.

Meet Your Teacher

Teacher Profile Image

Robert Barcik

Data Science Trainer


Class Ratings

Expectations Met?
  • 0%
  • Yes
  • 0%
  • Somewhat
  • 0%
  • Not really
  • 0%

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.


1. Welcome to the course!: Hi and welcome to the course. Be aware of Data Science. My name is Robert, and in this brief lecture, I would like to not only welcome you to the course, but also provide you with an overview of what's ahead of us in the course. And on top of it, I will also provide five practical tips on how to be as efficient as possible if your learning journey throughout this course. Let's go for it. First of all, let's discuss what's ahead of us in this course. As you can see, the course is structured in four key chapters. The essence of data science disciplines of data science, describing and exploring the data. And lastly, we have inference and predictive models. What can you expect from each of these four chapters? Well, within the first chapter, as the name says, we will talk about the essence of data science. So we will want to understand what the goal of data sciences, Why are we even using the data and how it helps us find what we call cognitive biases. These will be our key first learning. Then we also ask ourselves a question of how does data science apply scientific approaches, considering it has science in its name. So this is the essence of data science that we'll talk about in the first chapter. Now once we understood it, we will proceed with the second chapter where we are more practical, more tangible. We'll talk about the disciplines of data science. So there is data mining, machine learning Databases, Big Data. And we will look at how each of these contribute and how they together create what we call data science nowadays. So as you can see, will be very practical. In the second part of this chapter, we also ask a question of who is a data scientist will examine the unnecessary skillset of a data scientist. And hopefully this part of the chapter also gives you an inspiration on how you can join the world of data science or possibly grow further in your data science learnings. Having the basics covered, we proceed to the third chapter, describing and exploring the data. We are basically applying the first two approaches of data science that we will study in this course. We are doing so in order because when a data scientists get their hands on a data, well the first thing that they do is that they will describe the data, for example, they would use descriptive statistics. Secondly, what they will do is that they would explore the data. So they would search for some patrons within the datasets and we will do the very same thing Within the third chapter. Now, once we have described and explored our dataset, we can proceed with the fourth chapter where we talked about inference and predictive models. So this is the chapter where we will dig deeper into machine-learning and it will be creating powerful predictive models which can learn on some historical data, on some sample of the data and then possibly predict some future value. So it'll be building these cool predictive models. Now lastly, I would like to highlight that there is also a bonus section at the end of the course, which I'm really trying to fill in with a lot of bonus full pitch for you. For example, if I'm seeing that a couple of students are having the same questions or they're curious about something. I would collect the stores and record a special lecture which will be then placed in the bonus section of the course. You can also find there some bonus footage and ponens information on some of the electors to sort of provide you with some behind the scenes knowledge. And then also providing some practical tips such as what I recommend to read or how I recommend to grow further into the world of data science. So be sure to not miss the banjos section of the course. Alright, so this is the journey which is ahead of us. I hope you will enjoy it. And now as promised, let me provide you with five practical tips on how you can be as efficient with your learning within these cores as possible. First of all, pace yourself like find your ideal and optimal pace, maybe some hints. One chapter is approximately 90 minutes. If you like longer uninterrupted sessions, well, you can then paste it by the chapters and take one chapter at a time if you would like to have very short burst learnings, while the course is also optimal because one lecture is approximately five to seven minutes long. So it's really short pieces of content. And I really hope that you will find the pace which is really suiting you for all the course. Secondly, practice makes perfect Really I recommend to practice through assignments that we have created for you. They're spread it all over the course. And we have really tried our best to bring you interesting assignments. For example, there is one where we'll be analyzing the migrations and then predicting how the ears will migrate in our nature is innovation. There are more assignments. For example, you will be a restaurant owner and you will be identifying relevant inputs for our predictive models. I really recommend that you invest your time and practice with the prepared assignment. Thirdly, whenever there is a really a crucial learning within the course, we turn it into a learning story so that you can remember it better. So I really commend to focus on a learning story when you find such and remember it through it. For example, we'll be talking about why Hippel's are dangerous for organizations will be asking ourselves a question of whether storks bring babies, or we'll be asking ourselves a question of whether these mushroom is edible or poisonous. I really hope that these learning stories will help you to remember the key concept. Now at least for myself, I love collecting things. We are also constructing various handouts and collectibles, which are always available as a handout and next to a electrode. So I recommend that you always check whether there is a handout available for a lecture. There will be, for example, one pagers which are summarizing the key learnings from this particular lecture. So always check the handout section. And lastly, especially when we're talking about data science, It's really crucial to be curious and to be asking questions. And then also recommend these for you to ask questions with regards to the content. For example, if you stumble upon something which is unclear or something that you would like to discuss with your fellow students or with me, use the Q&A section. I'm always checking it and I will be replying to your questions. And I also hope that you will develop meaningful discussions. We fellow students, for example, when it comes to the assignment. So I really recommend to make use of the Q&A section. I'll be super happy to hear from you and reply to you. Alright, so these were the five quick tips for taking the course. And I can't wait to see you in the upcoming lectures. See you there. 2. Introduction to Chapter 1: Hello everyone. A warm welcome to the first chapter of the course. Be aware of Data Science. My name is Robert and I would like to provide you in this brief lecture with the goal that we have during the chapter that is called the essence of data science. Also, I will provide you with a bit of an outlook of the lectures that we will have. Let's go for it. If we will get the overall course structure, we can see that really we are at the beginning. We want to define and understand data science from multiple perspectives. This will be the goal during this very first chapter, we'll be tackling questions such as, what is the goal of data science? What are the tools, methods, and approaches which are available to us, or what is a data science model? These basic questions, we are really starting from the ground up and we'll be building up on these knowledge during the remainder of the course. So having the goal defined, let me show you the lectures which are ahead of us. We're starting really from the ground. And already in the first lecture we are defining what is the goal of data science. Essentially, you will understand the data science is the art of turning data into valuable information. Then I really recommend that you look forward. The second lecture is I have a quick win for you. There are the approaches of data science. I will build a framework of four approaches which are nowadays used by industries a lot. So a descriptive approach, export artery one, inferential and predictable. And we'll be relying on these framework during the remainder of the course. We'll be studying these four approaches of data science. It's a very important lecture. Then we will kind of starting to challenge the data and we will start to think about the data part of data science. Data science is very popular nowadays. Why does everyone want to utilize the data? And if we are utilizing the data, are the data sacred, like can we blindly trust them? Are they always right? As you can see, we'll be focusing on the data part of data science. I didn't have a learning story for you is I would really like you to remember to always remain skeptical towards the data and towards the nature of the data that you have ID then have an assignment for you which is called bias is everywhere. I will ask you to reflect with your own experience upon possible cognitive when statistical biases, which you might have experienced in the past. So it will be a practical lecture. Then in the second part of the chapter, we will move towards the science part of data science. We will think about some limitations of human mind and why is it so useful if data science is building modal sum simplification of reality, we are moving and focusing towards the science part of data science. Then just before we conclude the chapter, there is a checkpoint on which you can recap all of the knowledge that you have gained during the lecture, it might be necessary becomes the last part of the chapter is a quiz that I prepared for you where you can test your knowledge from this chapter. Overall, you can expect that this chapter will take you approximately 16 minutes in lectures, plus there is approximately 20 minutes of practice activities in terms of the assignment and also the quiz. So I'm looking forward to see you in this first chapter. 3. The Goal of Data Science: Hi and welcome to another lecture in the course. Be aware of Data Science. My name is Robert ending this lecture, we're going to talk about the goal of data science because I can't think of a better way how to outside the journey of exploring the wonderful world of data science then to talk about its gold. So let's go for it. At first, the goal of data science is very simple and straightforward. The goal is to turn data into valuable information. That, That's what data science attempts to do, like trees or what do we have defined data science? No, of course not. Even though this definition is awake, there is quite a lot behind this definition. I would like to talk about it within this lecture. First of all, let's ask ourselves where data scientists, the first ones who have ever attempted to turn data into available information. Certainly not. You had statisticians, artificial intelligence experts, data miners. You had business intelligence experts. And all of these people were trying to turn data into valuable information, but they were kind of separated from each other. But things changed approximately the breakthrough of millennia, approximately 1999. Basically, smart folks were meeting on conferences and they were thinking that if we unify all of these separated efforts under a common umbrella, under a common definition, then the industry will be able to adopt it better. And that's exactly what happened. They came up with the term data science and with this wig definition, which is unifying all of the efforts of turning data into valuable information. And it seemed as if the right move because then the industries really started to adopt data science methodologies and various disciplines which are all contributing to the data science. This is the first key takeaway from this lecture. Even though the definition of data science is awake, its wake by design, it's supposed to be like this. It's supposed to unify all of these otherwise separated efforts of turning data into valuable information under one common umbrella, which we call data science. Now, I would like to zoom in and look water behind of this definition as there is, of course a lot more. First of all, we can look at the left side, the data side of the definition. There are actually lots of aspects about the data that can vary quite a lot from project to project or in Castile data science. First of all, we can have structured or unstructured data. Um, if the data is structured, you can, for example, imagine some demographics dataset of your customers lethally in an Excel sheet which you can load into the Excel and local where your individual customer is located over the social demographic characteristics of them. This would be structured data and it's quite easy to work with. Then contrary to that, we can have unstructured data. Well, maybe your company is having some large IT system which is logging everything which happens, for example, on a server or within your accompany solution that is exposed to the customers and it just outputting the locks, the data exists, but we do not have such a great overview of what's in there, if we would like to analyze it and derive some available information. So of course, it's going to be a lot harder to work with such unstructured data. Secondly, the data can be, we call them purposed or on purpose. We will discuss these in detail later down the course. But for now, you know, purpose dataset would be one which is collected exactly for us, for the hypotheses, for the idea which we are having. We have collected dataset just for our purpose. That's what we would call a purpose dataset. On the other hand, you can have datasets which are on purpose. Maybe they're just being collected for some operational purpose such as the IT system that I already mentioned. The logs are there in case something goes wrong so that our IT experts can examine it. Or maybe some, there was some another project going on which was a data science project and I collected some data. But again, the data is not primarily collected for our purpose, for the idea that we are having. So it might be a bit more troublesome the work with these data, and it might be a bit less informative as it's on purpose from our perspective. Firstly, we can think about the nature and accessibility of the data. Now, do we have access to the data on a real-time basis or a near real-time basis? Or do we have to process the data in near real-time basis? So as they are popping up, as they're coming into existence, we have to analyze them and process them right away. Or do we have an access to the data and also the need to access to data in a batch way. So let's say once a month we are supposed to analyze the data and come up with some valuable information. You can see even from this perspective, there can be a lot of differences within the data. Now, this list is of course not exhaust you and fall. I just wanted to give you right away the start of the course a clear idea that datasets and data might be really different. Maybe a key takeaway from this part of the lecture is always we should be asking ourselves what is the availability and the nature of the data within our domain are within the use case within the project that we're working on right now. So you can see we have now zoomed in to the data part of the definition of data science. Secondly, let's zoom into the right side of the definitions on the side of valuable information. I mean, there, there are also a lot of differences. In what form are we creating the valuable information that are actually various forums. First of all, it can be some simple descriptive statistic. For example, we summarize in which days of the week we're making how much sales. These can already be available information to our marketing colleagues who can then send better campaign. So a simple descriptive statistic could already be available information or we could be a bit more complex. We are thinking about the visualization or a pattern. Let's say we have a large dataset and we are visualizing how our cells are changing across the days of the week or maybe across the months of the year. And we might find some valuable pattern through the visualization. So the visualization itself is the valuable information that we are seeking. Or lastly, we could be a bit more complex with our approach and aim for a creation of a predictive model, usually a machine-learning model. Later in the course we'll be building up an example that we are an owner of an ice cream stand and we're hoping to predict how many customers will come to our ice cream stand on any given day so that we can stock up properly and we can have appropriate amount of personnel in the place. This is the most complex type of available information that we could be creating, which are individual prediction suggests for individual customers for individual days. As you can see again, this is not an exhaustive and full list. I just wanted to give you an idea that the valuable information that we are creating using data science can also vary in forms. And it's always going to be the use case or the domain in which we are working that will determine in which form our way desiring to have the valuable information. I would like to stay one more minute on this definition of data science because I have one more key message for you, which is that these awake definition is a gift and the curse at the same time. What do I mean by it? Well, it's a gift because data science can be visually applied anywhere within any industry was a data scientists can work for manufacturing company, a solar energy engineering company. You can be working within financial industry virtually anywhere and within any organization which is really great, isn't it? At the same time, these waves definition means the data science can be applied on any level within an organization. So we could have expert data scientists with many years of experience creating large and complex models and the same time alongside of them, you could have maybe their business colleagues who are creating simpler models, integrating them really well into the business domain where they are the experts. So this is the gift because you can apply or so which will in any industry organization and also within ME level of the organization, that's the gift which these awake definition brings. Now at the same time, it's kind of a curse. Why do I claim that these weaknesses a bit of a curse for data science, because data science use cases and projects can get overwhelmingly complex. And you can then find people wondering like, do I need a lot of data to apply data science? Do I need a powerful infrastructure? Do I need data scientists to be able to do use case within my organization? Who even is a data scientist. You can see these weaknesses causing various confusions. I will say it all the data science now that he's, now of course, later in the course we are going to discuss all of these and we'll clarify them. I just already now at the beginning of the course, wanted to give you a sort of a quick win of why are we single so much confusion around the world of data science nowadays? Well, it's because this working definition is causing a little bit of the curves. Now this is what I wanted to discuss within this lecture, we have defined the goal of the data science. We have zoomed into various parts of it. And now you might find yourself wondering, how is all of these happening? How are we turning the data into valuable information? And that's what we will tackle in the upcoming lecture where we will talk about the approaches of data science. So thank you so much for being part of this lecture. I look forward to see you in the upcoming lectures. 4. Approaches of Data Science: Hi and welcome to another lecture in the course. Be aware of Data Science. My name is Robert ending this lecture, I would like to talk about the approaches of data science. So just bored, I promise you, once we understood the basic goal and definition of data science, now we take all, how does it happen? How do we turn data into valuable information? Let's go for it. Actually, what these widely applied in the industries nowadays are for a protease of data science. And in this video, I would like to give you an overview of these four approaches. And then later in the course we will dig deeper into these. You will get an overview of the methods which are available. You will have a chance to practice with the methods and lot more. Now the key takeaway from this video will be to understand how these are building up on each other and what are the differences among them? Of course, what is the essence about each of them? Or I told, say we will be building this overview with a colorful example. We are a fruit and vegetable stand owner. Of course, we want to sell as much of the fruits and vegetables as we can. And we have some data available maybe from the past sales that we were making. And we want to utilize data science. What do we do? The very first thing that we can attempt to do is to use descriptive approach of datasets. Now as you can see, I have put this arrow over here. These are proteins as we'll be discussing them. They're usually we finance a project within a data science use case building up on each other in a sort of a sequence. So descriptive approach is usually the first one that we are applying when we get our hands on our dataset. Alright, but back to our fruit and vegetable stand, we are utilizing the descriptive approach and we are mainly asking ourselves a question, what is the essence of this dataset? What is it that is in front of us and what is the problem even that we are tackling? It will be using very simple tools from statistics, such as descriptive statistics, measuring averages, measuring the extremes. And then we will also utilize some simple visualization. I was just trying to understand what is in the data being very objective. An example outcome using this descriptive approach, data science will be that on weekdays I sell on average five kilograms of epochs. Now this is already useful. This is already available information for me because I know how much apples should I have installed on any given week day, approximately five kilograms. Now it's not the most powerful finding now it's not the most valuable information, but it's a good start. And this is also how usually data science use cases are starting just trying to understand what is the essence of the dataset and trying to come up with these quick wins. Once we understood the essence of the dataset and we utilize the, the descriptive approach. We are turning our attention to what we call the exploratory approach. Here the question is different. We are asking, are there any patterns in my data? This is different. We'll be using tools such as correlation examination. We will later in the course study correlation in detail. But we're basically measuring some relationships which could be occurring. And we're also building some more complex visualization is to visualize these relationships and patterns. And an example outcome will be that customers who buy a pulse also buy bananas. And you can see the difference that we are no longer just describing that we are on average selling five kilograms of apples per day. But we're describing that there is a relationship between apples and bananas. So this is an example of an exploratory approach, finding patterns in our data. Now, once we found such better and it might have gotten asked curious, well, these people buying apples and bananas, this is something that is happening only within our standard, or is it something generally happening anywhere in the world? So people who like apples also buy bananas. Well, we might turn our attention to the third approach which we call the inferential. Here the question that we're asking ourselves is, if there are patterns in my data, can I generalize these outside of my sample? In our case, can I generalize these patrons outside of our fruit and which double stamp? So are the other vendors around in the market also selling apples and bananas together? Or is it something specific to us? Here we are reaching out to the world of inferential statistics, which we also have an introduction to later in the course. And let's say that defining well-being, that the apples and bananas only hip bones within our stand. Maybe what do we did is that we obtained sales data also from other standards which are on the market and we have compared and we found out it really, it's just for us that the apples and bananas are frequently bought together. Now is this valuable information for us? Of course it is because it tells us that we are doing something special with the apples and bananas. Maybe it's the display how we're putting them together in our little fruit and vegetable stand, which makes people buy them together. Again, another kind of available information which we might create thanks to the inferential approach. And the last approach that we might decide to utilize, which also industries nowadays are utilizing go up is called the predictive approach. He didn't basically asking a question, can I make granular predictions with the patterns? What do I mean by granular predictions? Well, let's say granular predictions are on the granularity of days. So I could maybe forecast how much apples and bananas I will be selling on any given day. Or I can be on a customer granularity. So I would like to make individualized predictions for my customers. And I would say that if I would like to utilize these predictive approach, I would focus on making personalized offers for our customers. So I will run a little loyalty program with loyalty cards so that whenever customers are coming to our fruit and vegetable stand, they are showcasing the loyalty card and I can link their purchases together. And let's say that I found out customers who like artichokes. And I also found out that there are customers who liked to cook special meals on weekend. Now you can see this is not the one pattern, those are two patterns. So customers who like artichokes and those who would like to cook special meals on the weekends. If I combine the multiple patterns at the same time into one predictive model, I might be able to make personalized granular predictions on my customer level. So maybe if I utilize the predictive approach, the outcome will be that the modal is going to output several customers whom I should call before the weekend when the fresh batch of artichokes their eyes and then make them an offer to come to our fruit and vegetable stand. We have these fresh artichokes that you might enjoy earlier. The predictive model might be right about couple of them. So we have again generated available information by combining multiple patrons, unlike with certain inferential approach where we had one pattern, we are combining them together to make these individualized predictions. So as you can see, we've been this last predictive approach will be of course, utilizing a more complex techniques, predictive modeling, machine-learning or statistical learning. But the important takeaway from this video, we will have time to dig, dig deeper into these. The important takeaway from this video is to have an overview of these four approaches, how they are nicely and intuitively building up on each other and how they all can be utilized for a problem that we're facing. And that is it from this lecture, I would like to thank you for being part of this lecture and I look forward to see you in the upcoming ones. 5. The "Data" Part of Data Science (1): Hi, and welcome to another lecture in the course. Be aware of Data Science. My name is Robert, and in this lecture we are going to be talking about the data part of the test data. Why is it so powerful? Why is it so useful? It first, let's think about decisions and information. We make decisions every day, small or big ones decisions matter and because they influence our future, for example, you might decide not to take an umbrella today. It's going to be raining and your day is going to be ruined. Bad decision. Let's say that you decided to buy a property at the outskirts of the city. These property doubles in value over the upcoming years. So obviously it was a good decision. We want to make right choices within our decision-making process. Now, what can help us in this decision-making process? Well, I would say that if we watch the weather forecast in the morning, we would have seen that it's going to be raining today. So these valuable information could have told us that we should take an umbrella or we might have examine in detail the geographical plannings of our municipality. And we might have noticed that there are large environment cultural investments blended will increase this property value. So you can see information is valuable for decision-making and the data contains it. But do we really need the data to make the right decisions? Can't we just use our own experience or knowledge to make the decisions? Well, there is a problem. And the problem is that humans are incredibly bias. We suffer from what we call cognitive biases. Psychologists, anthropologists and sociologists have been conducting for decades are very long list of these cognitive biases. Here on the slide, I'm just listing actually just quite a few. There is lot more of them. What are these cognitive biases? Well, we could simplify them and translate them to a statements such as, we, as humans are not fair. We create prejudice, we discriminate, we are not even rational. You could say, I think you'll get my point of the summarization. Now, even though we are right now shedding sort of a bad light on these cognitive biases. Some of them actually helped us survive in the past or they helped our ancestors survived. Let's say I'll rank is through, walked through a dark alien, spotted a shady individual. Now he walked through another street and they saved his life. In other end customer of ours was offered a venture by an unhealthy looking partner. She passed on the business venture and she avoided a failure of this business venture, obviously because of the health reasons of these potential partner in the past, these might have been useful. However, nowadays, when we talk, for example, about the business decision-making, these can be incredibly problematic. The fact that human decision-making can be this incredibly biased is a huge incentive for companies to incorporate data in their decision making process. Now, just to show you what I mean, I would like to focus on one of these biases, which is what maybe one of the most prevalent ones, which you can find in the companies nowadays it's called the authority bias. We have made this beautiful illustration for you. You can remember that hippos are dangerous for companies. You cannot remember the statement that you should trust the data and not the highest paid person's opinion or the hippo. So it is a way how I like to remember the authority bias. Here is the story. Let's say that you are working for a yogurt producing company and maybe you have experienced meetings such as this one in your past. We have launched a new product in the past. So when you taste and flavor of the yogurt, and now after a couple of months of attempting to sell this new yogurt, you are meeting to decide whether you should continue producing, manufacturing, and promoting these new yogurt. Well, as it looks like based on the sales data, it's not selling that well. And also your customer reviews are indicating that this is not the best product ever. However, the manager, the hippo in this case, stand-ups and says, No, I liked the new product. I actually think that things will start to look better. We just need to give it some time. We have already invested so much into the development of this new yogurt. We won't just stop now. And the opinion of these hippo outweight, what everyone else in the room is saying, obviously, that's the highest paid person's opinion and we have just suffered from the authority bias. Here's the thing. Luckily, the data does not suffer from cognitive biases such as this one. We could use the data to overtake hippos opinion. The data can easily showcase that even though customers try the new product, the body, it, no one bolded for the second time or no one wrote a positive review. If companies learn from the data and utilize the information out of the data properly within their decision-making, they can avoid these human biases. Now, am I saying that the data are flawless or are the data flow is not at all. The data is unfortunately not free of bigotry either. And we will talk about that in the upcoming lecture. 6. The "Data" Part of Data Science (2): Hi and welcome to another electron and Accords be aware of data science. We have left the last lecture with a question, are the data for us? And I'm claiming Not at all, even though the data might be really helpful in our decision-making process so that we avoid all our cognitive biases. We have to be aware about some potential pitfalls and issues with the data. I will bring a handful of these issues just to give you an idea. First of all, there is the famous statement by Mr. Kanzi that if you torture the data enough, it will confess to anything. What do we mean by it? Well, even though the data might be unbiased, it is still us humans who is interpreting the data. If I'm suffering as a data scientist from some cognitive bias, I might portray my cognitive bias on top of the data and still obtain a biased result if you would like to simplify this statement even further, the problem is between the computer screen at the chair, which is of course us, the humans will have a bit more of examples on this one later in the course, but I definitely recommend to remember this statement. So this is the first issue of working with the data. Secondly, yogurt say that not all data are born equal or not all features about, for example, people are not born equal. What do I mean by it? Well, we can, for example, breakdown data points into four categories based on how they come into existence. And some of these are going to be more useful for our data science projects. And some of these are going to be less useful. Usually the most useful ones are data which are observed or they are sort of exhaust the data. Observed data I think are really easy to imagine. You just observe how old somebody. So that could be an age. Exhaust data are created as a result of some process that, for example, a person could be doing. Let's say that you are typing an email and I create out of a data point which says your speed of typing these two kinds of data based on how they come into existence, tend to be pretty useful for data science projects. But then we are getting to the problematic parts. For example, there are learned data points. What are those? Well, let's say your bank or your insurance company has some risk rating about you. These risk rating, it's not a role data point which is describing, you know, it was already learned from some other data points, such as, for example, what their behavior was in the past, whether you were paying your depth and so on and so on. Now the problem for us might be that if we reuse this data point for our use case for our project, we might be taking over some biases or some issues from the previous project which was constructing these learned data point. Working with these could be already problematic. Lastly, we are coming to the red category, which is the self-reported data. These are from my perspective, the most problematic. So many applications, products and services over organizations deals with humans in one way or another. Unfortunately, there are several issues connected with people. First of all, people might lie. So let's say that I would like to get a loan from a bank. The bank clerk asks me if I owe the money somewhere outside of these bank at some other financial institutions. And even though I do, but at the same time, I really would like to get this loan from this bank. So I'm going to say, Hey, I don't owe money anywhere else. Of course I have a light with a purpose. So whatever I say is going to be, of course, very biased. I have self-reported something that is wrong. Secondly, people might just not know themselves. Let's say I participate in the job interview and the interviewer asks me how good am I at the data science or how good am I at dealing with stressful situations? Well, of course, I'm going to report it. I'm great at data science and I'm great at dealing with stressful situations because I don't have the right perspective. Of course, I might be terrible at dealing with stressful situations. I just don't know by self. So again, I create a data point about me which is completely biased because I don't know myself. So it's another self-reported problem. I think you get my point right over here. Whenever data science works with the human data about issues should be kept in mind. In my experience, the best data about humans is how saying are the ones on the top. And we should rather be avoiding these ones on the bottom. And at least from my perspective and from my subject to opinion, I do not really like to work with service because they are full of these self-report biases. There's a second example of an issue that can be there with the data friendly. It comes to biased dataset. As the world of data science and machine learning are progressing, we want to build more powerful, more impactful, and larger models. To build such models, you oftentimes need very huge datasets. This is especially true for use cases such as visual recognition or natural language processing. And the issue is that when you build such model on a dataset, the dataset might have some statistical bias. And now be careful, I'm saying statistical bias, not cognitive biases. Cognitive biases are over here. The statistical bias in the inside of the data. For example, there was a very nice research done on one of the famous datasets which are being used for various visual recognition applications. What the research found out is that there is a terrible imbalance between like light-skinned people and dark-skinned people. Inside of the dataset, it was found that 84% of the images which are inside of dataset on top of which our model will learn, are of light-skinned people. And only 14% of these images were of dark-skinned people. So as a result, if we learn on top of this dataset, the model is going to be way better at recognizing faces over liked skin people. And it will have way worse performance when it comes to dark-skinned people, which is of course very bad. Now imagine that we have resorted to the data we wanted to avoid all the cognitive biases. Yet we have stumbled upon a biased dataset. We will have a problematic model as a result. Now, we will talk about statistical biases as well later in the course. But you can see even the datasets which we are using can have these sorts of problems. Lastly, I would like to talk about fairness in recent years, fairness is the phenomena is really growing in importance tremendously. Let's say that you are a bank and you are deciding to whom you will grant a loan or you are building a model which will be automatically deciding who will be granted a loan. Now it is well-known fact that using the data within such modelling exercise can be incredibly helpful. What the data might you need? The simplest form might be the social demographics data. But let's say that the order using age, gender, income, residence, occupation, all these kinds of data points. But a new question arises, should you be using all of these data? Wanted to be unfair if you use some of these data points. What do I mean? Well, think for example, of gender information, if you will get your past data and learned that there is a difference in the default rates. So who will not be able to pay back the loan between men and women for the sake of simplicity, let's say that the middle age men are a few percent less likely to pay back the loan. Hence, you might consider it. You might be less likely to grant a loan to a middle aged man. Is it fair though a man who was most likely born a man cannot change the state of this variable, and it then becomes discriminatory to act upon the variable which the customer cannot influence. So you say, alright, alright. I'm going to exclude these variable and I'm going to make a database decision. We found this unfairly discriminating variable. We are hoping that now your model is becoming fair. Well, what if, however, the gender information is somehow reflect it in another proxy, in another variable in your dataset. Within the geographical region where you operate, there isn't a fortunate wage gap between men and women. Hence, income becomes indicator of the gender information. You should also maybe exclude this income feature from the features which you are using for your loan and granting model. Now, we have just dipped our toes into the topic of fairness. It's a really big one. I just wanted to highlight for you that it's not only just about the issues which might be there with, you know, portraying our own cognitive biases on top of the data about it. Not all data are of same qualities or that there can be statistical bias. We also have to think about the fairness of using some data points and some sources of the data. Just to sum up these two lectures about the data, we want to make decisions. We need to make decisions every day. We would like to make the right choices within our decision-making process. And it's troublesome to be using our own perceptions, our own experience and knowledge, because we have all of these cognitive biases. That's why we might resort to data, and that's why many companies are resorting to data to act as an aid in their decision-making process to overcome these cognitive biases, however, we should not think that the data are flawless. The data can have their own problems which we need to tackle. And within the second part of the lecture, we have discussed that we can still as a human portrait, our own cognitive biases on top of the data based on how the data came into existence. They can have their own problems that aren't sample of the data could be biased. And so the modal which we built, built on top of these data will be biased as well. And lastly, that we should think about whether it's even fair to use some of the data points that we have available. And that is it from this lecture I'm looking forward to see you in the upcoming ones. 7. Statistical bias: Hi, and welcome to another lecture in the course. Be aware of Data Science. My name is Robert, and in this lecture we will talk about statistical bias. We have actually already touched upon an example of a statistical bias in the previous lecture, where we had the bias dataset that contains an imbalance of images of dark-skinned and light-skinned people. Now, this is an important theorem for the wall data science. So in this lecture I want to go into more details about what statistical biases. And then you have an assignment coming up where you can practice with both cognitive biases as well as statistical biases. So let's go for these statistical bias before and to fully explain what statistical biases, we need to understand a bit of a difference in terminology when it comes to observation, population, and a sample. So let's start with a formal definition. In statistics, a sample is a set of individuals or objects collected or selected for my statistical population. The elements of a sample are known as sample points or observations. Now let's start from population. I think it's easiest and simply still imagine human populations of ovulation could be every human being living in the world of ovulation could be every person living in a certain state, or it could be a much narrower population. So, for example, everyone who is using a computer. As you can see, the population definition is always dependent on us. It's coming from us. It is about what or whom we are interested in studying. If we are studying an entire country, then it's everyone who is living in this country who creates our population. Moreover, it's not only about human populations. Populations could be anything even if it's not living. For example, if I have a building and I have many light bulbs in the building, all of the light bulbs in a certain building can create my population. So this definition is really, really flexible. Now let's continue with a sample, as you can already see from the drawing, that the sample is a subset from this population. Now, when talking dogs sample, I would start by asking, why are we even talking about the sample? Why is it important? Why is it necessary? The reason is very simple and practical. Oftentimes it will be very impractical, sometimes even impossible, to collect the data about an entire population. Let's bring a concrete example. We are interested in an average height of the people living in a certain country. Of course, you can imagine that if this country has 10 million inhabitants, it will be completely impractical, if not impossible, to collect all of the heights of people living in this country. What do we can do instead is that we will only draw a sample of people from this population, and we'll measure the height of the people in the sample. So we'll measure the heights of our observations. Now. We have the population and we have the sample defined. I think you can already see where I'm headed. Basically what we will attempt to do is that will attempt to make an educated guess about what is the average height in this country based on the measurements that we have made on the sample. For example, we have measured our one hundred, ten hundred observations and we can see our average sample. We also see some deviations from this every so of course not everyone is the same height. And looking at the sample data, we will attempt to infer the average height of the people in the country, how it may look? Well, let's be very concrete. We have measured that the average height in our sample is 175 centimeters. Our claim about the population will be that we are 99% certain that the true population mean lies between 171 and the 179 centimeters. Why are we not claiming that the population mean is also 175 centimeters? We should not do this because the data is not perfectly right. We cannot have perfect inference from the sample to the population. I mean, we could be fairly lucky and we have drawn a very representative sample. And indeed the population mean is going to be 175 centimeters. But more likely, it's happening that we are experiencing some degree of statistical bias so that there is some degree of noise, error or randomness coming into play. Now when they're saying statistical bias, again, please be careful. Cognitive biases are aware here, statistical biases are in our data. Now, we have an assignment coming up. Where do you will practice with both of these? The define a statistical bias in a fairly simple way. We could say that it is when sample value differs from the population value. As I was saying, this could be due to a fairly long list of reasons during the assignment, you will have a chance to practice with these, such as selection bias, attrition bias or exclusion bias. But for the sake of this lecture, let's continue on. Let's say that we are experiencing an exclusion bias, which is causing the statistical bias. So our inference from the sample to the population will not be exact thanks to the or due to the statistical bias. How does the exclusion bias happen? Well, let's say that we have collected these one hundred, ten hundred and height. How well are we collecting these 1000 height? We have decided to spend five days in a city. Each day we wanted to call it 200 heights. Well, each day we tried to stand at a different place in a certain city or maybe somewhere in the country. We measured a height of everyone who passed by. Well, unfortunately, one day we stood just next to the sports club and the log of basketball players passed by. Now I think you can guess it. Basketball players tend to be taller than the average population. Only other hand, elderly people did not pass by these ports center and they were clearly excluded from our measurement. As you can see, our estimation from the sample is biased to a certain degree. We can sum up these little example. We FERC e-learning. We usually only work with a sample of the data and we are going to be inferring something about the rest of the population whenever we will try to learn something from a sample and then make an estimate to the entire population, we have to account for some uncertainty and errors due to the statistical bias. And on top of it, we are usually making some assumptions. For example, the assumption that our sample is indeed representative of the population from which it is drawn. And later during the course within the chapter, we'll talk about the inference a lot more. Now I wanted to give you the core idea of order statistical bias. And just to connect to what you already know, you have seen that there are four basic approaches of data science. We have discussed that there is a descriptive, exploratory, inferential, and predictable. The statistical bias is not equally problematic for all of these, will be later on in the course, of course studying these. But when you think about it, when we are using the descriptive and exploratory approach, we are only staying within the samples. So whatever we find out, whether it's people buying bananas and apples together, we are only claiming it about our sample. This issue of a statistical bias, this issue of inference is not impacting the descriptive and exploratory approach. However, as soon as we try to generalize from our sample to an entire population, either via the inferential approach or via the predictive approach. We already have to account for these issues that we have just discussed. And we have to account for a certain degree of uncertainty, error of noise, basically summed up as a statistical bias. And maybe it's a little bonus from this lecture, you might be now thinking, All right, how often do I have to estimate the average height of people in a country from a sample. I want to build a machine learning predictive models. Well, the same holds true for all the machine-learning models. Why is that? Because the machine-learning model of predicting one, we'll also be learning from a sample. You just need to broaden your horizons. We've understanding what is a sample and a population. Let's say that I would like to build a predictive model, which we'll be tackling my customer base. Whenever a customer comes to my store, I wanted to have a prediction ready to recommend him or her the best product. Well, our machine-learning model you've held learning on our historical data and what is then the population? Well, that's all the past customers which we have. That's going to be all the customers which in the future Can we don't have data about them. We only have data about our past customers and the population is also all the customers in the future. Again, we have the inference we have to generalize to the wall population. So there's going to be the same problem. Similarly, if you would like to build a visual recognition model that is recognizing breeds off the dogs. Well, let's say that we have available data. Maybe you have all of the pictures of all of the dog holes in our city? Well, it's still a sample because our population is older dogs in the past, all the dogs in the future. And actually all the possible angles from which somebody could take a picture of a dog, then we certainly only have a sample available. The population is much broader. Again, the same issue, Let's say that we are within the stock mark and then we would like to predict some stock prices in the future. What does the sample that we have available for our machine learning model? It's the past performance, It's the past data. And again, the future is what we are trying to generalize into what we are trying to predict for this understanding of a population sample and an observation that there is always some degree of a statistical bias will hold true even for when we will be building machine learning models. All right, that is it from this lecture. I thank you for being part of it and I'm looking forward to see you in the upcoming lectures. 8. I love the yellow walkman!: Hi, and welcome to another lecture in the course BOOL data science. In the previous lectures, we talked a lot about the biases. Of course, we started with cognitive biases that are inside of our mind. And basically we are hoping that the data can help us fight these cognitive biases so that we are making more rational decisions. However, I said that also the data are not sacred and the data can hold, for example, some form of a statistical bias. And we should always be questioning the qualities of the data that we have available. Now I have prepared for you are learning study which is called I loved the yellow Walkman. When we will reiterate on this notion of being skeptical towards the data that we have available. Now for the learning story to remember, Walkmans and those were these big chunky things, almost the size of a shoebox which were playing music. I don't remember them. Yes, I am dead old. Nevertheless, around 2 thousand. They only breakthrough of millennia suddenly was manufacturing Walkman successfully. And they were selling a black color of a Walkman. And now they had a strange hypothesis. Well, maybe if we would launch also a yellow color of a Walkman, our customers might enjoy it. How they decided to go up all these hypothesis is that they have invited their customers to focus group. Well, there are focus groups. Well, basically you invite your customers, you ask them a couple of questions and the answers to these questions generate for you a dataset based on which you can then make decisions within your business. These focus groups then usually take a few minutes and maybe even an hour. We would also provide some reward to our customers for that they have arrived. Now one of the questions that they have asked during the focus group was, would you buy this yellow Walkman do likely do maybe prefer it over the black color of a Walkman which we are currently selling. And the answer was, yes, I loved the yellow Walkman, I would definitely buy it. If Sony collected data this way, they might arrive to a conclusion that the yellow color of a Walkman might be well appreciated and they will now start manufacturing them. They would have big expectations and they would offer it to the customer as well. That is a catch within the story. So Sony was likely very smart within the focus groups, they didn't only ask the customers to provide the self-reported data. So what do you buy this yellow Walkman, they were also offering the customer as a reward in a very specific way when the customer was leaving the room went one is being done with the focus group. They said, Hey, pick your reward on your way out, please. And there were these two piles of Walkmans. One pile was with yellow Walkmans and other pile was with like Walkmans. Now you can guess which pile was almost empty and which of these customers pick the most. Of course, it's the black Walkman. So even though the customer was saying during the focus group, yes, I would prefer the yellow Walkman. I would definitely buy it. Well, when they were actually supposed to pick the reward, they were preferring the black color of a Walkman. Luckily, Sony also collected the second dataset from the focus groups where you are not directly asking your customers what they think or what they would like. You are observing them while they're picking their reward, which gives you a much more unbiased dataset. So now we have these two views. Now we have these two datasets. Basically. You can see that oftentimes, quality of the data really matters is how was the data collected? Can we rely on it? And as I'm saying, usually self-reported data which are collected through surveys, can have a lot of data quality issues. So to sum up these learning story, I hope that you will always be skeptical towards the qualities of the data that you have available, as there can be various forms and video sources of statistical bias. I'm looking forward to see you in the upcoming lectures. 9. The Limitation of Our Mind: Hi, and welcome to another lecture in the course. Be aware of data science. So now that we have covered the data part of data science, so we have discussed cognitive biases, why we would like to use the data. We have also discussed statistical biases and potential issues in the data. I think it's time to move forward. We will move to the science part of data science. But before we get there, before we get to a data science model or some scientific methods, I would like to in this lecture talk about the specific limitation of our mind as it will be a solid foundation for our upcoming lectures. Just enough feel minutes from now I'm going to claim the data science creates scientific models. Actually scientific models. I told you this definition from Britannica have these strange definitions of scientific modelling is the generation of physical, conceptual and mathematical representation of real phenomenon that is difficult to observe directly. Scientific models are used to explain and predict the behavior of real object or system. So this is our goal in a few minutes from now we would like to get here. Now, let's come back. And as I promised, we will start from a very intuitive understanding of the limitation of our mind. I will mind is used to three-dimensions. I think it's the easiest to imagine three physical dimensions around us. We have the width, we have the height, and we have the depth. If we are now supposed to imagine samples position of an object around us, our mind can easily use these three-dimensions to be fine this position of an object or where it is. However, what if we were thinking in four or more dimensions? I mean, I, at least for myself, cannot be managing the world being defined by four or more dimensions. But now Excuse please my lame knowledge of physics. But as it turns out, our world might be defined by more than four dimensions. But can our mind imagine it? No. I think this is the easiest way how we can imagine these natural limitation that our mind has just simply used to be working in 12 or three dimensions. We really are having problems with comprehending more than three-dimensions at once. But we are not here for the physics, right? So let's be more concrete. We add experiencing these limitation of our mind also in our daily lives. Let's say that the word offered multiple jobs. We want to compare which one is better for us and which we will take even though there are a lot of factors to consider about each of the jobs, our mind will naturally limit itself to a maximum of three-dimensional. But please be mindful here, I am not claiming that our mind cannot comprehend more than these three-dimensions individually. What I'm saying is that our mind will have a Trumbull of comprehending more than three dimensions at once. So two omega final decision of comparing which job is better, we will limit ourselves to free, for example, what's the commuting time, whether the job role is fitting Gaza, and what the salary. The limitation is influencing also our daily lives, but we are also not here for deciding about jobs. So let's come closer to data science. Now, let's say that you are a marketing manager and you wanted to design a data-driven campaign, you have a lot of characteristics available about your customers, such as social demographics or behavioral factors. And let's say that you do not have any machine-learning available or anything such. When you want to create some rules on which this campaign will run. Again, you will be able to consider only a limited number of dimensions. For example, belly where this campaign to customers of higher age, we're having a higher salary and own a certain product. Now, I think I got my point through with regards to the dimensions. And now the issue is that when we will be studying a phenomena, it will be defined by many dementia. As we mentioned, your dataset about customers might contain dozens of social demographics and behavior characteristics. And it will be again, troublesome for our mind to comprehend the large number of characteristics to design. For example, some modal on top of which we can deliver marketing campaigns. So to sum it up, our mind is limited in the number of dimensions with which it can work, or in other words, consider them at the same time when drawing conclusions and making decisions. Now unfortunately, that is not the only limitation that our mind is having. Another limitation or another aspect of our mind that we have to consider is it's limited capability to consider a large amount of data. Where our mind could work is when the amount of data is small. So for example, like in this case, we have some 15 costumers. We can let say many only look through their characteristics and comprehend what is going on with them. For example, if there is some PR chase pattern. But what would happen if the number of customers would increase? We would have to comprehend what is going on with hundreds, thousands, or even millions of customers. Of course, our mind is not able to do that. That is the second aspect or the second limitation of our mind. This time it regards essentially the number of observations or the amount of data that we have to comprehend. If this number grows, our mind hits a limitation again. Alright, and this brings us back to data science models and scientific models. A data science model is essentially a simplification of reality, or as you can see in this definition, it is a representation of a phenomena that we are interested in. Now this representation is indeed useful due to the limitations of human mind that you have just heard about. If someone, let's say a data scientists provides a model which is the simplification of reality. Hopefully, we can understand things about this phenomenon and potentially draw conclusions such as design marketing campaigns. These was the reasoning for the science part of data science. We will be building scientific models just in the upcoming lecture. But within this lecture, I still wanted to provide you with a little bit of a bonus. And I would like to touch upon one famous quote which is used around data science a lot. It is originally attributed to Mr. George Box and it says All models are wrong, but some of them are useful. Now how I usually see the sentence being interpreted is that these models always fall short of the complexities of reality in a sense that they cannot grasp the complexity of the reality. And that it's some sort of a weakness of data science and it's some sort of weakness of machine-learning models. Well, hopefully now you can appreciate that the interpretation should be maybe a little different. A modal has to be wrong because if it wasn't, then it wouldn't be a modal. Modal has to simplify reality. And for this simplification, you just have to ignore some noise, some randomness in some nuances that their reality brings. A modal has to be wrong to a certain extent. It's definitely not the case that our techniques will be too poor to comprehend the realities of life. Alright, that's it for this lecture and in the upcoming lecture, we will build a data science model. 10. The "Science" Part of Data Science: Hi there. Let's continue our exploration of the science part of data science. The last lecture we have concluded with the understanding that a modal is attempting to represent some real phenomenon to simplify it for our mind, Let's continue now. I will say that we can focus also on the second part of the definition which says that scientific models are used to explain and predict the behavior of real objects or system. It turned that we have created this little box which is grasping the purchase decisions of our customers. How can we use it? What can we do with it? So as the definition says, we could use it to explaining the behavior of our customers. This is the option one that we are having. We will basically look inside of the model and observe the patterns that the model has learned. Now, this might be a bit counter-intuitive for now, but later on down the course where we will be talking on machine learning models and predict your metals. It will be a bit more clear. Now if it is a machine-learning model that has been learning on the historical data. How is it useful for us? Well, it was learning on a historical data that will be, though, tricky to comprehend by our mind. So we look inside of the model opposite of what the model has learned and use that to explain the purchase decisions of our customers and what would be an example of using a data science models suggest this. An example of this could be our fruit and vegetable standard. We have discussed previously. We learned that apples and bananas are frequently bought together. We will reuse the display of how we are displaying the apples and bananas for some other combinations of fruits and vegetables. Thus, we have adjusted our business process in some way based on these data science model, these options might seem a bit counter-intuitive, but as I say later on, it will make more sense. So this is option one that we have. We're looking inside of the model and let it explain the phenomenon for us. The second option that we are having is that we would use our model to predict the behavior of our customers. Now, we want to provide some data as an input to the model and we will observe what comes out as a prediction on the other side of our data science model. So let's say that we are running against some sort of loyalty program in our fruit and vegetable stands. So we have a lot of historical data about the purchases of our customers, and we also have an option to call our customers. Now, we will provide purchases from previous week as an input to our model. And we would let them modal predict what the customer might buy in the upcoming week, the modal will say pay. There is a high probability that this customer will purchase US parabolas in the upcoming week. So we will pick up a phone and call the customer and say, Hey, wouldn't you like to come and buy an asparagus? I just got a fresh *****. Now of course, the model we will not be perfect with some customers. It will be wrong and the customer will not come and purchase the Aspergillus from us. But with some customers, we are hoping that the model will correctly predict their behavior in the upcoming week. Thus, we create the business benefit by calling these customers and inviting them to buy the Aspergillus in our stamp. As you can see, this is the second option of how we can utilize our data science model. Either we are explaining what's happening or we're letting the model predict what might happen, for example. Now, the very last thing that I would like to talk about when it comes to the science part of data science is, how are we creating these data size model? I mean, we buy now understand why do we need them? We also understand how do we utilize them? The only question that remains is, how are they create? Now please don't get confused. I am not talking about the approaches and metals that we already touched upon. So there's a descriptive exploratory, inferential or predictable. We are now one level higher about these. We first need to make a more general decision on how we go about the creation of the model. And only then we will decide about the particular approach and connected methods of how we are turning the data into valuable information like we will do later in the course. So we are now one level about it. There are in general two ways, how data science usually goals when creating a model, then it's observation, and then there is experimentation. Both of these are coming from the world of science. Observation is, I think, easy to imagine. For example, we sociology. We observe how observations such as humans behave and learn something from it. So let's say that we are observing how people walk and navigate around the library in data science trends, these are most of the use cases that you want to sing nowadays. Accompany has some historical data available in the US. These for the observation, basically they are constructing the model by observing or learning from historical data. There is still, however, a different path which is unfortunately still considered a bit exotic by many companies nowadays, this is experimentation. Experimentation is mostly associated and practiced within medicine nowadays, Let's say that you have a new medical meant, any hope that it improves the patient's life? How would you know that this is through an indeed the medical and improves the patient's life. Well, you can conduct a sort of an AB experiment to a portion of your patients. You provide the maybe comment and to the another portion of the patients, you only provide a placebo, which is not the real medical meant. You wait for a bit and see if there is a difference between the two groups. If your hypothesis was right and your medical and works, then the life of patients to whom you'll have provided the medical and improve as opposed to those homeo only gave the placebo. This approach is useful within data science when you don't have the data. Similarly, like the researchers within medicine do not have a data about their medical and I do not know whether it works. So you would set up an experiment when you would offer your customers a product at different times of the day because you have a hypothesis of, let's say that if you are making the offer in the morning, your marketing campaign will have a higher conversion rate. You will be focusing on collecting the data from this experiment, whether there is a difference in the acceptance of the offers from the customers. So you would essentially generated the data from which you can proceed further and from which you can build your data science model. Now unfortunately, as I say these later approach of experimentation, I see oftentimes being undermined and underestimated by companies that are usually resorting to the observation. I think this underestimation has something to do with culture maybe. I mean, imagine that data scientists now managed to persuade senior management and data science is something that we should be doing. There is value in the data. And now all of a sudden they would come back to the senior management and they would say, Hey, maybe we have a lot of data, but for this particular hypothesis, we do not have the right data. We would need to conduct an experiment that would generate the data from which we can create this data science model? Well, maybe the senior management wouldn't be too happy about it. I think also data scientists that are a bit worried to experiment nowadays. All right, So these are the two ways of how we could go about creating a data science model. Now just to summarize a little bit, we went through quite a lot on the science part of the data science at first, we started with the limitations of the human mind when it comes to the number of dimensions as well, when it comes to the amount of data, then we have defined what a data science model is. It simplifies and represent the phenomena such as grasping the purchase decisions of our customers. When we build such model, we can use it into fault, for example, we can use it for explaining the phenomenon to us, to our limited mine. Or we can use these data science model to predict, for example, based on the purchases from the last week, we can predict the purchases in the upcoming week. And how do we go about building these data science model? Well, it could be constructed through an observation or through and experimentation. Or you can sometimes even see these combined when it comes to the particular approaches and methods of how we go about constructing this model. We will get to that later in the course. So that is it from this lecture. I thank you very much for being part of it and I look forward to see you in the upcoming lectures. 11. Introduction to Chapter 2: Hello and a warm welcome to the second chapter in the course, be aware of data science. In this chapter we're going to talk about the DC planes of data science. And it is very brief lecture. I want to invite you to the chapter estate, the goal of the chapter, and also give you a bit of an outlook of wallet ahead of us. First of all, if we are looking at the overall core structure, we are already done with defining and understanding of the essence of this field that we call data science. Now it's time to be a bit more practical and to be tackling questions from in everyday life of data scientists. So first of all, there is a lot of confusion around data science. For example, people are still wondering these data science the same thing as machine learning. Maybe you already know that this is not the case, but here are the confusions that we might be clarifying. And really, if we do so, we will help ourselves and we'll gain deeper understanding of data science by cleaning out these confusions. Secondly, we should be already focused on the people who are around data science. So we'll be also talking about data scientists. The second chapter, we'll deepen our understanding of data science, will view it from new perspectives to maybe show you even better water ahead of us. Let's imagine that the right now go ahead and Google war these data size or even better, a hoist, a data scientist, how to become a data scientist? Unfortunately, you will be met with more questions than answers. There will be people mentioning with the work with databases, there'll be people mentioning some statistical methods, machine learning. You will even stumble upon some advanced concepts from software engineering. Why is that wire will always met with these confusion. Well, the reason is relatively straightforward. Data science is not a soul field, it's rather a joint initiative of several smaller overlapping areas. All of these are now putting something special on the table. Examining these fields which are involved under the common umbrella of data science will help us understand the data science itself. Hence, the goal of this chapter is to provide us with an overview of these disciplines as a kind of a starting line on which we can then again build later during the course. As you can see on this image will be talking about these areas, will be starting with these statistics, sort of an origin of what we call data science nowadays, then it will be talking about databases and how these are in recent years turning into the topics of big data. Then we have data mining, which is sort of a field which emerged on top of the big data, if you would like to call it this way, then we are digging into the areas of machine learning, deep learning and artificial intelligence. But now to set the expectations, we're not going to study deeply these individual disciplines. We will be focused at. What's the special thing that this discipline puts on a table? How does it contribute to the overall field of data science? I wanted to mention this to set the expectations that I do. We are here to study data science and not the underlying individual disciplines. All the ones we have covered the basics of these disciplines. We will move in the second part of the chapter towards data scientists. These image that you are seeing right now on the slide is my view of who is a Data Scientists, either T-shaped skills, head of a data scientists, and we will slowly build it up. We'll start talking about the data science mindset. Then we'll move towards the vertical component of this T-shaped skill set, meaning the things where data scientists should be coded. And we'll understand what are the necessary skills when it comes to data science use cases. Then in the later part we will move towards the horizontal part of this T-shaped skills, which are the relating skills, relating disciplines where data scientists should still have some ideas, hopefully for you, they will also give you ideas how you can connect to data science use cases, for example, if you are coming from some data engineering or cloud engineering background, or maybe you aren't just coming from a certain domains such as banking or insurance, whatever domain you are coming from, you can always connect to the data science use cases which are done in your area thanks to your domain knowledge, we will talk about the importance of domain knowledge later in the chapter. Now as a small bonus, if you then visit the bonus section of the course, I'm there providing a bit more informal lecture on the T-shaped skills head of the Data Scientist. And I'm reflecting a point giving you hints and tips in case you after taking this course, decide to continue and grow further into the field of data science. How you can use this T-shaped skill set as sort of an advice and guidance for your studies. Alright, so this was the goal of this chapter and I provided you with an overview of what's ahead of us. And I'm really looking forward to see you in the lectures. 12. Disciplines: Statistics: Hi, there. It is time that we start exploring the disciplines of data science and studying how these are all contributing together, creating what we call data science nowadays. And in this lecture we will start with statistics and databases. Where are we when conducting these two? Well, because they are sort of historical and early foundation of what we nowadays call data science. So in this video, we will take our statistics and in the upcoming one, we will connect with databases. So let's go for it and study the origin of data science in statistics. Now statistics as a field is enchant and I'm not meaning that in any negative sense, we could be reaching out to famous statisticians decades or even centuries ago, or go just pick a few years what they were doing across the history. I think we can easily see in this definition that I brought from Cambridge dictionary. So statistics is the science of using information discovered from collecting, organizing, and studying numbers. And I think you can already now spoke the connection between statistics and data science. We said that data science is the art of turning data into valuable information. There's a matter of fact, statistics or statistical approach will be one of the ways how a data scientists can go when creating valuable information out of the data. Now, stand these things is still prevalent nowadays. If we move to the contemporary era, you oftentimes see results of statistical studies. For example, a certain dietary supplement is helpful for a certain health condition that results from statistical study or contrary some habits such as smoking can lead to negative health implications. Long story short, statistics as a field has been super-helpful to humanity for a very long time and it still is. It allows us to quantitatively study phenomena and understand them. Now, you might be thinking that we will now start exploring the world of statistics and all of these statistical methods that are available, such as descriptive statistics and inferential statistics, we are not going there. We will do this later in the course. So in the chapter about describing and exploring the data, we will talk about the descriptive methods of statistics. And in the chapter about inference and predictive modelling, we will talk about the inferential statistics. Remember, what do we are onto within these lectures about disciplines of data science, we want to find the key ideas of how these disciplines contribute toward we call data science knowledge is what special Do they put on a table? Later in the course? We will go there, do not feel upset that we don't discuss it now you will have lectures, assignments and much, much more. So what is the key contribution that statistics has to data science nowadays? I would say it's the idea of hypothesis. Now we have something new which we did not discuss justice then, which my views are really the most extensive and important contribution of statistics. Do data science nowadays a hypothesis we are interested in something. This is a definition by Oxford dictionaries and it says, we, if we're talking about something countable, It's an IVR or explanation of something that is based on a few known facts, but that has not yet been proved to be true or correct. Or if we are talking about something uncountable, guesses and ideas that are not based on a certain knowledge. More of an intuition. We are revolving around this idea of a hypothesis. So go to Data Scientists now they similarly to statisticians, are often starting their use cases with a particular hypothesis in mind. For example, as we already mentioned, we believe that a certain dietary supplement might have positive impacts on our health. We will start with this hypothesis and proceed to data collection. We would collect the data and then either accept or reject the hypothesis that we had at the beginning. Now these idea of a hypothesis is really important. I want to run really quickly through one example of a use case where we are starting with a hypothesis. We are taking the statistical approach. The hypothesis that we are having is diet and reached website roses helps sailors health. So let's say that someone provided us with this hypothesis. We are statistician's. What do we do? Well, we design a little experiment where we would collect data about this hypothesis. We would provide ships with two different kinds of diet. To 36 ships, we would provide a diet which is based on length deals and beings. Now let's imagine we are few 100 years bank when statistics was just in its early beginnings, we would provide 36 ships with these basic lentils and beans based height to 42 ships, we would provide a different diet. It will be the typical diet and reached with synarthrosis because there is regarding our hypothesis, we would let the experiment run and once the ships are returning, we would observe the average percentage of sailors with medium or large health issues on the ship, which we of course want to be as low as possible. And we will see that on the ships which had the typical diet, approximately 55% of the same loris head medium or large health issues. Whereas if we look at the ships which had the diet and redrew cytosis, we will see that only 22% had some health issues. And this is now telling us that our hypothesis might be correct. And indeed the side trusses are helping out with sailors health. This is the key takeaway from this lecture. The statistical approach, the statistical mindset is to start from a certain hypothesis and then proceed towards data collection just we did within this very quick use case. And as I'm saying, go Data Scientists nowadays are still adopting this approach, starting from a hypothesis, then proceeding to the data, collecting it, and then either accepting or rejecting the hypothesis. All right, and in the upcoming lecture, we will connect to statistics with databases, as you can imagine once we are starting with the statistical approach, data start to arise and we need to store it somewhere. See you in the upcoming lecture. 13. Disciplines: Databases: Hi there. In the previous lecture we discussed statistics and maybe we have conducted some statistical experiments resulting in various datasets. We need to store the data somewhere. In this lecture we'll discuss the second predecessor of Data Science, which is databases or the discipline of databases. Usually people tend to tell themselves data. The data is somewhere and we use it. Well, that's not quite eight, there is more to it. For data science data are in ideal scenario stored safely. We do not want our data to get lost or stolen, accessible and retrieval. We want to be able to reach out to the data whenever we want and we want to be able to take it out of its storage. And for example, movie the R-squared or copy it somewhere else where we will analyze it, firmly, described and known. We want to know what the data we exactly have. Oftentimes we want other people to know what data we have. For example, we want the sales department to be aware of that customer care department is storing certain useful data. These are the three characteristics when it comes to what is ideal for data science when it comes to databases and storing data. Now if we take the aspect of time and we look historically what was going on when it comes to databases, we could start in energy and time. So people have been keeping records of various things since they were building the pyramids as well. Dr escaped paper-based records of their patients to keep track of them. Farmers kept paper-based records of their harvest and whether we felt storage and access to the data. We cannot of course create any available information. Hence, clearly, we felt data stored in databases. There can be no data science that would make available information, although veto people have been storing and accessing the data already long time ago. But let's move to more contemporary era. When we move to the 20th century and when we talk about databases, we really are talking about so-called relational model of a database. These dated back to approximately 9070 when Edgar Codd published a paper explaining the relational data model, which was a revolution of its own database user now just wrote a query that defined what he or she wanted while he did not have to worry too much about the underlying data structures or where the data was physically stored. If we would continue further in history, we will discover that companies attempted to bring together data from various databases, sold it more complex operations and calculations could be done on top of them. This is when we start to refer to them as data warehouses. If we think about the electronic form of storing the data, discipline of databases started to emerge a rapidly and aren't 1980s. These were essentially collections of data on a certain phenomenon, companies have been creating databases, storing data of their products, customers or equipment. And thus companies have been creating and using various database options, usually a custom ones. Many companies would literally have a little server or multiple servers. I'm somewhere in the back closer to, you could say for many years the aspect which we highlighted about where a seemingly well addressed we had saved solution for storing the data. We were able to access them, use them, and retrieve them. Unfortunately, as companies continue their digitization process, various problems started to occur. From my perspective, the most prevalent one is the one-off silos. Parts of company we're really getting close to around a certain product and the data around it. Many companies, especially the large ones, can hugely benefit just if they manage to successfully connect the various data sources which they have. For example, there are two big divisions in an organization. One handling sales, one hand link customer care. If the sales department has access to the data of a customer care, they could integrate the knowledge extracted from this data into their own porosities. So yeah, we are starting to have these electronic form of a data, the relational data models and even data warehouses for translators that time was passing by due to these custom solutions, these silos were being created and it's becoming a bit of a problem. Now let's move to the current times the century started with the introduction of cloud computing. I could hold an entire lecture just to introduce you to the basic idea and basic concepts of Cloud. But let's stay very simple. Instead of buying and maintaining that custom server in your beck, close it. You rent one from a white and huge network which is available globally. This will bring you a lot of benefits such as lowered costs, better security of your data, better accessibility of your data, and much more. This shift which companies are making in the past ten years primarily, is also accompanied with the attempt of monetizing the data as much as possible and breaking down the silos which were occurring maybe some decades ago. These of course, goals well in hand with wide data science is so popular nowadays. We fake company has a wealth setup Cloud solution. It could be a really potential solution for data science. For example, you have a certain dataset. Your data scientists now has a nice idea about the data science use case. All it takes is a few clicks really, and he or she can right away have a server already that is there just for him or her. We've stayed of the art tools and the data ready. This is really wonderful to have a well established cloth infrastructure. Unfortunately, for example, when you talk about the EU region with its stricter regulations, companies oftentimes have to be careful when it comes to complying with the European regulations, which might be sometimes limiting the Cloud adoption. Let's not everything of course from this lecture, I also want to showcase you want an example of how data scientist or a data science model is interacting with some database solution. What I have over here is really a conceptual drawing of how a database in a company might look like it. First, you would have some sources of data. For example, your website is generating some data, your inner voice system is generating data, and your warehouse is generating data. So these are really the role sources of the data. Then they are getting transferred to your relational database. There they might still be stored in their raw form. So we would have the logs from the website, we would have the orders and items from our invoice system. Then we would have the shipping data from our warehouse, which is still not overly useful. We want to increase the usefulness of this role data by various harmonisation and organization techniques. For example, we would create a customer table in which we would store from various sources all of the data that we have about our customers. So it could be the invoices, it could be the customer's interaction with the website. Then we would have one table which is about orders, and that's about the order from the invoice system about which items in this order where and also about the shipping which was made with regards to these order. Now, this would be already what we would call a pre-cooked data. It's prepared for us maybe for some specific purpose. Now a data scientist or a data science model could be interacting with this database on various levels. The most classical one and the simplest one is to be consuming the precooked data, for example, would consume the prepared customer table. And we will be creating some predictive model on top of it. It's there, it's prepared for us, we just query it and they use it for our purpose. The second way, how we could be interacting with our wall database could be that maybe we're thinking how the data were precognitive prepared for us is not an idea. We would need it in slightly different shaped. So we would reach out to the role data that we have in our relational database and we will prepare it ourselves. We would create different views on top of this data, which will then fit our data science eagles case better. Lastly, we could even go all the way back and we could be influencing the way how we are collecting the data. Maybe we have a certain hypothesis like from the field of statistics. And we will now be coming back to our warehouse colleagues and telling them, hey, this is an aspect that we should be collecting about the shipping of our orders. They would be now collecting new data point, which is then running it all the way through the database. And we can use it for our data science model. As you can see, data scientist or a data science model is then interacting with a database solution in different ways. So that is everything that I wanted to discuss in this lecture. We ran through the history of databases to really outline where we are headed and what do we want to achieve with our databases? We want the data to be stored safely. We want to have it accessible, and we want to have it described. In the second part of the lecture, I kind of highlighted the conceptual understanding of how a data science multiple 0s interacting with the databases. I'm looking forward to see you in the upcoming lectures. 14. Disciplines: Big Data (1): We will continue our exploration of data science disciplines. We have already covered Statistics and databases, and in this lecture, which will span across multiple videos, will be tickling Big Data and Data Mining. In this lecture, I would like to talk about the emergence of big data too, which will then connect and explain how did the emergence of big data influenced data scientists and how they have to adjust and change their processes. So let's talk about the emergence of big data. The problem is simple and I think we are all aware of it. The amount of data arises since the breakthrough off millenia, a big point in the history of humanity was the outset of digitization. They're also sees are getting digitized and people interact with digital products more and more. All of these digital products and processes generate the data, for example, due to simple operational reasons, if you're a bank, you might just have to store the locks of how your customers are interacting with your Internet banking solution simply due to legal reasons and obligations. So all of these digitization efforts and processes gave her eyes and the emergence to the concept that we call big data. Even though the time describing databases and big data in two separate lectures in the course, that is not exactly a clear cut between the two in case you are interested. Now there is no explicit and broadly accepted definition of how big the big data really are, is the hardware and computational capabilities are constantly evolving and changing. Also, this definition would have to be constantly changing and evolving. Maybe one useful way how to think about big data is that this isn't the amount of data that he wouldn't be able to work on within your laptop. Luckily, in 2004, go-go introduced its famous MapReduce algorithm that enables data scientists and companies to distributed large chunks of data that do not fit on a single machine across different machines. And these machines are then going to collaborate to analyze one big dataset together. Thus, we can still work with big data. Don't worry. Now, I would like to start the conversation on big data with regards to misunderstanding and overestimation, or do I mean, at first on the misunderstanding? I think the misunderstanding is coming from wild 20 tenths when data science was speaking in its hype. By hype, I mean the maximum of difference between all these perceived as being done and what is really being done. Many people were attempting to define data science as statistic that deals with big data, then it will be the same metals which we have been using for decades within statistics. And we have just doing them so that they can be applied on these vast datasets which are now available. That is simply not true in the upcoming video will explain the difference that is coming with big data. How data scientists have to adjust their approaches of working and analyzing datasets. No, data science is not the statistics that deals with big data. That's the misunderstanding. Secondly, I will send it big data also comes with overestimation. Think about which companies really have big data and they have also the need to utilize them. So both having them and having the need to utilize them. It's not that many companies, the classical examples are of course, banks with their transactional and markets data. Dell calls with their glorious log data from all the telecommunications, some areas within health care and also some more. The truth, we simply did a load of data science use cases simply do not deal with big data. Also, it might just happen for a data scientists that okay, I'm stumbling upon some bigger dataset. I might just subsample it. So I create a random subset, a random sample from this larger data. And they can work with the smaller sample of the smaller dataset and still draw some conclusions. So occasionally, even if you've stumbled upon a big data, you are still able to work with it as if it was a smaller dataset. I will tell you one important thing to understand about big data is that it's kind of overestimated nowadays and simply a lot of use cases are just not dealing with the big data. Alright, let's summarize with some key takeaways from this lecture. Big data emerge due to digitization processes going online, people interacting with them. It can only big data is often misunderstood with the connection to data science. And lastly, I would say that it's oftentimes overestimated when it comes to data science. Now in the upcoming lecture, we will keep on talking about big data. Maybe you will t usually trainers talking about the V's of big data such as volume and velocity. I do not want to go there. I mean, yes, the amount of data is increasing. But the important thing is to realize what sorts of data are we gaining and we will continue with that in the upcoming lecture. 15. Disciplines: Big Data (2): Hi there. In the previous lecture, we started to talk about the emergence of big data. Let's build upon it and discuss how it impacts the data science as we know it nowadays. Big change relevant for Data Science, which occurs in directly with big data is the nature of the data. In the days many years before all of the social networks shopping online, the telcos machinery, sensors, which are traditional sources of big data. Think about how datasets where created. Someone had a hypothesis, for example, a statistician, and collected data specifically to verify this hypothesis. So those were oftentimes very purposed collections of data, just like we told you about in the previous lecture about statistics. For example, we would specifically collect a data about sailors health by asking them to verify our own hypotheses. Now however, in the era of digitization and big data, we also have another purposed collections of data. Let's make a clear distinction about what I mean when I say dataset having a purpose, I really mean data science purpose, a purpose about the valuable information which is being derived from these data. For example, it's a survey on political preference, a purpose data, yes, the information on overall political preference is derived from it. Easy all banks transactional data, a purpose dataset, know the bank is primarily using the transactional data for legal purposes and your customer experience purposes. But let's say that from data science perspective, there is no purpose at the moment when the data is being created and stored. So this would be for me the clear distinction between purpose and the purpose data. Alright, then what is this difference between purpose than on purpose data have to do with data science? Well, there are two things. First of all, if a data scientists to intends to work with and purposed collection of the data, he might need, firstly, whole legal approval for the new purpose and secondly, availability to use it for this new purpose. For example, a data scientist who works for a bank sees a transactional data. He now cannot just take you then create a new sales model with these data. For example, within the European Union, customer needs to be aware that he's transactional data might be used for such purpose. Then when it comes to the availability of these transactional dataset that data scientists would like to use? Well, he might have to extract the data from some operational system where the data is stored for operational purposes and not our data science purpose. And we're bringing it to some analytical system where he can analyze the data. This is the first change that the unprejudiced datasets are bringing to data science nowadays. Secondly, unprepared collections of data give a new perspective on the job of data scientist. You'll see these datasets around you. Your job is to give them new purpose. This is exactly what companies expect from data science nowadays, that data scientists and other people come and work with the data that these companies have and give these new datasets a new purpose and new ideas about what valuable information could be derived from it. This is the second big change. It gives them a new perspective on the job of data scientists and people who are working with the data. Alright, so a key takeaway from this part of the lecture with the rise of big data and digitization. Nowadays not all datasets have a purpose from data science perspective. And our task might be to put new purpose on top of this dataset. All right, now that we understood the crucial aspects of our big data, I have a few more things which I want to say on top of big data with children nicely connects us to data mining. For example, as I mentioned, is these are larger, sometimes very large datasets. So clearly if a data scientist needs to be working with it, they need some new tools. The way how you can imagine these is that these dedicated tools are doing the very same thing as a normal tool that would work with smaller dataset is doing. They are just adjusted so they can work with larger datasets. Remember to think about big data as possibly the data that would not fit within a single machine and we wouldn't be able to analyze it on a single machine. An example would be that let's say that we have data about 10 million people and their heights. It will either take too long or it wouldn't be even possible on a single computer to calculate an average height of these 10 million people. Instead, these dedicated tools will break the 10 million people into groups of 1 million people each. And they would provide ten computers with parts and chunks of these dataset. And on each computer would calculate an average height of 1 million people. And then we will be, we will just collect the summary from these ten computers and we would make the final calculation about the final average height. So this is kind of an intuition of how these tools are working. Secondly, it's also about the new methods. I would say that a new focus mainly regards the unprejudiced collections of data that we mentioned previously. In the old days, statistics was working with well-defined statistical models and tools on top of these purpose collections of data. Then you would oftentimes prioritizing that the order data fits some statistical model. Well, however, these head to change now with the rise of big data, we rather had to prioritize the data over the underlying statistical model. We will get to statistical models later in the course, we became With the rise of big data, less formal with our modelling approaches. This last notion connects us to the second part of the lecture where we will talk about data mining, which is exactly a discipline which brings this change where we are becoming less formal with our modelling approaches and techniques. So see you in the upcoming video where we discuss data mining. 16. Disciplines: Data Mining: Hi, let's continue our exploration of big data and purpose collections of data. And more concretely in these lecture, we'll talk about data mining, which is a newer approach of how we might be responding to the need of analyzing these unprejudiced collections of data is this is again a new term. Let's take a look at the definition by Oxford Dictionary, Looking at large amount of information that has been collected on a computer and using it to provide new information. Now the terms knowledge, data discovery, and data mining refer more or less to the same concept. The distinction being that data mining is prevalent in business communities and KDD knowledge data discovery is most prevalent in academic communities. Now, this definition seems awfully similar towards statistics used to do, right? I mean, in a way you are right, but let's put things into perspective. We are talking about big data and did this on purpose, collections are rising in the amount. You might have noticed a clear distinction, and this distinction will exactly relate to the rise of big data. Within the realm of statistics usually start with a hypothesis which we discussed before. And only then you would think about what data you will need to accept or reject this hypothesis within the realm of data mining, however, we will be starting with the data. Remember we have lots of it. And by now you've already been thinking about what question you could ask the data so that new and valuable information could be derived by answering the new question that you have asked this data. So as you can see, Big Data gave us a new way how we derive valuable information out of the raw data once we are only two, what would be an example of a data mining? To see these contrast better, one very classical example will be clustering. We do not know much about the data and we also have no particular hypothesis in mind. We are starting from data and we're trying to figure out what kind of valuable hypothesis we can put on top of this data. What's the new question that we could be asking? Clustering might be simply about finding similar groups of our customers. So we have customers as observations, and then we have various data about their social demographic characteristics. Clustering will put together customers who have similar social demographic characteristics and maybe similar behavior. No concrete question or hypothesis we have just allow this data mining methods to group together customers who have similar behavior. And now we can work with it. Now we can see, alright, are there differences in ages of our customers? Are there differences in how customers are interacting with our products? So how are these groups formed based on the algorithm? And then we can proceed with more concrete hypothesis which we are putting on top of the data. Or we can right away contrast this with our business colleagues and say, Hey, my algorithm has created these two groups are, can we do something about it? Is it anyhow useful? So you can see we are starting from the data and only then we are proceeding to more concrete hypotheses. Now as a last learning from this part where we are discussing big data and data mining, I would like to show you how these two fields, statistics and data mining. Nicely click together when it comes to data science. Both of these approaches, statistics and data mining are preserved and the well-practiced in nowadays and modern data science in some use cases, even a mixture of these two will yield the desired benefit. If, for example, you are discussing with an owner of a trading company, let's say even a few 100 years ago. All these ulnar nerve, the trading companies telling you, Hey, I would like you to make my sale or as happier and more productive or there is no concrete hypothesis like we had previously. So we will start by collecting and examining various datasets that you have. And you would have number of sailors on various shapes, length of sale, salary type of sale, geographical region on diet and various productivity measures that your start. So at first you do not have a hypothesis, by the way, have a lot of data available. You would proceed with data mining and you will discover, did there appears to be patrons within the diet variable, it somehow seems to correlate with the productivity and the happiness measures. These variable appears to have some influence, but you are liking a concrete explanation. So now you pose a concrete statistical hypothesis, which is that happiness and productivity levels of your sailors are influenced by the diets that they have on the ships. So we will now come back a few lectures ago where we had the concrete hypothesis and we would provide the ships with different diets. Afterwards we would observe the results which will be a little statistical experiment. You can see these could be a flow of a single use case. You are combining these two approaches together. At first you are starting with data mining, finding some interesting patterns. And then on top of it you put a concrete hypothesis for which you would construct the statistical experiment. All right, then that's everything for this lecture. Key takeaway, we've statistical approach. We are starting from hypothesis proceeding to the data with data mining, we're starting from data proceeding to some hypotheses and then looking forward to see you in the upcoming lectures, where we will explore more disciplines that together create data science. 17. Disciplines: Machine Learning: Hi, and welcome to another lecture in the course. Be aware of data science. We are continuing our exploration of the disciplines that together create data science. And now we will slowly touch upon this area on the right side. So in this video we will discuss machine-learning and gained an understanding what the dance and why is it important? In the upcoming one, I will talk about deep learning and then we'll conclude with artificial intelligence. So let's start with the phenomenon of machine-learning. What easy devout? I will start by a statement of saying that patterns matter. Let's say that I am creating this course while working from home and I will be 11 o'clock, I start to prepare lunch, even though I'm not hungry yet, I do these because I know that I will get hungry. So on around noon, the simple pattern that I learned about myself can save me a few minutes of hunger. This is a bit naive example, but let's say that they also need to go shopping. I have visited the shop many times in the past, and I learned that around two o'clock in the afternoon, that has the fewest people in the shop, so I like to go shopping around this time of the day. Another useful pattern that I learned, which makes my life easier and more comfortable to useful patrons for me. Coming vector data science, we have been already working with a statistical hypothesis which was the sailors and their diets. While our finding, which is that if we provide the sale orders with a diet and reach the inside Rose's, they will become more happier. This is another example of a pattern. A pattern is the valuable information that we are seeking when we are, for example, looking through the data, patterns are useful patterns matter. Now fortunately, there is a downside to it, which is that the process of searching for patterns in the data is costly. Manually going through all of the available data might take us a lot of time even more. So if we are working with large datasets and remember our lecture on big data, we now have a lot of datasets as a result of the digitization processes. Moreover, as the time goes, some patterns might become obsolete while new patterns arise. If we are manually searching for patterns, we would constantly need to search for a new ones. Alright, so if we know that patterns matter and they are valuable, but at the same time, searching for them manually might be very costly. Well, where do we go? We resort to machine learning and I think this is why machine-learning the soul popular nowadays. But let me bring them more for poor motivation for machine learning, instead of us humans searching for valuable patterns in the data, we'll add a machine or an algorithm, do the search for us. We hope that the machine will learn valuable and non-obvious patterns. You might not feel a bit discourage if you have some prior knowledge about machine learning. We supposed to predict the future with machine learning, well, we can predict the future even we found machine-learning if we have the right pattern, like I have learned about myself, that there are 12 o'clock, I get hungry at 11 o'clock, I start cooking. So it's not about being able to predict the future. We can't do that. Even we found machine learning. Machine learning is about being able to automatically search through the data. And an algorithm or a machine-learning model will find some valuable veterans for us. I'm also bringing a more formal definition of machine learning. So machine learning is the study of computer algorithms that improve our automatically through experience. Now we can go back to my example. I have visited the shop many times in the past and I have learned the patron. A machine-learning algorithm will do the same thing. It is seen as a subset of artificial intelligence. We will come to that in the upcoming lectures. Now, machine-learning algorithms built a model based on a sample data known as training data in order to make predictions or decisions, we found being explicitly programmed to do so. Now I think we can nicely imagine the machine-learning algorithm. It's super useful for companies who have some prior datasets which might be completely on purpose, we would let a machine-learning search through these data. And hopefully it would find some patterns which for example, allow us to predict some future behavior of our customers. So now that we have gained an intuitive understanding of why machine learning is useful and what it is about. We will come back to machine learning in chapter four, where we will talk about inference and predictive modelling. And there is even an assignment where you will have a chance to become a machine learning model, you will be searching for patterns in the historical data about the ear migrations. And then your task will be to predict some data migrations. Alright, so these ones are key idea of machine learning. However, it has a subset of the burning and I will talk about that in the upcoming lecture. 18. Disciplines: Artificial Intelligence: Hi there and welcome to another video in which we will conclude our little journey through machine learning, deep learning, and artificial intelligence. Artificial intelligence is a sort of a discipline of data science because data science has some sort of an overlap with the area of artificial intelligence. And these overlap is mainly with machine learning. So let's focus on artificial intelligence defining and we will see the difference. I frequently see people kind of missing the clear distinction between data science and artificial intelligence. So I'm bringing this definition by Britannica. Artificial intelligence is the ability of a digital computer or computer-controlled robot to perform tasks commonly associated with intelligent beings, even though this is awake definition for me, the key term is the word better form. And hopefully now we can already grasp where the overlap between data science and artificial intelligence is. It's mainly within machine learning. So both of these areas are using machine learning for some automated search for patterns in the available data. But data science is more possible. Artificial intelligence will be more actively to attempt to construct an intelligence system which will maybe act independently, such as a robot that I have on the picture. Let's talk about two examples of use cases that would fall under artificial intelligence versus them all. Let's take a case when we would like to construct a virtual assistant that will help our customers answered the most common questions either via phone or via text. So thanks to these, we will save costs on our side and hopefully increase the customer satisfaction by the responses will be faster. We will add first, want to use machine learning, which we'll search for patterns in our data. We'll let the algorithms search for most common questions asked by costumer. And maybe the algorithm will come up with 15 most common questions by our customers. Then we would link the most suitable answers to these most common questions. And this will be then the intelligence system. We need to just connect it with the perform capabilities. So we would allow the model to read the input from the customers or listened to the input from the customers. And then on the other hand, respond automatically with the most suitable identified answer. So you can see the more complex system the machine-learning is under the hood. It's still searching for the patterns, but on top of it we have the perform capability. Another use case or example of artificial intelligence will be this little robot called Pepper. The robot is literally a visual example. I mean, it moves around the space and thanks to pre-processing of visual data, it can see the world and on itself and navigate within it efficiently. It can also listen to people and their commands. So you can see under the hood of this robot are many machine learning models, deep learning models which can recognize visual inputs, all the inputs. And then again, there is this perform layer on top of it. It's able to act independently as if it was unintelligent being. I hope these two examples highlight what artificial intelligence is about. And now we also understand the difference between data science and artificial intelligence. I have a small bonus for you within this lecture, which again regards the European Union. The European Union is taking the space of artificial intelligence pretty seriously in recent years, it even constructed high-level expert group on artificial intelligence and suppose to come up with regulations for artificial intelligence systems. And you can see they started off by defining what artificial intelligence means. Artificial intelligence refers to systems that display intelligent behavior by analyzing their environment and taking actions with some degree of autonomy to achieve specific goals. And now this is kind of an interesting one in many industries are being intrigued by this definition. It's very whilst it's not only about the rowboats or virtual assistant, it also includes a wide suit of models, which we would previously call a statistical model or a machine-learning model. Really to fall under this definition would be that you display some intelligent behavior and take actions. If you would have some machine-learning model which is analyzing historical data about their customers and then sending them a promotion campaign automatically at the end of the month. While that's already by the definition of European Union and artificial intelligence systems. And then it's going to fall under all of the regulations which are coming up. This was a bit of bonus. European in N is pretty serious about AI and it's bringing up a very wide definition which we'll cover a lot of systems that many companies use nowadays. It's everything that I wanted to provide you with within this lecture, I hope that it's clear now how data science, machine learning, deep learning, and artificial intelligence all work together and how these individual disciplines contribute to what we nowadays called data science. I'm looking forward to see you in the upcoming lectures. 19. T-shaped Skillset of Data Science: Hi, and welcome to another lecture in the course. Be aware of Data Science. My name is Robert, and in this lecture, I would like to introduce you to the upcoming parts of the chapter where we will answer a question, Who East update data scientists? So let me give you the goal and an overview of what's ahead of us. So what you have learned in the previous lectures, he said data science is not a soul discipline, rather an interdisciplinary field. And all of these disciplines such as databases, statistics, data mining, or every finger out machine-learning are contributing to together create what we call data science. Now if we're asking ourselves a question, who is a data scientist? Data scientist and expert in all of these siblings? Certainly not a data scientist, peaks these disciplines up to unnecessary extent or to a certain extent. So looking at the disciplines, it's not too helpful to answer for us who is a data scientist will rather have a new concept, which I like to call a T-shaped skill set of a data scientist. If you are not familiar with the T-shaped skills said when you talk about any job or any occupation, we have basically two components. We have a vertical component. That's something where a person should be really good. So we have a couple of skills. Where are the data scientists should be really good and hold the responsibilities over these areas and over these components of a data science solution. Then we have this horizontal component where on the left side we will have what we call a soft wing. And on the right side we'll have what we call a technical link. These are areas where data scientist has some knowledge, has some capabilities, but usually has to rely on other colleagues. Water, for example, experiencing the business or who are experts in some infrastructure components. This is the general idea of a T-shaped it. And as you can see in the upcoming videos, we will discuss each of these components and we will go into details what these mean to really answer a question and build up this T-shaped skillset. And to answer a question, who is a data scientist? 20. Skills: Mindset of a Data Scientist: Hi, there. It started building of our T-shaped skill set of a data scientist. And these two will start at the very heart of this T-Shirt skillset with a data science mindset, what do I mean by a data science might say, well actually there are a couple of components which I think every data scientists should have in their mind. Let me explain. First of all, it's about being skeptical. And here I really mean that scientists should be skeptical against everything. The words will the datasets towards what the business stakeholders say, even towards what the metal that he or she is using says, well, let me go into details when it comes to being skeptical towards the bottom, you have already seen in the previous lectures that there can be biases in our data such as statistical bias, data scientists should always question, for example, whether a data collection method might not have introduced some form of a statistical bias into the dataset that the data scientist is using. So be skeptical towards your data. Secondly, being skeptical towards business stakeholders. I mean, if you are unlucky and unfortunately it always depends upon the organization. But it might happen that your business stakeholders are going to be pushing ideas for projects and use cases which might not be the best, but they are the best for their own agendas. So again, be skeptical about it. Also in lastly, being skipped, the cobalt or the method says, I mean, we have not been talking too much yet about data science methods. But basically what the Kennedy member already now is that every metal within the realm of data science has its own pitfalls. For example, later on we'll be working with a correlation. Correlation is a very powerful tool and method within data size. And even this one has its pitfall. If we're not careful enough when applying this data science metal, we might stumble upon a pitfall and actually, even though the method looks like everything is okay, now things are not okay. Something went terribly wrong. First and foremost, when it comes to data science mindset is about being skeptical essentially towards everything. Secondly, it developed being down to earth, or in other words, data scientists should be realistic and practical. This means promoting methods, approaches, and products that actually have the potential to create a social or business impact in a way that delivers this efficient solution and not working on some crazy over chaos of the downside data scientists are oftentimes really curious and they would like to try some cool approach. The state of the art techniques that were just recently released met as a matter of fact, oftentimes to deliver a successful data science use case might be more about talking for weeks with the domain experts making sure that your data science model, your technical solution, we'll be well-integrated into the domain where it attempts to create the benefit. Then as the end result, you will not have some crazy state of the art machine learning solution. Maybe you will have something simple, straightforward, some super simple machine learning model alongside with a couple of visualization. The second important aspect when it comes to the mindset of the data scientist is to be down to earth and practical. Thirdly, I would say it's about the being ethical in here. When it comes to ethics, everyone draws their boundaries elsewhere. When thinking about ethics, there are, of course, crazy cases and boundary cases like Cambridge Analytica that you might have heard about in the media. And now you're telling yourself no, a ethically questionable cases would never happen, for example, next to me or within my organization. Well, in many organizations you will have a decline, questionable projects and use cases. Hopefully you are lucky enough to never stumble upon this. Now when it comes to data scientists, I think he or she should clearly they're all some sort of an ethical boundary. For example, for me, I know about a couple of use cases that I would never like to work on. One example will be a workforce optimization. I just never want to create a data science model that would as a result, mean that some people are going to lose their jobs. For example, that's one example of the ethical boundary for me. But as I say, everyone draws their boundaries elsewhere. And we can discuss about ethics probably we can make an entire course when it comes to ethics and data science. So I just wanted to mention it when it comes to the data size mindset. And maybe as a small ball news when it comes to the ethics, it's also about the data scientists should not be abusing their own technical advantage and technical competence. Well, I mean is that you shouldn't purposely misinterpret what the data has to say to pursue your own agenda or the agenda of your own department or company. But I guess not abusing on technical competence and technical advantage is true for any profession. Being ethical is the third component of data science mindset. Now lastly, I will tell you about being hypotheses oriented. Data scientists from my perspective, should be able to form a hypothesis and operate around the hypotheses. There are, for example, other data related occupations such as data analysts, business intelligence expert reporting. From my perspective of these hypotheses oriented mindset is one sort of a personality trait that should be distinguishing data scientists from these other occupations. He should not expect that someone comes to him and says, Hey, analyze these data, create these exact predictive model, pulled it together in that way, create this form of visualization. Now, data scientists should expect to work with an unknown animals. For example, he sees a dataset and can pose a new hypothesis on top of the dataset or the other way around, he sees a business problem and can form a concrete hypotheses out of the wake. Business problem. Being hypothesis oriented is another key component of a data science mindset. All right, that is it. From this video, we have discussed a couple of important components when it comes to the data science mindset. And in the upcoming lecture, we continue with building of the T-shaped skillset. 21. Skills: Rectangular Data: Hi and welcome. Let's continue with the building of the T-shaped skill set of data scientists. In the previous lecture, we have discussed the data science mindset. Ending this lecture, I would like to continue and explore the vertical dimension of the T-shaped skill set. Now these are skills and areas where data scientists needs to be gold, that he needs to be able to take responsibilities. Now I will also bring examples of various stalls and packages that you can utilize to apply these skills if these packages and libraries are not telling you much, that's still okay, don't worry. As a key takeaway, we are discussing the concepts and the skills which are necessary. So the packages are just a bonus, maybe as a bonus takeaway for you is that it's not a rocket science, it's really just a couple of libraries and packages the data scientist needs to be able to operate. Alright, so let's start. Before we begin to individual skills, I would like to mention that data scientists should first and foremost pick up a sort of an umbrella technology or umbrella tool, a programming language in which he or she develops his or her skills. Now for me, this is Python, and Python is also my recommendation to my students. It's an open source programming language in which you can basically apply all of the Common Data Science operations, methods, transformations, however you would call it, it's just about picking up the right library. Now there are, of course, alternatives to Python. One famous example is of course, our programming language. Most of these are fairly similar on the surface, but under the whole, you really start to spot differences. We turn, I think, originating in how these two programming languages or originated, I mean, R was originally developed mainly for statistical community, which was evolving around the academia. Whereas Python was always closer to the artificial intelligence, pattern recognition and machine learning, and it was always more focused on the industry applications. Nowadays, I would say that for most companies, Python is the go-to programming language. So that's why those off my recommendation to you in case you want to pick up a programming language. Once we have the umbrella tool, let's talk about data. As we already mentioned, there are various types of data for me or general data scientists should know how to work with what we call a rectangular data. If you are not perfectly familiar with this term, let me explain. We mean the most basic classical data that you can open up in the Excel your customers will be stored in the rows of the table and the characteristics of your customers will be stored in the columns of this table. Of course, I'm generalizing with the statement. And for example, there are more companies which have an entire business strategy or product based on some different type of, uh, datasets such as unstructured data for these niche cases and niche companies. A definition of a data scientist and what is expected of him or her is of course different. But for me, a generalization would be that a data scientist he's able to work with rectangle what are data? Having the umbrella told defined, and also the general data type. And if you would like to define out of the table, it's the gene to the individual skills. And we started with data pre-processing, which you might also oftentimes hear referred to as data wrangling. So here's the thing. Data in the real world is not clean. It actually can be really a garbage. If you want to take this data and turn it into some data science model, you will have to clean it up. An example of this is that the characteristics about your customers that you would like to use for your predictive model are going to be stored in multiple sources, in multiple tables. Data scientists will need to put them together into a neatly organized single table that he can then use for a creation of a predictive model. Similar sort of skills we call data pre-processing. And there are various tools hearing, I would say that the episode musty bread and batteries and SQL, which stands for Standard Query Language. Sql, allows us to query data which are stored in some database and this is what you usually meet in companies. They have their data stored in the database. And now data scientists using SQL, for example, needs to be able to obtain the data from the database. Then of course it's not only about SQL, but also about various Python packages here I will say the two most common ones which I meeting in industries are pandas, which allows us to nicely wrangle our data. And also if you are talking about the bigger dataset, for example, then you have a Pi Spark library. All right, so now that we have rankled, preprocessed, and cleaned our data, we can kind of move forward with our data science process towards exploration and visualizations. So I think we can all imagine what a visualization these and also later in the course, we'll be building some visualization. Oftentimes the company has an idea of what these in the dataset. But most likely that dataset was collected for a completely different purpose than the project or the use-case currently at hand. So on data science, these needs to be able to explore these data, visualize it. In other words, data scientists should sit down and really challenged the data. Asking all these in there, whether what I'm seeing make sense. And this is usually done through data visualization. Hello, can meet, for example, various business intelligence tools you might have heard about Power BI, or Tableau, which are well fitted for this purpose. Then we have, of course, a long list of Python packages which are available for visualization here you have Mac broadly, you can also visualize in Pandas. But I would say the industry standard nowadays is a seaborne, which is a powerful visualization library within Python programming language. Alright, having the exploration out of the table, we proceed to rectangular data modelling or in other words, creation of a data science model out of the rectangular data. Now, as we're discussing already, our descriptions and explorations of a dataset can, for example, find the patron or produce a visualization that will serve the purpose of a data science model. But oftentimes we need to go further and for example, use machine learning to create a predictive model out of our rectangular data. So this is what I mean by the rectangular data modelling. Now, there are a lot of libraries are already in Python, which could be used for rectangular data modeling. I just want to bring out one important distinction for you. For me, at least there is a scikit-learn you might have heard about. It's a very famous Python package. This one is an industry standard really for most of the use cases that I was working in the past, you can pick up these library and find a method or tool which will deliver the solution that the dataset which might use Keynes hairs requires a standard model. And then you will have datasets or use cases which are kind of niche. And maybe a bit more exotic. Example would be if you have a dataset which contains a very strong component of time, well, this is a bit of a niche application, we could refer it to as time-series and hearing the industry standard, which is covering most of the datasets and their needs would fail us and we need to pick up these exotic library. And here is a thing for data scientists. So you need to operate where he well within the industry standard library. And then of course you are not an expert in, let's say, dozens of niche libraries which are suited for a specific purpose. It's not what you should be able to become these niche library and let's say in a few days, get the basics of it. And then if the use case requires that you are applying the soldiers and using these niche modelling library. All right, and now for the final skill within this lecture on a more senior level data scientists picks up also skills that have already to do with a deployment of a model into production, model management and integration. How do we integrate a statistical solutions with other components in your technical infrastructure? An example from this box will be a modal lifecycle management. Modal. It's not just built and then deployed into the production. It didn't needs to be maintained, maybe replaced by a more novel approach, basically refreshed version of our model that is incorporating the recent trends. An example of a tool here could be, for example, it is a technology that allows us to version our cold or version our models basically. Another example would be that when you are moving into the production, you want to make sure that you have data quality controls. You do not want to have some crazy data point or crazy observation entering your modal which could completely break its predictions. So we have to be integrating some data quality controls. And these also belongs under the category of modal management and integration. And he ruled, for example, pick up another library which is called the Great Expectations. Alright, so this was a bit of a deep dive into the vertical component of our T-shaped skill set of a data scientist. And the upcoming lecture will of course continue building up. 22. Skills: Specializations: Hi there. In the previous lecture, we have moved forward quite a lot that whole T-shaped skill set of data scientists and we have discussed the rectangular data skills. Now there is one more skill or set of skills missing to complete the vertical component of our T-shaped skill set. And that's what I would like to discuss in this lecture. Alright, and the scale is going to evolve and all the areas of deep learning, natural language processing, and visual recognition. Here's the thing. Before we discuss this set of skills, you could say many data scientists are starting out with rectangular data and only then they decide to proceed into these niche areas of visual recognition, deep learning, which are closer, the artificial intelligence and machine learning technologies. If you are not thinking about becoming a data scientist, you do not need to stress yourself out by that. Oh, I will actually need to learn deep learning the language processing and visual recognition to be a data scientist known for me, this bottom scale or bottom component is kind of a specialization. As I say, some data scientists decide to specialize in these areas. All right, now let's discuss the scale itself. It's basically about the data. As we said in the world, you will not only stumble upon structure the rectangular data, but you also stumbled upon various more complex data sources such as images or natural language. By natural language, you could, for example, imagine an email written out by a human. Now, we need different methodologies and approaches to analyze this dataset. And these are usually revolving around deep learning. Let me give you some examples of use cases which could be here. Let's say that your company is receiving letters really Lipitor is on a paper and you would like to automatically read out and understand what is in the letter. Now this is actually a fairly complex task at first, let's say that you scan the document now from the skin, you'll need to use visual recognition to identify where on the scanned image our texts. And then once you have the texts identified, you then need natural language processing model, which will try to recognize what is written in this text and then you have fulfilled the task. Another example of a use case within this area will be then let's say we are trying to predict sale prices of the houses in our area. Now, there is a lot to consider when it comes to house prices and what determines for how much a house will be sold. Let's say that we have our rectangular structure data. So about every house, we can say, What's the size of the house, how many rooms it has, how far it is from the city center. But we are kind of determiner, we have a hypothesis that also the images or the pictures of how the house looks from the inside will determine itself price. Within one predictive model, we will need to combine these needs structured data also with couple of images from this house. Again, a rather complex task for which we might pick up a deep learning model. These were just two examples of the use cases which are within the specialized niche area revolving around deep learning. Now as I was saying, most of the use cases which I see creating benefit for the industries are other indeed top part so around rectangular data modeling. But the orderly can find companies which have, I would say matrix data science culture and that already experimenting with these niche areas and they are attempting to analyze images or natural texts. Or as I was saying, you have some smaller niche companies which are building their entire product or the entire company around. For example, single engine, which could be used, for example, for automatic recognition of water written on the letters. Then of course these company would say, hey, for as a data scientist is someone who can work with these approaches and technologies. And so if you would like to develop models within this skill set, you would also need to pick up a new libraries and tools, which could be rather different from what do we have discussed previously within the rectangular data, you might have heard about kairos or TensorFlow. These are all libraries which are well-suited for development of deep learning model. So this niche area, the specialization is also coming with its own skillset and toss it. And that is it. From this lecture I wanted to discuss this bottom part of our vertical T-shaped skillset. In the upcoming lecture, we will move towards the horizontal component of our T-shaped skill set. 23. Skills: Technical Wing: Hi, Let's continue with building of the T-shaped skillset. And in this lecture we will focus on the, on the horizontal component. So let us just briefly recap where we are right now. It all starts with a data science mindset. Then we continue and kind of a baseline skill for a data scientist is the capability to work with rectangular data. And then on the bottom, as I was saying, some data scientists decided to specialize into the niche areas around deep learning, natural language processing, and visual recognition. However, most of the times these skills are not sufficient to create a business value or a business impact. The thing is that with these skills, so you only create a model. You can imagine really a technical solution. But now we need to make them modal fly. We need to give you the wings, and there will be two of these wings. In this lecture, we are going to discuss the technical wing, which is about deploying the model, integrating it to the infrastructure which the organization has. And then in the upcoming lecture we will discuss the soft wing, which revolves around integrating the model well into the domain where we operate. So actually these skills at insufficient because we only create the model, but now we need to integrate it to the infrastructure and to the domain where we are operating on. Let's give our model a technical language Guifei will take consists of two core skills which revolve around data engineering and cloud engineering. So let's start with the data. The first thing I would like to mention really is knowledge of data. When it comes to the data scientist and the company where he or she works, every company has different datasets which somehow historically evolve into what they are today. Scientists needs to know where things are with your data sources are useful, which are maybe not so useful, these sorts of things. Now, unfortunately, this is something that can rarely be learned, for example, from online courses like you're watching right now, it usually only comes with experience. The good news is that if the company is, let's say, healthy and it has some form of documentation and it has a culture which is open to asking questions, then the data scientists can fairly quickly pick up the data knowledge we're talking weeks or maybe three months. And again, I can't emphasize the importance of Structured Query Language enough. Sometimes it might happen to a data scientist. They would need to write a sort of a data engineering pipeline. Or they would need to collaborate with a Professional Data Engineer on creation of a data engineering pipeline, which is bringing the data to their modal. So for example, extracting it from one system, then transforming it and loading it to another analytical system where all our model is developed and these analyzing the data, for example, our machine-learning model. It's also about some SQL skill. For example, to write data engineering pipelines. Moving on away from knowledge of the data and data engineering skills, we are coming to Cloud. Now it really depends from which geographical region you are watching this lecture. I'm speaking from Europe. So things might be of course different in your geographic origin. But to put it simply, a lot of companies nowadays obtained for a Cloud solution. Cloud in short is when you do not own a hardware and maybe also a lot of software, but you are renting it out from some large vendor. Cloud has a lot of advantages nowadays and data scientists like it. Then our free major vendors of cloud technologies, Amazon Web Services or AWS. Then you have Google Cloud, and then you have Azure from Microsoft. Now, all of these are implementing more or less the same concepts. They are just called differently, at least from my perspective. For data scientists, it is more of a question of whether he or she can use the cloud and then they can rather quickly adopt for a different vendor. An organization is working right now on Ada boys, and previously I have had experience with Google Cloud. I should be able to adapt rather quickly and easily. Lastly, some companies are still operating in a way that they utilize their own custom server instead of a clown, I would say every year that passes, you have less and less sad companies. If a data scientist works in such company, then it will require a few special skills of working with the custom solution that the company has, such as having some basic Linux skills and maybe some basic command line skills. But in general, what I'm seeing data scientists are more inclined to join companies and use cases that are on the Cloud. All right, To sum up this technical wing, as you see, it's about the technicalities that are happening around of our model. How are the data flowing into our model? That is the data engineering part where our modal resides when we want to deployed or where are we When developing the model? Is it happening on the Cloud? Is you'd happening maybe on our laptop or is it happening on some custom server solution? And this is the end of the lecture. We have just discussed the technical wink of the T-shaped skill set of the data scientists. And in the upcoming lecture, we go for the soft drink. 24. Skills: Soft Wing: Hi, and welcome to the lecture where we finally finished our T-shaped skills Head of Data Scientists. The last thing that we have missing is the soft wink of a data science model. Now I cannot emphasize this enough. Data scientists will never create business benefit or some business impact alone. He or she will need to interact with other people and will need to interact with the domain. We basically have two skills over here, people organization or we could call it people skills if you would like them. Then domain integration or domain knowledge. Let's start with the domain knowledge. As we have discussed at the beginning of the course, the gift of data science is that it can be applied essentially in any domain, finance, manufacturing, any sort of sales medicine. I could go on, but I think I already made my point. Now, data scientists should understand when it comes to the domain knowledge. A couple of things. There's the whole pains, Paine's did a certain domain palsy stone organization, for example, you are working in an initial band. You understand that one of the largest paints for your organization is a speed of delivery. Thus, you as a data scientist, is constantly wrapping your mind around how you can help out with a data science methodologies with the speed of delivery. Understanding of paints is the first key domain, knowledge that the data scientists should have. Secondly, business value, or in which ways can business value be created in a given domain? Is it through increasing the revenues? Is it through cutting of the course? Or is it by extending the domain in which the organization operates? Maybe the margins in the current business are very tight and there is no direct way how data science can help with did take banking, for example, maybe the only way how a data scientist in a bank can help out with mortgage product is by extending it. Thus, your bank would no longer be offering just a mortgage. The bank would also be searching for properties that might be ideal for these customer. And you don't only offer the mortgage, but alongside of it, also a property and the customer will just sign the mortgage on the property and everything will be sorted out. So this would be an example of a data scientist's understanding that okay, or business value can be created in a certain way in this domain where I am right now. So understanding how business value can be created is a second key piece of domain knowledge. Then lastly, hurdles. So what are the main hurdles that given domain palsies to the organization for which the data scientists working, for example, in recent years old of companies within European Union where having troubles with a particular regulation called GDPR, General Data Protection Regulation, companies had to adjust a load of their processes, mainly regarding how they handle their customer data to be compliant with the SAR equation. Now, data scientists should be pushing forward initiatives, projects, and use kinesins that are going well alongside of the post hurdles. You will, for example, not be pushing for the use case that goes against such a strict European Union Regulation. Lastly, it's about understanding the hurdles in a given domain. I will sum up the domain taught by saying that you can have the best predictive machine learning model that anyone has ever created. But if it is not integrated well into the domain, it will not create any real value at the end of the day, so it might very well be worthless. So as you can see, domain integration and knowledge is really crucial. Now, let's continue and talk about people skills and people organization. Firstly, I will take communication with other people. They did scientists going to collaborate with other people who are technical. And so he should be able to hold technical conversations. Also, he will communicate with non-technical people. So he should be able to translate his technical fault to a more actionable business terminology. So kind of adjusting the communication depending with whom you are talking right now. But I think we can only imagine that. Let's move to the second, which is the development methodology. People in organizations organize themselves into some form of development approach you could take. You might have heard about multiple things such as Waterfall Agile, Kanban, Scrum, Kaizen, whatever that is. Quite a lot of them. Now it is not only the data scientists should be able to fit within whatever the organization as pushing as a goal to approach for development, for example, some agile methodology. A good data scientist should also be able to adjust these so that they are fitting well for a project on which the data scientist is working. What do I mean by? Well, as mentioned, nowadays you have mostly agile methodologies being put forward. But one has to keep in mind that these are maybe not perfectly applicable for data science models and products as they were originally developed for software engineering. The thing is that within data science, it is a lot more about uncertainty then if you'll be working with in software engineering. So oftentimes it is simply not possible to estimate how long things might take, because you just don't know what you will stumble upon within the data once you open up the dataset and you start exploring it, it is about the ability to reflect Andy utilize a development approach which is the best for the use case at hand or right, and that is our T-shaped skills had completed. We have just answered a question of who is a Data Scientist. I thank you so much for being part of these lectures and I look forward to seeing the upcoming ones. 25. Introduction to Chapter 3: Hello When the warm welcome to the fifth trip during the course, be aware of Data Science. My name is Robert, and it is short video. I would like to provide you with an overview of the chapter which is ahead of us and it's called describing and exploring the data. First of all, let's take look at the overall course structure. We have already covered the essence of data science, it's basic principles. We have also told about the disciplines of data size and we have even answered the question, who is a Data Scientists? Now it is finally time to start working with the data. And if you remember at the beginning of the course, I said that there are four basic approaches of data science. We have descriptive approach, exploratory approach in financial and predictable. So within this chapter, we will cover the first two of these approaches will talk about the descriptive and exploratory approaches. We covered them together as they are closely related. And basically the goal of this chapter is to understand the essence of these two approaches. Why are they valuable? Why should we never skip the data description or data exploration before we may be moved to some more complex approaches such as inferential or predictable. Within this chapter, we want to fully understand these two approaches. What do they bring to the table? And even what are the potential pitfalls within these two simple approaches? Alright, that is the high-level view on this chapter. Now let me provide you with a more detailed overview of what's ahead of us. We are starting the chapter with a learning story called describing the life of a fully, I recommend not to skip this lecture because it will provide an intuitive basis for the upcoming chapters. So we'll basically finishes learning story. Understand how should we go about describing the data would approach her or flow we should be taking. Then I'm moving the second lecture. We are asking ourselves, why is it so important to describe the data so that we really understand why this approach is so valuable. Then, within the third lecture, we are of course, covering the basics of descriptive methods. I want to fill in your toolkit with methods such as measures of position, measures of spread, and the event visualizations will cover the necessary basics of these methods. Then as I always keep saying, no data science metal the sacred. So within four lecture we'll talk about calculating of average income. Where do you will see how even the simplest method or data science can fail if we are not careful enough. Now we will conclude the first part of the chapter. We've a wonderful assignment. It's called The Power of describing. And I really recommend not to skip this assignment because people usually think that descriptions of data or data explorations are boring. That we should be building the coal machine learning predictive models. Well, I think that's very unfortunate. And within this assignment, I want to show you how powerful data descriptions or powerful data visualizations can really take your foreign can really deliver a valuable information. So I hope you will enjoy this assignment. Now within the second part of the chapter, we will move towards exploratory approach. We are done with the descriptive and we'll move from description to exploration. And we'll start off the second part by highlighting the difference between these two approaches. Then we have a learning story where we'll be asking ourselves the question which houses the right one? Here we'll revisit the limitation of the human mind, which are outlined at the beginning of the course. How we have troubles with understanding problems defined by many dimensions. But now you will see it connected to data exploration. I would like to also provide you with concrete methods when it comes to the exploratory approach. So we will talk about correlation, which is a really powerful tool, really powerful metal that we have available. Within the eighth lecture, we will go in depth into the method of correlation. But as I keep saying, always remains skeptical. Data science methods can fail us. Already in the next lecture, which is called when temperature rises, we will see how even a powerful metal of correlation can fail if we are not careful enough. And then we will be concluding the chapter with a couple of learning stories. We'll be talking about storks, babies, football and presidents. What do these things have in common? Well, as you will see, they have something in come on, for example, will be asking you a question, do storks bring babies or can football predict who will win the presidential election? These are wonderful learning stories. I hope you will enjoy them. The learning stories will uncover an important concept which is called the spurious correlation. And I think it's so important. Then also designed an assignment where you can practice we spurious correlation, which is at the end of the chapter. Of course, we'll conclude the chapter in a fairly classical way where I will provide an article which is sort of a checkpoint where you can recap on the most important learning from the chapter before you conclude the chapter with the assessment, a couple of quiz questions that I have prepared for you, and that is it for this video I'm looking forward to see you. We've been deferred chapter 26. Describing the life of a foodie!: Hi and welcome to another lecture in the course. Be aware of data science, we are now starting a new chapter which will be about describing and exploring the data. And I would like to start with a learning story which is called describing the life of a fully. Who are fully is these are wonderful folks who are really enjoying travelling to enjoy some nice meals are some nice restaurant. For example, you don't just go out because you are hungry. Who go to a restaurant, you would pick a special place because of some special meal. And basically you are treating it as a hobby. Now, imagine that we are folded and as a matter of fact, Our friend is interested in joining our faulty hobby and he would like us to describe these fully hobby for us. The first question that he's asking is, how does your full day through block-like? Our description or our first answer is, well, we usually we call Berlin the morning and drive to a city where we would like to visit the restaurants. We then take the first meal of the day, usually the sort of a breakfast. Afterwards, we walk around the city or nearby to get hungry. We always go to the most. These are the restaurant for lunch so that we can take multiple courses. Most of the times we spent around €30 pair such trip. Now listening to our answer that we have provided to our friend, what is the natural tendency of our bind when it comes to describing something, describing our fully Humvee, we had at first thinking about averages, we are thinking about what is usual. We can now translate it to the technical terms because we will be doing the very same thing when we're supposed to describe the data. We will first describe some averages. What is usual or these common. This will be the first step when we're describing the data. And it kind of follows the natural tendency that our mind has as you have just seen within our answer to our friend. Now fortunately, our friend is still not persuaded. So he is asking, well, I'm still somehow worried whether it won't cost too much or if the folds are not too exotic for me. Well, there goes our answer. Well, don't be worried. Sometimes we add adventurers and go for a visit to some very authentic points. But even there you can pick some less exotic options. Indeed, when it comes to money, we occasionally go on a spending spree, but that is very seldom. It will then be announced beforehand so you can skip the trip if there is a trouble for you. Okay. In technical terms, how did we continue describing our fully home before our friend? Well, in technical terms, we are interested in how things arrange and how big extremes might occur. We have seen it. Sometimes we are adventurous, occasionally we go on a spending spree. The very same thing we'll be doing when we will be describing the data. At first, when we have described what is average, what is usual, we will reach towards the extremes. So how things spread? How do they vary? What is the minimum, what is the maximum value? We will again translate the natural folder, natural tendency of our mind into our descriptive data efforts. Although it's not everything our friend is now inclined, but he's still not persuaded that he says, I still want to imagine it better. So our answer is, Look, let me show you my fully diary and you can also check my social media. When I post about these trips, you will see imageries of the meals that I talk thanks to the diary, you will also be able to imagine how much money we spend on these trips. It will be. All right. This brings us to the last point in the natural flow of our mind will be visualizing because oftentimes we want to see the data. We want to imagine better what's going on with the data. So oftentimes describing some usual tendency, sum, average, and then describing some extremes is not sufficient. We still might want to visualize the data to some of these learning story. We will basically translate these natural flow of our mind into the upcoming lectures. At first we will be talking about measures of position. So similarly like we're describing to our friend, then there will be talking about measures of spread variance. There are different terms, but you will meet these terms such as range or mean, absolute deviation. At the very end, we'll of course be visualizing. Sometimes we just need to see the thing in order to imagine either really well, it will be mentioning boxplots, pie charts, and scatter plots. I hope that this learning story has provided a solid basis for our upcoming lectures where we'll be talking about the descriptive approach of data science. I'm looking forward to see you in the upcoming lectures. 27. Why do we need to describe the data?: Hi, In the previous lecture, we started to uncover the descriptive approach of data science. We have discussed the intuition behind it. And even before we proceed to the predictive methods, I want to stop and I want to talk about why is it so important to describe the data? For me, it goes as something like follows. When thinking about this crypto metals, you should remember to think simply and objectively. You will want to describe how things look like. We found the danger of falling into some bias. So it's really going to be about the simple descriptions of what's usual, what is an extreme about some descriptions of ranges and dispersions? Because here's the thing. When we proceed to more complex approaches such as an exploratory inferential one or a predicted one. There will be a lot of space for falling into some bias or encountering a pitfall. With the descriptive approach. We kind of want to leave ourselves a space where the danger of falling into some bias is minimized. Here's the thing to give you a concrete example. Already with the exploratory proton, it will be searching for patterns in our data. We will want to apply domain knowledge and domain thinking. And there is a danger that there is some cognitive bias in our mind. And the while we are applying ourselves on top of the data, we also project our cognitive bias on top of the data. So before we go there, before we do that, we want to think simply and objectively about the dataset that we have available. If I'm supposed to put it really simply, I would say the descriptive approach is about capturing the essence or the nature of the given data, like what's in there, what is this dataset telling me? And later on, what will I be able to do with it? If you want to think more concretely about it, here are a couple of questions that we could be answering with the descriptive approach. It does the dataset covered the sample than I was hoping for. Isn't there some bias in my dataset like Isn't there some weird extreme values? Isn't the data somehow weirdly shaped? Or I'm already thinking about how the data was collected. Or you could already be more concrete if you are taking the statistical I produce starting with the hypothesis and proceeding to the data collection, I could be asking B9 colic the data which I was hoping for. Or if you take the other way around and you are starting with data mining, where it will outset with the dataset and only on top of the dataset you would attempt to form the hypothesis you want to answering a question. Is this dataset useful? Can I even form some sort of a hypothesis on top of it? So I hope that with these two slides, I give you an intuition of why it is so important to describe the data. But there might be one thing that you are still wondering about. Hey, what about the predictive approach? Should I still be describing the data even if I'm aiming for some inference or predictive approach, yes, definitely you should be. Because think about it. What you want attempting to do with the influence or predictive approach you are trying to generalize from a sample to the population and you shouldn't go forward it directly. You should at first focus on the data which you have. You should kind of take a step back and your first step should be, here is my data, here is my interest right now I'm trying to capture the essence of the sample that I have available. Once I understood that, I sort of laid the foundation for the more complex approach of inferring from the sample are predicting from the sample to an entire population. So to sum it up, yes, descriptive approach is always a good start even if you are aiming for a predictive approach. All right, that is it for this lecture and I look forward to see you in the upcoming ones where we were already discussed the descriptive methods. 28. Basics of Descriptive Methods (1): Hi and welcome to another lecture in the course. Be aware of Data Science. My name is Robert and as promised in this lecture, we are starting with the basics of descriptive method. In the previous videos, we have understood the intuition and how we might be applying these even in which order that we might be starting with the measures of position. And that's exactly what we will do in this lecture. And then in the upcoming two lectures, we will continue with measures of spread and visualization. Now, the key in this course is not to fill your toolkit with various descriptive methods. The key is to get the mindset of data science and how it handles the description of the data. We will have a concrete example of applying one of these methods. And then I will also provide you with a brief overview of methods within each of these sets. Again, please don't feel discouraged if we do not cover all of the descriptive methods which are out there know really the key is to get the mindset of how we are applying measures of position and to have a sort of an overview of bodies available from these sets of tools. Let's start with, and let's start with the measures of position. As I said, I want to give you at first one concrete example of how an application of a measure of position might look like. And then pretty sure that the simplest measure of position you have already encountered, It's amine a calculation. Let's say that our task or a question is very straightforward. What is the average temperature in your country? And we have a dataset available for various temperatures in degrees of Celsius. Now the method we will use the mean calculation. The calculation is pretty straightforward. We just need to add up all of the values. So you can see 35 plus 35 plus 32 up until plus 24 with the result of 263, that's a sum. Now as a second step, we need to divide it by count. We have originally had nine values, so we will divide it by nine. And the resulting mean temperature in our country is 29.22 degrees Celsius. As you can see, we are not applying any domain knowledge. We are not thinking about what these temperatures mean in the real world. We are just applying a very simple descriptive statistic, which is a mean calculation to get the essence of what is the average temperature in our country. Of course, calculating a mean isn't the only measure of position that we have available. There is a lot more on these and they in general fall into three categories. There are every G is or what someone could mean when they say what is the average, then we have some extremes and then we have various positions. So let's talk about these. First of all, when we told you about every G's are we have mean and median. And already in a few lectures from now, I will show you an example where mean as a method, as a descriptive statistic tool can fail and ruled actually need to replace it with a medium to really get to a measure of average that we would like to. So these two are usually applied at the same time as they are really super useful. Then you have some nuanced applications of means such as weighted mean or trimmed mean. When you could be interested in knowing what the meanings we felt taking into account some extreme positions or some extreme temperatures. If you are aware that such extreme measures could exist in your data, you might want to disregard them and focus only on what these average or all these usual. You can see we have various measures of average. Now the two most common ones are mean and median. And it will also see application of these two in one of the upcoming lectures. Now for the extremes, I think that's pretty straightforward to imagine. We have a minimum value and we have a maximum value. I think there is not only to elaborate on that further, Let's continue with folio might have not encountered entering until now. It's these terms, quartiles, deciles, or percentiles. What are these? Well, It's about, first of all, always remember, we will order our samples, we will order our observations. We can see our original data was often on the order we hit 35 degrees Celsius. Then we had thirty two hundred twenty eight, thirty one and so on and so on. It's unordered. We want to order it from lowest to highest. You can see we are starting with 24. We continue 2528 up until 37. Now having our sample ordered like this, we could be looking at various positions. And now all of these strange terms are only about two. How many parts are we dividing our sample? If we are talking about quartiles, we're dividing our sample into four parts, whereas in each part that will be equal amount of observations. If we are dividing it to deciles, it's about ten aparts where we have equal amount of observations in each of the ten parts and percentiles. It's dividing the population into 100 parts. Now, let's divide our population into the styles because it's the simplest one is the most straightforward one for our case is we have ten observations. So basically in each of these buckets, in each of these steps, we have one observation. And now someone tells us, hey, please report to me, what's the eighth decimal? So we will be starting from the lowest bucket, you could say, or the lowest, tenth, lowest decile. And we're counting 123456 up until eight. And our eighth decile is 35 degrees Celsius. You can see this is a bit different approach to looking for a certain position within our sample. We are not looking at the average which has to be kind of in the middle, somewhere in the middle. We're also not looking at the extremes which are a DAGs. We could be looking at some various positions which are approximately the third or three-fifths of the data. And we have a lot more freedom with these just for the sake of curiosity in case it's already connecting in your mind. If we are talking about the median, well, that will be obtaining a very different way. We will be ordering the data and looking at the value which is in the middle of this data, which breaks down our population into two equal parts. On the right side and on the left side from the median we have the same amount of observation. You can see all of these are connecting to each other. Now, just to sum it up, these are the basic descriptive statistics, the basic measures of position which we are applying on top of our data. As I was saying, these are really simple and straightforward and we want to just apply them objectively to understand what the essence of our data is. Every Gs extremes. And then we might be measuring some nuanced positions within our sample. So this is it. In the upcoming lecture, we will continue with the measures of spread or dispersion. 29. Basics of Descriptive Methods (2): Hi, In the previous lecture we have talked about the measures of position we had mean, median, and some other positions. Now we should not stop there, we should continue. And in this video we will cover measures of spread, which will again enrich our view and our understanding of the data which is in front of us. Before we go there, I would like to just reiterate, we're not here to cover all of the available methods which belong under measures of spread. Now we are here to gain an intuitive understanding of what the measures of spread do. How do they enrich our view of the dataset which is in front of us. And of course we will go through an example of one measure of spread. Alright, so let's go for it. We have a beautiful project ahead of us. We are a producer of Windows and I do not mean the software, I mean actual physical windows. Now the issue is that we would like to produce windows at a specific width, let's say 100 centimeters. We are having a bit of troubles with our manufacturing methods. Occasionally it can of course happen that the window is not the 100 centimeters wide but 101 or 99 centimeters wide. Which is a problem because if we deliver such a window to a customer, the customer might be complaining because it's not fitting perfectly. And now we, as a producer of windows, have two ideas or two different manufacturing methods. We have produced a couple of windows with each of these manufacturing methods, and these generated two samples. We measured the width within these two samples. Now the measures of central tendency are insufficient. If we would measure what is the mean width within each of these samples? Most likely both of them would say 100 centimeters, but it is very likely that if one of them we are achieving more desirable results are more stable width of the windows then with the other measures of central tendency are insufficient. We need to expand with the measures of spread because we care for how much the wheat varies around the desired value of 100 centimeters, the methods that we'll use is called mean absolute deviation. It belongs under measures of spread. So we have a concrete example. It will show exactly how the width varies around the optimal mean. So let's say that we have one of our samples. These are the windows that we have produced. And let's say that the average of the sample is 100 centimeters, so our desired width of the windows. But we can see that some of the windows were maybe 9998 centimeters and some of the windows where 101102 and so on and so on. And so the width is really varying around the mean or the average. And we need to measure this. Alright, we will start by looking at a single window, and we are basically interested in how far is the width of this window from the average width of all the windows expressed mathematically, we're basically taking the width of the window and subtracting from it the average width of all the windows, which will measure this distance over here. Now we need to use an episode value around this calculation simply because we'll be doing this for all of the windows and some are above the mean, some are below the mean, which would mean that the positive differences will be canceling out with the negative differences. So basically by using the absolute value, we are really just interested in the episode distance from the average mean. This is the basis for our calculation. We now continue and we will do this for each of the windows within the samples. So let's say in this case we have 246 windows. So we will repeat these calculations six times each time for a different window. Now the result of this upper part needs to be divided by the number of windows. And now you can be reminded of the mean calculation that we had in the previous, like chess, We are basically summing up all of the values, dividing by the number of values that we had. Now, well, we are working with mean absolute deviation. So we have the episode deviation and we are calculating the mean from these absolute deviations. So it's the same principle is when we're calculating the means. Now believe it or not, but this is everything. By this simple calculation, we have measured mean absolute deviation, and it will answer the business problem that we're having. We can then compare the mean absolute deviation within d2 samples that we have. And this way the sides amount of which of the two manufacturing methods is producing more stable results? Where are we getting more stable with the being as close to 100 centimeter interests as possible. These walls mean absolute deviation. That's a concrete example from the measures of spread. Now as I was mentioning, there are a lot more methods of spread. You will be meeting terms, even standard deviation, interquartile range. I do not want to bother you too much with these terms. They are all more or less working in a similar efficient. And if you remember intuitively, what are we doing when we are measuring some spread, that will be a great key takeaway from this lecture. This was mean absolute deviation. And in the upcoming lecture we will continue with visualization looking forward to see you there. 30. Basics of Descriptive Methods (3): It is time that we conclude our exploration of the descriptive methods that we have available by talking about visualization. Before even we get to the actual methods I want to talk about how important, and I want to reiterate on what I already said. Why visuals, why it is so important to visualize the data simply because sometimes the numbers, meaning the descriptive tools that we discussed until now, the measures of position, measures of spread. The numbers do not tell the story. This is a famous example they obtained from the Wikipedia. It's called the Anscombe's Quartet. I hope I'm not butchering name too much anyway. It's beautiful artificial experimental where we have collected for samples. We have one sample second, third, fourth, by looking at these four simple, so by the visual, we can right away see that they're completely different than the patterns that we're seeing, which the samples that are forming are completely different from each other. However, here's the thing. If we will be measuring the measures of position or some measures of spread, our variance, they wouldn't look different. I have an extract from it here on the right side. We have a property, a mean of x, meaning measuring the horizontal mean. We also have the mean of y measuring the vertical mean, as you can see on the horizontal axis, they have exactly the same mean. So all four of these have exactly the same mean when we're talking about the vertical dimension, they almost exactly have the same mean. Now, we could be resorting to what we discussed on the previous lecture, the measures of variance. Well here we would be also failing as the variance on the horizontal axis is exactly the same for all four samples, the variance on the vertical axis is again, almost the same. Now we have a variance over here is a very similar method, the mean absolute deviation that we have discussed. Anyway, back to the Anscombe's Quartet. I think by this example, I hope that you will remember the key to that. Sometimes the numbers do not tell the stories. Sometimes we just have to visualize what we are seeing to uncover what's really in the data. Now that we have covered the why, let's talk about the visualization itself. Here's the thing. When I tried to visualize, especially when I'm describing the data, when I'm just starting with the dataset, I like to start with what I call one-dimensional or older, we tend to call univariate visualizations. You are only focusing at one characteristic. You could say one column of your dataset at a time. Let's say you have a dataset about the social demographics of your customers. Well, you have h, You have their occupation and many other characteristics. Always look only at one at a time and produce a visualization hearing. Here we have various tools and methods. Of course, if we are talking about some numerical characteristics, numerical features, we have tools suggest distributions. Here we will be having an age, for example, and we will observe videos, buckets of how old our customers are. Then we have boxplot, which is a bit more complex visualization, but it's a wonderful one, even though it looks weird, aids actually showcasing a lot of measures of position. For example, hearing the middle, we have a median. We're also measuring the quartiles and we could also be finding out extreme values through it. Just another metal that we could be using to display the numerical characteristic. And on top of it, we could even visualize the individual observations themselves. So you can see on top of our boxplot, which in this case looks a little different. We have on top of it edit the exact values of our observations. So for example, if something like on the previous slide headphones, we would be able to spot the differences between these observations. Then if we are having some categorical characteristics and categorical feature such as an occupation, we also have different univariate ports available, such as a count board with which we could be showcasing what the occupation is RUN. White-collar worker. Are you an entrepreneur? Are you a student? So this would be a one-dimensional display of our occupation characteristic. So I think the univariate visualization is one dimension at a time are always a gold star to data visualization exercise, of course, then we proceed to more complex visualizations where we are looking at two dimensions at once. We would call these bivariate visualization. And here, it's mainly about what's the nature of the characteristic that we are looking at. We could either have these numerical characteristics such as age or income, or we could have these categorical characteristics are categorical features such as water, your occupation. Then we're just asking, okay, are we pulled in numerical versus numerical characteristic, numerical variables, categorical or categorical versus categorical. And again, we have a couple of options available, but basically, it's all about reusing what we had within the univariate visualization. So if we were just floating like in the boxplot, the observations and their values. If we build up from the bivariate visualization, you can see it becomes what we call a scatterplot. So just the values of our numerical characteristics. Then if we are plotting a numerical variables are categorical characteristic. Again, let's reuse what we had within the univariate visualization, which was the boxplot. And we have a one boxplot. So one showcase of the numerical values pair each category. In this case it's a funky one. It's coming from a penguins dataset where we are measuring the body mass or how much doping Guin's weight with regards to their species. And we could nicely see that the Adelie species and the chin spread fees species are much lighter than the gentle species in general. You can see Hearing, we are already stepping away from descriptions. We're no longer just describing the data. But here is the edge between describing the data and exploring the data because we're already seeing some pattern in our data. And that will be our goal with the exploratory methods. So data visualization is the point where you are kind of going over the edge from the descriptions toward the explorations where we are uncovering some patterns in your data. Now, this is a two-dimensional bivariate visualization. Then the most complex visualization that we could undertake is multivariate visualization. Here we are trying to put into a same visual, three or more dimensions at the same time. And here it's where it starts to be from romantic. Remember what we discussed at the beginning of the course, that our mind is rather limited and it's having problems visualizing good imagining more than three dimensions at once. So really I would have big question with a multivariate visualization is, how do we encode the third or 441 fifth dimension into these visuals? We could be using shapes, we could be using colors, we could be using some composite plot, as you can see in this visualization that I have right now on the slide, we are using colors and also we just sort of a composite plot because we have some numerical characteristics over here of our beautiful pink wings, which are the bill link to the bill lips. What's the flipper length and what's the body mass in grams. And we're plotting them against each other. And also at the same time we are distinguishing the species of the pink wings using the colors. We have really a lot of dimensions at the same time in a single visual. However, the issue if the multivariate visualization is that there are three key to construct because you always have to undertake these creative exercise of thinking. How do I encode the third or fourth dimension into my visualization? So this is everything that I wanted to talk about when it comes to visualization. The key takeaways from this lecture really are wise visualization important because numbers sometimes just don't tell the story. And then when it comes to the visualization, we should be undertaking the simple approach of starting from univariate visualizations, then proceeding to bivariate visualization. And if we have time and space for it, we could even take, hold some multivariate visualizations. And it's already the breaking point between descriptions that we talked about until now and exploration or exploratory approach that we'll talk about in the upcoming lectures. I'm looking forward to see you in the upcoming lectures. 31. Calculating Average Income: Hi there and welcome to another lecture in the course. Be aware of data science. In the previous lectures we were discussing descriptive methods and you might be telling yourself, Oh, these metals are nice and straightforward. If we are applying them, nothing can go wrong. Well, in this lecture, I would like to talk about that. One of the really key takeaways from this course, which is that no data science metal is sacred. We should always be skeptical about the methods that we are using. And even such, a simple method is a descriptive one can fail us if we're not careful enough and if we're not skeptical enough, the example that we are going to work through is called calculating average income. So let's go for it. We have a task that we similar to one we already faced with the temperatures. Now we have what is the average salary in your country? And here is our dataset, here is the sample. We have fast salaries ranging from €350. Then we have €500.820 up until €4 thousand, while we are interested in the average salary in the country. So we can proceed to the calculation of the mean. We already know how to calculate it. We would as a first step, sum up all our values. So we will start with 350 plus 400 all the way plus 4 thousand, and the result would be 11,605 Euros. That's the first step. Now we need to divide by the count or the number of observations that we're having. So we divide 11,605 by 11 and the result is 1€1055. Now let's stop here for the second. I mean, if we're just intuitively looking at the salaries, we have three hundred and fifty, five hundred, eight hundred and fifteen. Does it seem reasonable that average salary is 155 years? I mean, if we should really rely on these measures of position, Then we have only three salaries that are higher than what we would claim the average salary is it would lie right over here, while something doesn't seem to be right. So let's reach out to the second measure of position, the second measure of central tendency that we have and calculate the median. Now in this case, we need to order our observations from the lowest to the highest. So we can see we are starting with free 50. Then we have everything nicely ordered up until 4 thousand. And now we want to find the observation which is splitting our sample into two parts. Within each part we have unequal amount of the observations. So we can see lift of our mean. We have five salaries and the right of our main, we have also five salaries. So our median in this case is the middle value, €520. Now looking at it, doesn't €520 see more reasonable if we are claiming that this is the average salary in our country when we are looking at our original sample, of course it does. What do we have encountered is that the mean calculation or the mean measure is sort of failing us. The reason for that is that we have outliers or extreme values present. On the upper side we have a salary of 2225 year-olds and 4 thousand zeros. These are extreme salaries. And if we are calculating the mean, they are skewing the mean, they are pulling it towards the high values. And that's why the mean ended up to be such a high value. Whereas for the median, it's a different story. The median doesn't care how high the salaries are. It only cares about the salary which lies in the middle of our sample. In this case, median might be a much better representation of what the average salary in our country is. So I wanted to highlight with this story one key takeaway, as you can see, even the simple descriptive methods have potential pitfalls, and every method in data science has potential pitfalls. The job of a gold data science practice is to be aware of these and address them within a use case. I think this is a perfect example of what a big chunk of a job of data scientist is about. You have your dataset, you will have a sudo of metals available. You have your experience and knowledge of things which might potentially go wrong. It is now about trying out the tools, you know, and always remaining skeptical about things that can go wrong. I mean, for example, your stakeholder accounts through and tells you, Hey, I want to know what the average salary in the countries. That question should already spin up a lot of thoughts in the mind of the data scientists. Should I be working with a mean? Should I be working with a median mode or a trimmed mean, or any other different measure of the central tendency. Could there be outliers such as in these case, or the distribution is maybe somehow skewed towards one side. So let me repeat it again. Every data science method has pitfalls and the, the job of a data scientists to always challenge the data and the methods that are being used. Okay, this concludes our exploration of the descriptive methods. As you can see, even they can have their pitfalls. And I'm looking forward to see you in the upcoming lectures. 32. From Description to Exploration: Hi, and welcome to another lecture in the course. Be aware of Data Science. My name is Robert ending this lecture, we are going to step away from the discrete to a proud of data science and we will move towards exploratory approach of data science. So let's go for it. But first, let's remind ourselves what is really an idea behind both of these approaches within the descriptive approach that we have already discussed for only it's about questioning the usefulness of the dataset, understanding what the nature of the data is. And we're possibly trying to uncover whether there are some issues we are may be questioning the data collection method and we are in general trying to understand what this data is about. However, now we are moving towards the exploratory approach will be searching for some useful and non-obvious better. And these can then hopefully serve as a basis for some more complex metal. Or they can already be creating a valuable information which can be, for example, used by our business colleague in their business decision making. Now, within this lecture, I would like to talk about the differences that appear when we are making the step from a descriptive to the exploratory approach. And there will be two of them. The first difference than we need to do when taking the step from describing the data to exploring the data is that we start to search for patterns and we usually start to look at multiple characteristics or multiple features at the same time. Now it just a few lectures from now, it will give you concrete and practical example of a very strong pattern that we might be searching for cold correlation. But for now, just like to stay with something super simple to focus at the intuition behind the patterns. In this case, we have a simple dataset. We have two characteristics. We have agender of our buyer in the shop, then we have our wine type which he or she purchased. So you can see we have men and women, and for the wine type, we have white wine and red wine. Now, if we apply that are descriptive methods and our descriptive thinking on top of this dataset, well, our findings might look at something like this. Ratio of white to red wines board is 40 to 60, ratio of men and women is 5050. I mean, neither of these appear spectacular in any way. Both of them seem reasonable. So we concluded the descriptive stage and the descriptive approach. Now however, remember, we are going to be a detective looking for the patterns. Thus, we ask ourselves, is there some pattern in my data which could be useful and interesting? And so if we are actually looking with the exploratory lens on this dataset, our findings might look at something like these. Men purchase under a ratio of 70 red wine and 30 white wine. We mean a purchase under a ratio of 20 or red wine and 80 white wine. And you can see this is already an indication of a pattern, which is that the choice of a wine is different between men and women. This is the sort of the pattern that we might be looking for within the exploratory stage, what we'll do is that we might now take it and deliberate, for example, to our business colleagues for whom it's already available information so that they can adjust their sales strategies. Or we might take the spectrin and move it forward towards the inferential approach or the predictive approach. For the second stage, you might have heard it for data science to work properly, domain and knowledge is needed. I was mentioning it already a couple of times. Now, domain knowledge is needed and many components of a data science project, however, exploratory approach is exactly the spot where it is the most obvious how much we need the domain knowledge. Let's say that we will get the dataset and we have social demographic characteristics of people. Now if we're just describing the data, we found the application of any domain knowledge without looking at the context of everything, we would find something like this. We have two males. They were both born in 1948. They were both raised in the UK. They were married twice. Both of them live in a castle and they are both wealthy and famous. You can see descriptive approach is not bringing us anything if we are just mindlessly describing the data. However, if we apply domain knowledge and if we try to look at the context of the things, well, we might see something like this. Behind one of these we have Ozzie Osborne and behind the other one we have Prince Charles. So you can see the difference if we are applying domain knowledge, if we are employing the context and not just describing what we are seeing, our conclusions will be completely different. A bit of domain knowledge such as looking at how these to obtain their wealth, would tell all of the difference. I mean, one characteristic would have been way more useful than all of the ones that we have listed. But to know which characteristic would show the difference between these two, of course, we need to apply the domain knowledge and context. Maybe to show you things from bit different perspective, I wanted to take this beautiful quote from Mr. John Tukey, who is considered one of the godfathers of exploratory data analysis. It's an old one from 1977. He says exploratory data analysis is a detective work. A detective investigating a crime needs both tools and understanding. If he has no fingerprint powder, he will fail to find fingerprints on most surfaces. So we need the tools, we need the methods. Now, if he does not understand where criminalized likely to have pulled his fingers, he will not look at the right places. We need also the domain knowledge. We need also the understanding of the context. Whether the difference between Ozzy Osbourne and print Charles helps you remember it or the quote by John Tukey, I definitely recommend to remember that domain knowledge is needed when we start to explore the data. And that's it, what I wanted to tell during this lecture and in the upcoming lectures, we continue with the Exploratorium, proud of data science. 33. Which house is the right one?: Hi, and welcome to another lecture in the course. Be aware of data science. In the previous video, I have promised you that we will be already starting with the exploration of the exploratory methods of data science, pun intended by the way, now, we will not go there just yet because I have one more important learning story for you, which is called which house is the right one? With these learning story, we'll revisit what we already discussed in the beginning of the course that our mind has some limitations and to really come and haunt us within the exploratory approach of data science, Someone's go for it and even before we go ahead and purchase a house, Let's think about something simpler. Imagine that we want to purchase a bicycle. Now, what factors would we consider when we are deciding about which a bicycle we will purchase? Well, let's say we would of course consider the price, maybe we would consider the weight of the bicycle and maybe the intended use of the bicycle. So a fairly limited number of factors, maybe 34 factors to consider. However, thinks really change when we want to purchase a house or a property. If you were ever purchasing a property or renting one, you know that there are really a lot of factors to consider. You should be thinking of all the, whether you want the property to be in the city or in the countryside, what is the neighborhood? Is there a school facility nearby that your children can attend? What is the price of corners? And then of course, you are deciding about whether the house has an appropriate layout. I have just listed a few. We can then talk about location, whether the property has a nice view, a lot, a lot of factors to consider. Now we can revisit what we discussed at the beginning of the course, that our mind is limited in comprehending two or three dimensions at once. And then remember, I'm not claiming that we can't look individually at all of these factors when we are deciding about purchasing property, we have trouble of comprehending all of them at once. For example, if we have a couple of bicycles from which we are designing, while it will be easy for our mind to decide which one is the best taking into account all of these vectors. But our minds simply has traveled to do this if we have multiple properties and we want to evaluate them based on all of these factors. So our mind is very limited and this will really haunt us within the exploratory approach of data science. Let me show you what I mean. First of all, let's say that we have this visualization over here. It's coming from a beautiful pink goings dataset. In this case, we have a two-dimensional visualization. There are two variables in play. First of all, we have the species of penguins are fully Paris. We have that daily chain strip and the gentle species. And on the vertical axis we have the body mass of the penguins and our minds similarly, like when purchasing a bicycle, can really easily see a pattern. We can see that the gentle species are generally much higher in the body mass compared to the other two species. This is a low dimensional problem and our mind right away sees the pattern. However, imagine if the story was different than if the visualization was different. In this case, we have three-dimensions free variables. So we have the island where the pink Winsor reside. Then we have again the body mass, but we also have encoded a third variable, a third vector, which is the sex of the penguins. And you can already see that our mind already has troubles comprehending what's going on. Like the first clients, you can't see exactly what is the pattern within these visualization. I would even encourage you to stop the video and think about it for a second. All right. I presumed that you have stopped the video and I think you have discovered the pattern, but it already took your mind a couple of seconds and you had to focus on the visualization. It didn't pop out. And this is just three dimensions. Imagine that we would want to consider four or more dimensions. We will be in the same travel is when deciding about the property to purchase. Now, let's summarize this learning story. I really hope that it provides you with an intuitive understanding of the challenge that we're facing within the exploratory approach, we are searching for useful and not the obvious patterns. And now the devil is hidden in the word none the obvious. Most likely other people have been working with the data that we are working with right now. So the obvious patterns which are in one or two dimensions have already been discovered. And we are kind of forced to be discovering patrons in three or more dimensions where our mind is fairly limited and we need to be simplifying the problems, creating some data science models, which then allow us to see the patterns in three or more dimensions. This is also justifying why oftentimes we will be resorting to machine learning methods and we will later in the course, because a machine learning model does not have the same limitation as our mind has, you can provide it with huge dataset. We've limited Leslie many dimensions and it will find patterns in these many dimensional space. In such cases, we are more worried about it. The patterns which the machine learning model finds are reasonable ones that they are making sense from the domain perspective. But that's a case for a different learning story that we will have later in the course. So this was the learning study which houses the right one. I hope you will take the key learning with you and I'm looking forward to see you in the upcoming videos where we already started with the exploratory methods. 34. Correlation: Hi and welcome to another lecture in the course. Be aware of data science. In this lecture we will discuss correlation. So Robert, you have been telling us since we have started the corner to be aware of data science that we are walking for patterns in the data. Wouldn't it be time that you finally saw as a practical example of a strong pattern that we might find, especially during the exploratory phase of a data science project. Indeed, it is the time. In this lecture we will talk about an incredibly powerful concept of a correlation that is a prime example of a pattern that we are interested in when we are exploring a dataset. I wanted to go for it. And as I would like to go a bit in depth of these concepts of correlation, I'm bringing also on artificial dataset will be doing some calculations. Don't worry, it won't be anything scary. Also, if you'd like to play around with this dataset is available also with the calculations that we're going to do as a handout to this lecture. Alright, so the dataset that we have contains two characteristics. We have a temperature in Celsius and then we have sunglasses, sales revenues. So the observations, the rows in this case might be days and we are observing what was the temperature on any given day and what was the revenue from the sunglasses cells? We are supposedly an owner of a standard that is selling sunglasses. Now, we are curious whether these two are somehow connected, whether there is a relationship, because if there is, maybe we can utilize it within our daily business. For example, if there will be a positive pattern, we could then say, if the temperature is going to be high tomorrow, we can expect that a lot of sunglasses will be sold, so we might keep our standard open for longer hours. We said that our prime tool within exploratory approach could be visualization. So we go ahead and visualize the data as we can see it right away, there indeed is a patron as the temperatures are increasing, we can also see that the sunglasses or venue are rising so we can see a clear pattern being formed. Now, this is great. This is already useful for us. However, we might want to quantify this relationship, but we might want to quantify the strength of this relationship. Why is that? Well, due to several reasons. For example, let's say that we don't have such a simple dataset, but we have a dataset which contains multiple characteristics and we are all relating all of these characteristics to our sunglasses sales revenue. If we would have maybe 1015 characteristics, it will be ten or 15 visualizations. And all of a sudden we start to have too many visualizations. It's all becoming a bit messy, having instead quantification of these relationships. So let's say a single number which tells us how long are these two characteristics related could be more practical. Secondly, having the relationship quantifying will also allow us to compare the strength of these relationships. What if two characteristics are forming very similar patterns with our sunglasses cells that are venules, it will be the temperature that we are seeing right now. But we will also have a characteristic which would say the number of visitors in our park is forming a very similar relationship with our sunglasses cells that are unused month for some reason we only want to pick one pattern which seems to be the best one, the utilize within our daily business. And it's kind of tricky to compare these two relationships just based on the visualizations. We want exact quantification of a strength of this relationship. Now these quantification is going to happen, as I mentioned already, through correlation. Now, correlation is simply a connection between two or more things. If you would look at the research and various disciplines, you will see that the correlation is being widely used. For example, that has been proven correlation between educational level and income, food and hunger, of course, sleep and happiness, and smoking and cancer of course. Remembering the idea of correlation is I would say one of the most important things right now. The thing to realize, we will talk about this also later is that if things correlate, if there is a relationship, that does not mean that one thing is causing another one. With the idea of a correlation, we will not claim that temperature is causing people to buy sunglasses, but more on that in the upcoming lectures. Now let's focus on correlation. Now it will be a bit of a myth, but just bear with me please. Okay, let's start with it as a first step if we want to quantify the relationship between the temperatures in Celsius and sunglass sales revenues. We need to calculate the means of these two characteristics. We have a mean of the temperature, 18.21, and the mean sunglasses revenue, €263. The second step that we need to do is that we need to subtract the means. For example, if we overhear to have a temperature in Celsius or five degrees minus 18.21 will give us minus 13.21. And we do the same with both characteristics and we are resulting with these two columns that we called a and b. Then thirdly. We need to calculate a few things now, things that are maybe becoming a bit more abstract, but don't worry about it, let me go through it. So we have an a times B, we are just multiplying these two columns and then we are squaring both of the columns. So we have a squared and b squared is a fourth step. We will just sum up these newly created columns. And you can see these are already very abstract figures which will be tricky to interpret in our real-world, in our real-world terms of these dataset. But let's just finish up the calculation and I will explain. Lastly, we need to plug these numbers that we have obtained through the summaries into a formula. What is this formula? Do have to remember it. Now it's are saying it's not about remembering the exact calculation of a correlation. One thing to remember is that we have a way how to quantify a strength of relationship. Secondly, it's the realization that we've been data science we often times that rely on decades or even centuries old models which have been figured out by great statisticians and great mathematicians who have been studying the world and the nature. And they figured out some very powerful methods and models, such as a correlation. And we simply rely on them because they have been proven to be very powerful. And the very last thing which I would like to highlight, Bye, calculating exactly the correlation is that you see that there is no magic beneath. People tend to look at data science, machine learning, and artificial intelligence at some scary disciplines which are full of crazy mathematics. And there is some magic happening beneath of these models. Well, not quite. I mean, we have just walked through an example of a correlation and isn't methanol fact when we will be later on in the course building predictive models using machine learning. These are relying on a very similar principles like what do you see right now on the screen. Basically, they will attempt to find a pattern or relationship between input features and an output feature is something that we are trying to predict. So we will be using the very same formulas over here. Let's say the temperature is the input and then sunglasses cells that are venues is the output. What I wanted to highlight as a third, possibly key learning is that there is no magic beneath and you don't have to be worried to even go for some more complex methods and study them. All right, but back to our correlation, what did we calculate? We have result of 0.989. So this way of calculating correlation will always result in a number between minus 11. The closer we are to minus one or one, the stronger the relationship is between the characteristics which we are measuring, we will come to an exact interpretation just in a second. For now, we can conclude that there indeed is a strong positive relationship between the temperature in Celsius and sunglasses cells revenues. But anyway, this world we already saw from our visualization, the correlation quantification just confirmed what we already knew and we can now enjoy all of those benefits of quantification that we have discussed. Now, let's continue with the interpretation. When you are looking at the correlation, as I was saying, the closer you are to a one or minus one, the stronger your relationship. In our case, if we would look beggar, the visualization, we have a patron which is very similar to this one. That's why our correlation measure was very close to one. Now if the relationship walls of opposite nature, that is, one characteristic is increasing, the other one is decreasing. Then we would obtain a minus one or a value very close to minus one. Then we have all of these values in-between. The worst-case scenario for us where we do not find any pattern. Is it for our correlation measure, report a 0 in such case it would look similar. And basically it indicates that there is no relationship between these two characteristics. So as I was mentioning, correlation is a very powerful idea. World around us is filled with correlations and you could reuse these concepts for your own dataset to see and quantify a better in-between two characteristics. And in the upcoming lectures, we'll keep on talking about correlation as it can sometimes be a bit tricky to work with it. I'm looking forward to see you there. 35. When Temperature Rises: Hi, In the previous lecture we learned about correlation and it seems is a really powerful tool that we can apply them within our dataset. However, remember what I keep saying across the wall course, be aware of data science, that no data science method is sacred and that we should always remain skeptical towards what the datasets, towards what the method that we are using series. We will have a learning story called when the temperature rises and we will keep on talking about correlation. In the previous lecture, we had this simple and straightforward dataset where we are measuring the temperature in Celsius and revenue from our sunglasses cells. And we can see that there was a really nice strong patron, which was also confirmed by our quantification by calculating the correlation that the reported 0.989 pretty close to one. So it seems as a very strong positive relationship. However, remember what I said that node data science metal is false. Well, let's keep on building our dataset. The temperature's got very high. On a few days, we collected a few more observations. So these were our original observations from five degrees Celsius above until approximately 33 degrees Celsius, and now it gets very hot. So we collected more data points. As you can see, the pattern has now changed. It seems like when it's 33 degrees Celsius outside, it's becoming too hot for people to go outside. So our sunglasses he sells, revenues are dropping. Now, this still seems as okay for us as an owner of a stand which is selling the sunglasses and we can still see with our eyes are clear pattern. If the temperature is rising, combine until 33 degrees Celsius, we can maybe expect the sales to also go up. If it goes beyond that, we can expect the sales to go down. Now, what would happen if we reuse the same calculation as we did in the previous lecture, the correlation calculation on this dataset that we have added on. Now, just to be mindful, if you'd like to play around with these calculation is available as a handout to this lecture. You can also play around with these numbers. If we use this calculation, it will report 0.27. What's going on? I mean, we have said that if our correlation measure is close to one or minus one, that indicates a strong relationship, either a positive one or negative one. And then we're saying that if we are close to 0 with our correlation, we are coming to no relationship between these two characteristics. One of these off is that based on the visualization, we clearly see that there is a relationship, but our quantification is saying that there is no relationship. What happened then? Well, the metal to that we use the correlation calculation is defined for what we call linear relationships. And what do we had in the previous lecture was indeed a linear relationship. We can nicely, they're all basically a line through the points almost perfectly. That's why it's 0.989. However, what do we have in this lecture is already a non-linear relationship. If you would like to have it explained in technical terms what went wrong, the correlation measure is actually having an assumption. It's well-defined if there is a linear relationship. So technically speaking, we have violated the assumption of this particular method, but I don't want to get too technical. I want to reiterate again on one of the most important learnings from the course, always be skeptical towards what your datasets. Always be skeptical towards what your metal test. This also brings us again back to the importance of data visualization, which is undoubtedly an important part of exploratory approach of data science. Finally, to visualize better, what do I mean we've nonlinear or linear relationships. I found this very nice visualization. It's showcasing the result of correlation calculation. We can see that for linear relationships such as the ones on the top, it is well-defined in the report that yes, I can see a relationship. However, if we go to the bottom and we see these nonlinear relationships, well, with our eyes, we see that there is a pattern. It's just that these metals is failing us. It's not well-defined for nonlinear relationships. And we would of course need to use a different metal to quantify this relationship. I'm not claiming that we do not have metals which can also quantify these relationships adjusted, we will need to use a different one. Alright, and that's what I wanted to say during this lecture. Always remain skeptical and I'm looking forward to see you in the upcoming lectures. 36. Do storks bring babies?: Hi, and welcome to another lecture in the course. Be aware of data science. These lecture has a strange name. Do storks bring babies? It doesn't change question, but it will help us uncover a very important concept of a spurious correlation. So let's go for it. In the 1990's through 1980s, in some smaller cities in Germany, a strange phenomenon started to occur. At the same time, a lot of storks, we're moving into the city and a lot of babies were being born. People started to think, do storks bring babies? This phenomena was not isolated to this geographical region. So even maybe you have heard the story that storks bring babies. Anyway, it was bothering the researchers in Germany, so they started to investigate it. What boundary the researchers was something like this. Is it possible that nature gave us this sort of a function or a process where a storks are causing babies being born, or maybe is there a correlation that could be useful? We've correlation, you can really imagine the correlation concept that we have learned just a few lectures ago. Maybe if there is a useful correlation, we could be observing the storks. And this will help us determine how many babies will be born. Well, probably since you were like eight years old, you know that storks do not bring babies. What was really going on in reality? What went down locked as something like this. There is a concept of a spurious correlation and the underlying story was rather different. At the very heart of this phenomenon was a social demographic trend that the young couples were moving into the city. These young couples were settling and they had babies. So you could say that there is a form of a causation or at least a very strong correlation at the same time. So when the ANC couples are moving in, they were building houses. As it turns out, these houses were an ideal nesting place for the storks. So we could rather point to this relationship. Now, the relationship that was originally observed between babies being born in storks moving into the city is what we call a spurious correlation. That is a concept that we should be worried wherever when it comes to data science, let me explain. We have learned within the exploratory approach, data science, we are searching for relationships. Now we are going to stumble upon some relationships that are useful and can be used. For example, if we discover that young couples are moving into the city, there is an increased possibility that they are going to have a baby. Also, the other relationship could be very useful. We could, for example, use it for prediction of bird migrations based on the young couples socio demographic trends. However, when we stumble upon a spurious correlation, we should not use it. The problem with spurious correlation is that it is not stable, it is not reasonable. It is just a result of maybe a chance or some randomness. In the upcoming lecture, we are going to talk about the why we should not be using spurious correlation. I hope that with this example of the storks bring babies, you will remember that spurious correlation exists. And then in the upcoming lecture we will talk about football and presidential elections. And we will learn why we should not be relying on the spurious correlation if we discover it. 37. Football and Presidents: Hi there. In the previous lecture, we started to talk about spurious correlation. Let's continue talking about it as it is a fairly important concept. There is a famous case of a spurious correlation called the Redskins rule. It regards the Washington football team. You could actually observe the last home game of the steam and whether the Washington Redskins will win, will be then a great indicator of who will win the presidential election. For those of you who are not from the US, you would basically have two parties with their candidates. The incumbent party is the one currently holding the office. And the thing is, if Redskins will win their last home game, then the incumbent party will win the elections. These case of spurious correlation actually held for over 70 years. The first time it happened once we, Franklin Roosevelt in 1936, the last time it helped drawers in 2008 with Barack Obama during the seventh years, you could have used this pattern to predict who will win the presidential election. Realistically though these two phenomena has something to do with each other, is there is some sort of reasonable relationship that is very unlikely. These just a coincidence and a perfect example of a spurious correlation. Since the correlation broke in 2008, it actually reversed. It is since 2008 the other way round. If skins laws, the incumbent party losses the elections. And this is a fairly practical reason of why we should not be relying on a correlation that appears to have no logical justification and most likely just a result of p or coincident. I mean, these examples are mainly for the conceptual understanding of a spurious correlation, but you will stumble upon spurious correlation also in your daily life. For example, you are a data scientist and you will see a strong correlation between coffee consumption at the office and the sales. Now, should you rely on these correlation and maybe attempt to increase the coffee consumption in the office with the hope that also sales will go up. Certainly not as this is most likely in other case, of a spurious correlation which will be very unstable. So now that we have seen two great examples of spurious correlation, let me summarize with some formal taught. At the heart of everything are causations. One thing is causing another. Unfortunately, discovering and proving causation is very hard in scientific terms, you'll most likely need a fluoro and inexpensive experiment to prove it. That is why within data science, we are rarely hoping to discover and prove a causation. We are rather most of the times relying on the idea of a correlation or discovering correlation of some sort. We are hoping that we discovered two phenomena that correlate with each other and that the correlation is useful and is reasonable. For example, temperature outside correlating with sunglasses sales, clear example of a useful correlation. However, we need to be careful because we might stumble upon a spurious correlation. Our task is to have a sensible and skeptical mind and discard the spurious correlation from our analysis. The spurious correlation could be caused by some unseen factor, such as the case we've storks and babies. Really there was this unseen factors of young couples moving into the city and the underlying mechanism was completely different than we initially observed. Alternatively, it could be a result of pure coincidence. This was the case with the Washington Redskins rule and the US presidential elections. In any case, whether it is some unseen factor, our coincidence, we should not rely on a spurious correlation as it will likely not hold stabilize at a certain moment in time. In the upcoming assignment, you will have a chance to practice your creative and critical thinking with these concepts. So I hope you will enjoy it. And that's it, what I wanted to say in this lecture. And I look forward to see you in the upcoming ones. 38. Introduction to Chapter 4: Hi and welcome to Chapter four in the course, be aware of data science. In this very brief lecture, I want to provide you with the coal that we are having during these last tip throughout the course. And also to give you an outlook of the lectures that are ahead of us. So first of all, looking at the course structure, we are almost at the end within these last lecture, we will take all two more approaches and we'll study them more in depth. We'll be talking about inference and predictive model. So our goal is to gain an understanding about the stone. Now how can you imagine this chapter? Well, until now we were only describing and exploring the sample of the data that we have available. Now comes the time to infer something from the sample about the rest of the population. Or it is about the time that we build a predictive model on top of the sample, based on which we can then make some predictions about the rest of the population or about some future value. So we are tackling some really cool and powerful concepts within this chapter. Alright, now let me show you the electrodes. First of all, we will start very slowly and gradually. We will talk about the sample and we will talk about the population and about how important it is to always have it clear in your mind, What's your population, What's your sample? And to also think about whether your sample is representative of the population. We also have a learning story which is focused on the topic it's called is the mushroom edible, where we will be together building a visual recognition app. It will recall. Then we're being more practical and we are talking about the inference. We'll set up an inferential experiment in our imaginary organization, where we'll try to test whether our new sales strategy is really having a positive impact on our profits and revenues. Afterwards, we will conclude the inference where we are just inserting one thing from a sample to the population. And we'll focus on building predictive models. I wanted to provide you with a deep intuitive understanding of how you can imagine the workings of the predictive model. We have unimportant lecture, lecture number four. It's called the function of the nature. Then, once we have understood how we can imagine predictive models, we have an assignment which is focused on the most important thing about the predictive model are its inputs. Within this assignment, you will create the only thing about the possible inputs for a predictive model that we are about to build. Then we will continue with our exploration of the predictive models will be asking ourselves a question, when do we really need a predictive model? Or when a more simple approach such as a descriptive exploratory inferential would suffice, then we will be actually building a predictive model and you will have a chance to build one yourself or you can actually think about it that you are becoming a machine-learning model because we have a powerful assignment within which we are trying to predict dear migration. It's a kind of artificial use case on which I have worked in the past. So it will be a bit longer assignment, but I hope that you will enjoy it. Afterwards. It's time that we will conclude the chapter by more thoughts on the predictable proud of data science, where we will think about why is predictive model never perfect? Why is it always a little bit off and it never has perfect predictions? Then we have a learning story at the end of this chapter, which is about distinguishing dogs from wolves, It's basically a story of how machine learning model can three costs and learn something completely different than we would have thought. At the very end of the chapter, we'll be thinking about how is our modal impacting our business or our domain, because we have just build a predictive model, we would also need to verify and test whether it's really having an impact that we have intended. So you can really see that we go through all of the components of the inferential and predictive approach during this chapter. Of course, we'll conclude the chapter rather classically, there will be a checkpoint on which you can recap them all of the knowledge that you gained during the trip. And at the very end, there are a few quiz questions waiting for you on which you can test your knowledge. So this is chapter four and I can't wait to see you within the lectures. 39. From Sample to Population (1): Welcome to another lecture in the course. Be aware of data science. With this lecture, we are officially opening up the next week chapter, which is about inference and predictive modelling. But before we get to eat some inferential methods and predictive methods, I kind of want to set a healthy basis on which will be built in the upcoming lectures. The most important thing from my perspective to understand if you want to perform some inference or some prediction, is to understand your sample and the difference between a sample and the population. Lastly, even today, to be able to define, okay, this is my population to which I will be inferring. This is my sample thrombin, which I am inferring. These are the topics of this lecture. Let's go for it. First of all, heavy, clear in your mind what we're doing until now we were only working with our sample. We describe the sample, we explore the sample and whatever conclusions we had, whether the storks are bringing babies, whether there is some useful correlations such as sunglasses and temperature outside, we are only using these conclusions within our sample. We can't say that this is true everywhere in the world. We just can't say that whenever the temperature rises, wherever in the world people will buy more sunglasses. We can only stay within our sample. The big thing is coming now, we'll be able to infer something from a sample about an entire population, such as while they adjust mentioned whether it is really true or what is the probability that wherever in the world where you are, if the temperature rises, more people go and buy sunglasses, this will be the inference. Or we'll be building predictive models which are enabled to make a granular predictions about some observations, some individual who is within the population, but maybe it was not part of our sample. So this will be the predicted apart. So as I was saying, I would like to discuss a few things within this lecture. First of all, a population when they won't be starting some inferential experiment or some predictive or exercise, I would strongly recommend that you start off by defining what is your population. I have listed a few questions which you can ask yourself and these should help you define your population. First of all, you can ask yourself, is my population living tangible or intangible? I mean, as I was saying, it's easiest to imagine human populations, such as a population in a city, but you could also have animal populations. It's still kind of easy to imagine, but also object, let's say that I'm working for a factory and this factory has a very large building or it's a very large facility. And I'm doing research on the light bulbs which are in the buildings. And my population could be all of the light bulbs which are in the facility, which are in this factory. I will be drawing a sample and there will be only examining a fuel light bulbs. And the population is then all of the light bulbs in the buildings. So it can also be about object. Lastly, your population could be also intangible. There were even experiments where the population was all of the roulette wheel spins, which happened in the Las Vegas. If I'm not mistaken, it doesn't even have to be tangible. Secondly, you can this ask yourself, how is my population defined? I mean, usually if we are talking about industries and we're talking about organizations which have their customers. These are the most common populations that we are defining here. Some sociodemographic characteristics will be helping us, such as we narrow our customer base by h, maybe by occupation and so on. And so we'll sociodemographic characteristics is the most classical one. Secondly, we could have also some definition of a population based only basically irrelevance, such as product ownership. Well, let's say that I am a smartphone manufacturer and I have a couple of my smartphones which I'm producing and selling. The population that I'm interested in are the owners of one smartphone type which I'm selling. I can define also a population based on the business relevance. Thirdly, a very important question to ask is whether your population is fixed with respect to time. What do I mean by, well, let's say we could have a population which is fixed with respect to time. Let's say all the people in the city, I mean, they are changing as the time goes, but not with the particular may be project or use case that we have in mind such as measuring what's the average height of all the people in the city. Or you could stumble upon the more problematic case when your population is really changing with time. For example, let's say that I'm interested in the behavior of customers in the store. I'm only in a grocery store. Now my customers are going to come by at various times of the day. So a single customer will arrive on Monday morning, we'll arrive on Wednesday evening, maybe Friday afternoon. And maybe they will display different behavior at different times of the day. Maybe they only spoke also different behavior at different parts of the month. This is what I mean with respect to time. If this is the case, then it's going to be a bit troublesome when we will be drawing a sample from such populations, we need to keep in mind the aspect of the time. And lastly, this is a pretty simple one. Is the population defined for us, like you are going to meet some projects are use cases where, for example, if you are a penguin researcher, it's very straightforward. The penguin population is predefined for you. Let's say you are pointed to a certain island where penguins leaf, and that is the population that you are supposed to work with. Or the population will not be defined for you. And that's when you should be asking all of these questions and really spinning it in your mind. What is your population? These oftentimes happens, for example, when you are a product development analysis, your organization is about to launch a new product and you are thinking what the target group are on the population that which you are targeting is going to look like. So in such case, you need to define the population yourself. So to summarize some key takeaways, as you can see, there is a load of freedom that you are having when you are defining the population. Important thing is to always heavy defined whatever exercise you are starting makes sure that it's clear what is your population. And in the upcoming lecture we'll continue with similar told on a sample. I'm looking forward to see you there. 40. From Sample to Population (2): Hi there. In the previous lecture, we thought about a population and how can we be finding in these electrolytes? Continue and think for a bit about a sample. So the crucial question that you should be asking yourself when you are drawing a sample or working with a sample, is thus every member of a population has the same chance to end up in a sample. If we are lucky, the answer is yes. And in such case we have probability sampling. Let's say that we are doing some study or an experiment at a university campus. So we would randomly select students from my campus to be involved in a study. How could such a random sampling look like? Let's say we did. We would come to the university administrators and we would ask for anonymous IDs of all of the students that are on the campus. Now we will use a random generator and the randomly pick some of the students who are on the campus. This would be a random, this will be a probability sample. And if we have such a random samples, such probability sampling, then we are lucky because it will be easy and straightforward to infer something from the sample to the entire population. Now what would be the opposite case? Let's say some non-probability sampling. Let's say that we wouldn't have access to the administrator's office. And we are interested in which of the students are studying data science and then we are interested in what is their behavior. Well, we don't know because we do not have access to the administrator's office. We don't know which students are taking the data science courses. We would walk around the temples and find one person who is studying data science. And we would ask them, Hey, do you know anyone else who is studying data science? There will be sway kind of snowball our way to our sample. This is unfortunately non-probability sampling and it makes things trickier when we are inferring from such sample to an entire population, because we always have to ask ourselves, and this is the key learning from this lecture. If we are employing a non-probability sampling, always consider whether your sample is representative of the population. Previously in the course, we had one example where we draw non-representative sample. It was the case when we were interested in people's heights. And we were measuring people just outside of a SportsCar. And a lot of basketball players were passing by all the sports people who are, let's say taller than the average population. On the contrary, elder people who are maybe short that are, we're not passing by the sports club. This was non-representative sample. We cannot draw from it to an entire population. We cannot infer what the average height is in the population. This is such an important learning that in the upcoming lecture, we will also have a learning story where it will be building a visual recognition app that is supposed to recognize whether a mushroom is edible or poisonous. And on top of it, you'll be having an assignment where I will face here we have a few scenarios and you will be supposed to judge whether the sample that we are having a probability or non-probability sample. And if it is non-probability sample, whether it is a representative of the population, to which we would like to draw some conclusions. Another thing that I would like to mention when it comes to sample is that you should be also considering your sample size. Is your sample sufficiently large? I mean, this was more of a problematic question in the past. Nowadays it's not so prevalent in the real-world applications where we usually have hundreds, thousands, or millions of observations. And basically the story goes that the larger your sample, the better inferences you are able to make about the population. Now, just to give you an idea about what do we mean when we say sample size? If you would like to infer something about the country where I'm coming from that has 5 million inhabitants. You would need a sample size of approximately 10 thousand observations. So we really need to collect 10 thousand heights to be able to infer what the average height in these countrys. But as I'm saying, it's not so problematic nowadays with the sample sizes or the really matters is to always consider whether you have probability, non-probability sampling, and whether your sample is representative of the population. Now that we have talked about both population and a sample, I think we are ready to start with the inferential and predictive approaches. But as I was saying before we go there, there is a learning story waiting for you and an assignment. I'm looking forward to see you there. 41. Is the mushroom edible?: Hi, and welcome to another lecture in the course. Be aware of data science. This time we have a learning story. In one of the previous lectures, we have discussed population and a sample. And that will saying that having a representative sample is really important because if we're not having a representative sample will not be able to make some good inference about a population, or will not be able to build a strong predictive model. So here comes the learning story it's called is the mushroom edible? We have a clear business idea. We want a mobile app. If we too can just point your smartphone camera on a mushroom and the EBU tell you whether the mushroom is edible or poisonous. Ideally, it would also tell you what kind of a mushroom these ys, this will be a standard task for data science and machine learning capabilities. Now of course, we need some sort of data on which our machine-learning model would learn. And it will be able then to recognize between the edible and the poisonous mushrooms. We have decided to obtain the data by scraping from the internet pictures that people have posted from their mushroom picking ventures. So let's say that someone who was out mushroom picking and they found a mushroom, they took the picture and they posted it to the Internet for everyone to see. And we have designed it to scrape these pictures from the Internet and the use it as what we call training data for our machine learning model, for our predictive model. Now, even though we have not told you about machine-learning, that's not an important part of this learning story. We care about our sample that we have collected and from which we want to learn. Now, I would encourage you to stop the video and think about what's going to be a problem? Are we then going to be able to build a good predictive model which we can then use as these visual recognition app that people can just download and then point to a mushroom. And the IRB wouldn't recognize if the mushroom is edible or poisonous Really, please stop the video and think about the sample that we have collected. Alright, I presumed that you pause the video and gave it a try. Now, we're not going to build a good predictive model based on the sample that we have collected. There will be a problem with the app and the model that we will build. It will be great at recognizing edible mushrooms, but rather poor at recognizing poisonous mushrooms. Why is that? Well, our data that we used for the training where problematic. They were not the representative of the population in nature. Let's say, let's simplify things and let's say that there is a ratio of 50% of edible mushrooms and 50% of poisonous mushroom. So 5050. So for example, you have an equal chance of stumbling upon an edible mushroom as you have of stumbling upon a poisonous mushrooms. But what about the images that people take that we used as a training data for our predictive model. People are way less likely to pick up a poisonous mushrooms and take a picture of it and of course posted online. This is why the ratio of edible, too poisonous mushrooms in our data will be, for example, ninety-five percent of pictures will be of edible mushrooms and only 5% of the pictures will be off poisonous mushrooms. The modal is really primarily focused at recognizing and being good at recognizing of edible mushrooms, but it will be quite poor at recognizing poisonous mushrooms. Or in other words, most of the images that are modal sees during the training process of edible mushrooms. So I hope that this little learning story provided kind of an intuitive example behind the sample needs to be representative when it's true of whatever we are doing, whether we are doing some statistical tests from which we are drawing an inference, which we'll do in one of the upcoming lectures, or we are building a predictive machine learning model. Always think about whether your sample is representing the population well, because we might stumble upon unfortunate scenarios like in this case, when we didn't represent the correct ratio of edible, too poisonous mushrooms in our sample. I'm looking forward to see you in the upcoming lectures. 42. Inference: Experiment Setup: Hi and the warm welcome to another lecture in the course. Be aware of data science. With this lecture, we are kicking off a chapter in which we study inference and predictive models. And in this particular lecture, we will focus on inference and we will set up a little inferential experiment, which will then continue working on in the upcoming videos. But before we get there, let me remind us of where we are right now and what are we really trying to do? Basically within this chapter, we're focused at the methods through which we can learn on a sample and then infer something about the rest of the population from which the sample is coming. Or alternatively, we can build a predictive model that we will be able to do granular predictions about individual observations within our population. Though, as I'm saying, we get to predict your methods later. Now we are focused at inference. We will learn from a sample and infer, generalize to the rest of the population. Let's set up the inferential experiment. Imagine that we are working for a company which is of course making some sales. And now these cells are varying in a sale size. So for how much the sale was. As you can see, this is our past historical data from the sales. You can see that occasionally we are unfortunate and the size of the sale, which we may use for 80 or €90. Most of the times we are able to make sales for approximately €100, and occasionally we are lucky in the sales are for 110 or €120. Even. Now the business colleagues are coming to us and they're saying, Hey, we have this belief, we have this hypothesis that if we manage to increase the average sale size, it will really benefit our business. We're looking at the visualization that we have. And basically what this goal set by our business colleagues means is that we will attempt to move the mean cell size to the right. We want to increase the average cell size. So this will be the goal of our inferential experiment. Now having the experiment defined from a business perspective, let's translate it to data science terminology. Basically, what we will be doing is that we'll be working with two populations or we'll be thinking about two populations. Now I will take it slowly as if you're seeing this for the first time. It can be a bit confusing. There is a population which is defined by our old sales strategy. It's our customers that we are having right now. We're selling to them through our old sales strategy. And we need to imagine that there is potentially another population. It's the same humans, it's the same customers, but they will be defined by our new self-strategy. And now we can go back. Few videos are few lectures where I was talking about how we are defining populations. It doesn't only have to be a set of individuals or set of humans, we can incorporate some business logic into defining this population of our customers. Even though it's the same humans. These ones on the left side are defined by the old sales strategy. These ones on the right side are defined by the new cell strategy. Basically, in our business colleagues attempt to do is that they hope to change the population. And now they're hoping that the new population defined by the new sale strategy will have higher average cell size. You may be asking, why don't we just do it? Why don't we just change the self-strategy and we hope that increases the average sale size. Well, that might not be a good idea because it's a hypothesis we don't know whether the new sales strategy will really increase the average sound size. It might very well happen that the effect will be opposite. That may be our customers will think that we are bothering them with the new sales strategy and we would instead lower the average cell size. So we shouldn't just now take the new self-strategy and deploy to the entire population, we should be smarter about it. And that's where the power of inference and inferential experiment will come to play. So what we will do is that we will think about collecting samples from both of these population. We want to sample from the old sales strategy and we want to sample from the new cell strategy, how can these be translated to real terms, to the real-world? Well, we will just take our old sales strategy and we will attempt to sell to maybe 15 of our customers using the old sales strategy. For the new self-strategy, we take the new source strategy and attempt to sell to 15 customers. Using these new sales strategy, we will use maybe some random sampling and randomly pick these two groups of 15 customers. And that will provide us with the necessary data that then allows us to compare these two populations. We will then be looking at the samples, observing whether there is a difference in the average cell size. I mean, if our business colleagues are right, then we will be hoping that the sample on the right side, which is defined by the new cell strategy, would have higher average cell size. But that's not everything we cannot just stay at comparing the two samples. We will also use a statistical model, a statistical test from the world of inferential statistics, where we will check whether the difference that we're seeing between the two samples is generalizable to the entire population. So there will be two steps. We collect the samples, compare their means, and then as a second step, we use a test from inferential statistics. It will be a lot of fun. I hope this video cleared how we're setting up our inferential experiment and the upcoming videos we get to solving it. Looking forward to see you there. 43. Inference: Statistical Test: Welcome and let's together conduct our little inferential experiment. So let's just briefly recap all the we will do. We will attempt to sell to a small portion of our customers using the Alt till strategy. And then we'll attempt to sell to a small portion of our customers. We've done new sales strategy. Then we will be comparing these two samples, observing whether we managed to increase the average sale size. And we will also rely on a test from inferential statistics. All right, let's do it. So here are the result. First of all, I need to start off by what we don't know. We have managed to influence the population or change the population the way we intended. And you can see that originally with the old-style sales strategy, the mean cell size was 100 years, as we knew already with the new cell strategy, the mean cell size is €104. However, we don't know these. We have not measured the entire population, neither the old one nor the new one. I'm just putting it here for our reference and for our comparison. What do we have, however, are our two samples. You can see that our sample with the old sales strategy contains 15 observations. So we have attempted to sell to 15 customers with the old sales strategy. Now we are measuring the mean, which is 19.689 the euros. With the new sales strategy, we have also attempted to sell to 15 customers and the sample mean is €103.07, if you would like to. I have also visualized these two samples here on the left side. Now that we have these two samples, it's time to ask the big question. Thus, the new strategy increase the average sale size. I mean, if we will be looking just at the sample, our answer will be definitely yes by 3%, this is approximately 3% difference. However, this would be a wrong conclusion. It is only through WAF in our sample. We are just comparing the samples. We now need to face the results of our experiment with a statistical test to verify whether this difference in means will be generalizable to the entire population. All these, these statistical tests going to help us with, well, it's going to consider a few things and it's going to report to us or indicate to us whether the change or the difference in means could have been caused by a pure randomness. Or it's going to indicate the probability with which the difference in means could have been caused by the randomness. It's going to consider a sample size. Of course, the smaller the sample we have, the higher is the probability that we have just encountered some random noise. We ideally want to have a larger sample size. So this will be the first consideration of this statistical test. The second consideration will be also a measure of dispersion. I mean, even though we are comparing measure of central tendency or measure of position, which is really the mean. We should also be looking at the dispersion of our two samples. When you look at these two small hills, which we have these two distributions, they are overlapping. Now, what if they were overlapping? Just very little. So we would have one narrow distribution over here which is just slightly on the side overlapping with the other distribution or vice versa. Awards example would be if these distributions are very wide, the dispersion, the spread is large and they are overlapping quite a lot. I mean, in such case, if they are overlapping a lot because they are white, it might be really happening. The difference is just caused by pure randomness. Thus, the statistical test is also going to consider the Spirit and not adjust a measure of central tendency. Now, this statistical test is going to be called the t-test that we will perform. Do you need to remember that we are using a t-test for this particular thing. Well, not quite. What matters more is that you focus on the intuition whenever we are facing a task where we have defined ourselves a hypothesis just like we did in the previous video when we collect the samples and we would like to compare them. What is a key for you to remember is that you can reach out to the world of inferential statistics and there will be most likely a test available for you. I just picked one of these tests or one of these statistical models which have been defined the years, decades or maybe even centuries ago and has been proven to be useful. So we're just reaching out to an appropriate test and reusing it for our project or our use case. Now back to our t-test. When we plugging the data from our two samples to this t-test, it's going to report something like this. The probability that given a chance modal result as extreme as the observed results could occur, a 12%. Now that's a really tricky formulation. Here's the thing. Statisticians, they are very careful at how they are interpreting the results from these statistical tests. And rightfully so, because many people have this misconception that if you conduct these tests, it says that there is a definite answer, that yes, there is a difference between the means or there isn't the difference between the means. Now, that's not the right. These tests are just indicating how, what's the probability that the result could be caused by randomness? Anyway, the number that we have gotten back is 12%. Well, that's kind of a high chance that the difference between the means in the samples that we are seeing, it's just caused by randomness. Usually we are hoping for, let's say 1%, 5 percent or even less, to be really certain that the difference in means was caused by us, by our new sales strategy and not just by some randomness. So we have actually failed. We cannot conclude from this experiment, even though when we are seeing with the samples that there is a difference with means, the statistical test is telling us that we cannot rely on what we have observed within these samples to be generalizable to an entire population. Now, this is unfortunate. Can we do something about it? Of course, there is something that we can do about it and we will do it in the upcoming lecture. Just to sum up the lecture, we have collected the two samples, we have compared them, and then we have faced them with a statistical test, in this case, a t-test, which says that there is a rather large probability that the difference in means which we are observing is just caused by randomness. 44. Inference: Solving and Summarising: Hi there. Our last lecture was sort of unfortunate, even though we have conducted beautifully our experiment. It ended up in an unfortunate way because we cannot come back to our business colleagues and say, hey, the change in sales strategy which you have performed is the really influencing our customer base. You can really change the population of our customers and increase the average sales side. So we have failed to provide it. In this lecture, we will of course continue with these examples is that it's not the end of the story. And we can address these as we were saying, one of the things could be the sample size. What if we would address the issue we fail larger, simple because here's the thing we can see like we have in these lecture, can see that there is the difference between the population means it's just not showing up sufficiently within our samples. So instead of collecting 15 observations, we would call it the 80 observations. And now we are seeing the difference between the samples, which again is approximately 3, 4%. But here's the thing. If we plug in this data to our t-test, it's going to report that the probability that given a chance modal result as extreme as the observed results could occur is less than 0.01%. So now we've larger amount of data available at the statistical test can say that the yam, there is a pretty low probability that the difference in means is caused by randomness. So this time we have actually succeeded with our experiment and with our statistical tests. The only thing that was really needed was to collect larger samples, which would showcase the difference between the populations better. The second thing that we can do is that we will address it via a more aggressive sales strategy. Will do I mean by it? Well, let's say that the business colleagues are coming to us and saying, we're thinking about calling our customers every second day and it will be like maybe no, maybe we should try to call them every day, which is of course way more aggressive. But maybe it will be changing the population much, much more. There will be much larger difference between the two samples. So what's really happening is that because of the new sales strategy are more aggressive one, the new sales strategy means sale size is €110. Unlike before when we had €104, the populations are further apart. What happened now is that we collected 15 observations, just like in our previous case. But now the thing is that populations from which the samples are coming are further away, they are more separated. You can also see it on these distributions on the left side. Again, if we take even the smaller samples and plug them into our statistical test, is again going to report as 0.01% or less. So again, we could come back to them and confirm, yeah, this change which you are doing, it seems to be making sense. It really seems to be changing the populations. Alright, so let me summarize what we just did. We had really three attempts on collecting the samples and comparing the sales strategies. In our first attempt, it was kind of unfortunate, we collected 15 observations and what do we, of course, don't know what the truthful population means where it was one hundred and one hundred and four euros. The statistical test concluded that based on the data that we provided, the difference in means is not generalizable to an entire population. So we cannot really say that our new sales strategy will be rapidly changing the populations. In our second attempt, we have collected larger samples, we hit 80 observations. And this is helpful. It provides more evidence to the statistical test that there is a difference in the means. The test is, are these not definitely the S? It's more like maybe I agree, can rely on it and work with it further. We can come back to our business colleagues and say, Yeah, I'm the new sales strategy seemed to be working. You can now deploy it to an entire customer base. And thanks to it, we can expect increased revenues in our first attempt, the story we'll see more about in this case, we have decided to use a more aggressive sales strategy. You can see that the difference in means is ten. So from one hundred and one hundred and ten. And now this difference in population means is so large that if we drew even a smaller samples, such as 15 observations per sample, the statistical test is reporting here. Maybe you could be relying on these difference. So we will again come back to our business colleagues and we can say, yeah, there seems to be a change in the population. Now why am I saying maybe another Definitely, yes, it's what I'm saying. Statisticians are very careful with how they are interpreting the results of the statistical test. It's never a certain, it's just a test based on a provided data, but we can certainly rely on it. What can we take away from the attempt to Winfrey, as I was saying, we can now generalize one thing. If the new sales strategy is applied, the average cell size might increase, which will be of course beneficial to our business colleagues as they can now deploy the new sales strategy to an entire population. But be mindful, we are managing to generalize. One thing is about the sales strategy and this is the difference that we were testing. In the upcoming lectures. We are going to continue to learn on the sample, but we will continue in a different way. We will be building predictive models. This now concludes our little three lectures about the statistical inference. I thank you for being part of these lectures. 45. The Function of Nature: Hi, and welcome to another lecture in the course. Be aware of Data Science. My name is Robert and we have electric ahead of us call the function of nature. Even though this lecture has a strange name, basically, what do we are starting with is a predictive approach of data science. We will attempt to build predictive models with which we can make predictions about individuals or individual observations within our population. And the function of nature is basically my view on predictive modelling. So here's the thing. Our data scientists, some sort of fortune tellers claiming to be able to predict the future. Fortunately, no, we're just simplifying the world around us as depicted in this picture. Basically the data scientists believe that many processes in the world around them are happening through a specific mapping function provided by nature. Let's say these mapping function, as you can see on the picture, has some inputs and then it has certain output or outcome. Now, is that term a function ringing some bells in your mind? Certainly it should. We have the same notion of functions in mathematics as well as programming. So it is a box which takes inputs and then it produces an output. For example, in mathematics, we could have a function of X squared where the input goes instead of our x, it is being squared. And let's say if the input is now to, the output would be for within a programming, we could just provide a certain string and paste it together with another string and we would have a result. So a digit you typed is number five. The same notion of functions can be found in mathematics as well as programming. Now let's come back to predictive modeling. Let's talk about the concrete example, imaging the or desire to purchase an ice cream. Is it a completely random process that is not influenced by anything, whether you go and buy an ice cream right now, I'm pretty sure that it is not a random process. If the weather is warm, if your partner says that he or she would like an ice cream or if you just saw on ice cream commercial, you might be more likely to purchase an ice cream. Similarly, some factors can negatively influence you and you will be less likely to go, such as being gonna die it or having a sore throat. Inside of your mind, there is this complex functions designed by nature, which takes into account all of these positive and negative influences and makes the final decision whether you now stand up, go and purchase an ice cream. Now before we proceed, I would like to mention one thing. Until now we have talked about individual patterns or multiple individual patterns. You could say that we are hoping to learn from a sample and generalize to the population. Now, why did we change our view into this mapping function? It is because one such mapping function is consisting of multiple of these individual patterns. They are combined at the same time into the same mapping function, mapping the input into the output. For example, it is never the case that only the weather is influencing whether you will go and purchase an ice cream. It will be a multitude of patrons like we can see on the picture. And then as I said, we are combining these patterns into this little box, into this mapping function. Alright, let's assume that we have grasp these basic idea of a mapping function. What do we do with it? Well, we will attempt estimate or approximate this function. We want to construct a little Box2D, which if we provide the irrelevant, it puts, it will as accurately as possible map them to the outcome. If we manage to do that, we can then reuse this little box for the purpose that we have in mind. For example, we can predict in the morning how many customers will arrive to our ice cream stand and purchase the ice cream. Thanks to an accurate prediction, we can then store appropriately and have an appropriate amount of personnel in place. So this is really what someone means when they say, I am going to build a predictive model, they see a function and they believe that they can estimate or approximated, and they have a clear benefit in doing so. Like we have an ice cream stand, we see a clear benefit of why we should be estimating or approximating this function. In the upcoming lectures, we're going to explore what's inside of this little gray box that we are having. Clearly we can use various tools, a protease and so on. However, before we go there, I will not take a little break and talk about Probably what's the most important aspect of a predictive modelling exercise. The most important thing in the wall, estimation and approximation are the inputs. If we are not using inputs that are not well representative of the mapping function at hand, we can use the best predictive method which is out there, but we will still fail. Inputs which we are using are the most important thing. To emphasize the importance of the high-quality inputs we are going to after this lecture, have an assignment where you will creatively think about the inputs that we can collect the data about and then use for the estimation or approximation of this mapping function. Afterwards, we'll dig deeper and examine how exactly can we build this box which estimates are approximates the function of a nature. So this is what I meant when I say the function of nature. I think it's the easiest way how we can imagine, what do we mean when we say we are going to build a predictive model? And in the upcoming lectures we will dig deeper into that. I'm looking forward to see you there. 46. When do we need a predictive model?: Hi, I hope you enjoy the little assignment that we had with identifying relevant inputs for the predictive model. Now it will be time to start building predictive models. But before we go there, I will not provide this brief lecture and ask ourselves a question, when do we really need a predictive model? So there is a lot of talk around the predictable proud of data science and there is a good reason for it. I mean, it has a lot of potential applications for various organizations. However, before I come to explain to you how you are building a predictive model, I want to highlight one important thing. You should never jump straight into predictive modelling. You should at first consider the simpler approaches that we already discussed or descriptive approach and exploratory approach, end and inferential approach. This is because of multiple reasons. First of all, these tend to be simpler and it might just happen that they will suffice for the project or the use case that we have in mind. For example, let's say that you are facing a customer churn problem. We're business colony comes to you and says, Hey buddy, we need a predictive model. It would be predicting which customer has a high potential to turn and generate four minute a signal. With such signal, I can then send an e-mail to the customer with a special offer and we can prevent this customer from turning. I mean, of course you can just go ahead and build a predictive model. Or you take it step-by-step. You will start by describing your data and understanding what is really going on within. Secondly, you explore and search for patterns and ops Hadu stumble upon an interesting churning pattern. The customers which have churned from your customer base in the recent months have just before they turned, stumbled upon a technical issue within your product. And you can see these in your application locks. So there is a solution to your turning problem. And of course it is way easier to do this little bit of data exploration and pattern recognition exercise as opposed to building of a complex predictive machine learning model. So the simpler solutions might already suffice. And if we are taking it step-by-step, we are picking the simplest solution that will answer the problem at hand. Secondly, you really should only jump to a predictive approach if you need granular observation based predictions. What do I mean by it? Well, let's go back to our example of an ice cream stand. If I want to understand how the weather influences my sales, I do not need a predictive model. I only need a pattern to understand it. If I need to understand which ice creams are frequently bought together, I also do not need a predictive model. On the other hand, if I need to know a fairly accurate estimation of how many customers will arrive to my ice cream stand on any given day. I need a predictive approach. I need to know on a day granularity, how many customers can I expect to come to my stand? Only resort to a predictive approach if you'll need the granular observation based predictions. So with this brief lectures, I wanted to highlight a few reasons why we shouldn't jump directly into predictive modelling and why we shouldn't disregard these simpler approaches that we have discussed until now. I'm looking forward to see you in the upcoming lectures. 47. Building a Predictive Model: It is finally time that we build a predictive model. Hi, and welcome to another lecture in the course, be aware of data science. And in this lecture we continue with the exploration of the predictive approach of data science. In one of the previous lectures, we have been discussing the function of the nature, this mapping function which we are attempting to approximate or estimate if we manage to do so, if we manage to build this gray box over here, it might be super useful for us, for example, we're working with the case that we are an owner of an ice cream stand. And if we have this box, we are able to provide it with the relevant input. And the predictive model will make a prediction about whether or not a certain individual comes and purchases an ice cream or how many people will make the decision to come and purchase the ice cream? Odds build these box. We need to break it down into smaller components. We will walk through these step-by-step. We will be talking about the inputs. Then we are assigning weights are important scenes to the input. Afterwards we will summarize and having them summarized, we can make a final prediction. First of all, let's talk about the inputs. By the way, I hope that you have enjoyed the assignment where you could create the only thing about what input could be relevant for the project that we were solving. Now, something similar is happening within every project and every use case. It's usually our domain and knowledge telling us what inputs we would like to have these what we discussed already. We definitely have some ideas what could be driving the decisions of whether people go and purchase an ice cream. On the other hand, on the downside, there is the limiting data availability, for example, will certainly not have data points about whether a customer is having a sore throat or not. I mean, it will be a gradient, the useful data point, we just don't know whether some customer is having a sore throat, which is of course, negatively influencing his or her decision to come and buy an ice cream. We owner they know that of course we want the predictive model that has as much predictive power as possible. Oftentimes you will hear that how successful we are is primarily dependent on the availability of relevant inputs. In other words, about how many of these significant important in what quality we can grasp and collect the data. If we really only have data available about what is the day of the week, our model will not be perfectly accurate or far from it. On the other hand, if we have very rich datasets that cover the weather, that cover historically, which offers we made to the customers and much, much more than we can create a powerful predictive model. We will now put together the set of inputs that are available. Of course, having an assumption that all of these are really relevant for the mapping function at hand is not exactly correct. We would only select a subset of these factors for the estimation or approximation. We can either do this manually or we can use a statistical tools which is available. We would call this a feature selection or feature elimination process. Before we now proceed, I would like to mention one thing. Hopefully now you can appreciate why some mapping functions are inherently trickier to estimate or approximated than others. For example, it is a rather difficult to estimate your political views as these are shaped by demographic, social, and economic factors, there is a lot of relevant input. The data scientist might simply not have the data about them. On the other hand, it could be way simpler to estimate whether you will click on an online promotion. That is a load of digital data available about you end your online behavior. So hopefully now you can appreciate why some mapping functions are inherently trickier to estimate than others. Having the input sorted out and out of the table, let's proceed to the importances or the weight of the input. Not every input is equally important. Certainly, for example, weather and day of the week might be way more important as an input as opposed to, for example, what offer we are making, whether we have a five or 10% discount, it doesn't matter so much, it matters way more whether it's a sunny weather outside. So not every input is equally important. The inputs also have different effects. Some have a positive effect and are increasing the probability that the person comes and purchases the ice cream while other are having negative effects that are lowering the probability that a person comes in purchases and ice cream. The question which you might now be having, how do we assign these important cities or these weights? We can go in two ways. First of all, we can assign them manually. This is what we would call a heuristic or an expert based model. So for example, we would say it's Saturday and on Saturdays we are expecting that 100 customers arrive to our ice cream stand. If it is Saturday and on top of it is sunny weather, we would expect additional 50 customers to arrive to our ice cream stand. So in total, we would expect 150 customers. This is an example of a heuristic expert based model. Going manual is of course not the only way. The importance these can be learned by an algorithm. This is where machine learning comes into place. Remember what we said about machine-learning, that it is capable to learn patterns from historical data by itself. Well, the weights, the importances are the alarm patterns, debts, what the machine-learning model is coming up with, and we do not have to assign these manually. So hopefully now you can appreciate why machine-learning in so much popularity in the recent years. Of course, our datasets are fairly large. We are oftentimes working with dozens, hundreds or even thousands of inputs. And going through these manual and assigning the weight, the importance is manually could be a very tedious exercise. Instead, we can rely on the machine-learning model that is capable of automatically assigning these importances or weights. So now having the input sorted out and having also the weights assigned. The third step that we need to do is that we will summarize, basically, we need to take a holistic look at all of our inputs. And the importance is that we assign to make the final decision. For example, the weather is sunny and warm. We do not have any special offering, our ice cream stand, and it is a weekend. Thus, we can make a final prediction that a certain person will come and purchase an ice cream. Or we can have a different setup of our predictive model. And it would be rather predicting how many people make the decision of coming and purchasing an ice cream. This last step is really a summarization of all the inputs with their respective weights or with their respective important sees. What sum up this lecture a little bit. We just went through all of the components that are needed in order for us to make a predictive model, you will have an assignment coming up where you will have the chance to be the predictive model. You will be coming up with these importances and with these weights for a particular use case from my past that I worked on. I hope you will give them more intuition in this assignment of how the importance these are assigned. But for now, you might have heard about it. There are various types of machine learning algorithms, decision trees, linear models, random forests, neural networks. They all work in a slightly different way. But if you intuitively want to imagine what are these doing well, what do we just discussed of? How are we building a predictive model? I'm pretty sure you can keep this intuition in your mind. And you can well imagine all the machine learning model is doing when we are building one. Alright, then the final question for this lecture is, what do we do now? How can we utilize these predictive model that we have just created? So we have our little box tomorrow morning when we are deciding about how much stock or how much ice cream do we make for the day and how many of our employees do we invite? Be in our ice cream stand, we can use our predictive model. It is sunny, it is Saturday, and we are offering a 30% discount. We take these values and we provide them as an input to the predictive model that we have created. And it's going to provide us with a prediction that 170 customers will arrive to our ice cream stand. So this is how we will use our predictive model. And I have to reiterate on two statements. Can we expect a perfectly accurate predictions? No, we cannot. We have created a model. We have created a simplification of the phenomenon that we are studying. Our predictions will not be perfectly accurate. It will not happen that exactly 170 customers will come to our store. It can be 168, can be 170. But we are hoping that our predictions are accurate enough so that they create a business benefit for us when it comes, for example, to how much ice cutting we should make and how many of our employees should arrive to the stamp. And then we just pin following the causation. Can we claim that the inputs are causing the outputs? Certainly not. We have just discovered some form of a correlation. We have mapped the inputs into the output, but we cannot claim that we have discovered a causation, that we just have a useful estimation of a function which was provided to us by a nature. All right, so this is everything that I wanted to say. I hope you now have a confidence that you can build a predictive model, because in the upcoming assignment, I will ask you to do so. 48. Predictive Model Types: Hi, and welcome to another lecture in the course. Be aware of data science. We have already learned the basics about constructing a predictive model. There are various types of these predictive models, and it really depends on the application which we decide to go for, the problem that we're solving. So in this lecture, I would like to discuss some of these predictive model types. The very first distinction between predictive models that we should be making is whether we have target feature available or not. What do I mean by target feature? Well, this is the data about the outcome that we are interested in. The most common scenario is when the target feature is available. This is what we call a supervised learning or a supervised problem. The example that we're solving with an ice cream stand of estimating how many customers will make the decision to arrive to our store was an example of a supervised learning problem. We had the data from the history of how many customers arrive to our store on any given day for the period that we're collecting the data. Now if we are within a supervised realm or if we are facing a supervised problem, this is usually the simpler scenario because we have the supervisor available. What do I mean by a supervisor? Well, the target feature and it's very important or having the data about the outcome which we are trying to predict. Because remember, we are trying to learn some patterns, either us or the machine learning model that we are using. Now, whatever pattern our machine learning model picks up, it can check against the outcome. It can check against the target feature whether this pattern is feasible. For example, the machine learning model is picking up that on Saturdays more customers are coming to our ice cream stamp. It can then verify it against the historical data about the Saturdays and see if this pattern is feasible and the machine learning model should keep it, it will move forward and the other research for different patterns. Having a target feature is a simple scenario for us. We are then within the realm of supervised learning. Uh, more troublesome scenario is here on the right side. This is the realm of unsupervised learning when we would not have a target feature available. Now for a second, have to forget this little setup that we have created in one of the previous lectures because it wouldn't be overly helpful when we are facing an unsupervised learning problem when the target feature is available. Here, the problems are more about grouping or clustering. You might have heard about it. There is also anomaly detection. What do I talk about? Imagine that you have your customer base. Now, you don't have a target feature, nothing in particular that you are interested in predicting. You can still extract some valuable patterns and some valuable information from these data. For example, by the d will attempt to group your customers into some clusters of similar behavior, of similar social demographic characteristics. Or you can also do is that you can try to discover customers which are anomalous with their behavior. So here we would have projects and the use cases around clustering or anomaly detection. And as I'm saying, we would need a slightly different model types then the ones that we are using for supervised learning. This is the first distinction. We have supervised learning and unsupervised learning. Nowadays most of the use cases are within the realm of supervised learning because luckily for us, we still have a lot of untapped potential within our dataset. For example, your business colleagues might constantly be coming up with interesting target features which are wharf predicting. Nowadays we see a lot of use cases within the supervised around, but there are some that predicted. We all kind of ran out of target features which are wharf predicting. And in the upcoming years we will see a rise in unsupervised learning. Another distinction that we should be making is whether we are interested in creating a predictive model or a descriptive one. Both of these will follow the very same principles or the approach that we have discussed, but they will aim to achieve different things. Which of these two we pick will of course, depend on the application or the project. So what do you most of the time c is a predictive model. These are aiming to make accurate predictions as possible. So for example, we really want to have an accurate predictions about how many customers will arrive to our ice cream stand, we will include a lot of inputs, a load of relevant data, basically as much as possible. And we will use a rather complex model type becomes we really want to identify every subtle pattern in our data and increase the predictive performance of our model. Another option or use of the model that we have is a descriptive model. I have it here on the right side. Here, having a high predictive power is not a prime concern. We are rather interested in deep patterns that the model learns. We might want to do this because we want to learn something about the decision-making of our customers while we're not so concerned. The predictive accuracy. When building such model, we would encode lower number of inputs and maybe we would also obtained for a simpler or less complex model type. Now, you might be meeting this idea of a descriptive model for the first time. So let me elaborate on it for a bit. How are we trying to look inside of our model? For example, we are interested in the effect that various values of our inputs have on the outcome. I printed out two examples. Firstly, we can see the effect that temperature has on the number of customers that arrive to the store. As the temperature increases, more and more customers are coming into the store. However, I went, temperature rises beyond a certain point, we can see that the effect is opposite. It's becoming too hot for the people to be outside. If we are increasing the temperature beyond that point, the number of customers who arrive to the store lowers. Secondly, we can see the effect that discount in our offer makes on the number of customers that arrive to the store. You can see that having a small discount that does not increase the number of customers in our store rapidly. But as you can see beyond a certain point with, let's say at least 30% discount, we can positively influence the number of customers who arrive to our store. So this would be an example of how we might look inside of a predictive model, inside of a machine-learning model that was learning on historical data and extract some patrons from it. We, for which we expand our domain knowledge. Now, you might be thinking why this distinction between a predictive and descriptive model even exists. Cannot we aim for both high predictive accuracy as well as to interpret the model and describe the phenomenon. Well, not quite. There is a certain trade-off between these two. We can increase the predictive power of our model by increasing its complexity. Complexity could be increased, for example, by including a lot of inputs. Now unfortunately, if we increase the complexity like this, the possibility of looking inside the model and describing what is happening inside of it is lowered. It will be becoming increasingly difficult to pinpoint which of the input is having what sort of effect on the outcome, thus achieving both of these is not exactly possible. We always have to pick whether we are aiming for a higher predictive accuracy or rather descriptive model. And this brings us to the last distinction that we should be making with regard the complexity of the predictive model. Basically, complex models can find and utilize more complex patterns, thus having a higher predictive power. On the other hand, the more complex model you build, the more considerations there could be during its development and maintenance. I have now generalized predictive models into three categories with regards to their complexity. We can start with the simplest ones, which are heuristics, rule-based systems or scores. Now, one example of a rule-based system will be our ice cream stand. Let's say we have this intuitive understanding of our business and we will translate it into a set of roles. For example, if it is centered day, we would expect 50 people to come to our store. If it is sunny on top of it, we would expect additional 30 people. This way we would build a set of rules, may be 2030 rules, and we connect them to construct a predictive model. These predictive model could then be used to predict future values. This is a pretty common misconception. People believe that predictive models are only machine-learning models while it's not drawn already, if you are constructing these heuristics are rule-based systems. You are constructing a predictive model because it can predict the future values. Now, are these still useful nowadays when we have so many options when it comes to machine learning. Well they are. And for example, this is when we have polishing or some cases where a full transparency will be required. For example, your bank or some governmental agency is assigning some risks course. And basically this risk score is predicting how likely you are to pay back a loan if you now borrow some money? Well, this is one example of a case where you would require a full transparency. You as a human wanted to know what behind the score that has been assigned to you. These simple predictive models are still in use when it comes to some niche applications where full transparency is required. Let's increase the complexity and talk about simple machine learning models. Maybe you are already familiar with machine learning and you have heard about to model types, linear models, and decision trees. These are toll prime examples of simple machine learning models. Actually you have for sure already experienced as simple machine learning model within our dear assignment, we were using or where basically imitating a linear model. And this model is so simple in its workings that we were even able to imitate it. So we became the linear model, which was assigning the individual weights to the inputs. Now, when would you use a simple machine-learning model where there are many reasons why we can opting for a simpler model. For example, when we want to increase the transparency of our predictive model, the linear model, once we construct it, we can really nicely and easily interpreted. So if this is our buildings goal, then we should obtain for a simpler machine learning model. In other case, would be that our resources for building machine learning model are limited. For example, we are in rush. We have two days to build a machine learning model where we would apply only a simpler machine learning technique. So there are multiple reasons why we might opt-in for a simpler model. However, there are cases when a simple machine learning models wouldn't suffice and we would opt in for a realm of complex machine learning models. This would be when maximum predictive accuracy is required or where patterns in the dataset are very complex. These walls, for example, the case with Visual Recognition, human eye is a very complex function to imitate. So whenever we are supposed to analyze images or understand human language, human texts, we really need complex machine-learning models because these are complex functions to estimate. Now some famous examples of complex machine-learning models would be, for example, neural networks. These are well understood by now and they're used exactly for these complex applications. Alright, so as you can see, we need to keep in mind these exceeds when it comes to the complexity of the predictive model. When we are building on, we should also keep in mind how complex we would want our predictive model to be. We are basically balancing between the transparency and the simplicity as opposed to high predictive accuracy. That's everything that I wanted to say with in this lecture, looking forward to see you in the upcoming ones. 49. Predictive Model is Never Perfect: Hi and welcome to lecture. Predictive model is never perfect. There is oftentimes these wrong assumption that people hold, which is the data science models are machine learning models can have perfect predictive power. Of course, this is never the case. No model will ever have all of the predictions, correct. So within this lecture, I want to provide you with reasons why this is the case or taken from a different perspective are already understood that a model is a simplification of reality. Thus, it cannot include all of the subtle nuances about the phenomenon which we are studying. In this lecture, I would like to show you these in a more practical way. What is causing the deep predictive model? It never has the perfect predictive power. So let me list these reasons and elaborate on them a bit. First of all, it's about non-representative training data. We have our little ice cream standard we have been building across the chapter in Sweden. However, for some reason we have available data from a similar standard in Spain. Now, if we try to learn some valuable information from these data, it will most likely not be representative of the behavior in Sweden. It might just be completely different when it comes to purchasing power in Sweden. So this is an example of a non-representative training data. And we already discussed the issue at the beginning of the chapter. Basically we are referring to a non-representative sample. You might be now telling yourself that this is a bit naive example, collecting the data in one country and then using them in another country. Well, we would never do this may be in the real world or hopefully, however, remember the learning story about the edible and poisonous mushrooms. This was a very realistic scenario of a non-representative training data or a non-representative sample. This is the first issue. Secondly, it might happen that we will have an overfitted model, or in one more rare scenario, we would have an underfitting model. We are usually learning from a sample generalizing to a population. We already know that much. We basically do not want to focus too much on the sample because that would create an overfitted model with the patterns and valuable information that we learn from the sample, we essentially want to leave. They're a little bit of freedom so that our model generalizes well to the population. Remember what we said about the inference of an average height from the sample to the population. Even though we have measured in these sample that the average height is 175 centimeters, we did not claim that also an average height in the wall population is 175 centimeters. We claim that we are ninety-five percent certain that the average height in the population is between 171179 centimeters. It's something very similar over here. Now, we do not want to overfit the sample, while at the same time we also do not want to underfit the sample. This will be the case when we are not utilizing the sampling of when we are not learning enough from the sample. But as I'm saying, overfitting is a bigger issue. This is especially troublesome with the machine-learning models. You'll literally have methods and techniques, how you have to hold them or stop them from overly learning on the sample and overfitting to the sample, whereas they would then fail to generalize to the population. The second issue of why predictive model is never a perfect is because it might happen that we are overfitting or underfitting our training data horror our sample. Thirdly, irrelevant inputs, as we already said, the features or in other words, the inputs that we provide the model to learn from RD, most important aspect of the predictive model, it might happen that we provide some irrelevant inputs or inputs which contains some spurious correlation that will not hold in the future. In such case the model learns a pattern, but it will not generalize well to the population or the spurious correlation will not hold stable in the future. It might just be providing the modal with an irrelevant input and we are lowering its predictive powers and its predictive accuracy. Lastly, we are staying with the inputs and it's about the low quality of input. At the beginning of the course, we discussed a little bit about the data. We said that not all the data is of the same quality. And at, for example, self-reported data might be very problematic. Always remember the yellow Walkman learning story. It could be the same if we are interested in predicting from the data points that the people report. This data might be low quality input and thus it will cause our model will not have a great predictive power. And also always remember that the data represents the phenomenon that we are studying. If the data are not representing the phenomenon well, if it's a low quality input, well then we can have as good predictive model technique or as good predictive model method, we will never grasp boards really going on in the phenomenon, if our data is of low quality, should wrap up this little exercise that we did with listing of the reasons why our predictive model will never be perfect. I would say that we always have to keep in mind that there is always some sort of bias or error within our predictive exercise, or that it will always contain some degree of noise that will not allow us to have a perfect predictable power. Maybe you are feeling a bit discouraged now, but you definitely shouldn't be. We can create great and powerful predictive models. So even though our model will not have perfect predictive power, we are not aiming to create a model which has perfect predictive power. We are aiming to create a good enough predictive model that can create a benefit for a concrete application. So this is one of the key takeaways from this lecture. We want to create a gold enough predictive model and it doesn't matter that it will not be perfect. I would say maybe as a bonus takeaway from this lecture, it's about how we can view and approach these issues are data scientists should admit that these limitations and issues exist in our predictive modeling exercises and should view them as an opportunity for an improvement of a predictive model. Incorporating better training data may be incorporating more training data, improving the data quality, working on the data collection methods. All of these have a potential to improve the model that the data scientist is working on. We should view this as a potential for an improvement of our predictive model. And that's it. That's what I wanted to discuss in this lecture. And I'm looking forward to see you in the upcoming learning story about dogs and wolves, where we will wrap up our predictive approach. 50. Are we seeing a dog or a wolf?: Hi, and welcome to another lecture in the course. Be aware of data science. It is just about the time to Concord our exploration of the predictive approach to data science. So we will have a learning story called are we seeing a dog or a wolf? You'll learn about one of the crucial part of data science, estimating or approximating a function that will allow us to make some predictions. I decided to provide you with a little banjos story to showcase you the beauties and wonders of data science. A little bit more acidic, the function that we're trying to estimate the approximate is very complex. An example could be imitating a human eye and the human brain with visual recognition model, hearing a strange issue might occur. The problem is that the function becomes so complicated that it basically becomes a black box for us. We will not know what patrons the model was learning, such as is often the case with visual recognition and deep learning models. There are six pictures on the slide. Each one is depicting a dog or a wolf. So some researchers constructed the visual recognition model. The model was supposed to decide whether it sees a doc or a wolf. The model worked very well, even astonishingly well. I mean, in some cases we can imagine that it's pretty simple and straightforward and recognize that this is a dog or that this is a dog. However, we've some specific breeds or short from specific angles. It might be even a bit tricky for a human eye to recognize whether it is a dog or a wolf. Yet the modal was really good, like it could basically perfectly distinguish whether it's a dog in a picture or a wolf in the picture. How is it possible that the data science model was so good it really bothered the researchers. So they decided to look inside of this complex model and examine the patterns that the model was relying on when making the predictions. Basically, they were expecting that they would find something like this that the model has learned to distinguish dogs from wolves based on the third color or head shape or a snout shape, these would be the features they would expect. However, in reality, the Model three tasks it needs don't learn all the nuance differences between dogs and wolves, such as the fair coloring or posture or the contours, it was heavily relying on the dogs are usually pictured in an urban environment or with some human made objects such as breach, House, fence, or floor. On the other hand, we often photograph wolves in nature settings. Grass, trees or forest. Is the modal focused a lot on the simple pattern. It was accurate at predicting whether there is a doc or a wolf. Now, interesting turn of events, isn't it? I think this story really nicely highlights the potential pitfall with the predictive approach. Oftentimes you stumble upon large datasets and complex mapping functions that you attempt to estimate or approximate, you might end up with a model that is doing what you would expect. However it is basing its decisions on something completely different than you would expect. The question is, of course, is this really a broken predict your model? If you now look at the pictures, the pattern which the modal picked up is correct one to classify whether it sees a doc or a wall, maybe if you think about your brain didn't need three Q as well. When you are looking at the pictures of the American white shepherd or for example, the husky on the right side, did you base a decision on the contours of the dog or the shape of the dog itself? Or did you just look at the environment subconsciously, of course, and decided, this is a dog and maybe this is a wolf. Maybe at the end of the day, our brain is doing something very similar as these kind of broken visual recognition model. Before I end the lecture, I need to thank the authors of the pictures of these lovely Douglas. And I'm hoping that you have enjoyed these learning story. Always remember also what we have discussed. We should be at first describing and exploring the data, trying to find some useful patterns ourselves. And only then we should be relying on some predictive approaches, such as a machine-learning model so that we can use this with more confidence. And that's everything that I wanted to discuss when it comes to the predictive approach. Thank you so much for being part of this chapter. 51. Is our model having an impact: Hi and welcome to another lecture in the course. Be aware of data science. We have designed already some data science models. Most importantly, in this chapter, we have designed a predictive models, for example, by using a machine-learning algorithm that learns on our historical data. And now we are thinking about applying this predictive model on our customer base. Let's say that we are a company which has 1 million customers. And the big question for us is our predictive model having an impact that we would intend to, for example, is it positively impacting our revenues or our profit? This is the topic that we'll be tackling within this lecture. Let's start off by designing our little beast. Any setup, I guess you are already tired of the ice cream stand, so let's have a different setup. In this case, we had a vision or an idea about the function of a nature which is inside of our customers mind, deciding about whether they purchase a mortgage from us. We have considered and collected various inputs which we consider irrelevance such as author that we are having with the mortgages, an age of the customer, whether the customer owns already some property and let's say some more. And of course, all of these inputs are coming to the function of a nature. What comes out on the other side is a final decision of a customer, whether he or she comes and purchases or mortgage from our bank, as we've discussed in the previous lectures, what do we basically now attempt to do is that we would estimate or approximate this function. And thanks to it, we create a predictive model. Alright, so let's take it one step further and don't already consider how we would apply these in our bank, in our real business setup. Let's say that we have 1 million of our customers on top of which we would like to apply this predictive model. The way how we would apply would be that above the customers about whom we predict that they have really a high probability of coming to our bank and purchasing the mortgage from us. We would give them a call and we will try to persuade them, okay, really come to our bank and purchase the mortgage from us. So this is the application. Now it comes order the first key learning from this lecture, it's negative. So I highlighted in red, we should not apply our modal right away to the entire customer base because of various reasons. First of all, we might have missed some form of cognitive or a statistical bias. What do I mean by the sentence? Well, maybe we are thinking about the wall function of a nature wrong and the underlying process is completely different. There's one possible scenario or secondly, the data that we have used might have some form of a statistical bias inside of it. And the modal which is fitted and created on top of these datasets will be doing something completely different than we will intend to do to these reasons if we would now apply the model to the entire customer base, well, it might be having a negative impact instead of a positive one, so we should be careful with applying it. Secondly, our modal might be negatively influencing a different aspect of our business. So think about it. Our modal is not considering some other aspects of our bank we had, of course, having different products. We need to take care of our customers. We have our operations. There are many different aspects about our business. The modal is focused only on one of them, which is our mortgage sales. And if we are now applying these optimization and if we are trying to increase the mortgage sales, we might be negatively influencing, for example, customer satisfaction. Maybe these phone calls which we are making to the customers are bothering them. They are unhappy and they will not only not buy the mortgage from us, but they will actually turn from our bank as customer has completely. We have to be careful with the application of our predictive model, okay, Now, what do we do only if we had a way how we can apply the predictive model only on their sample of our customer base. And then from the sample we will generalize the learnings to the entire population. Only if we had such method. Of course we have it and I think you guessed it correctly. If we go just a few lectures back when we were discussing the inference, you might remember the case when we are considering a new sales strategy. And we would collect a sample of customers to whom we try to solve the old sales strategy. Then we collect a sample of customers to whom we try to solve the new sale strategy, we compare these two samples and we check against the statistical model for the generalizability of this difference. We just need to reuse the very same technique and just reshape our setup a little bit. We would consider that, okay, our old population is a one. We found applying the predictive model. So basically how we are doing our business right now. And the other population is the one on which we already apply the predictive model, where we are already calling the customers based on the predictions of our predictive model. And again, we compare the two samples we would conduct a statistical test. Now, just so that you can imagine it really is a real-world scenarios. I have redrawn this picture. Basically we have our 1 million customers and we will break our customer base into two halves. For example, we will consider the Don 1.5. We would not apply the predictive model. On the other half, we would apply the predictive model. But again, we do not want to apply the predictive model on such large scale to 0.5 million customers. So we would only collect a sample. We would collect a sample of 1000 customers with whom we are doing the business as usual as we do it today. And then we would have a sample of one hundred, ten hundred customers on which we are applying these predictive model. And again, we will compare them. What will come out of these experiment? I do not want to resort back to the statistical test we have already covered it. You can revisit that like train Casey would like to have the technical details. I want to more imagine this as a real-world scenario. So here is the outcome of our experiment. As you can see, we have our two groups in a rows. We have a group, we found our new predictive model, and then we have a group with our predictive model. Now let's look at the number of customers in a sample. We have 980 customers. We filed our new predictive model and 930 with our predictive model. Now the reason why this is not 1000 is simply because some customers have churned in the process of conducting our little experiment. So here is the thing. We cannot measure it right away. We need to wait some time, maybe three months, maybe six months in-between, we start to conduct the experiments. So we try to apply the predictive model. And when we measure the result, it simply took a few months or so. Some customers naturally turned from the process and we have of course, collected data value. Now about each of the groups we are collecting various metrics. Because remember, we should consider other aspects of our business. We do not want to look only at the mortgage itself because the modal might be negatively influencing or maybe positively influencing some other aspects of our business. We have multiple metrics that we are considering. Let's look at it. Revenue per customer group. This is the one that's fairly important for us. We hope that we increase the more details. And indeed we can see that we felt our new predictive model. It is one hundred, ten hundred and thirty five dollars way if our predictive model, it seems to be larger, $1222. So it seems our predictive model is increasing the revenue per customer. Now, the second metric that we are considering is a customer churning group. We indeed can see that more customers are churning from a group when we are applying the new predictive model. This really might be because of the phone calls that we are making. It's simply bothering them and they do not want to hear from us. And lastly, we have collected also a customer satisfaction score, 8.558.53. So let's say that there is no difference between the stool. Now, if we're looking at these, I think we can make a conclusion that our predictive model is having the desired and overall positive impact. So even though we are negatively influencing the customer churn, all in all our business colleagues have located the results of our experiment and they are concluding that the increased revenues are indeed warfare. So we are making the conclusion that yes, our modal is having the positive impact. And based on the results of these experiments, based on the results of this pilot phase, we will now deploy the predictive model to address the entire population. So entire 1 million of our customers. So let's summarize with a key takeaway from this lecture, we apply the predictive model only to a sample of population to see whether it has a positive impact. This is not only applicable for a predictive approach, it's actually applicable for all approaches, but it is crucial with a predictive model because most likely your predictive model, we have used a machine-learning model to create it, and we need to make sure that there is no statistical bias, that there is no cognitive bias, or we might have gotten the wall setup just wrongly. So it's very crucial to do these experiments, especially with a predictive model. This is the end of the lecture. This is how you check whether your predictive model is having a positive and the intended impact. I'm looking forward to see you in the upcoming lectures.