Transcripts
1. Welcome to the course!: Hi and welcome to the course. Be aware of Data Science. My name is Robert, and
in this brief lecture, I would like to not only
welcome you to the course, but also provide you
with an overview of what's ahead of
us in the course. And on top of it, I will also provide five
practical tips on how to be as efficient as possible if your learning journey
throughout this course. Let's go for it. First of all, let's discuss what's ahead
of us in this course. As you can see, the course is structured in four key chapters. The essence of data science
disciplines of data science, describing and
exploring the data. And lastly, we have inference
and predictive models. What can you expect from
each of these four chapters? Well, within the first chapter, as the name says, we will talk about the essence
of data science. So we will want to understand what the goal of data sciences, Why are we even
using the data and how it helps us find what
we call cognitive biases. These will be our
key first learning. Then we also ask ourselves
a question of how does data science apply
scientific approaches, considering it has
science in its name. So this is the essence of data science that we'll talk
about in the first chapter. Now once we understood it, we will proceed with the
second chapter where we are more practical,
more tangible. We'll talk about the
disciplines of data science. So there is data mining, machine learning
Databases, Big Data. And we will look at how each
of these contribute and how they together create what we
call data science nowadays. So as you can see, will
be very practical. In the second part
of this chapter, we also ask a question of who is a data scientist will examine the unnecessary
skillset of a data scientist. And hopefully this part
of the chapter also gives you an
inspiration on how you can join the world
of data science or possibly grow further in
your data science learnings. Having the basics covered, we proceed to the third chapter, describing and
exploring the data. We are basically applying the first two approaches of data science that we will
study in this course. We are doing so in
order because when a data scientists get
their hands on a data, well the first thing
that they do is that they will
describe the data, for example, they would use
descriptive statistics. Secondly, what they
will do is that they would explore the data. So they would search
for some patrons within the datasets and we will do the very same thing
Within the third chapter. Now, once we have described
and explored our dataset, we can proceed with the
fourth chapter where we talked about inference
and predictive models. So this is the
chapter where we will dig deeper into machine-learning and it will be creating powerful predictive models which can learn on some
historical data, on some sample of
the data and then possibly predict
some future value. So it'll be building these
cool predictive models. Now lastly, I would like
to highlight that there is also a bonus section at
the end of the course, which I'm really
trying to fill in with a lot of bonus
full pitch for you. For example, if I'm seeing
that a couple of students are having the same questions or they're curious about something. I would collect the
stores and record a special lecture
which will be then placed in the bonus
section of the course. You can also find there some bonus footage and
ponens information on some of the electors to sort
of provide you with some behind the
scenes knowledge. And then also providing some practical tips such
as what I recommend to read or how I recommend to grow further into the world
of data science. So be sure to not miss the
banjos section of the course. Alright, so this is the
journey which is ahead of us. I hope you will enjoy it. And now as promised, let me provide you with five practical tips
on how you can be as efficient with your learning within
these cores as possible. First of all, pace
yourself like find your ideal and optimal
pace, maybe some hints. One chapter is
approximately 90 minutes. If you like longer
uninterrupted sessions, well, you can then paste it by the chapters and
take one chapter at a time if you would like to have very short burst learnings, while the course is also
optimal because one lecture is approximately five to
seven minutes long. So it's really short
pieces of content. And I really hope that
you will find the pace which is really suiting
you for all the course. Secondly, practice
makes perfect Really I recommend to practice through assignments that we
have created for you. They're spread it
all over the course. And we have really tried our best to bring you
interesting assignments. For example, there
is one where we'll be analyzing the migrations and then predicting
how the ears will migrate in our nature
is innovation. There are more assignments. For example, you will be a restaurant owner
and you will be identifying relevant inputs
for our predictive models. I really recommend that
you invest your time and practice with the
prepared assignment. Thirdly, whenever there is a really a crucial learning
within the course, we turn it into a learning story so that
you can remember it better. So I really commend to focus on a learning story when you find such and remember
it through it. For example, we'll
be talking about why Hippel's are dangerous for organizations will
be asking ourselves a question of whether
storks bring babies, or we'll be asking
ourselves a question of whether these mushroom
is edible or poisonous. I really hope that these
learning stories will help you to remember
the key concept. Now at least for myself, I love collecting things. We are also constructing various handouts
and collectibles, which are always available as a handout and next
to a electrode. So I recommend that you
always check whether there is a handout available
for a lecture. There will be, for example, one pagers which are summarizing the key learnings from
this particular lecture. So always check the
handout section. And lastly, especially when we're talking about
data science, It's really crucial to be curious and to be
asking questions. And then also recommend
these for you to ask questions with regards
to the content. For example, if you stumble upon something which is unclear or something that you
would like to discuss with your fellow
students or with me, use the Q&A section. I'm always checking
it and I will be replying to your questions. And I also hope that you will develop meaningful
discussions. We fellow students, for example, when it comes to the assignment. So I really recommend to
make use of the Q&A section. I'll be super happy to hear
from you and reply to you. Alright, so these were the five quick tips
for taking the course. And I can't wait to see you in the upcoming
lectures. See you there.
2. Introduction to Chapter 1: Hello everyone. A warm welcome to the first
chapter of the course. Be aware of Data Science. My name is Robert
and I would like to provide you in this
brief lecture with the goal that we have during the chapter that is called
the essence of data science. Also, I will provide you
with a bit of an outlook of the lectures that we will
have. Let's go for it. If we will get the
overall course structure, we can see that really
we are at the beginning. We want to define and understand data science
from multiple perspectives. This will be the goal during
this very first chapter, we'll be tackling
questions such as, what is the goal
of data science? What are the tools, methods, and approaches which
are available to us, or what is a data science model? These basic questions,
we are really starting from the ground up and we'll be building up
on these knowledge during the remainder
of the course. So having the goal defined, let me show you the lectures
which are ahead of us. We're starting really
from the ground. And already in the
first lecture we are defining what is the
goal of data science. Essentially, you will understand the data science is the art of turning data into
valuable information. Then I really recommend
that you look forward. The second lecture is I
have a quick win for you. There are the approaches
of data science. I will build a framework of four approaches which are nowadays used by
industries a lot. So a descriptive approach, export artery one,
inferential and predictable. And we'll be relying on these framework during the
remainder of the course. We'll be studying these four
approaches of data science. It's a very important lecture. Then we will kind of starting
to challenge the data and we will start to think about the data part
of data science. Data science is very
popular nowadays. Why does everyone want
to utilize the data? And if we are
utilizing the data, are the data sacred, like can we blindly trust them? Are they always right? As you can see,
we'll be focusing on the data part
of data science. I didn't have a
learning story for you is I would really
like you to remember to always remain skeptical towards the data and towards
the nature of the data that you
have ID then have an assignment for you which
is called bias is everywhere. I will ask you to reflect
with your own experience upon possible cognitive
when statistical biases, which you might have
experienced in the past. So it will be a
practical lecture. Then in the second
part of the chapter, we will move towards the
science part of data science. We will think about
some limitations of human mind and why
is it so useful if data science is building modal sum simplification
of reality, we are moving and focusing towards the science
part of data science. Then just before we
conclude the chapter, there is a checkpoint
on which you can recap all of the knowledge that you have gained during the lecture, it might be necessary becomes the last part of the
chapter is a quiz that I prepared
for you where you can test your knowledge
from this chapter. Overall, you can expect
that this chapter will take you approximately
16 minutes in lectures, plus there is approximately
20 minutes of practice activities in terms of the assignment and
also the quiz. So I'm looking forward to see
you in this first chapter.
3. The Goal of Data Science: Hi and welcome to another
lecture in the course. Be aware of Data Science. My name is Robert
ending this lecture, we're going to talk
about the goal of data science because I can't
think of a better way how to outside the
journey of exploring the wonderful world of data science then to
talk about its gold. So let's go for it. At first, the goal of data science is very simple
and straightforward. The goal is to turn data
into valuable information. That, That's what data
science attempts to do, like trees or what do we
have defined data science? No, of course not. Even though this
definition is awake, there is quite a lot
behind this definition. I would like to talk about
it within this lecture. First of all, let's ask
ourselves where data scientists, the first ones who have
ever attempted to turn data into available
information. Certainly not. You had statisticians, artificial intelligence
experts, data miners. You had business
intelligence experts. And all of these
people were trying to turn data into
valuable information, but they were kind of
separated from each other. But things changed
approximately the breakthrough of millennia,
approximately 1999. Basically, smart
folks were meeting on conferences and they
were thinking that if we unify all of these separated efforts
under a common umbrella, under a common definition, then the industry will be
able to adopt it better. And that's exactly
what happened. They came up with the term data science and
with this wig definition, which is unifying all
of the efforts of turning data into
valuable information. And it seemed as if the
right move because then the industries really
started to adopt data science methodologies and various disciplines
which are all contributing to
the data science. This is the first key
takeaway from this lecture. Even though the definition
of data science is awake, its wake by design, it's supposed to be like this. It's supposed to unify all of these otherwise
separated efforts of turning data into valuable information under
one common umbrella, which we call data science. Now, I would like to zoom in and look water behind of this
definition as there is, of course a lot more. First of all, we can
look at the left side, the data side of the definition. There are actually lots of
aspects about the data that can vary quite a lot from project to project or in
Castile data science. First of all, we
can have structured or unstructured data. Um, if the data is structured,
you can, for example, imagine some
demographics dataset of your customers lethally in
an Excel sheet which you can load into the Excel
and local where your individual customer
is located over the social demographic
characteristics of them. This would be structured data and it's quite
easy to work with. Then contrary to that, we can have unstructured data. Well, maybe your
company is having some large IT system which is logging everything which
happens, for example, on a server or within your accompany solution that is exposed to the customers and it just outputting the locks, the data exists,
but we do not have such a great overview
of what's in there, if we would like
to analyze it and derive some available
information. So of course, it's going
to be a lot harder to work with such
unstructured data. Secondly, the data can be, we call them purposed
or on purpose. We will discuss these in
detail later down the course. But for now, you know, purpose dataset would be one which is collected
exactly for us, for the hypotheses, for the
idea which we are having. We have collected dataset
just for our purpose. That's what we would
call a purpose dataset. On the other hand, you can have datasets which are on purpose. Maybe they're just
being collected for some operational purpose such as the IT system that I
already mentioned. The logs are there in
case something goes wrong so that our IT
experts can examine it. Or maybe some, there was
some another project going on which was a data
science project and I collected some data. But again, the data is not primarily collected
for our purpose, for the idea that we are having. So it might be a bit more troublesome the work
with these data, and it might be a bit
less informative as it's on purpose from our perspective. Firstly, we can think about the nature and
accessibility of the data. Now, do we have
access to the data on a real-time basis or a
near real-time basis? Or do we have to process the data in
near real-time basis? So as they are popping up, as they're coming
into existence, we have to analyze them and
process them right away. Or do we have an
access to the data and also the need to access
to data in a batch way. So let's say once a month
we are supposed to analyze the data and come up with
some valuable information. You can see even from
this perspective, there can be a lot of
differences within the data. Now, this list is of course
not exhaust you and fall. I just wanted to give you right away the start of the course a clear idea that datasets and data might be
really different. Maybe a key takeaway
from this part of the lecture is
always we should be asking ourselves what is the availability and the
nature of the data within our domain are within
the use case within the project that we're
working on right now. So you can see we
have now zoomed in to the data part of the
definition of data science. Secondly, let's zoom
into the right side of the definitions on the side
of valuable information. I mean, there, there are
also a lot of differences. In what form are we creating the valuable information that are actually various forums. First of all, it can be some simple
descriptive statistic. For example, we
summarize in which days of the week we're
making how much sales. These can already be
available information to our marketing colleagues who can then send better campaign. So a simple descriptive
statistic could already be available information or we
could be a bit more complex. We are thinking about the
visualization or a pattern. Let's say we have a
large dataset and we are visualizing
how our cells are changing across the
days of the week or maybe across the
months of the year. And we might find some valuable pattern
through the visualization. So the visualization itself is the valuable information
that we are seeking. Or lastly, we could be
a bit more complex with our approach and aim for a creation of a
predictive model, usually a
machine-learning model. Later in the course we'll be building up an example that we are an owner of an
ice cream stand and we're hoping to predict how many customers will come
to our ice cream stand on any given day so that we
can stock up properly and we can have appropriate amount
of personnel in the place. This is the most complex type of available information
that we could be creating, which are individual
prediction suggests for individual customers
for individual days. As you can see again, this is not an exhaustive and full list. I just wanted to give you an idea that the valuable
information that we are creating
using data science can also vary in forms. And it's always going to be the use case or the domain
in which we are working that will determine
in which form our way desiring to have the
valuable information. I would like to stay one more minute on
this definition of data science because I have
one more key message for you, which is that these
awake definition is a gift and the curse
at the same time. What do I mean by it? Well, it's a gift because data science can be
visually applied anywhere within any industry was a data scientists can work for
manufacturing company, a solar energy
engineering company. You can be working within
financial industry virtually anywhere and within
any organization which is really great, isn't it? At the same time, these waves definition means
the data science can be applied on any level
within an organization. So we could have expert data scientists with many
years of experience creating large and complex models and the same time
alongside of them, you could have maybe
their business colleagues who are creating simpler models, integrating them
really well into the business domain where
they are the experts. So this is the gift because you can apply
or so which will in any industry
organization and also within ME level of
the organization, that's the gift which these
awake definition brings. Now at the same time, it's kind of a curse. Why do I claim that these weaknesses a bit of
a curse for data science, because data science
use cases and projects can get
overwhelmingly complex. And you can then find
people wondering like, do I need a lot of data
to apply data science? Do I need a powerful
infrastructure? Do I need data
scientists to be able to do use case within
my organization? Who even is a data scientist. You can see these weaknesses
causing various confusions. I will say it all
the data science now that he's, now of course, later in the course
we are going to discuss all of these
and we'll clarify them. I just already now at the
beginning of the course, wanted to give you a sort
of a quick win of why are we single so much confusion around the world of
data science nowadays? Well, it's because this
working definition is causing a little
bit of the curves. Now this is what I wanted to
discuss within this lecture, we have defined the goal
of the data science. We have zoomed into
various parts of it. And now you might find
yourself wondering, how is all of these happening? How are we turning the data
into valuable information? And that's what we will tackle in the upcoming lecture where we will talk about the
approaches of data science. So thank you so much for
being part of this lecture. I look forward to see you
in the upcoming lectures.
4. Approaches of Data Science: Hi and welcome to another
lecture in the course. Be aware of Data Science. My name is Robert
ending this lecture, I would like to talk about the approaches of data science. So just bored, I promise you, once we understood
the basic goal and definition of data science, now we take all, how does it happen? How do we turn data into
valuable information? Let's go for it. Actually, what these
widely applied in the industries nowadays are for a protease of data science. And in this video, I
would like to give you an overview of these
four approaches. And then later in the course we will dig deeper into these. You will get an overview of the methods which are available. You will have a
chance to practice with the methods and lot more. Now the key takeaway from this video will be
to understand how these are building
up on each other and what are the
differences among them? Of course, what is the
essence about each of them? Or I told, say we
will be building this overview with
a colorful example. We are a fruit and
vegetable stand owner. Of course, we want
to sell as much of the fruits and
vegetables as we can. And we have some data available maybe from the past sales
that we were making. And we want to utilize data
science. What do we do? The very first thing that
we can attempt to do is to use descriptive
approach of datasets. Now as you can see, I have
put this arrow over here. These are proteins as
we'll be discussing them. They're usually we
finance a project within a data science use case building up on each other
in a sort of a sequence. So descriptive approach is
usually the first one that we are applying when we get
our hands on our dataset. Alright, but back to our
fruit and vegetable stand, we are utilizing the
descriptive approach and we are mainly asking
ourselves a question, what is the essence
of this dataset? What is it that is
in front of us and what is the problem even
that we are tackling? It will be using very simple
tools from statistics, such as descriptive statistics, measuring averages,
measuring the extremes. And then we will also utilize
some simple visualization. I was just trying to
understand what is in the data being
very objective. An example outcome using
this descriptive approach, data science will be
that on weekdays I sell on average five
kilograms of epochs. Now this is already useful. This is already available
information for me because I know
how much apples should I have installed on any given week day,
approximately five kilograms. Now it's not the most
powerful finding now it's not the most valuable information,
but it's a good start. And this is also how usually data science use cases are starting just trying
to understand what is the essence
of the dataset and trying to come up with
these quick wins. Once we understood the
essence of the dataset and we utilize the, the
descriptive approach. We are turning our attention to what we call the
exploratory approach. Here the question is different. We are asking, are there
any patterns in my data? This is different.
We'll be using tools such as
correlation examination. We will later in the course
study correlation in detail. But we're basically measuring some relationships which
could be occurring. And we're also building some more complex visualization is to visualize these
relationships and patterns. And an example
outcome will be that customers who buy a
pulse also buy bananas. And you can see the difference
that we are no longer just describing that
we are on average selling five kilograms
of apples per day. But we're describing
that there is a relationship between
apples and bananas. So this is an example of
an exploratory approach, finding patterns in our data. Now, once we found such better and it might have
gotten asked curious, well, these people buying
apples and bananas, this is something that is happening only
within our standard, or is it something generally happening anywhere in the world? So people who like
apples also buy bananas. Well, we might turn
our attention to the third approach which
we call the inferential. Here the question that
we're asking ourselves is, if there are
patterns in my data, can I generalize these
outside of my sample? In our case, can I
generalize these patrons outside of our fruit
and which double stamp? So are the other
vendors around in the market also selling
apples and bananas together? Or is it something
specific to us? Here we are reaching out to the world of
inferential statistics, which we also have
an introduction to later in the course. And let's say that
defining well-being, that the apples and bananas only hip bones within our stand. Maybe what do we did is that we obtained sales data also from other standards which
are on the market and we have compared and
we found out it really, it's just for us
that the apples and bananas are frequently
bought together. Now is this valuable
information for us? Of course it is because
it tells us that we are doing something special with
the apples and bananas. Maybe it's the display
how we're putting them together in our little
fruit and vegetable stand, which makes people
buy them together. Again, another kind of available information
which we might create thanks to the
inferential approach. And the last approach that
we might decide to utilize, which also industries
nowadays are utilizing go up is called the
predictive approach. He didn't basically
asking a question, can I make granular
predictions with the patterns? What do I mean by
granular predictions? Well, let's say
granular predictions are on the granularity of days. So I could maybe forecast
how much apples and bananas I will be selling
on any given day. Or I can be on a
customer granularity. So I would like to make individualized predictions
for my customers. And I would say that
if I would like to utilize these
predictive approach, I would focus on making personalized offers
for our customers. So I will run a little
loyalty program with loyalty cards so that whenever customers are coming to our
fruit and vegetable stand, they are showcasing
the loyalty card and I can link their
purchases together. And let's say that I found out customers who
like artichokes. And I also found out
that there are customers who liked to cook special
meals on weekend. Now you can see this is
not the one pattern, those are two patterns. So customers who like
artichokes and those who would like to cook special
meals on the weekends. If I combine the
multiple patterns at the same time into
one predictive model, I might be able to make personalized granular predictions
on my customer level. So maybe if I utilize
the predictive approach, the outcome will be that
the modal is going to output several
customers whom I should call before the weekend when the fresh batch of
artichokes their eyes and then make them an
offer to come to our fruit and vegetable stand. We have these fresh artichokes that you might enjoy earlier. The predictive model might be
right about couple of them. So we have again generated available information by
combining multiple patrons, unlike with certain
inferential approach where we had one pattern, we are combining
them together to make these individualized
predictions. So as you can see, we've been this last predictive approach
will be of course, utilizing a more
complex techniques, predictive modeling, machine-learning or
statistical learning. But the important
takeaway from this video, we will have time to dig, dig deeper into these. The important takeaway
from this video is to have an overview of
these four approaches, how they are nicely and intuitively building
up on each other and how they all can be utilized for a problem
that we're facing. And that is it
from this lecture, I would like to thank
you for being part of this lecture and I look forward to see you in the upcoming ones.
5. The "Data" Part of Data Science (1): Hi, and welcome to another
lecture in the course. Be aware of Data Science. My name is Robert, and in this lecture we are going
to be talking about the data part of the test data. Why is it so powerful? Why is it so useful? It first, let's think about
decisions and information. We make decisions every day, small or big ones decisions matter and because they
influence our future, for example, you might decide not to take an umbrella today. It's going to be raining and your day is going to be ruined. Bad decision. Let's say that you
decided to buy a property at the
outskirts of the city. These property doubles in
value over the upcoming years. So obviously it was
a good decision. We want to make right choices within our
decision-making process. Now, what can help us in this
decision-making process? Well, I would say
that if we watch the weather forecast
in the morning, we would have seen that it's
going to be raining today. So these valuable information could have told us
that we should take an umbrella or we
might have examine in detail the geographical
plannings of our municipality. And we might have noticed that there are large environment cultural investments
blended will increase this property value. So you can see information
is valuable for decision-making and
the data contains it. But do we really need the data to make the
right decisions? Can't we just use our own experience or knowledge
to make the decisions? Well, there is a problem. And the problem is that
humans are incredibly bias. We suffer from what we
call cognitive biases. Psychologists,
anthropologists and sociologists have
been conducting for decades are very long list
of these cognitive biases. Here on the slide, I'm just listing actually
just quite a few. There is lot more of them. What are these cognitive biases? Well, we could simplify them and translate them to a
statements such as, we, as humans are not fair. We create prejudice,
we discriminate, we are not even rational. You could say, I
think you'll get my point of the summarization. Now, even though we
are right now shedding sort of a bad light on
these cognitive biases. Some of them actually
helped us survive in the past or they helped
our ancestors survived. Let's say I'll rank
is through, walked through a dark alien, spotted a shady individual. Now he walked through another street and
they saved his life. In other end customer
of ours was offered a venture by an unhealthy
looking partner. She passed on the
business venture and she avoided a failure of
this business venture, obviously because of
the health reasons of these potential
partner in the past, these might have been useful. However, nowadays, when
we talk, for example, about the business
decision-making, these can be incredibly
problematic. The fact that human
decision-making can be this incredibly biased is a huge incentive
for companies to incorporate data in their
decision making process. Now, just to show
you what I mean, I would like to focus
on one of these biases, which is what maybe one of
the most prevalent ones, which you can find in
the companies nowadays it's called the authority bias. We have made this beautiful
illustration for you. You can remember that hippos
are dangerous for companies. You cannot remember
the statement that you should trust
the data and not the highest paid person's
opinion or the hippo. So it is a way how I like to
remember the authority bias. Here is the story. Let's say that you
are working for a yogurt producing company and maybe you have experienced meetings such as this
one in your past. We have launched a new
product in the past. So when you taste and
flavor of the yogurt, and now after a
couple of months of attempting to sell
this new yogurt, you are meeting to decide whether you should
continue producing, manufacturing, and
promoting these new yogurt. Well, as it looks like
based on the sales data, it's not selling that well. And also your
customer reviews are indicating that this is
not the best product ever. However, the manager,
the hippo in this case, stand-ups and says, No, I liked the new product. I actually think that things
will start to look better. We just need to
give it some time. We have already invested so much into the development
of this new yogurt. We won't just stop now. And the opinion of
these hippo outweight, what everyone else in the
room is saying, obviously, that's the highest paid
person's opinion and we have just suffered
from the authority bias. Here's the thing. Luckily, the data does not suffer from cognitive
biases such as this one. We could use the data to
overtake hippos opinion. The data can easily
showcase that even though customers
try the new product, the body, it, no one bolded for the second time or no one
wrote a positive review. If companies learn from
the data and utilize the information out of the data properly within their
decision-making, they can avoid
these human biases. Now, am I saying
that the data are flawless or are the data
flow is not at all. The data is unfortunately
not free of bigotry either. And we will talk about that
in the upcoming lecture.
6. The "Data" Part of Data Science (2): Hi and welcome to
another electron and Accords be aware
of data science. We have left the last
lecture with a question, are the data for us? And I'm claiming Not at all, even though the data might
be really helpful in our decision-making
process so that we avoid all our cognitive biases. We have to be aware about some potential pitfalls
and issues with the data. I will bring a handful of these issues just to
give you an idea. First of all, there is the
famous statement by Mr. Kanzi that if you
torture the data enough, it will confess to anything. What do we mean by it? Well, even though the
data might be unbiased, it is still us humans who
is interpreting the data. If I'm suffering as a data scientist from
some cognitive bias, I might portray my cognitive
bias on top of the data and still obtain a biased result if you would like to simplify this statement even further, the problem is between the
computer screen at the chair, which is of course us, the humans will have a bit more of examples on this
one later in the course, but I definitely recommend
to remember this statement. So this is the first issue
of working with the data. Secondly, yogurt say
that not all data are born equal or not
all features about, for example, people are not born equal. What
do I mean by it? Well, we can, for example, breakdown data points into four categories based on how
they come into existence. And some of these
are going to be more useful for our
data science projects. And some of these are
going to be less useful. Usually the most useful
ones are data which are observed or they are sort
of exhaust the data. Observed data I think are
really easy to imagine. You just observe
how old somebody. So that could be an age. Exhaust data are created as a result of
some process that, for example, a person
could be doing. Let's say that you are
typing an email and I create out of a data point
which says your speed of typing these two kinds of data based on how they
come into existence, tend to be pretty useful
for data science projects. But then we are getting
to the problematic parts. For example, there are learned data points.
What are those? Well, let's say your bank or your insurance company has
some risk rating about you. These risk rating, it's not a role data point which
is describing, you know, it was already learned from
some other data points, such as, for example, what their behavior
was in the past, whether you were paying your
depth and so on and so on. Now the problem for us
might be that if we reuse this data point for our
use case for our project, we might be taking over some
biases or some issues from the previous project which was constructing these
learned data point. Working with these could
be already problematic. Lastly, we are coming
to the red category, which is the self-reported data. These are from my perspective,
the most problematic. So many applications,
products and services over organizations deals with
humans in one way or another. Unfortunately, there
are several issues connected with people. First of all, people might lie. So let's say that I would like
to get a loan from a bank. The bank clerk asks
me if I owe the money somewhere outside of these bank at some other financial
institutions. And even though I do,
but at the same time, I really would like to get
this loan from this bank. So I'm going to say, Hey, I don't owe money anywhere else. Of course I have a
light with a purpose. So whatever I say
is going to be, of course, very biased. I have self-reported
something that is wrong. Secondly, people might
just not know themselves. Let's say I participate in the job interview and the
interviewer asks me how good am I at the
data science or how good am I at dealing with
stressful situations? Well, of course, I'm
going to report it. I'm great at data science and
I'm great at dealing with stressful situations because I don't have the
right perspective. Of course, I might be terrible at dealing with
stressful situations. I just don't know by self. So again, I create a data
point about me which is completely biased because
I don't know myself. So it's another
self-reported problem. I think you get my
point right over here. Whenever data science works with the human data about issues
should be kept in mind. In my experience, the
best data about humans is how saying are the
ones on the top. And we should rather be avoiding these ones
on the bottom. And at least from my perspective and from my subject to opinion, I do not really like to work
with service because they are full of these
self-report biases. There's a second example
of an issue that can be there with
the data friendly. It comes to biased dataset. As the world of data science and machine learning
are progressing, we want to build more powerful, more impactful,
and larger models. To build such models, you oftentimes need
very huge datasets. This is especially true
for use cases such as visual recognition or
natural language processing. And the issue is that when you build such model on a dataset, the dataset might have
some statistical bias. And now be careful, I'm saying statistical bias, not
cognitive biases. Cognitive biases are over here. The statistical bias in
the inside of the data. For example, there was
a very nice research done on one of the
famous datasets which are being used for various visual
recognition applications. What the research found
out is that there is a terrible imbalance between like light-skinned people and
dark-skinned people. Inside of the dataset, it was found that 84%
of the images which are inside of dataset on top of
which our model will learn, are of light-skinned people. And only 14% of these images
were of dark-skinned people. So as a result, if we learn
on top of this dataset, the model is going
to be way better at recognizing faces
over liked skin people. And it will have way worse performance when it
comes to dark-skinned people, which is of course very bad. Now imagine that we
have resorted to the data we wanted to avoid
all the cognitive biases. Yet we have stumbled
upon a biased dataset. We will have a problematic
model as a result. Now, we will talk about statistical biases as
well later in the course. But you can see even
the datasets which we are using can have
these sorts of problems. Lastly, I would
like to talk about fairness in recent years, fairness is the phenomena is really growing in
importance tremendously. Let's say that you are
a bank and you are deciding to whom you will grant a loan or you are building
a model which will be automatically deciding who
will be granted a loan. Now it is well-known fact
that using the data within such modelling exercise
can be incredibly helpful. What the data might you need? The simplest form might be
the social demographics data. But let's say that the order
using age, gender, income, residence, occupation, all
these kinds of data points. But a new question arises, should you be using
all of these data? Wanted to be unfair if you use
some of these data points. What do I mean? Well, think for example,
of gender information, if you will get your past
data and learned that there is a difference in
the default rates. So who will not be able
to pay back the loan between men and women for
the sake of simplicity, let's say that the
middle age men are a few percent less likely
to pay back the loan. Hence, you might consider it. You might be less
likely to grant a loan to a middle aged man. Is it fair though a man
who was most likely born a man cannot change the
state of this variable, and it then becomes
discriminatory to act upon the variable which the
customer cannot influence. So you say, alright, alright. I'm going to exclude
these variable and I'm going to make
a database decision. We found this unfairly
discriminating variable. We are hoping that now your
model is becoming fair. Well, what if, however, the gender information
is somehow reflect it in another proxy, in another variable
in your dataset. Within the geographical
region where you operate, there isn't a fortunate wage
gap between men and women. Hence, income becomes indicator of the gender information. You should also maybe exclude
this income feature from the features which
you are using for your loan and granting model. Now, we have just dipped our toes into the
topic of fairness. It's a really big one. I just wanted to highlight
for you that it's not only just about the issues
which might be there with, you know, portraying our
own cognitive biases on top of the data about it. Not all data are of same qualities or that there
can be statistical bias. We also have to think about
the fairness of using some data points and some
sources of the data. Just to sum up these two
lectures about the data, we want to make decisions. We need to make
decisions every day. We would like to make
the right choices within our
decision-making process. And it's troublesome to be
using our own perceptions, our own experience
and knowledge, because we have all of
these cognitive biases. That's why we might
resort to data, and that's why many companies are resorting to data to act as an aid in their
decision-making process to overcome these
cognitive biases, however, we should not think
that the data are flawless. The data can have
their own problems which we need to tackle. And within the second
part of the lecture, we have discussed that we can
still as a human portrait, our own cognitive
biases on top of the data based on how the
data came into existence. They can have their own
problems that aren't sample of the data
could be biased. And so the modal which we built, built on top of these data
will be biased as well. And lastly, that we should
think about whether it's even fair to use some of the data points that
we have available. And that is it from
this lecture I'm looking forward to see
you in the upcoming ones.
7. Statistical bias: Hi, and welcome to another
lecture in the course. Be aware of Data Science. My name is Robert, and
in this lecture we will talk about statistical bias. We have actually
already touched upon an example of a statistical
bias in the previous lecture, where we had the bias
dataset that contains an imbalance of images of dark-skinned and
light-skinned people. Now, this is an
important theorem for the wall data science. So in this lecture
I want to go into more details about what
statistical biases. And then you have an assignment coming up where you
can practice with both cognitive biases as
well as statistical biases. So let's go for these
statistical bias before and to fully explain
what statistical biases, we need to understand
a bit of a difference in terminology when it
comes to observation, population, and a sample. So let's start with
a formal definition. In statistics, a sample is
a set of individuals or objects collected or selected for my statistical population. The elements of a
sample are known as sample points or observations. Now let's start from population. I think it's easiest
and simply still imagine human populations of ovulation could be
every human being living in the world of ovulation could be every person living in a certain state, or it could be a much
narrower population. So, for example, everyone
who is using a computer. As you can see, the
population definition is always dependent on us. It's coming from us. It is about what or whom we
are interested in studying. If we are studying
an entire country, then it's everyone
who is living in this country who
creates our population. Moreover, it's not only
about human populations. Populations could be anything
even if it's not living. For example, if I
have a building and I have many light bulbs
in the building, all of the light bulbs in a certain building can
create my population. So this definition is
really, really flexible. Now let's continue
with a sample, as you can already
see from the drawing, that the sample is a subset
from this population. Now, when talking dogs sample, I would start by asking, why are we even talking
about the sample? Why is it important? Why is it necessary? The reason is very
simple and practical. Oftentimes it will
be very impractical, sometimes even impossible, to collect the data about
an entire population. Let's bring a concrete example. We are interested in
an average height of the people living in
a certain country. Of course, you can
imagine that if this country has 10
million inhabitants, it will be completely
impractical, if not impossible, to collect all of the heights of people living in this country. What do we can do instead
is that we will only draw a sample of people
from this population, and we'll measure the height
of the people in the sample. So we'll measure the heights
of our observations. Now. We have the population and
we have the sample defined. I think you can already
see where I'm headed. Basically what we will
attempt to do is that will attempt to make
an educated guess about what is the average
height in this country based on the measurements that we have made on the sample. For example, we have
measured our one hundred, ten hundred observations and we can see our average sample. We also see some deviations from this every so of course not everyone is the same height. And looking at the sample data, we will attempt to infer the average height of the people in the country, how it may look? Well, let's be very concrete. We have measured that
the average height in our sample is 175 centimeters. Our claim about the
population will be that we are 99% certain that the true population
mean lies between 171 and the 179 centimeters. Why are we not claiming
that the population mean is also 175 centimeters? We should not do this because the data is not perfectly right. We cannot have perfect inference from the sample to
the population. I mean, we could be
fairly lucky and we have drawn a very
representative sample. And indeed the
population mean is going to be 175 centimeters. But more likely, it's happening
that we are experiencing some degree of statistical bias so that there is some
degree of noise, error or randomness
coming into play. Now when they're saying
statistical bias, again, please be careful. Cognitive biases are aware here, statistical biases
are in our data. Now, we have an
assignment coming up. Where do you will practice
with both of these? The define a statistical
bias in a fairly simple way. We could say that it is when sample value differs from
the population value. As I was saying,
this could be due to a fairly long list of reasons
during the assignment, you will have a chance
to practice with these, such as selection bias, attrition bias or
exclusion bias. But for the sake of this
lecture, let's continue on. Let's say that we are
experiencing an exclusion bias, which is causing the
statistical bias. So our inference from the sample to the
population will not be exact thanks to the or
due to the statistical bias. How does the exclusion
bias happen? Well, let's say that
we have collected these one hundred, ten
hundred and height. How well are we collecting
these 1000 height? We have decided to spend
five days in a city. Each day we wanted to
call it 200 heights. Well, each day we
tried to stand at a different place in a certain city or maybe
somewhere in the country. We measured a height of
everyone who passed by. Well, unfortunately, one
day we stood just next to the sports club and the log of basketball players passed by. Now I think you can guess it. Basketball players tend to be taller than the
average population. Only other hand, elderly people did not pass by
these ports center and they were clearly excluded
from our measurement. As you can see, our
estimation from the sample is biased
to a certain degree. We can sum up these
little example. We FERC e-learning. We usually only work with a sample of the data
and we are going to be inferring something
about the rest of the population
whenever we will try to learn something from
a sample and then make an estimate to the
entire population, we have to account
for some uncertainty and errors due to the
statistical bias. And on top of it, we are usually
making some assumptions. For example, the assumption
that our sample is indeed representative of the population
from which it is drawn. And later during the
course within the chapter, we'll talk about the
inference a lot more. Now I wanted to give you the core idea of order
statistical bias. And just to connect to
what you already know, you have seen that there are four basic approaches
of data science. We have discussed that
there is a descriptive, exploratory, inferential,
and predictable. The statistical bias is not equally problematic
for all of these, will be later on in the course,
of course studying these. But when you think about it, when we are using the descriptive and
exploratory approach, we are only staying
within the samples. So whatever we find out, whether it's people buying
bananas and apples together, we are only claiming
it about our sample. This issue of a
statistical bias, this issue of inference is not impacting the descriptive
and exploratory approach. However, as soon as we try to generalize from our sample
to an entire population, either via the
inferential approach or via the predictive approach. We already have to account for these issues that
we have just discussed. And we have to account for a certain degree of uncertainty, error of noise, basically summed up as a statistical bias. And maybe it's a little
bonus from this lecture, you might be now
thinking, All right, how often do I have to estimate the average height of people
in a country from a sample. I want to build a machine
learning predictive models. Well, the same holds true for all the
machine-learning models. Why is that? Because the machine-learning
model of predicting one, we'll also be learning
from a sample. You just need to
broaden your horizons. We've understanding what is
a sample and a population. Let's say that I would like
to build a predictive model, which we'll be tackling
my customer base. Whenever a customer
comes to my store, I wanted to have a
prediction ready to recommend him or her
the best product. Well, our machine-learning
model you've held learning on our historical data and what
is then the population? Well, that's all the past
customers which we have. That's going to be all
the customers which in the future Can we don't
have data about them. We only have data about
our past customers and the population is also all
the customers in the future. Again, we have the inference we have to generalize to
the wall population. So there's going to
be the same problem. Similarly, if you would like to build a visual recognition model that is recognizing
breeds off the dogs. Well, let's say that we
have available data. Maybe you have all
of the pictures of all of the dog
holes in our city? Well, it's still
a sample because our population is older
dogs in the past, all the dogs in the future. And actually all
the possible angles from which somebody could
take a picture of a dog, then we certainly only
have a sample available. The population is much broader. Again, the same issue, Let's say that we are within
the stock mark and then we would like to predict some
stock prices in the future. What does the
sample that we have available for our
machine learning model? It's the past performance, It's the past data. And again, the future is what we are trying to
generalize into what we are trying to predict for
this understanding of a population sample and
an observation that there is always some degree of a
statistical bias will hold true even for when we will be building machine
learning models. All right, that is it
from this lecture. I thank you for being
part of it and I'm looking forward to see you
in the upcoming lectures.
8. I love the yellow walkman!: Hi, and welcome to
another lecture in the course BOOL data science. In the previous lectures, we talked a lot
about the biases. Of course, we started with cognitive biases that
are inside of our mind. And basically we are hoping that the data can help us fight these cognitive
biases so that we are making more rational decisions. However, I said
that also the data are not sacred and
the data can hold, for example, some form
of a statistical bias. And we should always
be questioning the qualities of the data
that we have available. Now I have prepared for you
are learning study which is called I loved
the yellow Walkman. When we will reiterate
on this notion of being skeptical towards the data
that we have available. Now for the learning
story to remember, Walkmans and those were
these big chunky things, almost the size of a shoebox
which were playing music. I don't remember them. Yes, I am dead old. Nevertheless, around 2 thousand. They only breakthrough
of millennia suddenly was manufacturing
Walkman successfully. And they were selling a
black color of a Walkman. And now they had a
strange hypothesis. Well, maybe if we
would launch also a yellow color of a Walkman, our customers might enjoy it. How they decided to go up all these hypothesis
is that they have invited their customers to focus group. Well,
there are focus groups. Well, basically you
invite your customers, you ask them a couple
of questions and the answers to these
questions generate for you a dataset
based on which you can then make decisions
within your business. These focus groups then usually take a few minutes and
maybe even an hour. We would also provide
some reward to our customers for that
they have arrived. Now one of the questions
that they have asked during the
focus group was, would you buy this yellow
Walkman do likely do maybe prefer it over the
black color of a Walkman which we are
currently selling. And the answer was, yes, I loved the yellow Walkman, I would definitely buy it. If Sony collected data this way, they might arrive to a conclusion that the
yellow color of a Walkman might be well appreciated and they will now start
manufacturing them. They would have big
expectations and they would offer it to the
customer as well. That is a catch
within the story. So Sony was likely very smart
within the focus groups, they didn't only ask the customers to provide
the self-reported data. So what do you buy
this yellow Walkman, they were also offering the
customer as a reward in a very specific way
when the customer was leaving the room went one is being done with the focus group. They said, Hey, pick your
reward on your way out, please. And there were these
two piles of Walkmans. One pile was with yellow Walkmans and other
pile was with like Walkmans. Now you can guess
which pile was almost empty and which of these
customers pick the most. Of course, it's
the black Walkman. So even though the customer was saying during
the focus group, yes, I would prefer
the yellow Walkman. I would definitely buy it. Well, when they were actually supposed to pick the reward, they were preferring the
black color of a Walkman. Luckily, Sony also collected
the second dataset from the focus groups where you
are not directly asking your customers what they think
or what they would like. You are observing them while they're picking
their reward, which gives you a much
more unbiased dataset. So now we have these two views. Now we have these two
datasets. Basically. You can see that oftentimes, quality of the data really matters is how was
the data collected? Can we rely on it? And as I'm saying, usually self-reported data which are
collected through surveys, can have a lot of
data quality issues. So to sum up these
learning story, I hope that you will always
be skeptical towards the qualities of the data
that you have available, as there can be
various forms and video sources of
statistical bias. I'm looking forward to see
you in the upcoming lectures.
9. The Limitation of Our Mind: Hi, and welcome to another
lecture in the course. Be aware of data science. So now that we have covered the data part of data science, so we have discussed
cognitive biases, why we would like
to use the data. We have also discussed
statistical biases and potential
issues in the data. I think it's time
to move forward. We will move to the science
part of data science. But before we get there, before we get to a data science model or
some scientific methods, I would like to in this
lecture talk about the specific limitation of our mind as it will be a solid foundation for
our upcoming lectures. Just enough feel minutes
from now I'm going to claim the data science
creates scientific models. Actually scientific models. I told you this definition
from Britannica have these strange definitions of scientific modelling is the
generation of physical, conceptual and mathematical
representation of real phenomenon that is
difficult to observe directly. Scientific models are used
to explain and predict the behavior of real
object or system. So this is our goal in a few minutes from now we
would like to get here. Now, let's come back. And as I promised,
we will start from a very intuitive understanding of the limitation of our mind. I will mind is used
to three-dimensions. I think it's the
easiest to imagine three physical
dimensions around us. We have the width, we have the height,
and we have the depth. If we are now
supposed to imagine samples position of
an object around us, our mind can easily use
these three-dimensions to be fine this position of
an object or where it is. However, what if
we were thinking in four or more dimensions? I mean, I, at least for myself, cannot be managing
the world being defined by four or
more dimensions. But now Excuse please my
lame knowledge of physics. But as it turns out,
our world might be defined by more than
four dimensions. But can our mind imagine it? No. I think this is the
easiest way how we can imagine these natural
limitation that our mind has just simply
used to be working in 12 or three dimensions. We really are having
problems with comprehending more than
three-dimensions at once. But we are not here for
the physics, right? So let's be more concrete. We add experiencing
these limitation of our mind also in
our daily lives. Let's say that the word
offered multiple jobs. We want to compare which one is better for us and
which we will take even though there are a lot of factors to consider
about each of the jobs, our mind will naturally limit itself to a maximum
of three-dimensional. But please be mindful here, I am not claiming
that our mind cannot comprehend more than these
three-dimensions individually. What I'm saying is that our
mind will have a Trumbull of comprehending more than
three dimensions at once. So two omega final decision of comparing which
job is better, we will limit ourselves
to free, for example, what's the commuting time, whether the job role
is fitting Gaza, and what the salary. The limitation is influencing
also our daily lives, but we are also not here
for deciding about jobs. So let's come closer
to data science. Now, let's say that you are
a marketing manager and you wanted to design a
data-driven campaign, you have a lot of characteristics available
about your customers, such as social demographics
or behavioral factors. And let's say that
you do not have any machine-learning available
or anything such. When you want to create some rules on which
this campaign will run. Again, you will be
able to consider only a limited number
of dimensions. For example, belly where this campaign to customers
of higher age, we're having a higher salary
and own a certain product. Now, I think I got my point through with
regards to the dimensions. And now the issue is that when we will be
studying a phenomena, it will be defined
by many dementia. As we mentioned, your dataset about customers might contain dozens of social demographics and behavior characteristics. And it will be again,
troublesome for our mind to comprehend the large number
of characteristics to design. For example, some
modal on top of which we can deliver
marketing campaigns. So to sum it up, our mind is limited
in the number of dimensions with
which it can work, or in other words, consider
them at the same time when drawing conclusions
and making decisions. Now unfortunately, that is not the only limitation that
our mind is having. Another limitation
or another aspect of our mind that we have to consider is it's
limited capability to consider a large
amount of data. Where our mind could work is when the amount
of data is small. So for example,
like in this case, we have some 15 costumers. We can let say many
only look through their characteristics
and comprehend what is going on with them. For example, if there is
some PR chase pattern. But what would happen if the number of customers
would increase? We would have to
comprehend what is going on with hundreds, thousands, or even
millions of customers. Of course, our mind is
not able to do that. That is the second aspect or the second limitation
of our mind. This time it regards
essentially the number of observations or the amount of data that we
have to comprehend. If this number grows, our mind hits a
limitation again. Alright, and this
brings us back to data science models
and scientific models. A data science model
is essentially a simplification of reality, or as you can see
in this definition, it is a representation of a phenomena that we
are interested in. Now this representation
is indeed useful due to the limitations of human mind that you
have just heard about. If someone, let's say
a data scientists provides a model which is the
simplification of reality. Hopefully, we can understand things about this phenomenon and potentially draw
conclusions such as design marketing campaigns. These was the reasoning for the science part
of data science. We will be building scientific models just
in the upcoming lecture. But within this lecture, I still wanted to provide you with a little bit of a bonus. And I would like to
touch upon one famous quote which is used around
data science a lot. It is originally
attributed to Mr. George Box and it says
All models are wrong, but some of them are useful. Now how I usually
see the sentence being interpreted is
that these models always fall short of the
complexities of reality in a sense that they cannot grasp the complexity
of the reality. And that it's some sort of
a weakness of data science and it's some sort of weakness of machine-learning models. Well, hopefully now you
can appreciate that the interpretation should be
maybe a little different. A modal has to be wrong
because if it wasn't, then it wouldn't be a modal. Modal has to simplify reality. And for this simplification, you just have to
ignore some noise, some randomness in some nuances that their reality brings. A modal has to be wrong
to a certain extent. It's definitely not the case
that our techniques will be too poor to comprehend
the realities of life. Alright, that's it
for this lecture and in the upcoming lecture, we will build a
data science model.
10. The "Science" Part of Data Science: Hi there. Let's continue our
exploration of the science part
of data science. The last lecture we
have concluded with the understanding
that a modal is attempting to represent
some real phenomenon to simplify it for our mind,
Let's continue now. I will say that we
can focus also on the second part of the
definition which says that scientific models are used
to explain and predict the behavior of real
objects or system. It turned that we have created
this little box which is grasping the purchase
decisions of our customers. How can we use it? What
can we do with it? So as the definition says, we could use it to explaining the behavior of our customers. This is the option one
that we are having. We will basically look
inside of the model and observe the patterns
that the model has learned. Now, this might be a bit
counter-intuitive for now, but later on down the
course where we will be talking on machine learning models and predict your metals. It will be a bit more clear. Now if it is a
machine-learning model that has been learning on
the historical data. How is it useful for us? Well, it was learning on a
historical data that will be, though, tricky to
comprehend by our mind. So we look inside of the model opposite of what the
model has learned and use that to explain the
purchase decisions of our customers and what would be an example of using a data
science models suggest this. An example of this could be our fruit and
vegetable standard. We have discussed previously. We learned that apples and bananas are frequently
bought together. We will reuse the display
of how we are displaying the apples and bananas for some other combinations
of fruits and vegetables. Thus, we have adjusted
our business process in some way based on
these data science model, these options might seem
a bit counter-intuitive, but as I say later on, it will make more sense. So this is option
one that we have. We're looking inside
of the model and let it explain the
phenomenon for us. The second option that we are
having is that we would use our model to predict the
behavior of our customers. Now, we want to
provide some data as an input to the
model and we will observe what comes out as a prediction on the other side
of our data science model. So let's say that we are
running against some sort of loyalty program in our
fruit and vegetable stands. So we have a lot
of historical data about the purchases
of our customers, and we also have an option
to call our customers. Now, we will provide
purchases from previous week as an
input to our model. And we would let them
modal predict what the customer might buy
in the upcoming week, the modal will say pay. There is a high probability that this customer will purchase US parabolas in
the upcoming week. So we will pick up a phone and
call the customer and say, Hey, wouldn't you like to
come and buy an asparagus? I just got a fresh *****. Now of course, the model we will not be perfect
with some customers. It will be wrong and
the customer will not come and purchase
the Aspergillus from us. But with some customers, we are hoping that the
model will correctly predict their behavior
in the upcoming week. Thus, we create the
business benefit by calling these customers
and inviting them to buy the Aspergillus
in our stamp. As you can see, this
is the second option of how we can utilize
our data science model. Either we are explaining
what's happening or we're letting the model predict what
might happen, for example. Now, the very last thing that
I would like to talk about when it comes to the science
part of data science is, how are we creating
these data size model? I mean, we buy now understand
why do we need them? We also understand how
do we utilize them? The only question
that remains is, how are they create? Now please don't get confused. I am not talking about the approaches and metals
that we already touched upon. So there's a descriptive
exploratory, inferential or predictable. We are now one level
higher about these. We first need to make a more general decision on how we go about the
creation of the model. And only then we will decide about the particular
approach and connected methods of
how we are turning the data into
valuable information like we will do
later in the course. So we are now one
level about it. There are in general two ways, how data science usually
goals when creating a model, then it's observation, and
then there is experimentation. Both of these are coming
from the world of science. Observation is, I
think, easy to imagine. For example, we sociology. We observe how
observations such as humans behave and learn
something from it. So let's say that we are
observing how people walk and navigate around the library
in data science trends, these are most of the use cases that you
want to sing nowadays. Accompany has some historical
data available in the US. These for the observation, basically they are
constructing the model by observing or learning
from historical data. There is still, however, a different path which is unfortunately still
considered a bit exotic by many companies nowadays, this is
experimentation. Experimentation is
mostly associated and practiced within
medicine nowadays, Let's say that you have
a new medical meant, any hope that it improves
the patient's life? How would you know that
this is through an indeed the medical and
improves the patient's life. Well, you can conduct a sort of an AB experiment to a
portion of your patients. You provide the maybe comment and to the another
portion of the patients, you only provide a placebo, which is not the
real medical meant. You wait for a bit
and see if there is a difference between
the two groups. If your hypothesis was right
and your medical and works, then the life of
patients to whom you'll have provided the medical and improve as opposed to those homeo only
gave the placebo. This approach is useful within data science when
you don't have the data. Similarly, like the researchers
within medicine do not have a data about
their medical and I do not know whether it works. So you would set up an experiment when
you would offer your customers a product at different times of the day because you have
a hypothesis of, let's say that if you are making the offer in the morning, your marketing
campaign will have a higher conversion rate. You will be focusing on collecting the data
from this experiment, whether there is a difference in the acceptance of the
offers from the customers. So you would essentially generated the data
from which you can proceed further
and from which you can build your data
science model. Now unfortunately, as I say these later approach
of experimentation, I see oftentimes being
undermined and underestimated by companies that are usually
resorting to the observation. I think this underestimation has something to do
with culture maybe. I mean, imagine that data scientists now
managed to persuade senior management
and data science is something that
we should be doing. There is value in the data. And now all of a sudden
they would come back to the senior management
and they would say, Hey, maybe we have a lot of data, but for this
particular hypothesis, we do not have the right data. We would need to conduct an experiment that
would generate the data from which we can create this data science model? Well, maybe the
senior management wouldn't be too happy about it. I think also data
scientists that are a bit worried to experiment nowadays. All right, So these are
the two ways of how we could go about creating
a data science model. Now just to summarize
a little bit, we went through quite a lot on the science part of the
data science at first, we started with
the limitations of the human mind when it comes to the number of
dimensions as well, when it comes to
the amount of data, then we have defined what
a data science model is. It simplifies and
represent the phenomena such as grasping the purchase
decisions of our customers. When we build such model, we can use it into
fault, for example, we can use it for explaining
the phenomenon to us, to our limited mine. Or we can use these
data science model to predict, for example, based on the purchases
from the last week, we can predict the purchases
in the upcoming week. And how do we go about building
these data science model? Well, it could be
constructed through an observation or through
and experimentation. Or you can sometimes even see these combined when it comes to the particular
approaches and methods of how we go about
constructing this model. We will get to that
later in the course. So that is it from this lecture. I thank you very much
for being part of it and I look forward to see you
in the upcoming lectures.
11. Introduction to Chapter 2: Hello and a warm welcome to the second chapter
in the course, be aware of data science. In this chapter
we're going to talk about the DC planes
of data science. And it is very brief lecture. I want to invite you
to the chapter estate, the goal of the chapter, and also give you a bit of an outlook of
wallet ahead of us. First of all, if
we are looking at the overall core structure, we are already done with
defining and understanding of the essence of this field
that we call data science. Now it's time to be a bit
more practical and to be tackling questions from in everyday life of
data scientists. So first of all,
there is a lot of confusion around data science. For example, people
are still wondering these data science the same
thing as machine learning. Maybe you already know
that this is not the case, but here are the confusions
that we might be clarifying. And really, if we do so, we will help ourselves
and we'll gain deeper understanding
of data science by cleaning out
these confusions. Secondly, we should
be already focused on the people who are
around data science. So we'll be also talking
about data scientists. The second chapter, we'll deepen our understanding
of data science, will view it from new perspectives to maybe show you even better
water ahead of us. Let's imagine that the
right now go ahead and Google war these data
size or even better, a hoist, a data scientist, how to become a data scientist? Unfortunately, you
will be met with more questions than answers. There will be people mentioning with the work with databases, there'll be people mentioning some statistical methods,
machine learning. You will even stumble upon some advanced concepts
from software engineering. Why is that wire will always
met with these confusion. Well, the reason is
relatively straightforward. Data science is
not a soul field, it's rather a joint
initiative of several smaller
overlapping areas. All of these are now putting something special on the table. Examining these fields which are involved under the
common umbrella of data science will help us understand the data
science itself. Hence, the goal of this
chapter is to provide us with an overview of
these disciplines as a kind of a starting
line on which we can then again build later
during the course. As you can see on
this image will be talking about these areas, will be starting with
these statistics, sort of an origin of what we
call data science nowadays, then it will be talking about databases and how these are in recent years turning into
the topics of big data. Then we have data mining, which is sort of a field which emerged on top of the big data, if you would like to
call it this way, then we are digging into the
areas of machine learning, deep learning and
artificial intelligence. But now to set the expectations, we're not going to study deeply these individual
disciplines. We will be focused at. What's the special thing that this discipline puts on a table? How does it contribute to the overall field
of data science? I wanted to mention this to set the expectations
that I do. We are here to study
data science and not the underlying
individual disciplines. All the ones we have covered the basics of these disciplines. We will move in
the second part of the chapter towards
data scientists. These image that you are
seeing right now on the slide is my view of who is
a Data Scientists, either T-shaped skills,
head of a data scientists, and we will slowly build it up. We'll start talking about
the data science mindset. Then we'll move towards the vertical component of
this T-shaped skill set, meaning the things where data
scientists should be coded. And we'll understand what are the necessary skills when it comes to data science use cases. Then in the later part
we will move towards the horizontal part of
this T-shaped skills, which are the relating skills, relating disciplines
where data scientists should still have some ideas, hopefully for you, they
will also give you ideas how you can connect to
data science use cases, for example, if you
are coming from some data engineering or
cloud engineering background, or maybe you aren't
just coming from a certain domains such
as banking or insurance, whatever domain you
are coming from, you can always connect to
the data science use cases which are done in your area thanks to your domain knowledge, we will talk about
the importance of domain knowledge
later in the chapter. Now as a small bonus, if you then visit the bonus
section of the course, I'm there providing a bit
more informal lecture on the T-shaped skills head
of the Data Scientist. And I'm reflecting a
point giving you hints and tips in case you
after taking this course, decide to continue and grow further into the field
of data science. How you can use this
T-shaped skill set as sort of an advice and
guidance for your studies. Alright, so this was the
goal of this chapter and I provided you with an overview of
what's ahead of us. And I'm really looking forward to see you
in the lectures.
12. Disciplines: Statistics: Hi, there. It is time that we start exploring the disciplines of data science and studying how these are all
contributing together, creating what we call
data science nowadays. And in this lecture
we will start with statistics and databases. Where are we when
conducting these two? Well, because they are
sort of historical and early foundation of what we
nowadays call data science. So in this video, we will take our statistics and
in the upcoming one, we will connect with databases. So let's go for it and study the origin of data
science in statistics. Now statistics as
a field is enchant and I'm not meaning that
in any negative sense, we could be reaching out to famous statisticians decades
or even centuries ago, or go just pick a few years what they were
doing across the history. I think we can easily see in this definition that I brought
from Cambridge dictionary. So statistics is the
science of using information discovered
from collecting, organizing, and
studying numbers. And I think you can
already now spoke the connection between
statistics and data science. We said that data
science is the art of turning data into
valuable information. There's a matter of fact, statistics or
statistical approach will be one of the ways how a data scientists
can go when creating valuable
information out of the data. Now, stand these things is
still prevalent nowadays. If we move to the
contemporary era, you oftentimes see results
of statistical studies. For example, a certain
dietary supplement is helpful for a certain health
condition that results from statistical study or
contrary some habits such as smoking can lead to
negative health implications. Long story short, statistics
as a field has been super-helpful to humanity for a very long time
and it still is. It allows us to quantitatively study phenomena and
understand them. Now, you might be thinking that we will now start exploring the world of
statistics and all of these statistical methods
that are available, such as descriptive statistics and inferential statistics, we are not going there. We will do this
later in the course. So in the chapter about describing and
exploring the data, we will talk about the descriptive
methods of statistics. And in the chapter about inference and
predictive modelling, we will talk about the
inferential statistics. Remember, what do
we are onto within these lectures about
disciplines of data science, we want to find the key ideas
of how these disciplines contribute toward we call data science knowledge is what special Do they put on a table? Later in the course? We will go there, do
not feel upset that we don't discuss it now
you will have lectures, assignments and much, much more. So what is the key
contribution that statistics has to data
science nowadays? I would say it's the
idea of hypothesis. Now we have something new which we did not discuss justice then, which my views are really the most extensive and important contribution
of statistics. Do data science nowadays a hypothesis we are
interested in something. This is a definition by Oxford
dictionaries and it says, we, if we're talking about
something countable, It's an IVR or explanation of something that is based
on a few known facts, but that has not yet been
proved to be true or correct. Or if we are talking about
something uncountable, guesses and ideas that are not based on a
certain knowledge. More of an intuition. We are revolving around
this idea of a hypothesis. So go to Data Scientists now they similarly
to statisticians, are often starting
their use cases with a particular
hypothesis in mind. For example, as we
already mentioned, we believe that a certain
dietary supplement might have positive
impacts on our health. We will start with this hypothesis and proceed
to data collection. We would collect the data
and then either accept or reject the hypothesis that
we had at the beginning. Now these idea of a hypothesis
is really important. I want to run really
quickly through one example of a use case where we are starting
with a hypothesis. We are taking the
statistical approach. The hypothesis that we
are having is diet and reached website roses
helps sailors health. So let's say that someone provided us with
this hypothesis. We are statistician's.
What do we do? Well, we design a little
experiment where we would collect data
about this hypothesis. We would provide ships with
two different kinds of diet. To 36 ships, we would provide a diet which is based on
length deals and beings. Now let's imagine we
are few 100 years bank when statistics was just
in its early beginnings, we would provide 36 ships
with these basic lentils and beans based
height to 42 ships, we would provide
a different diet. It will be the typical
diet and reached with synarthrosis because there
is regarding our hypothesis, we would let the
experiment run and once the ships are returning, we would observe the
average percentage of sailors with medium or large
health issues on the ship, which we of course want
to be as low as possible. And we will see
that on the ships which had the typical diet, approximately 55% of the same loris head medium
or large health issues. Whereas if we look
at the ships which had the diet and redrew cytosis, we will see that only 22%
had some health issues. And this is now telling us that our hypothesis
might be correct. And indeed the side trusses are helping out with
sailors health. This is the key takeaway
from this lecture. The statistical approach,
the statistical mindset is to start from a
certain hypothesis and then proceed towards
data collection just we did within this very
quick use case. And as I'm saying,
go Data Scientists nowadays are still
adopting this approach, starting from a hypothesis, then proceeding to the
data, collecting it, and then either accepting or
rejecting the hypothesis. All right, and in the
upcoming lecture, we will connect to
statistics with databases, as you can imagine once we are starting with the
statistical approach, data start to arise and we
need to store it somewhere. See you in the upcoming lecture.
13. Disciplines: Databases: Hi there. In the previous lecture we discussed statistics
and maybe we have conducted some
statistical experiments resulting in various datasets. We need to store
the data somewhere. In this lecture we'll discuss the second predecessor
of Data Science, which is databases or the
discipline of databases. Usually people tend to
tell themselves data. The data is somewhere
and we use it. Well, that's not quite eight, there is more to it. For data science data are in ideal scenario stored safely. We do not want our
data to get lost or stolen, accessible
and retrieval. We want to be able to reach
out to the data whenever we want and we want to be able to take it out of its storage. And for example, movie
the R-squared or copy it somewhere else
where we will analyze it, firmly, described and known. We want to know what the
data we exactly have. Oftentimes we want other people to know what data we have. For example, we want the sales
department to be aware of that customer care department is storing certain useful data. These are the three
characteristics when it comes to what is ideal for data science when it comes to databases
and storing data. Now if we take the
aspect of time and we look historically what was going on when it
comes to databases, we could start in
energy and time. So people have been
keeping records of various things since they were building the pyramids as well. Dr escaped paper-based records of their patients to
keep track of them. Farmers kept paper-based
records of their harvest and whether we felt storage
and access to the data. We cannot of course create
any available information. Hence, clearly, we felt
data stored in databases. There can be no data science that would make
available information, although veto people
have been storing and accessing the data
already long time ago. But let's move to more
contemporary era. When we move to the 20th century and when we talk
about databases, we really are talking about so-called relational
model of a database. These dated back to
approximately 9070 when Edgar Codd published a paper explaining the
relational data model, which was a revolution of its own database user now just
wrote a query that defined what he or she wanted while he did not have to
worry too much about the underlying data structures or where the data was
physically stored. If we would continue
further in history, we will discover that
companies attempted to bring together data
from various databases, sold it more complex operations and calculations could
be done on top of them. This is when we
start to refer to them as data warehouses. If we think about
the electronic form of storing the data, discipline of
databases started to emerge a rapidly
and aren't 1980s. These were essentially
collections of data on a certain phenomenon, companies have been
creating databases, storing data of their products,
customers or equipment. And thus companies have been creating and using
various database options, usually a custom ones. Many companies would
literally have a little server or
multiple servers. I'm somewhere in
the back closer to, you could say for
many years the aspect which we highlighted
about where a seemingly well addressed we had saved solution
for storing the data. We were able to access them, use them, and retrieve them. Unfortunately, as companies continue their
digitization process, various problems
started to occur. From my perspective, the most prevalent one
is the one-off silos. Parts of company we're
really getting close to around a certain product
and the data around it. Many companies, especially
the large ones, can hugely benefit
just if they manage to successfully connect the various data sources
which they have. For example, there
are two big divisions in an organization. One handling sales, one
hand link customer care. If the sales department has access to the data
of a customer care, they could integrate the
knowledge extracted from this data into their
own porosities. So yeah, we are starting to have these electronic form of a data, the relational data models
and even data warehouses for translators that
time was passing by due to these
custom solutions, these silos were being created and it's becoming
a bit of a problem. Now let's move to
the current times the century started with the introduction of
cloud computing. I could hold an entire lecture
just to introduce you to the basic idea and basic
concepts of Cloud. But let's stay very simple. Instead of buying
and maintaining that custom server in
your beck, close it. You rent one from a white and huge network
which is available globally. This will bring you a lot of benefits such as lowered costs, better security of your data, better accessibility of
your data, and much more. This shift which companies are making in the past
ten years primarily, is also accompanied
with the attempt of monetizing the data as
much as possible and breaking down the silos which were occurring maybe
some decades ago. These of course, goals
well in hand with wide data science is
so popular nowadays. We fake company has a wealth
setup Cloud solution. It could be a really potential
solution for data science. For example, you have
a certain dataset. Your data scientists now has a nice idea about the
data science use case. All it takes is a
few clicks really, and he or she can right
away have a server already that is there
just for him or her. We've stayed of the art
tools and the data ready. This is really wonderful to have a well established
cloth infrastructure. Unfortunately, for example,
when you talk about the EU region with its
stricter regulations, companies oftentimes have
to be careful when it comes to complying with
the European regulations, which might be sometimes
limiting the Cloud adoption. Let's not everything of
course from this lecture, I also want to showcase you want an example of how
data scientist or a data science model is interacting with some
database solution. What I have over here is really a conceptual drawing of how a database in a company
might look like it. First, you would have
some sources of data. For example, your website
is generating some data, your inner voice system
is generating data, and your warehouse
is generating data. So these are really the
role sources of the data. Then they are
getting transferred to your relational database. There they might still be
stored in their raw form. So we would have the
logs from the website, we would have the orders and items from our invoice system. Then we would have the shipping
data from our warehouse, which is still not
overly useful. We want to increase
the usefulness of this role data by various harmonisation and
organization techniques. For example, we would create a customer table
in which we would store from various sources all of the data that we
have about our customers. So it could be the invoices, it could be the customer's
interaction with the website. Then we would have one table
which is about orders, and that's about the order from the invoice system
about which items in this order where and also about the shipping which was made
with regards to these order. Now, this would be
already what we would call a pre-cooked data. It's prepared for us maybe
for some specific purpose. Now a data scientist or
a data science model could be interacting with this database on various levels. The most classical one
and the simplest one is to be consuming the
precooked data, for example, would consume
the prepared customer table. And we will be creating some predictive
model on top of it. It's there, it's
prepared for us, we just query it and they
use it for our purpose. The second way, how we
could be interacting with our wall database could be that maybe we're thinking how the data were precognitive prepared for us is not an idea. We would need it in
slightly different shaped. So we would reach out to the
role data that we have in our relational database and
we will prepare it ourselves. We would create different
views on top of this data, which will then fit our data
science eagles case better. Lastly, we could even go all
the way back and we could be influencing the way how
we are collecting the data. Maybe we have a
certain hypothesis like from the field
of statistics. And we will now
be coming back to our warehouse colleagues
and telling them, hey, this is an aspect
that we should be collecting about the
shipping of our orders. They would be now
collecting new data point, which is then running it all the way through
the database. And we can use it for
our data science model. As you can see, data scientist or a data science model is then interacting with a database
solution in different ways. So that is everything that I wanted to discuss
in this lecture. We ran through the history
of databases to really outline where we are headed and what do we want to achieve
with our databases? We want the data to
be stored safely. We want to have it accessible, and we want to
have it described. In the second part
of the lecture, I kind of highlighted the
conceptual understanding of how a data science multiple 0s interacting with the databases. I'm looking forward to see
you in the upcoming lectures.
14. Disciplines: Big Data (1): We will continue our exploration of data science disciplines. We have already
covered Statistics and databases, and
in this lecture, which will span across
multiple videos, will be tickling Big
Data and Data Mining. In this lecture, I would
like to talk about the emergence of big data too, which will then connect and explain how did the emergence of big data influenced
data scientists and how they have to adjust and
change their processes. So let's talk about the
emergence of big data. The problem is simple and I
think we are all aware of it. The amount of data arises since the breakthrough
off millenia, a big point in the
history of humanity was the outset of digitization. They're also sees are
getting digitized and people interact with digital
products more and more. All of these digital products and processes generate the data, for example, due to simple
operational reasons, if you're a bank, you might just have to store the locks of how your
customers are interacting with your Internet banking
solution simply due to legal reasons
and obligations. So all of these digitization
efforts and processes gave her eyes and the emergence to the concept that
we call big data. Even though the time
describing databases and big data in two separate
lectures in the course, that is not exactly a clear cut between the two in case
you are interested. Now there is no explicit and
broadly accepted definition of how big the big
data really are, is the hardware and
computational capabilities are constantly
evolving and changing. Also, this definition
would have to be constantly changing
and evolving. Maybe one useful
way how to think about big data is that
this isn't the amount of data that he
wouldn't be able to work on within your laptop. Luckily, in 2004,
go-go introduced its famous MapReduce algorithm that enables data scientists and companies to distributed
large chunks of data that do not fit on a single machine
across different machines. And these machines are
then going to collaborate to analyze one big
dataset together. Thus, we can still work
with big data. Don't worry. Now, I would like to start the conversation
on big data with regards to misunderstanding
and overestimation, or do I mean, at first
on the misunderstanding? I think the misunderstanding
is coming from wild 20 tenths when data science was
speaking in its hype. By hype, I mean the maximum
of difference between all these perceived as being done and what is really being done. Many people were attempting
to define data science as statistic that
deals with big data, then it will be the
same metals which we have been using for
decades within statistics. And we have just doing them
so that they can be applied on these vast datasets
which are now available. That is simply not true in the upcoming video will explain the difference that is
coming with big data. How data scientists
have to adjust their approaches of working
and analyzing datasets. No, data science is not the statistics that
deals with big data. That's the misunderstanding. Secondly, I will
send it big data also comes with overestimation. Think about which
companies really have big data and they have also
the need to utilize them. So both having them and having
the need to utilize them. It's not that many companies, the classical examples
are of course, banks with their transactional
and markets data. Dell calls with their
glorious log data from all the telecommunications, some areas within health
care and also some more. The truth, we simply
did a load of data science use cases simply
do not deal with big data. Also, it might just happen for a data scientists that okay, I'm stumbling upon
some bigger dataset. I might just subsample it. So I create a random subset, a random sample from
this larger data. And they can work with
the smaller sample of the smaller dataset and
still draw some conclusions. So occasionally, even if you've
stumbled upon a big data, you are still able
to work with it as if it was a smaller dataset. I will tell you one
important thing to understand about big data
is that it's kind of overestimated nowadays
and simply a lot of use cases are just not
dealing with the big data. Alright, let's summarize with some key takeaways
from this lecture. Big data emerge due to digitization processes
going online, people interacting with them. It can only big data is often misunderstood with the
connection to data science. And lastly, I would say
that it's oftentimes overestimated when it
comes to data science. Now in the upcoming lecture, we will keep on talking
about big data. Maybe you will t usually
trainers talking about the V's of big data such as
volume and velocity. I do not want to go there. I mean, yes, the amount
of data is increasing. But the important thing is
to realize what sorts of data are we gaining and we will continue with that in
the upcoming lecture.
15. Disciplines: Big Data (2): Hi there. In the
previous lecture, we started to talk about
the emergence of big data. Let's build upon it
and discuss how it impacts the data science
as we know it nowadays. Big change relevant
for Data Science, which occurs in directly with big data is the
nature of the data. In the days many
years before all of the social networks
shopping online, the telcos machinery, sensors, which are traditional
sources of big data. Think about how
datasets where created. Someone had a hypothesis, for example, a statistician, and collected data specifically to verify this hypothesis. So those were oftentimes very purposed
collections of data, just like we told you about in the previous lecture
about statistics. For example, we
would specifically collect a data about sailors health by asking them to
verify our own hypotheses. Now however, in the era of
digitization and big data, we also have another purposed
collections of data. Let's make a clear
distinction about what I mean when I say dataset
having a purpose, I really mean data
science purpose, a purpose about the
valuable information which is being derived
from these data. For example, it's a survey on political preference,
a purpose data, yes, the information on overall political preference
is derived from it. Easy all banks transactional
data, a purpose dataset, know the bank is primarily
using the transactional data for legal purposes and your
customer experience purposes. But let's say that from
data science perspective, there is no purpose
at the moment when the data is being
created and stored. So this would be for me
the clear distinction between purpose and
the purpose data. Alright, then what is this
difference between purpose than on purpose data have
to do with data science? Well, there are two things. First of all, if a data
scientists to intends to work with and purposed collection
of the data, he might need, firstly, whole legal approval for the new purpose
and secondly, availability to use it
for this new purpose. For example, a data
scientist who works for a bank sees a
transactional data. He now cannot just take you then create a new sales
model with these data. For example, within
the European Union, customer needs to
be aware that he's transactional data might
be used for such purpose. Then when it comes to
the availability of these transactional dataset that data scientists
would like to use? Well, he might have to
extract the data from some operational system
where the data is stored for operational purposes and not our data science purpose. And we're bringing it to some analytical system where
he can analyze the data. This is the first change that the unprejudiced datasets are bringing to data
science nowadays. Secondly, unprepared
collections of data give a new perspective on the
job of data scientist. You'll see these
datasets around you. Your job is to give
them new purpose. This is exactly what companies expect from data
science nowadays, that data scientists
and other people come and work with the data
that these companies have and give these new datasets
a new purpose and new ideas about what valuable information
could be derived from it. This is the second big change. It gives them a new
perspective on the job of data scientists and people who
are working with the data. Alright, so a key takeaway
from this part of the lecture with the rise of
big data and digitization. Nowadays not all datasets have a purpose from data
science perspective. And our task might be to put new purpose on top
of this dataset. All right, now
that we understood the crucial aspects
of our big data, I have a few more things
which I want to say on top of big data with children nicely
connects us to data mining. For example, as I mentioned, is these are larger, sometimes very large datasets. So clearly if a data scientist needs to be working with it, they need some new tools. The way how you
can imagine these is that these dedicated
tools are doing the very same thing
as a normal tool that would work with
smaller dataset is doing. They are just
adjusted so they can work with larger datasets. Remember to think about big data as possibly the
data that would not fit within a single
machine and we wouldn't be able to analyze
it on a single machine. An example would be that
let's say that we have data about 10 million people
and their heights. It will either take too long or it wouldn't
be even possible on a single computer to calculate an average height of
these 10 million people. Instead, these dedicated
tools will break the 10 million people into groups of 1 million people each. And they would provide ten computers with parts and
chunks of these dataset. And on each computer
would calculate an average height of
1 million people. And then we will
be, we will just collect the summary from these ten computers
and we would make the final calculation about
the final average height. So this is kind of
an intuition of how these tools are working. Secondly, it's also
about the new methods. I would say that a new
focus mainly regards the unprejudiced collections of data that we
mentioned previously. In the old days, statistics
was working with well-defined
statistical models and tools on top of these
purpose collections of data. Then you would oftentimes
prioritizing that the order data fits
some statistical model. Well, however, these head to change now with the
rise of big data, we rather had to prioritize the data over the underlying
statistical model. We will get to statistical
models later in the course, we became With the
rise of big data, less formal with our
modelling approaches. This last notion connects us to the second part of the lecture where we will talk
about data mining, which is exactly a discipline which brings this
change where we are becoming less formal with our modelling approaches
and techniques. So see you in the upcoming video where we discuss data mining.
16. Disciplines: Data Mining: Hi, let's continue
our exploration of big data and purpose
collections of data. And more concretely
in these lecture, we'll talk about data mining, which is a newer
approach of how we might be responding to the need of analyzing these
unprejudiced collections of data is this is
again a new term. Let's take a look at the
definition by Oxford Dictionary, Looking at large amount of information that
has been collected on a computer and using it
to provide new information. Now the terms knowledge,
data discovery, and data mining refer more
or less to the same concept. The distinction being
that data mining is prevalent in business
communities and KDD knowledge data discovery is most prevalent in
academic communities. Now, this definition
seems awfully similar towards statistics
used to do, right? I mean, in a way you are right, but let's put things
into perspective. We are talking about big data
and did this on purpose, collections are
rising in the amount. You might have noticed
a clear distinction, and this distinction will exactly relate to the
rise of big data. Within the realm of
statistics usually start with a hypothesis
which we discussed before. And only then you would think about what data
you will need to accept or reject this hypothesis within the realm of data mining, however, we will be
starting with the data. Remember we have lots of it. And by now you've already
been thinking about what question you
could ask the data so that new and valuable
information could be derived by answering
the new question that you have asked this data. So as you can see,
Big Data gave us a new way how we derive valuable information out of the raw data once
we are only two, what would be an example
of a data mining? To see these contrast better, one very classical example
will be clustering. We do not know much about
the data and we also have no particular
hypothesis in mind. We are starting from data and
we're trying to figure out what kind of valuable hypothesis we can put on top of this data. What's the new question
that we could be asking? Clustering might be
simply about finding similar groups of our customers. So we have customers
as observations, and then we have
various data about their social demographic
characteristics. Clustering will put
together customers who have similar social demographic
characteristics and maybe similar behavior. No concrete question or
hypothesis we have just allow this data mining
methods to group together customers who
have similar behavior. And now we can work with it. Now we can see, alright, are there differences in
ages of our customers? Are there differences in how customers are interacting
with our products? So how are these groups formed
based on the algorithm? And then we can proceed with more concrete hypothesis which we are putting on
top of the data. Or we can right away contrast this with our business
colleagues and say, Hey, my algorithm has created
these two groups are, can we do something about it? Is it anyhow useful? So you can see we are starting
from the data and only then we are proceeding to
more concrete hypotheses. Now as a last learning
from this part where we are discussing
big data and data mining, I would like to show you
how these two fields, statistics and data mining. Nicely click together when
it comes to data science. Both of these approaches, statistics and data mining
are preserved and the well-practiced in
nowadays and modern data science in some use cases, even a mixture of these two will yield the desired benefit. If, for example, you are discussing with an owner
of a trading company, let's say even a
few 100 years ago. All these ulnar nerve, the trading companies
telling you, Hey, I would like you to make my
sale or as happier and more productive or there is no concrete hypothesis
like we had previously. So we will start
by collecting and examining various
datasets that you have. And you would have number of
sailors on various shapes, length of sale,
salary type of sale, geographical region on diet and various productivity
measures that your start. So at first you do not
have a hypothesis, by the way, have a lot
of data available. You would proceed with data
mining and you will discover, did there appears to be patrons
within the diet variable, it somehow seems
to correlate with the productivity and
the happiness measures. These variable appears
to have some influence, but you are liking a
concrete explanation. So now you pose a concrete
statistical hypothesis, which is that happiness
and productivity levels of your sailors are influenced by the diets that they
have on the ships. So we will now come back a
few lectures ago where we had the concrete hypothesis
and we would provide the ships
with different diets. Afterwards we would observe the results which will be a little statistical experiment. You can see these could be a
flow of a single use case. You are combining these
two approaches together. At first you are starting
with data mining, finding some
interesting patterns. And then on top of it you put
a concrete hypothesis for which you would construct
the statistical experiment. All right, then that's
everything for this lecture. Key takeaway, we've
statistical approach. We are starting from hypothesis proceeding to the data
with data mining, we're starting from
data proceeding to some hypotheses and then looking forward to see you
in the upcoming lectures, where we will explore more disciplines that
together create data science.
17. Disciplines: Machine Learning: Hi, and welcome to another
lecture in the course. Be aware of data science. We are continuing
our exploration of the disciplines that together
create data science. And now we will slowly touch upon this area on
the right side. So in this video we will discuss machine-learning
and gained an understanding what the
dance and why is it important? In the upcoming one, I will
talk about deep learning and then we'll conclude with
artificial intelligence. So let's start with the phenomenon of machine-learning.
What easy devout? I will start by a statement of saying that patterns matter. Let's say that I am
creating this course while working from home and
I will be 11 o'clock, I start to prepare lunch, even though I'm not hungry yet, I do these because I know
that I will get hungry. So on around noon, the simple pattern
that I learned about myself can save me a
few minutes of hunger. This is a bit naive example, but let's say that they
also need to go shopping. I have visited the shop
many times in the past, and I learned that around two
o'clock in the afternoon, that has the fewest
people in the shop, so I like to go shopping
around this time of the day. Another useful pattern
that I learned, which makes my life
easier and more comfortable to useful
patrons for me. Coming vector data science, we have been already
working with a statistical hypothesis which was the sailors and their diets. While our finding,
which is that if we provide the sale orders with a diet and reach
the inside Rose's, they will become more happier. This is another
example of a pattern. A pattern is the
valuable information that we are seeking when we are, for example, looking
through the data, patterns are useful
patterns matter. Now fortunately, there
is a downside to it, which is that the
process of searching for patterns in the data is costly. Manually going through all of the available data might take
us a lot of time even more. So if we are working with large datasets and remember
our lecture on big data, we now have a lot of datasets as a result of the
digitization processes. Moreover, as the time goes, some patterns might become obsolete while new
patterns arise. If we are manually
searching for patterns, we would constantly need
to search for a new ones. Alright, so if we know that patterns matter and
they are valuable, but at the same time, searching for them manually
might be very costly. Well, where do we go? We resort to machine
learning and I think this is why machine-learning the
soul popular nowadays. But let me bring them more for poor motivation for
machine learning, instead of us humans searching for valuable
patterns in the data, we'll add a machine
or an algorithm, do the search for us. We hope that the
machine will learn valuable and
non-obvious patterns. You might not feel a bit
discourage if you have some prior knowledge
about machine learning. We supposed to predict the future with machine
learning, well, we can predict the
future even we found machine-learning if
we have the right pattern, like I have learned
about myself, that there are 12 o'clock, I get hungry at 11
o'clock, I start cooking. So it's not about being
able to predict the future. We can't do that. Even we
found machine learning. Machine learning is
about being able to automatically search
through the data. And an algorithm or a
machine-learning model will find some valuable
veterans for us. I'm also bringing a more formal definition of
machine learning. So machine learning is the
study of computer algorithms that improve our automatically
through experience. Now we can go back
to my example. I have visited the shop many times in the past and I
have learned the patron. A machine-learning algorithm
will do the same thing. It is seen as a subset of
artificial intelligence. We will come to that in
the upcoming lectures. Now, machine-learning
algorithms built a model based on a sample data known as training data in order to make predictions
or decisions, we found being explicitly
programmed to do so. Now I think we can nicely imagine the
machine-learning algorithm. It's super useful for
companies who have some prior datasets which might
be completely on purpose, we would let a machine-learning search through these data. And hopefully it would find some patterns which for example, allow us to predict some future behavior
of our customers. So now that we have gained an intuitive
understanding of why machine learning is useful
and what it is about. We will come back to machine
learning in chapter four, where we will talk about inference and
predictive modelling. And there is even an
assignment where you will have a chance to become a
machine learning model, you will be searching
for patterns in the historical data about
the ear migrations. And then your task will be to predict some data migrations. Alright, so these ones are
key idea of machine learning. However, it has a subset of the burning and I will talk about that in the
upcoming lecture.
18. Disciplines: Artificial Intelligence: Hi there and welcome to
another video in which we will conclude our little journey through machine learning, deep learning, and
artificial intelligence. Artificial
intelligence is a sort of a discipline of
data science because data science has some
sort of an overlap with the area of
artificial intelligence. And these overlap is mainly
with machine learning. So let's focus on
artificial intelligence defining and we will
see the difference. I frequently see
people kind of missing the clear distinction between data science and
artificial intelligence. So I'm bringing this
definition by Britannica. Artificial intelligence
is the ability of a digital computer or computer-controlled
robot to perform tasks commonly associated with
intelligent beings, even though this is
awake definition for me, the key term is the
word better form. And hopefully now we
can already grasp where the overlap between data science and artificial intelligence is. It's mainly within
machine learning. So both of these areas are
using machine learning for some automated search for
patterns in the available data. But data science
is more possible. Artificial intelligence will be more actively to attempt to construct an intelligence system which will maybe
act independently, such as a robot that I
have on the picture. Let's talk about two examples
of use cases that would fall under artificial
intelligence versus them all. Let's take a case
when we would like to construct a virtual
assistant that will help our customers answered
the most common questions either via phone or via text. So thanks to these,
we will save costs on our side and
hopefully increase the customer satisfaction by the responses will be faster. We will add first, want
to use machine learning, which we'll search for
patterns in our data. We'll let the
algorithms search for most common questions
asked by costumer. And maybe the algorithm
will come up with 15 most common questions
by our customers. Then we would link the most suitable answers to
these most common questions. And this will be then
the intelligence system. We need to just connect it
with the perform capabilities. So we would allow the
model to read the input from the customers or listened to the input
from the customers. And then on the other hand, respond automatically with the most suitable
identified answer. So you can see the
more complex system the machine-learning
is under the hood. It's still searching
for the patterns, but on top of it we have
the perform capability. Another use case or example of artificial intelligence will be this little robot called Pepper. The robot is literally
a visual example. I mean, it moves
around the space and thanks to pre-processing
of visual data, it can see the world and on itself and navigate
within it efficiently. It can also listen to
people and their commands. So you can see under
the hood of this robot are many machine
learning models, deep learning models which can recognize visual
inputs, all the inputs. And then again, there is this
perform layer on top of it. It's able to act independently as if it
was unintelligent being. I hope these two examples highlight what artificial
intelligence is about. And now we also understand
the difference between data science and
artificial intelligence. I have a small bonus for
you within this lecture, which again regards
the European Union. The European Union is
taking the space of artificial intelligence pretty
seriously in recent years, it even constructed
high-level expert group on artificial intelligence and suppose to come up
with regulations for artificial
intelligence systems. And you can see
they started off by defining what artificial
intelligence means. Artificial intelligence
refers to systems that display intelligent
behavior by analyzing their environment and
taking actions with some degree of autonomy to
achieve specific goals. And now this is kind of
an interesting one in many industries are being
intrigued by this definition. It's very whilst it's not only about the rowboats or
virtual assistant, it also includes a
wide suit of models, which we would previously call a statistical model or a
machine-learning model. Really to fall under this
definition would be that you display some intelligent
behavior and take actions. If you would have some
machine-learning model which is analyzing
historical data about their customers and
then sending them a promotion campaign automatically at the
end of the month. While that's already
by the definition of European Union and artificial
intelligence systems. And then it's going to fall under all of the regulations
which are coming up. This was a bit of bonus. European in N is pretty serious about AI and it's bringing up a very wide definition
which we'll cover a lot of systems that many
companies use nowadays. It's everything that I wanted to provide you with
within this lecture, I hope that it's clear
now how data science, machine learning, deep learning, and artificial intelligence
all work together and how these individual
disciplines contribute to what we nowadays
called data science. I'm looking forward to see
you in the upcoming lectures.
19. T-shaped Skillset of Data Science: Hi, and welcome to another
lecture in the course. Be aware of Data Science. My name is Robert,
and in this lecture, I would like to introduce
you to the upcoming parts of the chapter where we
will answer a question, Who East update data scientists? So let me give you the goal and an overview of
what's ahead of us. So what you have learned
in the previous lectures, he said data science is
not a soul discipline, rather an
interdisciplinary field. And all of these
disciplines such as databases, statistics,
data mining, or every finger out
machine-learning are contributing to together create what we call data science. Now if we're asking
ourselves a question, who is a data scientist? Data scientist and expert
in all of these siblings? Certainly not a data scientist, peaks these disciplines up to unnecessary extent or
to a certain extent. So looking at the disciplines, it's not too helpful to
answer for us who is a data scientist will
rather have a new concept, which I like to call a T-shaped skill set
of a data scientist. If you are not familiar
with the T-shaped skills said when you talk about
any job or any occupation, we have basically
two components. We have a vertical component. That's something where a
person should be really good. So we have a couple of skills. Where are the data
scientists should be really good and hold the responsibilities
over these areas and over these components of
a data science solution. Then we have this
horizontal component where on the left side we will have what we
call a soft wing. And on the right side we'll have what we call a technical link. These are areas where data
scientist has some knowledge, has some capabilities,
but usually has to rely on other colleagues. Water, for example,
experiencing the business or who are experts in some
infrastructure components. This is the general
idea of a T-shaped it. And as you can see in
the upcoming videos, we will discuss each
of these components and we will go into details what these mean to really answer a question and build up
this T-shaped skillset. And to answer a question, who is a data scientist?
20. Skills: Mindset of a Data Scientist: Hi, there. It
started building of our T-shaped skill set
of a data scientist. And these two will start
at the very heart of this T-Shirt skillset with
a data science mindset, what do I mean by a
data science might say, well actually there
are a couple of components which I think every data scientists
should have in their mind. Let me explain. First of all, it's about being skeptical. And here I really
mean that scientists should be skeptical
against everything. The words will the datasets towards what the business
stakeholders say, even towards what the metal that he or she is using says, well, let me go into details when it comes to being skeptical
towards the bottom, you have already seen in the
previous lectures that there can be biases in our data
such as statistical bias, data scientists should always
question, for example, whether a data collection method might not have
introduced some form of a statistical bias into the dataset that the
data scientist is using. So be skeptical
towards your data. Secondly, being skeptical
towards business stakeholders. I mean, if you are unlucky and unfortunately it always
depends upon the organization. But it might happen that your business stakeholders
are going to be pushing ideas for projects and use cases
which might not be the best, but they are the best
for their own agendas. So again, be skeptical about it. Also in lastly, being skipped, the cobalt or the method says, I mean, we have not
been talking too much yet about data
science methods. But basically what the Kennedy
member already now is that every metal within the realm of data science has
its own pitfalls. For example, later on we'll be working with a correlation. Correlation is a
very powerful tool and method within data size. And even this one
has its pitfall. If we're not careful enough when applying this data
science metal, we might stumble upon a
pitfall and actually, even though the method looks
like everything is okay, now things are not okay. Something went terribly wrong. First and foremost, when it comes to data science mindset is about being skeptical
essentially towards everything. Secondly, it developed
being down to earth, or in other words,
data scientists should be realistic
and practical. This means promoting
methods, approaches, and products that actually
have the potential to create a social or business
impact in a way that delivers this efficient
solution and not working on some
crazy over chaos of the downside data scientists
are oftentimes really curious and they would like
to try some cool approach. The state of the art
techniques that were just recently released met
as a matter of fact, oftentimes to deliver a
successful data science use case might be more about talking for weeks with the domain experts making sure that your data
science model, your technical solution,
we'll be well-integrated into the domain where it attempts
to create the benefit. Then as the end result, you will not have
some crazy state of the art machine
learning solution. Maybe you will have something
simple, straightforward, some super simple
machine learning model alongside with a couple
of visualization. The second important aspect when it comes to the mindset of the data scientist is to be
down to earth and practical. Thirdly, I would say it's about the being ethical in here. When it comes to
ethics, everyone draws their
boundaries elsewhere. When thinking about ethics, there are, of course, crazy cases and
boundary cases like Cambridge Analytica that you might have heard
about in the media. And now you're
telling yourself no, a ethically questionable
cases would never happen, for example, next to me or
within my organization. Well, in many organizations
you will have a decline, questionable projects
and use cases. Hopefully you are lucky enough to never stumble upon this. Now when it comes
to data scientists, I think he or she
should clearly they're all some sort of an
ethical boundary. For example, for me, I know about a couple of use cases that I would
never like to work on. One example will be a
workforce optimization. I just never want to create a data science model
that would as a result, mean that some people are
going to lose their jobs. For example, that's one example of the ethical boundary for me. But as I say, everyone draws
their boundaries elsewhere. And we can discuss about
ethics probably we can make an entire course when it comes to ethics and data science. So I just wanted to mention it when it comes to the
data size mindset. And maybe as a small ball news when it comes to the ethics, it's also about the data
scientists should not be abusing their own
technical advantage and technical competence. Well, I mean is that
you shouldn't purposely misinterpret what
the data has to say to pursue your own agenda or the agenda of your own
department or company. But I guess not abusing
on technical competence and technical advantage is
true for any profession. Being ethical is
the third component of data science mindset. Now lastly, I will tell you about being hypotheses oriented. Data scientists from
my perspective, should be able to form a hypothesis and operate
around the hypotheses. There are, for
example, other data related occupations
such as data analysts, business intelligence
expert reporting. From my perspective of these hypotheses oriented
mindset is one sort of a personality trait
that should be distinguishing data scientists from these other occupations. He should not expect that
someone comes to him and says, Hey, analyze these data, create these exact
predictive model, pulled it together in that way, create this form
of visualization. Now, data scientists
should expect to work with an unknown animals. For example, he sees a
dataset and can pose a new hypothesis on top of the dataset or the
other way around, he sees a business
problem and can form a concrete hypotheses
out of the wake. Business problem. Being hypothesis oriented is another key component of
a data science mindset. All right, that is it. From this video, we have
discussed a couple of important components when it comes to the data
science mindset. And in the upcoming lecture, we continue with building
of the T-shaped skillset.
21. Skills: Rectangular Data: Hi and welcome. Let's continue with
the building of the T-shaped skill set
of data scientists. In the previous lecture, we have discussed the data
science mindset. Ending this lecture, I would
like to continue and explore the vertical dimension of
the T-shaped skill set. Now these are skills
and areas where data scientists
needs to be gold, that he needs to be able
to take responsibilities. Now I will also
bring examples of various stalls and
packages that you can utilize to apply these skills if these packages and libraries
are not telling you much, that's still okay, don't worry. As a key takeaway, we are discussing the concepts and the skills which
are necessary. So the packages
are just a bonus, maybe as a bonus takeaway for you is that it's not
a rocket science, it's really just a couple
of libraries and packages the data scientist needs
to be able to operate. Alright, so let's start. Before we begin to
individual skills, I would like to mention that data scientists should
first and foremost pick up a sort of an umbrella
technology or umbrella tool, a programming language
in which he or she develops his or her skills. Now for me, this is Python, and Python is also my
recommendation to my students. It's an open source programming
language in which you can basically apply all of the Common Data
Science operations, methods, transformations,
however you would call it, it's just about picking
up the right library. Now there are, of course,
alternatives to Python. One famous example is of course, our
programming language. Most of these are fairly
similar on the surface, but under the whole, you really start to spot differences. We turn, I think, originating in how these two
programming languages or originated, I mean, R was originally developed mainly for statistical
community, which was evolving
around the academia. Whereas Python was always closer to the artificial intelligence, pattern recognition
and machine learning, and it was always more focused on the industry applications. Nowadays, I would say
that for most companies, Python is the go-to
programming language. So that's why those off my recommendation to you in case you want to pick up a
programming language. Once we have the umbrella tool, let's talk about data. As we already mentioned, there are various types of data for me or general data
scientists should know how to work with what
we call a rectangular data. If you are not
perfectly familiar with this term, let me explain. We mean the most basic classical data that you
can open up in the Excel your customers will be stored in the rows of the table and the characteristics of
your customers will be stored in the columns
of this table. Of course, I'm generalizing
with the statement. And for example, there are
more companies which have an entire business strategy or product based on some
different type of, uh, datasets such as unstructured data for these niche cases
and niche companies. A definition of a data
scientist and what is expected of him or her
is of course different. But for me, a generalization
would be that a data scientist he's able to work with rectangle
what are data? Having the umbrella
told defined, and also the general data type. And if you would like to
define out of the table, it's the gene to the
individual skills. And we started with
data pre-processing, which you might also
oftentimes hear referred to as data wrangling. So here's the thing. Data in the real
world is not clean. It actually can be
really a garbage. If you want to
take this data and turn it into some
data science model, you will have to clean it up. An example of this is that the characteristics about
your customers that you would like to use for your predictive
model are going to be stored in multiple sources,
in multiple tables. Data scientists will need
to put them together into a neatly organized
single table that he can then use for a creation
of a predictive model. Similar sort of skills we
call data pre-processing. And there are various
tools hearing, I would say that the episode musty bread and
batteries and SQL, which stands for
Standard Query Language. Sql, allows us to query
data which are stored in some database and this is what you usually
meet in companies. They have their data
stored in the database. And now data scientists
using SQL, for example, needs to be able to obtain
the data from the database. Then of course it's
not only about SQL, but also about various Python
packages here I will say the two most common ones which I meeting in industries
are pandas, which allows us to
nicely wrangle our data. And also if you are talking
about the bigger dataset, for example, then you
have a Pi Spark library. All right, so now
that we have rankled, preprocessed, and
cleaned our data, we can kind of move forward with our data science process towards exploration
and visualizations. So I think we can
all imagine what a visualization these and
also later in the course, we'll be building
some visualization. Oftentimes the company has an idea of what these
in the dataset. But most likely that
dataset was collected for a completely different
purpose than the project or the use-case
currently at hand. So on data science, these needs to be able to explore these data,
visualize it. In other words, data
scientists should sit down and really
challenged the data. Asking all these in there, whether what I'm
seeing make sense. And this is usually done
through data visualization. Hello, can meet, for example, various business
intelligence tools you might have heard
about Power BI, or Tableau, which are well
fitted for this purpose. Then we have, of course, a long list of Python
packages which are available for visualization
here you have Mac broadly, you can also
visualize in Pandas. But I would say the
industry standard nowadays is a seaborne, which is a powerful
visualization library within Python
programming language. Alright, having the
exploration out of the table, we proceed to rectangular data modelling or
in other words, creation of a data science model out of the rectangular data. Now, as we're
discussing already, our descriptions and
explorations of a dataset can, for example, find the
patron or produce a visualization that will serve the purpose of
a data science model. But oftentimes we need to
go further and for example, use machine learning to create a predictive model out
of our rectangular data. So this is what I mean by the
rectangular data modelling. Now, there are a lot of
libraries are already in Python, which could be used for
rectangular data modeling. I just want to bring out one important
distinction for you. For me, at least there is a scikit-learn you might
have heard about. It's a very famous
Python package. This one is an industry
standard really for most of the use cases that
I was working in the past, you can pick up these
library and find a method or tool
which will deliver the solution that the
dataset which might use Keynes hairs requires
a standard model. And then you will have datasets or use cases which
are kind of niche. And maybe a bit more exotic. Example would be if you
have a dataset which contains a very strong
component of time, well, this is a bit of
a niche application, we could refer it to as time-series and hearing
the industry standard, which is covering most of
the datasets and their needs would fail us and we need to pick up these exotic library. And here is a thing
for data scientists. So you need to operate where he well within the industry
standard library. And then of course you are
not an expert in, let's say, dozens of niche libraries which are suited for
a specific purpose. It's not what you should
be able to become these niche library
and let's say in a few days, get
the basics of it. And then if the use case
requires that you are applying the soldiers and using these
niche modelling library. All right, and now for
the final skill within this lecture on a more senior
level data scientists picks up also skills that
have already to do with a deployment of a
model into production, model management
and integration. How do we integrate a
statistical solutions with other components in your
technical infrastructure? An example from this box will be a modal lifecycle management. Modal. It's not just built and then deployed into
the production. It didn't needs
to be maintained, maybe replaced by a
more novel approach, basically refreshed version of our model that is incorporating
the recent trends. An example of a tool here
could be, for example, it is a technology
that allows us to version our cold or version
our models basically. Another example would be that when you are moving
into the production, you want to make sure that you have data quality controls. You do not want to have
some crazy data point or crazy observation
entering your modal which could completely
break its predictions. So we have to be integrating
some data quality controls. And these also belongs under the category of modal
management and integration. And he ruled, for example, pick up another library which is called the
Great Expectations. Alright, so this was a
bit of a deep dive into the vertical component of our T-shaped skill set
of a data scientist. And the upcoming lecture will of course continue building up.
22. Skills: Specializations: Hi there. In the
previous lecture, we have moved forward quite a lot that whole
T-shaped skill set of data scientists and we have discussed the
rectangular data skills. Now there is one more
skill or set of skills missing to complete
the vertical component of our T-shaped skill set. And that's what I would like
to discuss in this lecture. Alright, and the
scale is going to evolve and all the
areas of deep learning, natural language processing, and visual recognition.
Here's the thing. Before we discuss
this set of skills, you could say many data
scientists are starting out with rectangular data and
only then they decide to proceed into these niche
areas of visual recognition, deep learning, which are closer, the artificial intelligence and machine learning
technologies. If you are not thinking about
becoming a data scientist, you do not need to stress
yourself out by that. Oh, I will actually need
to learn deep learning the language processing
and visual recognition to be a data scientist
known for me, this bottom scale or bottom component is kind
of a specialization. As I say, some data
scientists decide to specialize in these areas. All right, now let's
discuss the scale itself. It's basically about the data. As we said in the world, you will not only stumble upon structure the rectangular data, but you also stumbled upon various more complex
data sources such as images or
natural language. By natural language,
you could, for example, imagine an email
written out by a human. Now, we need different
methodologies and approaches to
analyze this dataset. And these are usually revolving
around deep learning. Let me give you some examples of use cases which
could be here. Let's say that your company is receiving letters
really Lipitor is on a paper and you would
like to automatically read out and understand
what is in the letter. Now this is actually a fairly
complex task at first, let's say that you scan the
document now from the skin, you'll need to use
visual recognition to identify where on the
scanned image our texts. And then once you have
the texts identified, you then need natural
language processing model, which will try to recognize
what is written in this text and then you
have fulfilled the task. Another example of
a use case within this area will be then
let's say we are trying to predict sale prices of
the houses in our area. Now, there is a lot to
consider when it comes to house prices and what determines for how much
a house will be sold. Let's say that we have our
rectangular structure data. So about every
house, we can say, What's the size of the house, how many rooms it has, how far it is from
the city center. But we are kind of determiner, we have a hypothesis that also the images or the pictures of how the house looks from the inside will
determine itself price. Within one predictive model, we will need to combine
these needs structured data also with couple of
images from this house. Again, a rather complex
task for which we might pick up a deep
learning model. These were just two examples of the use cases
which are within the specialized niche area revolving around deep learning. Now as I was saying, most of the use cases which I
see creating benefit for the industries are other indeed top part so around
rectangular data modeling. But the orderly can find
companies which have, I would say matrix
data science culture and that already experimenting with these niche areas
and they are attempting to analyze images
or natural texts. Or as I was saying, you have some smaller
niche companies which are building their entire product or the entire company around. For example, single engine, which could be used, for example, for
automatic recognition of water written on the letters. Then of course these
company would say, hey, for as a data scientist
is someone who can work with these approaches
and technologies. And so if you would like to develop models within
this skill set, you would also need to pick
up a new libraries and tools, which could be rather different from what do we have discussed previously within the
rectangular data, you might have heard about
kairos or TensorFlow. These are all libraries
which are well-suited for development of
deep learning model. So this niche area, the specialization
is also coming with its own skillset
and toss it. And that is it. From this lecture I
wanted to discuss this bottom part of our
vertical T-shaped skillset. In the upcoming lecture, we will move towards the
horizontal component of our T-shaped skill set.
23. Skills: Technical Wing: Hi, Let's continue with building of the
T-shaped skillset. And in this lecture
we will focus on the, on the horizontal component. So let us just briefly recap
where we are right now. It all starts with a
data science mindset. Then we continue and kind
of a baseline skill for a data scientist
is the capability to work with rectangular data. And then on the bottom,
as I was saying, some data scientists decided to specialize into the niche
areas around deep learning, natural language processing,
and visual recognition. However, most of the times
these skills are not sufficient to create
a business value or a business impact. The thing is that
with these skills, so you only create a model. You can imagine really
a technical solution. But now we need to
make them modal fly. We need to give you the wings, and there will be
two of these wings. In this lecture, we are going to discuss the technical wing, which is about
deploying the model, integrating it to
the infrastructure which the organization has. And then in the upcoming lecture we will discuss the soft wing, which revolves around
integrating the model well into the domain
where we operate. So actually these skills at insufficient because we
only create the model, but now we need to integrate
it to the infrastructure and to the domain where
we are operating on. Let's give our model a technical language Guifei
will take consists of two core skills
which revolve around data engineering and
cloud engineering. So let's start with the data. The first thing I
would like to mention really is knowledge of data. When it comes to the
data scientist and the company where
he or she works, every company has
different datasets which somehow historically evolve
into what they are today. Scientists needs to
know where things are with your data
sources are useful, which are maybe not so useful,
these sorts of things. Now, unfortunately,
this is something that can rarely be
learned, for example, from online courses like
you're watching right now, it usually only comes
with experience. The good news is that if
the company is, let's say, healthy and it has some
form of documentation and it has a culture which
is open to asking questions, then the data scientists
can fairly quickly pick up the data knowledge we're talking weeks or maybe three months. And again, I can't emphasize the importance of Structured
Query Language enough. Sometimes it might happen
to a data scientist. They would need to write a sort of a data engineering pipeline. Or they would need
to collaborate with a Professional Data Engineer on creation of a data
engineering pipeline, which is bringing the
data to their modal. So for example, extracting
it from one system, then transforming it and loading it to another
analytical system where all our model is developed and these
analyzing the data, for example, our
machine-learning model. It's also about some SQL skill. For example, to write data
engineering pipelines. Moving on away from knowledge of the data and data
engineering skills, we are coming to Cloud. Now it really depends from which geographical region you
are watching this lecture. I'm speaking from Europe. So things might be of course different in your
geographic origin. But to put it simply, a lot of companies nowadays obtained for a Cloud solution. Cloud in short is
when you do not own a hardware and maybe
also a lot of software, but you are renting it out
from some large vendor. Cloud has a lot of advantages nowadays and data
scientists like it. Then our free major vendors
of cloud technologies, Amazon Web Services or AWS. Then you have Google Cloud, and then you have
Azure from Microsoft. Now, all of these
are implementing more or less the same concepts. They are just
called differently, at least from my perspective. For data scientists,
it is more of a question of whether
he or she can use the cloud and then
they can rather quickly adopt for a
different vendor. An organization is working
right now on Ada boys, and previously I have had
experience with Google Cloud. I should be able to adapt
rather quickly and easily. Lastly, some companies are
still operating in a way that they utilize their own custom
server instead of a clown, I would say every
year that passes, you have less and
less sad companies. If a data scientist
works in such company, then it will require
a few special skills of working with the custom
solution that the company has, such as having some
basic Linux skills and maybe some basic
command line skills. But in general, what I'm seeing data scientists are
more inclined to join companies and use cases
that are on the Cloud. All right, To sum up this
technical wing, as you see, it's about the
technicalities that are happening around of our model. How are the data
flowing into our model? That is the data engineering
part where our modal resides when we want to deployed or where are we When
developing the model? Is it happening on the Cloud? Is you'd happening maybe
on our laptop or is it happening on some
custom server solution? And this is the end
of the lecture. We have just discussed the technical wink of the T-shaped skill set
of the data scientists. And in the upcoming lecture, we go for the soft drink.
24. Skills: Soft Wing: Hi, and welcome to the
lecture where we finally finished our T-shaped skills
Head of Data Scientists. The last thing that
we have missing is the soft wink of a
data science model. Now I cannot emphasize
this enough. Data scientists
will never create business benefit or some
business impact alone. He or she will need
to interact with other people and will need
to interact with the domain. We basically have two
skills over here, people organization or we could call it people skills
if you would like them. Then domain integration
or domain knowledge. Let's start with the
domain knowledge. As we have discussed at the
beginning of the course, the gift of data science
is that it can be applied essentially
in any domain, finance, manufacturing, any
sort of sales medicine. I could go on, but I think
I already made my point. Now, data scientists should understand when it comes
to the domain knowledge. A couple of things. There's the whole pains, Paine's did a certain domain
palsy stone organization, for example, you are
working in an initial band. You understand that one of the largest paints for your organization is
a speed of delivery. Thus, you as a data scientist, is constantly wrapping your mind around how you can help out with a data science
methodologies with the speed of delivery. Understanding of paints
is the first key domain, knowledge that the data
scientists should have. Secondly, business
value, or in which ways can business value be
created in a given domain? Is it through increasing
the revenues? Is it through cutting
of the course? Or is it by extending the domain in which the
organization operates? Maybe the margins in the current business are
very tight and there is no direct way how data science can help with did take banking, for example, maybe the only
way how a data scientist in a bank can help out with mortgage product is
by extending it. Thus, your bank
would no longer be offering just a mortgage. The bank would also
be searching for properties that might be
ideal for these customer. And you don't only
offer the mortgage, but alongside of it, also a property and
the customer will just sign the mortgage on the property and everything
will be sorted out. So this would be an example of a data scientist's
understanding that okay, or business value
can be created in a certain way in this domain
where I am right now. So understanding how
business value can be created is a second key
piece of domain knowledge. Then lastly, hurdles. So what are the main
hurdles that given domain palsies to
the organization for which the data scientists
working, for example, in recent years old of
companies within European Union where having troubles with a particular regulation
called GDPR, General Data
Protection Regulation, companies had to adjust a
load of their processes, mainly regarding how they handle their customer data to be compliant with the SAR equation. Now, data scientists should be pushing forward
initiatives, projects, and use kinesins that are going well alongside of
the post hurdles. You will, for example, not be pushing for the use case that goes against such a strict
European Union Regulation. Lastly, it's about understanding the hurdles in a given domain. I will sum up the domain taught by saying
that you can have the best predictive machine learning model that
anyone has ever created. But if it is not integrated
well into the domain, it will not create any real
value at the end of the day, so it might very
well be worthless. So as you can see, domain integration and
knowledge is really crucial. Now, let's continue
and talk about people skills and
people organization. Firstly, I will take
communication with other people. They did scientists
going to collaborate with other people
who are technical. And so he should be able to
hold technical conversations. Also, he will communicate
with non-technical people. So he should be
able to translate his technical fault to a more actionable
business terminology. So kind of adjusting
the communication depending with whom you
are talking right now. But I think we can
only imagine that. Let's move to the second, which is the development
methodology. People in organizations
organize themselves into some form of development
approach you could take. You might have heard
about multiple things such as Waterfall Agile, Kanban, Scrum, Kaizen, whatever that is.
Quite a lot of them. Now it is not only
the data scientists should be able to fit within whatever the organization as pushing as a goal to
approach for development, for example, some
agile methodology. A good data scientist
should also be able to adjust these so that they are fitting well for a project on which the
data scientist is working. What do I mean by? Well, as mentioned,
nowadays you have mostly agile methodologies
being put forward. But one has to keep in mind that these are maybe not perfectly applicable for data science
models and products as they were originally developed
for software engineering. The thing is that
within data science, it is a lot more about uncertainty then if you'll be working with in
software engineering. So oftentimes it is simply not possible to estimate how
long things might take, because you just don't know what you will stumble upon within the data once you open up the dataset and you
start exploring it, it is about the ability to reflect Andy utilize a
development approach which is the best for the
use case at hand or right, and that is our T-shaped
skills had completed. We have just answered a question of who is a Data Scientist. I thank you so much for
being part of these lectures and I look forward to
seeing the upcoming ones.
25. Introduction to Chapter 3: Hello When the warm welcome to the fifth trip
during the course, be aware of Data Science. My name is Robert, and
it is short video. I would like to provide you with an overview of the
chapter which is ahead of us and it's called describing and
exploring the data. First of all, let's take look at the overall course structure. We have already
covered the essence of data science, it's
basic principles. We have also told about
the disciplines of data size and we have even
answered the question, who is a Data Scientists? Now it is finally time to
start working with the data. And if you remember at the
beginning of the course, I said that there are four basic approaches
of data science. We have descriptive approach, exploratory approach in
financial and predictable. So within this chapter, we will cover the first two
of these approaches will talk about the descriptive
and exploratory approaches. We covered them together as
they are closely related. And basically the goal
of this chapter is to understand the essence
of these two approaches. Why are they valuable? Why should we never skip the data description
or data exploration before we may be moved to some more complex
approaches such as inferential or predictable. Within this chapter, we want to fully understand
these two approaches. What do they bring to the table? And even what are the potential pitfalls within these two simple approaches? Alright, that is the high-level
view on this chapter. Now let me provide you with a more detailed overview
of what's ahead of us. We are starting the chapter
with a learning story called describing the life of a fully, I recommend not to skip this lecture because
it will provide an intuitive basis for
the upcoming chapters. So we'll basically
finishes learning story. Understand how should
we go about describing the data would approach her
or flow we should be taking. Then I'm moving the
second lecture. We are asking ourselves, why is it so important to
describe the data so that we really understand why this
approach is so valuable. Then, within the third lecture, we are of course, covering the basics of
descriptive methods. I want to fill in
your toolkit with methods such as
measures of position, measures of spread, and the
event visualizations will cover the necessary
basics of these methods. Then as I always keep saying, no data science
metal the sacred. So within four
lecture we'll talk about calculating
of average income. Where do you will
see how even the simplest method or data science can fail if we
are not careful enough. Now we will conclude the
first part of the chapter. We've a wonderful assignment. It's called The
Power of describing. And I really
recommend not to skip this assignment
because people usually think that descriptions of data or data explorations are boring. That we should be building the coal machine learning
predictive models. Well, I think that's
very unfortunate. And within this assignment, I want to show you how
powerful data descriptions or powerful data visualizations
can really take your foreign can really deliver
a valuable information. So I hope you will
enjoy this assignment. Now within the second
part of the chapter, we will move towards
exploratory approach. We are done with the
descriptive and we'll move from description
to exploration. And we'll start off
the second part by highlighting the difference
between these two approaches. Then we have a learning
story where we'll be asking ourselves the question
which houses the right one? Here we'll revisit the
limitation of the human mind, which are outlined at the
beginning of the course. How we have troubles with understanding problems
defined by many dimensions. But now you will see it
connected to data exploration. I would like to also
provide you with concrete methods when it comes to the
exploratory approach. So we will talk
about correlation, which is a really powerful tool, really powerful metal
that we have available. Within the eighth lecture, we will go in depth into
the method of correlation. But as I keep saying,
always remains skeptical. Data science methods
can fail us. Already in the next lecture, which is called when
temperature rises, we will see how even
a powerful metal of correlation can fail if we
are not careful enough. And then we will be concluding the chapter with a couple
of learning stories. We'll be talking about storks, babies, football and presidents. What do these things
have in common? Well, as you will see, they
have something in come on, for example, will be
asking you a question, do storks bring babies or can football predict who will win
the presidential election? These are wonderful
learning stories. I hope you will enjoy them. The learning stories
will uncover an important concept which is called the spurious correlation. And I think it's so important. Then also designed
an assignment where you can practice we
spurious correlation, which is at the end
of the chapter. Of course, we'll
conclude the chapter in a fairly classical way where
I will provide an article which is sort of a
checkpoint where you can recap on the most
important learning from the chapter before you conclude the chapter
with the assessment, a couple of quiz questions
that I have prepared for you, and that is it for this video I'm looking
forward to see you. We've been deferred chapter
26. Describing the life of a foodie!: Hi and welcome to another
lecture in the course. Be aware of data science, we are now starting a new
chapter which will be about describing and
exploring the data. And I would like to start with
a learning story which is called describing
the life of a fully. Who are fully is these are
wonderful folks who are really enjoying travelling to enjoy some nice meals are
some nice restaurant. For example, you don't just go out because you are hungry. Who go to a restaurant,
you would pick a special place because
of some special meal. And basically you are
treating it as a hobby. Now, imagine that we are folded
and as a matter of fact, Our friend is
interested in joining our faulty hobby and
he would like us to describe these
fully hobby for us. The first question
that he's asking is, how does your full day
through block-like? Our description or our
first answer is, well, we usually we call Berlin
the morning and drive to a city where we would like
to visit the restaurants. We then take the first
meal of the day, usually the sort of a breakfast. Afterwards, we walk around the city or nearby
to get hungry. We always go to the most. These are the
restaurant for lunch so that we can take
multiple courses. Most of the times we spent
around â‚¬30 pair such trip. Now listening to our answer that we have provided
to our friend, what is the natural tendency of our bind when it comes
to describing something, describing our fully Humvee, we had at first thinking
about averages, we are thinking
about what is usual. We can now translate it to the technical terms
because we will be doing the very same thing when we're supposed to
describe the data. We will first describe
some averages. What is usual or these common. This will be the first step when we're describing the data. And it kind of follows the natural tendency
that our mind has as you have just seen within
our answer to our friend. Now fortunately, our friend
is still not persuaded. So he is asking, well, I'm still somehow worried
whether it won't cost too much or if the folds are
not too exotic for me. Well, there goes our answer. Well, don't be worried. Sometimes we add
adventurers and go for a visit to some
very authentic points. But even there you can pick
some less exotic options. Indeed, when it comes to money, we occasionally go
on a spending spree, but that is very seldom. It will then be announced
beforehand so you can skip the trip if there
is a trouble for you. Okay. In technical terms,
how did we continue describing our fully
home before our friend? Well, in technical terms, we are interested in
how things arrange and how big extremes might
occur. We have seen it. Sometimes we are adventurous, occasionally we go
on a spending spree. The very same thing
we'll be doing when we will be describing the data. At first, when we have
described what is average, what is usual, we will
reach towards the extremes. So how things spread? How do they vary? What is the minimum, what is the maximum value? We will again translate
the natural folder, natural tendency
of our mind into our descriptive data efforts. Although it's not everything
our friend is now inclined, but he's still not
persuaded that he says, I still want to
imagine it better. So our answer is, Look, let me show you my fully diary and you can also check
my social media. When I post about these trips, you will see imageries
of the meals that I talk thanks to the diary, you will also be able to imagine how much money we
spend on these trips. It will be. All right. This brings us to the last
point in the natural flow of our mind will be
visualizing because oftentimes we want
to see the data. We want to imagine better
what's going on with the data. So oftentimes describing
some usual tendency, sum, average, and then describing some extremes
is not sufficient. We still might want to visualize the data to some of
these learning story. We will basically translate these natural flow of our mind into the upcoming lectures. At first we will be talking
about measures of position. So similarly like we're
describing to our friend, then there will be talking about measures of spread variance. There are different terms, but you will meet
these terms such as range or mean,
absolute deviation. At the very end, we'll of
course be visualizing. Sometimes we just
need to see the thing in order to imagine
either really well, it will be mentioning boxplots, pie charts, and scatter plots. I hope that this
learning story has provided a solid basis for our upcoming lectures where
we'll be talking about the descriptive approach
of data science. I'm looking forward to see
you in the upcoming lectures.
27. Why do we need to describe the data?: Hi, In the previous lecture, we started to uncover the descriptive approach
of data science. We have discussed the
intuition behind it. And even before we proceed
to the predictive methods, I want to stop and I want
to talk about why is it so important to
describe the data? For me, it goes as
something like follows. When thinking about
this crypto metals, you should remember to think
simply and objectively. You will want to describe
how things look like. We found the danger of
falling into some bias. So it's really going to be about the simple descriptions
of what's usual, what is an extreme about some descriptions of
ranges and dispersions? Because here's the thing. When we proceed to more
complex approaches such as an exploratory inferential
one or a predicted one. There will be a lot of
space for falling into some bias or
encountering a pitfall. With the descriptive approach. We kind of want to
leave ourselves a space where the danger of falling
into some bias is minimized. Here's the thing to give
you a concrete example. Already with the
exploratory proton, it will be searching for
patterns in our data. We will want to apply domain knowledge and
domain thinking. And there is a danger that there is some cognitive
bias in our mind. And the while we are applying ourselves on top of the data, we also project our cognitive
bias on top of the data. So before we go there, before we do that, we want to think simply and objectively about the dataset
that we have available. If I'm supposed to
put it really simply, I would say the descriptive
approach is about capturing the essence or the nature
of the given data, like what's in there, what is this dataset telling me? And later on, what will
I be able to do with it? If you want to think more
concretely about it, here are a couple of
questions that we could be answering with
the descriptive approach. It does the dataset covered the sample than I
was hoping for. Isn't there some bias
in my dataset like Isn't there some
weird extreme values? Isn't the data somehow
weirdly shaped? Or I'm already thinking about
how the data was collected. Or you could already be more concrete if
you are taking the statistical I produce
starting with the hypothesis and proceeding
to the data collection, I could be asking B9 colic the data which
I was hoping for. Or if you take the
other way around and you are starting
with data mining, where it will outset with the
dataset and only on top of the dataset you would
attempt to form the hypothesis you want
to answering a question. Is this dataset useful? Can I even form some sort of
a hypothesis on top of it? So I hope that with
these two slides, I give you an
intuition of why it is so important to
describe the data. But there might be one thing that you are still
wondering about. Hey, what about the
predictive approach? Should I still be
describing the data even if I'm aiming for some inference
or predictive approach, yes, definitely you should be. Because think about it. What you want
attempting to do with the influence or predictive
approach you are trying to generalize
from a sample to the population and you shouldn't
go forward it directly. You should at first focus
on the data which you have. You should kind of take a step back and your first
step should be, here is my data, here is my interest right
now I'm trying to capture the essence of the sample
that I have available. Once I understood that, I sort of laid the foundation for the
more complex approach of inferring from the
sample are predicting from the sample to an
entire population. So to sum it up, yes, descriptive approach
is always a good start even if you are aiming for
a predictive approach. All right, that is it for this lecture and I look
forward to see you in the upcoming
ones where we were already discussed the
descriptive methods.
28. Basics of Descriptive Methods (1): Hi and welcome to another
lecture in the course. Be aware of Data Science. My name is Robert and as
promised in this lecture, we are starting with the
basics of descriptive method. In the previous videos, we have understood the intuition and how we might be applying these even in which
order that we might be starting with
the measures of position. And that's exactly what we
will do in this lecture. And then in the
upcoming two lectures, we will continue with measures of spread and visualization. Now, the key in this
course is not to fill your toolkit with various
descriptive methods. The key is to get the
mindset of data science and how it handles the
description of the data. We will have a concrete example of applying one
of these methods. And then I will also
provide you with a brief overview of methods
within each of these sets. Again, please don't
feel discouraged if we do not cover all of the
descriptive methods which are out there know
really the key is to get the mindset of how we
are applying measures of position and to have
a sort of an overview of bodies available from
these sets of tools. Let's start with, and let's start with the
measures of position. As I said, I want to
give you at first one concrete example of how an application of a measure
of position might look like. And then pretty sure
that the simplest measure of position you
have already encountered, It's amine a calculation. Let's say that our task or a question is very
straightforward. What is the average
temperature in your country? And we have a dataset
available for various temperatures
in degrees of Celsius. Now the method we will
use the mean calculation. The calculation is
pretty straightforward. We just need to add
up all of the values. So you can see 35 plus
35 plus 32 up until plus 24 with the result of
263, that's a sum. Now as a second step, we need to divide it by count. We have originally
had nine values, so we will divide it by nine. And the resulting
mean temperature in our country is 29.22
degrees Celsius. As you can see, we are not
applying any domain knowledge. We are not thinking about what these temperatures mean
in the real world. We are just applying a very
simple descriptive statistic, which is a mean
calculation to get the essence of what is the average temperature
in our country. Of course, calculating
a mean isn't the only measure of position
that we have available. There is a lot more on these and they in general fall
into three categories. There are every G is or what someone could mean when they
say what is the average, then we have some extremes and then we have
various positions. So let's talk about these. First of all, when we
told you about every G's are we have mean and median. And already in a few
lectures from now, I will show you an example
where mean as a method, as a descriptive
statistic tool can fail and ruled actually
need to replace it with a medium
to really get to a measure of average
that we would like to. So these two are
usually applied at the same time as they
are really super useful. Then you have some
nuanced applications of means such as weighted
mean or trimmed mean. When you could be interested in knowing what the
meanings we felt taking into account
some extreme positions or some extreme temperatures. If you are aware that such extreme measures
could exist in your data, you might want to
disregard them and focus only on what these average
or all these usual. You can see we have various
measures of average. Now the two most common
ones are mean and median. And it will also
see application of these two in one of
the upcoming lectures. Now for the extremes, I think that's pretty
straightforward to imagine. We have a minimum value and
we have a maximum value. I think there is not only to
elaborate on that further, Let's continue with
folio might have not encountered
entering until now. It's these terms, quartiles, deciles, or percentiles.
What are these? Well, It's about, first
of all, always remember, we will order our samples, we will order our observations. We can see our original
data was often on the order we hit
35 degrees Celsius. Then we had thirty two
hundred twenty eight, thirty one and so on and
so on. It's unordered. We want to order it
from lowest to highest. You can see we are
starting with 24. We continue 2528 up until 37. Now having our sample
ordered like this, we could be looking
at various positions. And now all of these strange
terms are only about two. How many parts are we
dividing our sample? If we are talking
about quartiles, we're dividing our
sample into four parts, whereas in each
part that will be equal amount of observations. If we are dividing
it to deciles, it's about ten aparts where
we have equal amount of observations in each of the
ten parts and percentiles. It's dividing the
population into 100 parts. Now, let's divide our population into the styles because it's the simplest one is the most
straightforward one for our case is we have
ten observations. So basically in each
of these buckets, in each of these steps, we have one observation. And now someone tells us, hey, please report to me, what's the eighth decimal? So we will be starting
from the lowest bucket, you could say, or the lowest,
tenth, lowest decile. And we're counting
123456 up until eight. And our eighth decile
is 35 degrees Celsius. You can see this is a bit
different approach to looking for a certain
position within our sample. We are not looking
at the average which has to be kind
of in the middle, somewhere in the middle. We're also not looking at the
extremes which are a DAGs. We could be looking at some
various positions which are approximately the third or
three-fifths of the data. And we have a lot more
freedom with these just for the sake of curiosity in case it's already
connecting in your mind. If we are talking
about the median, well, that will be obtaining
a very different way. We will be ordering
the data and looking at the value which is in
the middle of this data, which breaks down our population
into two equal parts. On the right side and
on the left side from the median we have the same
amount of observation. You can see all of these are
connecting to each other. Now, just to sum it up, these are the basic
descriptive statistics, the basic measures of position which we are
applying on top of our data. As I was saying, these are really simple and
straightforward and we want to just apply them
objectively to understand what the
essence of our data is. Every Gs extremes. And then we might be measuring some nuanced positions
within our sample. So this is it. In the upcoming lecture,
we will continue with the measures of
spread or dispersion.
29. Basics of Descriptive Methods (2): Hi, In the previous
lecture we have talked about the measures of
position we had mean, median, and some
other positions. Now we should not stop
there, we should continue. And in this video we will
cover measures of spread, which will again
enrich our view and our understanding of the data
which is in front of us. Before we go there, I would
like to just reiterate, we're not here to cover all of the available methods which belong under measures of spread. Now we are here to gain an intuitive understanding of what the measures of spread do. How do they enrich our view of the dataset which
is in front of us. And of course we will go through an example of one
measure of spread. Alright, so let's go for it. We have a beautiful
project ahead of us. We are a producer of Windows and I do not
mean the software, I mean actual physical windows. Now the issue is that
we would like to produce windows at
a specific width, let's say 100 centimeters. We are having a bit of troubles with our manufacturing methods. Occasionally it can of course happen that the
window is not the 100 centimeters wide but
101 or 99 centimeters wide. Which is a problem because if we deliver such a window
to a customer, the customer might
be complaining because it's not
fitting perfectly. And now we, as a
producer of windows, have two ideas or two different
manufacturing methods. We have produced a
couple of windows with each of these
manufacturing methods, and these generated two samples. We measured the width
within these two samples. Now the measures of central
tendency are insufficient. If we would measure what is the mean width within
each of these samples? Most likely both of them
would say 100 centimeters, but it is very likely that
if one of them we are achieving more
desirable results are more stable width
of the windows then with the other measures of central tendency
are insufficient. We need to expand with the measures of spread
because we care for how much the wheat varies around the desired value
of 100 centimeters, the methods that we'll use is called mean absolute deviation. It belongs under
measures of spread. So we have a concrete example. It will show exactly
how the width varies around the optimal mean. So let's say that we
have one of our samples. These are the windows
that we have produced. And let's say that
the average of the sample is 100 centimeters, so our desired width
of the windows. But we can see that some
of the windows were maybe 9998 centimeters and some of the windows where 101102
and so on and so on. And so the width is really varying around the
mean or the average. And we need to measure this. Alright, we will start by
looking at a single window, and we are basically
interested in how far is the width of this window from
the average width of all the windows
expressed mathematically, we're basically taking the
width of the window and subtracting from it the average
width of all the windows, which will measure this
distance over here. Now we need to use an episode value around
this calculation simply because we'll
be doing this for all of the windows and
some are above the mean, some are below the mean, which would mean that
the positive differences will be canceling out with
the negative differences. So basically by using
the absolute value, we are really just interested in the episode distance
from the average mean. This is the basis
for our calculation. We now continue and
we will do this for each of the windows
within the samples. So let's say in this case
we have 246 windows. So we will repeat
these calculations six times each time for
a different window. Now the result of this
upper part needs to be divided by the
number of windows. And now you can be reminded of the mean calculation that we had in the previous, like chess, We are basically summing
up all of the values, dividing by the number
of values that we had. Now, well, we are working
with mean absolute deviation. So we have the episode
deviation and we are calculating the mean from
these absolute deviations. So it's the same
principle is when we're calculating the means. Now believe it or not, but this is everything. By this simple calculation, we have measured mean
absolute deviation, and it will answer the business problem that we're having. We can then compare the mean absolute deviation within d2 samples that we have. And this way the sides
amount of which of the two manufacturing methods is producing more stable results? Where are we getting more
stable with the being as close to 100 centimeter
interests as possible. These walls mean
absolute deviation. That's a concrete example
from the measures of spread. Now as I was mentioning, there are a lot more
methods of spread. You will be meeting terms, even standard deviation,
interquartile range. I do not want to bother you
too much with these terms. They are all more or less
working in a similar efficient. And if you remember intuitively, what are we doing when we
are measuring some spread, that will be a great key
takeaway from this lecture. This was mean
absolute deviation. And in the upcoming
lecture we will continue with visualization looking forward
to see you there.
30. Basics of Descriptive Methods (3): It is time that we conclude
our exploration of the descriptive
methods that we have available by talking
about visualization. Before even we get to the actual methods I want to
talk about how important, and I want to reiterate
on what I already said. Why visuals, why it
is so important to visualize the data simply
because sometimes the numbers, meaning the descriptive tools that we discussed until now, the measures of position,
measures of spread. The numbers do not
tell the story. This is a famous example they obtained from
the Wikipedia. It's called the
Anscombe's Quartet. I hope I'm not butchering
name too much anyway. It's beautiful
artificial experimental where we have
collected for samples. We have one sample
second, third, fourth, by looking at
these four simple, so by the visual, we can right away
see that they're completely different than the
patterns that we're seeing, which the samples
that are forming are completely different
from each other. However, here's the thing. If we will be
measuring the measures of position or some
measures of spread, our variance, they
wouldn't look different. I have an extract from it
here on the right side. We have a property, a mean of x, meaning measuring
the horizontal mean. We also have the mean of y
measuring the vertical mean, as you can see on
the horizontal axis, they have exactly the same mean. So all four of these have exactly the same mean when we're talking about the
vertical dimension, they almost exactly
have the same mean. Now, we could be resorting to what we discussed on
the previous lecture, the measures of variance. Well here we would be also
failing as the variance on the horizontal axis is exactly the same for
all four samples, the variance on the
vertical axis is again, almost the same. Now we have a variance over here is a very similar method, the mean absolute deviation
that we have discussed. Anyway, back to the
Anscombe's Quartet. I think by this example, I hope that you will
remember the key to that. Sometimes the numbers do
not tell the stories. Sometimes we just have
to visualize what we are seeing to uncover what's
really in the data. Now that we have
covered the why, let's talk about the
visualization itself. Here's the thing. When I tried to visualize, especially when I'm
describing the data, when I'm just starting
with the dataset, I like to start with what I call one-dimensional or older, we tend to call univariate
visualizations. You are only focusing
at one characteristic. You could say one column
of your dataset at a time. Let's say you have
a dataset about the social demographics
of your customers. Well, you have h, You have their occupation and
many other characteristics. Always look only
at one at a time and produce a
visualization hearing. Here we have various
tools and methods. Of course, if we are talking about some numerical
characteristics, numerical features, we have
tools suggest distributions. Here we will be having
an age, for example, and we will observe videos, buckets of how old
our customers are. Then we have boxplot, which is a bit more
complex visualization, but it's a wonderful one, even though it looks weird, aids actually showcasing a
lot of measures of position. For example, hearing the middle, we have a median. We're also measuring
the quartiles and we could also be finding out
extreme values through it. Just another metal
that we could be using to display the numerical
characteristic. And on top of it, we could even visualize the individual
observations themselves. So you can see on
top of our boxplot, which in this case looks
a little different. We have on top of it edit the exact values of
our observations. So for example, if something like on the previous
slide headphones, we would be able to spot the differences between
these observations. Then if we are having some
categorical characteristics and categorical feature
such as an occupation, we also have different
univariate ports available, such as a count board
with which we could be showcasing what the
occupation is RUN. White-collar worker. Are you an entrepreneur? Are you a student? So this would be a
one-dimensional display of our occupation
characteristic. So I think the univariate visualization is
one dimension at a time are always a gold star to data
visualization exercise, of course, then we proceed to more complex visualizations
where we are looking at two
dimensions at once. We would call these
bivariate visualization. And here, it's
mainly about what's the nature of the characteristic
that we are looking at. We could either have these
numerical characteristics such as age or income, or we could have these
categorical characteristics are categorical features such
as water, your occupation. Then we're just asking, okay, are we pulled in numerical versus
numerical characteristic, numerical variables, categorical or categorical
versus categorical. And again, we have a couple of options available,
but basically, it's all about reusing what we had within the
univariate visualization. So if we were just floating
like in the boxplot, the observations
and their values. If we build up from the
bivariate visualization, you can see it becomes what
we call a scatterplot. So just the values of our
numerical characteristics. Then if we are plotting a numerical variables are
categorical characteristic. Again, let's reuse what we had within the univariate
visualization, which was the boxplot. And we have a one boxplot. So one showcase of the numerical values
pair each category. In this case it's a funky one. It's coming from a
penguins dataset where we are measuring
the body mass or how much doping Guin's weight with
regards to their species. And we could nicely see that the Adelie species and
the chin spread fees species are much lighter than the gentle
species in general. You can see Hearing, we are already stepping
away from descriptions. We're no longer just
describing the data. But here is the edge between describing the
data and exploring the data because we're already seeing some pattern in our data. And that will be our goal
with the exploratory methods. So data visualization
is the point where you are kind of going
over the edge from the descriptions toward
the explorations where we are uncovering some
patterns in your data. Now, this is a two-dimensional
bivariate visualization. Then the most complex
visualization that we could undertake is multivariate
visualization. Here we are trying to
put into a same visual, three or more dimensions
at the same time. And here it's where it
starts to be from romantic. Remember what we discussed at the beginning of the course, that our mind is rather limited
and it's having problems visualizing good imagining more than three
dimensions at once. So really I would have
big question with a multivariate visualization is, how do we encode the third or 441 fifth dimension
into these visuals? We could be using shapes, we could be using colors, we could be using
some composite plot, as you can see in
this visualization that I have right
now on the slide, we are using colors and also we just sort of a
composite plot because we have some numerical
characteristics over here of our
beautiful pink wings, which are the bill
link to the bill lips. What's the flipper
length and what's the body mass in grams. And we're plotting them
against each other. And also at the same time
we are distinguishing the species of the pink
wings using the colors. We have really a lot of dimensions at the same
time in a single visual. However, the issue if the
multivariate visualization is that there are three key to construct
because you always have to undertake these creative
exercise of thinking. How do I encode the third
or fourth dimension into my visualization? So this is everything
that I wanted to talk about when it
comes to visualization. The key takeaways from this
lecture really are wise visualization important
because numbers sometimes just don't
tell the story. And then when it comes
to the visualization, we should be undertaking
the simple approach of starting from univariate
visualizations, then proceeding to
bivariate visualization. And if we have time
and space for it, we could even take, hold some multivariate
visualizations. And it's already
the breaking point between descriptions that
we talked about until now and exploration or exploratory approach that we'll talk about in the
upcoming lectures. I'm looking forward to see
you in the upcoming lectures.
31. Calculating Average Income: Hi there and welcome to
another lecture in the course. Be aware of data science. In the previous lectures
we were discussing descriptive methods and you
might be telling yourself, Oh, these metals are nice
and straightforward. If we are applying them, nothing can go wrong. Well, in this lecture, I would like to talk about that. One of the really key
takeaways from this course, which is that no data
science metal is sacred. We should always be skeptical about the methods
that we are using. And even such, a simple method
is a descriptive one can fail us if we're not careful enough and if we're
not skeptical enough, the example that we are
going to work through is called calculating
average income. So let's go for it. We have a task
that we similar to one we already faced
with the temperatures. Now we have what is the average
salary in your country? And here is our dataset,
here is the sample. We have fast salaries
ranging from â‚¬350. Then we have â‚¬500.820
up until â‚¬4 thousand, while we are interested in the average salary
in the country. So we can proceed to the
calculation of the mean. We already know how
to calculate it. We would as a first step, sum up all our values. So we will start with 350 plus 400 all the way plus 4 thousand, and the result would
be 11,605 Euros. That's the first step. Now we need to divide by the count or the number of observations
that we're having. So we divide 11,605 by 11
and the result is 1â‚¬1055. Now let's stop here
for the second. I mean, if we're just intuitively looking
at the salaries, we have three hundred and fifty, five hundred, eight
hundred and fifteen. Does it seem reasonable that
average salary is 155 years? I mean, if we should really rely on these
measures of position, Then we have only three salaries that are higher than
what we would claim the average salary is it
would lie right over here, while something doesn't
seem to be right. So let's reach out to the
second measure of position, the second measure of
central tendency that we have and calculate the median. Now in this case,
we need to order our observations from the
lowest to the highest. So we can see we are
starting with free 50. Then we have everything nicely ordered up until 4 thousand. And now we want to find
the observation which is splitting our sample
into two parts. Within each part we have unequal amount of
the observations. So we can see lift of our mean. We have five salaries and
the right of our main, we have also five salaries. So our median in this case
is the middle value, â‚¬520. Now looking at it, doesn't â‚¬520 see more
reasonable if we are claiming that this is
the average salary in our country when we are looking
at our original sample, of course it does. What do we have
encountered is that the mean calculation or the mean measure is
sort of failing us. The reason for that
is that we have outliers or extreme
values present. On the upper side
we have a salary of 2225 year-olds and
4 thousand zeros. These are extreme salaries. And if we are
calculating the mean, they are skewing the mean, they are pulling it
towards the high values. And that's why the mean ended
up to be such a high value. Whereas for the median,
it's a different story. The median doesn't care
how high the salaries are. It only cares about
the salary which lies in the middle
of our sample. In this case, median might be a much better representation of what the average salary
in our country is. So I wanted to highlight with this story one key
takeaway, as you can see, even the simple
descriptive methods have potential pitfalls, and every method in data
science has potential pitfalls. The job of a gold data
science practice is to be aware of these and address
them within a use case. I think this is a
perfect example of what a big chunk of a job of
data scientist is about. You have your dataset, you will have a sudo
of metals available. You have your experience
and knowledge of things which might
potentially go wrong. It is now about trying
out the tools, you know, and always remaining skeptical about things that can go wrong. I mean, for example,
your stakeholder accounts through and tells you, Hey, I want to know what the average salary in the countries. That question should
already spin up a lot of thoughts in the
mind of the data scientists. Should I be working with a mean? Should I be working with a
median mode or a trimmed mean, or any other different measure
of the central tendency. Could there be outliers
such as in these case, or the distribution is maybe somehow skewed towards one side. So let me repeat it again. Every data science method
has pitfalls and the, the job of a data scientists
to always challenge the data and the methods
that are being used. Okay, this concludes
our exploration of the descriptive methods. As you can see, even they
can have their pitfalls. And I'm looking forward to see you in the upcoming lectures.
32. From Description to Exploration: Hi, and welcome to another
lecture in the course. Be aware of Data Science. My name is Robert
ending this lecture, we are going to
step away from the discrete to a proud
of data science and we will move towards exploratory approach
of data science. So let's go for it. But first, let's
remind ourselves what is really an
idea behind both of these approaches within
the descriptive approach that we have already discussed for only it's about questioning the usefulness
of the dataset, understanding what the
nature of the data is. And we're possibly trying to uncover whether there are
some issues we are may be questioning the data collection
method and we are in general trying to understand
what this data is about. However, now we
are moving towards the exploratory approach will be searching for some useful
and non-obvious better. And these can then
hopefully serve as a basis for some
more complex metal. Or they can already be creating a valuable information
which can be, for example, used by our
business colleague in their business decision making. Now, within this lecture, I would like to talk about the differences that
appear when we are making the step from a descriptive to the
exploratory approach. And there will be two of them. The first difference
than we need to do when taking the step from
describing the data to exploring the data is that we start to search for
patterns and we usually start to look at multiple characteristics
or multiple features at the same time. Now it just a few
lectures from now, it will give you concrete
and practical example of a very strong pattern that we might be searching
for cold correlation. But for now, just like to
stay with something super simple to focus at the
intuition behind the patterns. In this case, we have
a simple dataset. We have two characteristics. We have agender of our
buyer in the shop, then we have our wine type
which he or she purchased. So you can see we
have men and women, and for the wine type, we have white wine and red wine. Now, if we apply that are descriptive methods and
our descriptive thinking on top of this dataset, well, our findings might
look at something like this. Ratio of white to red
wines board is 40 to 60, ratio of men and women is 5050. I mean, neither of these
appear spectacular in any way. Both of them seem reasonable. So we concluded the
descriptive stage and the descriptive approach. Now however, remember,
we are going to be a detective looking
for the patterns. Thus, we ask ourselves, is there some pattern in my data which could be
useful and interesting? And so if we are
actually looking with the exploratory
lens on this dataset, our findings might look
at something like these. Men purchase under a ratio of 70 red wine and 30 white wine. We mean a purchase
under a ratio of 20 or red wine and
80 white wine. And you can see this is already an indication
of a pattern, which is that the
choice of a wine is different between men and women. This is the sort of the
pattern that we might be looking for within
the exploratory stage, what we'll do is that we might now take it and
deliberate, for example, to our business colleagues for whom it's already available information so that they can adjust their sales strategies. Or we might take the
spectrin and move it forward towards the inferential approach or the predictive approach. For the second stage, you might have heard it for data science to work properly, domain and knowledge is needed. I was mentioning it
already a couple of times. Now, domain knowledge
is needed and many components of a data
science project, however, exploratory approach is
exactly the spot where it is the most obvious how much we need the domain knowledge. Let's say that we
will get the dataset and we have social demographic
characteristics of people. Now if we're just
describing the data, we found the application
of any domain knowledge without looking at the
context of everything, we would find
something like this. We have two males. They were both born in 1948. They were both raised in the UK. They were married twice. Both of them live
in a castle and they are both
wealthy and famous. You can see descriptive
approach is not bringing us anything if we are just
mindlessly describing the data. However, if we apply
domain knowledge and if we try to look at
the context of the things, well, we might see
something like this. Behind one of these we
have Ozzie Osborne and behind the other one we
have Prince Charles. So you can see the difference if we are
applying domain knowledge, if we are employing the context and not just describing
what we are seeing, our conclusions will be
completely different. A bit of domain knowledge
such as looking at how these to obtain
their wealth, would tell all of
the difference. I mean, one characteristic
would have been way more useful than all of the
ones that we have listed. But to know which characteristic would show the difference
between these two, of course, we need to apply the domain knowledge
and context. Maybe to show you things from
bit different perspective, I wanted to take
this beautiful quote from Mr. John Tukey, who is considered one
of the godfathers of exploratory data analysis. It's an old one from 1977. He says exploratory data
analysis is a detective work. A detective investigating
a crime needs both tools and understanding. If he has no fingerprint powder, he will fail to find
fingerprints on most surfaces. So we need the tools, we need the methods. Now, if he does not
understand where criminalized likely to
have pulled his fingers, he will not look at
the right places. We need also the
domain knowledge. We need also the
understanding of the context. Whether the difference between
Ozzy Osbourne and print Charles helps you remember it
or the quote by John Tukey, I definitely
recommend to remember that domain knowledge
is needed when we start to explore the data. And that's it, what
I wanted to tell during this lecture and
in the upcoming lectures, we continue with
the Exploratorium, proud of data science.
33. Which house is the right one?: Hi, and welcome to another
lecture in the course. Be aware of data science. In the previous video, I have promised you that we
will be already starting with the exploration of the exploratory methods
of data science, pun intended by the way, now, we will not go there
just yet because I have one more important
learning story for you, which is called which
house is the right one? With these learning story, we'll revisit what we
already discussed in the beginning of the
course that our mind has some limitations
and to really come and haunt us within the exploratory
approach of data science, Someone's go for it and even before we go ahead
and purchase a house, Let's think about
something simpler. Imagine that we want
to purchase a bicycle. Now, what factors would
we consider when we are deciding about which a
bicycle we will purchase? Well, let's say we would of
course consider the price, maybe we would
consider the weight of the bicycle and maybe the
intended use of the bicycle. So a fairly limited
number of factors, maybe 34 factors to consider. However, thinks
really change when we want to purchase a
house or a property. If you were ever purchasing
a property or renting one, you know that there are really a lot of factors to consider. You should be
thinking of all the, whether you want the
property to be in the city or in the countryside, what is the neighborhood? Is there a school facility nearby that your
children can attend? What is the price of corners? And then of course,
you are deciding about whether the house has
an appropriate layout. I have just listed a few. We can then talk about location, whether the property
has a nice view, a lot, a lot of
factors to consider. Now we can revisit what we discussed at the
beginning of the course, that our mind is limited in comprehending two or
three dimensions at once. And then remember, I'm not
claiming that we can't look individually at all of these factors when we are deciding about
purchasing property, we have trouble of comprehending
all of them at once. For example, if we have a couple of bicycles from which
we are designing, while it will be easy
for our mind to decide which one is the best taking into account all
of these vectors. But our minds simply has
traveled to do this if we have multiple properties
and we want to evaluate them based on
all of these factors. So our mind is very
limited and this will really haunt us within the exploratory approach
of data science. Let me show you what I mean. First of all, let's say that we have this visualization
over here. It's coming from a beautiful
pink goings dataset. In this case, we have a
two-dimensional visualization. There are two variables in play. First of all, we have the species of penguins
are fully Paris. We have that daily chain
strip and the gentle species. And on the vertical axis
we have the body mass of the penguins and
our minds similarly, like when purchasing a bicycle, can really easily see a pattern. We can see that the gentle
species are generally much higher in the body mass compared to the
other two species. This is a low
dimensional problem and our mind right away
sees the pattern. However, imagine if the
story was different than if the visualization
was different. In this case, we have
three-dimensions free variables. So we have the island where
the pink Winsor reside. Then we have again
the body mass, but we also have encoded
a third variable, a third vector, which is
the sex of the penguins. And you can already
see that our mind already has troubles
comprehending what's going on. Like the first clients, you can't see exactly what is the pattern within
these visualization. I would even
encourage you to stop the video and think
about it for a second. All right. I presumed that you
have stopped the video and I think you have
discovered the pattern, but it already took
your mind a couple of seconds and you had to
focus on the visualization. It didn't pop out. And this is just
three dimensions. Imagine that we would want to consider four or
more dimensions. We will be in the
same travel is when deciding about the
property to purchase. Now, let's summarize
this learning story. I really hope that
it provides you with an intuitive understanding
of the challenge that we're facing within the
exploratory approach, we are searching for useful
and not the obvious patterns. And now the devil is hidden
in the word none the obvious. Most likely other
people have been working with the data that we
are working with right now. So the obvious
patterns which are in one or two dimensions have
already been discovered. And we are kind of forced to
be discovering patrons in three or more dimensions
where our mind is fairly limited and we need to be simplifying the problems, creating some data
science models, which then allow us to see the patterns in three
or more dimensions. This is also justifying
why oftentimes we will be resorting to machine
learning methods and we will later in the course, because a machine
learning model does not have the same limitation
as our mind has, you can provide it
with huge dataset. We've limited Leslie many
dimensions and it will find patterns in these
many dimensional space. In such cases, we are
more worried about it. The patterns which the machine
learning model finds are reasonable ones that
they are making sense from the
domain perspective. But that's a case for a different learning
story that we will have later in the course. So this was the learning study which houses the right one. I hope you will take the key learning with you and
I'm looking forward to see you in the upcoming
videos where we already started with the
exploratory methods.
34. Correlation: Hi and welcome to another
lecture in the course. Be aware of data science. In this lecture we will
discuss correlation. So Robert, you have been telling us since we have
started the corner to be aware of data
science that we are walking for patterns
in the data. Wouldn't it be time
that you finally saw as a practical example of a strong pattern
that we might find, especially during the
exploratory phase of a data science project. Indeed, it is the time. In this lecture we
will talk about an incredibly powerful concept
of a correlation that is a prime example of a
pattern that we are interested in when we
are exploring a dataset. I wanted to go for it. And as
I would like to go a bit in depth of these concepts
of correlation, I'm bringing also on artificial dataset will be
doing some calculations. Don't worry, it won't
be anything scary. Also, if you'd like
to play around with this dataset is available also with the calculations
that we're going to do as a handout to this lecture. Alright, so the dataset that we have contains two
characteristics. We have a temperature
in Celsius and then we have sunglasses,
sales revenues. So the observations, the rows in this case might be
days and we are observing what was the
temperature on any given day and what was the revenue
from the sunglasses cells? We are supposedly an owner of a standard that is
selling sunglasses. Now, we are curious
whether these two are somehow connected, whether there is a relationship, because if there is, maybe we can utilize it
within our daily business. For example, if there will
be a positive pattern, we could then say, if the temperature is going
to be high tomorrow, we can expect that a lot of
sunglasses will be sold, so we might keep our standard
open for longer hours. We said that our prime tool within exploratory approach
could be visualization. So we go ahead and visualize the data as we can
see it right away, there indeed is a patron as the temperatures
are increasing, we can also see that the
sunglasses or venue are rising so we can see a
clear pattern being formed. Now, this is great. This is already useful for us. However, we might want to
quantify this relationship, but we might want to quantify the strength of
this relationship. Why is that? Well, due to several reasons. For example, let's
say that we don't have such a simple dataset, but we have a dataset
which contains multiple characteristics
and we are all relating all of these characteristics to our
sunglasses sales revenue. If we would have maybe
1015 characteristics, it will be ten or
15 visualizations. And all of a sudden we start to have too many visualizations. It's all becoming a bit messy, having instead quantification
of these relationships. So let's say a single number
which tells us how long are these two
characteristics related could be more practical. Secondly, having the relationship
quantifying will also allow us to compare the strength
of these relationships. What if two characteristics
are forming very similar patterns with our sunglasses cells
that are venules, it will be the temperature
that we are seeing right now. But we will also have a
characteristic which would say the number of visitors in our park is forming a very similar relationship with our sunglasses
cells that are unused month for some
reason we only want to pick one pattern which
seems to be the best one, the utilize within
our daily business. And it's kind of
tricky to compare these two relationships just
based on the visualizations. We want exact quantification of a strength of
this relationship. Now these quantification
is going to happen, as I mentioned already,
through correlation. Now, correlation is simply a connection between
two or more things. If you would look at the research and
various disciplines, you will see that the correlation
is being widely used. For example, that has
been proven correlation between educational
level and income, food and hunger, of course, sleep and happiness, and
smoking and cancer of course. Remembering the idea of
correlation is I would say one of the most
important things right now. The thing to realize, we will talk about
this also later is that if things correlate, if there is a relationship, that does not mean that one
thing is causing another one. With the idea of a correlation, we will not claim
that temperature is causing people
to buy sunglasses, but more on that in
the upcoming lectures. Now let's focus on correlation. Now it will be a bit of a myth, but just bear with me please. Okay, let's start with it as
a first step if we want to quantify the
relationship between the temperatures in Celsius
and sunglass sales revenues. We need to calculate the means of these
two characteristics. We have a mean of
the temperature, 18.21, and the mean
sunglasses revenue, â‚¬263. The second step that
we need to do is that we need to subtract the means. For example, if we
overhear to have a temperature in
Celsius or five degrees minus 18.21 will
give us minus 13.21. And we do the same with
both characteristics and we are resulting with
these two columns that we called a and b. Then thirdly. We need to calculate
a few things now, things that are maybe
becoming a bit more abstract, but don't worry about it, let me go through it. So we have an a times B, we are just multiplying
these two columns and then we are squaring
both of the columns. So we have a squared and b
squared is a fourth step. We will just sum up these
newly created columns. And you can see
these are already very abstract figures
which will be tricky to interpret
in our real-world, in our real-world terms
of these dataset. But let's just finish up the calculation and
I will explain. Lastly, we need to plug
these numbers that we have obtained through the
summaries into a formula. What is this formula? Do have to remember it. Now it's are saying it's
not about remembering the exact calculation
of a correlation. One thing to remember
is that we have a way how to quantify a
strength of relationship. Secondly, it's the realization that we've been data
science we often times that rely on decades or even centuries old
models which have been figured out by great
statisticians and great mathematicians
who have been studying the world
and the nature. And they figured out some very powerful
methods and models, such as a correlation. And we simply rely
on them because they have been proven
to be very powerful. And the very last thing which I would like
to highlight, Bye, calculating exactly
the correlation is that you see that there
is no magic beneath. People tend to look at data
science, machine learning, and artificial intelligence at some scary disciplines which are full of crazy mathematics. And there is some magic happening beneath
of these models. Well, not quite. I mean, we have
just walked through an example of a correlation
and isn't methanol fact when we will be later on
in the course building predictive models using
machine learning. These are relying on a very similar principles like what do you see
right now on the screen. Basically, they will attempt to find a pattern or relationship between input features and an output feature is something that we are
trying to predict. So we will be using the very
same formulas over here. Let's say the
temperature is the input and then sunglasses cells that
are venues is the output. What I wanted to
highlight as a third, possibly key learning is that there is no magic
beneath and you don't have to be worried
to even go for some more complex
methods and study them. All right, but back
to our correlation, what did we calculate? We have result of 0.989. So this way of calculating
correlation will always result in a
number between minus 11. The closer we are to
minus one or one, the stronger the relationship is between the characteristics
which we are measuring, we will come to an exact
interpretation just in a second. For now, we can conclude
that there indeed is a strong positive
relationship between the temperature in Celsius and
sunglasses cells revenues. But anyway, this
world we already saw from our visualization, the correlation
quantification just confirmed what we already knew and we can now enjoy all
of those benefits of quantification that
we have discussed. Now, let's continue with
the interpretation. When you are looking at the
correlation, as I was saying, the closer you are to
a one or minus one, the stronger your relationship. In our case, if we would look
beggar, the visualization, we have a patron which is
very similar to this one. That's why our correlation
measure was very close to one. Now if the relationship walls of opposite nature, that is, one characteristic
is increasing, the other one is decreasing. Then we would obtain a minus one or a value very
close to minus one. Then we have all of
these values in-between. The worst-case scenario for us where we do not
find any pattern. Is it for our
correlation measure, report a 0 in such case
it would look similar. And basically it
indicates that there is no relationship between
these two characteristics. So as I was mentioning, correlation is a
very powerful idea. World around us is filled with correlations and you could
reuse these concepts for your own dataset to see and quantify a better in-between
two characteristics. And in the upcoming lectures, we'll keep on talking
about correlation as it can sometimes be a bit
tricky to work with it. I'm looking forward
to see you there.
35. When Temperature Rises: Hi, In the previous lecture we learned about
correlation and it seems is a really powerful tool that we can apply them
within our dataset. However, remember what I keep saying across
the wall course, be aware of data science, that no data science method
is sacred and that we should always remain skeptical
towards what the datasets, towards what the method
that we are using series. We will have a learning
story called when the temperature rises and we will keep on talking
about correlation. In the previous lecture, we had this simple and straightforward dataset
where we are measuring the temperature in Celsius and revenue from our
sunglasses cells. And we can see that there was a really nice strong patron, which was also confirmed by our quantification
by calculating the correlation
that the reported 0.989 pretty close to one. So it seems as a very strong
positive relationship. However, remember
what I said that node data science
metal is false. Well, let's keep on
building our dataset. The temperature's got very high. On a few days, we collected
a few more observations. So these were our original
observations from five degrees Celsius above until approximately
33 degrees Celsius, and now it gets very hot. So we collected
more data points. As you can see, the
pattern has now changed. It seems like when it's 33
degrees Celsius outside, it's becoming too hot for
people to go outside. So our sunglasses he sells,
revenues are dropping. Now, this still seems as okay for us as an owner
of a stand which is selling the sunglasses and we can still see with our
eyes are clear pattern. If the temperature is rising, combine until 33
degrees Celsius, we can maybe expect the
sales to also go up. If it goes beyond that, we can expect the
sales to go down. Now, what would
happen if we reuse the same calculation as we
did in the previous lecture, the correlation calculation on this dataset that
we have added on. Now, just to be mindful, if you'd like to play around
with these calculation is available as a
handout to this lecture. You can also play around
with these numbers. If we use this calculation, it will report 0.27. What's going on? I mean, we have said that if our correlation measure is
close to one or minus one, that indicates a
strong relationship, either a positive
one or negative one. And then we're saying
that if we are close to 0 with our correlation, we are coming to no relationship between
these two characteristics. One of these off is that
based on the visualization, we clearly see that
there is a relationship, but our quantification is saying that there
is no relationship. What happened then? Well, the metal to that we use the correlation calculation is defined for what we call
linear relationships. And what do we had in
the previous lecture was indeed a linear
relationship. We can nicely,
they're all basically a line through the
points almost perfectly. That's why it's 0.989. However, what do we
have in this lecture is already a non-linear
relationship. If you would like to
have it explained in technical terms
what went wrong, the correlation measure is
actually having an assumption. It's well-defined if there
is a linear relationship. So technically speaking,
we have violated the assumption of this
particular method, but I don't want to
get too technical. I want to reiterate
again on one of the most important
learnings from the course, always be skeptical towards
what your datasets. Always be skeptical towards
what your metal test. This also brings us again back to the importance of
data visualization, which is undoubtedly
an important part of exploratory approach
of data science. Finally, to visualize better, what do I mean we've nonlinear
or linear relationships. I found this very
nice visualization. It's showcasing the result
of correlation calculation. We can see that for linear relationships such
as the ones on the top, it is well-defined in
the report that yes, I can see a relationship. However, if we go
to the bottom and we see these nonlinear
relationships, well, with our eyes, we see that there is a pattern. It's just that these
metals is failing us. It's not well-defined for
nonlinear relationships. And we would of
course need to use a different metal to
quantify this relationship. I'm not claiming that we
do not have metals which can also quantify these
relationships adjusted, we will need to use
a different one. Alright, and that's what I wanted to say during
this lecture. Always remain skeptical and I'm looking forward to see you
in the upcoming lectures.
36. Do storks bring babies?: Hi, and welcome to another
lecture in the course. Be aware of data science. These lecture has
a strange name. Do storks bring babies? It doesn't change question, but it will help us uncover a very important concept
of a spurious correlation. So let's go for it. In the 1990's through 1980s, in some smaller
cities in Germany, a strange phenomenon
started to occur. At the same time, a lot of storks, we're moving into the city and a lot of babies
were being born. People started to think, do storks bring babies? This phenomena was not isolated to this
geographical region. So even maybe you have heard the story that
storks bring babies. Anyway, it was bothering
the researchers in Germany, so they started to
investigate it. What boundary the researchers
was something like this. Is it possible that nature
gave us this sort of a function or a process where a storks are causing
babies being born, or maybe is there a correlation
that could be useful? We've correlation, you
can really imagine the correlation
concept that we have learned just a few lectures ago. Maybe if there is a
useful correlation, we could be observing
the storks. And this will help us determine how many babies will be born. Well, probably since you
were like eight years old, you know that storks
do not bring babies. What was really
going on in reality? What went down locked
as something like this. There is a concept of a spurious correlation and the underlying story
was rather different. At the very heart of
this phenomenon was a social demographic trend that the young couples were
moving into the city. These young couples were
settling and they had babies. So you could say that there
is a form of a causation or at least a very strong
correlation at the same time. So when the ANC
couples are moving in, they were building houses. As it turns out,
these houses were an ideal nesting
place for the storks. So we could rather point
to this relationship. Now, the relationship
that was originally observed between
babies being born in storks moving
into the city is what we call a
spurious correlation. That is a concept that
we should be worried wherever when it comes to
data science, let me explain. We have learned within
the exploratory approach, data science, we are
searching for relationships. Now we are going to stumble upon some relationships that are
useful and can be used. For example, if we discover that young couples are
moving into the city, there is an increased
possibility that they are going
to have a baby. Also, the other relationship
could be very useful. We could, for example, use it for prediction
of bird migrations based on the young couples
socio demographic trends. However, when we stumble
upon a spurious correlation, we should not use it. The problem with
spurious correlation is that it is not stable, it is not reasonable. It is just a result of maybe
a chance or some randomness. In the upcoming lecture, we are going to talk
about the why we should not be using
spurious correlation. I hope that with this example
of the storks bring babies, you will remember that
spurious correlation exists. And then in the upcoming
lecture we will talk about football and
presidential elections. And we will learn why we
should not be relying on the spurious correlation
if we discover it.
37. Football and Presidents: Hi there. In the
previous lecture, we started to talk about
spurious correlation. Let's continue talking
about it as it is a fairly important concept. There is a famous case of a spurious correlation
called the Redskins rule. It regards the Washington
football team. You could actually observe
the last home game of the steam and whether the Washington
Redskins will win, will be then a
great indicator of who will win the
presidential election. For those of you who
are not from the US, you would basically
have two parties with their candidates. The incumbent party is the one currently
holding the office. And the thing is, if Redskins will win
their last home game, then the incumbent party
will win the elections. These case of
spurious correlation actually held for over 70 years. The first time it
happened once we, Franklin Roosevelt in 1936, the last time it
helped drawers in 2008 with Barack Obama during
the seventh years, you could have used
this pattern to predict who will win the
presidential election. Realistically though
these two phenomena has something to do
with each other, is there is some sort of reasonable relationship
that is very unlikely. These just a coincidence and a perfect example of a
spurious correlation. Since the correlation broke in 2008, it actually reversed. It is since 2008 the
other way round. If skins laws, the incumbent
party losses the elections. And this is a fairly
practical reason of why we should not be
relying on a correlation that appears to have no logical
justification and most likely just a result
of p or coincident. I mean, these examples
are mainly for the conceptual understanding
of a spurious correlation, but you will stumble upon spurious correlation
also in your daily life. For example, you are a data
scientist and you will see a strong correlation between coffee consumption at the
office and the sales. Now, should you rely on these correlation and
maybe attempt to increase the coffee consumption
in the office with the hope that
also sales will go up. Certainly not as this is
most likely in other case, of a spurious correlation
which will be very unstable. So now that we have seen two great examples of
spurious correlation, let me summarize with
some formal taught. At the heart of everything
are causations. One thing is causing another. Unfortunately,
discovering and proving causation is very hard
in scientific terms, you'll most likely
need a fluoro and inexpensive experiment
to prove it. That is why within data science, we are rarely hoping to
discover and prove a causation. We are rather most of the
times relying on the idea of a correlation or discovering
correlation of some sort. We are hoping that we discovered two phenomena that
correlate with each other and that the correlation is useful and is reasonable. For example, temperature outside correlating with
sunglasses sales, clear example of a
useful correlation. However, we need to
be careful because we might stumble upon a
spurious correlation. Our task is to have a
sensible and skeptical mind and discard the spurious
correlation from our analysis. The spurious
correlation could be caused by some unseen factor, such as the case we've
storks and babies. Really there was
this unseen factors of young couples
moving into the city and the underlying mechanism was completely different than
we initially observed. Alternatively, it could be a
result of pure coincidence. This was the case with the Washington Redskins rule and the US presidential
elections. In any case, whether it is some unseen factor,
our coincidence, we should not rely on a
spurious correlation as it will likely not hold stabilize at
a certain moment in time. In the upcoming assignment, you will have a
chance to practice your creative and critical
thinking with these concepts. So I hope you will enjoy it. And that's it, what I wanted
to say in this lecture. And I look forward to see
you in the upcoming ones.
38. Introduction to Chapter 4: Hi and welcome to Chapter
four in the course, be aware of data science. In this very brief lecture, I want to provide you
with the coal that we are having during these last
tip throughout the course. And also to give you an outlook of the lectures that
are ahead of us. So first of all, looking
at the course structure, we are almost at the end
within these last lecture, we will take all
two more approaches and we'll study
them more in depth. We'll be talking about
inference and predictive model. So our goal is to gain an
understanding about the stone. Now how can you
imagine this chapter? Well, until now we were
only describing and exploring the sample of the
data that we have available. Now comes the time to infer something from the sample about the rest of
the population. Or it is about the
time that we build a predictive model on
top of the sample, based on which we can then
make some predictions about the rest of the population
or about some future value. So we are tackling some really cool and powerful concepts
within this chapter. Alright, now let me show
you the electrodes. First of all, we will start
very slowly and gradually. We will talk about the sample
and we will talk about the population and about how important it is to always
have it clear in your mind, What's your population,
What's your sample? And to also think about
whether your sample is representative
of the population. We also have a learning
story which is focused on the topic it's called
is the mushroom edible, where we will be
together building a visual recognition app. It will recall. Then we're being more practical and we are talking
about the inference. We'll set up an
inferential experiment in our imaginary organization, where we'll try to test whether
our new sales strategy is really having a
positive impact on our profits and revenues. Afterwards, we will conclude
the inference where we are just inserting one thing from
a sample to the population. And we'll focus on building
predictive models. I wanted to provide you with a deep intuitive
understanding of how you can imagine the workings
of the predictive model. We have unimportant lecture,
lecture number four. It's called the
function of the nature. Then, once we have understood how we can imagine
predictive models, we have an assignment
which is focused on the most important thing about the predictive model
are its inputs. Within this assignment, you will create the only thing about the possible inputs for a predictive model that
we are about to build. Then we will continue
with our exploration of the predictive models will be asking ourselves a question, when do we really need
a predictive model? Or when a more simple
approach such as a descriptive exploratory
inferential would suffice, then we will be
actually building a predictive model and
you will have a chance to build one yourself or you can actually think about it
that you are becoming a machine-learning
model because we have a powerful assignment
within which we are trying to predict
dear migration. It's a kind of
artificial use case on which I have
worked in the past. So it will be a bit
longer assignment, but I hope that
you will enjoy it. Afterwards. It's time that we will
conclude the chapter by more thoughts on the predictable
proud of data science, where we will think about why is predictive model never perfect? Why is it always
a little bit off and it never has
perfect predictions? Then we have a learning story
at the end of this chapter, which is about distinguishing
dogs from wolves, It's basically a story of how
machine learning model can three costs and learn something completely different than
we would have thought. At the very end of the chapter, we'll be thinking
about how is our modal impacting our business
or our domain, because we have just
build a predictive model, we would also need to verify
and test whether it's really having an impact
that we have intended. So you can really see
that we go through all of the components of the inferential and predictive approach during this chapter. Of course, we'll conclude the
chapter rather classically, there will be a checkpoint
on which you can recap them all of the knowledge that you gained during the trip. And at the very end, there are a few quiz
questions waiting for you on which you can
test your knowledge. So this is chapter four and I can't wait to see you
within the lectures.
39. From Sample to Population (1): Welcome to another
lecture in the course. Be aware of data science. With this lecture,
we are officially opening up the
next week chapter, which is about inference
and predictive modelling. But before we get to eat some inferential methods
and predictive methods, I kind of want to set
a healthy basis on which will be built in
the upcoming lectures. The most important thing from my perspective to understand if you want to perform some
inference or some prediction, is to understand your sample and the difference between a
sample and the population. Lastly, even today,
to be able to define, okay, this is my population
to which I will be inferring. This is my sample thrombin,
which I am inferring. These are the topics of this
lecture. Let's go for it. First of all, heavy, clear in your mind
what we're doing until now we were only working
with our sample. We describe the sample, we explore the sample and
whatever conclusions we had, whether the storks
are bringing babies, whether there is some
useful correlations such as sunglasses and
temperature outside, we are only using these
conclusions within our sample. We can't say that this is
true everywhere in the world. We just can't say that whenever
the temperature rises, wherever in the world people
will buy more sunglasses. We can only stay
within our sample. The big thing is coming now, we'll be able to infer something from a sample about
an entire population, such as while they adjust mentioned whether it
is really true or what is the probability that wherever in the
world where you are, if the temperature rises, more people go and
buy sunglasses, this will be the inference. Or we'll be building predictive
models which are enabled to make a granular predictions
about some observations, some individual who is
within the population, but maybe it was not
part of our sample. So this will be the
predicted apart. So as I was saying,
I would like to discuss a few things
within this lecture. First of all, a population
when they won't be starting some inferential experiment or some predictive or exercise, I would strongly recommend
that you start off by defining what is
your population. I have listed a few
questions which you can ask yourself and these should help you define your population. First of all, you
can ask yourself, is my population living
tangible or intangible? I mean, as I was
saying, it's easiest to imagine human populations, such as a population in a city, but you could also have
animal populations. It's still kind of easy to
imagine, but also object, let's say that I'm working for a factory and this factory has a very large building or
it's a very large facility. And I'm doing research on the light bulbs which
are in the buildings. And my population
could be all of the light bulbs which
are in the facility, which are in this factory. I will be drawing a
sample and there will be only examining a
fuel light bulbs. And the population is then all of the light bulbs
in the buildings. So it can also be about object. Lastly, your population
could be also intangible. There were even
experiments where the population was all of
the roulette wheel spins, which happened in the Las Vegas. If I'm not mistaken, it doesn't
even have to be tangible. Secondly, you can
this ask yourself, how is my population defined? I mean, usually
if we are talking about industries and we're talking about organizations
which have their customers. These are the most
common populations that we are defining here. Some sociodemographic
characteristics will be helping us, such as we narrow our
customer base by h, maybe by occupation and so on. And so we'll sociodemographic characteristics is the
most classical one. Secondly, we could have
also some definition of a population based only
basically irrelevance, such as product ownership. Well, let's say that
I am a smartphone manufacturer and I have a couple of my smartphones which I'm producing and selling. The population that
I'm interested in are the owners of one smartphone
type which I'm selling. I can define also a population based on the business relevance. Thirdly, a very important
question to ask is whether your population is fixed
with respect to time. What do I mean by, well, let's say we could
have a population which is fixed with
respect to time. Let's say all the
people in the city, I mean, they are changing
as the time goes, but not with the particular may be project
or use case that we have in mind such
as measuring what's the average height of all
the people in the city. Or you could stumble upon the more problematic case when your population is really
changing with time. For example, let's say
that I'm interested in the behavior of
customers in the store. I'm only in a grocery store. Now my customers
are going to come by at various times of the day. So a single customer will
arrive on Monday morning, we'll arrive on
Wednesday evening, maybe Friday afternoon. And maybe they will display different behavior at
different times of the day. Maybe they only spoke also different behavior at
different parts of the month. This is what I mean
with respect to time. If this is the case, then it's going to be a bit
troublesome when we will be drawing a sample
from such populations, we need to keep in mind
the aspect of the time. And lastly, this is
a pretty simple one. Is the population
defined for us, like you are going to meet some projects are
use cases where, for example, if you are
a penguin researcher, it's very straightforward. The penguin population
is predefined for you. Let's say you are pointed to a certain island
where penguins leaf, and that is the population that you are supposed
to work with. Or the population will
not be defined for you. And that's when you
should be asking all of these questions and really
spinning it in your mind. What is your population? These oftentimes
happens, for example, when you are a product
development analysis, your organization
is about to launch a new product and you
are thinking what the target group are
on the population that which you are targeting
is going to look like. So in such case, you need to define the population yourself. So to summarize some key
takeaways, as you can see, there is a load of freedom
that you are having when you are defining
the population. Important thing is to always
heavy defined whatever exercise you are
starting makes sure that it's clear what is
your population. And in the upcoming
lecture we'll continue with similar told on a sample. I'm looking forward
to see you there.
40. From Sample to Population (2): Hi there. In the previous lecture, we thought about
a population and how can we be finding
in these electrolytes? Continue and think for
a bit about a sample. So the crucial question
that you should be asking yourself when you are drawing a sample or working
with a sample, is thus every member
of a population has the same chance to
end up in a sample. If we are lucky, the answer is yes. And in such case we have
probability sampling. Let's say that we are
doing some study or an experiment at a
university campus. So we would randomly
select students from my campus to be
involved in a study. How could such a random
sampling look like? Let's say we did. We would come to the university administrators and we would ask for anonymous IDs of all of the students
that are on the campus. Now we will use a random
generator and the randomly pick some of the students
who are on the campus. This would be a random, this will be a
probability sample. And if we have such
a random samples, such probability sampling, then we are lucky because it will be easy and
straightforward to infer something from the sample
to the entire population. Now what would be
the opposite case? Let's say some
non-probability sampling. Let's say that we wouldn't have access to the
administrator's office. And we are interested in which of the students
are studying data science and then we are interested in what
is their behavior. Well, we don't know because we do not have access to
the administrator's office. We don't know which students are taking the data
science courses. We would walk around
the temples and find one person who is
studying data science. And we would ask them, Hey, do you know anyone else who
is studying data science? There will be sway
kind of snowball our way to our sample. This is unfortunately non-probability
sampling and it makes things trickier when
we are inferring from such sample to
an entire population, because we always have
to ask ourselves, and this is the key
learning from this lecture. If we are employing a
non-probability sampling, always consider
whether your sample is representative
of the population. Previously in the course, we had one example where we draw non-representative
sample. It was the case when we were interested in people's heights. And we were measuring people
just outside of a SportsCar. And a lot of basketball
players were passing by all the
sports people who are, let's say taller than
the average population. On the contrary, elder people who are
maybe short that are, we're not passing
by the sports club. This was
non-representative sample. We cannot draw from it
to an entire population. We cannot infer what the average height is
in the population. This is such an
important learning that in the upcoming lecture, we will also have
a learning story where it will be building a visual recognition app that
is supposed to recognize whether a mushroom is
edible or poisonous. And on top of it,
you'll be having an assignment where
I will face here we have a few scenarios
and you will be supposed to judge whether the sample that we are having a probability or
non-probability sample. And if it is
non-probability sample, whether it is a representative
of the population, to which we would like to
draw some conclusions. Another thing that
I would like to mention when it comes
to sample is that you should be also considering
your sample size. Is your sample
sufficiently large? I mean, this was more of a problematic
question in the past. Nowadays it's not
so prevalent in the real-world
applications where we usually have hundreds, thousands, or millions
of observations. And basically the story goes
that the larger your sample, the better inferences
you are able to make about the population. Now, just to give
you an idea about what do we mean when
we say sample size? If you would like to infer something about
the country where I'm coming from that has
5 million inhabitants. You would need a sample size of approximately 10
thousand observations. So we really need to collect
10 thousand heights to be able to infer what the average
height in these countrys. But as I'm saying, it's not
so problematic nowadays with the sample sizes or
the really matters is to always consider whether
you have probability, non-probability
sampling, and whether your sample is representative
of the population. Now that we have talked about both population and a sample, I think we are
ready to start with the inferential and
predictive approaches. But as I was saying
before we go there, there is a learning
story waiting for you and an assignment. I'm looking forward
to see you there.
41. Is the mushroom edible?: Hi, and welcome to another
lecture in the course. Be aware of data science. This time we have
a learning story. In one of the previous lectures, we have discussed
population and a sample. And that will saying that having a representative sample is really important
because if we're not having a representative
sample will not be able to make some good
inference about a population, or will not be able to build
a strong predictive model. So here comes the learning story it's called is the
mushroom edible? We have a clear business idea. We want a mobile app. If we too can just point your smartphone camera
on a mushroom and the EBU tell you
whether the mushroom is edible or poisonous. Ideally, it would also tell you what kind of a
mushroom these ys, this will be a standard task for data science and machine
learning capabilities. Now of course, we need
some sort of data on which our machine-learning
model would learn. And it will be able
then to recognize between the edible and
the poisonous mushrooms. We have decided to obtain
the data by scraping from the internet pictures
that people have posted from their mushroom
picking ventures. So let's say that
someone who was out mushroom picking and
they found a mushroom, they took the picture
and they posted it to the Internet for
everyone to see. And we have designed
it to scrape these pictures from
the Internet and the use it as what we call training data for our
machine learning model, for our predictive model. Now, even though we have not told you about machine-learning, that's not an important part
of this learning story. We care about our sample that we have collected and from
which we want to learn. Now, I would
encourage you to stop the video and think about
what's going to be a problem? Are we then going
to be able to build a good predictive model
which we can then use as these visual recognition
app that people can just download and
then point to a mushroom. And the IRB wouldn't
recognize if the mushroom is edible
or poisonous Really, please stop the video and think about the sample that
we have collected. Alright, I presumed that you pause the video
and gave it a try. Now, we're not going to build a good predictive model based on the sample
that we have collected. There will be a problem with the app and the model
that we will build. It will be great at
recognizing edible mushrooms, but rather poor at recognizing
poisonous mushrooms. Why is that? Well, our data that we used for the training where problematic. They were not the
representative of the population in nature. Let's say, let's
simplify things and let's say that
there is a ratio of 50% of edible mushrooms and
50% of poisonous mushroom. So 5050. So for example, you have an equal chance
of stumbling upon an edible mushroom
as you have of stumbling upon a
poisonous mushrooms. But what about the images
that people take that we used as a training data
for our predictive model. People are way less
likely to pick up a poisonous mushrooms and take a picture of it and of
course posted online. This is why the ratio of edible, too poisonous mushrooms in our
data will be, for example, ninety-five percent of pictures will be of edible mushrooms and only 5% of the pictures will
be off poisonous mushrooms. The modal is really
primarily focused at recognizing and being good at recognizing of edible mushrooms, but it will be quite poor at recognizing
poisonous mushrooms. Or in other words, most of the
images that are modal sees during the training process
of edible mushrooms. So I hope that this little learning story provided kind of an
intuitive example behind the sample needs to be representative when it's true
of whatever we are doing, whether we are doing
some statistical tests from which we are
drawing an inference, which we'll do in one of
the upcoming lectures, or we are building a predictive
machine learning model. Always think about whether your sample is representing
the population well, because we might stumble upon unfortunate scenarios
like in this case, when we didn't represent the
correct ratio of edible, too poisonous mushrooms
in our sample. I'm looking forward to see
you in the upcoming lectures.
42. Inference: Experiment Setup: Hi and the warm welcome to
another lecture in the course. Be aware of data science. With this lecture, we are kicking off a
chapter in which we study inference and
predictive models. And in this particular lecture, we will focus on
inference and we will set up a little
inferential experiment, which will then
continue working on in the upcoming videos. But before we get there, let me remind us of where we are right now and what are
we really trying to do? Basically within this chapter, we're focused at the
methods through which we can learn on a
sample and then infer something
about the rest of the population from which
the sample is coming. Or alternatively, we can build a predictive model that
we will be able to do granular predictions about individual observations
within our population. Though, as I'm saying, we get to predict your
methods later. Now we are focused at inference. We will learn from
a sample and infer, generalize to the rest
of the population. Let's set up the
inferential experiment. Imagine that we are working for a company which is of
course making some sales. And now these cells are
varying in a sale size. So for how much the sale was. As you can see, this is our past historical
data from the sales. You can see that
occasionally we are unfortunate and the
size of the sale, which we may use for 80 or â‚¬90. Most of the times
we are able to make sales for approximately â‚¬100, and occasionally we are lucky in the sales are for 110 or â‚¬120. Even. Now the
business colleagues are coming to us
and they're saying, Hey, we have this belief, we have this hypothesis
that if we manage to increase the
average sale size, it will really
benefit our business. We're looking at the
visualization that we have. And basically what
this goal set by our business colleagues
means is that we will attempt to move the
mean cell size to the right. We want to increase
the average cell size. So this will be the goal of
our inferential experiment. Now having the experiment defined from a
business perspective, let's translate it to
data science terminology. Basically, what we will be
doing is that we'll be working with two populations or we'll be thinking about two populations. Now I will take it slowly as if you're seeing this
for the first time. It can be a bit confusing. There is a population which is defined by our old
sales strategy. It's our customers that
we are having right now. We're selling to them through
our old sales strategy. And we need to
imagine that there is potentially
another population. It's the same humans, it's the same customers, but they will be defined
by our new self-strategy. And now we can go back. Few videos are few
lectures where I was talking about how we are
defining populations. It doesn't only have to be a set of individuals or set of humans, we can incorporate some
business logic into defining this population
of our customers. Even though it's
the same humans. These ones on the left side are defined by the old
sales strategy. These ones on the right side are defined by the new
cell strategy. Basically, in our business
colleagues attempt to do is that they hope to
change the population. And now they're hoping that
the new population defined by the new sale strategy will
have higher average cell size. You may be asking, why don't we just do it? Why don't we just change
the self-strategy and we hope that increases
the average sale size. Well, that might not be a good idea because it's
a hypothesis we don't know whether the
new sales strategy will really increase
the average sound size. It might very well happen that the effect
will be opposite. That may be our customers will think that we are
bothering them with the new sales
strategy and we would instead lower the
average cell size. So we shouldn't just now take the new self-strategy and deploy to the entire population, we should be smarter about it. And that's where the
power of inference and inferential experiment
will come to play. So what we will do is
that we will think about collecting samples from
both of these population. We want to sample from the old sales strategy
and we want to sample from the
new cell strategy, how can these be translated to real terms, to
the real-world? Well, we will just take our old sales strategy and
we will attempt to sell to maybe 15 of our customers
using the old sales strategy. For the new self-strategy, we take the new
source strategy and attempt to sell to 15 customers. Using these new sales strategy, we will use maybe some
random sampling and randomly pick these two
groups of 15 customers. And that will provide us with the necessary data that then allows us to compare
these two populations. We will then be looking
at the samples, observing whether there is a difference in the
average cell size. I mean, if our business
colleagues are right, then we will be hoping that
the sample on the right side, which is defined by
the new cell strategy, would have higher
average cell size. But that's not everything
we cannot just stay at comparing
the two samples. We will also use a
statistical model, a statistical test from the world of
inferential statistics, where we will check whether the difference that
we're seeing between the two samples is generalizable to the
entire population. So there will be two steps. We collect the samples,
compare their means, and then as a second step, we use a test from
inferential statistics. It will be a lot of fun. I hope this video cleared
how we're setting up our inferential experiment and the upcoming videos
we get to solving it. Looking forward
to see you there.
43. Inference: Statistical Test: Welcome and let's together conduct our little
inferential experiment. So let's just briefly
recap all the we will do. We will attempt to sell
to a small portion of our customers using
the Alt till strategy. And then we'll
attempt to sell to a small portion
of our customers. We've done new sales strategy. Then we will be comparing
these two samples, observing whether we managed to increase the
average sale size. And we will also rely on a test from inferential
statistics. All right, let's do it. So here are the result. First of all, I need to start
off by what we don't know. We have managed to influence the population or change the population the
way we intended. And you can see that
originally with the old-style sales strategy, the mean cell size
was 100 years, as we knew already with
the new cell strategy, the mean cell size is â‚¬104. However, we don't know these. We have not measured
the entire population, neither the old one
nor the new one. I'm just putting it here for our reference and
for our comparison. What do we have, however, are our two samples. You can see that our sample with the old sales strategy
contains 15 observations. So we have attempted to sell to 15 customers with the
old sales strategy. Now we are measuring the mean, which is 19.689 the euros. With the new sales strategy, we have also
attempted to sell to 15 customers and the sample mean is â‚¬103.07, if
you would like to. I have also visualized these two samples here
on the left side. Now that we have
these two samples, it's time to ask
the big question. Thus, the new strategy increase
the average sale size. I mean, if we will be
looking just at the sample, our answer will be
definitely yes by 3%, this is approximately
3% difference. However, this would be
a wrong conclusion. It is only through
WAF in our sample. We are just comparing
the samples. We now need to face the
results of our experiment with a statistical test
to verify whether this difference in means will be generalizable to the
entire population. All these, these
statistical tests going to help us with, well, it's going to consider a few things and it's
going to report to us or indicate to us
whether the change or the difference in
means could have been caused by a
pure randomness. Or it's going to indicate
the probability with which the difference in means could have been caused
by the randomness. It's going to consider
a sample size. Of course, the smaller
the sample we have, the higher is the
probability that we have just encountered
some random noise. We ideally want to have
a larger sample size. So this will be the
first consideration of this statistical test. The second consideration will be also a measure of dispersion. I mean, even though
we are comparing measure of central tendency
or measure of position, which is really the mean. We should also be looking at the dispersion of
our two samples. When you look at these
two small hills, which we have these
two distributions, they are overlapping. Now, what if they
were overlapping? Just very little. So we would have one
narrow distribution over here which is just slightly on the side
overlapping with the other distribution
or vice versa. Awards example would be if these distributions
are very wide, the dispersion, the spread is large and they are
overlapping quite a lot. I mean, in such case, if they are overlapping a
lot because they are white, it might be really happening. The difference is just
caused by pure randomness. Thus, the statistical test
is also going to consider the Spirit and not adjust a
measure of central tendency. Now, this statistical
test is going to be called the t-test
that we will perform. Do you need to
remember that we are using a t-test for
this particular thing. Well, not quite. What
matters more is that you focus on the
intuition whenever we are facing a task where we have defined ourselves
a hypothesis just like we did in the
previous video when we collect the samples and we
would like to compare them. What is a key for you to remember is that you
can reach out to the world of inferential
statistics and there will be most likely a
test available for you. I just picked one of
these tests or one of these statistical models which have been defined the years, decades or maybe even centuries ago and has been
proven to be useful. So we're just reaching out
to an appropriate test and reusing it for our
project or our use case. Now back to our t-test. When we plugging the data from our two samples to this t-test, it's going to report
something like this. The probability that given
a chance modal result as extreme as the observed
results could occur, a 12%. Now that's a really tricky
formulation. Here's the thing. Statisticians, they are
very careful at how they are interpreting
the results from these statistical tests. And rightfully so,
because many people have this misconception that
if you conduct these tests, it says that there is a
definite answer, that yes, there is a difference
between the means or there isn't the difference
between the means. Now, that's not the right. These tests are just
indicating how, what's the probability that the result could be
caused by randomness? Anyway, the number that we
have gotten back is 12%. Well, that's kind
of a high chance that the difference between the means in the samples
that we are seeing, it's just caused by randomness. Usually we are hoping for, let's say 1%, 5
percent or even less, to be really certain that the difference in means
was caused by us, by our new sales strategy and not just by some randomness. So we have actually failed. We cannot conclude
from this experiment, even though when
we are seeing with the samples that there is
a difference with means, the statistical test is telling us that we cannot
rely on what we have observed within
these samples to be generalizable to an
entire population. Now, this is unfortunate. Can we do something about it? Of course, there is something
that we can do about it and we will do it in
the upcoming lecture. Just to sum up the lecture, we have collected
the two samples, we have compared them, and then we have faced them with a statistical test,
in this case, a t-test, which says that there is a rather large probability that the difference
in means which we are observing is just
caused by randomness.
44. Inference: Solving and Summarising: Hi there. Our last lecture
was sort of unfortunate, even though we have conducted
beautifully our experiment. It ended up in an
unfortunate way because we cannot come back to our
business colleagues and say, hey, the change in sales
strategy which you have performed is the really
influencing our customer base. You can really change
the population of our customers and increase
the average sales side. So we have failed to provide it. In this lecture, we will
of course continue with these examples is that it's
not the end of the story. And we can address these
as we were saying, one of the things could
be the sample size. What if we would address
the issue we fail larger, simple because
here's the thing we can see like we have
in these lecture, can see that there is the difference between
the population means it's just not showing up sufficiently within our samples. So instead of collecting
15 observations, we would call it the
80 observations. And now we are seeing the
difference between the samples, which again is approximately 3, 4%. But here's the thing. If we plug in this
data to our t-test, it's going to report that
the probability that given a chance modal
result as extreme as the observed results could
occur is less than 0.01%. So now we've larger
amount of data available at the statistical
test can say that the yam, there is a pretty
low probability that the difference in means
is caused by randomness. So this time we have actually succeeded with our experiment and with our statistical tests. The only thing that
was really needed was to collect larger samples, which would showcase
the difference between the populations better. The second thing that we
can do is that we will address it via a more
aggressive sales strategy. Will do I mean by it?
Well, let's say that the business colleagues are
coming to us and saying, we're thinking about
calling our customers every second day and it
will be like maybe no, maybe we should try to
call them every day, which is of course
way more aggressive. But maybe it will be changing the population
much, much more. There will be much
larger difference between the two samples. So what's really happening
is that because of the new sales strategy
are more aggressive one, the new sales strategy
means sale size is â‚¬110. Unlike before when we had â‚¬104, the populations
are further apart. What happened now is that we
collected 15 observations, just like in our previous case. But now the thing is that
populations from which the samples are coming
are further away, they are more separated. You can also see it on these distributions
on the left side. Again, if we take even the smaller samples and plug them into our
statistical test, is again going to report
as 0.01% or less. So again, we could come back
to them and confirm, yeah, this change which you are doing, it seems to be making sense. It really seems to be
changing the populations. Alright, so let me
summarize what we just did. We had really three attempts on collecting the samples and comparing the sales strategies. In our first attempt, it was kind of unfortunate, we collected 15 observations
and what do we, of course, don't know what
the truthful population means where it was one hundred and
one hundred and four euros. The statistical
test concluded that based on the data
that we provided, the difference in means is not generalizable to an
entire population. So we cannot really say that our new sales strategy will be rapidly changing
the populations. In our second attempt, we have collected
larger samples, we hit 80 observations. And this is helpful. It provides more evidence to the statistical test that there is a difference in the means. The test is, are these
not definitely the S? It's more like maybe I agree, can rely on it and
work with it further. We can come back to our
business colleagues and say, Yeah, I'm the new sales
strategy seemed to be working. You can now deploy it to
an entire customer base. And thanks to it, we can expect increased revenues in
our first attempt, the story we'll see more
about in this case, we have decided to use a more
aggressive sales strategy. You can see that the
difference in means is ten. So from one hundred and
one hundred and ten. And now this difference
in population means is so large that if we drew
even a smaller samples, such as 15 observations
per sample, the statistical test
is reporting here. Maybe you could be relying
on these difference. So we will again come back to our business colleagues
and we can say, yeah, there seems to be a
change in the population. Now why am I saying maybe
another Definitely, yes, it's what I'm saying. Statisticians are very
careful with how they are interpreting the results
of the statistical test. It's never a certain, it's just a test based on a provided data, but we can certainly rely on it. What can we take away from
the attempt to Winfrey, as I was saying, we can
now generalize one thing. If the new sales
strategy is applied, the average cell
size might increase, which will be of
course beneficial to our business colleagues
as they can now deploy the new sales strategy
to an entire population. But be mindful, we are
managing to generalize. One thing is about the sales strategy and this is the difference that
we were testing. In the upcoming lectures. We are going to continue
to learn on the sample, but we will continue
in a different way. We will be building
predictive models. This now concludes our
little three lectures about the statistical inference. I thank you for being
part of these lectures.
45. The Function of Nature: Hi, and welcome to another
lecture in the course. Be aware of Data Science. My name is Robert and we
have electric ahead of us call the function of nature. Even though this lecture has
a strange name, basically, what do we are starting with is a predictive approach
of data science. We will attempt to build predictive models with which we can make predictions about individuals or individual observations
within our population. And the function of
nature is basically my view on predictive
modelling. So here's the thing. Our data scientists,
some sort of fortune tellers claiming to be able to predict the future. Fortunately, no, we're just simplifying the world around us as depicted in this picture. Basically the data
scientists believe that many processes in the
world around them are happening through a specific mapping function
provided by nature. Let's say these
mapping function, as you can see on the picture, has some inputs and then it has certain output or outcome. Now, is that term a function ringing some
bells in your mind? Certainly it should. We have the same
notion of functions in mathematics as
well as programming. So it is a box which takes inputs and then it
produces an output. For example, in mathematics, we could have a function of X squared where the input
goes instead of our x, it is being squared. And let's say if the
input is now to, the output would be for
within a programming, we could just provide a
certain string and paste it together with another string
and we would have a result. So a digit you typed
is number five. The same notion of
functions can be found in mathematics as
well as programming. Now let's come back to
predictive modeling. Let's talk about the
concrete example, imaging the or desire to
purchase an ice cream. Is it a completely
random process that is not influenced
by anything, whether you go and buy
an ice cream right now, I'm pretty sure that it
is not a random process. If the weather is warm, if your partner says that
he or she would like an ice cream or if you just
saw on ice cream commercial, you might be more likely
to purchase an ice cream. Similarly, some
factors can negatively influence you and you will
be less likely to go, such as being gonna die it
or having a sore throat. Inside of your mind, there is this complex
functions designed by nature, which takes into account all of these positive and
negative influences and makes the final decision
whether you now stand up, go and purchase an ice cream. Now before we proceed, I would like to
mention one thing. Until now we have talked about individual patterns or
multiple individual patterns. You could say that we
are hoping to learn from a sample and generalize
to the population. Now, why did we change our view into this
mapping function? It is because one such
mapping function is consisting of multiple of
these individual patterns. They are combined at the same time into the
same mapping function, mapping the input
into the output. For example, it is never the
case that only the weather is influencing whether you will go and purchase an ice cream. It will be a multitude of patrons like we can
see on the picture. And then as I said, we are combining these
patterns into this little box, into this mapping function. Alright, let's assume
that we have grasp these basic idea of
a mapping function. What do we do with it? Well, we will attempt estimate or approximate
this function. We want to construct
a little Box2D, which if we provide
the irrelevant, it puts, it will
as accurately as possible map them
to the outcome. If we manage to do that, we can then reuse
this little box for the purpose that
we have in mind. For example, we can predict in the morning how many customers will arrive to our
ice cream stand and purchase the ice cream. Thanks to an
accurate prediction, we can then store
appropriately and have an appropriate amount
of personnel in place. So this is really what
someone means when they say, I am going to build
a predictive model, they see a function
and they believe that they can estimate
or approximated, and they have a clear
benefit in doing so. Like we have an ice cream stand, we see a clear benefit
of why we should be estimating or
approximating this function. In the upcoming lectures, we're going to explore
what's inside of this little gray box
that we are having. Clearly we can use
various tools, a protease and so on. However, before we go there, I will not take a little break and talk about Probably what's the most important aspect of a predictive
modelling exercise. The most important
thing in the wall, estimation and approximation
are the inputs. If we are not using
inputs that are not well representative of the
mapping function at hand, we can use the best predictive
method which is out there, but we will still fail. Inputs which we are using are
the most important thing. To emphasize the importance of the high-quality inputs we are going to after this lecture, have an assignment
where you will creatively think about
the inputs that we can collect the data about
and then use for the estimation or approximation
of this mapping function. Afterwards, we'll dig deeper and examine how exactly can we build this box which estimates are approximates the
function of a nature. So this is what I meant when I say the
function of nature. I think it's the easiest
way how we can imagine, what do we mean when we say we are going to build
a predictive model? And in the upcoming lectures we will dig deeper into that. I'm looking forward
to see you there.
46. When do we need a predictive model?: Hi, I hope you enjoy the
little assignment that we had with identifying
relevant inputs for the predictive model. Now it will be time to start
building predictive models. But before we go there, I will not provide
this brief lecture and ask ourselves a question, when do we really need
a predictive model? So there is a lot of talk
around the predictable proud of data science and
there is a good reason for it. I mean, it has a lot of potential applications for
various organizations. However, before I
come to explain to you how you are building
a predictive model, I want to highlight
one important thing. You should never jump straight
into predictive modelling. You should at first consider the simpler approaches that we already discussed or descriptive approach and
exploratory approach, end and inferential approach. This is because of
multiple reasons. First of all, these tend to be simpler and it might just happen that they will suffice for the project or the use
case that we have in mind. For example, let's
say that you are facing a customer churn problem. We're business colony
comes to you and says, Hey buddy, we need
a predictive model. It would be predicting
which customer has a high potential to turn and generate four
minute a signal. With such signal, I can then send an e-mail to
the customer with a special offer and we can prevent this customer
from turning. I mean, of course
you can just go ahead and build a
predictive model. Or you take it step-by-step. You will start by
describing your data and understanding what is
really going on within. Secondly, you explore
and search for patterns and ops Hadu stumble upon an interesting
churning pattern. The customers which have churned from your
customer base in the recent months have
just before they turned, stumbled upon a technical
issue within your product. And you can see these in
your application locks. So there is a solution
to your turning problem. And of course it is way easier to do this
little bit of data exploration and pattern
recognition exercise as opposed to building of a complex predictive
machine learning model. So the simpler solutions
might already suffice. And if we are taking
it step-by-step, we are picking the
simplest solution that will answer the
problem at hand. Secondly, you really should only jump to a predictive approach if you need granular observation
based predictions. What do I mean by it? Well, let's go back to our
example of an ice cream stand. If I want to understand how the weather influences my sales, I do not need a
predictive model. I only need a pattern
to understand it. If I need to understand which ice creams are
frequently bought together, I also do not need
a predictive model. On the other hand,
if I need to know a fairly accurate estimation of how many customers
will arrive to my ice cream stand
on any given day. I need a predictive approach. I need to know on
a day granularity, how many customers can I
expect to come to my stand? Only resort to a predictive
approach if you'll need the granular observation
based predictions. So with this brief lectures, I wanted to highlight a few reasons why we
shouldn't jump directly into predictive modelling and
why we shouldn't disregard these simpler approaches that we have discussed until now. I'm looking forward to see
you in the upcoming lectures.
47. Building a Predictive Model: It is finally time that we
build a predictive model. Hi, and welcome to another
lecture in the course, be aware of data science. And in this lecture we
continue with the exploration of the predictive
approach of data science. In one of the previous lectures, we have been discussing the
function of the nature, this mapping function
which we are attempting to approximate or estimate
if we manage to do so, if we manage to build
this gray box over here, it might be super useful for us, for example, we're working with the case that we are an
owner of an ice cream stand. And if we have this box, we are able to provide it
with the relevant input. And the predictive model will make a prediction
about whether or not a certain individual comes
and purchases an ice cream or how many people will make the decision to come and
purchase the ice cream? Odds build these box. We need to break it down
into smaller components. We will walk through
these step-by-step. We will be talking
about the inputs. Then we are assigning weights are important
scenes to the input. Afterwards we will summarize
and having them summarized, we can make a final prediction. First of all, let's
talk about the inputs. By the way, I hope that you have enjoyed the assignment
where you could create the only thing about what input could be relevant for the
project that we were solving. Now, something
similar is happening within every project
and every use case. It's usually our domain
and knowledge telling us what inputs we would like to have these what we
discussed already. We definitely have some
ideas what could be driving the decisions of whether people go and purchase an ice cream. On the other hand,
on the downside, there is the limiting
data availability, for example, will certainly
not have data points about whether a customer is having
a sore throat or not. I mean, it will be a gradient, the useful data point, we just don't know whether some customer is having a sore throat,
which is of course, negatively influencing
his or her decision to come and buy an ice cream. We owner they know
that of course we want the
predictive model that has as much predictive
power as possible. Oftentimes you will hear that
how successful we are is primarily dependent on the availability of
relevant inputs. In other words, about how many of these
significant important in what quality we can
grasp and collect the data. If we really only have data available about what is
the day of the week, our model will not be perfectly
accurate or far from it. On the other hand, if we have very rich datasets that cover the weather, that
cover historically, which offers we made to
the customers and much, much more than we can create
a powerful predictive model. We will now put together the set of inputs
that are available. Of course, having
an assumption that all of these are really relevant for the
mapping function at hand is not exactly correct. We would only select a subset of these factors for the
estimation or approximation. We can either do
this manually or we can use a statistical
tools which is available. We would call this
a feature selection or feature elimination process. Before we now proceed, I would like to
mention one thing. Hopefully now you
can appreciate why some mapping functions
are inherently trickier to estimate or
approximated than others. For example, it is a rather
difficult to estimate your political views as these
are shaped by demographic, social, and economic factors, there is a lot of
relevant input. The data scientist might simply not have the data about them. On the other hand, it
could be way simpler to estimate whether you will
click on an online promotion. That is a load of digital data available about you end
your online behavior. So hopefully now you
can appreciate why some mapping functions
are inherently trickier to estimate
than others. Having the input sorted
out and out of the table, let's proceed to the importances or the weight of the input. Not every input is
equally important. Certainly, for example, weather and day of
the week might be way more important as an input as opposed
to, for example, what offer we are making, whether we have a
five or 10% discount, it doesn't matter so much, it matters way more whether
it's a sunny weather outside. So not every input is
equally important. The inputs also have
different effects. Some have a positive
effect and are increasing the probability
that the person comes and purchases the ice cream
while other are having negative effects that are
lowering the probability that a person comes in
purchases and ice cream. The question which you
might now be having, how do we assign these important
cities or these weights? We can go in two ways. First of all, we can
assign them manually. This is what we would call a heuristic or an
expert based model. So for example, we would say it's Saturday
and on Saturdays we are expecting that 100 customers arrive to
our ice cream stand. If it is Saturday and on
top of it is sunny weather, we would expect
additional 50 customers to arrive to our
ice cream stand. So in total, we would
expect 150 customers. This is an example of a
heuristic expert based model. Going manual is of
course not the only way. The importance these can be
learned by an algorithm. This is where machine
learning comes into place. Remember what we said
about machine-learning, that it is capable to learn patterns from historical
data by itself. Well, the weights, the importances are the
alarm patterns, debts, what the machine-learning
model is coming up with, and we do not have to
assign these manually. So hopefully now you
can appreciate why machine-learning in so much popularity
in the recent years. Of course, our datasets
are fairly large. We are oftentimes
working with dozens, hundreds or even
thousands of inputs. And going through these manual
and assigning the weight, the importance is manually could be a very tedious exercise. Instead, we can rely on the machine-learning
model that is capable of
automatically assigning these importances or weights. So now having the
input sorted out and having also the
weights assigned. The third step that we need to do is that we will summarize, basically, we need to take a holistic look at
all of our inputs. And the importance is that we assign to make the
final decision. For example, the weather
is sunny and warm. We do not have any
special offering, our ice cream stand,
and it is a weekend. Thus, we can make a
final prediction that a certain person will come
and purchase an ice cream. Or we can have a different
setup of our predictive model. And it would be rather
predicting how many people make the decision of coming
and purchasing an ice cream. This last step is really
a summarization of all the inputs with their respective weights or with their respective important sees. What sum up this
lecture a little bit. We just went through all of the components that are needed in order for us to make
a predictive model, you will have an
assignment coming up where you will have the chance to
be the predictive model. You will be coming up with
these importances and with these weights for a
particular use case from my past that I worked on. I hope you will give
them more intuition in this assignment of how the
importance these are assigned. But for now, you might
have heard about it. There are various types of
machine learning algorithms, decision trees, linear models, random forests, neural networks. They all work in a
slightly different way. But if you intuitively want to imagine what are
these doing well, what do we just discussed of? How are we building
a predictive model? I'm pretty sure you can keep
this intuition in your mind. And you can well imagine all the machine
learning model is doing when we are building one. Alright, then the final
question for this lecture is, what do we do now? How can we utilize these predictive model
that we have just created? So we have our little box tomorrow morning when
we are deciding about how much stock or how
much ice cream do we make for the day and how many of
our employees do we invite? Be in our ice cream stand, we can use our predictive model. It is sunny, it is Saturday, and we are offering
a 30% discount. We take these values
and we provide them as an input to the predictive
model that we have created. And it's going to provide
us with a prediction that 170 customers will arrive
to our ice cream stand. So this is how we will
use our predictive model. And I have to reiterate
on two statements. Can we expect a perfectly
accurate predictions? No, we cannot. We have created a model. We have created a simplification of the phenomenon
that we are studying. Our predictions will not
be perfectly accurate. It will not happen that exactly 170 customers
will come to our store. It can be 168, can be 170. But we are hoping that our predictions are
accurate enough so that they create a business
benefit for us when it comes, for example, to how much ice
cutting we should make and how many of our employees
should arrive to the stamp. And then we just pin
following the causation. Can we claim that the inputs
are causing the outputs? Certainly not. We have just discovered
some form of a correlation. We have mapped the
inputs into the output, but we cannot claim that we
have discovered a causation, that we just have a
useful estimation of a function which was
provided to us by a nature. All right, so this is everything
that I wanted to say. I hope you now have a confidence that you can build
a predictive model, because in the
upcoming assignment, I will ask you to do so.
48. Predictive Model Types: Hi, and welcome to another
lecture in the course. Be aware of data science. We have already
learned the basics about constructing
a predictive model. There are various types of
these predictive models, and it really depends on the application which
we decide to go for, the problem that we're solving. So in this lecture, I
would like to discuss some of these predictive model types. The very first
distinction between predictive models
that we should be making is whether we have target feature available or not. What do I mean by
target feature? Well, this is the data about the outcome that we
are interested in. The most common scenario is when the target
feature is available. This is what we
call a supervised learning or a
supervised problem. The example that
we're solving with an ice cream stand of estimating how many
customers will make the decision to arrive
to our store was an example of a supervised
learning problem. We had the data from the history of how many
customers arrive to our store on any given day for the period that we're
collecting the data. Now if we are within a supervised realm or if we are facing a
supervised problem, this is usually the
simpler scenario because we have the
supervisor available. What do I mean by a supervisor? Well, the target feature
and it's very important or having the data about the outcome which we
are trying to predict. Because remember, we are
trying to learn some patterns, either us or the machine learning model
that we are using. Now, whatever pattern our machine learning
model picks up, it can check against
the outcome. It can check against
the target feature whether this pattern
is feasible. For example, the machine learning model is
picking up that on Saturdays more customers are coming to our
ice cream stamp. It can then verify it against the historical data
about the Saturdays and see if this pattern is feasible and the machine
learning model should keep it, it will move forward
and the other research for different patterns. Having a target feature is
a simple scenario for us. We are then within the realm
of supervised learning. Uh, more troublesome scenario
is here on the right side. This is the realm of
unsupervised learning when we would not have a
target feature available. Now for a second, have to forget this
little setup that we have created in one of the
previous lectures because it wouldn't be overly helpful
when we are facing an unsupervised learning problem when the target
feature is available. Here, the problems
are more about grouping or clustering. You
might have heard about it. There is also anomaly detection. What do I talk about? Imagine that you have
your customer base. Now, you don't have
a target feature, nothing in particular that you are interested
in predicting. You can still extract some valuable patterns and some valuable information
from these data. For example, by the d
will attempt to group your customers into some
clusters of similar behavior, of similar social
demographic characteristics. Or you can also do is that
you can try to discover customers which are anomalous
with their behavior. So here we would have
projects and the use cases around clustering or
anomaly detection. And as I'm saying, we would need a slightly different
model types then the ones that we are using
for supervised learning. This is the first distinction. We have supervised learning
and unsupervised learning. Nowadays most of the
use cases are within the realm of supervised learning
because luckily for us, we still have a lot of untapped potential
within our dataset. For example, your
business colleagues might constantly be coming up with interesting
target features which are wharf predicting. Nowadays we see a lot of use cases within the
supervised around, but there are some
that predicted. We all kind of ran out of target features which
are wharf predicting. And in the upcoming
years we will see a rise in unsupervised learning. Another distinction
that we should be making is whether we
are interested in creating a predictive model
or a descriptive one. Both of these will follow the very same principles or the approach that
we have discussed, but they will aim to
achieve different things. Which of these two we
pick will of course, depend on the application
or the project. So what do you most of the
time c is a predictive model. These are aiming to make accurate predictions
as possible. So for example, we
really want to have an accurate predictions about how many customers will arrive
to our ice cream stand, we will include a lot of inputs, a load of relevant data, basically as much as possible. And we will use a rather
complex model type becomes we really
want to identify every subtle pattern
in our data and increase the predictive
performance of our model. Another option or
use of the model that we have is a
descriptive model. I have it here on
the right side. Here, having a high
predictive power is not a prime concern. We are rather interested in deep patterns that
the model learns. We might want to do this because we want to
learn something about the decision-making
of our customers while we're not so concerned. The predictive accuracy. When building such model, we would encode lower number of inputs and maybe
we would also obtained for a simpler or
less complex model type. Now, you might be meeting this idea of a descriptive
model for the first time. So let me elaborate
on it for a bit. How are we trying to look
inside of our model? For example, we are
interested in the effect that various values of our
inputs have on the outcome. I printed out two examples. Firstly, we can see the
effect that temperature has on the number of customers
that arrive to the store. As the temperature increases, more and more customers
are coming into the store. However, I went, temperature rises beyond a certain point, we can see that the
effect is opposite. It's becoming too hot for
the people to be outside. If we are increasing the
temperature beyond that point, the number of customers who
arrive to the store lowers. Secondly, we can see the
effect that discount in our offer makes on the number of customers that
arrive to the store. You can see that having a
small discount that does not increase the number of
customers in our store rapidly. But as you can see beyond
a certain point with, let's say at least 30% discount, we can positively
influence the number of customers who
arrive to our store. So this would be an
example of how we might look inside of a
predictive model, inside of a machine-learning
model that was learning on historical data and extract
some patrons from it. We, for which we expand
our domain knowledge. Now, you might be thinking
why this distinction between a predictive and
descriptive model even exists. Cannot we aim for both high predictive
accuracy as well as to interpret the model and
describe the phenomenon. Well, not quite. There is a certain trade-off
between these two. We can increase the
predictive power of our model by increasing
its complexity. Complexity could be
increased, for example, by including a lot of inputs. Now unfortunately, if we increase the
complexity like this, the possibility of looking
inside the model and describing what is happening
inside of it is lowered. It will be becoming increasingly difficult to pinpoint which of the input is having what sort of effect
on the outcome, thus achieving both of these
is not exactly possible. We always have to pick
whether we are aiming for a higher predictive accuracy
or rather descriptive model. And this brings us to the last distinction
that we should be making with regard
the complexity of the predictive model. Basically, complex models can find and utilize more
complex patterns, thus having a higher
predictive power. On the other hand, the more
complex model you build, the more considerations
there could be during its development and maintenance. I have now generalized
predictive models into three categories with
regards to their complexity. We can start with
the simplest ones, which are heuristics,
rule-based systems or scores. Now, one example of a rule-based system will
be our ice cream stand. Let's say we have this
intuitive understanding of our business and we will translate it into
a set of roles. For example, if it
is centered day, we would expect 50 people
to come to our store. If it is sunny on top of it, we would expect
additional 30 people. This way we would
build a set of rules, may be 2030 rules, and we connect them to
construct a predictive model. These predictive
model could then be used to predict future values. This is a pretty
common misconception. People believe that
predictive models are only machine-learning
models while it's not drawn already, if you are constructing these heuristics are
rule-based systems. You are constructing
a predictive model because it can predict
the future values. Now, are these still
useful nowadays when we have so many options when it comes to
machine learning. Well they are. And for example, this is when we
have polishing or some cases where a full
transparency will be required. For example, your bank or some governmental agency is
assigning some risks course. And basically this risk
score is predicting how likely you are to pay back a loan if you now
borrow some money? Well, this is one
example of a case where you would require a
full transparency. You as a human wanted
to know what behind the score that has
been assigned to you. These simple predictive models
are still in use when it comes to some niche applications where full transparency
is required. Let's increase the complexity and talk about simple
machine learning models. Maybe you are already
familiar with machine learning and you have
heard about to model types, linear models, and
decision trees. These are toll prime examples of simple machine
learning models. Actually you have for sure
already experienced as simple machine learning model
within our dear assignment, we were using or where basically imitating
a linear model. And this model is so simple in its workings that we were
even able to imitate it. So we became the linear model, which was assigning the
individual weights to the inputs. Now, when would you use a
simple machine-learning model where there are many reasons why we can opting for
a simpler model. For example, when we want to increase the transparency
of our predictive model, the linear model,
once we construct it, we can really nicely
and easily interpreted. So if this is our
buildings goal, then we should obtain for a simpler machine
learning model. In other case, would be that our resources for building machine learning
model are limited. For example, we are in rush. We have two days to build a machine learning
model where we would apply only a simpler
machine learning technique. So there are multiple
reasons why we might opt-in for
a simpler model. However, there are cases when a simple machine
learning models wouldn't suffice and we would
opt in for a realm of complex machine
learning models. This would be when maximum
predictive accuracy is required or where patterns in the dataset are very complex. These walls, for example, the case with
Visual Recognition, human eye is a very complex
function to imitate. So whenever we are
supposed to analyze images or understand
human language, human texts, we really need complex machine-learning
models because these are complex functions
to estimate. Now some famous examples of complex machine-learning
models would be, for example, neural networks. These are well understood
by now and they're used exactly for these
complex applications. Alright, so as you can see, we need to keep in mind
these exceeds when it comes to the complexity
of the predictive model. When we are building on, we should also keep in mind how complex we would want our
predictive model to be. We are basically balancing
between the transparency and the simplicity as opposed to
high predictive accuracy. That's everything that I wanted to say with
in this lecture, looking forward to see
you in the upcoming ones.
49. Predictive Model is Never Perfect: Hi and welcome to lecture. Predictive model
is never perfect. There is oftentimes these wrong assumption
that people hold, which is the data science
models are machine learning models can have perfect
predictive power. Of course, this is
never the case. No model will ever have all
of the predictions, correct. So within this lecture, I want to provide
you with reasons why this is the case or taken from a different perspective are
already understood that a model is a
simplification of reality. Thus, it cannot include all of the subtle nuances about the phenomenon which
we are studying. In this lecture, I
would like to show you these in a more practical way. What is causing the
deep predictive model? It never has the perfect
predictive power. So let me list these reasons
and elaborate on them a bit. First of all, it's about non-representative
training data. We have our little
ice cream standard we have been building across
the chapter in Sweden. However, for some reason we have available data from a
similar standard in Spain. Now, if we try to learn some valuable information
from these data, it will most likely not be representative of the
behavior in Sweden. It might just be completely
different when it comes to purchasing power in Sweden. So this is an example of a non-representative
training data. And we already discussed the issue at the
beginning of the chapter. Basically we are referring to a non-representative sample. You might be now
telling yourself that this is a bit
naive example, collecting the data
in one country and then using them
in another country. Well, we would never
do this may be in the real world or
hopefully, however, remember the
learning story about the edible and
poisonous mushrooms. This was a very
realistic scenario of a non-representative
training data or a non-representative sample. This is the first issue. Secondly, it might happen that we will have an
overfitted model, or in one more rare scenario, we would have an
underfitting model. We are usually learning from a sample generalizing
to a population. We already know that much. We basically do not want to focus too much on the
sample because that would create an overfitted model with the patterns and
valuable information that we learn from the sample, we essentially want to leave. They're a little bit
of freedom so that our model generalizes
well to the population. Remember what we said
about the inference of an average height from the
sample to the population. Even though we have measured
in these sample that the average height
is 175 centimeters, we did not claim that
also an average height in the wall population
is 175 centimeters. We claim that we are ninety-five percent certain that
the average height in the population is between
171179 centimeters. It's something very
similar over here. Now, we do not want to
overfit the sample, while at the same
time we also do not want to underfit the sample. This will be the case
when we are not utilizing the sampling of when we are not learning enough from the sample. But as I'm saying, overfitting
is a bigger issue. This is especially troublesome with the
machine-learning models. You'll literally have
methods and techniques, how you have to hold
them or stop them from overly learning on the sample and overfitting to the sample, whereas they would then fail to generalize to the population. The second issue of why
predictive model is never a perfect is
because it might happen that we are overfitting or underfitting our training
data horror our sample. Thirdly, irrelevant inputs,
as we already said, the features or in other words, the inputs that we provide
the model to learn from RD, most important aspect of
the predictive model, it might happen that we provide
some irrelevant inputs or inputs which contains
some spurious correlation that will not hold
in the future. In such case the model
learns a pattern, but it will not generalize
well to the population or the spurious correlation will not hold stable in the future. It might just be providing the modal with
an irrelevant input and we are lowering its predictive powers and
its predictive accuracy. Lastly, we are staying with the inputs and it's about
the low quality of input. At the beginning of the course, we discussed a little
bit about the data. We said that not all the
data is of the same quality. And at, for example, self-reported data might
be very problematic. Always remember the yellow
Walkman learning story. It could be the same if we
are interested in predicting from the data points
that the people report. This data might be low quality
input and thus it will cause our model will not have
a great predictive power. And also always remember that the data represents the
phenomenon that we are studying. If the data are not representing
the phenomenon well, if it's a low quality input, well then we can have as good predictive
model technique or as good predictive
model method, we will never grasp boards really going on in
the phenomenon, if our data is of low quality, should wrap up this little exercise that we
did with listing of the reasons why
our predictive model will never be perfect. I would say that we always have to keep in mind
that there is always some sort of bias or error within our
predictive exercise, or that it will always
contain some degree of noise that will not allow us to have a perfect
predictable power. Maybe you are feeling
a bit discouraged now, but you definitely shouldn't be. We can create great and
powerful predictive models. So even though our
model will not have perfect predictive power, we are not aiming to create a model which has perfect
predictive power. We are aiming to create a good enough predictive
model that can create a benefit for a
concrete application. So this is one of the key
takeaways from this lecture. We want to create a gold enough
predictive model and it doesn't matter that it
will not be perfect. I would say maybe as a bonus
takeaway from this lecture, it's about how we can view and approach these issues are
data scientists should admit that these limitations
and issues exist in our predictive modeling exercises
and should view them as an opportunity for an improvement
of a predictive model. Incorporating better
training data may be incorporating
more training data, improving the data quality, working on the data
collection methods. All of these have a
potential to improve the model that the data
scientist is working on. We should view this
as a potential for an improvement of
our predictive model. And that's it. That's what I wanted to discuss
in this lecture. And I'm looking
forward to see you in the upcoming learning story
about dogs and wolves, where we will wrap up
our predictive approach.
50. Are we seeing a dog or a wolf?: Hi, and welcome to another
lecture in the course. Be aware of data science. It is just about
the time to Concord our exploration of the predictive approach
to data science. So we will have a
learning story called are we seeing a dog or a wolf? You'll learn about one of the crucial part
of data science, estimating or
approximating a function that will allow us to
make some predictions. I decided to provide you
with a little banjos story to showcase you the beauties
and wonders of data science. A little bit more acidic, the function that we're
trying to estimate the approximate is very complex. An example could be
imitating a human eye and the human brain with
visual recognition model, hearing a strange
issue might occur. The problem is that the
function becomes so complicated that it basically
becomes a black box for us. We will not know what patrons
the model was learning, such as is often the case with visual recognition and
deep learning models. There are six pictures
on the slide. Each one is depicting
a dog or a wolf. So some researchers constructed the visual recognition model. The model was supposed
to decide whether it sees a doc or a wolf. The model worked very well,
even astonishingly well. I mean, in some cases we can imagine that it's
pretty simple and straightforward and
recognize that this is a dog or that this is a dog. However, we've some
specific breeds or short from specific angles. It might be even
a bit tricky for a human eye to recognize
whether it is a dog or a wolf. Yet the modal was really good, like it could basically
perfectly distinguish whether it's a dog in a picture
or a wolf in the picture. How is it possible that
the data science model was so good it really
bothered the researchers. So they decided to look inside
of this complex model and examine the patterns
that the model was relying on when making
the predictions. Basically, they
were expecting that they would find
something like this that the model has learned to distinguish dogs
from wolves based on the third color or head
shape or a snout shape, these would be the features
they would expect. However, in reality, the
Model three tasks it needs don't learn all
the nuance differences between dogs and wolves, such as the fair coloring
or posture or the contours, it was heavily relying on the dogs are usually pictured in an urban environment or with some human made objects
such as breach, House, fence, or floor. On the other hand,
we often photograph wolves in nature settings. Grass, trees or forest. Is the modal focused a lot
on the simple pattern. It was accurate at predicting whether there is
a doc or a wolf. Now, interesting turn
of events, isn't it? I think this story
really nicely highlights the potential pitfall with
the predictive approach. Oftentimes you stumble
upon large datasets and complex mapping
functions that you attempt to estimate
or approximate, you might end up
with a model that is doing what you would expect. However it is basing
its decisions on something completely different
than you would expect. The question is, of course, is this really a broken
predict your model? If you now look at the pictures, the pattern which the
modal picked up is correct one to classify whether
it sees a doc or a wall, maybe if you think about your brain didn't
need three Q as well. When you are looking
at the pictures of the American white
shepherd or for example, the husky on the right side, did you base a decision on the contours of the dog or
the shape of the dog itself? Or did you just look at the environment
subconsciously, of course, and decided, this is a dog
and maybe this is a wolf. Maybe at the end of the day, our brain is doing
something very similar as these kind of broken
visual recognition model. Before I end the lecture, I need to thank the authors of the pictures of these
lovely Douglas. And I'm hoping that you have enjoyed these
learning story. Always remember also
what we have discussed. We should be at first describing
and exploring the data, trying to find some useful
patterns ourselves. And only then we should be relying on some
predictive approaches, such as a machine-learning
model so that we can use this with
more confidence. And that's everything
that I wanted to discuss when it comes to
the predictive approach. Thank you so much for being
part of this chapter.
51. Is our model having an impact: Hi and welcome to another
lecture in the course. Be aware of data science. We have designed already
some data science models. Most importantly,
in this chapter, we have designed a predictive
models, for example, by using a
machine-learning algorithm that learns on our
historical data. And now we are thinking
about applying this predictive model
on our customer base. Let's say that we
are a company which has 1 million customers. And the big question for us is our predictive model having an impact that we
would intend to, for example, is it positively impacting our revenues
or our profit? This is the topic that we'll be tackling within this lecture. Let's start off by
designing our little beast. Any setup, I guess
you are already tired of the ice cream stand, so let's have a different setup. In this case, we had a
vision or an idea about the function of a
nature which is inside of our customers mind, deciding about whether they
purchase a mortgage from us. We have considered and collected various inputs which we consider irrelevance such as author that we are having
with the mortgages, an age of the customer, whether the customer
owns already some property and
let's say some more. And of course, all
of these inputs are coming to the
function of a nature. What comes out on
the other side is a final decision of a customer, whether he or she comes and purchases or mortgage
from our bank, as we've discussed in
the previous lectures, what do we basically now
attempt to do is that we would estimate or
approximate this function. And thanks to it, we
create a predictive model. Alright, so let's take it
one step further and don't already consider how we would
apply these in our bank, in our real business setup. Let's say that we have 1
million of our customers on top of which we would like to apply this predictive model. The way how we would
apply would be that above the
customers about whom we predict that they have
really a high probability of coming to our bank and purchasing the mortgage from us. We would give them a call and we will try
to persuade them, okay, really come
to our bank and purchase the mortgage from us. So this is the application. Now it comes order the
first key learning from this lecture,
it's negative. So I highlighted in red, we should not apply our
modal right away to the entire customer base
because of various reasons. First of all, we might
have missed some form of cognitive or a
statistical bias. What do I mean by the sentence? Well, maybe we are thinking about the wall
function of a nature wrong and the underlying process
is completely different. There's one possible
scenario or secondly, the data that we
have used might have some form of a statistical
bias inside of it. And the modal which
is fitted and created on top of
these datasets will be doing something
completely different than we will intend to do to these reasons
if we would now apply the model to the
entire customer base, well, it might be having a negative impact instead
of a positive one, so we should be careful
with applying it. Secondly, our modal
might be negatively influencing a different aspect of our business.
So think about it. Our modal is not considering some other aspects
of our bank we had, of course, having
different products. We need to take care
of our customers. We have our operations. There are many different
aspects about our business. The modal is focused
only on one of them, which is our mortgage sales. And if we are now applying these optimization and if we are trying to increase
the mortgage sales, we might be negatively
influencing, for example, customer
satisfaction. Maybe these phone
calls which we are making to the customers
are bothering them. They are unhappy
and they will not only not buy the
mortgage from us, but they will actually turn from our bank as customer
has completely. We have to be careful
with the application of our predictive
model, okay, Now, what do we do only if we
had a way how we can apply the predictive model only on their sample of
our customer base. And then from the sample we will generalize the learnings
to the entire population. Only if we had such method. Of course we have it and I think you guessed
it correctly. If we go just a few lectures back when we were
discussing the inference, you might remember
the case when we are considering a new
sales strategy. And we would collect a
sample of customers to whom we try to solve
the old sales strategy. Then we collect a
sample of customers to whom we try to solve
the new sale strategy, we compare these two samples
and we check against the statistical model for the generalizability
of this difference. We just need to reuse the
very same technique and just reshape our
setup a little bit. We would consider that, okay, our old population is a one. We found applying the
predictive model. So basically how we are doing
our business right now. And the other population
is the one on which we already apply
the predictive model, where we are already
calling the customers based on the predictions
of our predictive model. And again, we compare
the two samples we would conduct a
statistical test. Now, just so that
you can imagine it really is a
real-world scenarios. I have redrawn this picture. Basically we have our
1 million customers and we will break our customer
base into two halves. For example, we will
consider the Don 1.5. We would not apply
the predictive model. On the other half, we would
apply the predictive model. But again, we do
not want to apply the predictive model on such large scale to
0.5 million customers. So we would only
collect a sample. We would collect a
sample of 1000 customers with whom we are doing the business as usual
as we do it today. And then we would have a
sample of one hundred, ten hundred customers
on which we are applying these predictive model. And again, we will compare them. What will come out
of these experiment? I do not want to resort back to the statistical test we
have already covered it. You can revisit that
like train Casey would like to have the
technical details. I want to more imagine this
as a real-world scenario. So here is the outcome
of our experiment. As you can see, we have
our two groups in a rows. We have a group, we found
our new predictive model, and then we have a group
with our predictive model. Now let's look at the number
of customers in a sample. We have 980 customers. We filed our new
predictive model and 930 with our predictive model. Now the reason why this is not 1000 is simply because
some customers have churned in the process of conducting our little experiment.
So here is the thing. We cannot measure it right away. We need to wait some
time, maybe three months, maybe six months in-between, we start to conduct
the experiments. So we try to apply
the predictive model. And when we measure the result, it simply took a
few months or so. Some customers
naturally turned from the process and we have of
course, collected data value. Now about each of the groups we are collecting various metrics. Because remember, we should consider other aspects
of our business. We do not want to look
only at the mortgage itself because the
modal might be negatively influencing or maybe positively influencing some other aspects
of our business. We have multiple metrics that we are considering.
Let's look at it. Revenue per customer group. This is the one that's
fairly important for us. We hope that we increase
the more details. And indeed we can see that we felt our new predictive model. It is one hundred, ten
hundred and thirty five dollars way if our
predictive model, it seems to be larger, $1222. So it seems our
predictive model is increasing the
revenue per customer. Now, the second
metric that we are considering is a
customer churning group. We indeed can see that more
customers are churning from a group when we are applying the new
predictive model. This really might be because of the phone calls
that we are making. It's simply bothering them and they do not want
to hear from us. And lastly, we have
collected also a customer satisfaction
score, 8.558.53. So let's say that there is no difference between the stool. Now, if we're looking at these, I think we can make
a conclusion that our predictive model is having the desired and overall
positive impact. So even though we are negatively influencing the customer churn, all in all our
business colleagues have located the results
of our experiment and they are concluding that the increased revenues
are indeed warfare. So we are making the
conclusion that yes, our modal is having
the positive impact. And based on the results
of these experiments, based on the results
of this pilot phase, we will now deploy the predictive model to
address the entire population. So entire 1 million
of our customers. So let's summarize with a key
takeaway from this lecture, we apply the predictive
model only to a sample of population to see whether
it has a positive impact. This is not only applicable
for a predictive approach, it's actually applicable
for all approaches, but it is crucial with a predictive model because most likely your
predictive model, we have used a machine-learning
model to create it, and we need to make sure that there is no
statistical bias, that there is no cognitive bias, or we might have gotten the
wall setup just wrongly. So it's very crucial to
do these experiments, especially with a
predictive model. This is the end of the lecture. This is how you check whether your predictive model is having a positive and the
intended impact. I'm looking forward to see
you in the upcoming lectures.