Transcripts
1. 1 course introduction promo video final: The financial industry
is no longer just about numbers on
a balance sheet. It's about the pattern
hidden within those numbers. Today, the world's most
successful institutions aren't just making decisions
based on intuition. They are building
intelligence system that can learn, adapt, and predict. Hello, I'm Omar
Koryakin and welcome to Financial Intelligence
and machine learning, Mastering predictive
system and Rec tech. Throughout my career, I've seen first hand how AI and
machine learning have transformed finance from
a role based industry into a data driven powerhouse. But there is a huge
gap between knowing the theory and building a
system that actually works. This course is designed
to bridge that gap. We aren't just going to
talk about algorithms, we are going to master them. We will move beyond traditional economic theory and let the data speak for itself. You will learn how to use machine learning
not just as a tool, but as a competitive advantage, whether you are
automating trades, rating bonds, or
predicting tasks. Our journey begins
with the foundation of AI and statistical modeling. We'll dive deep into
predictive modeling, which is mastering everything
from linear regression to gradient boosted trees
and neural networks. Deep learning, understanding
complex architecture like CNNs and how they apply
to financial data. The Rick tech revolution, exploring how AI is distributing regulatory compliance
monitoring, and fraud detection. Ethics and risks,
we will talk about learning the vital
consideration of bias, overfitting and systematic
risks in automated systems. If you are a student looking
to enter the quant space, a financial professional aiming to upgrade your toolkit or an AI enthusiast eager to apply your skills to the world's
most complex markets, this course is for you. We believe in an
interdisciplinary approach drawing inspiration
from statistics, computer science,
and even psychology to build the next generation
of financial intelligence. The future of finance
is intelligent, automated, and data driven. The question is,
are you ready to lead this transformative
industry? Let's get started. I'll see you soon in the first lecture.
2. 2 Getting Started with AI & ML in Finance: I want to give you a first idea of what artificial
intelligence is, how we can define
artificial intelligence, and what problems we
are confronted with when using AI and ML in finance. Now, what is artificial
intelligence? Well, if you try to define it, it's best to define
it as something different than human or
natural intelligence, natural intelligence,
being intelligent, the intelligent behavior as displayed by humans and animals. And artificial intelligence
or just AI in short, in contrast is intelligent
demonstrated by machines, so fly by a computer,
robot, et cetera. You can also define
artificial intelligence by cognitive functions
that are mimicked by machines and the way
machines try to mimic cognitive functions that
humans would usually associate with human
mind uh, human behavior. We are led to behavior such
as learning problem solving. This is one way of defining
artificial intelligence. As you can see in the box below, machine learning is
actually a sub field of AI. It's the part where we are
trying to teach machine. We are trying to teach a model to learn from input
data to generate, for example, some
type of output data. This brings us to learning
or just machine learning. And if we are using models
that have more layers, we call this deep learning. Then of course,
artificial intelligence, just as natural intelligence also includes the field of
natural language processing. If we are given, for example, an MP three, a video recording,
an audio recording of language that is spoken
by a computer or by a human. This needs to be processed in certain applications
and we need to put this into a form which we can work
with in a computer. The next we also
have perception. You might know that certain
electric cars have sensors that can detect other cars that can detect
humans on the street. And this is perception. We need senses, and then we need algorithms and models that
can process this input data, which is actually
big data that can process this input data
and make decisions. For example, in the car, make a left turn, make a right turn. Motion and manipulation,
same thing, we are getting closer to
what robots would do. And last but not least, also social intelligence and what we call
effective computing. These are all
different subfields of artificial intelligence
and machine learning is just one part of AI really, which is used quite
frequently in finance. That's why I included
it in the title, but we will look at all these different subfields
more or less in this class. Um, now, artificial intelligence
is a lot of statistics, actually, but it's
not just statistics. It relies obviously on
statistical analysis. It needs to rely on
computer science because we're using
computers and computer algorithms to make our decisions and to teach our
machine to make decisions. But we also draw inspiration
from psychology uh, also from biology and medicine, this is where it all
gets mixed up really. We have statistical analysis and we have statistical models that are built and that
are trying to mimic, for example, neural
networks that we know from biology and human medicine that are then used
on a computer. You have statistics using
a model or coming up with a model that is related to biology and used
on a computer, and of course, engineering
and so on and so. This makes artificial
intelligence very, very interesting because it's
interdisciplinary and we're using models that have
some parts statistic, statistics, some
computer science, a little bit of
psychology, and so on. Um, I artificial intelligence really a new field of
research? Actually, no. The first research on
the first examples of AI actually date back to
ancient automatons. These were machines that
were able to learn very, very basic things or that were able to make
certain decisions. And these were
actually used very, very early in human mankind, but they would call at that time automatons and call
them automatons. Um, but even modern
methods, like, for example, neural networks that we
will discuss in this class, they were first
proposed in the 1940s. So some foundations of AI
are actually very, very old, but we have new
statistical models and new statistical
methods that have been developed over the
last couple of years. We have seen an increase in the available amounts
of data, which is very, very important
because even though we might have had
the methods ten, 20, 30 years ago, we
didn't have any data, and the beauty of AI really comes into play
when we have a lot of data with very good algorithms and
very very fast computers. That's the third bullet
point here below. The increase in computer
processing speeds has made it possible to use
modern algorithms on huge amounts of big data. Which makes it even more interesting to
use AI and of course, the reduction in
data storage costs. The combination of all these
four points, new methods, more data, faster computers, and reduction in
data storage costs. This has all led to a vast
increase in the interest we've taken in AI and machine learning
in applied sciences, even though the basic foundation of AI might have been
laid even in the 1940s, 50, 60s and 70s. Now that's AI and ML in general. So what about AI
and ML in finance? Well, here are just a
couple of benefits we can reap by using AI
and ML in finance. Now, financial operations,
and that's usually trades, transactions in a stock market, in a bond market, for example, they are based on
predefined rules. So by automating these rules, where a AI and machine learning, this can reduce costs and
it can increase speed. So we can implement trading algorithms that can
make decisions on their own. And this may reduce
costs, obviously, but it can also increase speed
and by increasing speed, we might be the traders on the market that are
quickest to buy or sell, thereby increasing our
profits or minimizing losses. Second, financial decisions. Every financial institution, every financial investor
has to make, for example, ground alone, raid a bond, make an investment or
don't make investment. They usually require quick but also fact based
judgment calls. We shouldn't do
this on emotions. We should do all our
financial decisions based on hard facts
balance sheet data, income statement data, analysts forecast fundamental
data from the market, macroeconomic
factors, et cetera. So if financial decisions do not really rely
so much on emotions, but they rely on half facts, we can use those half facts, those facts based
judgment calls, and we can try to teach
this to a machine. We can use AI and ML to do automated decision
making when, for example, granting a loan or rating a bond or buying or selling
the stock, now, AI and ML algorithms, they make these fact based and hopefully
objective decisions, and this is another advantage. They will always
comply with the laws and regulations if we
program them to do so. This is even if we don't
have an increase in SVE, even if we don't have
a reduction in cost, can be very attractive
to financial investors, especially to regulated
and supervised financial institutions
because it takes out the human element
that is error prone, we have an advantage in contrast in comparison
to a situation where we have humans doing
these decisions because the machines will always comply with the
laws and regulations. That's one advantage we get besides increase in speed
and a reduction in costs, and AI and ML do not
requ economic theory, they simply use data
to detect patterns. This is an advantage and
disadvantage at the same time. The advantage is that we are using data
and we are letting data speak for itself and we are getting patterns that
we do not need to explain, but this is just reality
what we can see, what we can observe
in the market. It's a disadvantage,
and I'm pretty sure researchers will disagree on this and you will probably have
researchers arguing for the first and
for the second case. You could also argue that this is a disadvantage to AI and ML because we do not have any economic theory that
can explain these patterns. We are only modeling the patterns that we can
observe in the market. This is actually an argument, I think that is made in the
textbook by Lopez D Prado uh, who argues that this is a huge advantage of AI
and ML because you're not relying on any theory
that in the end might not be tested and might not
be able to test at all. So you only concentrate
on what you can observe and the patterns
you can see in the data, and then hopefully AIN NL will tell you what
these patterns are. Now I want to give
you a very simple introductory example
for machine learning, and this is taken from the
textbook by John Hall. You can actually download the data from John
How's website, the link is here. And now the task is we want to predict the salaries of
people based on their age. The sample is of size 30, N equal is equal to
30 and we will divide the data sample into three
trainings into three datasets. The first one is a training set. The second one is
a validation set, and the third one is a test set and we will use
three mo loads, a linear, a
polynomial, and again, a polynomial, but of
higher order model to estimate the salary to predict the salary of people
with X being the age. The linear model, let me
highlight this for you. The linear model
is simply assumes that we're using people's age, and we estimate salary, which is Y, and we need to estimate the
parameters A and B. Now, if we use a polynomial
model of order two, we would take H as a linear
term and we would take it as square and then
include it in our model. Then again, we have a
polynomial model of order five, where we only have
X, X squared X taken to the third power,
fourth and fifth part. Very simple, and again, we are using three sets of data. Now, this is the training set. We have age 25 and a salary
of $135,000 euros et cetera, 55 age 260, these are very
wealthy people actually, and you can see this
is the training set. Now, if we plot
the training set, you can see at first, you just see the data. Now, one would think that it could be a model
that looks like this. Could also be a model
that looks like here. Let's use this one could also be a model that looks like this. That's why we absent any theory, we need a good model and
we need a good procedure to train our machine
learning algorithm, our model to come up
with a model that is able to explain
the data we have. Now, a very simple linear
model would look like this. So we estimate
this linear model. Later on, we will see that this is regression analysis and we are estimating a linear model based on regression
analysis in this case. The quadratic model
would look like this. And if we go back to our data, well, a quadratic model
would look like this, could be okay, but maybe it's rather a polynomial
model of order five, which gives us a
pretty good fit. We could have seen this
from this plot here, but actually, you can also
see this in the data. Now, the polynomial model of order five is the
most flexible one, and it will yield the best fit. However, as you can see here, we have ups and
downs and ups again. This might be indicative of
what we call overfitting. Have a model that is too
flexible and it will not only model the
pattern in the data, but it might also be
modeling the noise. We have maybe some
errors, some outliers. In contrast, the linear model is not very flexible at all. It's rather simplistic, as you can see here
in the blue line. It's not a very flexible model. It only has two parameters, so it might be too inflexible. And this is what we
call underfitting. The solution is we
have to back test our model using the
second dataset, which we then call
the validation set. We use the second
dataset and we check whether the model generalizes well to a validation dataset, and this is what is shown on
the second data plot here. We have the ten points, the ten observations
in our validation set, and what we then do is we estimate the so called
root mean square error. We take our three models, the linear and the two
polynomial models. And first of all, we estimate our models
based on the training set. Then what is quite simple and what I can
show you here is, for example, we erase a little
bit and maybe this one. What we can do is we
can actually take those differences between
those points and our model, and those are the errors. For example, in if I were
to take a linear model, which would like this, then I can show you the errors here. The errors would be these
differences, quite simply. And so on and so on
and you get the idea. And if you now
take these errors, these differences,
and you square them, and then take the
square root of them, you get what we call the
root mean square error. You take those errors,
you square them, and you take the root
mean and get the RMC. Now in our training sense, if we estimate our model, it's quite clear that obviously
the most flexible model, which in this case is
the polynomial model of order five gives you the best fit in the training set and we have a root means
square error of 49,000 here, 32,000 here and 12,000 in here. If we now do the same thing
in the validation set, we take our models and we look where do where does
our validation set lie? We can see that now in the validation set
in the linear model, we get almost the same
root mean square error. So the difference is 259. For the polynomial
order of order two, we have 33,000 and only
a difference between the root mean squared error of the training and
validation set of 622, but for the polynomial
order of five, we get the huge difference. And this shows that we here, in this case, we
have overfitting. Yes. The polynomial
model of order two now produces the best fit without overfitting the data. As you can see, the means square error is considerably
lower for this model than, for example, in
the linear model. This means that this
is a better model. But as you can see, the
difference between the training set and the validation set
is also very small here. We would argue that this is the best model
without overfitting, as you can see, the
validation set, actually the polynomial
order of five again, has a huge root
means squared error. We should use Model two, gives you the best fit in the validation set and it
doesn't overfit the data. Now, how accurate is
the chosen model? We have the RMSE in the training sets and
in the validation sets. Is it the first one?
Is it the second one? Actually, none of the two. The accuracy of our chosen
model should not be measured based on datasets that were used to choose or validate the model. So what we need to do is we need to use the third set of data, that is the test set. Remember that we have
30 observations, and we estimated and trained our model
using the first one. We delegated and chose our
model based on the third one, and then to estimate and make an estimate for the
accuracy of our model, we have to use the third one. This is a very
prototypical approach with very simple models, of course, in machine learning, and we have a dataset.
We divide it into Training, a validation,
and a test sample. And then we estimate the root mean square error
for our third test set, and this gives us 34,275 as an estimate for
the root means square error for our model two, a polynomial model of order two. Okay. Now, a second
machine learning example. The task here is, and
this is taken from the James Whiten Hasty
Tip Cherni textbook, we have defaults by customers, for a bank, for example,
or credit card company, and we want to predict
the probability of credit card defaults
based on annual income, which is on the YXS and the
monthly credit card balance, which is on the axis. And the blue ones are actually writing defaults on actually
the defaults in orange, sorry, and the non
defaults in blue. And one can immediately see that there seems to be
a pattern here at work, meaning that we can see we
can simply draw a line here, and we would be would be
very good and we were able to classify
our observations as a default or non defaults. Now, what this mean? It would mean that
actually here, if we have a balance on
our credit card below, let's say 1,200, that's okay. And this would mean that anyone with a credit
card balance below 1,200 has a pretty high
probability of not defaulting, whereas here, this
pretty much looks like this is bound to default. This is a second task we will see quite often
in machine learning. Um, we want to train an
algorithm to classify observations as good or bad
as ones or zeros, and so on. And we need models
to do this for us. And here we have two
predictors balance in income, and of course, we
want to do this with many more predictors. In this case here, if we
were to use box plots, you could see that with income, which is on the Y axis, there's actually
no real pattern. We can see that actually
for any type of income, the probability of default
is probably the same. You can see this here
that with default, the means and the quantiles
are very close together. Now with balance on the credit cards, there's
a huge difference. As you can see here, no
default is here and yes, default it lies here. Actually, credit card balance seems to be a very
good predictor of the probability of
default and this can then be used in a
machine learning model, machine learning algorithm to make automated
decisions on whether, for example, a new customer
is bound to default or not. So what could we do? We
can do the same as before. We could fit a linear model
that could look like this. Default equals A plus B to parameters times
income to the data. But the problem is
that in contrast to our first model
where we were trying to estimate and forecast
quantitative variables, income um, we now want to forecast the
probabilities and we want to estimate
probabilities of default. Now, the probabilities
can be negative, and this is a problem here. So if we were to use
this linear model, this would be the linear
model, the blue line. Suddenly, the probabilities
we are trying to estimate, they could also be negative and this
should not be the case. We need a different model and we cannot use the linear model. Now the solution is, which is better idea to fit a so called logistic
function, logistic mod. We will see this later on in the lecture and the
logistic model assumes that we have the probability divided by one minus
the probability, the conditional one that we have a default if we are given the income and we now take the natural logarithm
of these odds, then we get what we call the logics this is a logic
model and logistic model, we estimate this as
you can see here, all those probabilities
are now 0-1, and this is more suitable, better model to forecast those
probabilities of default. This will actually
be a huge part in this lecture,
classification problems, classifiers and machine
learning algorithms that can classify loans that can classify
stocks and investments. Okay, this was a
short introduction. You should know what AI is that machine
learning is actually a sub field in
artificial intelligence, and you should have seen
two pretty simple examples how these models can be
applied in economics, business, and
especially in finance.
3. 3 Sources of Financial Data output: Now in this video, I would like to start
with Chapter two, data sources, data generation, and data pre processing. And as you can see
on this slide, in the three subjects we have, I would like to start with the data sources and
data types that we will usually encounter
in the applications of AI and ML in finance, give you an idea that it's not just about it's
no longer about, um cabal market data, we need more additional
data sources, and this leads us to
new problems because we need to merge all these
different data sources, and then we have to preprocess
the data in order to make the most of our
machine learning in artificial
intelligence algorithm, and in some cases, last but not least, we are
also led to the problem of being required to generate new data that we can
feed our algorithms. So we'll start in
this video with data sources and data types. Now we only start
with financial data, even financial data
are quite diverse. They come at different
levels of complexity, and they can be about prices, they can be about indexes, they can be about transactions, and so on and so on. We'll shortly see what
financial data are. But actually, even
financial data are quite diverse and have
always been quite diverse. Now, the data can be structured. They can be unstructured. They come at low
frequency, high frequency. Sometimes we have prices only available
every once a week, or we have prices, for example, that can be um, sampled from transactional
data at five minute intervals, then we would get what we call intra day data or
high frequency data, or at the other end, we might just have
balance sheet data. Balance sheet data only is
published every quarter, sometimes just once a year. The data can also be publicly available. It can be private. If it's published, if
it's disclosed by firms, like a balance sheet,
like an annual report, it's publicly available. But sometimes we also
have private data, data that is only
available to one company, and that is actually
business sic port. Financial data can be complemented um, with
alternative data. I call this alternative data. You could also say just
non financial data. Now, we've always worked with financial data in
finance, obviously, with balance sheet,
income statement data, with prices, market data. Nowadays, we are
trying in research, but also a year in teaching. And, of course, in practice, we are trying to complement
financial data with additional data sources
from a non financial real. Could be satellite images, could be data from Twitter, could be data from Facebook, could be data from some
other data source that we don't know about yet or we
haven't thought about yet, and the combination of financial data with non
financial data, first of all, makes artificial
intelligence and machine learning so much necessary because
we have big data, we have more data available
and we need more, um, um, powerful algorithms and statistical tools to deal
with this type of data. And second, this is also what makes artificial
intelligence and machine learning so darn interesting
in finance because we can see hopefully can see more than just by looking
at financial data. So more data sources
lead to more data, more data in turn
leads to the need for AI and ML methods to
process this data, but also, of course, for
big data algorithms. But then different data
sources, structured, unstructured high, low frequency all this makes data
preprocessing necessary. We need to think about
how we can combine, how we can merge
this type of data, how we can make working with
this amount of data most efficient and we need
to make sure that our algorithms do not run
into problems in between, so we have to
preprocess the data. Now what types of
financial data do we have? We have fundamental data like balance sheet items,
income statement items, also macro variables that
are usually published and processed by from
some central banks. Obviously, we have
market data, so prices, also yields for bonds, implied volatilities
when it comes to options and
other derivatives. We have transactional
data, trading volume, we have dividends, coupon, open interest quotes,
cancellations, and so on and so on. There's a huge universe of data available when
it comes to market data. We have in the third
category, analytics, which uses fundamental
and market data and creates what we would
later on call and extracts what we would later call in machine
learning features. We have analyst recommendations, credit ratings, earnings
expectations, and new senant. So this has already
been processed data. It focuses on market
data, fundamental data, and it complements
this data with, say, for example,
the recommendation of a financial analyst. Then in the fourth category, this is not financial, but it complements the other
three categories. It's the alternative data
section, for example, images, could be images of persons, images of companies of products, but it could also be
simply images of builds, Google searches, Twitter,
chats, and meter data. In this class, just like in big data analysis in
finance in practice, we'll concentrate on
all four types of data, financial and alternative
data to make the most of AI and ML in
our applications. Now, where do I get
financial data? Usually from vendors
like Bloomberg, Compustat Con, used to be called Datastream
so I mentioned this here. So these are professional
vendors of data, and it usually only
comes at a high price. So you have to pay a
lot for Bloomberg, computate IC data stream, but you also get
high quality data. There are, of course,
lower quality data sources like just Yahoo Finance, but don't expect too much from these free of charge sources. It's just like in
any area of life, if you pay more, you usually
get a better product, and this is also the case here. But in machine learning,
especially in machine learning, it's quite nice to
see that there's a huge community now
of practitioners, researchers who have
published data song boards, algorithms, and you
can access these. For example, you can
find a lot on Cagle or UCI machine
learning repository. UCI is the University of
California at Irvine, and they've published
I click here, cagle.com or archive
at ICS it UML. You get the UCI Machine
Learning Repositi or Cagle. You find a lot of data, you
find a lot of examples, and I encourage you
to look these pages up and see what you
can use yourself. You also get data from public government
agencies, for example, in the US and the
European Union, and you can get IDCs, you get census data. This is maybe not alternative data as been used especially in
economics for a long time, but it's the first
step to complement traditional capital markets data with more alternative
data samples, same with the economics
datasets from the World Bank, and throughout the class, we will introduce
different sources of data, different data samples,
different databases, and you will learn how to import different kinds
of data based on, say, CSV files, PDF files, but also image data pictures in our practical applications
throughout the lecture.
4. 4 Preparing and Cleaning Data: Hi, everyone, and welcome
back to our class in artificial intelligence and
machine learning in finance. Topic for today is
data preprocessing. We've talked about the
different data sources. We have at our
disposal in finance. Nowadays, actually, we have
financial markets data. We have alternative datasets, so we are complementing
our data from, say, balance sheets and
income statements and from markets with
data, let's say, from Twitter, Facebook,
but also geographic data, satellite images, et cetera. So financial and all the data, nowadays are often
quite incomplete. For example, you
have missing values, you have missing attributes
for some data points. Sometimes we have granular data, sometimes we have
aggregated data. So depending on
what you want to do and want to achieve with
your data analysis, you sometimes need to
aggregate the data or you need to disentangle, if that's possible,
aggregated data back to the granular data
they were created from. Financial and other data
often also sometimes noisy, they contain errors.
You have outlies. The outlies will
not be erroneous, but you want to
exclude them anyway because this might drive your results into a
direction you don't want to. They could also be inconsistent. For example, they could contain discrepancies in codes or names. Very, very, very simple
and trivial example is that many companies don't
just have one single name, but they have many
legal entities and all the subsidiaries and all the companies that belong to one large conglomerate they
share one part of the name, but they have slightly
different names in different countries, and you need to make sure that these inconsistencies
are taken care of. So data pre processing is
about resolving these issues, making sure you
have complete data, you have now errors in your data and everything
is consistent. You need to transform raw
data into a format that is understandable
by the computer and by machine
learning algorithms. Data preprocessing is now key
to good model performance. First thing that could arise is if you have
errors in your data, if you have incomplete data, you might not be able to perform your analysis in
the first place. Your algorithm might just stop. But that's actually
the good case and the bad and worst case is
that your algorithm works, it works on the data, and you don't see actually
what the errors produce. You don't see the
errors in your result. You only get an output that is biased by the results you
haven't identified before. So we need data pre processing, and this typically
involves two steps. First, you need to
understand the data. You need to make
sure that you have a good feeling of what
the data looks like, and this includes
looking at the raw data, taking the extra mile, looking at the Excel file, looking at the CSV file, all the other formats, the raw data might come in and having a look and a
good understanding of what the data looks like. And then you have to prepare the data so that your
machine learning in artificial intelligence
algorithms can properly work on the raw data. Now, if we have a look at what
Less Miste 2015 proposed, the main task of
data understanding are first of all,
collecting the data. That's usually a huge task, especially if you think
about alternative data. You need to work with the interfaces and the APIs
with different data sources. You need to describe the data. In research, this
is usually done by looking at
descriptive statistics. You have to explore the data
and verify the data quality. Look at the NAs, the not availables,
look at missing values, look at data values
that are completely off the chart and make
sure that this is not a data error
in the best case, this is an outlier
and then think about whether you should
remove these outlies. These tasks are performed
to make sure that the data are adequate to meet your goal that you want to achieve with your data analysis. Furthermore, by
exploring the data, and determining its sparseness and identifying missing values, you get a better idea of
which learning method might actually be appropriate for this kind of data sample. It might be that you thought of one method and you now
after looking at the data, you see, Okay, maybe I should
use a different angle. And verifying the data
quality is critical. We are talking about algorithms and methods that are based on and get their huge advantage from working with big data. And if you have a
huge data sample and the quality is not good, well, might be that the analysis is doing
from the very start. So understanding who
collects the data, how it is collected. You can, um, try to identify
incomplete, erroneous data, might be that the data vendor
already tells you that, okay, we are rounding values. We don't have access
to this sort of data, or sometimes for that variable, for that feature, we do
have some missing values. So it's possible that rules for collecting the data
change over time. And when this happens, this will lead to structural
breaks in the data, which isn't the problem per se, but you need to be aware of this and you need to account
for this in your analysis. In data preparation, you
need to select the data. You need to clean the data, construct the data, integrate
and format the data. In the simplest example, this would simply mean put
it all into an Excel file, put it all into a spreadsheet
and make sure that the spreadsheet at the end is in a format that
is readable by, let's say, our stater or any
other statistics program. The goal of these days is
to get the data ready to use as input in the algorithms. In other words, make it ready to use in your
statistical software. This includes merging data from different sources,
feature engineering, and maybe further
transformations you need to apply to make it
readable to the computer. So algorithms require categorical
variables, yes or no, and this needs to be
formatted into the way one or zero or true folds or
to be formatted as factors. So this is sometimes
seems to be very trivial, but this is one part of your data analysis that
will take up a lot of time, and the data might be split into training tests
and validation set. We saw that in the
very first video, which is quite simple, I guess, but at least you
need to remember this. Now, data understanding
and data preparation are often not performed separately,
but are interrelated. As an example, we
will later have a closer look at the
Gem credit data set by Dui Gaff and the data are also available from the UCI
Machine Learning repository, along with a very detailed
description of the data. So I encourage you
to have a look at the data in the example at UCI, and we'll talk about
this credit data set and long data set here
in these videos next. Now, before in the next video, we'll start with the example. Let's talk about which
statistic software we will be using in the
remainder of the class. We're using R. Now, everything is
performed in R. R is free software under the terms of the free software foundations Gu general public
license, and so on. I and my team, we would recommend you the use
of SDU the studio desktop, which is just a little bit
more comfortable and has a higher usability than the standard software
that is distributed. And this is also for free. You can find the link here. If I find my cursor,
yes, here's the link. And alternatively, if
you're studying it Leipzig, you can also download it from our computer center's website. So studio is the way to go here, and this is quite convenient. Now, R is a language and environment for statistical
computing and graphics. It's one of the major languages for data science has been around for decades and it's highly
extensible via packages. That's why with the new
algorithms and new methods for machine learning and
artificial intelligence, we've just seen an
increase in the number of packages that are purely
devoted to these methods, and the standard
statistics packages are obviously also around. Are, of course, alternatives, most notably Python libraries with psych learn
Karas tens of Low, et cetera, which are related and similar in some
extent, to some extent. But here we'll focus on R, and you can also use different software and different programming
languages, of course. Next, we'll start with
the practical example, but that will be done
the next. Thank you.
5. 5 German Credit Dataset Part 1: Hello, everyone,
and welcome back to our class in artificial
intelligence, machine learning in finance. Now, in the last video, we looked at data preprocessing and this is what we'll
do here in this example. Actually, this is an example, the German credit data
example taken from the UCI Machine
learning repository. So you can actually download the same dataset from
the Internet and try to go through the
individual steps of this example yourself, and I would encourage
you to do so. So we'll start by importing
the data and we'll read the data by using this read table command in R. Let me just
highlight this here. And we are reading
it directly from the Internet website at UCI, so you can see UCI at a C UCI Machine
Learning repository. We are downloading the data into a data structure
which at this point, we call German credit.
Makes sense to do so. And to get the first feeling of what gem credit the
data sample looks like, we simply use in the fourth line the
command dim for dimension. And the dimension of gem
credit of this array is 1,000 rows and 21 columns. We have 1,000 observations
of I guess 21 variables. At this point, we're not so
sure if the data consists of rows with four observations and columns for the
different variables. At this point actually is
an assumption that we'll later see it makes
sense that yes, we have 1,000 observations
for loans and borrowers, and we have 21 variables. Now because of this dataset having including a
lot of features, we will first produce the number of features to make
them fit on one slide, just for expositional purposes. You can actually have a
look at all 21 variables, and in maybe your exercises, you will see that you can also
use additional variables, additional features
for the purpose for which we are doing all this, that is forecasting
defaults, default rates. But here for these slides, we wanted to make it
fit onto the slide. We're first reducing
the number of features. This is what we are
doing in the last line. We are just using the columns
one through five and 21, we'll end up with
just six columns, but we are keeping all the rows. We're taking Jam credit, reducing it to columns one, two, three, four, five, and 21, and then again, writing it into our
data array Jam credit. Now to explore the
structure of the data, we are using the command
SDR for structure in R. And if we enter
SDR for gem credit, we'll see that this is a
data frame that the type of object we have here with 1,000 observations
and six variables. Now, the six variables are, as we've seen before,
the columns one, two, three, four,
five, and then 21. They have names which are
given by V one, V two, V three, V four, five, and 21. The first one is a
factor variable. It has four levels, A one, A one, two, a one, three, and so on, actually, because we
only have four levels, it's actually A one, four. And you can see one, two, four, one, one, four, and so on. These are the first
observations for this column. The two is an integer with six, 48, 12, and so on
as first values. These three is a factor
now with five levels. The four is a factor with ten different levels, and so on. And most interestingly
here, for V 21, we have an integer that seems to only take on the values one, two, and again, one, one, two, and so probably variable that looks
like a dummy variable, but doesn't really
have a good coding, so we'll later have to
switch that as well. Now, the SDR command, as I said, gives us valuable information at the very start
of our analysis. We're seeing that it's a data frame that is
1,000 observations, six variables by construction because we only cut
out six columns from the initial jump
credit data frame and features are of type factor. End of type integer
or type integer. Now the effects, these represent categorical or or no
variables and have the advantage that category
labels are stored only once. This requires a
lot less memory in your working space
and this enables faster computations
if for example, in contrast, you would have used a float or an integer number. The different values
for the factors, they are referred to as levels. For example, for the
first variable V one, it has the levels A one, one, a one, two, a one, three, and a 14. That's actually the
way the data frame was coded and programmed. It doesn't really make
sense in an economic way. We don't know what
the levels are, what kind of variable this is. We need to find out um, one actually variable one is
and what the levels mean. But as you can see here, this is the way how it's coded. And we can already see
the first observations. So one, two, four,
one, one, four. Those are the values of the first variable for the
first, one, two, three, four, five, six, seven, eight, nine, so the first first ten
rows of observations. Now the variable or feature, we usually calling it
features in machine learning. The variable names
are not meaningful in an economic way,
an economic sense. So it's a general
credit data sample, we would expect the names to
be something like income, marital status, gender,
maybe location earn acids. Now, this would make sense, but V one, V two, V three, we cannot infer what is meant
with these variables. So we need in the next step, we need to rename them to understand what is meant by variable one,
variable two, and some. And even if they
were, one should always be skeptical about
the provider labels. Could be that they
are plain wrong. So again, check the data before starting your analysis and make sure
that you know you data. So you can actually look at the description here
here at the UCI website, there is a description
of the variables. And if you look through
the description, you will see that for example, Attribute one, man is actually the status of the existing checking
account and the four levels, the very old data sums
in Deutsche Mark, you can see the four
levels are actually coded as the first level, a 11 is that the checking
account is below zero. If it's 0-200 Dutch
marks, it's level two. Level three is above 200
Dutch mark or salary line for at least one year and
the fourth level is that this customer doesn't
have a checking account. The second attribute is
the duration in month. And most importantly, if we
now switch to attribute 21, the last variable we
count out actually here, this is the 21, the 21st column and
this is 21 again. This is actually the rating, one is good, two is bad. This is later on our
outcome variable. We have our borrows
and they have a good or bad credit
rating and one or two, it is actually more or
less a dummy variable, but you should probably code it with ones and
zeros and at one and two, so we have to switch
the coding year later. Now, based on the documentation, we can now assign meaningful variable names, and further all, we can transform the data frame, data type in R into
what we call a tibble, which, in R is a more modern
type of a data frame. Now, R was developed long
time ago some decades ago, and it has evolved
over all this time. And some things that
were once included in the original base, um um, but not packaged in
the base version of R, and nowadays is a
little bit outdated and isn't as fast and
convenient as it could be. So the same applies
to data frame, which is in the base delivery, and a tib as a type of Mlin data frame has some advantages over
the sender data frame, and you can see this is described in the Tibble
ti verse and for example, it has an enhanced print method. So we are changing
it into a table we first use the
documentation to assign meaningful column names. So the colnames, um, command in R for Jem credit, renames the column names, and we are using
checking account status, duration, credit
history, purpose, amount, and rating as the new columnames instead
of one, two answer. Now let's look at the table. We first have to load
the library table, and then we rewrite Jem
credit as a Tibble. We are taking Jem credit
which used to be a dataframe, and with the command
as tibble we write it over Jem credit
and it's now a tibble. Let's print some first
observations here. If we print Jem credit, see it's a table with 1,000
lines and six columns, and it starts with a factor, an integer factor
factor integer integer. And for example,
in line six here, you can see the
first observation, which as A 11 as
the first factor, then six months A 34 a 431169, and this is a good rating. Yes, one is good. This is a good rating,
and then you have the remaining 990 rows. Now, the first example for analysis is
summary statistics. I think I mentioned
this in one of our previous lectures
and previous videos. You should always have a
look at the data itself, at some observations,
but also have a look at the summary statistics because
the summary statistics, which will give you the mean um, the first and third
quartile, the median, maybe also the
standard deviation and volatility and
variance of the data. These summary statistics
will give you a first impression
of the sample, not just individual
observations, but the sample as a whole, and you will see probably that you might have
some outliers. You will see some,
well, if you can see, for example, here, where rate
the mean rating being 1.3, this shows you that most
of these observations has a good rating will only
if you have a bad rating, otherwise would
be closer to two. Obviously, if you were
to see the minimum, not being one or the
maximum, not being two, you would know that
actually something's off because the variable, this factor variable
is defined only to have observations
with ones and twos, so something would be off. Now, as you can see
here with the amount, this might be an outlier, 81000 could be. We
have to check this. Credit rating, you can see male the histogram for
those five levels. Everything seems
to be around 832. Duration is 4-72 months
seems to be okay. So use the summary
statistics to check your data and see if something's off if something doesn't
make sense economically, always a good starting point. Now next, we need to transform
the integer variables into numeric float ones because we want to facilitate later
processing of the data. If we use float numbers, this makes computations easier. And for this and some
further operations, we rely on the DPLYR package. And here on this side, you have some documentation
from the website on which a DP LYR is provided. It's a grammar for
data manipulation. Providing a consistent
set of verbs that help you solve the most common data manipulation
challenges. You can do all this in R without the help
of this package, but this is much
more convenient. For example, mutate is
a function is a command that adds new variables that are functions of existing variables. You can use existing
variables and you can simply add new ones that are
functions in old ones. For example, if you
need twice the amount, if you want to multiply
the amount by five, that could be one mutation
of an existing variable. Select picks variables
based on their names, filter picks cases
based on their values. Summarize reduces
multiple values down to a single summary
and arrange changes the ordering of the rows. Many actually
convenient functions, if you have to do
data manipulation all over and over again, and this is why we're relying
on this package here. It's also part of
the tidy verse, which is a collection
of R packages that are specifically designed for data science and
machine learning. And again, everything
is based on the table data structure as modern version
of the data frame. So we'll adjust the
variable names, which were again, still
quite inconvenient. And we first transform the integer variables to
numeric ones to float numbers, and we use the DBL
YR style tation with the pipe operator. So German credit
is now gem credit, and this is the pipe
operator and we mutate E is integer as numeric. We're changing the
integers to numerics. We transform rating
into effector. It's not yet if you go back, it's still an integer. It's still an integer
that has ones and twos. If we were to include three,
wouldn't make a problem. It would still accept a three, but the three
wouldn't make sense. In order for rating, only to have the ones
and twos as levels, we need to change this
from integer to effector. Jam credit, Jam
credit pipe operator mut rating equals a factor, and we say it's now rating. Actually, this is the
original variable. We take the integer
variable as input for the factor function
and we say, okay, we mutate and rating now has to be a factor based from
the old rating variable, and the labels are
now good and bad. So that's how we do. Last in the least we
show the percentage of good debt rated
credit applications. We do a table JEM
credit dollar rating, which is the rating
column divided by enroll, the number of rows in Jam
credit, this is 1,000. But if we don't know this, then simply calculate the number
of rows using the enroll. A month, and as you can see, we have 70% good credits, good loans, and 30% bad loans. So on the next slide
and in the next video, we will continue this data pre processing and data
manipulation analysis. Next, we'll concentrate on
the ratio weight of evidence, but that's up to the next video. Thank you and hopefully
see you in the next
6. 6 German Credit Dataset Part 2: Hello, everyone, and
welcome back to our class in Machine Learning and Artificial Intelligence
and Finance. Now in this video,
we want to continue our practical example based
on the um credit data, which is available
from the machine learning repository at UCI. Now we've already seen
some summary statistics. We've done some data pre
processing in the sense that we've renamed
the variables, the columns in our data array. We have transformed
the data frame in R to what we call the tib
from the table tidy verse, which is just a more convenient
form of a data structure. It's a Tibo that's
what it's called. And after having looked at
the summary statistics, we now want to continue
and using what is called the weight of
evidence ratio, the WOE. The weight of evidence ratio
is a first way of having a look at the explanatory power of some features
because in the end, in data science, we want
to predict something. We want to forecast
maybe something. And as you might have guessed, if the data semble is
called German credit data, it's probably that
we are trying to forecast and predict default
rates in a loan portfolio. And for this, we are going to use the weight of
evidence to get an understanding of the
predictive power of some of our covariates,
some of our features. Now, the weight of evidence
encodes the relation between a categorical predictor variable with a binary target variable. So we have the binary
target variable, which in our case
is good rating, bad rating, and we have our
numerous predictor variables, which need to be categorical, and this weight of evidence ratio originated in the finance industry actually, in this very same setting, it was used to separate
good from bad risks. It has also been in use in other areas for
quite some time now, but it's still best
known, I guess, in finance and industry
and insurance industry. So we'll use weight of evidence
ratios, and in our case, it's defined as the logarithm of the number of non events, which in this case
is a good rating. There is no default
divided by the number of bad events or just the
events we are looking at, so that's a bad rating. Now ratios of non events, two events close to one indicate that the
corresponding category, the covariate the covariate has no predictive power
on the target value. Now, this corresponds
to a value near zero after having
applied the logarithm. And one should be
careful in driving conclusions from the ratio as
giving a loan, for example, to a defaulting customer
is worse actually than not giving a loan to a non
defaulting potential customer. So this is still a purely data driven
approach to get a first glimpse really of the predictive power of
some of our covariates, apart from being completely
void of any economic theory. But again, it gives us a first impression and gives us a first hindu
what the data looks like. Now the weight of evidence is calculated in different groups in different sub
samples that are formed based on the
covariate of interest. So for example, if we had
gender as our covariate, this would be very
simple because gender will probably come in two or three maybe
four, I guess, levels. And if we only were
to use male, female, then we only would have two
groups, two subsamples. Both sub samples would be
probably of equal size. There would be
enough observations in each of these two groups, and we could easily estimate and calculate the
weight of evidence. For the categorical variables, these might be the categories or pooling of multiple
smaller subcategories. For example, if we
think of our data, we had the checking
account status, we might, for
example, have income. So every income because
income will most likely be a float number variable
or an integer one. Everyone has a slightly
different income. For example, you might have an income of 40,000
euros per year. The next person might have
40,005 euros per year. So all these income observations would be slightly different. You need to pull them
to arrive based on those multiple smaller
subcategories and observations. You might arrive at
larger categories and larger pools so that there are enough observations
in all of these ports. So for continuous variables, most definitely,
one has to create bins based on thresholds. For example, you
could say, okay, income 0-40 thousand euros, 40 to 80,000 euros and everyone has an income higher
than 8,000 euros per year. Each category bin should contain at least 5% observations to avoid the results being
driven by noise or outlier. They need to be enough observations in
each and every bin. Let's do this for the
checking account status. We'll use the first attribute, which is a qualitative one. Remember that we have
four levels below zero dodge mark
0-200 dodge marks, more than 200 and we have no checking accounts,
four different levels, and we calculate the weight of ence or weight of
evidence and some further simple ratios to compare them to the
weight of evidence. We are using the
pipe operator again. Let me highlight this here
from the LP LYR package. You need to select the
checking account status. That's the new uh variable name we've given to
actually, this was one. And what we are also calculating is the percentage of
total observations, which is just simply the length of rating divided by
the number of rows. The good rating is the mean when rating is good and the
weight of evidence as defined on the previous
slide is the lock of the sum of good ratings divided by the sum
of bad ratings. Let's see what
comes out of this. If we print those results, you can see for
these four levels, A one, A one, two, and so on. Have a percentage of
total observations of 27, 27, six and 39%. The good rating,
50%, 61, and so on, and the weight of
avance is more tilted towards the extremes to
actually zero and two. If we plot this and compare
the two statistics, percentage of good ratings in these four bins and the weight of evidence
for those four levels, we can see that actually based on the percentage
of good ratings, one could think, okay, there seems to be a difference
between those four levels. It's increasing, so our
A 14 is actually seems to be a level that has more predictive power to
explain default rates. But the differences between those four levels are
not that extreme. But if you look at the
weight of evidence, you will see that
actually the first level has almost no predictive power. Whereas the fourth level, which if you might remember
this is no checking account, you don't have any
checking account at all, this seems to be highly
predictive of a bad credit loan. And this is also what
we are looking for, So it seems that when looking
at the weight of evidence, the third and fourth level, these outcomes of this variable, these levels, they seem to predict a bad credit
rating quite well. So the weight of evidence shows this more clearly than the
percentage of good ratings. Let's turn to the loan duration. The first plot on the
left shows you a box plot across all
thousand observations. And if you divide this up into good and
bad credit ratings, remember that it is
defined as a variable that takes on one and two as values. You can see that, okay, it seems as if and this is only speculative
if we are honest, it seems as if the
loan duration, if it's lower seems to predict
a low value for rating, and if it's higher, it seems to predict a higher
value of ratings, low loan duration seems to imply a good rating
and the other way around. But this is only
speculative because actually the plot on the right
isn't really helping here. Again, let's calculate
the weight of evidence separately for each unique value
of loan duration. We cannot really do this because if we take this
integer variable. But still, as you can
see from the plots here, the thousand
observations are pretty quite dispersed
across the universe of all those log durations, so we need to create bins. First of all, we calculate for each value of long duration. Again, the good rating and we take the meaning of the good rating in
each of these groups, we group by duration, and then actually we form larger
groups on a yearly basis. And do the same and calculate the mean
of the good rating. So this is on the
left hand side, this is before grouping. As you can see, there are some log durations for which we don't even
have an observation. And well, one would say, yes, there seems to be a
trend that looks like this, and this becomes much
clearer as soon as we group our observations into
yearly bins of loan duration. And you can see, yes, the percentage of
good rating seems to decrease the longer
the loan duration is. Okay. Now finally, we also have a look at the credit history variable
and for brevity, we do not consider
further variables. Remember that we actually had 20 covariates in
our data sample. We could have used
additional variables, but we only shown this here
for now the credit history, checking account status,
and loan duration. So let's use credit
history, pass five values, no credit stan, all
credits paid back duly, all credits as this
bank paid back early. Well, the second line
831 will probably be the best predictor and the best level if we are interested in a good
rating because it means, yes, you've taken up loans and you've paid all those
loans back in time. Existing credits and loans
paid back duly till now, delay in paying off
invest critical account, so that's the worst state
actually of this value. Again, let's calculate the percentage of
total observations, the mean of the good ratings, and the weight of evidence, and this is what comes
out of this analysis. You can see here it starts
with 4%, 4%, 50%, 8%, 29%, and for the
weight of evidence, actually, it's even more
extreme as we've seen before. This is the percentage
of good ratings. This looks like this,
and it's much more extreme and not surprisingly
the last status, which is that it's critical. This is highly
predictive and it's a high explanatory power for explaining a
bad credit rating. So this is the weight of evidence ratio that can be used to study the
explanatory power of some of our covariate in order to see which variables should be
included in later models. So this is data pre processing, and in the next video,
we'll talk about data
7. 7 Generating Synthetic Data: Hello, everyone, and
welcome back to our class in artificial intelligence and machine learning in finance. We are now at Chapter 2.3, which deals with
data generation. Now, we've already looked
at data preprocessing. So why do we need
data generation? Well, actually, in many cases, machine learning techniques
rely on synthetic instead of real data for training
and testing purposes. Why is that? Well, in
several instances, we need to preserve privacy. We have confidential
data, for example, on credit card usage, credit card fraud, et cetera, and it should not be possible to derive any conclusions on the
origins of the data or on, for example, a certain person. So we need to preserve privacy and in some
instances we also need more additional data or we and too few data to
run our algorithms. So synthetic data is artificial data usually
created by algorithms, sound boat wire simulations that should mirror statistical
properties of the original data as
good as possible and while not revealing
information on real people. So we want to preserve privacy. We want to create trading
data for our algorithms, and we want to test our systems. This is why we need
to generate data. Now, many training datasets
are highly imbalanced, making classification
tests difficult. In these cases, synthetic
data generation is particularly important for building accurate machine
learning models and we'll later on see how
this actually works. Let's at this point, look at
three different ways how to generate data and what types of synthetic data we can
differentiate between. The first one is
fully synthetic data. This is when the data is completely synthetic is
completely artificial, if you want to call it this way, and it does not contain
any original data entries, any original observations or
values of certain variables. Thus, the joint density of the original
data is estimated, and we sample random variables from the estimated
da density function. In other words, we are only taking the original data fitting a statistical distribution
through the data and then sampling synthetic data from that fitted statistical
distribution. In this case, the data has
strong privacy protection, but the truthfulness
of the data is obviously lost because
it's fully synthetic. Now, with partially
synthetic data, this is the second possibility. Uh, we replace values
of selected attributes that have a high risk of
disclosure with synthetic data. Now, this could, for example, be in the very simple
most simplest case, could be that we
replace the name, the first name and the
surname of our observations, for example, when it comes
to credit card data. Now, if we call everyone Mr. X, misses X, it could be
that, um, actually, this doesn't change
the value of the data, but we preserve the privacy uh, of the real people behind the data behind
the original data. Now, disclosure risk
is higher than in fully synthetic data because
you might imagine that it is often possible to identify
persons not just by name, but also if you take, for example, age, gender, income, and, um, place of birth, date of birth, et cetera. So by combining
different variables, you might still identify person. So disclosure risk is higher than in the first case
of fully synthetic data. Then we have hybrid synthetic
data that is the dataset is generated using both
original and synthetic data. Now each record and the
original data is replaced by the nearest record
in the synthetic data, you simulate from a
fitted distribution, but you try to replace
the original data by synthetic data that is closest to the
original observation. This method combines
good privacy protection with high utility at the cost of more memory and processing time. Obviously, this takes longer, but on a modern computer, this shouldn't take too long. Um, how should we
generate synthetic data? Now we have three
broad concepts, generating data from
a known distribution, fitting a distribution
to real data, and then simulating
from the distribution and using deed learning. So what do we do
in the first case? Well, actually, you simply take a statistical distribution. You simply assume that the
data comes, for example, from a student T exponential
or normal distribution, and then you simulate
random data from this a priori chosen
distribution. Uh, the difference later on too, the second method where we fit a dissbution to real
data is actually that you simply assume the distribution and you assume it to be known,
you made an assumption. For example, a normal
disbution with mean two and standard
deviation five. This is an assumption
that is not validated by any estimation, by any look at the real data. And if we now take
this to the real data, make an assumption, let's say, on the parametric form
of the distribution, but we still estimate
the parameters. If we fit the
distribution to the data, then we at the second method. If we have real data,
you can determine the best fit distribution chosen from a given parametric
family of distributions. Usually it's parametric,
and you can then generate synthetic data by
Monte Carlo simulation. The quality of the
generated data obviously depends
on, first of all, on the selection of
the parametric form of the distribution and
also on the estimation. So we might want to try
goodness of fit test. We want to check how far the fitted distribution is from the empirical
distribution function. And last but not least, you can also use a
machine learning model such as decision trees, providing an approximation to non classical distributions. For example, if you want to use a multimodal distribution
that is one, for example, that has two humps. In these cases, overfitting
might be an issue. You have to be
careful that if you use very complex distributions, you might get overfitting. You might be fitting the distribution also to
the noise in the data. And as the third method,
we have deep learning. We'll later in this lecture see different methods
from deep learning. I only want to mention
two of these here, deep generative models such
as variational auto encoder, VAE, or generative
adversarial network GAN. Now, VAE is an
unsupervised method where the original data is first compressed by a so
called encoder. Into a more contact structure
and then the decoder generates a representation of the original data from
the compressed data. Then the system is
trained to minimize the differences
between the outputs and the original data. As you can see from
the word encoder, this is something
that is also used in audio and visual compression. In the GAN model, we have two separate
networks that are trained it reflecty
the first network, which is called the generator, this takes random sample data to create a synthetic data set
and the second network, which is called
the discriminator, then compares the synthetic
data with the real dataset, and the generator
network is strained to make discriminating
between the generated and the real data for the discriminator network
as high as possible. In a sense, you
want to make sure that the discriminating network, which could also be real person. Is not able to
distinguish between the synthetic dataset and
the original dataset. This is obviously
what you want to try to achieve when it
comes to data privacy. No one should be able
to determine whether this is the original
observation or a synthetic one. Then this is also frequently used for
generating image data. If you click on this link here, you
can see the link here. You can watch a demonstration
on YouTube for the GAN. I don't want to go
into detail yield, but you should know
that you can also use deep learning algorithms
for data generation. Okay, so we've now talked
enough about data sources, data generation,
data pre processing, and next, we will
start looking at AIN and L methods with their
applications in finance.
8. 8 Basic Linear Regression: Hello, everyone, and
welcome back to our class in artificial intelligence and machine learning in finance. And after having talked a
lot about data sources, data generation, and our
introduction to the realm of ML, we've now finally arrived at
the applications in finance, and we'll go through all
these topics by first of all, describing and discussing
the statistical methods and then highlighting the
applications in finance, maybe a little bit surprising, we'll start with um,
simple linear regression, multiple linear regression
because these are the building blocks for more sophisticated models
in statistical learning. We'll start. Our discussion now of these different
AI and ML methods, it's a little bit heavy on
the machine learning methods, but we'll also see some
artificial intelligence methods, and we'll discuss the
statistical background and then the application. Now, all of these
methods are examples of statistical
learning algorithms and statistical learning refers to the set of tools used
for understanding data and explaining behavior
of statistical data. Broadly speaking,
we can distinguish two types of models in
statistical learning. The first one are
supervised algorithms, and in contrast to
supervised one, we also have
unsupervised algorithms. But the differences, supervised
statistical learning means that we estimate or say we predict an output based on inputs very
simple example, you want to predict the wage of workers based on
their education, their gender, and we would get what we call
a regression problem. The second problem
is related to this, but it's called a classification problem because in this case, the outcome variable, the
output is a binary variable. It's a dummy variable that
takes on values one and zero. For example, if we only want to predict ups and downs
of the S&P 500. If we now use, for example, macroeconomic variables, X one, X two, et cetera, and we want to estimate the ups and
downs, ones and zeros. We have a
classification problem. Again, same here as with
the credit card data, for example, we want
to forecast default, no default or default
one and zero. That's a classification problem. In non supervised statistical
learning algorithm, we want to group input observations with
no output variable. The example we have here is
what is a clustering problem. We have data on customers, say, age, income, education, and we want to group them to see certain clusters of customers which share common properties. For example, one group
are the middle aged men, a second group are young
women third group is maybe all persons with low income and we get
different clusters. And this is an unsupervised statistical learning
problem because, as you see, we have no output, we have no outcome
variable, why. We want to predict.
And in this case, we have no way to see how these different clusters are different when it comes
to output variable. They only differ based
on their inputs. Okay. Though this is a
very short introduction, let's now turn to simple
linear regression, which you know from your
introduction to statistics. Now, linear regression
is the foundation for many modern supervised
statistical learning approaches. We are now looking at
supervised algorithms, meaning that we have an outcome. In the setting of simple
linear regression, we want to predict a
quantitative response or a metric response Y on the basis of a single
predictor variable X, which means that we have
regression function. Y and Y is approximated by two parameters beta
zero and beta one, and we assume a linear
relation between X and Y, meaning that Y is
almost equal to beta one plus beta zero
plus beta one times X. We say that Y is regressed on X. This is a little bit
different in German. Actually, we switch
the Y and X and we say that X is regressed on Y, but it's Y that is
requested on X. We have two unknown parameters
or coefficients beta zero, which is called the intercept and beta one, which
is called the slope. Those are obviously the
intercept and slope of the linear
regression function. It's a line, and they
are estimated by minimizing the residual
sum of squares or RSS. What we get as a result is we have an estimate for B zero, beta zero hat, and an estimate
for B one, B one head. And if we now have input data X, we can use those estimated
parameters enter X, and we get an estimate
for Y, Y has. Now, how should we estimate this regression function,
this linear line? Very simply by minimizing
the residual sum of squares. So we'll start with
the prediction y IH. We have a certain number
of observations with X, and we look at this
regression function, and we have those estimates. Why are the predictions, why IH also have
those observations Y and by comparing Y and Y h, we get what we call
the IT residual. That's the error. That's
also why it's called E. It's the error
between our prediction, Y had and the actual
observation, YI. These are the
different residuals from our estimated linear
regression function. We now square those
errors to make sure that a negative and positive error don't cancel each other out, if we square those errors
and we sum them up, we get RSS in equation four, which is the residual
sum of squares. It's the sum of the
squared errors. Now, here, you can see one
very simple example based on the advertising data sale from the James Witnay and TikhiRni textbook on
Statistical Learning. Now, you see the blue line
is the regression line. We have TV and sales, TV being the predictor and sales being our
outcome variable, and you see all those
different redox. Those are the estimates and actually the
actual observations, and the blue line is an
estimated regression line. In those small gray lines,
they highlight errors. We now want choose the
blue line in such a way that sum of the
squared gray lines becomes minimal.
This is the RSS. So how should we estimate this? Well, if you remember
your statistics classes, minimization of the RSS, in this case of simple
linear regression leads to the well
known OLS estimator, the ordinary least
squares estimator, four B to one and B to zero X, um, uh and Y are the
sample means of X and Y. So you take your
observations Xi Yi, you take the sample means, enter them into these two
equations five and six, and you get the OLS estimates
for beta zero and beta one. Actually, this is done on
most modern calculators. They usually have functions
for simple linear regression. Okay. Now here, in this case, again, the response
variable is sales. The predictor is TV
advertising budget. And as you can see here, we have beta zero and B to
one on the X and Y axis, and the red dot here is these This is a contour plot of the RSS
and you can see here, this is actually the minimum, so it's very simple
to find minimal is minimal value of the RSS
for B to zero and B to one. Okay, and the three dimensional plot of the Rs as
you can see here, it's very smooth, um, function. We beta one and B to zero on the X and Y axis and on the Z axis and the
three dimensional case. Here, you have the minimum. Okay. Get resulting OLS
regressions, and we can do this. In this case, if we
have three predictors, we can use three estimations and look at TV
advertising budget, radio advertising budget, and newspaper
advertising budget. And we can already
see there seems to be a strong linear relation between TV advertising
and sales. It looks little less linear for radio
advertising and sales, and there's probably not a linear relation here the newspaper advertising and sales. But these are three distinct
regression analyses. Again, we are in the case of
a simple linear regression, meaning that we
are looking at X, actually the outcome arri Y being estimated and
progressed on X one, in the first plot and
Y being regress on X two in the second plot and on
X three in the third plot. It's not a multivariate
linear regression. It's still simple
linear regression. As you remember, the
statistics classes, you probably know
that we need to look for measures of the accuracy of the coefficient estimates. We need to assess the
goodness of fit of our model, and if we assume that we have a regression function
Y being equal to beta zero plus B to one
times X plus an error, how should we assess the accuracy of those
coefficient estimates, beta zero head and B to one net. Well, we have to look at the standard errors of
the OLS coefficients of the estimated parameters because they are estimated
from a sample, we do have standard errors, and the standard
errors are given by equations seven and eight. As you can see, what we need
is we need sample mean of X, and we also need the
variance of salon, which is the variance
of the residual terms, the error or to
estimate TigMa squares because this is the only
thing we do not know from our sample of X R, and we know and the
number of observations. The only thing that is left
unknown is Tigma squared, the variance of the
residual terms, and we estimate Sigma via the residual standard we do not know how Sigma looks
in the population, but we can estimate Sigma based on our sample and
on the sample error. We take the residual
standard error, R E, and we take the residual
sum of squares, divided by N minus two, take the square root and we get an estimate for the
residual standard error. That's our estimate. And what we do next, we put it into the equation
for the standard equation, a standard error of the
coefficient estimates. And then we get the standard error of
our two coefficients, but what should we do with this? Well, if we have the
standard error of B to zero head and b0b1 hand, we can actually construct confidence intervals for
B to zero and B one. So the 95% confidence
intervals for those two parameter
estimates are given by or less the zero plus and minus times the standard error
of those coefficients, and we get confidence intervals. And how does this help us? Well, actually, later on, we will look at
significance tests, and significance tests
usually work like this. You have a hypothesized
pemter value. Let's say we want to test the hypothesis that
B one is zero. There's no relation
between X and Y, meaning that the
coefficient is zero. So we take zero plus minus
two times the standard error, and we get a
confidence interval. If now our parameter estimate is in this confidence interval, there with 95% probability, the parameter is not significantly
different from zero, meaning that we have to
reject we cannot reject the hypothesis that B to one is zero and there's
no linear relation. So we can use
confidence interval for significance testing. That is shown here. For example, the hypothesis
is B one is zero or this versus h1b1
is not equal to zero. And now we take the
standard error of B to one head and we
use this T statistic, B one head minus zero. Divided by the
standard equation. So the T statistic
measures the number of standard deviations that
our parameter estimate is far away from zero. If it's far from zero, there's a likelihood that actually the parameter
is not equal to zero. If it's close to zero, then actually we have
to reject H one. Okay. Now, this is what we do for the
parameter estimates, but how should we assess the accuracy of
the model itself, assuming that we rejected the hypothesis B to
one is equal to zero. We cannot reject a linear
relation between X and Y, but how can we assess the
accuracy of the model itself? We can take the
residual standard error and recall that associated with each individual observation is an error term EI or
epsilon in the population. Even if the coefficients
were known, these error terms still exist, and this is why it's not
a perfect linear line. If it were perfect linear line, there would be no stochastic
behavior in the data, and we wouldn't need regression analysis in the first place. But assume that we have a linear relation
between X and Y, and the residual
term exists and it's noise simply added to
our linear regression. Even then we do have
these error terms EI and these error terms
prevent us from perfectly predicting Y from X. Thus we estimate the
standard error of Epsilon via the residual standard error, which is given by
the residual sum of squares divided by N
Myers dot square root, and then, how should
we assess this? Well, the RSE provides an absolute measure
of lack of fit of the model and it is given
in terms of units of Y. Problem is the RSE depends
on the scale of Y. The solution in this
case is we scale it. We take R squared, which you probably know from your statistics introduction, R squared is the proportion of variance explained by the model. R squared is equal to one minus. The RSS divided by TSS, which is the total
sum of squares, simply Y minus Y bar squared and sum it all
up for observations. This gives us the total sum of squares, the total variance, and one minus the
residual sum of squares divided by
the total sum of squares gives you the
proportion of variance explained our simple
linear regression node. Also note, R squared
is equal in this case to the squared correlation
between X and Y. Actually, this is why it's
already called R squared. It's the squared correlation
between X and Y. This is a very
short introduction to simple linear regression. We'll come back to this
in some instances. But obviously, we want
to step our game up to multiple linear regression and we'll look at this
in the next video.
9. 9 Multivariate Linear Regression Fundamentals: Hello, everyone, and
welcome back to our class in Artificial Intelligence
Machine Learning and Finance. We are now at Chapter 3.3 in which we want to discuss
multiple linear regression. We've seen in the
previous videos, simple linear regression
in which we use a simple line to predict response variable
based on one predictor. Now, as you can imagine, multiple linear regression
is simply the extension of simple linear
regression by including more than one um,
Bx are variable. So, simple linear
regression is useful, so it's multiple
linear regression. But actually, in practice, we all the time have
more than one predictor. So in machine
learning, actually, we might end up with
using thousands of different variables,
thousands of predictors. And the question now
is what can we do? Um, two possible approaches
come to our mind. The first one is simply estimate several
simple regressions. For example, if you
have ten predictors, estimate ten simple
in your regressions. The problem is that if you have correlations between
those predictors, this will almost
always be the case. These correlations will bias the results in the sense that the coefficients on
one predictor variable will be biased
upward or downward. So we have an overall
underestimation of the effect of one single
predictor because this might simply be due to correlation with a
different predictor. The second alternative
is much better. It's estimating a multivariate or multiple linear regression. That is, we include more
than one predictor variable. Though we have our response, which in this case is
metric response, Y, and we want to predict Y on the basis of several predictors, X one, X two, X three, and so on until X P.
We have P variables, does with P plus one
parameters beta one, beta two, beta three, up until B P for the predictors and beta
zero for the intercept. We say that Y is regressed on the set of predictors,
X one, X two, and so on, and those P plus one unknown parameters
or unknown coefficients. With the beta zero
being the intercept and beta I being the slopes for
the different parameters, the different predictors, sorry. They are estimated again by minimizing the residual
sum of squares. So we again use OLS, the ordinary least
squares method to estimate those coefficients. And if you've already
taken a statistics class, you will probably know that
the vector of coefficients Beta he is given by
X transpose times X. These are the um, matrices of those observations
for X one until XP. You take the variable
of X observations, X transpose it, multiply
it with X itself, take the inverse and then
multiply it again with the transpose of X times Y, and you get those coefficients. In matrix notation. The result, if you write out the
vector of coefficients, then is that you can predict y using those
coefficient estimates, beta zero head plus B one head, et cetera, and your
observations X one until X B. Equation 50 now the
multivariate case in multiple linear regression, we no longer have a line in
the two dimensional case, where are actually
three dimensional case where we have two predictors
and one response variable. We get a plane, as
you can see here. Again, we estimate this plane, which is shown here in
blue and green colors. And you see those
observations we have being the red dots and the black lines show
the estimation errors. So we minimize, we are
choosing the plane such that the sum of squared
errors is minimal. This is the three
dimensional case where we have two predictors. If we have more predictors, we get a linear hyperplane. Well, let's have a look
at this in R. We estimate the regression coefficients
and we are using the car seats dataset
from the ISLR package, which is the companion package for the statistical
learning textbook. And what we are trying to do
here is we want to predict product sales based on
advertising budget, community income
level, average age, and average education
in those communities. And the way we do this
is, first of all, we have to load the library
mass, for regressions. We have to load ISLR, which includes the data. And then we use the LM function, which is the linear
model in R. This is just the very short command for linear regression analysis in R. And as you can see
from the command, the syntax is as follows, sales is explained and predicted by using
fertizing plus income, plus age, plus ucation. We're using the RST data. We are trying to fit, we're not try it, we actually
fitting a linear model. And this is now written into
our new variable m dot Fit, also have called it results
or results dot Fit. So MFitYe is the object, um, that includes the
fitted linear model object. By using the summary
command on m dot Fit, we can see what the result is. The function call
was formula sales is explained by
using advertising, income, age, education,
and using car seats data. The residuals are shown here, so we can see that median,
the first quartile, the third quartile,
and the minimum and the maximum residuals after
having fitted the model, and these are the
coefficient estimates. You see the intercept
advertising is actually this is ITA. Zero B to one and
two, three, and four. You can see the estimates
for the coefficients. The standard errors, T values, and here you can see which of these coefficients are
significantly different from zero. Now, B a little bit
careful actually R has this feature that well, in research datas,
we usually write three stars for significance
at the 1% level. Two stars for 5% and
one star for 10% level. Actually, this is different
in R. As you can see here, three stars actually means
significant at 0.1%. So stars is 1%, one star is 5%. So actually, in a
research paper, would probably have
to add another star, for example, in this case. We can see advertising
is highly statistically significant as is
income and as is age. Education is not significantly
different from zero, so it seems that our
predictor education has no power to explain the car
sales in this data sample. You can see we've used
that's on the next page, I think the number
of observations, you can see the multiple R
squared is close to 15%. We have an adjusted
R squared of 14%. I'll talk about it later. And we get N statistics and
P value for the whole model. So where should we go next? We should check is
at least one of the predictors useful in
predicting the response? Well, you've already seen as it is from the output from R, it seems three out
of four predictors are significantly
different from the zero, so they are awful to
explain our response. So we can use the
F statistic for the question is at least one of the predictors useful in
predicting the response? Is the model itself then
we have to decide on do we need or only a subset
of the predictors and we get to the question
of variable selection. Then how well does the
model fit the data? We have to again look at
the RSE and the R squared, and last but not least, given a set of predictor values, what response value
should we predict and how accurate is
our predictions. We are back to forecasting,
which would mean that, for example, if you have a
line with my cursor here, if, for example, in the
one dimensional case, we have these observations and we've put estimated
regression line. For example, you would
know that if this is the estimated regression
line and we get new X value, which is here, for
example, here, and we know this would be
predicted and forecasted value, or if we forecast
this probably here, if this is extension
of the linear line. Now, let's talk about this. First question is at least
one predictor useful? We are using the F statistics. We're testing the
null hypothesis that all those
slopes are f zero. Not even one is
not equal to zero. So the um null hypothesis is
beta one equals Beta two, and so on, equals Beta
P and equals zero. We're testing this hypothesis using this following
F statistics. Again, we are going back to the total sum of squares and
the residual sum of squares, and we are simply using the
previous F statistic from the simple linear
regression case and now applied in the
multivariate case. And if there is no relationship between the
response and the predictors, the F statistic should
be close to one. And if the um, um, hypothesis HA is true, then F should be
greater than one, so it should increase. That's the F statistics we've already seen here,
and as you can see, it is rather far away
from one and the acti six is also
converted into a P value, and it seems that
the null hypothesis is actually rejected. Okay. Now, which predictors
are significant? We've already seen in
the R output that let's assume that all the
predictors are not all predictors have an
insignificant coefficient, but which predictors
are significant. Again, as in the simple
linear regression case, we estimate T statistics
and P values. We calculate T statistics for each predictor and the
square of each T statistics is also the corresponding
a statistic based on those T statistics
that are also given here, in this column, sorry. We can then also convert those T values into
statistics and P values and you can see
that three out of the four predictors are
significant in our regression. However, this is one difference to the linear regression models, you've probably seen
in econometrics. In the case of machine learning and in the
case you have big data, you want to analyze it
might be that the number of variables is larger
than the number of observations or
extremely larger. You have thousands uh, predictors and only let's say
one or 2000 observations. So there are more
coefficients BJ to estimate the observations
from which to estimate them. Very simple example,
100 variables and only 50 observations. And we have two, um, two results that
come out of this. First of all, we cannot fix regression using multiple OLS. So what I've shown you before, the linear model in matrix
notation where you're using, um, the OLS estimator, OLS can no longer be used in case with more variables
than observations, and we cannot use
the F statistics, you need to keep this in mind when you have an application
where the number of predictors is larger or much larger than the number
of observations. So we have checked whether at least one
predictor is significant. We can use T statistics
and P values based on these T statistics to check which predictors
are significant. In the next video, we going to decide on the question which
predictors we should choose.
10. 10 Multivariate Linear Regression Feature Selection: Hello, everyone, and welcome
back to our class in artificial intelligence and
machine learning in finance. We are still in our discussion of multiple linear regression, and after having seen
the definition of multiple linear regression
and how it is estimated, we want to have a
look at the question how to select the variablets
to use in our model. Now, this is slightly different to econometrics
where usually you have an economic theory that guides you which
predictors to select. But here in the realm of statistical learning
and machine learning, it is rather a question
of which predictors can increase validity and
the fit of your model. That is why we will
actually select variables without much theory and rather on the
question how much, um, um, correlation, we can
see between the response and the predictors and
which predictors help us get an even better
fit of our overall model. So we need to select
the variables, and this is the process of determining which predictors
are associated with the response and which
we should include in our model and which uh
are the significant ones. We have several alternative
approaches at our disposal. The first one is that we
compare the fit of all models and calculate measures of
the models overall fit. You can use, for example, Acacus information
criterion, AIC, the Bayesian information
criterion or the adjusted R
squared of R models. But this would also
mean, in a sense, a brute force approach where we estimate all possible models, and this will become infeasible quite quickly as the number of different models
for P predictors is to taken to the power of P. So if we only had 30
variables, 30 predictors, we would get close to 1
billion different models, so you will not be
able to do this, especially when you have
thousands of variables. What we need to do is we need an automated and efficient
procedure to select a subset of models and back
here, going back here. There are three classical approaches for
variable selection. The first one is
forward selection. You start with the null model, which only includes the intercept and no
predictor at all. You estimate P simple in your
regressions and then add to the null model the variable from the regression with
the lowest RSS, and you continue adding variables in this manner until some stopping
rule is fulfilled. So for example, you
would start with an intercept and then
maybe add X three, then you would add X one, and maybe your stopping
rule is already fulfilled, so you will stop with
just two predictors, and that's forward selection. Now, in contrast, backward
selection is the opposite. You start with the model that contains all potential
P variables, and you remove the variable
with the largest P value, and then you down and try to see which are
the variables that are insignificant and you continue
removing variables in this manner again until some
stopping rule is fulfilled. Nick selection is a
mixture of the two. You start with the null model, you estimate P simple in your regressions and
add to the null model, the variable from the
regression with the lowest RSS. Then you remove variables
that turn insignificant when new variables are added to the model because of correlations
between the predictors. I've told you in the last video that you should be careful when including or actually excluding variables
from a regression because it might be that, for example, variable X
one is just picking up the correlation of X three with the response rather than
X one and the response. Because of spurious correlation
between predictors, it might be that the coefficient estimates on your predictors are biased if you omit
important variables. If you now include variables, and suddenly another variable
turns insignificant, then in mixed selection, you might want to
consider removing that variable because
other variables are picking up the
same correlation. And then you continue
adding and removing variables until again stopping
round was full filled. So let's look at this in R. We are using the credit card data, and you see the
summary statistics for those different variables. You see the ID, which is just the number
of the observation, going 1-400, we have
the different income, the credit card
limit, the rating, the number of cards, the
age of the customer, the occasion gender student, marit, ethnicity,
also, and balance, which is our response variable
in all these regressions. But we start with a dummi
variable for student. Are you a student or not? We use the credit data and we try to predict the
balance of the credit card. And again, we use LM in, which is the linear mole so
the multiple regression. You can see here that the
dummi variable for being a student is highly
statistically significant. We have a T value of 5.35, and yes, we should
include this vary. You can also see that
the adjusted R squared for this simple linear
regression is close to 7%, 7% around 7% of the total variance is
explained by our model. Now let's do this with
the credit card limit, and again, the T value is much, much larger in this case. Again, however, this variable is statistically significant
in our regression. We can also see that
multiple and adjusted, actually doesn't make sense that it's called multiple
R squared here, it's a simple linear regression. The R squared is 75% close to, and we can now see that
limit credit card limit seems to have much
more explanatory power when it comes to the
balance of the credit card. So we should if we
were to choose, we should include limit
rather than student. If we can include just
one more variable, then we should also
include student. And last but not
least the same for income higher T value,
statistically significant. You see here, we
have three stars. We have a T value of
almost ten and NA squared of 21% should also include this
in our regression. We now look at the DIC. You can see the Acacus
information criterion for these three models,
and with the AIC, it should be viewed as a lower numbers being
better and lower numbers, SNIC signaling a
better model fit. Actually, the third one is with income on actually the second
one as the lowest AIC, and we can now see how multiple linear
regression would fare if we include all
three variables. Let's do this. We
estimate a linear model, balance as the response
and income plus un limit, and all three variables remains statistically
significant and we can now see that just including
these three variables gets us to an adjusted R
squared of almost 95%, and the cakes
information criterion is also much lower than for the
simple linear regressions. So we can see, yes, we should actually estimate multiple linear regression
with all three variables. And if we were now to include more variables, at some point, I guess we would arrive at some variables
that can bot Okay. So now, how do we assess
the overall model fit? We've already seen the
adjusted R squared. The idea here is that if you
remember the R squared form, the simple linear regression, as a matter of fact, by
including more variables, the R squared can only go up. So what we need to do is in
order to prevent overfitting, we need to punish our model for excessively using variables. And this is done in the so
called adjusted R squared. As you can see, it's
basically the R squared, but it will decrease, actually, and it will be punished for the inclusion of
unnecessary predictors. So any variable, any predictor
that is included that doesn't reduce or doesn't
add to the models fit, will actually, lower
the adjusted R squared. This is why actually here, it's important to see that
the adjusted R squared is very close to the
multiple R squared, so all variables are actually attributing to the model's overall fit,
and this is a good thing. So as you can see from
the adjusted R squared, we should use this
multiple model.
11. 11 Multivariate Linear Regression Advanced Concepts: Hello, everyone, and welcome
back to our class in artificial intelligence and
machine learning in finance. Now in this video, I want
to talk a little bit about some additional details of the multiple linear
regression model. In the last video,
we've already talked a little bit about how
to assess the mole fit. We've seen that in contrast
to the regular R squared, we should use the adjusted
R squared because actually, including more and
more predictors, when including more variables
in our regression model, the R squared can
actually only go up. It can only increase will never decrease so that at some point we'll include
unnecessary predicts predictors that are irrelevant, and actually, this
will reduce the model fit in the sense that
we get overfitting. We have a better fit on paper, but actually we including
unnecessary complexity now mod. So that's why in the
adjusted R squared, we take the R squared
and we correct the R squared by the number of variables
we have included. So actually adding
noise variables to an already well fitted model decrease the RSS only
by a little bit. The adjusted R square punishes the model for every included
unnecessary predictor, every noise variable that
is included in our model. To see this art In this light, we can see that for
the credit data, um, actually, the adjusted
R squared increases to some point and including more and more predictors doesn't really change
the adjusted R squared. Actually here, actually,
I think this is seven, this is for seven predictors, we get an optimal point where the adjusted R squat
doesn't change too much. Actually, we could already
stopped here maybe with four or even
five predictors and including additional
predictors doesn't help us in explaining the variation in our
response variable. So this is, um, important now in multiple
linear regressions, use the adjusted R squared
rather than the squat. Okay. Now after having fitted our model and after having estimated the
coefficients B to one, beta zero, beta two, et cetera, we want
to predict values. We want to forecast. Now,
how to do this, very simple. We take equation 15, which said that we can take
the coefficient estimates. We can enter the
values for X one, X two, X three, and so on. Then we get a prediction, a forecasted value for Y had
this is what we will do. You take the estimated
coefficients beta head, zero, beta at one, and so on. And you include values for your predictors can
be actual data, but you can also make
sort of a simulation. Then if you compute equation 15, then you get estimate, your predicted value, your forecast for the
response variable. Now remember that obviously
the forecast will include a bias and this bias will stem
from three sources. First of all, with imprecise
estimates of beta zero, beta one, beta two, et cetera. So um, even with using OLS, even with using
ordinary least squares, we will have some imprecision in the estimation of
our coefficients because we don't
have infinite data. We only have a sample
size of let's say, 1,000, 5,000, even
100,000 observations. This means that there is still
some estimation error that will include or induce a
bias in our forecast Y hat. Second of all, um, the assumption underlying all of this is that we are using a linear regression
function instead of the true function F.
Now in the population, there seems to be that
doesn't seem to be, but there is a relation
between X and Y between our predictors or the
predictor variable and our response variable, Y. This relation is the function F. In our modeling approach, we've assumed that F is a linear function that we can use a linear
regression model. Now, this induces model bias. Might be that F is
actually non linear. If that is the case, obviously, we'll have a bias that stems from the fact that
we've chosen the wrong model. And last but not least, our third assumption here
is that we don't have a clear relation in the
sense that Y is with 100%, a clear function linear
or non linear of F. But our assumption is that Y the response variable
is a function in our predictors and we have still some random noise
that's the error term. We have made some assumptions on the error term, for example, that it has an
expectation of zero. It has constant variance, but we still have
the error term. Even if our mod is correct, if it's a linear function, even if we've estimated the
coefficients perfectly, there is still some
irreducible error that stems from the fact that some noise will always be in the data, and
that's the error turn. Never expect your predicted
values to be 100% perfect, please we have three
sources of errors, three sources of biases
in our predicted values. Okay. Now, how can we extend the multiple
linear regression model? The first extension is qualitative predictors are
qualitative predictors. Now, in the examples before, we've already seen solid then, but most of the time we have metric variables,
quantitative variables. For example, could be income, could be a price for
stock, et cetera. Um, but some variables
are actually qualitative. For example, age, gender
in our credit card data. So how should we deal with
qualitative predictors? If the predictor is a qualitative variable,
it's pretty simple. We can use a dummy variable. For example, if we want to code gender in a very
simple way, actually, we can also include
more gender types, but if we only use
two female male, it's Xi equal to one, if the I person is female
and zero otherwise. Then we get a dammi variable for predictor with
only two levels and we can include it in
our regression just as any quantitative
model variables are. If we have K levels, we can actually code this
in the following way. We only use K minus
one dammi variables. For example, if we want
to code let's say, H in a very simple
fashion, we can say, okay, We can say dummy variable
X one is one age smaller than let's say 20 years
and zero otherwise. X X two is one for age larger than 20 years
and smaller than 40 years. And then the zero otherwise. And if, for example, we only have three levels, then we need two demi variables. So if both of these are zero, it's clear then actually, in this case, the person
is older than 40 years. We have X one for
the young people, X two for the people 20-40. And if both are zero, it's clear that the person
is older than six years, and we can do this on
and on, for example, X three, for example,
one for age, let's say 40 age and between 60. And then it's clear if all three dummy variables are zero, it's a person that is
older than 60 years. This is the way how to code
qualitative predictors, but you need to be careful when interpreting the
estimated coefficients. Now, if you have an estimated beta coefficient for let's say a metric variable, it's clear that
you can interpret these coefficients
as the for example, 1% change in the predictor leads to a beta percent change
in the response variable. Now here, it's
clear that you have to interpret the coefficient
as a switch 0-1. What happens if you no longer
have a male observation, but a female observation? Then this is the interpretation
of the coefficient. Okay. Now this is a
qualitative predictor. How can we extend the multiple linear regression model even further by including so
called interaction terms? For example, imagine
predictor Xi increases or decreases the effect of a different predictor
X two on our response. To measure this synergy effect as it's called in marketing, we can use a so called
interaction term. We set up our regression
model response, eques intercept plus
slow one times X one plus slope two times X
two, plus a coefficient. That's the interaction
term beta three. Times the product
of X one and X two. This is how you can do it
with an interaction term. Again, be careful when interpreting the
coefficients B to one, beta two, and beta three. Beta three is the
interaction term. It means that it
measures the effect of X one on Y if X two changes. The coefficients Beta one and Beta two now also measures
something different. Actually, it's now even
more important when you're using
qualitative predictors. Beta one measures the
effect of X one when B two is X two is zero. This is the isolated
effect of X one if X two is zero and Beta two measures the
effect of X two on Y, if X one is zero. This is a little bit tricky. And also to correctly measure
the interaction term, must include X one and X two integral regression
as well as SOL. This equation here is correct in the sense that we
have included X one, X two, and their products. This is particularly interesting when you have an
interaction between a qualitative and a
quantitative predictor. For example, imagine X one is income X two is
gender, male, female. In this case, the
interaction term Beta three would measure the effect of gender
on income on say, health, if the
response is health. How much does gender fcc the effect of income on
health quite interesting. Beta one and Beta two would
be the isolated effects, for example, for a
male and female. That's quite interesting
when you combine a qualitative and
quantitative predictor. What else to do? Now we've seen we are talking
about a linear model. What to do if, for example, we see that the effect of X on Y is probably
a nonlinear one? Well, you can use a
polynomial regression. Let's say predictor X one
influences Y nonlinearly, we can use a
polynomial regression. That is, we include X one and the squared values of X one, X one squared Now, see that including nonlinear
terms such as X one squared does not change the fact that this is still
a linear model. It's still linear in
the sense that we have the intercept plus
X one plus X two. We could have also written X one as let's say X two or X three. We could have redefined
our variable. We don't need to
know that this is actually a non linear
term of X one. So it's still a linear model. We can use OLS, doesn't change anything, but
it can lead to a better fit. Actually, using
higher polynomials can lead to such a
good fit that again, as was the case with
unnecessary noise variables, we get the problem
of overfitting. This is a problem,
but otherwise, if we suspect a non
linear influence, we can try polynomial
regression. Let's look at this in
load, first of all, load the library ISA, and the linear model
is quite simple. We use the car, the autodata, so we are trying to explain
mileage per gallon, how many miles you can get out
of one gallon of gasoline. And we explain this
with horsepower plus the squared value of
horsepower and see that, let me highlight this for you. Actually where my mark. This is strange.
Here is the one. You need to use the function I. You cannot simply
include horsepower head two because actually the head is a different function in R, so you need to use
this function I here, not very nice highlighting, by the way, you can
see that this is a model Y equals
X plus X squared. This is the coll.
These are the results. You can see the residuals here, and then these are the
coefficient estimates. The coefficient estimates
here clearly show that the squared variable force vowel the polynomial term here is also highly statistically significant
in our regression. So actually, if you were only to include
horsepower and then estimate a second model
with horsepower and its square values as
a second predictor, you would also see
that the multiple and adjusted R squared
increase considerably. So yes, it seems as if sorry, it seems as if horsepower and a polynomial term
are needed to better explain the mileage per gallon
response variable. Okay. Now, in this plot, you can see three
different models. You can see the linear or less
fit and the linear model. You can see a
polynomial pression using second degree polynomial and also a polynomial
turn of degree five. Now, as I said, it is often the case
that there might be a non linear effect between predictor and
response variable. However, including more
and more polynomial terms in this regression will ultimately ultimately
lead to overfitting. This can be seen
with the green line. Actually, you can see
that such a line like the blue line is fully
sufficient to explain the data. You don't need something that is wiggly like the green one. So actually, including
horsepower squat is sufficient to
explain the data. You don't need a polynomial
term of degree five. So overfitting is the case
here with the green line, but also the linear model
underfits the data, which is quite clear. So it's okay, but a polynomial
regression is much, much better suited and
gets a better fit. Now, what else can happen in
multiple linear rettions? Actually, there are a couple of problems that I only want
to shortly comment on. The first one is non linearity of the response
predictor relationship. We have a non linear
relation between X and Y. We've seen this, one can use residual plots to identify
this problem and then, for example, use a
polynomial regression or more sophisticated models. The second problem is a likely correlation
of error terms. Now, the idea and
the assumption in the OLS model is that the
error terms are uncorrelated. If they are correlated,
we have a problem. Our model assumptions
are no longer true, and we can only use good causal inference and good experimental
design if we have experimental data to counter this because usually
a correlation of eroturns stems from the fact that we have omitted
variable bias. We've missed out one
important predictor. We have reverse causality. It's not X that is driving Y, but actually Y is
also driving X. So we have reverse
causality, as we call it. And all of these
problems omitted variable bias and
reverse causality, they lead to a correlation
of erotns because the erotns are picking up something that
should have been included, and then we have a problem
in our estimation, and we use causal inference
to tackle this problem. We are likely not talking
about this in this lecture, but this is a huge problem
in empirical analysis. The third problem is a non
constant variance of erotons. One assumption is
homoskedisticity, meaning that the terms
have a constant variance. If it is not constant,
for example, we have an increasing variance with more and more observations. One central assumption of
the osmole is violated, and we can use, for
example, weighted squares. Then we have outliers and
high leverage points, outliers meaning extremal values of our response variable. And high leverage points being extremal values of one
predictor variable. What we can do is we
can try to detect these outliers and
high leverage points and then, for example, use winds aisation or simply exclude these
data points from our analysis with
all problems that are attached to
these, um methods. And last but not least we have colinearity,
multi collinearity, meaning that X one and
X two, for example, share a high correlation, and this means that one of
the variables is obsolete. You don't need X one and X two. The information that
is needed to explain our response variable is already included in
X one, for example. Then we can exclude X
two, we can drop it, or orthogonalize this
colinear predictor and how can we decide
whether we have colinearity? We can use the so called
variance inflation factor, and I'll get to
these in just a bit. So let's start with the non linearity of the
response predictor relationship. We can look at the
residual plots. For example, here, if we
use a polynomial fitting, left, it's a linear fit. It's the quadratic fit with
the polynomial regression. As you can see, the fitted
values in the residuals should actually look
rather like this. In this case, we
would say, Okay, there's nuclear trend,
it's around zero, and we don't see that for very small or
very large fitted values, we have a certain trend
in the residuals. Now, with the linear fit, we can see the red line, that's the mean fit. You can see actually
yes there is a trend. It starts out very high, goes back to zero and
then starts again. This is an indication of
a non linear relation. And we do the same for
the residual plot, we can see it's almost constant and this is
how it should be. From the residual plot of the polynomial regression,
we can see that yes, it seems to be as
if the relation is non linear and quadratic model is much better suited
than a linear one. Now, let's go to the
second problem we have, which is the correlation
of the error terms. Now, I told you that this
usually stems from biases like omitted variable bias
reverse causality, and we cannot do
too much about this without using more sophisticated models for causal inference. However, how can we decide whether we have such
a problem here? To see this, this is a plot, these are three plots where
we have simulated data with different correlations between adjacent points and
adjacent residuals. Now, if we simulate data and our assumption of uncorrelated
rotns is fulfilled, we have a correlation of zero, then you can see that we
have a residual here. The next one can be
here, can also be down. There is a now clear
trend that, for example, this observation leads
to another observation, another residual
being here or here. Now, if we increase the
correlation between the errortns to 50% or even 90%, you can see that we have trends, and it looks like a time series, and this is a problem. Now, here you can
see it's highly likely that when you have
a negative residual, the next one will
also be negative. And this is due to the fact that the errorturs by
construction in this case, are highly negative are now actually
positively correlated. So this is how you can try to decide whether this is a violation of
our assumption, and then use a residual plot. Now, the third one is non constant variance
of error turns. You can see one example where you have the fitted values and the residuals and actually, if you only plot the
mean of the residuals, you would say, Well,
there is no trend. However, the dispersion, the variance of the error
turns increases dramatically. With the data. And this is what you can see
from this funnel shape of the residuals in contrast to the one you see on the right. Now, in the left picture
in the left block, we clearly have the problem
of heteroscedisticity. The variance of our residuals of the erratong in the data
increases from left to right. But one simple
remedy for this is to use the lock of our response. Actually, if we do this simple transformation,
you can see, it almost looks
like the variance stays the same over
our four sample. So very simple remedy might
not be as simple as this, but this is again,
how we can decide whether heteroscedisticity
seems to be a problem. Next, we have outliers. Now, outliers are
extreme values of Y. For example, this
observation year, number 20, as you can
see, this is an outlier. Now, with the outliers, the problem is that actually, if we have a look at the
next plot actually here, the head Sorry, with the
red line and the blue line, those are the linear fits for the data with and
without this outline. And as you can
see, the blue line in which we fitted
a linear model to the data and we've excluded the outlier doesn't
change too much. Actually, the outlier for this response value doesn't really change our linear model. It doesn't change the linear regression line that
we are fitting. However, the problem is that including the outlier can
lead to other problems. For example, it will lead to an increase in the
original standard error, so the RSE will increase. Actually, the adjusted and squared will decrease when
we include the outlier. The model fit is getting worse. So we could think about
excluding this outlier. You can also identify
this outlier by looking at the residuals
in the mid plot here, again, 20 far away from
the remaining residuals. And if we studentized
the residual, that is we take it and divide
it by standard deviation. Again, still, it's an outlier, yes, one could think about
excluding this outlier. Same thing for the
high leverage point, which is number 41 here, observation 41, and this is an extremal value for
X for a predictor. Again, you can see
that actually, if you plot X one and X
two against each other, this is an high leverage point. Now with high leverage
points, it might be actually, you can see this here that the regression line, it changes. Actually, in this case,
not too dramatically, but yes, it does change, and this is a problem
for high leverage point. That's why it's also called
high leverage points. One single observation
can affect the estimation of our regression line on the group regression model and the coefficients
dramatically. Actually, the
coefficient estimates are heavily affected by
high leverage points. Again, one could think about
excluding this observation. For this, we can also estimate the so called leverage
and can see observation 41 has a high leverage
in contrast to the outlier
observation number 20. Last but not least colinearity. Collinearity, meaning
that X one and X two, for example, are strongly
correlated with each other. And here we are using
the credit card data, and this is limit and
he and as you can see, those two variables are not really strongly correlated
with each other. It's different with
limit and rating. As you can see, there's a strong linear
relation between the two and including limit and rating in a regression model
will lead to problems. Why? Because if you remember the estimation of
the coefficients in the multiple linear
regression case, you remember that the
coefficient estimates are given by actually the matrix X transpose times X, and then you have to invert
X transpose of X times X, and you will get
numerical problems. Now, if two variables are strongly correlated
with each other, you can imagine that
in linear algebra, these two variables form two
columns of our matrix X, if these two are almost
linearly dependent, the information is
obsolete of one column. You can leave out one column. If you're not doing
this, in this case, you shouldn't the
problem is that inverting this matrix becomes
numerically unstable, and this is what will happen. So the coefficient estimates are actually strongly affected by this because you now get two numerical problems.
And this is shown here. If, for example, you have
beta limit and Beta H, those two from this plot here, we saw that those two variables, those two predictors
are not collinear. And actually, it's
quite easy to find the OLS optimum, which is here. These contour plots,
the ones for the RSS, actually, again,
you remember this, we are minimizing the RSS. Actually, you can see
that it's quite simple, for example, as the optimization algorithm, if you
start out here, it's quite simple to find
this point because this is actually the point where
the RSS becomes minimal. Now this is the same plot
for limit and rating. As you can see, the
contour plots become quite nasty and it's now
much more difficult to find the OLS estimate of
these two coefficient. Why? Because function, the RSS function that
needs to be minimized, looks rather, how
should I put it? This is not as nice
to optimize as, for example, this left plot. So it's a numerical problem, collinearity on
multicollinearity. What should you
do? Well, you can simply leave out
limit or rating. The information is already included in the other variable. You can decide on this using the so called variance
inflation factor, and then you can also
orthogonomize it. You regress X one on X two
and you use the residual. So you are only using the
information in X two, for example, that is
not already in X one. That's one possibility as well. So this is called linearity, and this is all I
wanted to talk about in this section on multiple
linear regression. In the next chapter, we'll look at penalized linear
regression molars, especially less rich regression.
12. 12 Regularized Regression Methods: Everyone. My name is Kibo Weiss. I'm a professor of
Finance here at Leipzig University,
and in this video, I want to show you how penalized linear
regression models work as part of our class in artificial intelligence
machine learning in finance. So we've already seen multiple linear regression
in action and it's details, and we now want to
move forward to penalized regression
as the name suggests, you can imagine that it kind is a regression analysis that penalizes in its
objective function certain aspects of our modeling. And if you recall the problem of model selection or
variable selection, we saw that one can do
subset variable selection, either forward or backward
stepwise selection. That is you start, for example, with the null model and
then you slowly uh, include more and more variables or you start with the
model that includes all P predictors and you exclude those predictors that don't add too much to the models fit. Now, the alternative
approach here, which is much more feasible than actually subset
variable selection, are shrinkage methods
or alternatively, as we call them
penalized regression. So what is penalized regression? Actually, in penalized
regression models, we use all P predictors. That is, we use all
variables that might explain our response variable and we
use them all in one model. And in contrast to subset
variable selection, the coefficient estimates are constrained or we also
call it regularized. Which means, in other words, the different algorithms shrink the coefficient estimates
towards zero so that some of the
variables actually have a coefficient that is close to zero or equal to zero, but all variables
are included from the star and actually we'll see in the case of rich regression, all variables, all
predictors actually stay in the model because most of the coefficient will
be close to zero, but they will not
be equal to zero. These are penalized
regression models and we'll start with so
called rich regression. Reminder in ordinary
least squares, we try to minimize the
residual sum of squares, the RSS, which actually
here is given as the sum of the difference between YI, the
response variable. And the intercept of coals and the sum of the coefficients
times the predictors. Actually, if you look
at this equation, we can see the response here, and actually this is nothing but our estimate
and our prediction. This is, for example, F head of X could also called the he YI, and then we take the difference, we square it, and we sum it all up and we get
ordinary lease squares. That's the case of
our previous section. Now in this variant of
OLS rich regression, we still minimize the RSS, but we include this
additional component, which is tuning perimeter lambdo times the sum of the
squared coefficients, B J. Be careful that we do not
include the intercept here, but we only include the slopes for all those P predictors. We start at J equal to one, and we do not touch beta zero. In other words, it's
simply the RSS, we try to minimize ins. But we have this
component, Lander, the tuning parameter, times the sum of the
squared coefficients. Now, what does this mean? It means that actually, because we are trying
to minimize this, the squared coefficients
are non um are non negative and actually the tuning parameter
is also non negative. It's larger or equal to zero. This means that actually as we are trying to minimize this, we are penalizing coefficients that are not equal to zero. Ideally we would see
that at some point, the coefficients will
tend to go to zero, and this is governed by the
tuning parameter lender. Now, as you can
see, if lender is chosen to be equal,
then we get OLS. OS ordinarily squares
is actually embedded in rich regression as
the special case for lender being equal to zero. That's rich regression.
As with OLS, Rich regression seeks to find
a model that first of all, fits the data well. We're trying so to minimize RSS, but we have this
shrinkage penalty, lender times sum of the
squared coefficients, and this is small when
the parameters to J, not the intercept, these
lobe are close to zero. The perimeter lender controls this process of shrinking
the parameters towards zero. As I told you, actually, if lend is equal to zero, we get the ordinary
least squares estimates, and I lender tends
to go to infinity, actually the rich regression coefficient will all
be zero because then the effect of this
shrinkage penalty will be infinitely high and
the coefficients will be shun to almost close to zero. So this is what happens to the rich regression
coefficient. But as you can see, with
every different lender, we get a different
set of coefficients. So one has to be
careful when comparing OLS with rich
regression because OLS only fits the model to one set of
coefficients to the data, and we get one set
of coefficient. Rich regression, actually,
we will do this and repeat this for several
values of anders. Actually, we'll compute ten, 50, maybe even 100 models and we get different coefficients
for different values of the cuning parameter and. To see how this works, we can have a look at this plot of different coefficients
for different lenders. On the left hand side,
you can see here, we have and on the X axis and the standardized
coefficients for variables like income
limit, rating, and skewed. You remember these
were predictors used to predict the balance on your credit card
and we're using the credit data from the James Woodasy and
Tiharni textbook. As you can see, if we start out here with lender being
close to zero, actually, those should be the OLS coefficients
for those predictors. But as you can see, if
we now increase lender, you can see that actually, we have a couple of
other predictors here, but they are already
very close to zero. You can see that with
increases in lender, some of the um, predictors are forced to a
value of zero quite early, for example, income, it starts
to decrease very early. In contrast to for
actually limit even starts before that
and for example, stumed remains constant for some time and then goes
down at some point. The right hand side here, you can see that at
some point lender is so high that coefficient, all those regression
slopes are close to zero. This happens here. On
the right hand side, you can see this in
just a different way of telling the story. You can see the rich regression coefficients Peter for
different lenders, divided by the OLS
coefficients and you can see that now these are
the OLS coefficients. This is when they
are actually the same here for most
of these predictors, they stay close to zero, but like income limit ratings, they are significant in ors, but also in the rich regression. The question, um, first of
all, what should we do? And what should
be careful about. With OLS, you might know that the coefficient estimates
are scale equivariant, which means that if you
rescale a variable, then the OLS coefficient will
be scaled by that factor. This is not true with
rich regression. Actually, the coefficients
can change when the predictors are
multiplied by a constant. Thus, we should always use
standardized predictors. That is, we take our
predictor values XIJ and we standardize it by the square
root of its sample variance. And in this case, the
standardized predictors in our sample have a standard
deviation of one. Each has a standard
deviation of one, as you can see here, this is
done for the Jth variable. The Jth predictor, we take
the standard deviation of the Jth predictor and we divide each observation by the
standard deviation, then it has a standard
deviation of one. So that's what you
should always. Now, one could ask, and this is the
obvious question, why should we care
and why should we consider using rich
regression over OLS, and what are the advantages of rich regression over
ordinary least squares? Well, the answer lies in the
bias variance trade off. If you recall the training
mean squared error, we've already seen the
mean squared error, which was simply defined as
one of N times the sum of the errors in our data and
those errors were squared, we would get the
mean squared error. Now, if we do this in
the training dataset, Then we would get the
training mean squared errors. Later on, we'll see that
we should actually do this based on a model that has been fitted
to the training data. We should check the
mean squared error for a previously unseen
test observation. Let's call it X zero and Y zero. Then we would get the test MSE just to make sure that
we are not fitting were not checking the validity
of our model based on the data we've used
for fitting the model. Now recall the training all
the test means weed error. Obviously, this is a proxy, this is a metric of how well
our model fits the data. Now, you didn't
show that actually, if we are interested in the expected test
means weed error. So the expected
error on average, if we square the error
of a test observation, which has not been used
in fitting the model, which is on the left hand side here, we highlight this for you. This is the expected
test means wear error. This is actually decomposed
into the variance of FHD at X zero plus the bias
squared of FHD at X zero, plus the variance of the
error term we cannot. Do too much about the
variance of the error turn. But actually, we can see
that we have the variance of our fit in model at the
test data plus the bias. Now, to minimize the test means were error on the
left hand side, we can see that actually
we have two levers. The first one is variance. It refers to the amount by which our estimated function
FH would change if we estimated it using a
different training dataset. Now, assume we
have this dataset, and we have this test
data observation. Now, if we estimate our model, if we fit our model
to this dataset, and we check the
expected test means weed error using this
test observation, then obviously we get
one means weed error. But what would
happen now if we had a different training dataset and we had the same or a
different test observation, would we still get
the same error? Now, the variance here
refers to the fact that if we use different data samples for fitting our model
for training our model, we probably will get a different estimate for
the response at X zero, so we get a different
means word error. The bias, on the other hand, refers to the error that
is introduced by using a model to approximate for
the unknown function F, that we use a linear model. Now, again, we want to estimate a function F. We
don't know per se, it's anti if F is linear. If it's polynomial, can be any non linear
function as well. Actually, if we have
this train data sample, what error is
introduced by the fact that we're using the
assumption for example in OLS, but also in rich regression
that we're using a linear model to estimate
F. That's the bias. You can see that actually the expected test
mean error here, this is actually due to the
fact that first of all, we might have the wrong model, that is the bias, and it could be due to
the fact that yes, we have the right model. F is approximated correctly by, for example, a linear model, but if we use different
training data, we still get a slightly
different approximation. You can see this quite easily. If, for example, we
assume that we have a training dataset that only includes N equal to
five observations, and again, five
different observations. Now it's clear that we might
have the correct model, but because of the fact that we are using a finite sample, we only have five observations in our training data sample, there is a lot of variance in our model estimation,
and that's the variance. So we have two things we need to care about where
variance and bias. And actually, it's
easy to show that if you increase or actually, if you decrease variance, you increase bias
and vice versa. You cannot reduce both
at the same time, but usually it's a trade off. If you have lower variance, you will probably
have higher bias. Now, this is shown
with simulated data. We get my red line E again. This is simulated
data and the data is simulated in these are the simulated observations,
those black circles. The true mole is actually
given by the black curve. It's a polynomial function
and that's the true mole. But because we are
simulating from this mole, we still have some random in our observations given
by the black circles. Now we can use a linear mole. That's the one in orange here, and we can use
smoothing splines, which is very similar to,
let's say, polynomial. You might know what
the spline is. I don't want to go
into detail here, but these are smoothing splines, polynomial spines
in blue and green. And you can see the blue
line is doing okay. It's actually pretty
close to the black curve, so that's probably
our best model. And as you can see
the green curve, is close to the black circles, but it's actually
quite far away, as you can see here,
from the black curve. It's close to the
points here and here, but it's rather far
away and it's modeling. It's fitting the
noise in the data. It's fitting the noise introduced by simulating
from the black curve. So what can we see on
the right hand side? Actually, we can see
the flexibility on the x axis and the
mean squared error. Actually, the red
curve is the tests, and the gray curve is the training MC for
those different mods. Here you can see the OLS
mod, bilinear model. As you can see, the
training embassy and the testes are quite close
here and actually quite high. Um, if we use the smoothing
line with the blue curb, you can see that actually
they are still quite close together and actually lower. And we can actually reduce the training MSE by using a
much more flexible model. For example, we can also
make it very extreme and use a curb that goes
through all these black points, but you can see it's rather
wiggly whey and actually, um, you get my idea that actually you're
just fitting noise, and what will happen
is that actually the training MSE goes down because we're using
more flexible model, but as we are fitting
the model to noise, if we now use a
different dataset, that is the test MSE,
it will increase. And we are using a
different dataset, and suddenly the
mean squared error, the test means squared
error goes up, this shows you what happens. By moving from the blue
curve to the green curve, we are decreasing the bias. There's no bias in
this red curve now. It's fitting the data
points perfectly well, so we don't have any bias, but it significantly
increases variance because if we are now using the model on a
different dataset, we suddenly get a model that no longer fits
the data quite well. So that's the bias
variance tradeoff. And coming back to
rich regression, if we now increase nd, we are decreasing
the flexibility of the rich regression and
the bias increases. And this is shown
here, for example, we can see with the
squared bias in black, the variance in green, and the test MSE, which is, as we've seen before, actually can be decomposed or it can be seen as a
decomposition of the test MSE. This now shows that actually
the test ME in purple can be reduced by moving
from OS to this point, which gives us an optimal
tuning perimeter lamder because now we can decrease
the test MSE from, let's say, maybe 48 to maybe 35. This is what happens.
Rich regression allows us to exploit the bias
variance trade off and to reduce the
test MSE by tuning or choosing a
tuning parameter to minimize the test
means weird error. This is quite clear here
from the Rich regression. Another advantage is
if P is larger than N, if we have more predictors
than observation, and the OLS estimates do not
even have a unique solution, we might be inclined to use rich regression
in the first place. It also has computational
advantages over OLS, combined with best
subset selection. That is rich regression, we're using all predictors. In ALS, this would be done
by best subset selection. That is, we would need to
estimate to take into the power of P models and this
quickly becomes unfeasible. And rich regression had
one serious disadvantage. It will include all
P predictors in defined model and almost always, it will happen that
the coefficients will not be equal to zero. This is why one can then move
on to the so called lesser. Lesso is the least
absolute shrinkage in selection operator, and rich regression,
as we've seen, will always generate
a model that includes relevant and
irrelevant predictors, because most of the predictors
will not be equal to zero. However, we can remedy
this by using the lesser. The lesser coefficients,
minimize the following quantity. Again, we minimize the RSS, and we are using a slightly
different penalty term here. We are not using the
squared coefficients, but we are using the
absolute coefficients with and again, being a tuning parameter. Now, this seems to be almost equal to the
rich regression. The difference, however,
is that in short, lesser uses the so
called L one known, the absolute value
of the coefficients, whereas Rich
regression uses the L two known to penalize
two high coefficients. Now the advantage of lesso is that by using the L one norm, this forces some coefficients to be exactly equal to zero. These predictors
will not pop up in our final select our
final model fit. Hence, lesso performs
a variable selection and fitted models are sparse. They do not include
all predictors, and this makes interpretation of the final result much easier in contrast
to rich regression. This is what it looks
for the credit data. We've seen with rich regression, those parameters will
slowly approach zero. But here you can see, for
example, with income, it already become zero here and all those
other predictors. They have zero coficiens that are equal to
zero at this point, and this is the difference to on rich regression to
explain why this happens. We can simply look at this plot, which gives you the contour
plots of OLS, of the RSS. This point here, the black dot, that's the OLS coefficients, that's the minimal point. If we don't at any constraints, we don't gliss what now happens is that with the
rich regression and less so, this is actually the less so. This is the rich regression. We're using the L two and the L one non and we only
allow coefficients, for example, that are
in this turquoise, rectangle or in this sphere. Now, to show you what happens
with rich regression, for example, if we only
allow these coefficients, then we are looking for a minimal RSS on the
border of this sphere. If we now increase the sphere, you can see what it happens is. Now we are looking
for a minimal RSS. This reduct the last
one, what happens is, for example, too
small, too large as. For example, this will happen, and you can see that now we
have a point where actually the sphere touches contublotT
then is the optimal point, the minimal point with
the penalty turns. This would be D one and
this here would be dt two. If we increase our
tuning parameter, then obviously if we
have a larger sphere, then, for example, to
be seen in road zero, then we would use
probably let me just see it touches probably a contour plot that
looks like this, then we could get non zero coefficients
beta two, and beta one. Now, you can now see why so forces one
coefficient to be zero. It's the same, but because
we're using a rectangle, this here will always
touch a contour plot of the RSS at some point where
beta one or beta two is zero. In this case, beta one is zero and beta two is
not equal to zero. This is why
geometrically speaking, why less so forces some of
the coefficients to be zero. Which performs
best rich or less? Well, neither rich
regression nor the lesser will universally
dominate the other. As a rule of funk, you
should use lesso if you have a relatively small
number of predictors that are expected to have
non zero coefficients. For example, if you 1,000 predictors and you
only expect five, ten, maybe 20 of them to
have non zero coefficients, you should use lesso because
you're not interested in hundreds of coefficients to be close to zero but
not equal to zero. Rich regression, you should use rich regression if the
response is a function of many predictors of let's say all 100 or 1,000 coefficients, predictors and all coefficients are roughly of equal size, then you should use
rich regression. Next, we'll have a look at the application of these
molets in practice.
13. 13 Elastic Net and Cross Validation Techniques: Hello, everyone. Welcome
back to our class in artificial intelligence and
machine learning Finance. After having seen the lesso
and the rich regression, one can ask oneself whether one can combine
actually the lesso and the rich regression as the two most well known types of penalized regression ones. Actually, yes, you can combine them to overcome
the shortcomings of both the lesso and the rich regression we've
seen in the previous video, and this combination is called the elastic net regularization. And actually the elastic
net is quite simple. You take, again, the RSS, the residual sum of squares, and then you take
the penalty terms from the lesser and
the rich regression. As you might well know, the two are the
absolute the sum of the absolute parameter
coefficients here seen here. With my curs a year
with Lambda one, and the sum of the
squared coefficients with the second tuning
parameter are Lambda two. So they are both added to
the residual sum of squares. And then again, we are trying to minimize the penalized
sum of squares. And then when we found
the coefficients beta, we have our
coefficient estimates for the elastic net
regularization. As I've said before, the elastic net
overcomes some of these, um shortcomings
of both the lesso and the rich regression, and later on in the application, we'll see all three of the lessoRt regression
and the elastic net. Now, the question we haven't
talked about is how to choose the tuning parameters for both the lesso the
rich regression, and the elastic net of course. Now, we call the
distinction between the test error rate and
the training error rate. We call the fact that if
we have a data sample, especially in machine learning
in statistical learning, we usually divide our
data sample into, um, well, it could be
that we already have a training dataset
and a test dataset or we artificially
divide it up into some training observations
and test observations. Now, if a sufficiently large
test dataset is available, then of course, the training and the test error rate
are easy to calculate. What are the test err rate
and the training error rate? Well, if you fit a
model and you look at the error of the
model on your test, sorry, on your training data. Um, on the data you've
used to fitting the model, then you have the
training error rate. Usually, when you fit the model, the training error rate will be very small because
that's actually, if you look at this, the RSS, for example,
as a very simple, um, measure of the model fit of the training data error is minimized in order to
find the coefficient, so this will be
minimal, of course. If you have fitted the
model and take the model to new observations
to test observation you can again
calculate some metrics of the model fit of your errors and you get
the test error rate. And if the dataset
is large enough, if you have sufficiently
large test data, then it is easy to calculate the training and the
test error rate. Often, however, this
is not the case. For example, you only have
a few test observations. You might only have
one test observations or you have no test
observations at all. And the possible remedy in this scenario is the so called
validation set approach, which means that
you randomly divide the available set of
observations into two parts. You declare one part
to be the training set and the second part to be the validation or also sometimes
called the holdout set. And then the model is fitted
on the training set and the fitted model
is used to predict the responses for the observations
in the validation set. This is the validation set
approach. It's shown here. Imagine you have N observations, number one, two, three, and so one until N.
And then you divide these N observations into the
training set shown in blue. For example, it could
be observations seven, 22, 13, then maybe 50, 17, and so on. And then you have
the validation set which is used to calculate
the test error rate, which is shown in Beige. The drawbacks here are that the validation estimate of
the test error rate can highly variable because which observations are randomly chosen the trading and validation set. This is random. Depending on which observations enter the
blue and the beige part, the test error rates
can actually differ from each random assignment to the next random assignment. And next problem is that we initially had
data observations, and now only a subset of the available data are
used to fit the model. So we are actually
wasting a lot of our data by leaving it out, by holding it out and
only using, let's say, half of our observations
for training the model. And this is a serious drawback, especially if we
have finite data. So what can we do? Can actually do cross validation to calculate
the test error rate. What we do is we have our N observations in
the first setting, we use the first observation, which is shown in Beige. We use the first observation
as our test observation and we fit the model on the N minus one remaining
observations, observations two, three,
four, and so on until A. Then we're using almost all the observations
for fitting the model, for training model and
estimating the error, based on the test
observation number one. For example, it could be
the mean squared error. I could be the sum of squares
actually or the squared. Then we do this another time. We fit a second model, and now for the second model, we use observations one,
three, four, five, six, seven, et cetera until N for fitting the model and we check
the out of sample, the um test error based on the test
observation number two. We get a different
mean squared error, and so on and so on, and we do this N times until we are using the last observation
as a test observation, and this will give
us squared errors, N metrics for the test error, and we will then simply average all these errors and we get
an estimate for the training, the test error rate, sorry. As you can see, because
we always leave one out, this is called leave one out cross validation, quite simple. It's quite simple
the next to, um, um, extend this to the so
called K fold cross validation. We will again start with N observations in
our data sample, and now in K fold cross
validation, for example, could be five K fold
cross validation, we subdivide observations, our data sample into five bins. And as you can see,
this is the first bin. So this is actually
the first set. This is used for, um, has the test data. And the model is fitted on
the remaining observations. Then we do it in the
next bin of size, actually, and divided
by five, in this case. And then we have K resulting
mean squared errors, and we can average these and get an estimate for
the test error rate. This is K fold cross validation, and this can be done again to estimate the test error rate. Now in the next video, I want to show you the
application of both the lesser, the rich regression,
and the elastic net and also how you can choose the parameters via
cross validation.
14. 14 Partial Least Squares in Practice: Hello, everyone, and welcome
back to our class in artificial intelligence and
machine learning in finance. In this video, I
want to show you the application of penalized
linear regression models, that is Rich regression, the less so, and elastic net, and we'll use it
in the R program. More precisely, we'll be using the risk analytics R package. And if you still need
to download this, you can access the sorry. I say, you can access the
package here under this link, and we will also need a
list of ticker symbols and some additional data
for the stock market returns that you
can download here. What we'll try to do is we
will use numerous features, numerous explanatory
variables or numerous regresses and we'll try to forecast equity returns. More precisely, we'll
try to forecast the returns on the
Apple one day ahead. This is a very simple task in finance forecasting equity
returns and ideally we'll identify those
features in our data that can help explain
the response variable, which in this case is just the one day ahead Apple
stock return. This is our task,
the application is actually quite simple. But as you can see from
the next couple of slides, this will take some while to do in and to implement
all these functions. Um, if you remember one
of the first videos, um, you will recognize on
this slide that actually, we're not starting with the
regression models itself, but we'll start with inporting and preprocessing the
data because this is one of the more
important steps that usually you will not find
in numerous textbooks. So everyone always starts with
the models, but actually, you need to work with the data beforehand
to bring it into a form so that it can be processed by the models
for machine learning. First of all, we have to start with even
more trivial task. We need to install the
risk analytics package, and it can be installed
directly from the cran mirror. Actually, you have to
download it and thus, you will probably need to use these two programming
lines here, library deaf tools,
and then install the risk analytics
package from Github. Then you need to load it into the memory in R.
Library risk analytics, we also need the Snow and
the Quang mod package. We need to set the file path for the downloaded CSV file that includes the NASDAQ
ticker symbol. I because actually we want to rename some of
our variables which have not very practical names
and designated names in R, but we want to make it
more easier to interpret. So we download the NASDAQ
Ticker file CSV file, and then set the path NASDAQ to the file path where the
NASDAiot CSV file is safe. And then we read the NASDAQ ticker symbols
into company list. We order the funds in descending order by actually
market catalyzation, which is included
in this data array. And what we'll be doing nexus, we'll be choosing companies and we'll only be working with those companies that have
a very high market cap. So that's why we
download the data. We read the CSV file into
company list and then salt the company names in descending order by
market capitalization. So that's what happens
in lines 12 and 30. Next, we need to download
the return data for those ten companies which have the highest
market cavalization. Stock data will later include the stock returns and we define it as a
data frame and then we need to get Yahoo data for those companies
that are included in the object company list
and my curse a here. Um, we only use ten companies. You can easily extend this
application to more companies. You will later see why this probably doesn't
make too much sense, but nevertheless, we'll
be using ten companies. We're using data 2010-2020, and we are using
the stock returns. We'll also use
macroeconomic variables. We get those why
I get macro data, and we are seeing it into
the object macro data. Next, we manually add the
corresponding date to each row. This is a rather
complicated command and you can see it will be
saved into the object dates. Those are the row names
as matrix get symbols, et cetera, et cetera, for
several of our variables. And then last but not least, we assign those dates as row
names for the macro data. Okay. Now the variables that are contained
in macro data in this objects have variable names that are not too practical. For example, VIX, one GSPC, these are the S&P 500 returns three and TCN are the changes in the three month
treasury bill rate and we'll rename some
of these variables. For example, instead
of pound head VIX, we'll simply call it VIX. Instead of one GSPC, we'll call it SP 500, this is more intuitive
than the previous names, and this is what we'll
do in line nine. The column names of the
object, macro data. They are overwritten
with VIX SP 500 Beal Estate TR three
M yield and credit. And by the way, these are
the macroeconomic variables that we'll be using
implied volatility index. This is the volatility
index on the S&P 500, the S&P 500 itself, the IHS DowJos use real
estate return index, and then the yield the
slow of the yield curve corresponding to
the spread between the ten year treasury rate
and the three month TB rate, and last pta the changes
in the credit spread between BAA rated bonds
and the treasury rate. So this is a credit spread. Okay. Now, we also need to preprocess the
data in some other sense. We'll add a data column to both macro data and stock
data and then merge it via the merge
command in line five to have a new data object
that is complete data. This includes the
macroeconomic and the stock market data
all in one data frame. Next, we'll refine
complete data as a table. I told you in a previous video that this is more convenient in many machine learning
applications in I at least if we now print the first five
rows of complete data, you can see that these are
the first five observations. You see the dates, and then
as Columns, you see AAPL. This is the Tita symbol for the Apple stock,
MSFT, Microsoft, Amazon, ubl it then goes on
and we have 2,761 more rows. All in all about 2,800 objects, 2,800 daily observations for stocks and the
macroeconomic data. You can see we have additional
variables, for example, JPM, JV Mong, J and J
is Johnson and Johnson. We'll actually see this in our results later on,
quite prominently, Johnson and Johnson
nowadays everyone knows about this company and so on, and you see Big SB
500, real estate, and so on these other
macroeconomic data variables. We again extract the macroeconomic
variables from complete data because the dates in the
original macro data object do not coincide with the
dates in stock data, we'll extract it again from complete data to have corresponding dates
in all our objects, and we now print the
correlation matrix. Get a first hint what
the data looks like. We'll see which variables
are correlated, and this gives us a first idea of what
the data looks like. You can also do this
for the stock data. You will see, of course,
different correlations, and it would also make sense to print out summary statistics, summary statistics that
will include the mean the standard deviation or
volatility of the data, the quantiles minimum
maximum they'll use to see if anything is off, is that anything might be an outlier that
needs to be dealt with. We could have a look at
the correlation between the macroeconomic and
the stock market data. And here, I've only shown you the correlation matrix
for the macro economic. Um, variables. You can also visualize this
via the library co plot, and then the command is
also called co plot. And here you can
see, for example, that there are some strong
positive correlation between real estate
and the S&P 500. Well, this is the same and
slightly negative correlation between the dx and the S&P 500 and the Ix and real estate. This is just another way of showing the correlations
and visual way. Okay. Now, as you remember, in statistical learning,
machine learning, we need a training dataset and a test dataset to see the out of sample performance
of our models. And we do this. First of all, we load
the library lubridate and then slit our data into
training and test data. This looks like it's
quite collated, but actually it's just using the complete data data frame
and selecting some filters, for example, the year that is smaller or equal than 2008
with my courser year, it is. It's smaller or equal than 2018. So that's the training
data and 2019 and 2020, those test observations. 18 years training data, two years test data, and because we later
on want to forecast, we want to predict the one day at returns on the Apple stock, we set another column. That is test data
dollar forecast to be the test data of Apple and same year for
the training data, and we switch it one day at. We're using the
Apple returns and by switching it one
day at, and you get, for example, if this
were 1 January, then for the observation of all our variables on 1 January, you would have the
observation on 2 January of the Apple stock as an additional column for the response variable
that we want to forecast or want to predict. That's actually test data, dollar forecast and training
data, dollar forecast. We keep in mind that this is
just the Apple stock return, just one day hat could be also quite different response
variable that we want to explain with our data. Then because some of the models need dependent and
explanatory variables, the response variables
and the predictors in separate matrices,
we do exactly this. We take training data,
we take test data, and we split it into X train and Y train and X test and Y test
here in lines three, four, six and seven, and we define four different distinct matrices that we can use as input
objects for our um ones later. Last but at least, some of
our molets do this anyway, but we want to use
our own functions, R squared and MSE. Actually, this is the R squared. As you can see, it
includes as inputs, the predicted and
the actual values, then you can calculate the RSS, the TSS and the R
squared is simply one minus RSS divided by TSS. That's what is
returned in line five. And the relative
mean squared error is just the square roots of the mean of the
squared differences between the predicted
and the actual mall use. In line nine, so we'll start with OS with a simple
linear regression model. We forecast the
future apple returns with past returns of
the stock itself, and there is other variables
and we use in model. That's the object that we are
defining is LM linear mod, that's the Rcde for linear mod. Forecast that's the column of the apple returns one day ahead, being a function of and as
you can see here, sorry here. If we don't specify
any variables and only put in a point, full stop, this means that all variables
that are included in training data can be
used or should be used by the program as predictors. What happens is, as
you can see here, we are using a simple
linear regression one that includes
all predictors, and we have the intercept, we have Apple StockReturns,
Microsoft, Amazon, Google Google, again, JPMo, Johnson and Johnson, and this is where Johnson and Johnson
becomes interesting. As you will see on the next
slide Johnson and Johnson, the stock return on the
Johnson and Johnson stock is the only predictor that is
statistically significant, at least at the 10% level. All the other variables are not statistically significant. It goes on with the five
macroeconomic variables and the multiple R squat, is quite low, the adjusted
R squared is even lower. It's almost 0.1%, and
as we'll see later on, it doesn't get too
much better using rich regression less so on the elastic net
slightly better. But what we can
already see here, none of the
predictors really has too much explanatory power for explaining and forecasting
future ever returns. Remember that we haven't put
in any theory into this. We are only using
all available data to forecast future
stock returns. This is typical of
machine learning. In an economic model, we would have argued that
there are some variables that should have an impact
on future stock returns. We probably would have used a time series,
econometric model. This is simply
statistical learning. We're using all
the data we have, we're using all
the predictors we have and it turns out
that for some reason, the Johnson and Johnson
stock return has a slight statistical
significance in explaining future
Apple stock returns, but as we can see from the R squared, it's not too helpful. Okay. We can then predict
from this linear model, we're using test data as the new data and the R squared and the relative
means squared errors. On the test data,
well, that in here, the R squared becomes
negative and the MSV is 2.4%. This is rather dismal.
As we can see, doesn't generalize
well to the new data. As the ensemble fit of the linear model was
already not too good, it's not surprising
that it doesn't perform better on new data and actually the R square
becomes negative. Now to use ridge or
lesser regression, remember that we have
one queuing parameter, both in rich and lesser and with two queuing parameters
in the elastic net and we need to choose the tuning parameters such that it generalize
well to the new data, and this is done via
cross validation. Um, remember, cross
validation means that, for example, leave one
out cross validation, you leave one
observation out and you fit the model on
the remaining objects and then continue
to do this until you've run through all
the data observations. In K fold cross validation, you split the training
data into K parts or so called folds and usually
choose five or ten folds. And then K fold cross validation considers training on all about the K part and then validating the mod
performance on the K part, iterating it over all K parts, all K folds, and then estimating the relative means weight error, for example. Then you choose the hyperparameter
that performs best. For example, in each model, you can use different types and different values for
the hyperparameters, which in this case, the tuning parameter,
and then you choose the parameter that reuses the MSE that yields
the lowest MS. This is how it's done
for the rich regression here in R. First of all, we're using the GLM net library. We set using the
set SED command for reproducability so that if you do this again in R,
you get the same results. There is some variants of
course in this process, but if you use the
set CD command, you can always reproduce
your initial results. Then we'll find the optimal lender parameter
by cross validation. We are using ten fold
cross validation. Each time we will use 100
regularization parameters, cuing parameters that
are tested in each of these folds and we'll start actually here
if we said Author, the weighting to
be used to zero, this yields us rich regression. Later on, if we set it to one, it will give us the lesser. So this is the
command cd dot gLMNt and this cross validation
for the Rich regression, we'll use the X train matrix
and the Y train matrix, the training data to select Y cross validation,
the perimeter. And after some time, you can see here in the
lower part of this slide, we get an optimal mdo tuning
parameter of 0.45 56. We can then fit ridge regression,
which is very simple, GLM net Alpha equal to zero, so that we get rich regression. Lando is the tuning parameter. We use optimal Lander selected by a cross
validation here, and then as data, we use X train and Y train and here you can see the coefficients
of the rich regression. Not surprisingly, if you
remember the video on rich regression, Rich regression We usually include non zero parameters
for all predictors. So it's not surprising
that we have all those parameters included
in our model and Apple, Microsoft, Amazon, and so on, all these predictors
are in our regression. You also will
remember that if we set the perimeter lender
to zero, we get alls. If we set it to infinity, all the parameters will
be forced to become zero. And if we do this now with a high lender and a low lender, uh you will see what happens. For example, if we set
it to high lender, then we get this and
if we do it with low lender will arrive actually
at the OLS coefficients. Now, to see how rich
regression fairs in comparison to OLS, let's have a look at
the sample squat. If we use it with the
training and training data, you can see that the
R squared is 0.0 003, which is much lower than in the OLS model, which was 0.007. Well, it's lower at a
very low level of course. The sample
performance, we again, predict using rich regression now and look at the R squared, it's minus 005 in comparison to -0.0 138, it's slightly better. Actually, the out of
sample performance is slightly better when looking at the
square and it's also better when looking at the MSE. This is one find we use ten fold cross validation to determine the
hyperparameter lender. Why is this, by the way, called a hyperparameter
and not a parameter? Well, it's used. It's a parameter that governs
the training process. If it is a parameter
that is used in the training process,
it's a parameter. But if it changes the way the
model learns from the data, this is called the
hyperparameter. End is hyper primter it shifts from S to reach
regression, for example. And using this Lando, we obtain a linear model with coefficients
that are smaller in terms of the absolute value than the OLS core coefficients, and the OLS mole achieved
the highest sample sued. However, the out of sample
performance was better for the rich regression slightly
and at a very low level, but nevertheless, out
of sample performance was better for the
rich regression. We can now do the
same for the lasso. Then we use the GLM net library. We use the set SET command to be able to reproduce
the results later on. Then CD lasso is, first of all, the cross validation selection
of the hyperparameter. We set Alpha to one,
so it's not rich, it's non eso, and we get the
optimum eso in line ten. We then fit the lesser
regression model to the data to the train data
using Alpha equal to one. We're using Lesso and we are
using the optimal lender. As you can see, now, several of our predictors are actually left
out of the model. This was the main difference
between lesser and rich. Lesso can exclude variables, whereas rich regression
usually will include all predictors here, not surprisingly,
actually, what is left in the model is the Apple
stock return itself. This is indicative of the fact Yesterday's
stock return of Apple will probably predict today's stock return
on the Apple stock. This actually makes sense
in an economic way, but it could turn out anyway
in any other way, of course. This is what we can see I
sample R squared is zero, the out of sample performance measured by the R squared
and the RMSE is again, slightly better than in the uh in the linear
model, for example, you can see the MSE is 0.023 in comparison to 0.024
in the OLS model. So this is the lesser. And last but not least, we can also do the
same with the leg net. Now, we use the carrot library. Again, make it available
for reproduction. And then we need to select
the um, two parameters. This is done via cross
validation again. And then in the end in
lines eight, nine, and ten, you can see that we are training the elastic net using
forecast, all variables, the training data, and the training control
object for the parameters. See the ocur parameters are Alpha and Lander, 0.81 and 5.48. Then again, based on the optimal premters
from the previous slide, we can predict our data
on the training data. The R squared remains
at zero and the out of sample predictions
and performance is again, slightly better. But if you compare, for example, the RSE for the
Elastic net 2395, compared to 2395, 205, 205, it doesn't get better. It's actually the same for the
lasso and the elastic net. This is an application of those penalized linear
regression modules. And next in the next section, we'll talk about classification a key nearest neighbors and
support vector machine.
15. 15 Bayesian and K Nearest Neighbors Classification: Llo everyone. My
name is Clego Weiss. I'm a professor of finance
here at Leipzig University, and welcome to this video on Artificial Intelligence and Machine Learning and Finance. Now, in this video, we'll have a look at the K
nearest neighbors and the base classifier
in order to classify observations based
on training data and to predict into which class, for example, new test
observations will fall. We'll later on look at
support vector machines, but in this video,
we'll start out with the K nearest
neighbor classifier. If you recall the
classification problem, we are always led
to the question, how can we assess
the model accuracy of a given classifier? That is, is it able to classify a new observation
correctly or do it classify it into
a wrong class? Well, this is obviously
in contrast to regression methods where we've already
seen measures of model fit, and it's clear that for
classification problems, this looks a little
bit different. So we'll start with the
training error rate form, and this is done for
qualitative responses, for example, default, non default or
contract termination, no termination, and the
training error rate, and with this, we mean the
proportion of mistakes that the algorithm has
made if we apply our estimate FAD to the
training observations. This is defined as basically the average of
an indicative function. This is I here in
line equation 21. This is an indicator
function that looks at the comparison of
the observations, YI of our qualitative response, and it is one. The indicator function is one if YI is not equal to
our predicted value. YI h is the predicted
class label for the I observation
and we simply take the average across all
these indicator functions. Basically, again, it's the proportion of
mistakes that we make. This is the training error rate this is done for
the training data. So in time series terminology, we would say in sample, and then it's clear that
can also look at the out of sample predictive
performance and this here in statistical learning is called the test error rate. The test error rate looks at the model accuracy
of a classifier when the classifier that
has been fitted is applied to new data
and the test error rate, now again, the
proportion of mistakes that we make if we apply our estimate to the
test observation or test observations x0y0, Again, it is defined as the
average of a comparison of the responses y zero and whether we can actually
predict it correctly or not. It's the percentage of incorrectly classified
test observations. Naturally we are looking for
classifiers that are able to reduce both the training and the test error rate,
and in many cases, we will see that the algorithms are not able to generalize
well to new data, so you have a high insund of it, a low training error rate, and rather high test error rate. So what would we use
as a classifier? Well, the most basic classifier is the so called
base classifier. It's the classifier that assigns each observation to
the most likely class given its predictor values. So we assign a test observation with predictor vector X zero to the JT class for which the conditional probability
that Y is in the JT class, given that we've seen the
predictor values X zero, so that this conditional
probability is large. In this case, the
test error rate is minimized on average by
our base classifier, but this is only theoretical and it's actually
only of any use in theory because the problem
is for this to work for the base classifier to
use the distribution Y, the conditional
distribution of Y given X of the response given
the predictor values, it needs to be known and in practice, this
is never known. If we knew the conditional
distribution of Y given X, we could simply, well, estimate and calculate these
conditional probabilities, and then we would know which predictors predict our outcome. But this is not the case, so we cannot actually use the base classifier in practice. What we can do instead and the K nearest
neighbor classifier is closely related to
the base classifier, we start with the
positive integer K, for example, five nearest neighbors, three
nearest neighbors. We have an integer five K and one test observation X zero. Starting from this test
observation X zero, we identify the K points in
the training data that are closest to X zero and
we call this N zero. Then we estimate the
conditional probability. Note that this is
not known ante, but we can estimate the
conditional probability for class J as the fraction of
points in the set zero, whose response values equals J. The probability is
estimated simply by taking the average of the
indicator functions. When looking at the observations in the vicinity of X zero, that belong to class J. This is why it's called
K nearest neighbors. You take the observation X zero. You look at the K, for example, the three closest nearest neighboring points and
you see, for example, if we have a five nearest
neighbor classifier, and we observed that the five nearest points to test observation
are belonging to. Let's say one is belonging
to the class and four are not belonging to
this first and only class, then it's one out of five, and this is our estimated conditional probability
for class J. We then apply base rule and classify the test
observation at zero to the class with the
largest probability because we now estimate
these probabilities. This is what comes out of the actually the base
decision boundary. As you can see, if we know
the conditional probability, the conditional
distribution of Y given X, then it's actually quite it's actually the best classifier we can achieve because we know the conditional
probabilities. You can see with two classes
in yellow and purple, yellow and blue and the purple line the
base decision boundary. You can see that, for example, if we have a new observation
that would fall in here, we would classify it
as the blue class. If it falls in here, then it's classified
as the yellow class. This could be
default, non default, contract termination, no
termination, et cetera. With K nearest neighbors,
it works like this. We have this point, for example, and we take the three, if it's K equal to three, the three nearest
neighboring observations. One blue, two blue, one yellow. This area is considered to be in the blue part and this is
where it will now appear. This here is, for example, the blue class and we
now do it again here, it probably be one, two, and I would say this
is the third point, so this is also blue this
is how you construct the decision boundaries and
what happens is you get um, obviously, a non
linear classifier and non linear
classification boundaries. And this is three
nearest neighbors. You can also use five nearest neighbors, ten
nearest neighbors, with a simulated
data obviously with so few observations
doesn't make any sense. So this is K nearest neighbors. How does it compare to the
base decision boundary? Well, you can see here, we've used ten
nearest neighbors, and you can still see the base decision
boundary in purple. It is a little bit more
wiggly, as you can see here. It's close to the base
decision boundary, but it is a little more variant, as you can see, it has more
variance and less bias. Same. Picture, but now
for K equal to one and K equal to 100 and you can see what happens
with K equal to one, it gets even more wiggly here, and with K equal to
100, it's very coarse. It's a very coarse, even
linear decision boundary. And again, you can see here the bias variance
straight off in play, and K is obviously a hyperparameter
that needs to be chosen before applying
the algorithm, usually via cross validation to arrive at an
optimal algorithm.
16. 16 Maximum Margin Classification: Hi, everyone, and welcome
back to our class on artificial intelligence and machine learning in finance. In this video, we want
to have a look at support vector machines as another method for
classification. As you remember the last video, we saw that and we'll see this even in more detail
later on in the applications, that classification is
often used in finance, at least, for example, in
credit risk management, where we want to classify
good and bad loans, defaulting customers and
non defaulting customers. In insurance, economics, insurance management could
be that we want to identify and classify those customers
who are most prone to terminate their contract and to switch to another
insurance company. So support vector
machines are another way of doing this of classification
for classification. And as such, it's quite similar to K nearest neighbor
but more sophisticated. And if we speak about
support vector machines, usually, this is
the summary term for three distinct methods. We start out with the so called maximal
margin classifier. We'll then skip to the
support vector classifier. And if we extend the
support vector classifier, we get the support
vector machines, and we'll come to that
probably in the next video. We have to start with
the basic definition of a so called hyperplane. What is a hyperplane? A hyperplane in P dimensions in the P dimensional space is
a flat affeme and such, it's a linear subspace of
dimension P minus one. For example, in two dimensions, if we have the plane, um, hyperplane is a line. For example, this
is a hyperplane. This is a hyperplane or
even this is a hyperplane. This is a linear line in the two dimensional
space, a plane. In three dimensional
space, it's a plane. You can see how this goes if this is a three
dimensional space. For example, well, I need to probably sketch
this right now. Could, for example, be
that it's this hyperplane. It's cut here on the z axis. This could be schedule
this a little bit for you. Let's use, for example, this here would be a plane in the three dimensional space
with let's say P enough P, but the set being equal to
let's say three somewhere X. Y and the three coordinates, and this is the hyperplane, this blue plane
that cuts through the three dimensional
space at t equal to three. As you can see, a
hyperplane divides the P dimensional space into two halves and it's
defined by this equation. You have the linear
combination of the coordinates beta zero plus B
to one times X one plus one until plus
B P times X P, and this needs to be equal to zero so that you
get a hyperplane. It's a hyoplane quite simple. And you can now see why
we're using hyperplanes here at the very start because in the two
dimensional case here, in the plane, such a line cuts through the plane and
cuts it into two halves. For example, if we have some observations,
here, here, here, here and here and maybe here and we have certain features associated with these points. Well, then, for example, this could be a decision
boundary, this red line. Meaning that in the
blue space on top here, all the observations
that fall into here are classified
as blue points and those that fall below this line are classified
into the red class. So the hyperplane is
used for classification, and how should we do this? Well, as input data, we have N times P observations. We get a data matrix of N observations in the
P dimensional space. Each observation
belongs to one class. That is, we have the response qualitative response
variables Y one to YM and we have minus one or one as the response variables
possible values. Minus one represents one
class and the other. We only have two classes. Right now, we can
extend all these models to the case where we
have more classes, like for example, um, default, A, A rating, and so on. It works well with ratings, but we start out with
just two classes. And minus one is one class,
one is the other one. We have a test observation with a P vector of observed features. This is X star, X one star up until X P star. As output, we get
classification of X star using a
separating hyperplane. We use a hyperplane that cuts the P dimensional
space into two halves, and then we can decide to which class these
observations belong. Quite simple, this classifier based on a separating
hyperplane. If we, for example, have these blue and these red observations, you can see that
all these lines, the separate the blue from the red points and
we can then say, Well, for example, if
we use this line here, this separating hyperplane, then everything that is on top is class blue and everything
below is class red, and we can use the hyperplane
for classification. It's very, very simple. Now, problem if one
separating hyperplane exists and it doesn't
necessarily need to exist, it could be that there isn't
a separating hyperplane, quite simple scenario where, for example, if we
mix all points, those blue and red
points in here, then if, for example,
these are red points. If I add some red points, if I add some non sorry, if I add some blue points here, you can easily see without
proof that it gets quite difficult to find a separating
hypoplanes bilinear, that's the definition
of a hyoplane you can try to insert a
separating hyoplane. It will not work. So there
isn't one in this scenario. The question if one
separating hyplane exists, then we have an infinite
number of such hyperplanes. You can see that I can
add an infinite number of hyperplanes here as long as each hypoblane still
separates all these points. We have an infinite number. The question is, which
one should we use? This is where we get to the
maximal margin classifier. Now, the natural choice is then match maximal
margin hyperplane, which is the separating
hyperplane that is farthest from the
training observations. We need to do is
we first compute the perpendicular distance from each training observation to
given separating hyperplane. Can do this actually
if I delete some of my drawings here. And if, for example, we use this one, for example, we calculate the
perpendicular distances let me just see this is almost perpendicular less
to the hyperplane, from each side, you
can see this here. And then we try to find the separating hyperplane such that the maximal this
is called the margin, the maximal distance
is well, it's maximal. And this is probably not the maximal margin
classifier because you can see that now I've
shifted the line, the hyperplane to the left. And even though
these are distances, for example, this one here
has now become larger. This one here has
become smaller. Actually, we're trying to fit the hyperplane somewhere
in between here such that the distances from
the blue points and the distances from the red
points are equidistant. Such is the way how to construct the maximal
margin classified. So we compute those distances. The smallest such distance
is known as the margin, and then we maximize the margin. It's the separating hoplane for which the margin is largest. So this is what we get, as you can see
here, the distance, the margin is now
maximal and actually, it's the same when considering the blue
points and the red points. Now, the interesting fact here is that actually, as
you can see here, adding a red point here or
adding a blue point here in the left doesn't
actually change the maximal margin hyperplane
and the classifier. Why is that? Because the
maximal margin classifier only depends on this And this point on
these three points, and this is why they are
called support vectors, they support the hyperplane, and these are the only points, and in P dimensial space,
these are vectors. These are the only
points of vectors that determine the
separating hyperplane. Changing the support vectors will get a different classifier, we'll get a different
separating hyperplane. But actually, if you add any
points on the left or right, Uh, the classifier
will not change. This is why these are actually
called support vectors. And now you can
see why later on, it's called the support
vector classifier and the spot vector machine. However, this is still the maximal margin hyperplane and the maximum
margin classifier. How can we construct
this in more detail? Well, the maximal
margin hyperplane is the solution to a specific
optimization problem. We need to maximize
the margin is N. We are looking for those
parameters beta zero, beta one, and so on, B P N such that N is maximized. Subject two, we summarize, we sum the squared
coefficients beta J. They need to add up to one, and Yi times actually the hyperplane needs to be
larger or equal than N. Meaning well actually
remember that Y is either minus one
or one and this constraint 26 ensures that each observation will be on the correct side
of the hyperplane. We have some buffer, some
margin M that's clear. It could be actually
that we also have some points in between this
area here, for example, here, here, probably, then
we have if it's switched, but this is the margin. Constraint 25 ensures that each observation is
at least a distance N from the hyperplane. This needs to be the case, and then we maximize and choose the coefficients
which will determine the hyperplane
such that M is maximal. Now, this is a very
simple classifier. Now, if a separating
hyperplane exists, we should use the maximum
margin classifier. But the question is, is
this always the case? No. This is the non
separable case. I've tried to
sketch this before, but you can start
anyway here and try to find a hyperplane
that separates these points. Well, no, there are
still red points here, can go and this is okay, but then we have
one red point here and you can see I
cannot put a line through this plane
without having some red points on the left and some blue points on the right
side of this hyperplane. This is a problem. The
non separable case, in this case, we don't have a separating hydroplane and we cannot use the maximal
margin classifier. Now, if a separating
hypoplane exists, should we always use a
classifier based on it? The answer unfortunately is no, because it's quite sensitive
to new observations. And as you can see
here, for example, if this is the maximal
margin classifier, if I add a point let's say here, this doesn't change anything. But in this case, we've added one point here and
as you can see, instead of moving the separating
hyperplane just a bit. Let's say this is
the old one and let's say, this is also okay, and this is one error we make, but this is still okay because the maximum margin classifier separates all blue
from all red points. You can see it's
extremely sensitive, and it suddenly gives us
this decision boundary. Even if a separating
hyperplane exists, it might be that we
don't want this level of perfection because then
the bias will be low, the variance will be quite high. This is why in the next
step and in the next video, we'll extend the Maxwell
margin classifier, allow for some degree of error, and we will get to the
support vector classifier.
17. 17 Understanding Support Vector Machines: Hi, everyone, and welcome
back to our class in artificial intelligence and
machine learning in finance. In the last video, we've seen the maximum margin classifier. If you look at this here, we saw that the maximum
margin classifier is actually quite
sensitive to new data. If we add new data points, then the decision boundary, which is the line
here, for example, here, is quite sensitive to these newly
added observations. The question is, can we
make this classifier a little less sensitive to new observations and a
little more forgiving? This is where we
actually now arrive at the so called support
vector classifier. It's not yet the
support vector machines that we want to talk
about in this section, but it's the support
vector classifier, which is also usually summarized under this name
support vector machines. The solution is use a classifier that allows
for some observations, not all of course, but
for some observations to be incorrectly specified
and classified. You can see this
here, for example, if we add this observation
11 and this observation 12, even though 11 is
red and 12 is blue, the decision boundary doesn't
change too much due to these newly added
observations that make the decision boundary no longer a separating
hyperplane. So we want to have a little discretion when it comes to the classification
of the observations, and this is what the support
vector classifier does. The difference to the
maximum margin classifier can be seen here in the optimization that leads to the support vector classifier or the support vector
hyperplane that we used to classify again, use our constant M. We have
our parameters B zero, B one, et cetera
for the hyperplane, and now we have
additional variables Epsilon one through Epsilon M, and we maximize M with
respect to these parameters. With the same
constraint in line 28, that is the squared
coefficients all add up to one, and the difference
now is first of all, that the hyperplane, the observations should not be exactly higher
or lower than MSO. It's not like we again, looking for a
separating hyperplane, but we have some discretion, and this error that is possible is actually
included here with these so called slack variables that allow the
observations to be on the wrong side at least a little bit is what
actually happens here. So we would expect 11
to be on this side, but it's okay that
it's actually a little bit off and a little bit
above the hyperplane. So then this would be captured
in the slack variable. And we allow we
demand, of course, that the si variables
are non negative, and we have an additional
hyperparameter, which is C so that the sum of all those slack
variables, in other words, the sum of the errors
that we allow, this needs to be smaller
or equal than C. C is of course a non negative
tuning hyperparameter. It's chosen via
cross validation, and M is again the margin. This is the support
vector classifier. We see four different
support vector hyperplanes in these slots plots. And as you can see,
from the title, these plots result from different choices for the tuning hyperparameter C. For example, the largest value of C was
used in the top left panel. Smaller values were used then
in top right, bottom left, and bottom right, with four different values and
actually, when C is large, then there is a
high tolerance for observations being on the
wrong side of the margin, and so the margin
will be very large. As C decreases as we reduce the maximum allowed
sum of errors, as C decreases, this tolerance for observations being opposite on the wrong side of
the hyoplane decreases and the margin narrow. You can see this
here, let me see, this is actually the margin from this side and
from this side, as you can see, it is reduced in each plot going
from top left to bottom right. Now, now is are we done? Well, actually, no, is a linear classifier
always warranted? Well, if you ever
look at this plot, you can immediately see, now, you can try your best
at using a plane, a hyperplane that is a line here to separate the
blue and red points. Then, for example, if you
were to use this line, then you have those
observations here that make problem and
cause a problem. You can use a hyperplane like this, it doesn't really help. What you really need
is something actually, that is non linear, for example, could be something this or it could be something like
this could also be useful. A linear hyperplane
is not warranted here and we need a non
linear classifier. This is where we finally get in the next couple of slides to the support vector misins. While the support vector
classifier is a linear one. How can we automatically
convert it to a classifier with non
linear decision bound? First thing we could try is that we again
maximize the margin. We use the betas, the Epsilons and MS, our parameters, and we now say, we are not using a hyperplane. But for example, as
you can see here, this is again the response
being minus one on one, the margin and we allow a certain error
in the slack variables. But here in parentheses where previously we had
a linear hyperplane, we now allow what is a
polynomial function. So just like we did with
polynomial regression, we allow the decision boundary
to be a polynomial one, and this is one way of doing it. Problem is it might
not be enough. A polynomial regression, we've seen that adds some
flexibility to our model. Same here, but it might be that we need
more non linearity. How can we cheat this First
of all, recall the standards, in this case, the
euclidean in a product for two vectors X and X bar, but dash actually, you
have the scalar product, the euclidean in a
product of two vectors, and we need this Because the linear support vector
classifier can be represented. It's just a different
representation, different way of showing what the support vector
classifier looks like. With N parameters, Alpha
I actually is what? It's the hyperplane is
given by beta zero plus the sum of perimeters Alpha I times the inner
products of X and X. In other words, we can use this representation
where you this here, which is it's a linear function, it's a linear line, a plane or whatever, and and the hyperplane can also be represented
in this way. Instead of using
beta one times X, B X one, beta two times X two, and so on, we can also
use the inner products. We get a different
set of parameters. Now these are Alpha i
and Va zero, of course. And to estimate
these, we only need those N times N
-1/2 inner products between all pairs of
training observations. At first, this
sounds like a lot, if we are looking at
a big data sample, then obviously if with
1 million observations, then we would also
have 1 million, um, now actually 1 million times almost 1000000/2
inner products and with N observations
and N parameters. However, phi is non zero, only for the support
vectors in the solution. You can again see why they're
called support vectors, and this means that
we actually don't need all those N
parameters Alpha i, but we have a much
smaller number of parameters that we
need to represent the hyperplane from the
support back to classifier. Now, why do we need this? Why do we need this
different representation for the support linear classifier? Well, the linear support
vector classifier can be extended in a
very simple way and how we replace the inner product with a more general so called
kernel. What is a kernel? A kernel K, at least in the
context of machine learning, have slightly
different meanings in different parts of
mathematics and statistics. But here in the context
of machine learning, a kernel is a function that quantifies the similarity
of two observations. So we have two observations,
X one and X two, and a kernel is a function that in some way
measures the similarity. Could also be the distance
between those two points. This is what we call a kernel. Quite clear, in the case of the inner product,
this is, of course, a function that measures the distance in the
Euclidean space between those two vectors. We need to replace
the inner product. We need to replace
with my cursor here, we need to replace
the inner product here with a different kernel, and then we get a more
general extension of the support
vector classifier. This is what finally is a
support vector machine. We can use different kernels. For example, we can
use the linear kernel. We can also use a polynomial
kernel of degree D or quite often use the so
called radial kernel or radial basis function kernel, which is given in equation 37. These are different choices for comparing Xi and Xi to vectors, and if we now substitute
the linear kernel by, say, the radial kernel, we get a different
classifier and if we substitute the linear one
with a non linear kernel, the resulting
classifier is known as support vector machine or SVM. Equation 38, you see, we now only speak of
a general kernel K, and what is important is,
as I mentioned before, we don't need all N parameters because these parameters
Alpha I will be non zero only for
the support vectors. The set S that includes the NCs for the support vectors
is usually much smaller than N and this representation is we only need these indCs in the set S. That's the set
of support vectors. That's a support vector machine. Substituting the
linear kernel with a non linear kernel and what we get is a much finer way of classifying
these observations. On the left hand side, you
see polynomial kernel. See that we have our observations
here and we would have liked those red observations to be separated
from the blue ones. And the polynomial
kernel does this. Actually, as you can see
on the right hand side, the radial basis function
kernel works even better. So these are support
vector machines. They are estimated
in a similar way as the support vector classifier
and all these classifiers. Actually, the maximal
module classifier, the spot vector classifier, and these more general
support vector machines, usually they are
all summarized on the name of support
vector machines. So these are SVMs, that's the theory, and
in the next video, we'll be looking at
some applications.
18. 18 SVMs in Practice Part 1: Hi, everyone, and welcome
back to our class in artificial intelligence and
machine learning in finance. In the last video, we've seen the key nearest neighbor and
the support vector machines, and we now want to apply these to a dataset of credit
card customers, and it's not a dataset that
is concerned about defaults, which one could probably
suspect when hearing the words credit card and finance and
classification application, but we actually want to
predict customer term. We want to see which
customers are more prone to cancel and terminate their contract
and which aren't. Data can be downloaded at Cagle. So if you go to this link here, you can find the data at Cagle. You can also find
various usually Python notebooks that feature wide variety of
different methods for actually using
this classification, not just Kynars neighbor and
support vector machines. However, here, because we've
just seen those two models, we want to concentrate
on those two. And with the support
vector machines, we are going to use the
support vector classifiers, so the linear one and one with a radial basis
function kernel. We've already seen that on previous slide
in our script here, we saw that the radial kernel, the radial basis function kernel actually achieves a non
linear decision boundary, and this is probably better
than the purely linear one. What is the motivation
behind this application? Well, this is actually even more true in
insurance management. In insurance management,
for example, if you think about
car insurance, the contracts usually
expire all after one year, and then you have to
renew your contract. It's usually in Germany,
if it's renewed automatically if you do
not cancel your policy. However, if you cancel it and switch to a cheaper,
um, competitor, then obviously
cancellation rates are a major concern and a major input for the calculation of premiums
in an insurance company. Same with the credit
card customers here. So if you're the manager, you're concerned about
the question whether an increasing fraction of your customers quit and
terminate their contracts. You want to slow
this down because obviously this is
bad for business and you're trying to use machine learning methods now K nearest neighbor and
support vector machines to identify those features
that are able to predict customer churn and
contract cancellation. However, she needs a prediction of who's going to
terminate the contract, so this is what we're
going to do here. The dataset that is
available at Cadle contains information on more than
almost 10,000 customers. We have features
including age, salary, credit card limit, and other features and the
data are unbalanced, meaning that only about 16% of the customer in the dataset have actually canceled
their contract, 16% have canceled and 84% have remained with
the credit card company. This complicates training and interpretation of the predictive performance of the models. Obviously, this would even
be worse if only, let's say, two customers had canceled and 9,998 customers didn't
cancel the contract. So this is a slight problem. We'll come back to this later. Might also be more favorable to contact a customer
who is not about to turn the not contacting
a customer who is. So the question is after having done the prediction and after having done
the classification, what is the best way
to move forward? Should we contact
those customers that are more likely to cancel
their contracts or should we actually
concentrate on those ones that are
close to cancellation or those that will never cancel their contract in order
to maximize our profits. This lecture, our primary
focus is on the models. We will therefore
not specifically address the so called
class imbalance problem, meaning that only 16% have
canceled, 84% haven't. But this is usually known as
the class imbalance problem, and you can look this up in
the literature in textbooks. Now for fitting the KN
N model and the SVMs, also as well as the elastic NT model we've seen in
the previous section, we will rely on the R package, carrot by Con 2020, and we will shortly introduce this package
in more detail. We've already used it before, but here we are going to talk a little bit more
about this in detail. Carrot is just short for classification and
regression pretting. It summarizes
activities related to model development in a
very streamlined process. It allows you to test different models with very
little changes to the code. We'll see this later on in the R code that
we actually don't need to rewrite too much of the code in order to switch from one
model to the next one and it offers automatic cross validation
and parameter cuning. Now, this package provides a
consistent modeling syntax. For example, by simply
changing the method argument, you can easily change the line model from a key
nearest neighbor to SVM. In total, it gives you the possibility to use more than 200 different
montes from machine learn. Now now that behind the scenes, the package is not performing
the modeling itself. Actually, it uses
the standard methods from R. For example, if you use LM, it simply refers to
the LM function from the stats package to estimate
a linear regression model. It only simplifies the syntax. For Kors neighbors as well, it relies on the class
package by Venables Ripley. So it is rather well, more convenient way
of doing this in R rather than
reinventing everything. And for a complete
list of models, you can go to this link here in the script and access the
documentation of carrot. Now, as before, we need to import and
preprocess the data first, and we're doing this by downing the bank
churn CSV from Cagle. You can see the
link here. And then after having downloaded
it to your computer, you need to import it in R.
We use the package read R, and we read the CSV file into
this object, Bank churns. Now, if you read the
description at Kettle, it's recommended to
drop the last two coms, which are not needed, and so this is what we are
doing in line six. Bank churns is overwritten
with bank churns, simply using the columns one, two, the number of columns
bank churnus has minus two, so we are dropping
the last two columns. We then print those data, the first seven observations, and you can see it's a table. It includes a client number, trition flag, the customer age, gender, and has
17 more variables and 10,000 rows approximately. We've 10,000 bank customers and as you can see, two, four, five, so we have 22
features for the customers. Now the valuable description is also available from Kegel. You can see for example
client NUM as is expected is a unique
identifier for the customer. Attrition flag is a flag that is one if the account has been
closed and zero otherwise. So it's zero, it's an
existing customer. If it's one, it is
a customer that has canceled his
or her contract. The dependent count is
the number of dependents, gender, age, et cetera,
or self explanatory. We will do some
explorative analysis. For example, we'll see if
there are any missing values. If you use N is
NA on bank Jonas, you can check whether there
is any NA not available. That's a data type in R, and it gives us faults back, so with no missing values. We can also check
for any duplicates that's NA and then duplicate
it on bank churns, again, comes out faults, so we don't have any duplicates. To see the class imbalance, we use a table of bank China's
dollar attrition flag, divided by the number of rows, and then we round this
up and you can see, we have at trited customer, 16% existing customer, 84%. We can clearly see, yes, we have a class imbalance. We are not addressing this
problem in our application, but yes, it is a
problem and it can, um deteriorate the quality of all classification
models later on, but this is a topic
for another necture. Now, we continue our
exploitative analysis. We analyze churning
customers by age. We use the package DPL
YR. What we do is we try to plot the attrition flag for some sub sabotsF example, customer by customer age, um, you can also do it
by gender, et cetera. And this is a little
bit complicated because the plots
are meant to nice. And if you look on
the next slide, you can see what
comes out of this. You can see the percentages
of at triting customers, and it starts at approximately, I would say, 12%
for the major 20 Then it goes up 30, 40, 50 years and it steeply drops for customer
aged 60 and 70. We can see, yes, obviously, age seems to have some influence
on the attrition flag. So it will probably be one of the predictors that
will stay on models. Um, saying here for customer
education, college, doctorate, graduate high school, post graduate, uneducated Amno, it seems that at least
if you have a doctorate, the attrition flag and the percentage of
attrition is much, much higher than for the rest
of the education classes. Again, this might
also be helpful later on in predicting
customer chin. Now, we do have some
categorical features. For example, customer education
is a categorical feature, and it cannot simply
be encoded in a numeric variable as it is sometimes possible for
ordinal variables. For example, if you have
rating scores, well, it's clear, the
score, let's say, of 50 is better than
the score of 30. With categorical
features, this is not possible because you
cannot really say that, for example, a
doctorate is twice as good as a high
school diploma. Customer education
might be encoded a six binary dummy variables. However, this raises
the dimensionality of the data significantly. If we do this for every
categorical feature, then we will end up with a huge number of
dummy variables. This is an especially
serious problem for non prometric models like KN N classification because of the curse of dimensionality. We are increasing and
artificially increasing the dimensionality and a
few observations we have. We have 10,000 observations, but in this context, this
is rather not too much. The curse of dimensionality
means that if you increase the dimensionality
of your problem, the a fixed number of observations of data
observations you have will at some point vanish in the huge space that is your
hyperdimensional space. For simplicity, we therefore
drop all numeric variables. In practice, you should
obviously carefully decide for each
categorical variable if it should be included or not. We look at custom education, doctorate seems to
have an influence. You might think about using
one dummy variable that is one for doctorate and zero
for all the other classes. This might be a solution. With age, you could split
into two subsamples below 60 or below
50 and above 50. This might work. But
if you were to include a Dummy variable for
each age and each year, this would simply cause more problems than
solve anything. Okay. Now, we then create the
training and the test set, we remove all numeric features. And the column lin N, we don't meet the lien number
has no predictive value, and we create the
training and test set by randomly including 80% of observations
in the training and 20% of observations
in the test set. So this is what we'll do.
We again set SDs 2021. Reproducibility, and then we split our dataset into X train, X test and Y train and test. Obviously, Bnc urns and
the attrition flag, in this case is our
response variable. We have our four data samples, the test and the
trainee set and this for the predictors and
the response variable. Now, by default, Kinars
neighbor is based on the Euclidean distance
and to ensure that all features contribute equally to the measured distances. We scale the data based on the minimum and
maximum values. So we determine the per column
minimum and the maximum, and then we scale all variables, for example, here X
train and X test, and we call this now X train
scaled and X train scaled. X train and test scaled. And what we do is we scale it so that it has
as we can see here, there's a minimum and maximum of zero and one, respectively. So all variables in the training set have a minimum of zero and maximum of one, and it's monotonously
scaled in between. And as scaling is based on
data from the training set, the same is not necessarily
true for the test set. So you can see here
if we do this for the test scaled and
the train scaled, it is slightly different. Okay. Now in the next video, we'll start with the K nearest
neighbor classification and then do support
vector machines. But you can see that
it takes some time to preprocess the data in order to be able to perform these
models in the first place.
19. 19 SVMs in Practice Part 2: Hello, everyone, and
welcome to our class in artificial intelligence and machine learning in finance. Now, in this video, we will continue our application of K nearest neighbors and support Better machines
in the context of predicting class
labels related to customer journ related to the termination of
credit card contracts. We've already seen the
data preprocessing. We've seen the data,
and we want to start with K nearest neighbor so
for K nearest neighbor, we here rely on the
Euclidean distance. We could have also
used the Manhattan distance, the cosine distance, any other metric that measures the distance
between two observations. And remember, K nearest neighbor is
called K nearest neighbor because we still need to decide on if we want to
use five neighbors, ten neighbors, 20
neighbors, et cetera. So, the number of neighbors
is a hyperparameter. It's a parameter
that we need to set in order to train
our model and we determine the appropriate and we optimal value a ten
fold cross validation. You can also do these
computation in parallel. As you will see later
on in R, actually, we are using a cluster and we use parallel computing to speed up the estimation process. This is not really
necessary in this context, but I wanted to include this here to show you how you can use parallel computing to
get to results quicker. We determine the optimal
parameter and evalue the model's predictive
performance that is how well we are able to predict the positive
class label based on the two metrics accuracy
and Chen's Capper and we'll have some more details on these two metrics later on. Now, we start again by loading the library carrot and we
set the TR control object, which is simply a container for some options on how
to train our model. By using this function
train control, we set the method to
cross validation CV. We want to do ten fold
cross validation, number is equal to ten and
then in line three, again, in order to be able to reproduce and to
replicate our results, we use the set set command. Then the cross validation
is done in parallel um, if you do encounter any
problems with the PSO cluster, then you should simply
comment this out and do the cross validation on
your computer solely. So the library need
is do parallel and CL is make PSO cluster six. So we're using six
cores or threats, and then we register do
parallel with this object CL. Into KN N model, this object, we are writing the results
of our train model. We train based on the X
train scale data object. We use as our response Y train. We use Kn N K nearest
neighbor as the method. We will set the tune grid to grid that expands
1-10. What does it mean? Well, Tune grid specifies the tuning parameters to
train over the model. So we're considering one, two, three, and so on ten
neighbors in the end. And then we are using the result of our cross
validation TR control, and the metric that
is used to look at the accuracy of our model
is actually the accuracy, and then we stop the cluster O. We print the result after
having done our trading. So the optimal
number of neighbors is actually the same year for both accuracy and C S Capper. In general, this
is not necessarily the case in this
scenario here it is. And as we can see, we see the resampling results cross tuning parameters for accuracy, Kappa and one, two, three, up to ten neighbors, and accuracy was used to select the optimal model and
the final model was six. We are ending up with a six nearest neighbor model
that we train on our data. To see the predictive
performance, we predict the class labels
for the test dataset. This is done using predict.
Where's my cursor? Yes. We use predict. We use the fitted and train K nearest KNN model object and the new data is X test scaled
rather than X train scaled. We see that these
are the predictions and then we can use the so
called confusion matrix, also error matrix to compare our predicted response variables with the reference
data, which is Y test. We actually doing an out of sample forecasting accuracy
or prediction accuracy test, and this is done in the so called confusion
matrix or error matrix, which is quite common
in machine learning. This is the confusion matrix. What you can see here is, let me highlight this my
curse again, here it is. The p dot is loaded. You can see here in
this first part, this is a simple matrix that compares the
predicted class labels, in this case, atritd
customers and existing customers in comparison with the observations in
the reference dataset. We have attritd customers
and existing customers, as you can see, this is
fine. And this is fine. It means that actually
those that actually were atretd customers were also
predicted as atretd customers, and the existing ones, 1,672 were also existing. Well, not surprisingly,
if we use red line here, this one is not good
and this one is also bad for our
prediction accuracy. Why? Because these are
observations that fall into class one and are attributed
to class two and vice versa. These are the errors our
prediction has done. We also can see the accuracy confidence interval
for accuracy, the no information rate, and he S Capper and
some other metrics, and we want to
comment on these now. Accuracy is actually
defined very simply as the percentage of correctly
classified observations. In our example, well,
not surprisingly, we have 189 plus 1,672 divided by the total
number of observations, which 189, 1,672, and those that were erroneously attributed to one
of the two classes, and thus we get an
accuracy of 90%. Then information
rate is defined as the largest proportion
of the observed classes, and our example,
this corresponds to the proportion of
existing customers. We take 53 plus 1,672, again, divided by the total
number of observations, it's the highest accuracy, which can be achieved by constant prediction,
meaning what? If, for example, we say a treated is one,
termination is one, existing customer is zero, and let's say this
is our dataset. This is our dataset. Then we could simply say, well, let's do it constantly. Let's say, a little bit more. Let's say one, one, one, one, one, one,
one, one, one. How much and how many
observations would we get right? One, two, three, four, five out of two,
four, six, nine. Same could be done, let's
say if we say it's zero, the red x is zero. One, two, three,
four, and so on. We are simply saying we don't
do any prediction at all, we simply set it all
constant to one or zero. And the no information rate
then is the highest accuracy, which you can achieve
by constant prediction. In this case, this would
be a constant one. If we simply say
everyone is attributed, I think, this is the first
let me just check now, it's actually the existing
customer, so no termination. And this is the highest
accuracy which we can achieve simply by setting all
predictions to one or to zero. You have to decide which
one is better of course. The ones Capper is a measure of a classifier
performance relative to how well the model would
have fared simply by chance. Therefore, you compare the
accuracy of the model to the hypothetical probability
of an agreement by chance. In our example, the no
information rate was 84%, 83%, they belong to
existing customers while 16.51% belong to
the trided customers. The K nearest neighbor
model classifies 88% as existing and 11.7%
as at trided customer. The probability for an
agreement by chance, can be calculated as you take the 83% times the
88% and the 16% for the attributed
customers times the 11.71% and this gives us 75.65% Counts Cappa is now defined as the accuracy minus
this probability divided by one minus
this probability and it gives you 0.59 26. That's en Scapa. When you compare two models, higher Kappa signals a better
predictive performance and the maximum value is one. However, there is
no standardized way of interpreting its value. If you have two models, take the one with higher
Chen Scapper. A negative value of
Kappa would signal that the model's predictions are worse than predicting by chance. This is even worse than, I guess, setting it
all constantly to one. Now we have some
additional metrics, error costs of positives. In our example, this is the customers who terminate their contracts and negatives. They are usually different
in this context, sensitivity and specificity are more informative than accuracy. So we start with sensitivity that's recall
or true positive rate. Is defined as the number of
correct positive predictions, divided by the total number
of positives, that's 55%, and specificity is
true negative rates, and it's the number of
correct negative predictions, divided by the total number
of negatives, it's 96.93%. Our case, the K nearest
neighbor classifier is good in predicting customers that do not terminate the
credit card contract, but bad in predicting
customers that do and what is the
possible reason? What's the problem here? Well, if you remember, we have a highly unbalanced data sample. Only a few terminations, a lot of customers
who stay with us, so it's quite difficult based on this data sample to achieve
a higher sensitivity. That's the reason.
This is the example of the K nearest neighbor classifier
and in the next video, we'll look at the
support vector machine.
20. 20 SVMs in Practice Part 3: And welcome to this class in artificial intelligence and
machine learning in finance. In this video, we want
to continue our example of the application of classifiers
in financial setting, and we are going
to use now support vector machines to
achieve the results, almost the same results as the Knees neighbor classifier in the context of
predicting customtu. We will fit a linear SVM, as well as support
vector machine with a radial basis
function kernel to the training data on customer churn in the
credit card data sample. Going to predict the
labels for the test data, that is contract
termination, no termination. For the linear support
vector machine, that is without kernel or
actually the linear kernel, we determine the most
appropriate value for the hyperparameter C, which is also known as
cost out of the set or grid two tan to the power of minus five to the power
of minus four, and so on. Again, as we've seen in the last video for the K
nearest neighbor model, this can be done in parallel to speed up computation time. You don't need to do
this, but this is a good exercise to practice
parallel computing in the support vector machine with the radial basis function, kernel has two hyperparameters, Sigma and C. And we do not provide a set
of possible values, but we simply let the
carrot package in R tryout ten reasonable
parameter combination, and it chooses the
combination itself. So this is tune length, the parameter tune length in the syntax of R and
the carrot package. So we start with the
hyperparameter tuning with the linear support
vector machine or the support
vector classifier. We set set in line one to be able to replicate
the results later on. We again use train control as an option later on in the
train command in line eight. In line two, we write down train control
with cross validation as the method and five
fold cross validation in the object TR control. We set the grid as we described
on the previous slide, and then we use six
clusters using the um, PSOC cluster in R. Then
SVM linear is our object and we train the
support vector machine using X train scale data, y train for the response. And as a method,
we set SVM linear and then this is the result
and we stop cluster. We print out the results and you can see the
resampling results across the tuning parameters with different parameters for C
and then accuracy and CAPA. We've seen that in
the last video. And the accuracy
was used to select the optimal model
and it was optimal for a hyperparameter
C equal to 0.0 3125. So in this case, choosing
C with regard to accuracy leads to a different value than choosing it with regard
to Chen S Capper, but we said that we want
to use, in this case, the accuracy, and it's the standard and default
setting in the carried package. Um, we can do the same with the radial
basis function kernel. Again, set C, initialize the
parallel computing cluster, and then SVM radial is the option we need
to set the method to, and the tuned length is ten, we stop the cluster and get results for the second
support vector machine. Again, you get the
resambling results from cross validation
and you see the tuning parameter
Sigma was held constant at a value of 0.05. Accuracy was again used to
select the optimal model and the final final values were Sigma equal to 0.05
and C equal to eight. This is what comes out
of the training model. We predict the classes
for the test scale data, and we do this both for the linear classifier and the Real basis support
vector machine. Then we also print the
confusion matrix for comparing our predictions with the t test data sample
as the reference, both for the linear
support vector machine and the Rial basis function. Um, Kern. This is the linear
spot vector machine. Again, you can see from
this very simple matrix, 186 1679 looks almost the same as for the
K nearest neighbor. Let's compare this.
Let's go back some slides and
you see 189 1672. Actually, it's almost the same
almost the same accuracy, 90.27% and the additional matrix here in the confusion matrix. For the radial basis function, you can see it is slightly
better for existing customers, one observation, but it's hugely better for attrits customers. As you can see also
in the accuracy, it increases by almost 3% in
contrast to the linear SVM. So the radial basis function kernel
support vector machine actually does a much better job than the two competing models. The predictive performance of the linear SVM is
very similar to the K nearest
neighbor classifier in terms of accuracy,
Kappa sensitivity. The support vector machine with a nonlinear radial basis function kernel
improves on this. Um, shifting the accuracy to almost 94% and increasing
CAPA to 0.76 47. This substantial increase
is mainly due to an improvement in predicting at triting customers,
we've seen this. So the sensitivity is
actually higher in this case, and almost 20% higher in comparison to the linear
support actor machine. So yes, using a non linear
model here makes sense, and it allows us to increase the sensitivity
and in result, the accuracy of our forecasting. So this is the example
for classification, and in the next subsection, we want to look at
regression trees.
21. 21 Decision Trees for Classification: Hello, everyone,
and welcome back to our class in
artificial intelligence, machine learning
here in finance. In this section, we want to
look at regression trees as yet another method in machine learning
for classification. That is, we want to
classify customers. We want to classify
observations we have, and maybe even stocks in an asset pricing application into classes could
be good customers, bad customers,
customers who terminate their contract or who are prone to terminate their contract
and those who are not. And for this, we want to use regression trees as another
method in comparison to, for example, a K nears neighbors model and the
support vector machines, we've seen insectionFive. So this is a section on
regression trees. What are trees? Again, just like support
vector machines and K nears neighbors and many
classification models. These models can actually
be used for classification, but also for
regression problems. It's simply a matter of changing the response
variable from, for example, binary
variable to a metric one. These trees involve
stratifying or segmenting the predictor space into a number of simple
box shaped regions. We've seen this with actually the support vector machines and the support
vector classifier. What happened was that we
took the predictor space, for example, the three
dimensional space, and we cut it into two halves. So we stratified or segmented
the predictor space. This segmenting of
the predictor space is performed based on a
set of splitting rules, and these roles can then
also be summarize in a tree. In the end, what we are
doing is we are setting up some set of rules
that decide, well, for example, if
you are a smoker, you're class one, if you're not a smoker, you're class two. Let's later on one
of our applications, a very powerful predictor, actually, you can imagine
it's for health now. So this is what a tree looks
like on the right hand side, and it corresponds to the segmentation on
the left hand side. So you can see we
have two predictors, X one and X two, and we are segmenting the
predictor space into one, two, three, four, five boxes. And what happens is
this corresponds to a tree that starts
out with X one. I X one is smaller or equal
T one, that's cut off. Then it yields us
boxes one and two, and how do we decide on
whether it's one or two? Well, we say X two is
small or equal than T two, then we get to R one
and we are here. If X two is larger
than we are in R two, and we get this box. If X one is actually
larger than T one, then we are in this area. Then we still need to decide is X one smaller or equal
than actually T three, then it's R three, and if it's larger, then we are in this area, and then we still
have to decide on whether it's R
four or R five and we do this with this cutoff
it's either R five or R four. This is the tree, and
this is the segmentation. Now, how do we do predictions? Well, for a specific
observation, they are typically made by a majority rule by majority
vote in classification, meaning if it's more than 50%, then it's on the
right hand side, if it's less than 50%, then on the left hand side, or by using the mean or mode of the training
observations in regression analysis
in the region that corresponds to the given
observation for a given region, the prediction for
every observation that falls into this region
is, of course, the same. We get the same prediction. Um, how do we build a tree? Now, in theory, for making predictions as
on the previous slide, the regions could
have any shape. We could have selected circles, we could have
selected rectangles, and we could have also said, Okay, well, it looks like this. And this is one region.
This is another one. This is the third one and
this is the next one. We could have done
this. Problem is, of course, this
complicates things a lot. So for simplicity,
interpretation, the predictor space
is often divided into high dimensional
rectangles or boxes. These boxes are chosen such
that the classification error or the squared error
of the residuals in regression analysis
are minimized, so we have an optimization
problem again. Now, if you think about what
is possible, then of course, think about this
predictor space as maybe an image and
the boxes as pixels. Well, obviously, with a
higher definition resolution, obviously, you get more
pixels, you get more boxes. So it's theoretically,
at least, it's, um, possible to partition
the predictor space into an infinite
number of boxes. So it's computationally
infeasible to consider any possible partition
of the feature space, even into a finite
number of boxes. But increase if you decrease
the size of the boxes, if you increase the resolution, you could even get to an
infinite number of boxes. So therefore, one
typically relies on a top down so called
greedy approach that is commonly referred to as recursive binary splitting.
So what do you do? It's top down? Because as
we've seen in this slide here, you start at the top, you start at the top of the tree where you have all observations and they
belong to a single region. So you start with
the full sample, and then you successively
split the full feature space into halves and then
you go on top down. And it's greedy
because at each step, you don't consider what
could be happening at, let's say, two levels down. You only consider what is best right now. That's
why it's greedy. You only consider the things
at this box in this step. It could mean that
you're looking for the best feature that can if
you now split, for example, your observations into smokers, non smokers by majority rule, you consider only
the best outcome for minimization of the MSE. Of the residual squared sum, or if you consider the classification
error at this point, what rule and what feature could get you the
largest minimization of your cost function. That's why it's
three. And you don't consider what is
optimal at this point, and does it make sense to choose a different feature here if
you, for example, say, okay, let's start out with
H here and consider smoking here if smoking gives
you the best result here, then you start out with smoking and you don't consider
the fall tree, and this is why it's gree. Okay. Now for performing
recursive binary splitting, you first select the
predictor XJ and the cut off or the cut point S such that
the squared errors in regression or the
classification error rate in classification are
minimized over the resulting regions X, given that XJ is smaller than S and X given that XJ is
larger or equal than S. In classification tasks,
sometimes you can also use the Gene index or the so called entropy as a criterion for making
the treat splits. You can see in the footnote,
you probably know the Gene coefficient and the entropy is actually defined
quite similarly. In this process, all predictors and all possible values
for the cut point as are considered and this process is then repeated in each of
the resulting regions, as we've seen, you go down the tree and then you
make another cut, and you look for the best predictor and
best cut point to minimize the squared recreation errors or the classification error
rate further at that level. You don't look one step ahead, but you do it in a greasy way. This continues until some
stopping criterion is met. For example, if you believe the classification
rate error has been minimized in a sufficient way, by adding further
splits to the tree, by going down further and
growing the tree deeper, the squared regression errors or the classification
error rate can only decrease or it can stay
virtually the same. Problem is that this
leads to overfitting. You can add layer after
layer to your tree, but as the classification
error rate cannot increase, I can only decrease, just like the R squared in
regression analysis, the model that you're
getting is way too complex. So indeed, you should use smaller trees
with fewer splits. This might lead
to lower variance and better interpretation at the cost of some minuscule
increase in your bias. Remember the variance
bias trade off, so variance will be lower, but the bias might be
a little bit higher. Another possible
remedy to overfit increase is to grow the
tree only as long as the reduction in classification
or regression error exceeds some threshold. This will lead to smaller trees, and this minimum required
reduction can be described by a complexity
parameter balancing, reductions in classification
or regression error against the complexity
of the model. And in the application, we
will see how this works. We will see that a high value
of the complexity parameter leads to more
shallow trees while a low value yields deeper trees, and we will also see
how in some instances, if we take the basic models, this will actually, if we
don't look at overfitting, the models will not
generalize well to new data. We will have huge overfitting. And again, as you
might have imagined, the optimal value of the
complexity parameter, it's a hyperparameter is
determined via cross validation. So this is a tree
versus a linear model. You can see that, for example,
on the left hand side, we have a standard
classification, for example, using a linear support
vector machine. You see, we cut through this into the yellow
and the green area. And obviously, if we
have a linear boundary, using boxes and using a regression tree
leads to a large bias. You will see this will lead to a huge classification error. Why? Because this is
classified probably wrongly, because the majority is yellow. This is probably
classified as green. This will be an error, this
and this. Why is that? Well, as you can see, even though we are using
linear boxes, it might be that this
doesn't fit the data well. However, if we have a
non linear boundary, you can see that using a very simple support
vector linear classifier here leads to a huge
classification error, whereas we get an
almost well, actually, it is a perfect classification
using the regression tree. In this case, we have a non linear boundary
decision boundary, and the regression tree, although we are
using linear boxes, rectangles, this is a
perfect classification. So even though we are
using linear objects, we're using rectangles,
might actually be that the regression
tree fits the data much better if we have a non
linear decision boundary. Now when all the
advantages of trees, they're easy to explain. I don't actually
need any formula. They closely mirror
human decision making. We are simply looking for
boxes and doing our decisions, making our decisions
based on off points, smoker non smoker, age, low age, high age. It can be displayed
graphically and it can handle qualitative
predictors, very simply. Smoker non smoker one, zero. The digital answers are that
single trees often exhibit an inferior predictive
accuracy as compared to, let's say, support
vector machines, and they can be very num robust, small change in the data might lead to completely
different tree. What you can do is you can aggregate many trees
a methods like bedding random forests or boosting to improve the
predictive performance, and we'll see this
in our application. I'll come back to
be boosting and random forests in probably
two or three videos. So these are the advantages
and disadvantages. In the next video, we'll have
a look at the applications.
22. 22 Decision Trees for Classification Practical Example: Hello, everyone, and
welcome to our class in Artificial Intelligence
and learning in finance. In this video, we want to learn more about the applications
of classification trees. We'll have a look at the
use of regression trees, classification trees,
tree modes in general. And we will again use the credit card customer dataset from our previous
examples where we saw how it can be used to exemplify classification made by
support vector machines and K nears Niber models. We are going to predict
customer churn. It's a classification
we want to forecast. We want to predict, um, the termination of contracts by customers for our
credit card company. And to provide a more
complete picture, we will also employ
boosted decision trees. These are special random
trees of random forests. We'll see how actually
the initial models, the initial classification
trees do not significantly improve on the performance of support vector machines
and K nears neighbor models. But Lekron will see that if we employ boosted
decision trees, yes, we can achieve some
increase in accuracy. In the regression task later on, we will forecast health
insurance premium. In this case, we have a metric response variable
in contrast to the binary one in the previous
customer churn example where it's only
about termination, no termination and
both data samples are available at
Cagle and actually, if you have curso, well,
take the cursor here. You can actually see that
we've linked included the link here at Cagle for both the customer churn data
and the insurance premium. Okay. So we'll start with
the classification task. Again, a short reminder. If you haven't watched
the other video, what the credit card customer
data set is all about. I um, a data sample, approximately 10,000
observations, and you as the manager, you're responsible for
looking out for the customers and you're worried that an increasing number of
customers quit your services, terminate the contract, and you want to slow
down customer churn. So G wants to proactively
contact customers who are about to leave
their credit card services to change their decision. Actually, you trying
to actively influence the behavior of your
customers based on a prediction of whether he or she is likely to
terminate the contract. Therefore, the manager needs predictions on who is going
to quit the services, so you want to classify
those observations. Dataset includes more
than 10,000 observations. You have features like age, salary, credit card limit,
and other features. There's a D missing, and
the data is unbalanced with only about 16% of the
customers having churned. We've already talked about this class imbalance
problem a little bit. This led to the fact that in
the previous lecture where we use support
vector machines and K nearest neighbor
models, actually, the accuracy wasn't
perfect because of the few observations we have where customers actually
terminated the contract. So we start again with importing the data and
preprocessing the data. You can also skip
to slide 220 and the following ones where you
can see we did the same. In our previous example, you can download the CSV
file from this link. You need to import the CSV file. Actually, again,
reminder that in many instances when using R, a CSV file is actually the
best choice because it includes as few graphical um, additional information and
layouting as possible. So it's the pure data. That's why CSV
fights are actually a nice input format for
many statistics programs. We drop the last two columns. According to the
dataset description, we don't need those two columns, and then bank churns
that's the object we create from the object
we've imported. And this is in line eight here. We
highlight this for you. Actually, we are dropping
the last two columns. So we are only
using columns one, to the number of
columns minus two. For example, if the original
object included ten columns, we are now using only
the first eight ones. We then add pose to the Ken
Resniva and the SVM example, we do not exclude
categorical features. We can actually work
with them here. So what we do is we create the training
set and the test set, we randomly select 80% of the observations
for the training set. And the remaining 20% are
included in the test set, and we include the
client number, which is a number
for identifying individual customers has
no predictive value. It doesn't carry any
economic information, and we use the DPLYR library. Again, we use the
set SIT command in order to be able to
replicate the results, and then we set the training
set and the test set. So what we do is we
sample integers, we include our um 80%, 20%, and then we select
based on the climb number. That's what we do in line seven. Same with the test set and then of course we need the
same for the response value or y train and Y test Rgin the response values in the
training and the test sample. Now, decision trees
do not require feature scaling
because they are not sensitive to the
variance in the data. That's why we do not need to scale our observations as we've done before in the
previous lecture on the support vector machines. Now, we fit the
classification tree. Again, as is common in many of our machine
learning algorithms, we need to select
the hyperparameter, which is the parameter that governs the training,
the learning process. We again, use the carat library and we do this in parallel. We use cross validation
as you might know by now, train control is the function in carrot to select the method
for hyperparameter training, and what we do here is ten
fold cross validation. Set seed line four. If again, you encounter any problems with
the paralleization, just comment these
two and actually, two lines out, and then
tree model is we train based on X train with
the response in Y train. The method is a part, which is of course
for partitioning. Tune length is 20 and the trading trees
is relatively fast, so we can actually consider
more possible values, and TR control is what
we set in line three. That's our option for using ten fold cross validation to select the hypogrameor. The metric that is used for
trading is the accuracy, and then we stop the
cluster in line 12. Um, what happens
here if we print out the tree model and the
best tuning parameter, it is the complexity, which is given by an
optimal parameter here of 0.0 0285 and so on. As you can see from
the plot here, with increasing complexity, you actually get fewer accuracy
from cross validation. This comes out if we plot tree model and this
complexity parameter, CP, is the minimum improvement required at each node to
make a further split. So remember that we're
trying to train a tree here, meaning that at each
level at each node, you need to decide
whether you want to go one level deeper
or if you say, Okay, this is
enough, the tree is deep enough, we have
enough accuracy. So this sameter balances possible reductions in
the classification error. Via further splits against
the complexity of the model, which is the number of splits
or the depth of the tree. And a high value of the CP of the complexity parameter
leads to more shallow trees, while a low value yields deeper
trees that might overfit. And on the following slides, we'll exemplify this
so that you can get an idea of what happens. Now, the RPA plots R package provides rather convenient
visualization of tree ones. However, it requires
a tree model that was fitted by R part directly and
not in the carrot package. So therefore, we again fit the same model using
just a different package in R and with the optimal parameters
determined in the hyperparameter tuning step. So we basically get the same model just with a different package,
but in this package, we are able to do some nice plots in R.
So this is what we do, we visualize this
best tree model and then the tree model is
initialized with R part Again, we need p dot
plot as the package, and then we can plot it. The result is shown
on this slide. Now, if you have the slides, you can actually zoom in here, but can do this here as well. I can show you this. Yes. We start out here, existing customer, and we
have 100% total trans CT, and we have total trans AMT, total transaction, total
relationship count, and you can see those
numerous features we have customer age, this is quite clear,
customer age being larger or equal than 37 total
revolving balance. These are all the cutoffs, and in the end, you get a tree that
looks like this. So if you zoom in or
do it at home with the same code, you
get this tree. As one can see, this is
actually quite deep. Now we will now raise the complexity parameter to
obtain a more shallow tree. We multiply the
complexity parameter from the current tree with 50
and do the same again, and what you get is on
the next slide here. You can see total
transaction CT, that's the total
transaction count and the total revolving
bands on the credit card. Again, these two are quite
high up in the tree. But this is where
the partitioning stops. That's the whole tree. Actually, we have the customer. If this total transaction
count is smaller than 55, yes or no, we get
a prediction here. We have an attritive customer, existing customer,
existing customer. This is the whole
tree quite shallow, that's because we increase the complexity parameter. Okay. Now, let's talk a little
bit about the cutoff level. Now, classification
in each node is, as I mentioned in
the previous video, performed by majority vote. However, it is
possible and sometimes even useful to choose
a cutoff level that is different from 50%. So it's not 50% and one, then it's class one, otherwise, it's class zero. You can also vary this
cutoff level and we can influence the sensitivity
and the specificity by choosing a different
cutoff level and the accuracy closely follows specificity due to the class imbalance
in the data. This is shown on the training
data and the deep tree. In this block, you
can see in red, the specificity in blue, the sensitivity and
black is the accuracy. You can see that if you take different cut
off probabilities, probably you shouldn't use zero or close to zero
or close to one. But as you can see, um, it's not constant for all these different kind of
probabilities, but actually, there might be a choice
that increases the accuracy or the sensitivity
of your model. Okay. So what's the
predictive performance of this classification tree? We rely on the model determined
via cross validation, which is the depo tree. We don't use this shallow
one with only two levels. Again, we use
predict and we print the confusion or error matrix based on the test data sample, and this is what we get. We get in the confusion matrix 252 and 1677 customers which have been
classified correctly. The accuracy does is 30 93%, sorry, with this
confidence interval. And corresponding information
on the no information rate, CAPA sensitivity, and so on. The predictive performance is
actually comparable to the one of the support
vector machine that uses a non
linear raial kernel. Se also slide 248. But we can do even more. And this is what we
do in the next video. We will use random forest. We will use boosting
and then also come back to the
regression example to improve the prediction accuracy of this classification tree.
23. 23 Gradient Boosted Classification Trees: Hello, everyone, and
welcome to our class in artificial intelligence and
machine learning in finance. In this video, I want to talk a little bit about some
problems associated with classification
trees and what can be done to improve on the predictive accuracy and the predictive performance of decision and
classification trees. As we've seen in
the previous video, one problem associated with classification trees
is overfitting. You can always use more levels. You can always use more notes
and go down even further, and what happens is
the feature space is partitioned into
more and more boxes, and this leads necessarily
leads to overfitting. This is a problem
for deep trees. While share load trees obviously
might underfit the data. In the previous video, we saw an example where actually
we only use two levels, two nodes or three nouns actually for partitioning
the feature space, and this might lead
to underfitting. So the variance bias trade off is very important to consider in the context
of classification trees. And one possible solution
to this problem is to construct multiple different
trees at training time and produce forecasts
based on the mode of the classes in
classification or the mean or median of the
response variable in the regression example
of the individual trees. This is an ensemble technique this is called random forest. So when I random forests, they can be constructed by introducing sort of randomness into the construction
of each tree. For example, if you choose random sub samples or
variables and splits, and this will then
yield trees that are built independently
of each other. This concept, this idea is usually referred to as bagging, also
bootstrap aggregating. Another approach is to use multiple trees in a so
called boosting framework. Where we have weak learners, those are the shallow trees
and they are combined to yield a stronger estimator such that at each
iteration step, classification or the
regression error is reduced. This will yield so called
boosted classification trees or regression trees in which the individual trees are no longer built independently
of each other. Um and to see illustration, I would advise you, my cursor to watch these
two YouTube videos. So this one is the first, and this one is the second. There the principles of boosting and bedding are explained quite
nicely, actually. And on the next slides, we will apply boosted
decision trees to our classification
problem and we will rely on the XGBoost algorithm, which is quite famous in
the context of boosting. So we want to fit a boosted decision tree,
again, carrot package. We do this in parallel
for faster computation, five fold cross validation. So this is train control. As we're fitting
boosted decision trees, it's quite in
computationally intense. We only use five fold
cross validation. If you start the um, parallel computing
session before that and you move this to a cluster, you also try ten fold
cross validation set C, as is tradition and with X train and X test
with the numeric data. We again exclude all
numeric features, all categorical features. The underlying XG
boost algorithm already uses parallel
processing implicitly. Therefore, we only employ two parallel processes
in carat wrapped around the implicit
parallelism that is included in XG boost. We set the tune
length equal to two, as carat selects five
different parameters, and this will then lead to
two taking to the fifth power different parameter
combinations that are considered in
cross validation. So XG boost model is our object. We train based on X
train and Y train. The method is XGB tree, so it's a classification
tree that is boosted via XG boost, and the rest is
standard chain control, TR control, our options,
cross validation, five fold and the metric used to select the best
model is accuracy, and then we stop the cluster. So this is what we get out. 100 rounds, maximal
depth is two, and some additional parameters. And let's already consider
the forecasting accuracy. So we predict based
on XGBoost model, the new data is X test. The reference is Y test, and we look at the error matrix, and now we see that actually where we
previously had, I think, 170 we have an improvement compared to both the
support vector machine and the unboosted
classification tree. The accuracy now
increases to 97%, but most importantly,
sensitivity increases. So it is a substantial
improvement compared to all
three previous mots, the classification tree, support vector machine, and
Kaneus neighbor. You can see this in the
increase in the sensitivity. I increased 74-87% for the
booster tree ensemble. And by relying on
the predictions from the boosted classification tree for almost 300 out of
the 341 customers, we can rightly predict that they are about to
quit these services. This leads to the
increase in sensitivity, and then the manager
could act on this. On the other hand, we would
falsely approach only 26 out of 1,725 existing customers. So yes, this could be
a way to move forward. No. In the next video, we want to take
the same approach, decision trees, but use it in the context of a
regression analysis. We need to use different data. Why? Because now we need a
metric response variable.
24. 24 Decision Trees for Regression: Hello, everyone, and welcome back to our class in
Artificial Intelligence, Machine Learning and finance. In the last videos, we've seen classification trees
and we now want to use basically the same model for a different purpose for
regression analysis. So we'll continue
our application, but we now need a
different dataset. We can use a dataset
from cattle, but this time, it's one that
is on insurance Premier. You can download the data
here under this link, and we want to predict
insurance premia, which are expenses for customers and potential
policyholders, and we have data on close to 1,300 health insurance policies, and we have four numeric
features, H, BMI, number of children and expenses, and three nominal features,
sex smoker region. So obviously, we want to
forecast the expenses, the insurance premium,
and we are going to use HBMI number
of children, sex, gender, smoker, yes, no, region, yes, not yes, no, but the different reson regions to predict the
insurance premium. We need to import the data, so we use the Read art library. We download the CSV file. Um, import this. There seems
to be a duplicate entry. So actually, we need to
remove the duplicate entry. If we now use Sunit NA
insurance, we get zero. So there are no missing values, which is important
in this context, and we have a look at the insurance premier
to get a summary. As you can see, the mean is
13,000 probably dollars. The maximum is 63,000 and the minimum is 1,122,
so that's the data. We use the TB library. We convert the data into a table and we print the first
eight observations and you can see we have female, male, male, male,
female, female, female, the body mass index,
number of children, smoker, yes, no, and the reason regions and
the different expenses. Now we create the
training and test data, which by now is pretty standard. Again, you can see the percent 80% of the observations go
into the training data, 20% and the test data, X train, y train, X test, Byte test as a standard, and as it has been done
before a number of time. Again, we use the
carrot package. We set train control the option to cross
validation, tenfold. We set seed in line three, and as single trees
are fitted very fast, we do not need pearl
computing here, and we actually consider
various hyper parameters, so the length of our
tuning step is 100. Metric needs to
be different now. We cannot use accuracy, which is for classification, but this is a regression tree. Y train is now a
metric variable. The function sees that
even using parts, regression partitioning,
it sees that this is not a classification but
a regression task and the metric then is the
relative mean squared error. There cannot be
accuracy as before. Then we print the complexity
parameter, 0.0 061. And this is what happens. You can see now with increasing complexity
parameter, the RMSE increases. It stays constant,
sometime then it jumps up. Now, to visualize
the optimal tree for the optimal
complexity parameter, you can see here that
on the first level, it's quite interesting
to see smoker yes or no. Actually, what you can see here, these are the expenses, the insurance permia
as you can see, if you're no smoker, you immediately in
this area where actually the premium
are relatively low. Then if you're young, it's the lowest premium, if you're old and then
it's a question if you're not as old or
if you're rather old, then this is a
distinction made here. If you're a smoker and if you have a high
body mass index, and if you're old, then 5% of the observations have
very high insurance premium. No. Now what is the
predictive accuracy here? Again, we predict based
on the test data sample, and you can see the
R squared is 87%. The root mean squared error is 4,238 and the mean
absolute error is 2,900. Obviously, the R squared
is easiest to interpret, and we get that in this model, close to 87% of the
variants can be explained by this
regression tree. We could also use boosted
regression trees. However, XG boost only works
with numerical features, and as we've seen, actually
smoker, yes or no, categorical feature, binary
variable should be included because it seems
to be the feature that has the highest
predictive power. If we leave this out, this would lead to an inferior model. Now, we can do this,
but you can see the R square immediately
drops to 9%. Instead, if you rely on bedding, why are the method tree back
if you use all features, this does not in this
scenario lead to an improved performance compared to the single tree model, squared drops to any 1%, but it's a good thing
to try this out. This is a regression tree. Classification regression
can be done by trees. We've seen a few examples and some nice applications in the next video in the
next sub section, we will start with using and discussing a huge part in
artificial intelligence, which are neural networks.
25. 25 Neural Network Architecture: Hello, everyone, and welcome
back to our class in artificial intelligence and
its application in finance. In this video, we
are going to start the subsection on neural
networks and to be precise, artificial neural networks
because these are obviously not
biological networks we are looking at,
but artificial ones. As I mentioned in one
of the first videos, artificial neural
networks try to mimic the behavior of the
human brain and tries to mimic the processes that happen in the human brain encompassing
neurons and synapses. And consequently, the turn neural network actually encompasses a large
class of models. We are going here, we're going to look at the
single layer perceptron or the single hidden layer
back propagation network. It's usually just referred to as the single layer perceptron. This is very plain
vanilla neural network. There are complicated
ones, more complex ones. We also shortly cover
deeper networks, so called multilayer
perceptrons or MLPs. And after having seen regression and
classification trees, you're probably
understanding and what is meant with deeper
networks and deeper models. And the same happens here
with neural networks. If you're more interested in neural networks and what can be done on top of neural networks, including the single layer and the multilayer perceptrons, you can have a look at the
HT tit run in Freedman and the Goodfellow Bengio and
Curvil textbooks actually both should be
available for free as open source books on the Internet and you can learn more about artificial
neural networks. Now, as I mentioned
in the first video, actually, I think,
neural networks, artificial neural networks
are far from being new and have been around
for 80, 90 years. In fact, they date
back to the 1940s. They have since
gone by many names, used to be part of cybernetics, then was known as connectionism. Then was just neural networks, artificial neural networks. Nowadays, it is commonly
referred to as deep learning. Why has it become more
um uh, important. Why have these molobs become
more sophisticated well, because we have more data
to train these modules. We have big data available. We have more computing power, and as a result, these models have grown in size with increasing power of
hardware and software. Nowadays, with the availability
of big data syllabls and the availability of large clusters and
powerful computers, these models can be put to a better use than
this was possible, say in the 80s or 90s. And since then, neural
networks have helped to solve increasingly complicated
applications with increasing accuracy. What is the structure
of a neural networks? Very simply, this is a single hidden layer feed
forward neural network. It's in a single layer,
single layer perceptron, sorry not multiple, but the single hidden
layer perceptron. What happens is let me
get my curse a here. We start out with
our features X one, X two, and one until X B. Again, just like in the traditional linear
regression setting, or the classification setting, we have P features, could be, for example, age, could be gender,
could be income, could be maybe home region. Then what we do in
the neuralnetwork in the single layer feed
forward neural network, we combine these
features linearly. In the first and only in
this case, hidden layer. So we're constructing
variables set one, set two, and so on until Zt M, and we are linearly combining these features to form,
for example, et one. Obviously, this is a
basically linear regression. We have coefficients
that in this case, will be called weights, and these weights
need to be estimated. What we get is one could say a couple of regression
analysis results, a couple of variables
that are hidden. Why are they called hidden? Because these are
not observable. We can only observe y those are our responses and
we can observe our features. For example, let's say this is the insurance premium
dataset, we have age, gender, income, and so on, and these are it's the
insurance premium. Uh example, we would
only have one response. But actually, these
mods are able to predict not just one
response variable, but also multiple classes or
multiple metric variables. So in the very simple example of the insurance
premium dataset, we would only have Y one. We are going to predict
insurance premium, and then this would be
the insurance premium, and these would be the features. What we are doing is we're
inserting yet another layer in between and before combining
the features directly, as we will do in the
regression analysis or in a regression tree instead of directly using the
features to predict y, we first compute these hidden and unobservable
variables that one to, and then we are recombining these variables for
our prediction of Y. That's the only thing that is different basically
in comparison to a linear regression model or linear classification.
Why are we doing this? Well, by inserting this
additional hidden layer, we can make and we can
do more transformations of the data and thus we
get more flexible models. That's basically it. That's also why if you take the hasty Tips
running frequent textbook, it will tell you that well, neural networks are
far from being magic. It's simply a non
linear generalization of regression module. So that's what basically at least the single
layer perceptrum is. It can be seen as a
two stage regression or classification model can be done for both regression and classification where we have the hidden features in this
hidden intermediate layer, and they are derived as a nonlinear function
of the inputs. We'll see this in
just a bit what is done to the
features X one, two, X P in order to derive these hidden features 12
set M. These features, these hidden features are then used to model
the responses, the targets, Y one, two Y K. For regression, you will usually set K equal to one and employ
only one output node. That's what we did here when thinking about the
insurance premium dataset, for classification,
obviously with K classes, we can use the K output, so we have Y two until Y K. Then they will be code
at 01 binary variables. We could say this is, let's say, um the class of say, low income, this is middle
income and then we have a third class that is higher
income, customers maybe. Okay. Now, to the
formal description, we start out with
the hidden layer. That's the hidden
layer in this long in. What we are doing is we
are using the features, this is something else
to highlight this. We're using the features X. These are observable. We
combine them linearly. We have actually the intercept in this case is called a bias, and the regression coefficients
are called weights. We have a linear combination
of our features. Okay. We have a linear
combination of our features and if we were to set Sigma
to the identity function, then we would simply get
linear regression model. Then Z one, Z two, et cetera, would simply be a
linear transformation of X of the features. But usually Sigma, which is the so called
activation function is often chosen to be the Sigmoid function
which is given here, and this is a
nonlinear function. So our linear combination of the features is non linearly transformed and
we get the hidden layer, and these hidden
layers are then again, linearly combined to yield TK and then we can apply yet
another non linear function, GK to T, and this is
our prediction for Y. This basically is, let me
use the red lion thee. This is basically our prediction
for Y K it's no magic. It's very simple.
Take the features, combine it linearly, apply
the sigmoid function to this. You get the hidden
layer, recombine those hidden features
in a linear fashion. Then you can apply yet
another non linear function G to those transform variables, TK and you get your
prediction for Y. That's the sigmoid function. One can also use the rectified
linear unit function. This is given by the
maximum of zero or new, but that's basically the single
hidden layer persectron. Now, Fk is the
models prediction. We've seen that these
predictions are constructed as a linear combination of the hidden features and we
have a final transformation GK In regression
analysis, usually one does not use this
final transformation, GK is just the
identify function. Basically, what you're
saying is our prediction is TK and we don't
apply any GK to it, but in classification,
one usually uses the softmax function and this is actually the same as
in the multi logit model, you might have seen in
regression analysis. So that's um the single
layer percepton. The units are called hidden
units in the hidden layer. They're not directly observable. They are learned from the
features X one through XP. These hidden features
are then used to produce our predictions on
the output features. They are again, observable. And when Sigma is the
identity function, the whole model collapses to a linear model in the inputs. It's no big difference to a linear classification model
or linear regression model. So one actually
needs, for example, the Sigmoid function to get a generalization of
the linear model and to have any extension
of the linear model, otherwise, it doesn't
make too much sense. And what are multi
layer perceptrons now? Well, it's very simple. If you do this again, if you insert yet another
layer and say, Okay, I'm combining these three
here and I'm combining these four maybe and we
get a second hidden layer. Then we have two
layer perceptron, and this will give
us a deeper model. If we have another and another, these all are called
multilayer perceptrons. Stacking multiple hill layers on each other will give us the MLP. And if you want to see a very excellent graphical illustration of a
neural network, if it's not clear by now, you can actually look at this video here on
YouTube, which is a very, very good example and an illustration of what
a neural network does. Let me comment on one thing and let me delete a little
bit of my drawings here. Why is it called a perceptron? Why is it called an
artificial neural network? Well, actually, what happens is usually in the human brain, at least, that's what we think
of neurons and synapses. The nodes are the neurons and
those edges, the synapses. What happens is have a signal. Let's think about
this maybe if this is the human eye or human brain. We get a signal that starts out here and we
get a signal here. For example, we
have a male person. This is one and
for this feature, we have a zero, and here
we have a one, zero, zero. Now, based from our learn actually in the train
process, we can see that, okay, if we have a one here
and if we have a one here, then this hidden neuron is
activated, so it's set to one. These are set to zero, and then we would
see then suddenly, we can observe one, what happens is that in
the training process, we see that if we have one
here, if we have one here, then we have one here, and then we set the parameters
which are the weights. Let me just check in the first, this should be Alpha,
and then we have beta. Alpha, this Alpha and this beta, they are increased to make
sure that we have one here, if we have one here,
then we get to this one. And then the signals come in they are transported
via the synapses, and then we get to this point. If we have one here, if we have a signal coming
in here and here, maybe these are the weights
that have been trained, have been increased, and
then we get to this point. This is how actually the artificial neural network and out so the
human brain works. We get signals coming
into the neurons. We have synapses that
have been trained, we remember certain things, and then we can decide
on whether, for example, we are looking at a cat or a so this is what is also
illustrated in the video, and you can see this especially when looking
at the sigmoid function. Again, this is the non
linear transformation of the linear combination
of the features. And what happens is, if actually this were
a linear function, then we would see, okay,
maybe it looks like this. We get a small value coming
in we get a small value. If it's a slightly higher value, we get a slightly higher
value, et cetera. But what is actually
done is we have a non linear
transformation that is also governed by
this parameter S. That's a scaling parameter
in the sigmoid function. And if you look at the extreme
example of S equal to ten, then actually what happens is you get the purple
function here. Meaning that for all small
values below zero coming in, you have no activation. The resulting function is
zero, the function value. If it's a positive value, then actually it's a one. You can see for this extreme choice of
the sigmoid function, how this actually works. This pretty much looks like, delete all this again. This is pretty much
like a signaling function that if you have enough neurons that are being switched on or the right neurons that are
being switched on, then the sigmoid function with this choice of scaling
parameter will lead to, let's say, one, zero,
zero, zero, zero. Then some hidden units
are switched on, others are switched off, and now it becomes clear
how this work works. Okay, so this is the structure
of a neural network. In the next video,
we are going to have a look at how to fit
these neural networks, which means we have to estimate and train those
pemters Alpha and
26. 26 Training Neural Networks: Everyone, and welcome
back to our class in artificial intelligence and
machine learning in finance. In the last video, we've
seen the definition and the basic structure of an
artificial neural network. In this video, we want to shortly discuss the
question how to train, how to fit a neural network. For example, a simple
single layer perceptron. Same for the multi
layer perceptron. Obviously, the multilayer
perceptron needs even more time than
the single layer, but the principles are the same. Now, we've seen that
the neural network has quite a large number of
parameters. Why is that? Because we have the
parameters, the weights, the coefficients of the
linear combinations of our features in
the first layer. For the observable features. Then we have the hidden layer
where we are recombining those observations
for the features and those combinations
of the features. Again, we have another set of coefficients or weights as we call them in artificial
neural networks. With Alpha N ending
from one through N and B to K going
from one through K, the numbers of hind layers and outputs and obviously
also the biases. These are the intercepts in
those linear combinations. Um, and this is actually quite a large number and it can increase even more
if we add layer, another hind layer in the
multi layer perceptron. Now the parameters
are chosen such that the model predictions
fit the training data well, as is the same as in any
statistical learning model. But what can we done for training the model in the case in special case of
neural network? Well, we have to
distinguish between regression analysis and
classification for regression, usually rely on the sum of squared errors as
a measure of fit. For example, we use
the cost function R of theater being the
squared errors, YK, those are the observations
we have for the K outputs. Remember that actually
even though this regression analysis for
the neural network, we can actually have more
than one response variable minus our predictions. We square them, and then we add all this up for
all observations, but also for all outputs. Theta is the vector that contains all
trainable parameters, so all the weights of
the neural network. And we have and
training examples. For classification, you usually use the so called cross
entropy or deviance, which is you take
your productions, FK, based on the features. You take the logarithm, you multiply it with the observed values
for the response, YK. Usually this will
be one or zero, and thus you get the errors. And then you also add it all up and take the
negative of this. Now the corresponding
classifier in this case is usually
the arc max function, that is the class to which
the highest probability is assigned is chosen
as the production. Now, these are in the sense, the error functions,
the cost functions. We need to minimize these to
train our neural network. The generic approach here
is a gradient descent. You might know
gradient descent also from our computation
of finance lecture. Gradient descent is the idea that in order to
minimize a function, this is actually
generic optimization, in order to minimize a function, you compute the gradient
in the one dimensal case, this is simply the
first rooftive. You compute the
gradient and then you move into the direction
of the steps descent, and that's given
by the gradient. Then you have an iterative
algorithm and you try to minimize the function by moving down in the
direction of the gradient. Now in the sending
of neural networks, the gradient descent
is usually referred to as back propagation
quite famously, as the gradient can easily be derived by the chain
rule of differentiation, and this can be
done in a forward or backward sweep
over the network. For details, you
should have a look at the HCT Shnian
treatment textbook. You can also look and I would
appreciate if you do this. And recommend this highly. Take a look at these two links here and you will
find two videos, this one and this
one in which the back provabation algorithm
in the context of neural networks is
quite nicely explained. Now, calculating the gradients, based on the whole data sample, be sure if you take a look
at this function R here, you can see that this is
based on all N observations, and to calculate the gradient, you have to go
through all the data, all the training observations. Now, Again, to
calculate the gradient, you need to go through all examples or
training observations, and this is called
batch learning. This can be quite costly in
terms of computational time. Because if the dataset
is quite large, computing the gradient
also needs a lot of time. Therefore, what one does is one usually relies on the
so called stochastic. Gradient descent
or SGD algorithm, which is referred to sometimes
as mini Dech learning. What you do is you select
small random subsamples. You concentrate on randomly
selected smaller subsample to update the network weights. As a consequence,
the confidential burden for each iteration, this is an iterative procedure. Uh, it does not increase
with the total number of training examples because
you keep the size of those random
sub samples fixed and you can increase
the training data, but the mini daches will
remain of the same size. The number of
training examples in each mini detch is referred
to as the Datch size, while a complete sweep over
the entire data sample of NN observations is
referred to as one epoch. A neural network is typically trained over multiple epochs. You can see with this huge number of
parameters and the need to catate the gradient
in each iteration to minimize our cost function. Training a neural network with a large theater set usually
requires a lot of time. Remember that a
huge problem with neural networks is because of this huge number
of parameters, we have a huge flexibility. Overfitting becomes
a huge problem. We have a numerous parameters, and neural networks are prone to overfitting at the global
minimum of the cost function. Now, while there are many means to mitigate overfitting,
for example, by using smaller edge sizes that have a regular
rising effect, there are also other
simpler methods that explicitly address the
problem of overfitting. And in the application, we
will see two of these methods. The first one is dropout, the other one is early stopping. Now, dropout is frequently used in a way where we are using non output units and we
are randomly removing those non output units from
the network during training. So again, we are reducing the observations that are
being used to train the model. While early stopping
addresses the question of how long to train a
neural network and we have some stopping rule that determines the
training is stopped when the model
performance starts to deteriorate on a validation set. So in a sense, we are already including the
validation set in our training. And if we see that actually the model seems to overfit
and the model seems to learn only based on the training sample and doesn't
generalize well, we stop. So this is what we will
see in the application. But before we go to
the application, we quickly introduce an
extension of neural networks, which is convolutional neural
network in the next video.
27. 27 Intro to Convolutional Neural Networks: Hello, everyone. Welcome
back to our class in artificial intelligence and
machine learning and finance. In this short video,
we are going to have a look at convolutional
neural networks, which are a specialized
kind of neural networks of the single and multi layer
perceptron we've already seen and they're used for processing data with
a grid like topology, very famously used for images. Um, actually, this is frequently used in business and
also in finance. When looking at
images, for example, to identify characters numbers can be actually used for
very simple applications. If you want to
read and read into your computer receipts or any other type of
handwritten information. And they have been tremendously successful in practical
applications. So let's have a look at these. Now why are they
called convolutional? It indicates that the CNNs use a mathematical operation
called convolution, which is a specialized
linear operation. Now, we don't need to
go into the details of what a convolution
is in mathematics, but they are neural network that use convolution instead of general matrix
multiplication like in the single layer or multi
layered perceptrum. Remember that actually, it is the single and the
multilayered perceptron. In a sense, if you leave
out the sigmoid function, um, if you only look at the
first layer, for example, and then later on on how the
hidden units are combined, these are linear operations, and they are linearly combined to yield
the output signals. Now, these are
linear operations, and if you've taken a basic introduction
to linear algebra, you know that linear
operations are equivalent and special
way to matrices. So actually, what
we are doing in the single and
multilayer perceptrons are matrix multiplications, and here we use a
convolution instead of the matrix multiplication in at least one of their layers. So this is CNN. Uh, what are we doing? Actually, you have
to think about this. Think about and we'll start with a very simple image that
is not even an image. It's simply a grid of numbers. Think about this grid. This is the source and the image that we
want to transform. Now, what we are doing is
look at these numbers, zero, zero, zero, zero, one, one, one, 02. And we are using
these nine pixels, and we are transforming these nine pixels according
to the convolution. We're using the
convolution kernel, and what happens is, you
carve these nine pixels out. Now, the center element of the kernel is placed
over the source pixel. The source pixel is
then replaced with a weighted sum of itself
and nearby pixels. You can see the
calculation here. We don't need to go
through all these and what happens is then we
get minus eight, and then we reduce all
this to minus eight, and this is the new pixel
value the destination and we leave out all
this as information. And what we do next is we shift this convolution kernel and to the right and to the right, we use nine pixels and calculate the one here and the one
here, and the one here, you can see that
in each dimension, I actually we go through here, we are actually
losing two pixels, so we don't get this. I mean, and we can get this. Actually, we start here, the minus eight,
go through these, and then we fill out all these boxes and
pixels in the middle. This is the convolution very, very simplified and what
happens is, actually, the convolution kernels are
also often referred to as filters because depending
on what kind of filter, what convolution you're using, you are extracting a
different specific feature. For example, here, in
this feature extraction, we are trying to detect etches. For example, you
can see that here, let me highlight this way here. You see that we have one color coming in from here
and then suddenly we have a much different
color at this edge, and we are trying to
detect these edges. We are not concentrating
on colors only on edges. If you apply this
feature extraction, this convolution kernel
here for edge detection, what you get is this picture, and you have one filter, one convolution that extracts
the edges of this picture. This is one layer of
information we get, and we can do many filters. We can apply many filters
to this picture and extract different features and then
later on, learn from this. Typically, many different
filters are used in each layer of a convolutional
neural network in parallel, because this gam
needs some time, this is done to extract different features
at the same time. As a result, the output of a convolutional layer has
the structure of a cuboid, and then you get many
different multiple filters, and then you can
recombine them to get a full idea of what the picture looks like and then train your neural network. Now some further
architectural components and good slations
like for example, pooling layers, they will be covered later on
in the application, and we have an
extensive treatment in the Goodfellow Bengio
and Curvil textbook, you can also find information on other network architectures,
recurrent neural networks. These are way beyond the
scope of this lecture, but we will see the
multi layer perceptron, which is, again, the simple extension of the single layer perceptrum
with more hidden layers, but also the convolutional
neural network. In the application, you now have a basic understanding of what the CNM and multi
layer perceptron, and for the details,
you can look this up but it's very simple to
use these models in R, and we'll see this
in the application.
28. 28 Neural Networks in Practice: Hello, everyone,
and welcome back to our class in Artificial
Intelligence and Finance. In this video, we want to
look at the application of neural networks in a
very generic example that is frequently used
in economies in finance. But that is scanning documents. Usually these documents
only exist in paper form and we have
invoices, bank statements, printouts of static data,
business cards, receipts, et cetera, and a scanned image cannot be searched
in its native state. And thus, it is usually
common to digitize printed text to enable
electronic editing, searching, compact storage,
and online display, and then for this, you need neural network. The retrieved hidden
text behind image can also be fed directly into
further machine processes, for example, automatically
managing invoices, receipts, transactions. And you can imagine
that this is usually the starting point
digitalization of these printed documents
that decades and years ago were only available
in printed form, but nowadays, they
can be digitized and used in a computer. Now, the technique
behind this is usually called optical character
recognition OCR, and this is done by
neural networks. This is one of the
most basic um, applications of neural networks. One can also use
the Minsse pricing. I'll comment on that later on. But I think this is very
instructive because you can easily see how
the neural networks work. On the following slides, we
employ neural networks for the task of handwritten
digit recognition. A classification problem
with ten classes, zero, one, two,
three, four, five, six, seven, eight or nine. We have a handwritten note, and we want the
neural network to be able to decide,
is this a one? Is this five or is it
zero, for example. We start with the multi
layer perceptron. Doesn't make too
much sense to use a single layer perceptron year. If we can use the multi layer one, the principle is the same, and we will discuss some regularization techniques
to prevent the model from overfitting and we will fit a convolutional neural
network to the data. All the application,
we now rely on the MNIST database that is provided within the
Keras R package. And this database contains
60,000 training and 10,000 test examples
of handwritten digits, and more details on this particular dataset can be accessed at this link here. So this is the MNIST
database in Keras. Okay. So we employ the Keras R package for
fitting the neural network. This package provides
an interface for the open source
Python library, Keras, which in turn acts as an interface for Google's
tens of Low library. You might have
heard tens of Low, which is the library provided by Google on Deep Learning
machine learning. And the syntax in the
R package is very, very similar to
the Python syntax in the original library. And it's one of the leading high level
neural network APIs, it has a focus on enabling fast experimentation,
quite user friendly, and also allows the training
of neural networks on both CPU and GPU without
changing the codes. We need additional
computation power here and it also supports arbitrary
network architectures, and it's quite appropriate for building essentially any
deep learning model. That's why we are using
this here and also we are able to stay
within our now we fog, we can start to
work with the data. We have to install
the Python cars and TensorFlow back end first. This can mainly be done from within R. During the process, you might be asked which tens of low version you want
to have installed? If you use default, this yields the CPU version which
we'll use here. But please note that
installing Python Keras via the Keras R package seems not to work on the
RStudio server. So if you are using RStudio, please use your own computer
and this doesn't work. Before executing the
following lines of code, please also install Anaconda with the default
settings from here. That's Anaconda then
you install a Keras and this third line will also install mini Conda and
several Python packages, Keras Nabi, et cetera. This is what we need in
the that end in order to be able to fit our multi layer ectrons
in our neural networks. Again, we start by
importing the data. We load the library carries, and then MNIST is dataset MNIST. This downloads the database, the object size divided by
100, actually 1 million. This is 219, the size megabytes, this is not really large, but it might be too
much for, let's say, a regular notebook
might be that the data, even this rather small dataset is even too large for
regular notebook. So this is why we are not using larger data samples
here in this lecture. Do you have a look at
the structure of Amnest, you can see it's a list of two. It's training, dollar training, that's one object
and dollar test. And you see it is now a very simple
structure, why is that we? The data basically is, these are just digitized images. It's not features like in
the insurance risk premiere, data sample where we saw we
have age, gender, et cetera. These are all images
and they are not saved as images as
JPEx obviously, but they are saved as a
digitized images and he this. For example, if we view MNIST, that's our data sample and in
the training dataset X one, the first observation,
you can see this matrix. You can see a 28 by
28 grayscale image, and what you see is actually, these are all numbers. If we zoom in here, you can see, these
are all numbers. And you can imagine
what has been done. Actually, this is
one of the images. Let me zoom out. This
actually looks like this. This is one example. I would guess this is
supposed to be a five. Someone wrote down
a five and this is the digitized
version of this. If you check the image
label, you can visualize it, yes, we would yes
that this is a five. Actually, yes, it is. So if we look at the response, the output value for this
observation, it's a five. And hopefully, later on, our models are able to train on the training data and then be able if it's fed this image, to be able to determine that
this is supposed to be five. Okay. Now, the data are stored in a three
dimensional area. If you take a look at
this, you can see this, this is the first this
is the first dimension, one, then the second
one and the third one. The first dimension
is for the image. This the first image, so we get a matrix in the
remaining two dimensions. So one and two. Actually, all the
data are stored in a three dimensional array
image by width and height. So actually, if we
were to access one, three, two, it's by curve
here, it's my cursor. One, we get the
first image, three, 93, and column two, we would get this zero. So this is how the
data is stored. Now, to be fed into a
multi layer perceptron, this matrix has to be flattened. That is, it needs to be
transformed into a vector. Additionally, because
as you can see, the data is stored in
gray scales 0-250, we need to transform
the gray scale values 0-255 to the range 0-1. So X train is MS dollar train dollar X. X B train
is MS train dollar Y, and so on for the test data. We need to reshape the data. X train and X test are
X train from before, and X test from before, and we reshape them
by flattening it. And then we re scale this by dividing all the numbers by 255. If you divide everything by 255, this whole range 0-255 is transformed to
the interval 0-1. That's in lines ten and 11. And then the Y data, these are integer vectors with obviously
integers ranging 0-9, zero, one, two, three,
and so on and so on. I for training, we encode the two vectors into
binary class matrices. This is done via the carries
two categorical functions. You can see here the structure before transformation is five, 04, 19, two, one, three, four, these
are the values. This was our first image,
that was the five. And if we do this
two categorical, y train and ten, you can
see this is now a matrix. And we now have categorical or binary
variables that, for example, if we start with these zero, which is actually the
first observation we have, the first binary variable. Is this a zero? No, zero. Is this a zero? Yes,
O, is this a zero? No, no, no, we did zero, one, zero, zero, zero,
zero, et cetera. This is why test in
the same manner. Now next in the next video, we are going to fit the multi layer perceptron
to this type of data.
29. 29 Multilayer Perceptron Hands On Implementation: Hello, everyone, and
welcome back to our class in Artificial
Intelligence and finance. Now we want to use a multi layer perceptron in this example of images that
we need to categorize. We need to determine whether
the handwritten images are, for example, a five,
a three, or a nine. Now, we want to
build a first model, a multi layer perceptron and the core data structure
in carries is a model. The simplest type of model
is a sequential one where potentially different kinds of layers are stacked
sequentially on each other. We start by defining a
sequential model via this code carries model sequential and subsequently
add layers to it. Instead of the object
oriented syntax in the Python original
cars library, which is mole AD, the R package
uses the pipe operator, which is a percentage
sign larger and percentage sign that we
are already familiar with from the DPL YR package, which is also used in that one. Even shallow models can exhibit hundreds of
thousands of printers. Therefore, be careful
of overfitting. We are going to discuss overfitting in the
next view in detail, and we are going
to have a look at a very simple multilayer
perceptron here in this one. This is the start. Let's build our first model, model one, carries
molar sequential. We define a sequential model. This adds one hidden layer with 256 neurons and not the sigmoid, but the Lu activation
function and the input shape corresponds to the length of the
flattened images, which is 28 by 28. Now model one, and then we
need this pipe operator. Layer dense, we need units 256, activation is u
input shape is 784. Again, the pipe operator layer dense Units ten,
activation softmax, this is the last layer and the output layer with one
neuron for each class, Rimando, if you go back, we saw that actually
we can have one Y. To which all the
neurons go, maybe. All we can have more
than one output, so this would be Y two. In this case, obviously, we have ten units because our result is one neuron for each class reflecting the probability
for each digit zero, one, two, three, and so on. These are actually
the two layers. This is the hidden layer and
this is the output layer. Now, that's all we need. We can print out
the model summary, so we do summary model one. It's a sequential model. You can see here layer, dense and dense one, 256 and ten urons. We get almost
201,000 parameters. The total number of
parameters is 204,000. These are all trainable. You can see it's a huge number
of parameters that needs to be estimated and this is bound to sulfur
from overfitting. Next, we compile the model
from the previous slides with an appropriate loss function,
optimizer and ometric. Model one, pipe
operator compile, the loss is the
categorical cross entropy. The optimizer is this
optimizer RMS prop and the matrix is accuracy. We want to use the categorical cross
entropy for minimization. An optimization of
the neural network and we are going to assess the accuracy of our model
by using the accuracy. Now, we're not
using, for example, the no information rate. Now we can train the model. This results from
the training stacks. These are saved in
the history object. History is model one, pipe operator, and then we
fit the neural network. Based on X train and Y train, we use 50 EPOC. Remember that one EPOC
is one suite over the whole dataset
and it's now mini da the batch size is 64 number of images processed after which the parameters are
again updated, and the validation split is 20%, this provides an indication of the generalization
performance of the model. That is, we split our
training data into 8%, 20%, and then we use the 20%
of our data for validation. Let's evaluate model one on
the test set result one, Rs one is our mod one pipe
operator evaluate X test, Y test, and print both the
loss and the accuracy. The accuracy is 98%. Pretty good model has
very high accuracy, but as we later on C, it's due to overfitting. The model overfits
in the training set. If we print out this mod one Pied evaluate
Xtrain y train, the accuracy is almost 100%, but we will later
on see it doesn't generalize well to new data. Well, we can also make
the prediction now. Set predictions is written over by model one pipe operated predict
classes from X test. And for example,
for the first one, predictions one for the
first image in the test set, which is a seven and we can see the true value
is actually zero, one, two, three, four,
five, six, seven. So yes, we would say
yes, this is a seven. Our prediction is a seven, and the true value is given
by this binary variable, and this is the binary
variable for a seven. This would be eight, and this would be a nine. So this is quite nicely. To see what happens. Based on the 204,000
parameters, as I mentioned, the model overfits
quite drastically and we will discuss
this problem of overfitting and what to do
with overfitting in the next.
30. 30 Multilayer Perceptron Handling Overfitting: Hello, everyone, and
welcome back to our class in artificial intelligence and machine learning in finance. In the last video, we've seen
the multi layer perceptron and we used it to
do the following. If you skip back one slide, you can see that we trained
a multi layer perceptron in order to recognize
handwritten digits. As in this example, we
were able, for example, to predict from this
picture correctly that this was supposed to be a seven and it actually
was a seven. So this is the train
models prediction, and we haven't talked about the accuracy and
actually overfitting. And if we plot the history of the model over the different
actually 50 epochs, you can see here in, well, green or turquoise, the loss inaccuracy for the validation
set and in orange, the loss in accuracy
for the training set. As you can see, the accuracy is actually extremely high
my cursor here it says, It is actually quite high and problem is, as
you can see here, from the loss in the
validation set with more and more epochs that we use for training on
multi layer perceptron. Then the algorithm, the model doesn't generalize
well to new data. We can see that
the loss increases enormously and linearly
with every epoch, even though the accuracy
in the trading data set is almost close to 100%. So we can see overfitting
being a huge problem here, which is not surprising given the fact that we
have 200 and I think, 204,000 parameters and much
less data observations. Actually, for each
data observation, we have more parameters, and this obviously yields
such a highly flexible model, but this is causing the overfitting we
can see on this pot. It's obvious that, when we look at the
previous slide where we compare the loss
and the accuracy, and the training
in the test data, that this first very simple multilayer perceptron
overfits the data. And what happens is the
model ends up memorizing the training sample
because we have more parameters than we
actually have observations. So we could simply store all our observations in
one or more parameters, and thus it does not generalize well to previously unseen data. With the loss continuously decreasing on the training set, it starts increasing from Epoch ten on the deladation set. That's approximately 20% of the trading sample
that we randomly selected via the
validation split parameter in our model fitting. As a consequence, the
classification accuracy on the deladation set there's
also no further increase. For the training set, it's close to 100%, but not for the validation set. We have seen this here. You can see the
accuracy doesn't really increase anymore
after this point in the validation set and we cannot train the model much better for the training set, obviously. So a possible solution to
overfitting in neural networks. One possibility is
regularization, and this can be done for example via dropout and early stopping, and we will be doing this
in this example here and exemplifying how these
two procedures work. Dropout was proposed by Suva
tava Hinton Khruschevski, Zutskiw and Zalakudyov in 2014. And it's a powerful
regularization method that is applicable actually
to broad family models. It's computationally
inexpensive and it's frequently used in
the current literature. And what you do is
drop out trains ensemble of sub networks
of a given neural network, and therefore we have non output units that are randomly removed
from the network. And this is typically achieved by multiplying the outputs of the respective neurons with a zero and for each mini batch, a different subsample of
hidden units is used. And then we calculate
the gradiens and we do back propagation through
these networks as usual. Now, early stopping as the second alternative for regularization of
the neural network, addressed the question of how long to train
the neural network. Because we can see here
that at some point, sorry, at some point, probably here, we should
have stopped and said, Okay, well, this is enough. The accuracy in the
training sample will only increase slightly, but everything that
follows now is an increase in the loss
for the validation set. So we could have stopped early. That's what early
stopping is all about. So little training might lead
to underfitting stopped. If we had stopped earlier, let's say after
maybe five epochs or three epochs and if we stop
too late, we get overfitting. Early stopping
proposes a compromise by stopping training at the point when performance on a validation set
starts to degrade. It's very simple, very
effective and widely used. Now we continue by adding a
dropout layer to model one. The dropout rate specifying
the percentage of neurons excluded
per minute batch serves as a hyperparameters, here we choose a
dropout rate of 50%. We estimate and fit
model two, again, with Keras moon sequential, and the Pip operator, we have a layer dense, 256 neurons,
activation function is u then we also have
the layer dropout, which is rate with 50%, and then the last
layer with ten, um, response binary variables
or ten outputs, and we're using softmax
as the activation. We have three layers now, and this is the summary
of our regularized model. Again, it's sequential
with those three layers. The dropout layer doesn't
have any parameters. Again, we get the same number
of parameters as before, so it's not really
about reducing the flexibility of the model, just using maybe different
data to fit the model. We compile this and fit
the regularized model. So again, we are using cross
entropy has loss function, same optimizer as before, and to assess the accuracy, sorry, we use the accuracy. This is what we do history. We fit the model
X train, y train, 50 epoch, batch size is 64, and the validation
split is again, 20% of our observations, and this is the result, as you can see, as before, we have a drop in
the loss at first, and it still increases
for the validation set. But it doesn't
increase like this. Actually, this difference
here and also this minimize difference
between the accuracy in the training and in
the validation set, these are the results of the
regularization Y A dropout. So that's the results. Now, accuracy in the validation set slightly increases with the number of epochs where the loss over the validation
set only slightly increases. Overfitting is not a
major issue anymore. We can see that yes,
it still increases. It's not that the loss,
here's my cursor. It's not constant, it
still increases slightly. Until Epoch 50, but it's not
a major issue as before. While the accuracy of
the regularized model, it's about 99% is lower in the training set compared
to the original model, it is actually higher
98.2% in the test set. So generalization
performance has improved, and this is also reflected in a lower loss over the test set. So yes, it's a good way to regularize the neural network and to prevent it
from overfitting. And in the next video, we'll have a look at a deep Mr
31. 31 Multilayer Perceptron Building Deep Models: Hello, everyone, and
welcome back to our class in artificial intelligence
Machine learning in finance. We fitted a multi
layer perceptron to our data example of
digitized digits, handwritten digits
that we wanted to digitize and to project. And in this video, we are going to have a look at a deep model or a
deeper model after having seen how to deal with
overfitting in our data. We want to fit a deeper model, and therefore we add two additional hidden layers to our multi layer perceptron. Actually before that it was only a single
layer perceectron. We now add two additional
hidden layers and we reduce the number of
neurons by a factor of two in each consecutive layer, it gets more sparse as we
move upwards to the outputs. Again, we apply dropout
with a dropout rate of 50% to each hidden layer for regularization of
our neural network. We now have a truly
multi layer perceptron. It's model three. Again, it's fitted sequentially, and as you can see,
we have layer dense. We start with 256 neurons, the Lou activation function, we use dropout, then
we use 128, again, dropout 64, and finally, we have our output layer with ten binary
response variables, and then we use the
activation function softmax. This is the summary. Actually, with the
multilayer perceptroon, we now have 243,000 parameters. All of these are trainable. And as you can see, interestingly in comparison to the single layer
perceptron from before, the number of parameters
doesn't increase that extremely when adding a second
and third hidden layer. So we move 204-243
thousand parameters. Continue as before, we
compile and fit the model, again, using cross
entropy, inaccuracy, 50 epochs batch size 64, and we use 80% for training and 20% of our available
data for validation. We visualize the
training process, and as we can see now, actually, the accuracy is quite high for both the training
and the validation set. And actually, the loss is also rather low for
both the training. Not surprisingly, but also
for the deletation set. If we compare this to the
single layer perceptron, actually, you have
to delete some of my drawings from
the previous video, and you can see now that yes, even though we're using
the same regularization, using two additional layers leads to a better performance. We now have a higher accuracy
and comparable loss. So to evaluate the model, we use our test data, X test and Y test, and here we get an accuracy
of 97.7% for the D model. Now, in the next video, we'll have a look at
early stopping as the second way of regularizing
multi layer cetera.
32. 32 Multilayer Perceptron Early Stopping Technique: Hello, everyone, and welcome
back to our class in artificial intelligence and
machine learning in finance. We are still looking and talking about the multi
layer perceptron. We fit it to our dataset of handwritten digits that we want to recognize
and to predict. And we've seen dropout as one way of recognizing
a neural network. Remember that we saw that
even our very simple single layer perceptron
at 204,000 parameters. The multis layer perceptron
with three hidden layers had close to 250,000 parameters, and they were prone
to overfitting. So we need some way of
dializing those neb networks. And the second way of doing
this is early stopping, which is very simple procedure
that actually decides when to stop the
training of the data when overfitting
starts to come in. So now want to fit the
model with early stopping, can go back to slide 335 to
see what early stopping is. I just explained it, we again rely on the
previous model from the previous video on the
multilayer perceptron with three in layers and for
illustrative purposes, we reduce regularization
by setting the dropout rate to
30% apart from this. It's the same model
specification as before. And so early stopping, we compile and fit the model. This will be model four. As you can see, we compile it
again using cross entropy, the same optimizer and accuracy as the
metric which is used to measure the productive
performance of our model. In training with early stopping, this is where it's now
different from before. It's performed a introducing a so called call
deck monitoring. Um, a callec that is monitoring the accuracy
on the validation set. We fit our model
four into history, the object based on
X train, y train, 50 Ebox batch size 64, and 20% of the data is used
for the validation set. As before, that's the same. But now Coldex is a list
call Deck early stopping, we have to monitor the accuracy. The patient is a peremter will be five and restore
best weights. If this is well, if we have seen that actually
the accuracy deteriorates, should we restore
the best weights, this option is set to true. These are the parameters. Monitor is the quantity
that needs to be monitored. In this case, the
accuracy, patients, the number of epochs with no improvement after which
treading is stopped, and restore best weights
is the option whether we should restore model weights from the epoch with
the best value of the monitored quality
or if we should stop, let's say, the patients number of epochs after we've stopped. These are the parameters, and this is the
training process. If we plot history, we can see that with less dropout,
lower dropout, and we could have done
this even without additionally using drop
off regularization, we can see that yes, the accuracy is quite high for the training
and validation set. Um, we can also
see that the loss actually increases and has
been increasing, let's say, from Epoch five through
Epoch close to 23, I guess, um, in the
validation set. But this is where it stopped, and we've seen
before that actually this would increase to this point in Epoch 50 with um the accuracy
being just as high. This is, probably
the good point, we could have also
stopped maybe here, we should have used
different parameters in our early stopping. Now, if we value model fall, we can see it has 97.8%
accuracy and close to 15% loss. So the classification
accuracy has slightly improved compared
to the previous deep model. However, to assess whether the improvement is due
to early stopping, the lower dropout rate or
just having by chance, one would have to perform
further analysis. It's not 100% fair
comparison because we've changed the dropout rate as well to exemplify this effect
of early stopping. Obviously, you can
try different models and different choices
for the hyperparameters. Both the modds do not achieve a better performance than
the shallow mo with dropout. So we can see that deeper
molots are not better per se and if we take
the accuracy and neural networks need to be carefully specified and trained and deeper molots might
perform better with more training examples
with a different out rate, different number of
neurons, layers, et cetera, we have a lot of ways of
changing the different models. On the following slides
and in the next video, we will consider a different
network architecture that is the convolutional
neural network. And this is especially
suited for data, like in this example, with a grid like structure such as the images we are
going to process here.
33. 33 Convolutional Neural Networks Practical Example: Hello, everyone, and
welcome back to our class in artificial intelligence
Machine Learning and finance. We've seen the single and
the multilayer perceptron being used for
predicting digits, handwritten digits
in our application, and we've also seen how we
can use regularization, wire dropout and early stopping to improve
on the accuracy. Of our models and
we now want to use convolutional neural
networks as yet another alternative neural network model that can be used in this case. For the MLP models, we also flatten the input data. That is, we transform the two dimensional images into vectors of grayscale
values 0-255. Actually, we then transform all those grayscale values
to the interval 0-1. However, the convolutional
neural network exploits this grid like
structure of the images. We've seen how it works. It uses filters or
kernels that goes through the image and then
reduces the information, therefore, we need
the training and test set in slightly
different structure. What we are doing is
X train and X test. You can see these we take
the Amnest training set and the test set and we
scale this again by 255. Uh, the dimension of X
train is 60,000 images by 28 by 28 28 pixels height
and 28 pixels width. And the CNN takes images in three dimensions
rather than the MLP. So the last dimension
typically has three values, the RGB channels for the color. And as we have grayscale images, we only need one channel. So what we are doing is uh, we're adding one dimension. Actually, X train then is
an array of dimension, 60,000 images, 20 pixels, 20 pixels, one RGB
channel because it's a gray scale picture and the same transformation
is done for the test set. Now the convolutional
neural networks can automatically learn a
large number of filters. In our case, in the first
convolutional layer, we'll use 32 and they will
be learned in parallel. Each filter provides
highly specific features that can be detected anywhere
on the input images. We've seen this example of a picture where one
filter was applied to see the edges without caring
too much about the colors, just trying to see
edges where suddenly, um, something appears or
changes in the picture. In our example, we will apply filters of size three by three, just as we've seen
in our illustration of the CNN and the
following lines of code specify the whole model as sequential and add the
first convolutional layer. Model CNN, that's what we are fitting, Keras model sequential, layer convolutional
two dimensions, 32 filters, and the kernel
size is three by three. The activation is, again, the Lou function and the input
shape is 28 by 28 by one. So um, more or less the X and
the Y axis of the picture, the height and the width and
one channel for the color, which in this case, just gray. Now the output feature
map obtained from the various filters
is sensitive to the location of the features in the input. Pulling layers. That's an approach to
downsample feature maps, and they summarize
the presence of features in patches
of the feature map. Common pulling methods are
average and maximum pulling. Average pulling summarizes the average presence
of a feature while maximum pulling
captures the most activated presence of a feature. For example, a cat is present in the respective part of the
image or it is not present. In our example, we use maximum pooling of a
patches of size two by two and this reduces each
26 by 26 feature map. Remember that in each dimension, we are actually losing two pixels and thus we get its 26 by 26
pixel feature map, and maximum pooling
reduces this feature map obtained by the convolution
layer to a 13 by 13 map. We then add a pooling
layer, model CNN, model CNN layer max putting two dimensional
put size is two by two. We add another convolutional
layer this time with 64 filters for detecting more detailed features
in the image, followed by another
maximum pulling there. We're adding layer after layer to our convolutional
neural network. You can see here layer
convolutional two D, filter 64 kernel
size three by three, and layer maximum pulling two dimensional
again, two by two. Um, to complete the model, we feed the outputs from the
last convolutional layer, which attached
pooling layer into a dense layer to
perform classification. So this is the same as actually in our multi layer perceptron and in the single
layer perceptron, and it's the same structure. We have ten units, ten binary
variables for those digits, and the activation
function is soft max. So before the outputs from the convolution layer can be
fed into the dense layer, the three D output has to be
flattened to one dimensions. And this whole model architecture
then looks like this. We have a CA, we summarize model CNN and we have
these different layers. We have the first layer
with 320 parameters. Then the second one
which actually has almost close to 19,000 parameters and the last
one as 16,000 parameters. In total, we get
35,000 parameters, much less than in the example of the neural network of the
multi layer perceptrum, but nothing comparable,
for example, to the linear classifies or regression analysis
we've seen before. Now, dimensionality of layers, the input data, as we've seen, consists of 28 by 28
gray scale images, applying filters of size
three by three to them. So using these convolutions
and these kernels leads to the loss of two
pixels in each dimension. If you remember this
picture with my cursor, we've seen that actually
if this is the picture, and these are the pixels, we've seen that
actually we are using three by three kernels, for example, we are using these actually these nine pixels
to calculate this one. We use the next, go one to the right and then get
this one and this one. Finally, if this is
the last column, we get, let's say, this one. As you can see, we are losing this pixel and we are
losing this pixel in this dimension going like this and obviously we also
only compute this one, this one, this one, this
one, so in the end, we get it 26 by 26
matrix, feature map. As in the first
convolution layer, we again apply 32
different filters, output dimensionalities,
then 26 by 26 times 32 different
filters that we apply. By maximum pooling with a
batch size of two by two, the dimensionalities
actually reduced to 13 by 13 times 32 filters, so that's what we
get in the end. Next convolutional layer, apply 64 different filters of size three by three
to the output. Again, the input feature
map loses two pixels. In the end, we get 11
by 11 by 64 filters. Maximum putting with
patches of size two by two. Number of pixels in the
first two dimensions is essentially divided by two, yielding the output
dimension five by five by 64 and flattening this output tensor yields
a vector length of five by five by
64, around 1,600. A very important feature of
convolutional neural networks is parameter sharing. This is achieved by
moving each filter over the picture thereby employing the same parameters
at each location. We've seen that this three by three filter moves from
left to right, for example. As a consequence, the first convolutional
layer of our network only employs 320
different parameters, and this compares to almost
200,000 parameters in the first hidden layer of the multi layer or
single layer perceptrm. Then each three by three filter
involves nine parameters plus one bias premises and
because we employ 32 filters, this yields 320 parameters. This is why we don't
end up with almost 260 60,000 parameters as in
the multilayer perceptron, but we stay at about 30,000. Pulling does not involve
any trainable premeters the size of the
patches over which pulling is performed
is a hyperparameters, but in the pulling itself, we don't have any
trainable parameters, and each of the 64
filters of size three by three from the second layer is applied to all
32 feature maps. This yields, again,
18,000 parameters. Pulling and flattening doesn't involve any trainable
parameters and the last dense layer
requests ten weights for each of the 1,600 neurons. That are mapped into the ten output neurons
for all ten digits, and this in the end yields
another 16,000 parameters. So in total, we have 35,000 compared to 204,000 parameters. The model has much
fewer parameters, but as we'll see, we don't
lose too much flexibility. Compiling and training is done in analogy to the
multilayer perceectron, but it's computationally
very expensive. We only perform ten epochs
in contrast to 50 epochs, and this has been done on our university computer
center using GPU clusters, which the cluster
is suitable for fitting large and
many neural networks. Actually, this one was done with four nodes with four test lave, 100 GPUs, and a lot of
computational power. So compiling and
training is done. As before, we compile model CNN, cross entropy
metric is accuracy, and we fit the model
using X train, y train, ten epochs, batch size is 64, and again, 20% is in the validation set. So this is the result.
As we can see, accuracy is also very high. We are actually starting 98, so this is 99 close
to 100% accuracy. And if you look at
the validation set, it goes down for some
time, and Remember, we would have to
look at 50 epochs to see where this is going, but actually even
for ten epochs, the accuracy is very good and we don't have such a high loss when we look at the
validation set. Actually, 99% accuracy, 3% loss. So the convolutional
neural network provides a higher forecasting
accuracy on the test set than any of the
multilayer perceptrons, and classification
accuracy could probably be improved
even further. By tuning the model specification
by running 50 epochs, and this can be done. But even with this
very simple example, you should see
that it works much better than the multi
layer perceptron. And this is why actually a convolutional
neural network is usually used in practice in these applications where we
are trying to analyze images and recognize characters
from optical data. Okay. So these were
the neural networks, and they can be applied in many different
applications, actually. Every time you have
a lot of data, you have your outputs
and you want to do regression or classification. And as you can see, they are highly flexible. They should be applied
on dig datasets, but then you need to think
about regularization, how to combat overfitting. And in the last section
of our lecture, we're going to have a look at some different aspects
of AI and ML in finance, that is the usage of AI andM by companies for
regulatory purposes, by regulators and
supervisory agencies, and then systemic risk and also some ethical
considerations.
34. 34 Regulatory Technology (RegTech): Everyone and welcome
back to our class in artificial intelligence and
machine learning in finance. In this last section
of the lecture, we want to look at some on mathematical topics related to the usage of machine
learning and artificial intelligence in
financial applications and we'll start in this video
with RegTech and sub tech. What are these? Well,
we start with RegTech, which is probably more
common than SupTech. RegTech is the use of AI
NL in the management of regulatory processes for
the financial industry within the financial industry. Actually, this is the case when banks, insurance companies, financial service providers
use machine learning and artificial intelligence
to speed up, to improve, to make more efficient
their processes, internal processes
to comply with regulatory filings to comply
with laws and regulations. And thus, the main functions
are regulatory monitoring, regulatory reporting, and
regulatory compliance. Put differently every time a bank a financial institution, insurance company uses
AI and or ML in order to improve regulatory
processes internally, this is what we call RegTech. Why is this? Why do we even have a special
name for this RegTech? Because many
regulatory processes have become more and
more complex in reality. We have more regulations, more supervision of
financial institutions in the wake of the
financial crisis. And thus, supervisors
and regulators are demanding more
and more information. There are more and more
rules that banks and insurance companies need
to stick by and thus fulfilling all regulatory
requirements is actually a full time job
for many banks nowadays, and they need to hire lawyers
in compliance offices. And thus, there's much room for improvement because
most of these things, usually supervisory
agencies are not that sophisticated when it comes to technology that
is being used. Um, so in many cases, you will actually hear
from practitioners, we have to fill
out excel sheets. You have huge Excel
sheets that, for example, in solvency to in insurance companies
that need to be filled out and sent
to regulators, and this needs and
takes up a lot of time. So in order to speed this
up to make this process more efficient but also safer, you can use artificial
intelligence and machine learning. And we have two, um,
examples down here. These are companies that are
offering now advertising, but these are two companies
that you actually can find easily
via Google Search. For example, anti fraud and risk management for
digital transactions, this identity mind Global and trinomy management of consent
for customer service data. There are numerous
other companies and consulting companies also either provide services
related to AI and ML, or they provide consulting
on these topics, Dod bearing point, just
to name two companies. So what are the different
market segments to get a feeling of what
RegTech encompasses? First of all, profiling
and due diligence, that is you collect and integrate data from multiple internal and external sources. And what's your aim
to profile an entity, to confirm the identity
of a person of a company or to categorize them according to
regulatory requirements. Reporting in dashboards, again, collect integrated
data and the aim here is to build
standardized reports for management or compliance
or regulatory purposes. Very often, this is the
case that as I mentioned, you have Excel sheets that
need to be sent, for example, to IOPA or National
Insurance supervisors and this is where you
want reporting to be automatized and to be run much more efficiently
and less error prone. Risk analytics. This
is where it gets a little more interesting from
a business perspective. You collect integrate data and the aim is to assess
the risk of fraud, market abuse or misconduct
at the transaction level. For example, in
investment banking in trading in the back end
in the middle office, you might want to
use AIML tools to analyze the transactions
and to see whether maybe the company has
too high risk exposure, whether there might be risk of fraud and these sort of things. Dynamic compliance is when you use machine learning methods to facilitate, monitor
regulatory changes, and in order to ensure a flexible adaptation
of your policies, but more importantly,
of your processes that you have in
place, otherwise, what would happen is that every time something in
regulation changes, you need to do this yourself or you need to hire
external consultants to change the processes
that have been put in place sometime
before that. Market monitoring. You
collect integrate data, and you aim to match market level adverse outcomes to regulatory or business rules. For example, you want to identify poor
product performance, market manipulation, et cetera. We've seen some of these things on the side of supervisors, actually, you'll come to that
when we talk about SupTech. But obviously, especially,
investment banks can do this awesome
and themselves. So numbers on RegTech, it's usually nowadays still
dominated by startups, almost 70% of firms are
younger than five years, according to this global
RegTech industry report by the Cambridge Center
for Alternative Finance and EY Japan. It employs just 44,000 people, but I think this is one part of the financial industry that will grow enormously
in the next years because this is one
way to move forward to combat the increase in regulatory and
supervisory requirements. As about 5 billion
annual review, and it raises a lot of
capital in external funding. So what is the
market environment? Well, um, market and
regulatory environment are rated to be generally
favorable to RegTech companies, and the pace of
regulatory changes has increased ever since
the great financial crisis, and it will increase even more, especially now after
the Corona crisis. So um we have uh, punishments with penalties for non compliance with
regulatory rulings, and this has led to a
surge in the demand for not just compliance
officers and experts in compliance and banking
and insurance supervision, but of course, also for
automated and reliable methods to speed these processes up. Some information on the
top ten RegTech markets, obviously, UK, USA, but also in the European
Union, Luxembourg, and Switzerland, and
Ireland because these are the countries where
we see a lot of financial industries and financial service companies
and providers having their headquarters within
the European Union or in the case of
Switzerland in Switzerland, mostly in Zurich,
and then Ireland, Australia, Singapore,
Japan, Germany, France. But obviously, the UK and the United States are
leading the hertz here. Now, who are the clients? Well, usually banks. Also 61% target insurers, but it's mostly about banks because after the great
financial crisis, most of the new regulation and more stringent and
strengthened supervision and regulation has hit banks. It is also with solvency to has hit insurance companies
in the European Union. So most of these reg tech firms, they target banks and
then insurance companies, but also some uh, do business with Fintech, because it makes sense if they are startups, financial
technology companies, makes sense to start with
automated process at the start, from the beginning
rather than putting up processes that are
quite traditional, especially if you're
a Fintech company. And 50% of those react companies also claim that
they have clients outside the financial
service sector, which makes sense because
large industrial companies usually have less regulation
and usually no supervision, but most of these companies
also have similar problems. For example, the risk of fraud of transactions being
fraudulent or erroneous, thus company every large
industrial company is also on the lookout automated processes within its finance function. This means there is
substantial overlap between what we consider RegTech and later on SupTech companies because
when we move to SupTech, this is actually where we are providing services to
regulators and supervisors, actually, some of
these processes can also be used by the
companies again. So what are the technologies and tools used by RegTech companies? Cloud computing,
machine learning, predictive data analysis, natural language
processing, deep learning, graph analysis,
image recognition. We've seen that with the
convolutional neural network, biometrics, cryptotkens, virtual reality
just a little bit, it's usually about
cloud computing, machine learning, and
some related topics. Okay, now, regulators,
the regulators have to adapt to ever new technology
enables financial services, which may present a challenge, especially for merging
and developing economies. This why some
regulators like Baffin, they have come up with something
that is called sandbox, regulatory sandbox.
What is that? It's a formal
program that allows certain financial
services providers and some business models which are not yet fully complying
with existing laws. The aim is to learn about the opportunities and risks that a particular innovation in financial technology,
for example, carries, and to develop
evidence based policy to see what regulators should do and what
they should not do. Imagine, for example, the introduction of
cryptocurrencies. This is something
that has come from the industry, from scientists, and we've seen a search in cryptocurrencies with
Bitcoin obviously being the most prominent one. These are usually at
least at the very start, these weren't regulated at all. The scheme from the industry, and then regulators and
supervisors said, Okay, we need to think
about how to regulate this if we should
regulate at all. And this can, for example, be done in the
regulatory sandbox in order to allow one type of innovation in a limited
way and to closely monitor this and to see what should be done and
what should not be done. So regulators like the UK FCA, actually also have innovation
hubs or innovation offices. These are places where innovators
and regulators meet to discuss solutions
to the challenges to the financial sector. And what has been done,
for example, the UK, these were seven tax sprints and two day events including industry representatives and
innovators and these were, for example, on regulatory report financial service
and mental health, anti money laundering and
these kinds of topics. This is how regulators
and the industry try to discuss new ideas when it comes to new technologies
and new challenges. Okay. So this is RegTech, and we'll now in the next
video switch to SupTech
35. 35 Supervisory Technology and Systemic Risk: Hello, everyone, and welcome
back to our class in artificial intelligence and
machine learning and Finance. In this short video, we're going to have a look
at SupTech in contrast to RegTech and to define SupTech, actually, it's the
mirror image of RegTech just on the side of financial supervisors
and regulators. It can be considered a sub
discipline of RegTech. But actually, it
can also be seen as an extension or actually
the mirror image on the side of supervisors
and regulators. It's artificial intelligence and machine learning techniques that are used by regulators and supervisors as part of their supervisory actions and
their supervisory conduct, such that we have financial authorities
like Bathing in Germany, the FCA in UK and
other authorities. And whenever they use
big data, they use AI, machine learning to support the supervision of
financial institutions, this is what we call SupTech. So um it's closely related to
RegTech, but obviously, um, companies have an incentive
to comply with regulations and make this as efficient and cost
efficient as possible. Was SupTech sorry,
SupTech is supposed to not only make this process more efficient
on the side of supervisors, but also to identify
more patterns, to identify more frauds, more potential problems, threats to financial
stability, et cetera. This is why the focus is
usually on misconduct analysis. On the reporting done by
financial institutions and managing all the data that is coming into financial
supervisors. Naturally, what you hear
from many consultants and representatives of
financial institutions is that they usually play that they have no
idea and they have no clue what the
financial supervisors are doing with all the
data they are compiling. And with usually financial supervisory
agencies being, let's say, having less funding than their counterparts in
the private industry, it probably makes sense
to invest in big data and I and ML technologies in order to process
all this data. Examples, for example,
the collection and management of detailed data
on loans in the euro area. You can actually go to this link at the
European Central Bank here, and the accelerating
sub tech solutions and prototyping are
to accelerator. That's another website
where you can get some initial information on this relatively
new field and um, it seems as if most supervisory agencies
are still looking into SupTech and trying out
at least some technologies, but this is not yet
widely recognized or widely distributed on the realm of financial
supervisors. There was one survey, um, among 39 financial
authorities from 31 countries about the intimation
of sub tech strategies. What they found is
that they identify two broad approaches which
supervisors followed. The first one is a
specific sub tech roadmap based on particular
needs of a department, and this approach
tends to be more experimental that
departments within a financial supervisor say, we might be in need
of, let's say, a Mvo to identify fraudulent transactions in the stock exchange, for
example, in the market. Um, and then things are
experimented with the set is an institution wide
digital transformation and data driven
innovation program, which is a broader, much broader approach that encompasses usually the whole
supervisory agency because management and governors of
the agency have decided that they need a transformation
of their overall IT. And then of course, it makes
sense to concentrate on AI and MLA technologies as part of that IT transformation
within the whole agency. And this is what they found
out this strategy by the FSI, actually 50% said we
have no strategy at all. So you can see
that this is still very experimental with
most supervisors. And as a conclusion, SupTech is still in its infancy, but it's gaining
momentum because we've seen that the institutions that are supervised and regulated they are
investing in RegTech, so it makes sense that agencies
supervisory agencies keep track of this
development and they follow in the footsteps of the institutions they're
supposed to supervise. So even though this is still experimental or in the
developmental stage, we will see many financial supervisory agencies
investing more in ML and AI technologies as part of financial
supervision. Problem, of course, is
these agencies usually have much less funding at their disposal than companies
that are supervised. But if we take the
example of Germany and the recent blunder with
the Wirecard scandal, it's likely that BAFN will be reformed to some extent
and will probably also explore new ways to identify threats to financial stability and
financial misconduct. On another topic that is related to supervision and regulation in this context are systemic risks. Now, systemic risk, you probably know this
from other lectures, other classes on financial
supervision, regulation. Financial stability is one main goal in
financial supervision, supervisors aim to achieve financial stability to
prevent financial crisis. The question is how is AIN ML the usage of AI and ML in finance related
to systemic risk. Well, there are two sides
to this coin, actually. One obviously is that AI and ML can be used by financial
institutions, for example, to keep better track
of risk exposure, to manage risks more proficiently
and better in the bank, in the insurance company. And thus, the usage of
AI and ML can actually enhance financial
stability simply by doing a better job in, let's say, risk management
and in trading. Other side to this coin is that obviously on the bad side, AI and ML technologies can also themselves
cause systemic risk. Actually, AI and ML and the widespread use we will
see in finance in the financial
sector of AI and ML might be a future driver
of systemic risk. And why is that? Well, especially when we're using it in regulatory
functions, first of all, AI is unable to reason about events it has not
yet been exposed to. We've seen in our modeling
that a feature on important aspect of ML methods is their ability to
generalize to new data. Humans can draw on
a broad range of prior experience and
also have imagination. AI doesn't have that. It can only extrapolate
from what it has seen before and we can hope that a model that has
been properly trained doesn't do overfitting and a generalized well to new data. That's the problem. We've seen
in many financial crises. Sometimes things
repeat themselves, but usually crises happen because we have something rather new happening or a new combination of things
that leads to a crisis. Just like the COD crisis, we've just seen um, this also something
at least mankind hasn't seen for decades
or even centuries. Second problem is, we do not
know how AI makes decisions. It is too complex
for us to follow. It is usually a black box and it most certainly
is a black box to external stakeholders so that only the companies usually
know how the AI methods works. In transparency and, um, or opacity is never a good thing in
financial institutions. As soon as investors,
stakeholders, debt holders do not
know what happens inside a bank, inside
insurance company, this leads to uncertainty and this on the
part of investors, and this in the end might
lead to panicky reactions to news and thus to financial
crises in the very end, of course, that AI
being a black box in itself might cause
some problems on the way when other
things odds app the. Third factor is AI is more likely to amplify
current cycles. It's prone to pro cyclicality
than human Rdultors. The problem is that automation favors homogeneous
methodologies and standardization and this leads to the problem of pro
cyclicality, meaning that if We know this from regulation
pro cyclicality in pro cyclical regulation means
that if markets go down, you go hard on banks and
you strengthen the crisis, actually, and you made
it worse by regulation. And as soon as you
have a boom phase, you loosen up regulation again and thus you
cause the next crisis. Now, anti cyclicality and
anti cyclical regulation is usually what nowadays
financial economists believe to be a much
smarter approach, meaning that if the economy takes a turn to the
worse for the worse, um you loosen regulation in order for the economy to
recover more quickly. And then in the boom phase, you tighten regulation
in order to prevent excessive
behavior by banks, excessive lending that would
cause the next crisis. Well, AI is unfortunately
quite procyclical. Then the high predictability
and transparency of AI will enable individuals
to bet against it. This is a very
general principle, but imagine that all players, all investors in the stock
market were AI methods and automated bots that
have been trained on data and that trade based on AI and machine
learning algorithms. Then you only need one human to understand how all these bots
and all these robo advice, robot traders, all
these algorithms, how all these work in tandem, and a human will be able to read those systems
quite easily because all these models are predictable and usually quite transparent when it comes to the
way they function. So that's why some problems
might be caused simply by the fact that AI is more
predictable in this sense. And here it is
interesting to look at this distinction between
exogens and indulgence risk. Well, exogius risk has
been made famous by YonanYzant LSE and exogent risk is caused by events
from the outside, much like Nasrit
falling on London, war on Berlin, Washington. It is easy to measure. It's purely exogenous
and you can't really do anything about that
when it comes to, let's say, financial
investments. Now, AI and machine ing as
part of AI is suited well for the evaluation and management of exogens you have signals coming
in you train your models, you use large datasets. Uh, you well established statistical methods and
repeated events to train. So if you have enough data, if you've seen enough asteroids, if you've seen the data on past asteroid sightings
and astrophysical data, you can try to teach and train your model and then
predict future asteroids. Us, AI is well suited
for micro regulation, internal risk management
of exogence events. Problem is, you also
have indulgence risk. Indulgence risk is caused by events from
within the system. It starts when individuals and individual entities within the system stop
acting independently, but synchronize their behavior. It's very difficult to measure this prime example
by John Daniel, some from LSE, is the
London Millennium Bridge, which when it was opened
for the millennium, in 99 or 2000, yes, the Millennium bridge was
specifically designed to withstand the wind that
flows along the river Thames, actually, risk management was in place for the bridge to be
safe when it gets windy. Problem is if you have people
standing on the bridge, and the calculations
that were done, I actually were actually random. This shouldn't be problem, but people standing on the
bridge when it gets windy, they start to counteract. They start to act in a synchronized way because
the bridge starts to, um, swing because of the wind then everyone isn't
acting in a random way. No one continues walking in a random same manner as they
would if there were no wind, but everyone starts to counteract the movements
of the bridge. These movements, these
counter movements by the people on the
bridge, not the wind, this caused the bridge
to become unstable so that they actually
had to close the bridge for some time
and do some renovations. And this is a prime example
of indulgence risk. You see the risk
is not the wind. That's the exogenous risk. Risk, the indulgent risk, these are the
individual entities, the humans on the bridge, they stop acting independently, but they start synchronize
their behavior. And artificial
intelligence methods, they're trained with various
games for situations, including interactions
between entities with full information gains. So for example, in chess, all possible moves can be
known and deep neural networks succeed in these
games because they assume their opponent
is their clone. Knows the same and they do not prepare for endogen
risk situations, simply they cover all basis when it comes to exogenous risk. Have incomplete information
games, for example, in poker, the opponents cards are
unknown and AI performs worse than human players
because it cannot theorize about the
opponents intentions. It can only learn
from past moves. We have cooperative games, success is even more
difficult for AI when cooperation leads to
multiple local optima, and this is, for example, the case in diplomacy
and game theory, and this is where it becomes
quite difficult for AI to be the human and this will limit the use of AI
in these situations. Okay. So this is
what I wanted to talk about SupTech
and systemic risk. There are more and more
discussions on the way when it comes to systemic risk and the use of AI and ML
in finance views. As you can imagine, AI and
ML will be used in trading. It will be used in
lending, and so on. In many instances, we'll cover and we'll touch
ethical considerations. I will talk about this
in the next video, but you can see that with AI and ML not being rolled out in all parts of the
financial industry yet, we cannot think about
all possibilities where in the future, AI and ML will be maybe a
cause of systemic risk, but supervisors and regulators obviously are already concerned about potential and
potential threats to financial stability. In the next video, we'll have a final look at some
ethical consider
36. 36 Ethical Considerations in AI and ML: Llo everyone, and welcome
back to our class in artificial intelligence and machine learning in finance. And in this last
video of our lecture, we are going to cover a
huge topic and a topic that will become even bigger in the next couple
of years and decades. That is the ethical side of the use of AI and ML and
some ethical considerations. Um, we'll start with some considerations
that are taken by a study by Thomas
at A in this year, actually, and it shows
that we have ethical risks at numerous levels when using
AI and machine learning. For example, if we
start with data, the training data must
be free of biases, and it should include all
relevant types of simuli. For example, racial sexist
biases should be excluded. And this is one thing you need to consider when training
machine learning models. AI should only be deployed in environments it
has been trained in. That is, we shouldn't have any cross use for a
different purpose. This might be problematic, especially in the
financial sector, if you think about models that have been trained
on, let's say, lending data on loan data that are used in a
different context, and especially in Germany, privacy laws must be
taken into account. Data privacy needs
to be respected when collecting and
processing the data. Turning to the algorithm site, we should think about
unethical coding that might be introduced by the programmers and the Delta
developers themselves, and the coding should
be free of biases. That is, it doesn't
necessarily mean that an ethical algorithm becomes unethical due to the
selection of training data, but it could be that
developers have already set up the model and set up AI
such that, for example, when we consider the fact that most developers are white males, could be that there are some biases already
included in the coding, and AI must be controlled for emerging biases
during the life cycle. And last but not least, the business use, the purpose of using AI should be
ethical in the first place. Shouldn't be, for example, to discriminate against certain
parts of the population, and the unintended impact of
AI can also be unethical. So these are some problems
associated with AI and ML and we'll come back to some of these later on in more detail. And we'll start with
racism and sexism in AI. Now, the machines and
the models themselves, they obviously have no bias, but the learning datasets, the training datasets
and algorithms are most likely as biased
as our current society. So most of the AI developers, I shall usually white males. They lack the perspectives
of minorities, and consequently, if
everything goes wrong, existing biases will be
introduced into the machines, and they can be amplified if we are not actively
working against them. And there are several
organizations, at least in the United States, that are working against
biased algorithms, for example, the ACLU, the American Civil
Liberties Union, and they, for example, fought in a wide
variety of issues, and they exposed Amazon's
recognition as racially biased. Also, the algorithmic
Justice League, which was founded in 2016, mission to raise public
awareness to biases in AI is one of these
organizations that have highlighted the dangers
of using AI algorithms in situations where
previously we had human interactions where
we had human managers, for example, that had
to make decisions on, let's say, loan
applications, et cetera, this is obviously not immune to biases and racism and sexism just because it's
a machine, but um, if the algorithm itself or the training data if they
include these biases, we might still end up with an AI method that
produces biased results. The study by Thomas Et Al also gave some practical
recommendations. For example, one should
implement a general statement on the firms and the institution's
intention for AI ethics, ideally I ethics
framework or a charter, there should be an extension to the firm's already existing
mission or purpose statement. Usually, you have this in place, especially
large companies. Um, one should implement an internal application
specific design plan, and you should regularly audit the processes so that if you see ethical risks and
some concerns that these programs might produce
unethical decisions, this should be flagged
and in the end, one can do something about this. And of course, one should
keep records of decisions concerning ethical
trade offs for transparency,
sometimes of course, it might be that you
need to do a trade off, this doesn't mean that you
have to act unethically, but in some cases, you need, of course, to, for example, weigh data privacy
concerns against business purposes and also
not just for your company, but also for the benefit
of your customers. And in this case, these
trade offs should be recorded just
for transparency. How are artificial
intelligence and ML regulated, especially from an
ethical perspective? Well, actually, this is
quite in its infancy. Artificial intelligence,
machine learning have not yet even
been clearly defined. We will later on see
the definition by German BaFin but this is
something in the making. For example, the European
Commission has stated in its digital finance
strategy that it intends to clarify together with the European
supervisory authorities by 2024 at the last, so in almost three years, whether and how existing financial market
regulation should be applied to the use of big data and artificial
intelligence. Now BaFin has proposed
several principles for the general use of algorithms in decision making processes of financial firms. This is the link
to the document. You can find it under BaFin and then big data when consolate
you into the dents. These are principles
that represent preliminary considerations. It's not yet regulation, but these are some first
ideas by BaFin for minimum supervisory
requirements regarding the use of artificial
intelligence in supervised financial
institutions. And this is the definition
of AI according to BaFin. They say artificial intelligence is a combination of
large amounts of data, big data, computing resources,
and machine learning. This rather a
different definition than in our very first lecture, but this is the one
that BaFin chose. So in machine learning, they say computers
are given the ability to learn from data
and experience on the basis of special algorithms and compared to
rule based methods, learning takes place without
the programmer specifying which results are to be derived from certain data
constellations and how. So big data plus machine learning and high
computational power. That's artificial
intelligence in the definition of German BaFin. And again, they mentioned
this not in regulations, but actually this is
basically a newsletter, a publication in
which they summarize their first ideas on what AI is, what it entails, and
what they might be doing in the future
on its regulation. Also say algorithms are rules of action that are usually
integrated into computer program and solve an optimization problem
or a class of problems. And in addition to
the distinction, according to the
type of algorithms, how is the problem
technically solved? Applications of
machine learning can also be differentiated
according to result types, the basic extinction
between classification, regression, clustering
with CNN and data types. That's what Bavin says. They also have some
overriding principles, and they mentioned this
in their publication. Again, this is also um in a way typical for German
supervision and regulation, it usually states
the responsibility of top management
that is management, just like risk for
risk management. Management is responsible for company wide strategies
and guidelines or policies for the use of algorithm based
decision making processes. We can find the same
kind of recommendation, regulation actually when it
comes to risk management. That management is responsible for overall enterprise
risk management. Now, potentials of such
processes as well as their limits and risk should be taken into account,
clearly stated, and a company wide
strategy for the use of algorithm based
decision processes should also be reflected
in the IT strategy. You can see that at this point, they have these
overriding principles that are very similar to the qualitative regulation we've seen in Basel two
and Basel three, adequate risk and
outsourcing management. Now, They financial institutions
should establish risk management system
adapted to the use of algorithm based decision
making processes. If applications are sourced
from service provider, management must also have an effective
outsourcing management. Clearly, if you're using, um, AI and ML and you're
outsourcing this, then obviously you
should also keep track of your service provider. Responsibility reporting
and controlled structures must be clearly defined. When establishing
adequate risk management, one needs to consider the risks of an algorithmic decision
making processes. This is risk mitigating measures and processes should start exactly where risk originates according to the
polluter base principle. Okay. Also, one should
prevent a bias. Avoidance of a bias, that is
the systematic distortion of results in algorithm based
decision making processes, and they must be
avoided in order to be able to make business
decisions based on results that are not systematically
distorted and exclude the possibility of a systematic
bias based discrimination against certain customer groups. And also, and this is important
to stress, in the end, and probably this is
one of the reasons why financial supervisors
and regulators will be concerned about AI
and ML is in the end, if you have ethical
concerns within AI and ML in its use in
a financial institution, this might cause a
reputational risk. This might cause damage to
your company's reputation, and at that point, this will become a concern
for the regulator. And in accordance with the
polluter pays principle, the risk of bias must be
identified where it can arise. It must be analyzed and either eliminated or at
least mitigated. Now, some things need
to be regulated, but actually some other things
are also privited by law. So for some financial services, it is actually
stipulated by law that certain characteristics may not be used for differentiation. That is for calculating risk,
calculating prices premier, and the danger of
discrimination exists if these characteristics are
replaced by an approximation. That is, for example,
if instead of using, let's say, gender or ethnicity, you suddenly use hometown
age and income groups, then in the end, it might lead to the fact
that algorithms will simply substitute one feature by three other features
which are correlated. Now, this again
would be associated with increased reputational
risks and also legal risks. So it's in the best interest of a financial institution to actually prevent
this from happening, and companies should establish statistical verification
processes that exclude discrimination and
such a substitution of features within
AI and ML processes. Now, this was true actually for all
financial institutions. We can also find some
more information by IOPA, the European Union's insurance and pension fund
supervisory authority, and they had a group
on digital ethics, and they also have formulated
AI governance principles. So they are um subdivided
into human oversight, robustness and performance,
data governance, record keeping, transparency
and explainability, fairness and non discrimination, and last but not least
principle of proportionality. Let's talk about human
oversight first. Now, again, this is for European insurance
companies, but obviously, most of the things
are also applicable to financial
institutions in general. Insurance firms should establish adequate levels of
human oversight, taking into account
the impact of specific AI use cases and other governance and
control measures in place. They should select the
level of human oversight, and the selection should be
proportionate to the nature, scale, and complexity
of the risk inherent to the specific AI use case
in that insurance company. Now different roles and
different responsibilities for the staff involving AI processes should be clearly defined in policy documents. That's human oversight. Robustness and performance. Now the firm should
assess and monitor the performance of
the AI systems on an ongoing basis and
take due consideration of their limitations and
potential shortcomings. Now, performance metrics
should be adapted to the objective pursued and the
nature of the data used. You should check
whether you're actually achieving the goals
you've set yourself. Sound data management
obviously is key to ensure the performance
of AI systems, and they should produce
stable outcomes over time. Otherwise, it doesn't make too much sense from a
business perspective, but obviously also
for the regulator, and insurance firms
should develop resilient IT systems and infrastructure that cannot
be tended with, for example. Data governance and
record keeping. Now, insurance firms should
adapt the data governance and record keeping
measures to the impact of specific AI use cases, and data used in AI mould should be accurate,
complete and appropriate. Again, some data
governance should be applied throughout the
AI mode life cycle. The data used in AI mod should be handled and stored
in a secure manner, obviously, because usually
especially in insurance, this is highly sensitive and confidential customer
data and appropriate records of the data and the
modeling methods should be kept to ensure reproduction
and traceability. Transparency and explainability. The funds should adapt the
types of actors boinations to specific AI use cases and to
the recipient stakeholders. Now, funds should adapt
their explanations to the different types
of stakeholders, and they should strive to
use explainable AI models in particular in high
impact AI use cases. Data used need to be
transparently communicated, and as a result, again, we need data security, data governance, and a
sensible data management. Fairness and non discrimination, sound and transparent governance processes are key to ensure fairness and non discrimination, especially when it comes
to the calculation of insurance premium. Otherwise, this could lead
to reputational risks. Insurance firms should
conduct their business in a fair manner
when using AI and make reasonable
efforts to take into account the outcomes
of AI systems. Consumers not willing to share very personal and sensitive
data are not strictly necessary for risk assessments should still have access to affordable
insurance coverage, and insurance firms
should respect the principle of human
autonomy by developing AI systems that support consumers in their decision
making process and avoid unfair nudging
practices introduced by the use of AI methods. And last but not least the
principle of proportionality, insurance undertaking
should establish the necessary
governance measures that are proportionate
to the nature, scale, and complexity
of their operations. AI use cases and use case impact assessment and
the governance measures should be proportionate to
the potential impact of a specific AI use case on consumers and all
insurance firms. Insurance firms should then assess combination of
measures put in place in order to ensure an ethical
and trustworthy use of AI within, for example, premium
calculation. These are the principles set
forth by IOPAS GDE group on digital ethics and
ethics and II in insurance. You can see many of
these things can also be applied to the
financial institutions, and this will be a huge
topic in the years to come with more applications,
with more models, and even a higher utility stemming from the
use of AI and ML in finance, we will see more ethical
problems, more use cases, and obviously more
regulation and principles set forth by regulators
and supervisors like IUPA. And thus, we are at the
end of the lecture. If you are interested in more of the literature and the tools and the references we've used, you will find all of
these on this last slide. I hope you've
enjoyed the lecture. You've learned a little bit
about the usage of AI M and L in finance and good
luck with the exam.