AI & Machine Learning in Finance: Build Predictive Models with R | Omar Koryakin | Skillshare

Playback Speed


1.0x


  • 0.5x
  • 0.75x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 1.75x
  • 2x

AI & Machine Learning in Finance: Build Predictive Models with R

teacher avatar Omar Koryakin, Metrology Manager

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

    • 1.

      1 course introduction promo video final

      2:49

    • 2.

      2 Getting Started with AI & ML in Finance

      26:11

    • 3.

      3 Sources of Financial Data output

      8:31

    • 4.

      4 Preparing and Cleaning Data

      9:35

    • 5.

      5 German Credit Dataset Part 1

      17:48

    • 6.

      6 German Credit Dataset Part 2

      12:47

    • 7.

      7 Generating Synthetic Data

      9:26

    • 8.

      8 Basic Linear Regression

      17:15

    • 9.

      9 Multivariate Linear Regression Fundamentals

      14:14

    • 10.

      10 Multivariate Linear Regression Feature Selection

      9:43

    • 11.

      11 Multivariate Linear Regression Advanced Concepts

      30:21

    • 12.

      12 Regularized Regression Methods

      26:38

    • 13.

      13 Elastic Net and Cross Validation Techniques

      9:10

    • 14.

      14 Partial Least Squares in Practice

      26:59

    • 15.

      15 Bayesian and K Nearest Neighbors Classification

      9:59

    • 16.

      16 Maximum Margin Classification

      14:37

    • 17.

      17 Understanding Support Vector Machines

      13:20

    • 18.

      18 SVMs in Practice Part 1

      15:35

    • 19.

      19 SVMs in Practice Part 2

      12:31

    • 20.

      20 SVMs in Practice Part 3

      7:00

    • 21.

      21 Decision Trees for Classification

      14:12

    • 22.

      22 Decision Trees for Classification Practical Example

      14:52

    • 23.

      23 Gradient Boosted Classification Trees

      6:51

    • 24.

      24 Decision Trees for Regression

      6:50

    • 25.

      25 Neural Network Architecture

      17:32

    • 26.

      26 Training Neural Networks

      8:38

    • 27.

      27 Intro to Convolutional Neural Networks

      6:40

    • 28.

      28 Neural Networks in Practice

      11:05

    • 29.

      29 Multilayer Perceptron Hands On Implementation

      6:48

    • 30.

      30 Multilayer Perceptron Handling Overfitting

      8:54

    • 31.

      31 Multilayer Perceptron Building Deep Models

      3:48

    • 32.

      32 Multilayer Perceptron Early Stopping Technique

      5:34

    • 33.

      33 Convolutional Neural Networks Practical Example

      13:45

    • 34.

      34 Regulatory Technology (RegTech)

      12:57

    • 35.

      35 Supervisory Technology and Systemic Risk

      17:50

    • 36.

      36 Ethical Considerations in AI and ML

      19:43

  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.

2

Students

--

Projects

About This Class

Disclaimer: This class is intended for educational purposes only and does not constitute investment, tax, or financial planning advice. The machine learning techniques and financial datasets used in this course are for academic learning. Always consult a qualified financial professional for personal financial decisions.

Master the application of artificial intelligence and machine learning to real-world financial problems. This comprehensive course bridges the gap between statistical theory and hands-on practice, teaching you how to build predictive models for credit risk assessment, customer churn prediction, stock return forecasting, and insurance premium estimation all using real financial datasets and the R programming language.

What You Will Learn

  • How to source, clean, and prepare financial data for machine learning, including handling missing values, outliers, and structural breaks
  • Weight of Evidence (WOE) analysis for evaluating the predictive power of financial variables
  • Techniques for generating synthetic financial data using statistical distributions, Variational Autoencoders, and GANs
  • Linear and multivariate regression, including feature selection, regularization methods (Ridge, Lasso, Elastic Net), and cross-validation
  • Classification algorithms: K-Nearest Neighbors, Support Vector Machines (linear and RBF kernels), and decision trees
  • Ensemble methods: Random Forests and Gradient Boosted Trees (XGBoost) for superior predictive performance
  • Neural network fundamentals: architecture, training via backpropagation, and stochastic gradient descent
  • Building Multilayer Perceptrons and Convolutional Neural Networks in Keras/TensorFlow for tasks like handwritten digit recognition
  • Regularization techniques for neural networks: dropout and early stopping
  • The regulatory landscape of AI in finance: RegTech, SupTech, systemic risk, and ethical considerations including algorithmic bias and fairness

Why You Should Take This Class

AI and machine learning are transforming the financial industry from algorithmic trading and automated credit decisions to regulatory compliance and fraud detection. Understanding these tools is no longer optional for finance professionals; it is essential. This course equips you with practical, job-ready skills by walking through complete machine learning workflows on real datasets: importing data, preprocessing, model fitting, hyperparameter tuning via cross-validation, and evaluating out-of-sample performance. You will also gain critical awareness of the ethical and regulatory dimensions that every practitioner must navigate.

Who This Class is For

This class is designed for finance professionals, data analysts, quantitative researchers, and students with an intermediate understanding of statistics and a basic familiarity with programming. Prior exposure to concepts like mean, variance, and linear regression will help you get the most out of this course. No prior experience with R is strictly required — the lessons introduce R syntax as you go but some programming background is beneficial.

Materials / Resources

  • Software: R (free, open-source) and RStudio Desktop (free IDE) installation guidance is provided in the course
  • R Packages used: tidyverse, dplyr, caret, glmnet, rpart, rpart.plot, xgboost, keras, quantmod, corrplot, lubridate, and others (installed as needed during lessons)
  • Keras / TensorFlow: Required for the neural network lessons (Anaconda installation recommended)
  • Datasets: German Credit Dataset (UCI ML Repository), credit card customer churn data (Kaggle), MNIST handwritten digits (via Keras), NASDAQ stock data, health insurance premiums (Kaggle), Advertising and CarSeats datasets (ISLR package)
  • Recommended textbooks (free online): "An Introduction to Statistical Learning" by James, Witten, Hastie & Tibshirani; "Deep Learning" by Goodfellow, Bengio & Courville

Meet Your Teacher

Teacher Profile Image

Omar Koryakin

Metrology Manager

Teacher

I am a Principal Metrology Engineer and dedicated technical instructor with over a decade of experience in high-precision engineering, data science, and technology. My professional background spans medical devices, manufacturing, and tech, giving me a practical, real-world perspective that I bring to every course I teach.
I specialize in breaking down complex technical subjects from Python and Data Science to mastering operating systems like Windows 11 into clear, actionable steps. My teaching philosophy is built on "learning by doing," ensuring that you not only understand the concepts but can confidently apply them to solve real-world challenges.
My courses are designed for learners who want to go beyond the basics and become true power users. Whether you are launching a new car... See full profile

Level: Intermediate

Class Ratings

Expectations Met?
    Exceeded!
  • 0%
  • Yes
  • 0%
  • Somewhat
  • 0%
  • Not really
  • 0%

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

Transcripts

1. 1 course introduction promo video final: The financial industry is no longer just about numbers on a balance sheet. It's about the pattern hidden within those numbers. Today, the world's most successful institutions aren't just making decisions based on intuition. They are building intelligence system that can learn, adapt, and predict. Hello, I'm Omar Koryakin and welcome to Financial Intelligence and machine learning, Mastering predictive system and Rec tech. Throughout my career, I've seen first hand how AI and machine learning have transformed finance from a role based industry into a data driven powerhouse. But there is a huge gap between knowing the theory and building a system that actually works. This course is designed to bridge that gap. We aren't just going to talk about algorithms, we are going to master them. We will move beyond traditional economic theory and let the data speak for itself. You will learn how to use machine learning not just as a tool, but as a competitive advantage, whether you are automating trades, rating bonds, or predicting tasks. Our journey begins with the foundation of AI and statistical modeling. We'll dive deep into predictive modeling, which is mastering everything from linear regression to gradient boosted trees and neural networks. Deep learning, understanding complex architecture like CNNs and how they apply to financial data. The Rick tech revolution, exploring how AI is distributing regulatory compliance monitoring, and fraud detection. Ethics and risks, we will talk about learning the vital consideration of bias, overfitting and systematic risks in automated systems. If you are a student looking to enter the quant space, a financial professional aiming to upgrade your toolkit or an AI enthusiast eager to apply your skills to the world's most complex markets, this course is for you. We believe in an interdisciplinary approach drawing inspiration from statistics, computer science, and even psychology to build the next generation of financial intelligence. The future of finance is intelligent, automated, and data driven. The question is, are you ready to lead this transformative industry? Let's get started. I'll see you soon in the first lecture. 2. 2 Getting Started with AI & ML in Finance: I want to give you a first idea of what artificial intelligence is, how we can define artificial intelligence, and what problems we are confronted with when using AI and ML in finance. Now, what is artificial intelligence? Well, if you try to define it, it's best to define it as something different than human or natural intelligence, natural intelligence, being intelligent, the intelligent behavior as displayed by humans and animals. And artificial intelligence or just AI in short, in contrast is intelligent demonstrated by machines, so fly by a computer, robot, et cetera. You can also define artificial intelligence by cognitive functions that are mimicked by machines and the way machines try to mimic cognitive functions that humans would usually associate with human mind uh, human behavior. We are led to behavior such as learning problem solving. This is one way of defining artificial intelligence. As you can see in the box below, machine learning is actually a sub field of AI. It's the part where we are trying to teach machine. We are trying to teach a model to learn from input data to generate, for example, some type of output data. This brings us to learning or just machine learning. And if we are using models that have more layers, we call this deep learning. Then of course, artificial intelligence, just as natural intelligence also includes the field of natural language processing. If we are given, for example, an MP three, a video recording, an audio recording of language that is spoken by a computer or by a human. This needs to be processed in certain applications and we need to put this into a form which we can work with in a computer. The next we also have perception. You might know that certain electric cars have sensors that can detect other cars that can detect humans on the street. And this is perception. We need senses, and then we need algorithms and models that can process this input data, which is actually big data that can process this input data and make decisions. For example, in the car, make a left turn, make a right turn. Motion and manipulation, same thing, we are getting closer to what robots would do. And last but not least, also social intelligence and what we call effective computing. These are all different subfields of artificial intelligence and machine learning is just one part of AI really, which is used quite frequently in finance. That's why I included it in the title, but we will look at all these different subfields more or less in this class. Um, now, artificial intelligence is a lot of statistics, actually, but it's not just statistics. It relies obviously on statistical analysis. It needs to rely on computer science because we're using computers and computer algorithms to make our decisions and to teach our machine to make decisions. But we also draw inspiration from psychology uh, also from biology and medicine, this is where it all gets mixed up really. We have statistical analysis and we have statistical models that are built and that are trying to mimic, for example, neural networks that we know from biology and human medicine that are then used on a computer. You have statistics using a model or coming up with a model that is related to biology and used on a computer, and of course, engineering and so on and so. This makes artificial intelligence very, very interesting because it's interdisciplinary and we're using models that have some parts statistic, statistics, some computer science, a little bit of psychology, and so on. Um, I artificial intelligence really a new field of research? Actually, no. The first research on the first examples of AI actually date back to ancient automatons. These were machines that were able to learn very, very basic things or that were able to make certain decisions. And these were actually used very, very early in human mankind, but they would call at that time automatons and call them automatons. Um, but even modern methods, like, for example, neural networks that we will discuss in this class, they were first proposed in the 1940s. So some foundations of AI are actually very, very old, but we have new statistical models and new statistical methods that have been developed over the last couple of years. We have seen an increase in the available amounts of data, which is very, very important because even though we might have had the methods ten, 20, 30 years ago, we didn't have any data, and the beauty of AI really comes into play when we have a lot of data with very good algorithms and very very fast computers. That's the third bullet point here below. The increase in computer processing speeds has made it possible to use modern algorithms on huge amounts of big data. Which makes it even more interesting to use AI and of course, the reduction in data storage costs. The combination of all these four points, new methods, more data, faster computers, and reduction in data storage costs. This has all led to a vast increase in the interest we've taken in AI and machine learning in applied sciences, even though the basic foundation of AI might have been laid even in the 1940s, 50, 60s and 70s. Now that's AI and ML in general. So what about AI and ML in finance? Well, here are just a couple of benefits we can reap by using AI and ML in finance. Now, financial operations, and that's usually trades, transactions in a stock market, in a bond market, for example, they are based on predefined rules. So by automating these rules, where a AI and machine learning, this can reduce costs and it can increase speed. So we can implement trading algorithms that can make decisions on their own. And this may reduce costs, obviously, but it can also increase speed and by increasing speed, we might be the traders on the market that are quickest to buy or sell, thereby increasing our profits or minimizing losses. Second, financial decisions. Every financial institution, every financial investor has to make, for example, ground alone, raid a bond, make an investment or don't make investment. They usually require quick but also fact based judgment calls. We shouldn't do this on emotions. We should do all our financial decisions based on hard facts balance sheet data, income statement data, analysts forecast fundamental data from the market, macroeconomic factors, et cetera. So if financial decisions do not really rely so much on emotions, but they rely on half facts, we can use those half facts, those facts based judgment calls, and we can try to teach this to a machine. We can use AI and ML to do automated decision making when, for example, granting a loan or rating a bond or buying or selling the stock, now, AI and ML algorithms, they make these fact based and hopefully objective decisions, and this is another advantage. They will always comply with the laws and regulations if we program them to do so. This is even if we don't have an increase in SVE, even if we don't have a reduction in cost, can be very attractive to financial investors, especially to regulated and supervised financial institutions because it takes out the human element that is error prone, we have an advantage in contrast in comparison to a situation where we have humans doing these decisions because the machines will always comply with the laws and regulations. That's one advantage we get besides increase in speed and a reduction in costs, and AI and ML do not requ economic theory, they simply use data to detect patterns. This is an advantage and disadvantage at the same time. The advantage is that we are using data and we are letting data speak for itself and we are getting patterns that we do not need to explain, but this is just reality what we can see, what we can observe in the market. It's a disadvantage, and I'm pretty sure researchers will disagree on this and you will probably have researchers arguing for the first and for the second case. You could also argue that this is a disadvantage to AI and ML because we do not have any economic theory that can explain these patterns. We are only modeling the patterns that we can observe in the market. This is actually an argument, I think that is made in the textbook by Lopez D Prado uh, who argues that this is a huge advantage of AI and ML because you're not relying on any theory that in the end might not be tested and might not be able to test at all. So you only concentrate on what you can observe and the patterns you can see in the data, and then hopefully AIN NL will tell you what these patterns are. Now I want to give you a very simple introductory example for machine learning, and this is taken from the textbook by John Hall. You can actually download the data from John How's website, the link is here. And now the task is we want to predict the salaries of people based on their age. The sample is of size 30, N equal is equal to 30 and we will divide the data sample into three trainings into three datasets. The first one is a training set. The second one is a validation set, and the third one is a test set and we will use three mo loads, a linear, a polynomial, and again, a polynomial, but of higher order model to estimate the salary to predict the salary of people with X being the age. The linear model, let me highlight this for you. The linear model is simply assumes that we're using people's age, and we estimate salary, which is Y, and we need to estimate the parameters A and B. Now, if we use a polynomial model of order two, we would take H as a linear term and we would take it as square and then include it in our model. Then again, we have a polynomial model of order five, where we only have X, X squared X taken to the third power, fourth and fifth part. Very simple, and again, we are using three sets of data. Now, this is the training set. We have age 25 and a salary of $135,000 euros et cetera, 55 age 260, these are very wealthy people actually, and you can see this is the training set. Now, if we plot the training set, you can see at first, you just see the data. Now, one would think that it could be a model that looks like this. Could also be a model that looks like here. Let's use this one could also be a model that looks like this. That's why we absent any theory, we need a good model and we need a good procedure to train our machine learning algorithm, our model to come up with a model that is able to explain the data we have. Now, a very simple linear model would look like this. So we estimate this linear model. Later on, we will see that this is regression analysis and we are estimating a linear model based on regression analysis in this case. The quadratic model would look like this. And if we go back to our data, well, a quadratic model would look like this, could be okay, but maybe it's rather a polynomial model of order five, which gives us a pretty good fit. We could have seen this from this plot here, but actually, you can also see this in the data. Now, the polynomial model of order five is the most flexible one, and it will yield the best fit. However, as you can see here, we have ups and downs and ups again. This might be indicative of what we call overfitting. Have a model that is too flexible and it will not only model the pattern in the data, but it might also be modeling the noise. We have maybe some errors, some outliers. In contrast, the linear model is not very flexible at all. It's rather simplistic, as you can see here in the blue line. It's not a very flexible model. It only has two parameters, so it might be too inflexible. And this is what we call underfitting. The solution is we have to back test our model using the second dataset, which we then call the validation set. We use the second dataset and we check whether the model generalizes well to a validation dataset, and this is what is shown on the second data plot here. We have the ten points, the ten observations in our validation set, and what we then do is we estimate the so called root mean square error. We take our three models, the linear and the two polynomial models. And first of all, we estimate our models based on the training set. Then what is quite simple and what I can show you here is, for example, we erase a little bit and maybe this one. What we can do is we can actually take those differences between those points and our model, and those are the errors. For example, in if I were to take a linear model, which would like this, then I can show you the errors here. The errors would be these differences, quite simply. And so on and so on and you get the idea. And if you now take these errors, these differences, and you square them, and then take the square root of them, you get what we call the root mean square error. You take those errors, you square them, and you take the root mean and get the RMC. Now in our training sense, if we estimate our model, it's quite clear that obviously the most flexible model, which in this case is the polynomial model of order five gives you the best fit in the training set and we have a root means square error of 49,000 here, 32,000 here and 12,000 in here. If we now do the same thing in the validation set, we take our models and we look where do where does our validation set lie? We can see that now in the validation set in the linear model, we get almost the same root mean square error. So the difference is 259. For the polynomial order of order two, we have 33,000 and only a difference between the root mean squared error of the training and validation set of 622, but for the polynomial order of five, we get the huge difference. And this shows that we here, in this case, we have overfitting. Yes. The polynomial model of order two now produces the best fit without overfitting the data. As you can see, the means square error is considerably lower for this model than, for example, in the linear model. This means that this is a better model. But as you can see, the difference between the training set and the validation set is also very small here. We would argue that this is the best model without overfitting, as you can see, the validation set, actually the polynomial order of five again, has a huge root means squared error. We should use Model two, gives you the best fit in the validation set and it doesn't overfit the data. Now, how accurate is the chosen model? We have the RMSE in the training sets and in the validation sets. Is it the first one? Is it the second one? Actually, none of the two. The accuracy of our chosen model should not be measured based on datasets that were used to choose or validate the model. So what we need to do is we need to use the third set of data, that is the test set. Remember that we have 30 observations, and we estimated and trained our model using the first one. We delegated and chose our model based on the third one, and then to estimate and make an estimate for the accuracy of our model, we have to use the third one. This is a very prototypical approach with very simple models, of course, in machine learning, and we have a dataset. We divide it into Training, a validation, and a test sample. And then we estimate the root mean square error for our third test set, and this gives us 34,275 as an estimate for the root means square error for our model two, a polynomial model of order two. Okay. Now, a second machine learning example. The task here is, and this is taken from the James Whiten Hasty Tip Cherni textbook, we have defaults by customers, for a bank, for example, or credit card company, and we want to predict the probability of credit card defaults based on annual income, which is on the YXS and the monthly credit card balance, which is on the axis. And the blue ones are actually writing defaults on actually the defaults in orange, sorry, and the non defaults in blue. And one can immediately see that there seems to be a pattern here at work, meaning that we can see we can simply draw a line here, and we would be would be very good and we were able to classify our observations as a default or non defaults. Now, what this mean? It would mean that actually here, if we have a balance on our credit card below, let's say 1,200, that's okay. And this would mean that anyone with a credit card balance below 1,200 has a pretty high probability of not defaulting, whereas here, this pretty much looks like this is bound to default. This is a second task we will see quite often in machine learning. Um, we want to train an algorithm to classify observations as good or bad as ones or zeros, and so on. And we need models to do this for us. And here we have two predictors balance in income, and of course, we want to do this with many more predictors. In this case here, if we were to use box plots, you could see that with income, which is on the Y axis, there's actually no real pattern. We can see that actually for any type of income, the probability of default is probably the same. You can see this here that with default, the means and the quantiles are very close together. Now with balance on the credit cards, there's a huge difference. As you can see here, no default is here and yes, default it lies here. Actually, credit card balance seems to be a very good predictor of the probability of default and this can then be used in a machine learning model, machine learning algorithm to make automated decisions on whether, for example, a new customer is bound to default or not. So what could we do? We can do the same as before. We could fit a linear model that could look like this. Default equals A plus B to parameters times income to the data. But the problem is that in contrast to our first model where we were trying to estimate and forecast quantitative variables, income um, we now want to forecast the probabilities and we want to estimate probabilities of default. Now, the probabilities can be negative, and this is a problem here. So if we were to use this linear model, this would be the linear model, the blue line. Suddenly, the probabilities we are trying to estimate, they could also be negative and this should not be the case. We need a different model and we cannot use the linear model. Now the solution is, which is better idea to fit a so called logistic function, logistic mod. We will see this later on in the lecture and the logistic model assumes that we have the probability divided by one minus the probability, the conditional one that we have a default if we are given the income and we now take the natural logarithm of these odds, then we get what we call the logics this is a logic model and logistic model, we estimate this as you can see here, all those probabilities are now 0-1, and this is more suitable, better model to forecast those probabilities of default. This will actually be a huge part in this lecture, classification problems, classifiers and machine learning algorithms that can classify loans that can classify stocks and investments. Okay, this was a short introduction. You should know what AI is that machine learning is actually a sub field in artificial intelligence, and you should have seen two pretty simple examples how these models can be applied in economics, business, and especially in finance. 3. 3 Sources of Financial Data output: Now in this video, I would like to start with Chapter two, data sources, data generation, and data pre processing. And as you can see on this slide, in the three subjects we have, I would like to start with the data sources and data types that we will usually encounter in the applications of AI and ML in finance, give you an idea that it's not just about it's no longer about, um cabal market data, we need more additional data sources, and this leads us to new problems because we need to merge all these different data sources, and then we have to preprocess the data in order to make the most of our machine learning in artificial intelligence algorithm, and in some cases, last but not least, we are also led to the problem of being required to generate new data that we can feed our algorithms. So we'll start in this video with data sources and data types. Now we only start with financial data, even financial data are quite diverse. They come at different levels of complexity, and they can be about prices, they can be about indexes, they can be about transactions, and so on and so on. We'll shortly see what financial data are. But actually, even financial data are quite diverse and have always been quite diverse. Now, the data can be structured. They can be unstructured. They come at low frequency, high frequency. Sometimes we have prices only available every once a week, or we have prices, for example, that can be um, sampled from transactional data at five minute intervals, then we would get what we call intra day data or high frequency data, or at the other end, we might just have balance sheet data. Balance sheet data only is published every quarter, sometimes just once a year. The data can also be publicly available. It can be private. If it's published, if it's disclosed by firms, like a balance sheet, like an annual report, it's publicly available. But sometimes we also have private data, data that is only available to one company, and that is actually business sic port. Financial data can be complemented um, with alternative data. I call this alternative data. You could also say just non financial data. Now, we've always worked with financial data in finance, obviously, with balance sheet, income statement data, with prices, market data. Nowadays, we are trying in research, but also a year in teaching. And, of course, in practice, we are trying to complement financial data with additional data sources from a non financial real. Could be satellite images, could be data from Twitter, could be data from Facebook, could be data from some other data source that we don't know about yet or we haven't thought about yet, and the combination of financial data with non financial data, first of all, makes artificial intelligence and machine learning so much necessary because we have big data, we have more data available and we need more, um, um, powerful algorithms and statistical tools to deal with this type of data. And second, this is also what makes artificial intelligence and machine learning so darn interesting in finance because we can see hopefully can see more than just by looking at financial data. So more data sources lead to more data, more data in turn leads to the need for AI and ML methods to process this data, but also, of course, for big data algorithms. But then different data sources, structured, unstructured high, low frequency all this makes data preprocessing necessary. We need to think about how we can combine, how we can merge this type of data, how we can make working with this amount of data most efficient and we need to make sure that our algorithms do not run into problems in between, so we have to preprocess the data. Now what types of financial data do we have? We have fundamental data like balance sheet items, income statement items, also macro variables that are usually published and processed by from some central banks. Obviously, we have market data, so prices, also yields for bonds, implied volatilities when it comes to options and other derivatives. We have transactional data, trading volume, we have dividends, coupon, open interest quotes, cancellations, and so on and so on. There's a huge universe of data available when it comes to market data. We have in the third category, analytics, which uses fundamental and market data and creates what we would later on call and extracts what we would later call in machine learning features. We have analyst recommendations, credit ratings, earnings expectations, and new senant. So this has already been processed data. It focuses on market data, fundamental data, and it complements this data with, say, for example, the recommendation of a financial analyst. Then in the fourth category, this is not financial, but it complements the other three categories. It's the alternative data section, for example, images, could be images of persons, images of companies of products, but it could also be simply images of builds, Google searches, Twitter, chats, and meter data. In this class, just like in big data analysis in finance in practice, we'll concentrate on all four types of data, financial and alternative data to make the most of AI and ML in our applications. Now, where do I get financial data? Usually from vendors like Bloomberg, Compustat Con, used to be called Datastream so I mentioned this here. So these are professional vendors of data, and it usually only comes at a high price. So you have to pay a lot for Bloomberg, computate IC data stream, but you also get high quality data. There are, of course, lower quality data sources like just Yahoo Finance, but don't expect too much from these free of charge sources. It's just like in any area of life, if you pay more, you usually get a better product, and this is also the case here. But in machine learning, especially in machine learning, it's quite nice to see that there's a huge community now of practitioners, researchers who have published data song boards, algorithms, and you can access these. For example, you can find a lot on Cagle or UCI machine learning repository. UCI is the University of California at Irvine, and they've published I click here, cagle.com or archive at ICS it UML. You get the UCI Machine Learning Repositi or Cagle. You find a lot of data, you find a lot of examples, and I encourage you to look these pages up and see what you can use yourself. You also get data from public government agencies, for example, in the US and the European Union, and you can get IDCs, you get census data. This is maybe not alternative data as been used especially in economics for a long time, but it's the first step to complement traditional capital markets data with more alternative data samples, same with the economics datasets from the World Bank, and throughout the class, we will introduce different sources of data, different data samples, different databases, and you will learn how to import different kinds of data based on, say, CSV files, PDF files, but also image data pictures in our practical applications throughout the lecture. 4. 4 Preparing and Cleaning Data: Hi, everyone, and welcome back to our class in artificial intelligence and machine learning in finance. Topic for today is data preprocessing. We've talked about the different data sources. We have at our disposal in finance. Nowadays, actually, we have financial markets data. We have alternative datasets, so we are complementing our data from, say, balance sheets and income statements and from markets with data, let's say, from Twitter, Facebook, but also geographic data, satellite images, et cetera. So financial and all the data, nowadays are often quite incomplete. For example, you have missing values, you have missing attributes for some data points. Sometimes we have granular data, sometimes we have aggregated data. So depending on what you want to do and want to achieve with your data analysis, you sometimes need to aggregate the data or you need to disentangle, if that's possible, aggregated data back to the granular data they were created from. Financial and other data often also sometimes noisy, they contain errors. You have outlies. The outlies will not be erroneous, but you want to exclude them anyway because this might drive your results into a direction you don't want to. They could also be inconsistent. For example, they could contain discrepancies in codes or names. Very, very, very simple and trivial example is that many companies don't just have one single name, but they have many legal entities and all the subsidiaries and all the companies that belong to one large conglomerate they share one part of the name, but they have slightly different names in different countries, and you need to make sure that these inconsistencies are taken care of. So data pre processing is about resolving these issues, making sure you have complete data, you have now errors in your data and everything is consistent. You need to transform raw data into a format that is understandable by the computer and by machine learning algorithms. Data preprocessing is now key to good model performance. First thing that could arise is if you have errors in your data, if you have incomplete data, you might not be able to perform your analysis in the first place. Your algorithm might just stop. But that's actually the good case and the bad and worst case is that your algorithm works, it works on the data, and you don't see actually what the errors produce. You don't see the errors in your result. You only get an output that is biased by the results you haven't identified before. So we need data pre processing, and this typically involves two steps. First, you need to understand the data. You need to make sure that you have a good feeling of what the data looks like, and this includes looking at the raw data, taking the extra mile, looking at the Excel file, looking at the CSV file, all the other formats, the raw data might come in and having a look and a good understanding of what the data looks like. And then you have to prepare the data so that your machine learning in artificial intelligence algorithms can properly work on the raw data. Now, if we have a look at what Less Miste 2015 proposed, the main task of data understanding are first of all, collecting the data. That's usually a huge task, especially if you think about alternative data. You need to work with the interfaces and the APIs with different data sources. You need to describe the data. In research, this is usually done by looking at descriptive statistics. You have to explore the data and verify the data quality. Look at the NAs, the not availables, look at missing values, look at data values that are completely off the chart and make sure that this is not a data error in the best case, this is an outlier and then think about whether you should remove these outlies. These tasks are performed to make sure that the data are adequate to meet your goal that you want to achieve with your data analysis. Furthermore, by exploring the data, and determining its sparseness and identifying missing values, you get a better idea of which learning method might actually be appropriate for this kind of data sample. It might be that you thought of one method and you now after looking at the data, you see, Okay, maybe I should use a different angle. And verifying the data quality is critical. We are talking about algorithms and methods that are based on and get their huge advantage from working with big data. And if you have a huge data sample and the quality is not good, well, might be that the analysis is doing from the very start. So understanding who collects the data, how it is collected. You can, um, try to identify incomplete, erroneous data, might be that the data vendor already tells you that, okay, we are rounding values. We don't have access to this sort of data, or sometimes for that variable, for that feature, we do have some missing values. So it's possible that rules for collecting the data change over time. And when this happens, this will lead to structural breaks in the data, which isn't the problem per se, but you need to be aware of this and you need to account for this in your analysis. In data preparation, you need to select the data. You need to clean the data, construct the data, integrate and format the data. In the simplest example, this would simply mean put it all into an Excel file, put it all into a spreadsheet and make sure that the spreadsheet at the end is in a format that is readable by, let's say, our stater or any other statistics program. The goal of these days is to get the data ready to use as input in the algorithms. In other words, make it ready to use in your statistical software. This includes merging data from different sources, feature engineering, and maybe further transformations you need to apply to make it readable to the computer. So algorithms require categorical variables, yes or no, and this needs to be formatted into the way one or zero or true folds or to be formatted as factors. So this is sometimes seems to be very trivial, but this is one part of your data analysis that will take up a lot of time, and the data might be split into training tests and validation set. We saw that in the very first video, which is quite simple, I guess, but at least you need to remember this. Now, data understanding and data preparation are often not performed separately, but are interrelated. As an example, we will later have a closer look at the Gem credit data set by Dui Gaff and the data are also available from the UCI Machine Learning repository, along with a very detailed description of the data. So I encourage you to have a look at the data in the example at UCI, and we'll talk about this credit data set and long data set here in these videos next. Now, before in the next video, we'll start with the example. Let's talk about which statistic software we will be using in the remainder of the class. We're using R. Now, everything is performed in R. R is free software under the terms of the free software foundations Gu general public license, and so on. I and my team, we would recommend you the use of SDU the studio desktop, which is just a little bit more comfortable and has a higher usability than the standard software that is distributed. And this is also for free. You can find the link here. If I find my cursor, yes, here's the link. And alternatively, if you're studying it Leipzig, you can also download it from our computer center's website. So studio is the way to go here, and this is quite convenient. Now, R is a language and environment for statistical computing and graphics. It's one of the major languages for data science has been around for decades and it's highly extensible via packages. That's why with the new algorithms and new methods for machine learning and artificial intelligence, we've just seen an increase in the number of packages that are purely devoted to these methods, and the standard statistics packages are obviously also around. Are, of course, alternatives, most notably Python libraries with psych learn Karas tens of Low, et cetera, which are related and similar in some extent, to some extent. But here we'll focus on R, and you can also use different software and different programming languages, of course. Next, we'll start with the practical example, but that will be done the next. Thank you. 5. 5 German Credit Dataset Part 1: Hello, everyone, and welcome back to our class in artificial intelligence, machine learning in finance. Now, in the last video, we looked at data preprocessing and this is what we'll do here in this example. Actually, this is an example, the German credit data example taken from the UCI Machine learning repository. So you can actually download the same dataset from the Internet and try to go through the individual steps of this example yourself, and I would encourage you to do so. So we'll start by importing the data and we'll read the data by using this read table command in R. Let me just highlight this here. And we are reading it directly from the Internet website at UCI, so you can see UCI at a C UCI Machine Learning repository. We are downloading the data into a data structure which at this point, we call German credit. Makes sense to do so. And to get the first feeling of what gem credit the data sample looks like, we simply use in the fourth line the command dim for dimension. And the dimension of gem credit of this array is 1,000 rows and 21 columns. We have 1,000 observations of I guess 21 variables. At this point, we're not so sure if the data consists of rows with four observations and columns for the different variables. At this point actually is an assumption that we'll later see it makes sense that yes, we have 1,000 observations for loans and borrowers, and we have 21 variables. Now because of this dataset having including a lot of features, we will first produce the number of features to make them fit on one slide, just for expositional purposes. You can actually have a look at all 21 variables, and in maybe your exercises, you will see that you can also use additional variables, additional features for the purpose for which we are doing all this, that is forecasting defaults, default rates. But here for these slides, we wanted to make it fit onto the slide. We're first reducing the number of features. This is what we are doing in the last line. We are just using the columns one through five and 21, we'll end up with just six columns, but we are keeping all the rows. We're taking Jam credit, reducing it to columns one, two, three, four, five, and 21, and then again, writing it into our data array Jam credit. Now to explore the structure of the data, we are using the command SDR for structure in R. And if we enter SDR for gem credit, we'll see that this is a data frame that the type of object we have here with 1,000 observations and six variables. Now, the six variables are, as we've seen before, the columns one, two, three, four, five, and then 21. They have names which are given by V one, V two, V three, V four, five, and 21. The first one is a factor variable. It has four levels, A one, A one, two, a one, three, and so on, actually, because we only have four levels, it's actually A one, four. And you can see one, two, four, one, one, four, and so on. These are the first observations for this column. The two is an integer with six, 48, 12, and so on as first values. These three is a factor now with five levels. The four is a factor with ten different levels, and so on. And most interestingly here, for V 21, we have an integer that seems to only take on the values one, two, and again, one, one, two, and so probably variable that looks like a dummy variable, but doesn't really have a good coding, so we'll later have to switch that as well. Now, the SDR command, as I said, gives us valuable information at the very start of our analysis. We're seeing that it's a data frame that is 1,000 observations, six variables by construction because we only cut out six columns from the initial jump credit data frame and features are of type factor. End of type integer or type integer. Now the effects, these represent categorical or or no variables and have the advantage that category labels are stored only once. This requires a lot less memory in your working space and this enables faster computations if for example, in contrast, you would have used a float or an integer number. The different values for the factors, they are referred to as levels. For example, for the first variable V one, it has the levels A one, one, a one, two, a one, three, and a 14. That's actually the way the data frame was coded and programmed. It doesn't really make sense in an economic way. We don't know what the levels are, what kind of variable this is. We need to find out um, one actually variable one is and what the levels mean. But as you can see here, this is the way how it's coded. And we can already see the first observations. So one, two, four, one, one, four. Those are the values of the first variable for the first, one, two, three, four, five, six, seven, eight, nine, so the first first ten rows of observations. Now the variable or feature, we usually calling it features in machine learning. The variable names are not meaningful in an economic way, an economic sense. So it's a general credit data sample, we would expect the names to be something like income, marital status, gender, maybe location earn acids. Now, this would make sense, but V one, V two, V three, we cannot infer what is meant with these variables. So we need in the next step, we need to rename them to understand what is meant by variable one, variable two, and some. And even if they were, one should always be skeptical about the provider labels. Could be that they are plain wrong. So again, check the data before starting your analysis and make sure that you know you data. So you can actually look at the description here here at the UCI website, there is a description of the variables. And if you look through the description, you will see that for example, Attribute one, man is actually the status of the existing checking account and the four levels, the very old data sums in Deutsche Mark, you can see the four levels are actually coded as the first level, a 11 is that the checking account is below zero. If it's 0-200 Dutch marks, it's level two. Level three is above 200 Dutch mark or salary line for at least one year and the fourth level is that this customer doesn't have a checking account. The second attribute is the duration in month. And most importantly, if we now switch to attribute 21, the last variable we count out actually here, this is the 21, the 21st column and this is 21 again. This is actually the rating, one is good, two is bad. This is later on our outcome variable. We have our borrows and they have a good or bad credit rating and one or two, it is actually more or less a dummy variable, but you should probably code it with ones and zeros and at one and two, so we have to switch the coding year later. Now, based on the documentation, we can now assign meaningful variable names, and further all, we can transform the data frame, data type in R into what we call a tibble, which, in R is a more modern type of a data frame. Now, R was developed long time ago some decades ago, and it has evolved over all this time. And some things that were once included in the original base, um um, but not packaged in the base version of R, and nowadays is a little bit outdated and isn't as fast and convenient as it could be. So the same applies to data frame, which is in the base delivery, and a tib as a type of Mlin data frame has some advantages over the sender data frame, and you can see this is described in the Tibble ti verse and for example, it has an enhanced print method. So we are changing it into a table we first use the documentation to assign meaningful column names. So the colnames, um, command in R for Jem credit, renames the column names, and we are using checking account status, duration, credit history, purpose, amount, and rating as the new columnames instead of one, two answer. Now let's look at the table. We first have to load the library table, and then we rewrite Jem credit as a Tibble. We are taking Jem credit which used to be a dataframe, and with the command as tibble we write it over Jem credit and it's now a tibble. Let's print some first observations here. If we print Jem credit, see it's a table with 1,000 lines and six columns, and it starts with a factor, an integer factor factor integer integer. And for example, in line six here, you can see the first observation, which as A 11 as the first factor, then six months A 34 a 431169, and this is a good rating. Yes, one is good. This is a good rating, and then you have the remaining 990 rows. Now, the first example for analysis is summary statistics. I think I mentioned this in one of our previous lectures and previous videos. You should always have a look at the data itself, at some observations, but also have a look at the summary statistics because the summary statistics, which will give you the mean um, the first and third quartile, the median, maybe also the standard deviation and volatility and variance of the data. These summary statistics will give you a first impression of the sample, not just individual observations, but the sample as a whole, and you will see probably that you might have some outliers. You will see some, well, if you can see, for example, here, where rate the mean rating being 1.3, this shows you that most of these observations has a good rating will only if you have a bad rating, otherwise would be closer to two. Obviously, if you were to see the minimum, not being one or the maximum, not being two, you would know that actually something's off because the variable, this factor variable is defined only to have observations with ones and twos, so something would be off. Now, as you can see here with the amount, this might be an outlier, 81000 could be. We have to check this. Credit rating, you can see male the histogram for those five levels. Everything seems to be around 832. Duration is 4-72 months seems to be okay. So use the summary statistics to check your data and see if something's off if something doesn't make sense economically, always a good starting point. Now next, we need to transform the integer variables into numeric float ones because we want to facilitate later processing of the data. If we use float numbers, this makes computations easier. And for this and some further operations, we rely on the DPLYR package. And here on this side, you have some documentation from the website on which a DP LYR is provided. It's a grammar for data manipulation. Providing a consistent set of verbs that help you solve the most common data manipulation challenges. You can do all this in R without the help of this package, but this is much more convenient. For example, mutate is a function is a command that adds new variables that are functions of existing variables. You can use existing variables and you can simply add new ones that are functions in old ones. For example, if you need twice the amount, if you want to multiply the amount by five, that could be one mutation of an existing variable. Select picks variables based on their names, filter picks cases based on their values. Summarize reduces multiple values down to a single summary and arrange changes the ordering of the rows. Many actually convenient functions, if you have to do data manipulation all over and over again, and this is why we're relying on this package here. It's also part of the tidy verse, which is a collection of R packages that are specifically designed for data science and machine learning. And again, everything is based on the table data structure as modern version of the data frame. So we'll adjust the variable names, which were again, still quite inconvenient. And we first transform the integer variables to numeric ones to float numbers, and we use the DBL YR style tation with the pipe operator. So German credit is now gem credit, and this is the pipe operator and we mutate E is integer as numeric. We're changing the integers to numerics. We transform rating into effector. It's not yet if you go back, it's still an integer. It's still an integer that has ones and twos. If we were to include three, wouldn't make a problem. It would still accept a three, but the three wouldn't make sense. In order for rating, only to have the ones and twos as levels, we need to change this from integer to effector. Jam credit, Jam credit pipe operator mut rating equals a factor, and we say it's now rating. Actually, this is the original variable. We take the integer variable as input for the factor function and we say, okay, we mutate and rating now has to be a factor based from the old rating variable, and the labels are now good and bad. So that's how we do. Last in the least we show the percentage of good debt rated credit applications. We do a table JEM credit dollar rating, which is the rating column divided by enroll, the number of rows in Jam credit, this is 1,000. But if we don't know this, then simply calculate the number of rows using the enroll. A month, and as you can see, we have 70% good credits, good loans, and 30% bad loans. So on the next slide and in the next video, we will continue this data pre processing and data manipulation analysis. Next, we'll concentrate on the ratio weight of evidence, but that's up to the next video. Thank you and hopefully see you in the next 6. 6 German Credit Dataset Part 2: Hello, everyone, and welcome back to our class in Machine Learning and Artificial Intelligence and Finance. Now in this video, we want to continue our practical example based on the um credit data, which is available from the machine learning repository at UCI. Now we've already seen some summary statistics. We've done some data pre processing in the sense that we've renamed the variables, the columns in our data array. We have transformed the data frame in R to what we call the tib from the table tidy verse, which is just a more convenient form of a data structure. It's a Tibo that's what it's called. And after having looked at the summary statistics, we now want to continue and using what is called the weight of evidence ratio, the WOE. The weight of evidence ratio is a first way of having a look at the explanatory power of some features because in the end, in data science, we want to predict something. We want to forecast maybe something. And as you might have guessed, if the data semble is called German credit data, it's probably that we are trying to forecast and predict default rates in a loan portfolio. And for this, we are going to use the weight of evidence to get an understanding of the predictive power of some of our covariates, some of our features. Now, the weight of evidence encodes the relation between a categorical predictor variable with a binary target variable. So we have the binary target variable, which in our case is good rating, bad rating, and we have our numerous predictor variables, which need to be categorical, and this weight of evidence ratio originated in the finance industry actually, in this very same setting, it was used to separate good from bad risks. It has also been in use in other areas for quite some time now, but it's still best known, I guess, in finance and industry and insurance industry. So we'll use weight of evidence ratios, and in our case, it's defined as the logarithm of the number of non events, which in this case is a good rating. There is no default divided by the number of bad events or just the events we are looking at, so that's a bad rating. Now ratios of non events, two events close to one indicate that the corresponding category, the covariate the covariate has no predictive power on the target value. Now, this corresponds to a value near zero after having applied the logarithm. And one should be careful in driving conclusions from the ratio as giving a loan, for example, to a defaulting customer is worse actually than not giving a loan to a non defaulting potential customer. So this is still a purely data driven approach to get a first glimpse really of the predictive power of some of our covariates, apart from being completely void of any economic theory. But again, it gives us a first impression and gives us a first hindu what the data looks like. Now the weight of evidence is calculated in different groups in different sub samples that are formed based on the covariate of interest. So for example, if we had gender as our covariate, this would be very simple because gender will probably come in two or three maybe four, I guess, levels. And if we only were to use male, female, then we only would have two groups, two subsamples. Both sub samples would be probably of equal size. There would be enough observations in each of these two groups, and we could easily estimate and calculate the weight of evidence. For the categorical variables, these might be the categories or pooling of multiple smaller subcategories. For example, if we think of our data, we had the checking account status, we might, for example, have income. So every income because income will most likely be a float number variable or an integer one. Everyone has a slightly different income. For example, you might have an income of 40,000 euros per year. The next person might have 40,005 euros per year. So all these income observations would be slightly different. You need to pull them to arrive based on those multiple smaller subcategories and observations. You might arrive at larger categories and larger pools so that there are enough observations in all of these ports. So for continuous variables, most definitely, one has to create bins based on thresholds. For example, you could say, okay, income 0-40 thousand euros, 40 to 80,000 euros and everyone has an income higher than 8,000 euros per year. Each category bin should contain at least 5% observations to avoid the results being driven by noise or outlier. They need to be enough observations in each and every bin. Let's do this for the checking account status. We'll use the first attribute, which is a qualitative one. Remember that we have four levels below zero dodge mark 0-200 dodge marks, more than 200 and we have no checking accounts, four different levels, and we calculate the weight of ence or weight of evidence and some further simple ratios to compare them to the weight of evidence. We are using the pipe operator again. Let me highlight this here from the LP LYR package. You need to select the checking account status. That's the new uh variable name we've given to actually, this was one. And what we are also calculating is the percentage of total observations, which is just simply the length of rating divided by the number of rows. The good rating is the mean when rating is good and the weight of evidence as defined on the previous slide is the lock of the sum of good ratings divided by the sum of bad ratings. Let's see what comes out of this. If we print those results, you can see for these four levels, A one, A one, two, and so on. Have a percentage of total observations of 27, 27, six and 39%. The good rating, 50%, 61, and so on, and the weight of avance is more tilted towards the extremes to actually zero and two. If we plot this and compare the two statistics, percentage of good ratings in these four bins and the weight of evidence for those four levels, we can see that actually based on the percentage of good ratings, one could think, okay, there seems to be a difference between those four levels. It's increasing, so our A 14 is actually seems to be a level that has more predictive power to explain default rates. But the differences between those four levels are not that extreme. But if you look at the weight of evidence, you will see that actually the first level has almost no predictive power. Whereas the fourth level, which if you might remember this is no checking account, you don't have any checking account at all, this seems to be highly predictive of a bad credit loan. And this is also what we are looking for, So it seems that when looking at the weight of evidence, the third and fourth level, these outcomes of this variable, these levels, they seem to predict a bad credit rating quite well. So the weight of evidence shows this more clearly than the percentage of good ratings. Let's turn to the loan duration. The first plot on the left shows you a box plot across all thousand observations. And if you divide this up into good and bad credit ratings, remember that it is defined as a variable that takes on one and two as values. You can see that, okay, it seems as if and this is only speculative if we are honest, it seems as if the loan duration, if it's lower seems to predict a low value for rating, and if it's higher, it seems to predict a higher value of ratings, low loan duration seems to imply a good rating and the other way around. But this is only speculative because actually the plot on the right isn't really helping here. Again, let's calculate the weight of evidence separately for each unique value of loan duration. We cannot really do this because if we take this integer variable. But still, as you can see from the plots here, the thousand observations are pretty quite dispersed across the universe of all those log durations, so we need to create bins. First of all, we calculate for each value of long duration. Again, the good rating and we take the meaning of the good rating in each of these groups, we group by duration, and then actually we form larger groups on a yearly basis. And do the same and calculate the mean of the good rating. So this is on the left hand side, this is before grouping. As you can see, there are some log durations for which we don't even have an observation. And well, one would say, yes, there seems to be a trend that looks like this, and this becomes much clearer as soon as we group our observations into yearly bins of loan duration. And you can see, yes, the percentage of good rating seems to decrease the longer the loan duration is. Okay. Now finally, we also have a look at the credit history variable and for brevity, we do not consider further variables. Remember that we actually had 20 covariates in our data sample. We could have used additional variables, but we only shown this here for now the credit history, checking account status, and loan duration. So let's use credit history, pass five values, no credit stan, all credits paid back duly, all credits as this bank paid back early. Well, the second line 831 will probably be the best predictor and the best level if we are interested in a good rating because it means, yes, you've taken up loans and you've paid all those loans back in time. Existing credits and loans paid back duly till now, delay in paying off invest critical account, so that's the worst state actually of this value. Again, let's calculate the percentage of total observations, the mean of the good ratings, and the weight of evidence, and this is what comes out of this analysis. You can see here it starts with 4%, 4%, 50%, 8%, 29%, and for the weight of evidence, actually, it's even more extreme as we've seen before. This is the percentage of good ratings. This looks like this, and it's much more extreme and not surprisingly the last status, which is that it's critical. This is highly predictive and it's a high explanatory power for explaining a bad credit rating. So this is the weight of evidence ratio that can be used to study the explanatory power of some of our covariate in order to see which variables should be included in later models. So this is data pre processing, and in the next video, we'll talk about data 7. 7 Generating Synthetic Data: Hello, everyone, and welcome back to our class in artificial intelligence and machine learning in finance. We are now at Chapter 2.3, which deals with data generation. Now, we've already looked at data preprocessing. So why do we need data generation? Well, actually, in many cases, machine learning techniques rely on synthetic instead of real data for training and testing purposes. Why is that? Well, in several instances, we need to preserve privacy. We have confidential data, for example, on credit card usage, credit card fraud, et cetera, and it should not be possible to derive any conclusions on the origins of the data or on, for example, a certain person. So we need to preserve privacy and in some instances we also need more additional data or we and too few data to run our algorithms. So synthetic data is artificial data usually created by algorithms, sound boat wire simulations that should mirror statistical properties of the original data as good as possible and while not revealing information on real people. So we want to preserve privacy. We want to create trading data for our algorithms, and we want to test our systems. This is why we need to generate data. Now, many training datasets are highly imbalanced, making classification tests difficult. In these cases, synthetic data generation is particularly important for building accurate machine learning models and we'll later on see how this actually works. Let's at this point, look at three different ways how to generate data and what types of synthetic data we can differentiate between. The first one is fully synthetic data. This is when the data is completely synthetic is completely artificial, if you want to call it this way, and it does not contain any original data entries, any original observations or values of certain variables. Thus, the joint density of the original data is estimated, and we sample random variables from the estimated da density function. In other words, we are only taking the original data fitting a statistical distribution through the data and then sampling synthetic data from that fitted statistical distribution. In this case, the data has strong privacy protection, but the truthfulness of the data is obviously lost because it's fully synthetic. Now, with partially synthetic data, this is the second possibility. Uh, we replace values of selected attributes that have a high risk of disclosure with synthetic data. Now, this could, for example, be in the very simple most simplest case, could be that we replace the name, the first name and the surname of our observations, for example, when it comes to credit card data. Now, if we call everyone Mr. X, misses X, it could be that, um, actually, this doesn't change the value of the data, but we preserve the privacy uh, of the real people behind the data behind the original data. Now, disclosure risk is higher than in fully synthetic data because you might imagine that it is often possible to identify persons not just by name, but also if you take, for example, age, gender, income, and, um, place of birth, date of birth, et cetera. So by combining different variables, you might still identify person. So disclosure risk is higher than in the first case of fully synthetic data. Then we have hybrid synthetic data that is the dataset is generated using both original and synthetic data. Now each record and the original data is replaced by the nearest record in the synthetic data, you simulate from a fitted distribution, but you try to replace the original data by synthetic data that is closest to the original observation. This method combines good privacy protection with high utility at the cost of more memory and processing time. Obviously, this takes longer, but on a modern computer, this shouldn't take too long. Um, how should we generate synthetic data? Now we have three broad concepts, generating data from a known distribution, fitting a distribution to real data, and then simulating from the distribution and using deed learning. So what do we do in the first case? Well, actually, you simply take a statistical distribution. You simply assume that the data comes, for example, from a student T exponential or normal distribution, and then you simulate random data from this a priori chosen distribution. Uh, the difference later on too, the second method where we fit a dissbution to real data is actually that you simply assume the distribution and you assume it to be known, you made an assumption. For example, a normal disbution with mean two and standard deviation five. This is an assumption that is not validated by any estimation, by any look at the real data. And if we now take this to the real data, make an assumption, let's say, on the parametric form of the distribution, but we still estimate the parameters. If we fit the distribution to the data, then we at the second method. If we have real data, you can determine the best fit distribution chosen from a given parametric family of distributions. Usually it's parametric, and you can then generate synthetic data by Monte Carlo simulation. The quality of the generated data obviously depends on, first of all, on the selection of the parametric form of the distribution and also on the estimation. So we might want to try goodness of fit test. We want to check how far the fitted distribution is from the empirical distribution function. And last but not least, you can also use a machine learning model such as decision trees, providing an approximation to non classical distributions. For example, if you want to use a multimodal distribution that is one, for example, that has two humps. In these cases, overfitting might be an issue. You have to be careful that if you use very complex distributions, you might get overfitting. You might be fitting the distribution also to the noise in the data. And as the third method, we have deep learning. We'll later in this lecture see different methods from deep learning. I only want to mention two of these here, deep generative models such as variational auto encoder, VAE, or generative adversarial network GAN. Now, VAE is an unsupervised method where the original data is first compressed by a so called encoder. Into a more contact structure and then the decoder generates a representation of the original data from the compressed data. Then the system is trained to minimize the differences between the outputs and the original data. As you can see from the word encoder, this is something that is also used in audio and visual compression. In the GAN model, we have two separate networks that are trained it reflecty the first network, which is called the generator, this takes random sample data to create a synthetic data set and the second network, which is called the discriminator, then compares the synthetic data with the real dataset, and the generator network is strained to make discriminating between the generated and the real data for the discriminator network as high as possible. In a sense, you want to make sure that the discriminating network, which could also be real person. Is not able to distinguish between the synthetic dataset and the original dataset. This is obviously what you want to try to achieve when it comes to data privacy. No one should be able to determine whether this is the original observation or a synthetic one. Then this is also frequently used for generating image data. If you click on this link here, you can see the link here. You can watch a demonstration on YouTube for the GAN. I don't want to go into detail yield, but you should know that you can also use deep learning algorithms for data generation. Okay, so we've now talked enough about data sources, data generation, data pre processing, and next, we will start looking at AIN and L methods with their applications in finance. 8. 8 Basic Linear Regression: Hello, everyone, and welcome back to our class in artificial intelligence and machine learning in finance. And after having talked a lot about data sources, data generation, and our introduction to the realm of ML, we've now finally arrived at the applications in finance, and we'll go through all these topics by first of all, describing and discussing the statistical methods and then highlighting the applications in finance, maybe a little bit surprising, we'll start with um, simple linear regression, multiple linear regression because these are the building blocks for more sophisticated models in statistical learning. We'll start. Our discussion now of these different AI and ML methods, it's a little bit heavy on the machine learning methods, but we'll also see some artificial intelligence methods, and we'll discuss the statistical background and then the application. Now, all of these methods are examples of statistical learning algorithms and statistical learning refers to the set of tools used for understanding data and explaining behavior of statistical data. Broadly speaking, we can distinguish two types of models in statistical learning. The first one are supervised algorithms, and in contrast to supervised one, we also have unsupervised algorithms. But the differences, supervised statistical learning means that we estimate or say we predict an output based on inputs very simple example, you want to predict the wage of workers based on their education, their gender, and we would get what we call a regression problem. The second problem is related to this, but it's called a classification problem because in this case, the outcome variable, the output is a binary variable. It's a dummy variable that takes on values one and zero. For example, if we only want to predict ups and downs of the S&P 500. If we now use, for example, macroeconomic variables, X one, X two, et cetera, and we want to estimate the ups and downs, ones and zeros. We have a classification problem. Again, same here as with the credit card data, for example, we want to forecast default, no default or default one and zero. That's a classification problem. In non supervised statistical learning algorithm, we want to group input observations with no output variable. The example we have here is what is a clustering problem. We have data on customers, say, age, income, education, and we want to group them to see certain clusters of customers which share common properties. For example, one group are the middle aged men, a second group are young women third group is maybe all persons with low income and we get different clusters. And this is an unsupervised statistical learning problem because, as you see, we have no output, we have no outcome variable, why. We want to predict. And in this case, we have no way to see how these different clusters are different when it comes to output variable. They only differ based on their inputs. Okay. Though this is a very short introduction, let's now turn to simple linear regression, which you know from your introduction to statistics. Now, linear regression is the foundation for many modern supervised statistical learning approaches. We are now looking at supervised algorithms, meaning that we have an outcome. In the setting of simple linear regression, we want to predict a quantitative response or a metric response Y on the basis of a single predictor variable X, which means that we have regression function. Y and Y is approximated by two parameters beta zero and beta one, and we assume a linear relation between X and Y, meaning that Y is almost equal to beta one plus beta zero plus beta one times X. We say that Y is regressed on X. This is a little bit different in German. Actually, we switch the Y and X and we say that X is regressed on Y, but it's Y that is requested on X. We have two unknown parameters or coefficients beta zero, which is called the intercept and beta one, which is called the slope. Those are obviously the intercept and slope of the linear regression function. It's a line, and they are estimated by minimizing the residual sum of squares or RSS. What we get as a result is we have an estimate for B zero, beta zero hat, and an estimate for B one, B one head. And if we now have input data X, we can use those estimated parameters enter X, and we get an estimate for Y, Y has. Now, how should we estimate this regression function, this linear line? Very simply by minimizing the residual sum of squares. So we'll start with the prediction y IH. We have a certain number of observations with X, and we look at this regression function, and we have those estimates. Why are the predictions, why IH also have those observations Y and by comparing Y and Y h, we get what we call the IT residual. That's the error. That's also why it's called E. It's the error between our prediction, Y had and the actual observation, YI. These are the different residuals from our estimated linear regression function. We now square those errors to make sure that a negative and positive error don't cancel each other out, if we square those errors and we sum them up, we get RSS in equation four, which is the residual sum of squares. It's the sum of the squared errors. Now, here, you can see one very simple example based on the advertising data sale from the James Witnay and TikhiRni textbook on Statistical Learning. Now, you see the blue line is the regression line. We have TV and sales, TV being the predictor and sales being our outcome variable, and you see all those different redox. Those are the estimates and actually the actual observations, and the blue line is an estimated regression line. In those small gray lines, they highlight errors. We now want choose the blue line in such a way that sum of the squared gray lines becomes minimal. This is the RSS. So how should we estimate this? Well, if you remember your statistics classes, minimization of the RSS, in this case of simple linear regression leads to the well known OLS estimator, the ordinary least squares estimator, four B to one and B to zero X, um, uh and Y are the sample means of X and Y. So you take your observations Xi Yi, you take the sample means, enter them into these two equations five and six, and you get the OLS estimates for beta zero and beta one. Actually, this is done on most modern calculators. They usually have functions for simple linear regression. Okay. Now here, in this case, again, the response variable is sales. The predictor is TV advertising budget. And as you can see here, we have beta zero and B to one on the X and Y axis, and the red dot here is these This is a contour plot of the RSS and you can see here, this is actually the minimum, so it's very simple to find minimal is minimal value of the RSS for B to zero and B to one. Okay, and the three dimensional plot of the Rs as you can see here, it's very smooth, um, function. We beta one and B to zero on the X and Y axis and on the Z axis and the three dimensional case. Here, you have the minimum. Okay. Get resulting OLS regressions, and we can do this. In this case, if we have three predictors, we can use three estimations and look at TV advertising budget, radio advertising budget, and newspaper advertising budget. And we can already see there seems to be a strong linear relation between TV advertising and sales. It looks little less linear for radio advertising and sales, and there's probably not a linear relation here the newspaper advertising and sales. But these are three distinct regression analyses. Again, we are in the case of a simple linear regression, meaning that we are looking at X, actually the outcome arri Y being estimated and progressed on X one, in the first plot and Y being regress on X two in the second plot and on X three in the third plot. It's not a multivariate linear regression. It's still simple linear regression. As you remember, the statistics classes, you probably know that we need to look for measures of the accuracy of the coefficient estimates. We need to assess the goodness of fit of our model, and if we assume that we have a regression function Y being equal to beta zero plus B to one times X plus an error, how should we assess the accuracy of those coefficient estimates, beta zero head and B to one net. Well, we have to look at the standard errors of the OLS coefficients of the estimated parameters because they are estimated from a sample, we do have standard errors, and the standard errors are given by equations seven and eight. As you can see, what we need is we need sample mean of X, and we also need the variance of salon, which is the variance of the residual terms, the error or to estimate TigMa squares because this is the only thing we do not know from our sample of X R, and we know and the number of observations. The only thing that is left unknown is Tigma squared, the variance of the residual terms, and we estimate Sigma via the residual standard we do not know how Sigma looks in the population, but we can estimate Sigma based on our sample and on the sample error. We take the residual standard error, R E, and we take the residual sum of squares, divided by N minus two, take the square root and we get an estimate for the residual standard error. That's our estimate. And what we do next, we put it into the equation for the standard equation, a standard error of the coefficient estimates. And then we get the standard error of our two coefficients, but what should we do with this? Well, if we have the standard error of B to zero head and b0b1 hand, we can actually construct confidence intervals for B to zero and B one. So the 95% confidence intervals for those two parameter estimates are given by or less the zero plus and minus times the standard error of those coefficients, and we get confidence intervals. And how does this help us? Well, actually, later on, we will look at significance tests, and significance tests usually work like this. You have a hypothesized pemter value. Let's say we want to test the hypothesis that B one is zero. There's no relation between X and Y, meaning that the coefficient is zero. So we take zero plus minus two times the standard error, and we get a confidence interval. If now our parameter estimate is in this confidence interval, there with 95% probability, the parameter is not significantly different from zero, meaning that we have to reject we cannot reject the hypothesis that B to one is zero and there's no linear relation. So we can use confidence interval for significance testing. That is shown here. For example, the hypothesis is B one is zero or this versus h1b1 is not equal to zero. And now we take the standard error of B to one head and we use this T statistic, B one head minus zero. Divided by the standard equation. So the T statistic measures the number of standard deviations that our parameter estimate is far away from zero. If it's far from zero, there's a likelihood that actually the parameter is not equal to zero. If it's close to zero, then actually we have to reject H one. Okay. Now, this is what we do for the parameter estimates, but how should we assess the accuracy of the model itself, assuming that we rejected the hypothesis B to one is equal to zero. We cannot reject a linear relation between X and Y, but how can we assess the accuracy of the model itself? We can take the residual standard error and recall that associated with each individual observation is an error term EI or epsilon in the population. Even if the coefficients were known, these error terms still exist, and this is why it's not a perfect linear line. If it were perfect linear line, there would be no stochastic behavior in the data, and we wouldn't need regression analysis in the first place. But assume that we have a linear relation between X and Y, and the residual term exists and it's noise simply added to our linear regression. Even then we do have these error terms EI and these error terms prevent us from perfectly predicting Y from X. Thus we estimate the standard error of Epsilon via the residual standard error, which is given by the residual sum of squares divided by N Myers dot square root, and then, how should we assess this? Well, the RSE provides an absolute measure of lack of fit of the model and it is given in terms of units of Y. Problem is the RSE depends on the scale of Y. The solution in this case is we scale it. We take R squared, which you probably know from your statistics introduction, R squared is the proportion of variance explained by the model. R squared is equal to one minus. The RSS divided by TSS, which is the total sum of squares, simply Y minus Y bar squared and sum it all up for observations. This gives us the total sum of squares, the total variance, and one minus the residual sum of squares divided by the total sum of squares gives you the proportion of variance explained our simple linear regression node. Also note, R squared is equal in this case to the squared correlation between X and Y. Actually, this is why it's already called R squared. It's the squared correlation between X and Y. This is a very short introduction to simple linear regression. We'll come back to this in some instances. But obviously, we want to step our game up to multiple linear regression and we'll look at this in the next video. 9. 9 Multivariate Linear Regression Fundamentals: Hello, everyone, and welcome back to our class in Artificial Intelligence Machine Learning and Finance. We are now at Chapter 3.3 in which we want to discuss multiple linear regression. We've seen in the previous videos, simple linear regression in which we use a simple line to predict response variable based on one predictor. Now, as you can imagine, multiple linear regression is simply the extension of simple linear regression by including more than one um, Bx are variable. So, simple linear regression is useful, so it's multiple linear regression. But actually, in practice, we all the time have more than one predictor. So in machine learning, actually, we might end up with using thousands of different variables, thousands of predictors. And the question now is what can we do? Um, two possible approaches come to our mind. The first one is simply estimate several simple regressions. For example, if you have ten predictors, estimate ten simple in your regressions. The problem is that if you have correlations between those predictors, this will almost always be the case. These correlations will bias the results in the sense that the coefficients on one predictor variable will be biased upward or downward. So we have an overall underestimation of the effect of one single predictor because this might simply be due to correlation with a different predictor. The second alternative is much better. It's estimating a multivariate or multiple linear regression. That is, we include more than one predictor variable. Though we have our response, which in this case is metric response, Y, and we want to predict Y on the basis of several predictors, X one, X two, X three, and so on until X P. We have P variables, does with P plus one parameters beta one, beta two, beta three, up until B P for the predictors and beta zero for the intercept. We say that Y is regressed on the set of predictors, X one, X two, and so on, and those P plus one unknown parameters or unknown coefficients. With the beta zero being the intercept and beta I being the slopes for the different parameters, the different predictors, sorry. They are estimated again by minimizing the residual sum of squares. So we again use OLS, the ordinary least squares method to estimate those coefficients. And if you've already taken a statistics class, you will probably know that the vector of coefficients Beta he is given by X transpose times X. These are the um, matrices of those observations for X one until XP. You take the variable of X observations, X transpose it, multiply it with X itself, take the inverse and then multiply it again with the transpose of X times Y, and you get those coefficients. In matrix notation. The result, if you write out the vector of coefficients, then is that you can predict y using those coefficient estimates, beta zero head plus B one head, et cetera, and your observations X one until X B. Equation 50 now the multivariate case in multiple linear regression, we no longer have a line in the two dimensional case, where are actually three dimensional case where we have two predictors and one response variable. We get a plane, as you can see here. Again, we estimate this plane, which is shown here in blue and green colors. And you see those observations we have being the red dots and the black lines show the estimation errors. So we minimize, we are choosing the plane such that the sum of squared errors is minimal. This is the three dimensional case where we have two predictors. If we have more predictors, we get a linear hyperplane. Well, let's have a look at this in R. We estimate the regression coefficients and we are using the car seats dataset from the ISLR package, which is the companion package for the statistical learning textbook. And what we are trying to do here is we want to predict product sales based on advertising budget, community income level, average age, and average education in those communities. And the way we do this is, first of all, we have to load the library mass, for regressions. We have to load ISLR, which includes the data. And then we use the LM function, which is the linear model in R. This is just the very short command for linear regression analysis in R. And as you can see from the command, the syntax is as follows, sales is explained and predicted by using fertizing plus income, plus age, plus ucation. We're using the RST data. We are trying to fit, we're not try it, we actually fitting a linear model. And this is now written into our new variable m dot Fit, also have called it results or results dot Fit. So MFitYe is the object, um, that includes the fitted linear model object. By using the summary command on m dot Fit, we can see what the result is. The function call was formula sales is explained by using advertising, income, age, education, and using car seats data. The residuals are shown here, so we can see that median, the first quartile, the third quartile, and the minimum and the maximum residuals after having fitted the model, and these are the coefficient estimates. You see the intercept advertising is actually this is ITA. Zero B to one and two, three, and four. You can see the estimates for the coefficients. The standard errors, T values, and here you can see which of these coefficients are significantly different from zero. Now, B a little bit careful actually R has this feature that well, in research datas, we usually write three stars for significance at the 1% level. Two stars for 5% and one star for 10% level. Actually, this is different in R. As you can see here, three stars actually means significant at 0.1%. So stars is 1%, one star is 5%. So actually, in a research paper, would probably have to add another star, for example, in this case. We can see advertising is highly statistically significant as is income and as is age. Education is not significantly different from zero, so it seems that our predictor education has no power to explain the car sales in this data sample. You can see we've used that's on the next page, I think the number of observations, you can see the multiple R squared is close to 15%. We have an adjusted R squared of 14%. I'll talk about it later. And we get N statistics and P value for the whole model. So where should we go next? We should check is at least one of the predictors useful in predicting the response? Well, you've already seen as it is from the output from R, it seems three out of four predictors are significantly different from the zero, so they are awful to explain our response. So we can use the F statistic for the question is at least one of the predictors useful in predicting the response? Is the model itself then we have to decide on do we need or only a subset of the predictors and we get to the question of variable selection. Then how well does the model fit the data? We have to again look at the RSE and the R squared, and last but not least, given a set of predictor values, what response value should we predict and how accurate is our predictions. We are back to forecasting, which would mean that, for example, if you have a line with my cursor here, if, for example, in the one dimensional case, we have these observations and we've put estimated regression line. For example, you would know that if this is the estimated regression line and we get new X value, which is here, for example, here, and we know this would be predicted and forecasted value, or if we forecast this probably here, if this is extension of the linear line. Now, let's talk about this. First question is at least one predictor useful? We are using the F statistics. We're testing the null hypothesis that all those slopes are f zero. Not even one is not equal to zero. So the um null hypothesis is beta one equals Beta two, and so on, equals Beta P and equals zero. We're testing this hypothesis using this following F statistics. Again, we are going back to the total sum of squares and the residual sum of squares, and we are simply using the previous F statistic from the simple linear regression case and now applied in the multivariate case. And if there is no relationship between the response and the predictors, the F statistic should be close to one. And if the um, um, hypothesis HA is true, then F should be greater than one, so it should increase. That's the F statistics we've already seen here, and as you can see, it is rather far away from one and the acti six is also converted into a P value, and it seems that the null hypothesis is actually rejected. Okay. Now, which predictors are significant? We've already seen in the R output that let's assume that all the predictors are not all predictors have an insignificant coefficient, but which predictors are significant. Again, as in the simple linear regression case, we estimate T statistics and P values. We calculate T statistics for each predictor and the square of each T statistics is also the corresponding a statistic based on those T statistics that are also given here, in this column, sorry. We can then also convert those T values into statistics and P values and you can see that three out of the four predictors are significant in our regression. However, this is one difference to the linear regression models, you've probably seen in econometrics. In the case of machine learning and in the case you have big data, you want to analyze it might be that the number of variables is larger than the number of observations or extremely larger. You have thousands uh, predictors and only let's say one or 2000 observations. So there are more coefficients BJ to estimate the observations from which to estimate them. Very simple example, 100 variables and only 50 observations. And we have two, um, two results that come out of this. First of all, we cannot fix regression using multiple OLS. So what I've shown you before, the linear model in matrix notation where you're using, um, the OLS estimator, OLS can no longer be used in case with more variables than observations, and we cannot use the F statistics, you need to keep this in mind when you have an application where the number of predictors is larger or much larger than the number of observations. So we have checked whether at least one predictor is significant. We can use T statistics and P values based on these T statistics to check which predictors are significant. In the next video, we going to decide on the question which predictors we should choose. 10. 10 Multivariate Linear Regression Feature Selection: Hello, everyone, and welcome back to our class in artificial intelligence and machine learning in finance. We are still in our discussion of multiple linear regression, and after having seen the definition of multiple linear regression and how it is estimated, we want to have a look at the question how to select the variablets to use in our model. Now, this is slightly different to econometrics where usually you have an economic theory that guides you which predictors to select. But here in the realm of statistical learning and machine learning, it is rather a question of which predictors can increase validity and the fit of your model. That is why we will actually select variables without much theory and rather on the question how much, um, um, correlation, we can see between the response and the predictors and which predictors help us get an even better fit of our overall model. So we need to select the variables, and this is the process of determining which predictors are associated with the response and which we should include in our model and which uh are the significant ones. We have several alternative approaches at our disposal. The first one is that we compare the fit of all models and calculate measures of the models overall fit. You can use, for example, Acacus information criterion, AIC, the Bayesian information criterion or the adjusted R squared of R models. But this would also mean, in a sense, a brute force approach where we estimate all possible models, and this will become infeasible quite quickly as the number of different models for P predictors is to taken to the power of P. So if we only had 30 variables, 30 predictors, we would get close to 1 billion different models, so you will not be able to do this, especially when you have thousands of variables. What we need to do is we need an automated and efficient procedure to select a subset of models and back here, going back here. There are three classical approaches for variable selection. The first one is forward selection. You start with the null model, which only includes the intercept and no predictor at all. You estimate P simple in your regressions and then add to the null model the variable from the regression with the lowest RSS, and you continue adding variables in this manner until some stopping rule is fulfilled. So for example, you would start with an intercept and then maybe add X three, then you would add X one, and maybe your stopping rule is already fulfilled, so you will stop with just two predictors, and that's forward selection. Now, in contrast, backward selection is the opposite. You start with the model that contains all potential P variables, and you remove the variable with the largest P value, and then you down and try to see which are the variables that are insignificant and you continue removing variables in this manner again until some stopping rule is fulfilled. Nick selection is a mixture of the two. You start with the null model, you estimate P simple in your regressions and add to the null model, the variable from the regression with the lowest RSS. Then you remove variables that turn insignificant when new variables are added to the model because of correlations between the predictors. I've told you in the last video that you should be careful when including or actually excluding variables from a regression because it might be that, for example, variable X one is just picking up the correlation of X three with the response rather than X one and the response. Because of spurious correlation between predictors, it might be that the coefficient estimates on your predictors are biased if you omit important variables. If you now include variables, and suddenly another variable turns insignificant, then in mixed selection, you might want to consider removing that variable because other variables are picking up the same correlation. And then you continue adding and removing variables until again stopping round was full filled. So let's look at this in R. We are using the credit card data, and you see the summary statistics for those different variables. You see the ID, which is just the number of the observation, going 1-400, we have the different income, the credit card limit, the rating, the number of cards, the age of the customer, the occasion gender student, marit, ethnicity, also, and balance, which is our response variable in all these regressions. But we start with a dummi variable for student. Are you a student or not? We use the credit data and we try to predict the balance of the credit card. And again, we use LM in, which is the linear mole so the multiple regression. You can see here that the dummi variable for being a student is highly statistically significant. We have a T value of 5.35, and yes, we should include this vary. You can also see that the adjusted R squared for this simple linear regression is close to 7%, 7% around 7% of the total variance is explained by our model. Now let's do this with the credit card limit, and again, the T value is much, much larger in this case. Again, however, this variable is statistically significant in our regression. We can also see that multiple and adjusted, actually doesn't make sense that it's called multiple R squared here, it's a simple linear regression. The R squared is 75% close to, and we can now see that limit credit card limit seems to have much more explanatory power when it comes to the balance of the credit card. So we should if we were to choose, we should include limit rather than student. If we can include just one more variable, then we should also include student. And last but not least the same for income higher T value, statistically significant. You see here, we have three stars. We have a T value of almost ten and NA squared of 21% should also include this in our regression. We now look at the DIC. You can see the Acacus information criterion for these three models, and with the AIC, it should be viewed as a lower numbers being better and lower numbers, SNIC signaling a better model fit. Actually, the third one is with income on actually the second one as the lowest AIC, and we can now see how multiple linear regression would fare if we include all three variables. Let's do this. We estimate a linear model, balance as the response and income plus un limit, and all three variables remains statistically significant and we can now see that just including these three variables gets us to an adjusted R squared of almost 95%, and the cakes information criterion is also much lower than for the simple linear regressions. So we can see, yes, we should actually estimate multiple linear regression with all three variables. And if we were now to include more variables, at some point, I guess we would arrive at some variables that can bot Okay. So now, how do we assess the overall model fit? We've already seen the adjusted R squared. The idea here is that if you remember the R squared form, the simple linear regression, as a matter of fact, by including more variables, the R squared can only go up. So what we need to do is in order to prevent overfitting, we need to punish our model for excessively using variables. And this is done in the so called adjusted R squared. As you can see, it's basically the R squared, but it will decrease, actually, and it will be punished for the inclusion of unnecessary predictors. So any variable, any predictor that is included that doesn't reduce or doesn't add to the models fit, will actually, lower the adjusted R squared. This is why actually here, it's important to see that the adjusted R squared is very close to the multiple R squared, so all variables are actually attributing to the model's overall fit, and this is a good thing. So as you can see from the adjusted R squared, we should use this multiple model. 11. 11 Multivariate Linear Regression Advanced Concepts: Hello, everyone, and welcome back to our class in artificial intelligence and machine learning in finance. Now in this video, I want to talk a little bit about some additional details of the multiple linear regression model. In the last video, we've already talked a little bit about how to assess the mole fit. We've seen that in contrast to the regular R squared, we should use the adjusted R squared because actually, including more and more predictors, when including more variables in our regression model, the R squared can actually only go up. It can only increase will never decrease so that at some point we'll include unnecessary predicts predictors that are irrelevant, and actually, this will reduce the model fit in the sense that we get overfitting. We have a better fit on paper, but actually we including unnecessary complexity now mod. So that's why in the adjusted R squared, we take the R squared and we correct the R squared by the number of variables we have included. So actually adding noise variables to an already well fitted model decrease the RSS only by a little bit. The adjusted R square punishes the model for every included unnecessary predictor, every noise variable that is included in our model. To see this art In this light, we can see that for the credit data, um, actually, the adjusted R squared increases to some point and including more and more predictors doesn't really change the adjusted R squared. Actually here, actually, I think this is seven, this is for seven predictors, we get an optimal point where the adjusted R squat doesn't change too much. Actually, we could already stopped here maybe with four or even five predictors and including additional predictors doesn't help us in explaining the variation in our response variable. So this is, um, important now in multiple linear regressions, use the adjusted R squared rather than the squat. Okay. Now after having fitted our model and after having estimated the coefficients B to one, beta zero, beta two, et cetera, we want to predict values. We want to forecast. Now, how to do this, very simple. We take equation 15, which said that we can take the coefficient estimates. We can enter the values for X one, X two, X three, and so on. Then we get a prediction, a forecasted value for Y had this is what we will do. You take the estimated coefficients beta head, zero, beta at one, and so on. And you include values for your predictors can be actual data, but you can also make sort of a simulation. Then if you compute equation 15, then you get estimate, your predicted value, your forecast for the response variable. Now remember that obviously the forecast will include a bias and this bias will stem from three sources. First of all, with imprecise estimates of beta zero, beta one, beta two, et cetera. So um, even with using OLS, even with using ordinary least squares, we will have some imprecision in the estimation of our coefficients because we don't have infinite data. We only have a sample size of let's say, 1,000, 5,000, even 100,000 observations. This means that there is still some estimation error that will include or induce a bias in our forecast Y hat. Second of all, um, the assumption underlying all of this is that we are using a linear regression function instead of the true function F. Now in the population, there seems to be that doesn't seem to be, but there is a relation between X and Y between our predictors or the predictor variable and our response variable, Y. This relation is the function F. In our modeling approach, we've assumed that F is a linear function that we can use a linear regression model. Now, this induces model bias. Might be that F is actually non linear. If that is the case, obviously, we'll have a bias that stems from the fact that we've chosen the wrong model. And last but not least, our third assumption here is that we don't have a clear relation in the sense that Y is with 100%, a clear function linear or non linear of F. But our assumption is that Y the response variable is a function in our predictors and we have still some random noise that's the error term. We have made some assumptions on the error term, for example, that it has an expectation of zero. It has constant variance, but we still have the error term. Even if our mod is correct, if it's a linear function, even if we've estimated the coefficients perfectly, there is still some irreducible error that stems from the fact that some noise will always be in the data, and that's the error turn. Never expect your predicted values to be 100% perfect, please we have three sources of errors, three sources of biases in our predicted values. Okay. Now, how can we extend the multiple linear regression model? The first extension is qualitative predictors are qualitative predictors. Now, in the examples before, we've already seen solid then, but most of the time we have metric variables, quantitative variables. For example, could be income, could be a price for stock, et cetera. Um, but some variables are actually qualitative. For example, age, gender in our credit card data. So how should we deal with qualitative predictors? If the predictor is a qualitative variable, it's pretty simple. We can use a dummy variable. For example, if we want to code gender in a very simple way, actually, we can also include more gender types, but if we only use two female male, it's Xi equal to one, if the I person is female and zero otherwise. Then we get a dammi variable for predictor with only two levels and we can include it in our regression just as any quantitative model variables are. If we have K levels, we can actually code this in the following way. We only use K minus one dammi variables. For example, if we want to code let's say, H in a very simple fashion, we can say, okay, We can say dummy variable X one is one age smaller than let's say 20 years and zero otherwise. X X two is one for age larger than 20 years and smaller than 40 years. And then the zero otherwise. And if, for example, we only have three levels, then we need two demi variables. So if both of these are zero, it's clear then actually, in this case, the person is older than 40 years. We have X one for the young people, X two for the people 20-40. And if both are zero, it's clear that the person is older than six years, and we can do this on and on, for example, X three, for example, one for age, let's say 40 age and between 60. And then it's clear if all three dummy variables are zero, it's a person that is older than 60 years. This is the way how to code qualitative predictors, but you need to be careful when interpreting the estimated coefficients. Now, if you have an estimated beta coefficient for let's say a metric variable, it's clear that you can interpret these coefficients as the for example, 1% change in the predictor leads to a beta percent change in the response variable. Now here, it's clear that you have to interpret the coefficient as a switch 0-1. What happens if you no longer have a male observation, but a female observation? Then this is the interpretation of the coefficient. Okay. Now this is a qualitative predictor. How can we extend the multiple linear regression model even further by including so called interaction terms? For example, imagine predictor Xi increases or decreases the effect of a different predictor X two on our response. To measure this synergy effect as it's called in marketing, we can use a so called interaction term. We set up our regression model response, eques intercept plus slow one times X one plus slope two times X two, plus a coefficient. That's the interaction term beta three. Times the product of X one and X two. This is how you can do it with an interaction term. Again, be careful when interpreting the coefficients B to one, beta two, and beta three. Beta three is the interaction term. It means that it measures the effect of X one on Y if X two changes. The coefficients Beta one and Beta two now also measures something different. Actually, it's now even more important when you're using qualitative predictors. Beta one measures the effect of X one when B two is X two is zero. This is the isolated effect of X one if X two is zero and Beta two measures the effect of X two on Y, if X one is zero. This is a little bit tricky. And also to correctly measure the interaction term, must include X one and X two integral regression as well as SOL. This equation here is correct in the sense that we have included X one, X two, and their products. This is particularly interesting when you have an interaction between a qualitative and a quantitative predictor. For example, imagine X one is income X two is gender, male, female. In this case, the interaction term Beta three would measure the effect of gender on income on say, health, if the response is health. How much does gender fcc the effect of income on health quite interesting. Beta one and Beta two would be the isolated effects, for example, for a male and female. That's quite interesting when you combine a qualitative and quantitative predictor. What else to do? Now we've seen we are talking about a linear model. What to do if, for example, we see that the effect of X on Y is probably a nonlinear one? Well, you can use a polynomial regression. Let's say predictor X one influences Y nonlinearly, we can use a polynomial regression. That is, we include X one and the squared values of X one, X one squared Now, see that including nonlinear terms such as X one squared does not change the fact that this is still a linear model. It's still linear in the sense that we have the intercept plus X one plus X two. We could have also written X one as let's say X two or X three. We could have redefined our variable. We don't need to know that this is actually a non linear term of X one. So it's still a linear model. We can use OLS, doesn't change anything, but it can lead to a better fit. Actually, using higher polynomials can lead to such a good fit that again, as was the case with unnecessary noise variables, we get the problem of overfitting. This is a problem, but otherwise, if we suspect a non linear influence, we can try polynomial regression. Let's look at this in load, first of all, load the library ISA, and the linear model is quite simple. We use the car, the autodata, so we are trying to explain mileage per gallon, how many miles you can get out of one gallon of gasoline. And we explain this with horsepower plus the squared value of horsepower and see that, let me highlight this for you. Actually where my mark. This is strange. Here is the one. You need to use the function I. You cannot simply include horsepower head two because actually the head is a different function in R, so you need to use this function I here, not very nice highlighting, by the way, you can see that this is a model Y equals X plus X squared. This is the coll. These are the results. You can see the residuals here, and then these are the coefficient estimates. The coefficient estimates here clearly show that the squared variable force vowel the polynomial term here is also highly statistically significant in our regression. So actually, if you were only to include horsepower and then estimate a second model with horsepower and its square values as a second predictor, you would also see that the multiple and adjusted R squared increase considerably. So yes, it seems as if sorry, it seems as if horsepower and a polynomial term are needed to better explain the mileage per gallon response variable. Okay. Now, in this plot, you can see three different models. You can see the linear or less fit and the linear model. You can see a polynomial pression using second degree polynomial and also a polynomial turn of degree five. Now, as I said, it is often the case that there might be a non linear effect between predictor and response variable. However, including more and more polynomial terms in this regression will ultimately ultimately lead to overfitting. This can be seen with the green line. Actually, you can see that such a line like the blue line is fully sufficient to explain the data. You don't need something that is wiggly like the green one. So actually, including horsepower squat is sufficient to explain the data. You don't need a polynomial term of degree five. So overfitting is the case here with the green line, but also the linear model underfits the data, which is quite clear. So it's okay, but a polynomial regression is much, much better suited and gets a better fit. Now, what else can happen in multiple linear rettions? Actually, there are a couple of problems that I only want to shortly comment on. The first one is non linearity of the response predictor relationship. We have a non linear relation between X and Y. We've seen this, one can use residual plots to identify this problem and then, for example, use a polynomial regression or more sophisticated models. The second problem is a likely correlation of error terms. Now, the idea and the assumption in the OLS model is that the error terms are uncorrelated. If they are correlated, we have a problem. Our model assumptions are no longer true, and we can only use good causal inference and good experimental design if we have experimental data to counter this because usually a correlation of eroturns stems from the fact that we have omitted variable bias. We've missed out one important predictor. We have reverse causality. It's not X that is driving Y, but actually Y is also driving X. So we have reverse causality, as we call it. And all of these problems omitted variable bias and reverse causality, they lead to a correlation of erotns because the erotns are picking up something that should have been included, and then we have a problem in our estimation, and we use causal inference to tackle this problem. We are likely not talking about this in this lecture, but this is a huge problem in empirical analysis. The third problem is a non constant variance of erotons. One assumption is homoskedisticity, meaning that the terms have a constant variance. If it is not constant, for example, we have an increasing variance with more and more observations. One central assumption of the osmole is violated, and we can use, for example, weighted squares. Then we have outliers and high leverage points, outliers meaning extremal values of our response variable. And high leverage points being extremal values of one predictor variable. What we can do is we can try to detect these outliers and high leverage points and then, for example, use winds aisation or simply exclude these data points from our analysis with all problems that are attached to these, um methods. And last but not least we have colinearity, multi collinearity, meaning that X one and X two, for example, share a high correlation, and this means that one of the variables is obsolete. You don't need X one and X two. The information that is needed to explain our response variable is already included in X one, for example. Then we can exclude X two, we can drop it, or orthogonalize this colinear predictor and how can we decide whether we have colinearity? We can use the so called variance inflation factor, and I'll get to these in just a bit. So let's start with the non linearity of the response predictor relationship. We can look at the residual plots. For example, here, if we use a polynomial fitting, left, it's a linear fit. It's the quadratic fit with the polynomial regression. As you can see, the fitted values in the residuals should actually look rather like this. In this case, we would say, Okay, there's nuclear trend, it's around zero, and we don't see that for very small or very large fitted values, we have a certain trend in the residuals. Now, with the linear fit, we can see the red line, that's the mean fit. You can see actually yes there is a trend. It starts out very high, goes back to zero and then starts again. This is an indication of a non linear relation. And we do the same for the residual plot, we can see it's almost constant and this is how it should be. From the residual plot of the polynomial regression, we can see that yes, it seems to be as if the relation is non linear and quadratic model is much better suited than a linear one. Now, let's go to the second problem we have, which is the correlation of the error terms. Now, I told you that this usually stems from biases like omitted variable bias reverse causality, and we cannot do too much about this without using more sophisticated models for causal inference. However, how can we decide whether we have such a problem here? To see this, this is a plot, these are three plots where we have simulated data with different correlations between adjacent points and adjacent residuals. Now, if we simulate data and our assumption of uncorrelated rotns is fulfilled, we have a correlation of zero, then you can see that we have a residual here. The next one can be here, can also be down. There is a now clear trend that, for example, this observation leads to another observation, another residual being here or here. Now, if we increase the correlation between the errortns to 50% or even 90%, you can see that we have trends, and it looks like a time series, and this is a problem. Now, here you can see it's highly likely that when you have a negative residual, the next one will also be negative. And this is due to the fact that the errorturs by construction in this case, are highly negative are now actually positively correlated. So this is how you can try to decide whether this is a violation of our assumption, and then use a residual plot. Now, the third one is non constant variance of error turns. You can see one example where you have the fitted values and the residuals and actually, if you only plot the mean of the residuals, you would say, Well, there is no trend. However, the dispersion, the variance of the error turns increases dramatically. With the data. And this is what you can see from this funnel shape of the residuals in contrast to the one you see on the right. Now, in the left picture in the left block, we clearly have the problem of heteroscedisticity. The variance of our residuals of the erratong in the data increases from left to right. But one simple remedy for this is to use the lock of our response. Actually, if we do this simple transformation, you can see, it almost looks like the variance stays the same over our four sample. So very simple remedy might not be as simple as this, but this is again, how we can decide whether heteroscedisticity seems to be a problem. Next, we have outliers. Now, outliers are extreme values of Y. For example, this observation year, number 20, as you can see, this is an outlier. Now, with the outliers, the problem is that actually, if we have a look at the next plot actually here, the head Sorry, with the red line and the blue line, those are the linear fits for the data with and without this outline. And as you can see, the blue line in which we fitted a linear model to the data and we've excluded the outlier doesn't change too much. Actually, the outlier for this response value doesn't really change our linear model. It doesn't change the linear regression line that we are fitting. However, the problem is that including the outlier can lead to other problems. For example, it will lead to an increase in the original standard error, so the RSE will increase. Actually, the adjusted and squared will decrease when we include the outlier. The model fit is getting worse. So we could think about excluding this outlier. You can also identify this outlier by looking at the residuals in the mid plot here, again, 20 far away from the remaining residuals. And if we studentized the residual, that is we take it and divide it by standard deviation. Again, still, it's an outlier, yes, one could think about excluding this outlier. Same thing for the high leverage point, which is number 41 here, observation 41, and this is an extremal value for X for a predictor. Again, you can see that actually, if you plot X one and X two against each other, this is an high leverage point. Now with high leverage points, it might be actually, you can see this here that the regression line, it changes. Actually, in this case, not too dramatically, but yes, it does change, and this is a problem for high leverage point. That's why it's also called high leverage points. One single observation can affect the estimation of our regression line on the group regression model and the coefficients dramatically. Actually, the coefficient estimates are heavily affected by high leverage points. Again, one could think about excluding this observation. For this, we can also estimate the so called leverage and can see observation 41 has a high leverage in contrast to the outlier observation number 20. Last but not least colinearity. Collinearity, meaning that X one and X two, for example, are strongly correlated with each other. And here we are using the credit card data, and this is limit and he and as you can see, those two variables are not really strongly correlated with each other. It's different with limit and rating. As you can see, there's a strong linear relation between the two and including limit and rating in a regression model will lead to problems. Why? Because if you remember the estimation of the coefficients in the multiple linear regression case, you remember that the coefficient estimates are given by actually the matrix X transpose times X, and then you have to invert X transpose of X times X, and you will get numerical problems. Now, if two variables are strongly correlated with each other, you can imagine that in linear algebra, these two variables form two columns of our matrix X, if these two are almost linearly dependent, the information is obsolete of one column. You can leave out one column. If you're not doing this, in this case, you shouldn't the problem is that inverting this matrix becomes numerically unstable, and this is what will happen. So the coefficient estimates are actually strongly affected by this because you now get two numerical problems. And this is shown here. If, for example, you have beta limit and Beta H, those two from this plot here, we saw that those two variables, those two predictors are not collinear. And actually, it's quite easy to find the OLS optimum, which is here. These contour plots, the ones for the RSS, actually, again, you remember this, we are minimizing the RSS. Actually, you can see that it's quite simple, for example, as the optimization algorithm, if you start out here, it's quite simple to find this point because this is actually the point where the RSS becomes minimal. Now this is the same plot for limit and rating. As you can see, the contour plots become quite nasty and it's now much more difficult to find the OLS estimate of these two coefficient. Why? Because function, the RSS function that needs to be minimized, looks rather, how should I put it? This is not as nice to optimize as, for example, this left plot. So it's a numerical problem, collinearity on multicollinearity. What should you do? Well, you can simply leave out limit or rating. The information is already included in the other variable. You can decide on this using the so called variance inflation factor, and then you can also orthogonomize it. You regress X one on X two and you use the residual. So you are only using the information in X two, for example, that is not already in X one. That's one possibility as well. So this is called linearity, and this is all I wanted to talk about in this section on multiple linear regression. In the next chapter, we'll look at penalized linear regression molars, especially less rich regression. 12. 12 Regularized Regression Methods: Everyone. My name is Kibo Weiss. I'm a professor of Finance here at Leipzig University, and in this video, I want to show you how penalized linear regression models work as part of our class in artificial intelligence machine learning in finance. So we've already seen multiple linear regression in action and it's details, and we now want to move forward to penalized regression as the name suggests, you can imagine that it kind is a regression analysis that penalizes in its objective function certain aspects of our modeling. And if you recall the problem of model selection or variable selection, we saw that one can do subset variable selection, either forward or backward stepwise selection. That is you start, for example, with the null model and then you slowly uh, include more and more variables or you start with the model that includes all P predictors and you exclude those predictors that don't add too much to the models fit. Now, the alternative approach here, which is much more feasible than actually subset variable selection, are shrinkage methods or alternatively, as we call them penalized regression. So what is penalized regression? Actually, in penalized regression models, we use all P predictors. That is, we use all variables that might explain our response variable and we use them all in one model. And in contrast to subset variable selection, the coefficient estimates are constrained or we also call it regularized. Which means, in other words, the different algorithms shrink the coefficient estimates towards zero so that some of the variables actually have a coefficient that is close to zero or equal to zero, but all variables are included from the star and actually we'll see in the case of rich regression, all variables, all predictors actually stay in the model because most of the coefficient will be close to zero, but they will not be equal to zero. These are penalized regression models and we'll start with so called rich regression. Reminder in ordinary least squares, we try to minimize the residual sum of squares, the RSS, which actually here is given as the sum of the difference between YI, the response variable. And the intercept of coals and the sum of the coefficients times the predictors. Actually, if you look at this equation, we can see the response here, and actually this is nothing but our estimate and our prediction. This is, for example, F head of X could also called the he YI, and then we take the difference, we square it, and we sum it all up and we get ordinary lease squares. That's the case of our previous section. Now in this variant of OLS rich regression, we still minimize the RSS, but we include this additional component, which is tuning perimeter lambdo times the sum of the squared coefficients, B J. Be careful that we do not include the intercept here, but we only include the slopes for all those P predictors. We start at J equal to one, and we do not touch beta zero. In other words, it's simply the RSS, we try to minimize ins. But we have this component, Lander, the tuning parameter, times the sum of the squared coefficients. Now, what does this mean? It means that actually, because we are trying to minimize this, the squared coefficients are non um are non negative and actually the tuning parameter is also non negative. It's larger or equal to zero. This means that actually as we are trying to minimize this, we are penalizing coefficients that are not equal to zero. Ideally we would see that at some point, the coefficients will tend to go to zero, and this is governed by the tuning parameter lender. Now, as you can see, if lender is chosen to be equal, then we get OLS. OS ordinarily squares is actually embedded in rich regression as the special case for lender being equal to zero. That's rich regression. As with OLS, Rich regression seeks to find a model that first of all, fits the data well. We're trying so to minimize RSS, but we have this shrinkage penalty, lender times sum of the squared coefficients, and this is small when the parameters to J, not the intercept, these lobe are close to zero. The perimeter lender controls this process of shrinking the parameters towards zero. As I told you, actually, if lend is equal to zero, we get the ordinary least squares estimates, and I lender tends to go to infinity, actually the rich regression coefficient will all be zero because then the effect of this shrinkage penalty will be infinitely high and the coefficients will be shun to almost close to zero. So this is what happens to the rich regression coefficient. But as you can see, with every different lender, we get a different set of coefficients. So one has to be careful when comparing OLS with rich regression because OLS only fits the model to one set of coefficients to the data, and we get one set of coefficient. Rich regression, actually, we will do this and repeat this for several values of anders. Actually, we'll compute ten, 50, maybe even 100 models and we get different coefficients for different values of the cuning parameter and. To see how this works, we can have a look at this plot of different coefficients for different lenders. On the left hand side, you can see here, we have and on the X axis and the standardized coefficients for variables like income limit, rating, and skewed. You remember these were predictors used to predict the balance on your credit card and we're using the credit data from the James Woodasy and Tiharni textbook. As you can see, if we start out here with lender being close to zero, actually, those should be the OLS coefficients for those predictors. But as you can see, if we now increase lender, you can see that actually, we have a couple of other predictors here, but they are already very close to zero. You can see that with increases in lender, some of the um, predictors are forced to a value of zero quite early, for example, income, it starts to decrease very early. In contrast to for actually limit even starts before that and for example, stumed remains constant for some time and then goes down at some point. The right hand side here, you can see that at some point lender is so high that coefficient, all those regression slopes are close to zero. This happens here. On the right hand side, you can see this in just a different way of telling the story. You can see the rich regression coefficients Peter for different lenders, divided by the OLS coefficients and you can see that now these are the OLS coefficients. This is when they are actually the same here for most of these predictors, they stay close to zero, but like income limit ratings, they are significant in ors, but also in the rich regression. The question, um, first of all, what should we do? And what should be careful about. With OLS, you might know that the coefficient estimates are scale equivariant, which means that if you rescale a variable, then the OLS coefficient will be scaled by that factor. This is not true with rich regression. Actually, the coefficients can change when the predictors are multiplied by a constant. Thus, we should always use standardized predictors. That is, we take our predictor values XIJ and we standardize it by the square root of its sample variance. And in this case, the standardized predictors in our sample have a standard deviation of one. Each has a standard deviation of one, as you can see here, this is done for the Jth variable. The Jth predictor, we take the standard deviation of the Jth predictor and we divide each observation by the standard deviation, then it has a standard deviation of one. So that's what you should always. Now, one could ask, and this is the obvious question, why should we care and why should we consider using rich regression over OLS, and what are the advantages of rich regression over ordinary least squares? Well, the answer lies in the bias variance trade off. If you recall the training mean squared error, we've already seen the mean squared error, which was simply defined as one of N times the sum of the errors in our data and those errors were squared, we would get the mean squared error. Now, if we do this in the training dataset, Then we would get the training mean squared errors. Later on, we'll see that we should actually do this based on a model that has been fitted to the training data. We should check the mean squared error for a previously unseen test observation. Let's call it X zero and Y zero. Then we would get the test MSE just to make sure that we are not fitting were not checking the validity of our model based on the data we've used for fitting the model. Now recall the training all the test means weed error. Obviously, this is a proxy, this is a metric of how well our model fits the data. Now, you didn't show that actually, if we are interested in the expected test means weed error. So the expected error on average, if we square the error of a test observation, which has not been used in fitting the model, which is on the left hand side here, we highlight this for you. This is the expected test means wear error. This is actually decomposed into the variance of FHD at X zero plus the bias squared of FHD at X zero, plus the variance of the error term we cannot. Do too much about the variance of the error turn. But actually, we can see that we have the variance of our fit in model at the test data plus the bias. Now, to minimize the test means were error on the left hand side, we can see that actually we have two levers. The first one is variance. It refers to the amount by which our estimated function FH would change if we estimated it using a different training dataset. Now, assume we have this dataset, and we have this test data observation. Now, if we estimate our model, if we fit our model to this dataset, and we check the expected test means weed error using this test observation, then obviously we get one means weed error. But what would happen now if we had a different training dataset and we had the same or a different test observation, would we still get the same error? Now, the variance here refers to the fact that if we use different data samples for fitting our model for training our model, we probably will get a different estimate for the response at X zero, so we get a different means word error. The bias, on the other hand, refers to the error that is introduced by using a model to approximate for the unknown function F, that we use a linear model. Now, again, we want to estimate a function F. We don't know per se, it's anti if F is linear. If it's polynomial, can be any non linear function as well. Actually, if we have this train data sample, what error is introduced by the fact that we're using the assumption for example in OLS, but also in rich regression that we're using a linear model to estimate F. That's the bias. You can see that actually the expected test mean error here, this is actually due to the fact that first of all, we might have the wrong model, that is the bias, and it could be due to the fact that yes, we have the right model. F is approximated correctly by, for example, a linear model, but if we use different training data, we still get a slightly different approximation. You can see this quite easily. If, for example, we assume that we have a training dataset that only includes N equal to five observations, and again, five different observations. Now it's clear that we might have the correct model, but because of the fact that we are using a finite sample, we only have five observations in our training data sample, there is a lot of variance in our model estimation, and that's the variance. So we have two things we need to care about where variance and bias. And actually, it's easy to show that if you increase or actually, if you decrease variance, you increase bias and vice versa. You cannot reduce both at the same time, but usually it's a trade off. If you have lower variance, you will probably have higher bias. Now, this is shown with simulated data. We get my red line E again. This is simulated data and the data is simulated in these are the simulated observations, those black circles. The true mole is actually given by the black curve. It's a polynomial function and that's the true mole. But because we are simulating from this mole, we still have some random in our observations given by the black circles. Now we can use a linear mole. That's the one in orange here, and we can use smoothing splines, which is very similar to, let's say, polynomial. You might know what the spline is. I don't want to go into detail here, but these are smoothing splines, polynomial spines in blue and green. And you can see the blue line is doing okay. It's actually pretty close to the black curve, so that's probably our best model. And as you can see the green curve, is close to the black circles, but it's actually quite far away, as you can see here, from the black curve. It's close to the points here and here, but it's rather far away and it's modeling. It's fitting the noise in the data. It's fitting the noise introduced by simulating from the black curve. So what can we see on the right hand side? Actually, we can see the flexibility on the x axis and the mean squared error. Actually, the red curve is the tests, and the gray curve is the training MC for those different mods. Here you can see the OLS mod, bilinear model. As you can see, the training embassy and the testes are quite close here and actually quite high. Um, if we use the smoothing line with the blue curb, you can see that actually they are still quite close together and actually lower. And we can actually reduce the training MSE by using a much more flexible model. For example, we can also make it very extreme and use a curb that goes through all these black points, but you can see it's rather wiggly whey and actually, um, you get my idea that actually you're just fitting noise, and what will happen is that actually the training MSE goes down because we're using more flexible model, but as we are fitting the model to noise, if we now use a different dataset, that is the test MSE, it will increase. And we are using a different dataset, and suddenly the mean squared error, the test means squared error goes up, this shows you what happens. By moving from the blue curve to the green curve, we are decreasing the bias. There's no bias in this red curve now. It's fitting the data points perfectly well, so we don't have any bias, but it significantly increases variance because if we are now using the model on a different dataset, we suddenly get a model that no longer fits the data quite well. So that's the bias variance tradeoff. And coming back to rich regression, if we now increase nd, we are decreasing the flexibility of the rich regression and the bias increases. And this is shown here, for example, we can see with the squared bias in black, the variance in green, and the test MSE, which is, as we've seen before, actually can be decomposed or it can be seen as a decomposition of the test MSE. This now shows that actually the test ME in purple can be reduced by moving from OS to this point, which gives us an optimal tuning perimeter lamder because now we can decrease the test MSE from, let's say, maybe 48 to maybe 35. This is what happens. Rich regression allows us to exploit the bias variance trade off and to reduce the test MSE by tuning or choosing a tuning parameter to minimize the test means weird error. This is quite clear here from the Rich regression. Another advantage is if P is larger than N, if we have more predictors than observation, and the OLS estimates do not even have a unique solution, we might be inclined to use rich regression in the first place. It also has computational advantages over OLS, combined with best subset selection. That is rich regression, we're using all predictors. In ALS, this would be done by best subset selection. That is, we would need to estimate to take into the power of P models and this quickly becomes unfeasible. And rich regression had one serious disadvantage. It will include all P predictors in defined model and almost always, it will happen that the coefficients will not be equal to zero. This is why one can then move on to the so called lesser. Lesso is the least absolute shrinkage in selection operator, and rich regression, as we've seen, will always generate a model that includes relevant and irrelevant predictors, because most of the predictors will not be equal to zero. However, we can remedy this by using the lesser. The lesser coefficients, minimize the following quantity. Again, we minimize the RSS, and we are using a slightly different penalty term here. We are not using the squared coefficients, but we are using the absolute coefficients with and again, being a tuning parameter. Now, this seems to be almost equal to the rich regression. The difference, however, is that in short, lesser uses the so called L one known, the absolute value of the coefficients, whereas Rich regression uses the L two known to penalize two high coefficients. Now the advantage of lesso is that by using the L one norm, this forces some coefficients to be exactly equal to zero. These predictors will not pop up in our final select our final model fit. Hence, lesso performs a variable selection and fitted models are sparse. They do not include all predictors, and this makes interpretation of the final result much easier in contrast to rich regression. This is what it looks for the credit data. We've seen with rich regression, those parameters will slowly approach zero. But here you can see, for example, with income, it already become zero here and all those other predictors. They have zero coficiens that are equal to zero at this point, and this is the difference to on rich regression to explain why this happens. We can simply look at this plot, which gives you the contour plots of OLS, of the RSS. This point here, the black dot, that's the OLS coefficients, that's the minimal point. If we don't at any constraints, we don't gliss what now happens is that with the rich regression and less so, this is actually the less so. This is the rich regression. We're using the L two and the L one non and we only allow coefficients, for example, that are in this turquoise, rectangle or in this sphere. Now, to show you what happens with rich regression, for example, if we only allow these coefficients, then we are looking for a minimal RSS on the border of this sphere. If we now increase the sphere, you can see what it happens is. Now we are looking for a minimal RSS. This reduct the last one, what happens is, for example, too small, too large as. For example, this will happen, and you can see that now we have a point where actually the sphere touches contublotT then is the optimal point, the minimal point with the penalty turns. This would be D one and this here would be dt two. If we increase our tuning parameter, then obviously if we have a larger sphere, then, for example, to be seen in road zero, then we would use probably let me just see it touches probably a contour plot that looks like this, then we could get non zero coefficients beta two, and beta one. Now, you can now see why so forces one coefficient to be zero. It's the same, but because we're using a rectangle, this here will always touch a contour plot of the RSS at some point where beta one or beta two is zero. In this case, beta one is zero and beta two is not equal to zero. This is why geometrically speaking, why less so forces some of the coefficients to be zero. Which performs best rich or less? Well, neither rich regression nor the lesser will universally dominate the other. As a rule of funk, you should use lesso if you have a relatively small number of predictors that are expected to have non zero coefficients. For example, if you 1,000 predictors and you only expect five, ten, maybe 20 of them to have non zero coefficients, you should use lesso because you're not interested in hundreds of coefficients to be close to zero but not equal to zero. Rich regression, you should use rich regression if the response is a function of many predictors of let's say all 100 or 1,000 coefficients, predictors and all coefficients are roughly of equal size, then you should use rich regression. Next, we'll have a look at the application of these molets in practice. 13. 13 Elastic Net and Cross Validation Techniques: Hello, everyone. Welcome back to our class in artificial intelligence and machine learning Finance. After having seen the lesso and the rich regression, one can ask oneself whether one can combine actually the lesso and the rich regression as the two most well known types of penalized regression ones. Actually, yes, you can combine them to overcome the shortcomings of both the lesso and the rich regression we've seen in the previous video, and this combination is called the elastic net regularization. And actually the elastic net is quite simple. You take, again, the RSS, the residual sum of squares, and then you take the penalty terms from the lesser and the rich regression. As you might well know, the two are the absolute the sum of the absolute parameter coefficients here seen here. With my curs a year with Lambda one, and the sum of the squared coefficients with the second tuning parameter are Lambda two. So they are both added to the residual sum of squares. And then again, we are trying to minimize the penalized sum of squares. And then when we found the coefficients beta, we have our coefficient estimates for the elastic net regularization. As I've said before, the elastic net overcomes some of these, um shortcomings of both the lesso and the rich regression, and later on in the application, we'll see all three of the lessoRt regression and the elastic net. Now, the question we haven't talked about is how to choose the tuning parameters for both the lesso the rich regression, and the elastic net of course. Now, we call the distinction between the test error rate and the training error rate. We call the fact that if we have a data sample, especially in machine learning in statistical learning, we usually divide our data sample into, um, well, it could be that we already have a training dataset and a test dataset or we artificially divide it up into some training observations and test observations. Now, if a sufficiently large test dataset is available, then of course, the training and the test error rate are easy to calculate. What are the test err rate and the training error rate? Well, if you fit a model and you look at the error of the model on your test, sorry, on your training data. Um, on the data you've used to fitting the model, then you have the training error rate. Usually, when you fit the model, the training error rate will be very small because that's actually, if you look at this, the RSS, for example, as a very simple, um, measure of the model fit of the training data error is minimized in order to find the coefficient, so this will be minimal, of course. If you have fitted the model and take the model to new observations to test observation you can again calculate some metrics of the model fit of your errors and you get the test error rate. And if the dataset is large enough, if you have sufficiently large test data, then it is easy to calculate the training and the test error rate. Often, however, this is not the case. For example, you only have a few test observations. You might only have one test observations or you have no test observations at all. And the possible remedy in this scenario is the so called validation set approach, which means that you randomly divide the available set of observations into two parts. You declare one part to be the training set and the second part to be the validation or also sometimes called the holdout set. And then the model is fitted on the training set and the fitted model is used to predict the responses for the observations in the validation set. This is the validation set approach. It's shown here. Imagine you have N observations, number one, two, three, and so one until N. And then you divide these N observations into the training set shown in blue. For example, it could be observations seven, 22, 13, then maybe 50, 17, and so on. And then you have the validation set which is used to calculate the test error rate, which is shown in Beige. The drawbacks here are that the validation estimate of the test error rate can highly variable because which observations are randomly chosen the trading and validation set. This is random. Depending on which observations enter the blue and the beige part, the test error rates can actually differ from each random assignment to the next random assignment. And next problem is that we initially had data observations, and now only a subset of the available data are used to fit the model. So we are actually wasting a lot of our data by leaving it out, by holding it out and only using, let's say, half of our observations for training the model. And this is a serious drawback, especially if we have finite data. So what can we do? Can actually do cross validation to calculate the test error rate. What we do is we have our N observations in the first setting, we use the first observation, which is shown in Beige. We use the first observation as our test observation and we fit the model on the N minus one remaining observations, observations two, three, four, and so on until A. Then we're using almost all the observations for fitting the model, for training model and estimating the error, based on the test observation number one. For example, it could be the mean squared error. I could be the sum of squares actually or the squared. Then we do this another time. We fit a second model, and now for the second model, we use observations one, three, four, five, six, seven, et cetera until N for fitting the model and we check the out of sample, the um test error based on the test observation number two. We get a different mean squared error, and so on and so on, and we do this N times until we are using the last observation as a test observation, and this will give us squared errors, N metrics for the test error, and we will then simply average all these errors and we get an estimate for the training, the test error rate, sorry. As you can see, because we always leave one out, this is called leave one out cross validation, quite simple. It's quite simple the next to, um, um, extend this to the so called K fold cross validation. We will again start with N observations in our data sample, and now in K fold cross validation, for example, could be five K fold cross validation, we subdivide observations, our data sample into five bins. And as you can see, this is the first bin. So this is actually the first set. This is used for, um, has the test data. And the model is fitted on the remaining observations. Then we do it in the next bin of size, actually, and divided by five, in this case. And then we have K resulting mean squared errors, and we can average these and get an estimate for the test error rate. This is K fold cross validation, and this can be done again to estimate the test error rate. Now in the next video, I want to show you the application of both the lesser, the rich regression, and the elastic net and also how you can choose the parameters via cross validation. 14. 14 Partial Least Squares in Practice: Hello, everyone, and welcome back to our class in artificial intelligence and machine learning in finance. In this video, I want to show you the application of penalized linear regression models, that is Rich regression, the less so, and elastic net, and we'll use it in the R program. More precisely, we'll be using the risk analytics R package. And if you still need to download this, you can access the sorry. I say, you can access the package here under this link, and we will also need a list of ticker symbols and some additional data for the stock market returns that you can download here. What we'll try to do is we will use numerous features, numerous explanatory variables or numerous regresses and we'll try to forecast equity returns. More precisely, we'll try to forecast the returns on the Apple one day ahead. This is a very simple task in finance forecasting equity returns and ideally we'll identify those features in our data that can help explain the response variable, which in this case is just the one day ahead Apple stock return. This is our task, the application is actually quite simple. But as you can see from the next couple of slides, this will take some while to do in and to implement all these functions. Um, if you remember one of the first videos, um, you will recognize on this slide that actually, we're not starting with the regression models itself, but we'll start with inporting and preprocessing the data because this is one of the more important steps that usually you will not find in numerous textbooks. So everyone always starts with the models, but actually, you need to work with the data beforehand to bring it into a form so that it can be processed by the models for machine learning. First of all, we have to start with even more trivial task. We need to install the risk analytics package, and it can be installed directly from the cran mirror. Actually, you have to download it and thus, you will probably need to use these two programming lines here, library deaf tools, and then install the risk analytics package from Github. Then you need to load it into the memory in R. Library risk analytics, we also need the Snow and the Quang mod package. We need to set the file path for the downloaded CSV file that includes the NASDAQ ticker symbol. I because actually we want to rename some of our variables which have not very practical names and designated names in R, but we want to make it more easier to interpret. So we download the NASDAQ Ticker file CSV file, and then set the path NASDAQ to the file path where the NASDAiot CSV file is safe. And then we read the NASDAQ ticker symbols into company list. We order the funds in descending order by actually market catalyzation, which is included in this data array. And what we'll be doing nexus, we'll be choosing companies and we'll only be working with those companies that have a very high market cap. So that's why we download the data. We read the CSV file into company list and then salt the company names in descending order by market capitalization. So that's what happens in lines 12 and 30. Next, we need to download the return data for those ten companies which have the highest market cavalization. Stock data will later include the stock returns and we define it as a data frame and then we need to get Yahoo data for those companies that are included in the object company list and my curse a here. Um, we only use ten companies. You can easily extend this application to more companies. You will later see why this probably doesn't make too much sense, but nevertheless, we'll be using ten companies. We're using data 2010-2020, and we are using the stock returns. We'll also use macroeconomic variables. We get those why I get macro data, and we are seeing it into the object macro data. Next, we manually add the corresponding date to each row. This is a rather complicated command and you can see it will be saved into the object dates. Those are the row names as matrix get symbols, et cetera, et cetera, for several of our variables. And then last but not least, we assign those dates as row names for the macro data. Okay. Now the variables that are contained in macro data in this objects have variable names that are not too practical. For example, VIX, one GSPC, these are the S&P 500 returns three and TCN are the changes in the three month treasury bill rate and we'll rename some of these variables. For example, instead of pound head VIX, we'll simply call it VIX. Instead of one GSPC, we'll call it SP 500, this is more intuitive than the previous names, and this is what we'll do in line nine. The column names of the object, macro data. They are overwritten with VIX SP 500 Beal Estate TR three M yield and credit. And by the way, these are the macroeconomic variables that we'll be using implied volatility index. This is the volatility index on the S&P 500, the S&P 500 itself, the IHS DowJos use real estate return index, and then the yield the slow of the yield curve corresponding to the spread between the ten year treasury rate and the three month TB rate, and last pta the changes in the credit spread between BAA rated bonds and the treasury rate. So this is a credit spread. Okay. Now, we also need to preprocess the data in some other sense. We'll add a data column to both macro data and stock data and then merge it via the merge command in line five to have a new data object that is complete data. This includes the macroeconomic and the stock market data all in one data frame. Next, we'll refine complete data as a table. I told you in a previous video that this is more convenient in many machine learning applications in I at least if we now print the first five rows of complete data, you can see that these are the first five observations. You see the dates, and then as Columns, you see AAPL. This is the Tita symbol for the Apple stock, MSFT, Microsoft, Amazon, ubl it then goes on and we have 2,761 more rows. All in all about 2,800 objects, 2,800 daily observations for stocks and the macroeconomic data. You can see we have additional variables, for example, JPM, JV Mong, J and J is Johnson and Johnson. We'll actually see this in our results later on, quite prominently, Johnson and Johnson nowadays everyone knows about this company and so on, and you see Big SB 500, real estate, and so on these other macroeconomic data variables. We again extract the macroeconomic variables from complete data because the dates in the original macro data object do not coincide with the dates in stock data, we'll extract it again from complete data to have corresponding dates in all our objects, and we now print the correlation matrix. Get a first hint what the data looks like. We'll see which variables are correlated, and this gives us a first idea of what the data looks like. You can also do this for the stock data. You will see, of course, different correlations, and it would also make sense to print out summary statistics, summary statistics that will include the mean the standard deviation or volatility of the data, the quantiles minimum maximum they'll use to see if anything is off, is that anything might be an outlier that needs to be dealt with. We could have a look at the correlation between the macroeconomic and the stock market data. And here, I've only shown you the correlation matrix for the macro economic. Um, variables. You can also visualize this via the library co plot, and then the command is also called co plot. And here you can see, for example, that there are some strong positive correlation between real estate and the S&P 500. Well, this is the same and slightly negative correlation between the dx and the S&P 500 and the Ix and real estate. This is just another way of showing the correlations and visual way. Okay. Now, as you remember, in statistical learning, machine learning, we need a training dataset and a test dataset to see the out of sample performance of our models. And we do this. First of all, we load the library lubridate and then slit our data into training and test data. This looks like it's quite collated, but actually it's just using the complete data data frame and selecting some filters, for example, the year that is smaller or equal than 2008 with my courser year, it is. It's smaller or equal than 2018. So that's the training data and 2019 and 2020, those test observations. 18 years training data, two years test data, and because we later on want to forecast, we want to predict the one day at returns on the Apple stock, we set another column. That is test data dollar forecast to be the test data of Apple and same year for the training data, and we switch it one day at. We're using the Apple returns and by switching it one day at, and you get, for example, if this were 1 January, then for the observation of all our variables on 1 January, you would have the observation on 2 January of the Apple stock as an additional column for the response variable that we want to forecast or want to predict. That's actually test data, dollar forecast and training data, dollar forecast. We keep in mind that this is just the Apple stock return, just one day hat could be also quite different response variable that we want to explain with our data. Then because some of the models need dependent and explanatory variables, the response variables and the predictors in separate matrices, we do exactly this. We take training data, we take test data, and we split it into X train and Y train and X test and Y test here in lines three, four, six and seven, and we define four different distinct matrices that we can use as input objects for our um ones later. Last but at least, some of our molets do this anyway, but we want to use our own functions, R squared and MSE. Actually, this is the R squared. As you can see, it includes as inputs, the predicted and the actual values, then you can calculate the RSS, the TSS and the R squared is simply one minus RSS divided by TSS. That's what is returned in line five. And the relative mean squared error is just the square roots of the mean of the squared differences between the predicted and the actual mall use. In line nine, so we'll start with OS with a simple linear regression model. We forecast the future apple returns with past returns of the stock itself, and there is other variables and we use in model. That's the object that we are defining is LM linear mod, that's the Rcde for linear mod. Forecast that's the column of the apple returns one day ahead, being a function of and as you can see here, sorry here. If we don't specify any variables and only put in a point, full stop, this means that all variables that are included in training data can be used or should be used by the program as predictors. What happens is, as you can see here, we are using a simple linear regression one that includes all predictors, and we have the intercept, we have Apple StockReturns, Microsoft, Amazon, Google Google, again, JPMo, Johnson and Johnson, and this is where Johnson and Johnson becomes interesting. As you will see on the next slide Johnson and Johnson, the stock return on the Johnson and Johnson stock is the only predictor that is statistically significant, at least at the 10% level. All the other variables are not statistically significant. It goes on with the five macroeconomic variables and the multiple R squat, is quite low, the adjusted R squared is even lower. It's almost 0.1%, and as we'll see later on, it doesn't get too much better using rich regression less so on the elastic net slightly better. But what we can already see here, none of the predictors really has too much explanatory power for explaining and forecasting future ever returns. Remember that we haven't put in any theory into this. We are only using all available data to forecast future stock returns. This is typical of machine learning. In an economic model, we would have argued that there are some variables that should have an impact on future stock returns. We probably would have used a time series, econometric model. This is simply statistical learning. We're using all the data we have, we're using all the predictors we have and it turns out that for some reason, the Johnson and Johnson stock return has a slight statistical significance in explaining future Apple stock returns, but as we can see from the R squared, it's not too helpful. Okay. We can then predict from this linear model, we're using test data as the new data and the R squared and the relative means squared errors. On the test data, well, that in here, the R squared becomes negative and the MSV is 2.4%. This is rather dismal. As we can see, doesn't generalize well to the new data. As the ensemble fit of the linear model was already not too good, it's not surprising that it doesn't perform better on new data and actually the R square becomes negative. Now to use ridge or lesser regression, remember that we have one queuing parameter, both in rich and lesser and with two queuing parameters in the elastic net and we need to choose the tuning parameters such that it generalize well to the new data, and this is done via cross validation. Um, remember, cross validation means that, for example, leave one out cross validation, you leave one observation out and you fit the model on the remaining objects and then continue to do this until you've run through all the data observations. In K fold cross validation, you split the training data into K parts or so called folds and usually choose five or ten folds. And then K fold cross validation considers training on all about the K part and then validating the mod performance on the K part, iterating it over all K parts, all K folds, and then estimating the relative means weight error, for example. Then you choose the hyperparameter that performs best. For example, in each model, you can use different types and different values for the hyperparameters, which in this case, the tuning parameter, and then you choose the parameter that reuses the MSE that yields the lowest MS. This is how it's done for the rich regression here in R. First of all, we're using the GLM net library. We set using the set SED command for reproducability so that if you do this again in R, you get the same results. There is some variants of course in this process, but if you use the set CD command, you can always reproduce your initial results. Then we'll find the optimal lender parameter by cross validation. We are using ten fold cross validation. Each time we will use 100 regularization parameters, cuing parameters that are tested in each of these folds and we'll start actually here if we said Author, the weighting to be used to zero, this yields us rich regression. Later on, if we set it to one, it will give us the lesser. So this is the command cd dot gLMNt and this cross validation for the Rich regression, we'll use the X train matrix and the Y train matrix, the training data to select Y cross validation, the perimeter. And after some time, you can see here in the lower part of this slide, we get an optimal mdo tuning parameter of 0.45 56. We can then fit ridge regression, which is very simple, GLM net Alpha equal to zero, so that we get rich regression. Lando is the tuning parameter. We use optimal Lander selected by a cross validation here, and then as data, we use X train and Y train and here you can see the coefficients of the rich regression. Not surprisingly, if you remember the video on rich regression, Rich regression We usually include non zero parameters for all predictors. So it's not surprising that we have all those parameters included in our model and Apple, Microsoft, Amazon, and so on, all these predictors are in our regression. You also will remember that if we set the perimeter lender to zero, we get alls. If we set it to infinity, all the parameters will be forced to become zero. And if we do this now with a high lender and a low lender, uh you will see what happens. For example, if we set it to high lender, then we get this and if we do it with low lender will arrive actually at the OLS coefficients. Now, to see how rich regression fairs in comparison to OLS, let's have a look at the sample squat. If we use it with the training and training data, you can see that the R squared is 0.0 003, which is much lower than in the OLS model, which was 0.007. Well, it's lower at a very low level of course. The sample performance, we again, predict using rich regression now and look at the R squared, it's minus 005 in comparison to -0.0 138, it's slightly better. Actually, the out of sample performance is slightly better when looking at the square and it's also better when looking at the MSE. This is one find we use ten fold cross validation to determine the hyperparameter lender. Why is this, by the way, called a hyperparameter and not a parameter? Well, it's used. It's a parameter that governs the training process. If it is a parameter that is used in the training process, it's a parameter. But if it changes the way the model learns from the data, this is called the hyperparameter. End is hyper primter it shifts from S to reach regression, for example. And using this Lando, we obtain a linear model with coefficients that are smaller in terms of the absolute value than the OLS core coefficients, and the OLS mole achieved the highest sample sued. However, the out of sample performance was better for the rich regression slightly and at a very low level, but nevertheless, out of sample performance was better for the rich regression. We can now do the same for the lasso. Then we use the GLM net library. We use the set SET command to be able to reproduce the results later on. Then CD lasso is, first of all, the cross validation selection of the hyperparameter. We set Alpha to one, so it's not rich, it's non eso, and we get the optimum eso in line ten. We then fit the lesser regression model to the data to the train data using Alpha equal to one. We're using Lesso and we are using the optimal lender. As you can see, now, several of our predictors are actually left out of the model. This was the main difference between lesser and rich. Lesso can exclude variables, whereas rich regression usually will include all predictors here, not surprisingly, actually, what is left in the model is the Apple stock return itself. This is indicative of the fact Yesterday's stock return of Apple will probably predict today's stock return on the Apple stock. This actually makes sense in an economic way, but it could turn out anyway in any other way, of course. This is what we can see I sample R squared is zero, the out of sample performance measured by the R squared and the RMSE is again, slightly better than in the uh in the linear model, for example, you can see the MSE is 0.023 in comparison to 0.024 in the OLS model. So this is the lesser. And last but not least, we can also do the same with the leg net. Now, we use the carrot library. Again, make it available for reproduction. And then we need to select the um, two parameters. This is done via cross validation again. And then in the end in lines eight, nine, and ten, you can see that we are training the elastic net using forecast, all variables, the training data, and the training control object for the parameters. See the ocur parameters are Alpha and Lander, 0.81 and 5.48. Then again, based on the optimal premters from the previous slide, we can predict our data on the training data. The R squared remains at zero and the out of sample predictions and performance is again, slightly better. But if you compare, for example, the RSE for the Elastic net 2395, compared to 2395, 205, 205, it doesn't get better. It's actually the same for the lasso and the elastic net. This is an application of those penalized linear regression modules. And next in the next section, we'll talk about classification a key nearest neighbors and support vector machine. 15. 15 Bayesian and K Nearest Neighbors Classification: Llo everyone. My name is Clego Weiss. I'm a professor of finance here at Leipzig University, and welcome to this video on Artificial Intelligence and Machine Learning and Finance. Now, in this video, we'll have a look at the K nearest neighbors and the base classifier in order to classify observations based on training data and to predict into which class, for example, new test observations will fall. We'll later on look at support vector machines, but in this video, we'll start out with the K nearest neighbor classifier. If you recall the classification problem, we are always led to the question, how can we assess the model accuracy of a given classifier? That is, is it able to classify a new observation correctly or do it classify it into a wrong class? Well, this is obviously in contrast to regression methods where we've already seen measures of model fit, and it's clear that for classification problems, this looks a little bit different. So we'll start with the training error rate form, and this is done for qualitative responses, for example, default, non default or contract termination, no termination, and the training error rate, and with this, we mean the proportion of mistakes that the algorithm has made if we apply our estimate FAD to the training observations. This is defined as basically the average of an indicative function. This is I here in line equation 21. This is an indicator function that looks at the comparison of the observations, YI of our qualitative response, and it is one. The indicator function is one if YI is not equal to our predicted value. YI h is the predicted class label for the I observation and we simply take the average across all these indicator functions. Basically, again, it's the proportion of mistakes that we make. This is the training error rate this is done for the training data. So in time series terminology, we would say in sample, and then it's clear that can also look at the out of sample predictive performance and this here in statistical learning is called the test error rate. The test error rate looks at the model accuracy of a classifier when the classifier that has been fitted is applied to new data and the test error rate, now again, the proportion of mistakes that we make if we apply our estimate to the test observation or test observations x0y0, Again, it is defined as the average of a comparison of the responses y zero and whether we can actually predict it correctly or not. It's the percentage of incorrectly classified test observations. Naturally we are looking for classifiers that are able to reduce both the training and the test error rate, and in many cases, we will see that the algorithms are not able to generalize well to new data, so you have a high insund of it, a low training error rate, and rather high test error rate. So what would we use as a classifier? Well, the most basic classifier is the so called base classifier. It's the classifier that assigns each observation to the most likely class given its predictor values. So we assign a test observation with predictor vector X zero to the JT class for which the conditional probability that Y is in the JT class, given that we've seen the predictor values X zero, so that this conditional probability is large. In this case, the test error rate is minimized on average by our base classifier, but this is only theoretical and it's actually only of any use in theory because the problem is for this to work for the base classifier to use the distribution Y, the conditional distribution of Y given X of the response given the predictor values, it needs to be known and in practice, this is never known. If we knew the conditional distribution of Y given X, we could simply, well, estimate and calculate these conditional probabilities, and then we would know which predictors predict our outcome. But this is not the case, so we cannot actually use the base classifier in practice. What we can do instead and the K nearest neighbor classifier is closely related to the base classifier, we start with the positive integer K, for example, five nearest neighbors, three nearest neighbors. We have an integer five K and one test observation X zero. Starting from this test observation X zero, we identify the K points in the training data that are closest to X zero and we call this N zero. Then we estimate the conditional probability. Note that this is not known ante, but we can estimate the conditional probability for class J as the fraction of points in the set zero, whose response values equals J. The probability is estimated simply by taking the average of the indicator functions. When looking at the observations in the vicinity of X zero, that belong to class J. This is why it's called K nearest neighbors. You take the observation X zero. You look at the K, for example, the three closest nearest neighboring points and you see, for example, if we have a five nearest neighbor classifier, and we observed that the five nearest points to test observation are belonging to. Let's say one is belonging to the class and four are not belonging to this first and only class, then it's one out of five, and this is our estimated conditional probability for class J. We then apply base rule and classify the test observation at zero to the class with the largest probability because we now estimate these probabilities. This is what comes out of the actually the base decision boundary. As you can see, if we know the conditional probability, the conditional distribution of Y given X, then it's actually quite it's actually the best classifier we can achieve because we know the conditional probabilities. You can see with two classes in yellow and purple, yellow and blue and the purple line the base decision boundary. You can see that, for example, if we have a new observation that would fall in here, we would classify it as the blue class. If it falls in here, then it's classified as the yellow class. This could be default, non default, contract termination, no termination, et cetera. With K nearest neighbors, it works like this. We have this point, for example, and we take the three, if it's K equal to three, the three nearest neighboring observations. One blue, two blue, one yellow. This area is considered to be in the blue part and this is where it will now appear. This here is, for example, the blue class and we now do it again here, it probably be one, two, and I would say this is the third point, so this is also blue this is how you construct the decision boundaries and what happens is you get um, obviously, a non linear classifier and non linear classification boundaries. And this is three nearest neighbors. You can also use five nearest neighbors, ten nearest neighbors, with a simulated data obviously with so few observations doesn't make any sense. So this is K nearest neighbors. How does it compare to the base decision boundary? Well, you can see here, we've used ten nearest neighbors, and you can still see the base decision boundary in purple. It is a little bit more wiggly, as you can see here. It's close to the base decision boundary, but it is a little more variant, as you can see, it has more variance and less bias. Same. Picture, but now for K equal to one and K equal to 100 and you can see what happens with K equal to one, it gets even more wiggly here, and with K equal to 100, it's very coarse. It's a very coarse, even linear decision boundary. And again, you can see here the bias variance straight off in play, and K is obviously a hyperparameter that needs to be chosen before applying the algorithm, usually via cross validation to arrive at an optimal algorithm. 16. 16 Maximum Margin Classification: Hi, everyone, and welcome back to our class on artificial intelligence and machine learning in finance. In this video, we want to have a look at support vector machines as another method for classification. As you remember the last video, we saw that and we'll see this even in more detail later on in the applications, that classification is often used in finance, at least, for example, in credit risk management, where we want to classify good and bad loans, defaulting customers and non defaulting customers. In insurance, economics, insurance management could be that we want to identify and classify those customers who are most prone to terminate their contract and to switch to another insurance company. So support vector machines are another way of doing this of classification for classification. And as such, it's quite similar to K nearest neighbor but more sophisticated. And if we speak about support vector machines, usually, this is the summary term for three distinct methods. We start out with the so called maximal margin classifier. We'll then skip to the support vector classifier. And if we extend the support vector classifier, we get the support vector machines, and we'll come to that probably in the next video. We have to start with the basic definition of a so called hyperplane. What is a hyperplane? A hyperplane in P dimensions in the P dimensional space is a flat affeme and such, it's a linear subspace of dimension P minus one. For example, in two dimensions, if we have the plane, um, hyperplane is a line. For example, this is a hyperplane. This is a hyperplane or even this is a hyperplane. This is a linear line in the two dimensional space, a plane. In three dimensional space, it's a plane. You can see how this goes if this is a three dimensional space. For example, well, I need to probably sketch this right now. Could, for example, be that it's this hyperplane. It's cut here on the z axis. This could be schedule this a little bit for you. Let's use, for example, this here would be a plane in the three dimensional space with let's say P enough P, but the set being equal to let's say three somewhere X. Y and the three coordinates, and this is the hyperplane, this blue plane that cuts through the three dimensional space at t equal to three. As you can see, a hyperplane divides the P dimensional space into two halves and it's defined by this equation. You have the linear combination of the coordinates beta zero plus B to one times X one plus one until plus B P times X P, and this needs to be equal to zero so that you get a hyperplane. It's a hyoplane quite simple. And you can now see why we're using hyperplanes here at the very start because in the two dimensional case here, in the plane, such a line cuts through the plane and cuts it into two halves. For example, if we have some observations, here, here, here, here and here and maybe here and we have certain features associated with these points. Well, then, for example, this could be a decision boundary, this red line. Meaning that in the blue space on top here, all the observations that fall into here are classified as blue points and those that fall below this line are classified into the red class. So the hyperplane is used for classification, and how should we do this? Well, as input data, we have N times P observations. We get a data matrix of N observations in the P dimensional space. Each observation belongs to one class. That is, we have the response qualitative response variables Y one to YM and we have minus one or one as the response variables possible values. Minus one represents one class and the other. We only have two classes. Right now, we can extend all these models to the case where we have more classes, like for example, um, default, A, A rating, and so on. It works well with ratings, but we start out with just two classes. And minus one is one class, one is the other one. We have a test observation with a P vector of observed features. This is X star, X one star up until X P star. As output, we get classification of X star using a separating hyperplane. We use a hyperplane that cuts the P dimensional space into two halves, and then we can decide to which class these observations belong. Quite simple, this classifier based on a separating hyperplane. If we, for example, have these blue and these red observations, you can see that all these lines, the separate the blue from the red points and we can then say, Well, for example, if we use this line here, this separating hyperplane, then everything that is on top is class blue and everything below is class red, and we can use the hyperplane for classification. It's very, very simple. Now, problem if one separating hyperplane exists and it doesn't necessarily need to exist, it could be that there isn't a separating hyperplane, quite simple scenario where, for example, if we mix all points, those blue and red points in here, then if, for example, these are red points. If I add some red points, if I add some non sorry, if I add some blue points here, you can easily see without proof that it gets quite difficult to find a separating hypoplanes bilinear, that's the definition of a hyoplane you can try to insert a separating hyoplane. It will not work. So there isn't one in this scenario. The question if one separating hyplane exists, then we have an infinite number of such hyperplanes. You can see that I can add an infinite number of hyperplanes here as long as each hypoblane still separates all these points. We have an infinite number. The question is, which one should we use? This is where we get to the maximal margin classifier. Now, the natural choice is then match maximal margin hyperplane, which is the separating hyperplane that is farthest from the training observations. We need to do is we first compute the perpendicular distance from each training observation to given separating hyperplane. Can do this actually if I delete some of my drawings here. And if, for example, we use this one, for example, we calculate the perpendicular distances let me just see this is almost perpendicular less to the hyperplane, from each side, you can see this here. And then we try to find the separating hyperplane such that the maximal this is called the margin, the maximal distance is well, it's maximal. And this is probably not the maximal margin classifier because you can see that now I've shifted the line, the hyperplane to the left. And even though these are distances, for example, this one here has now become larger. This one here has become smaller. Actually, we're trying to fit the hyperplane somewhere in between here such that the distances from the blue points and the distances from the red points are equidistant. Such is the way how to construct the maximal margin classified. So we compute those distances. The smallest such distance is known as the margin, and then we maximize the margin. It's the separating hoplane for which the margin is largest. So this is what we get, as you can see here, the distance, the margin is now maximal and actually, it's the same when considering the blue points and the red points. Now, the interesting fact here is that actually, as you can see here, adding a red point here or adding a blue point here in the left doesn't actually change the maximal margin hyperplane and the classifier. Why is that? Because the maximal margin classifier only depends on this And this point on these three points, and this is why they are called support vectors, they support the hyperplane, and these are the only points, and in P dimensial space, these are vectors. These are the only points of vectors that determine the separating hyperplane. Changing the support vectors will get a different classifier, we'll get a different separating hyperplane. But actually, if you add any points on the left or right, Uh, the classifier will not change. This is why these are actually called support vectors. And now you can see why later on, it's called the support vector classifier and the spot vector machine. However, this is still the maximal margin hyperplane and the maximum margin classifier. How can we construct this in more detail? Well, the maximal margin hyperplane is the solution to a specific optimization problem. We need to maximize the margin is N. We are looking for those parameters beta zero, beta one, and so on, B P N such that N is maximized. Subject two, we summarize, we sum the squared coefficients beta J. They need to add up to one, and Yi times actually the hyperplane needs to be larger or equal than N. Meaning well actually remember that Y is either minus one or one and this constraint 26 ensures that each observation will be on the correct side of the hyperplane. We have some buffer, some margin M that's clear. It could be actually that we also have some points in between this area here, for example, here, here, probably, then we have if it's switched, but this is the margin. Constraint 25 ensures that each observation is at least a distance N from the hyperplane. This needs to be the case, and then we maximize and choose the coefficients which will determine the hyperplane such that M is maximal. Now, this is a very simple classifier. Now, if a separating hyperplane exists, we should use the maximum margin classifier. But the question is, is this always the case? No. This is the non separable case. I've tried to sketch this before, but you can start anyway here and try to find a hyperplane that separates these points. Well, no, there are still red points here, can go and this is okay, but then we have one red point here and you can see I cannot put a line through this plane without having some red points on the left and some blue points on the right side of this hyperplane. This is a problem. The non separable case, in this case, we don't have a separating hydroplane and we cannot use the maximal margin classifier. Now, if a separating hypoplane exists, should we always use a classifier based on it? The answer unfortunately is no, because it's quite sensitive to new observations. And as you can see here, for example, if this is the maximal margin classifier, if I add a point let's say here, this doesn't change anything. But in this case, we've added one point here and as you can see, instead of moving the separating hyperplane just a bit. Let's say this is the old one and let's say, this is also okay, and this is one error we make, but this is still okay because the maximum margin classifier separates all blue from all red points. You can see it's extremely sensitive, and it suddenly gives us this decision boundary. Even if a separating hyperplane exists, it might be that we don't want this level of perfection because then the bias will be low, the variance will be quite high. This is why in the next step and in the next video, we'll extend the Maxwell margin classifier, allow for some degree of error, and we will get to the support vector classifier. 17. 17 Understanding Support Vector Machines: Hi, everyone, and welcome back to our class in artificial intelligence and machine learning in finance. In the last video, we've seen the maximum margin classifier. If you look at this here, we saw that the maximum margin classifier is actually quite sensitive to new data. If we add new data points, then the decision boundary, which is the line here, for example, here, is quite sensitive to these newly added observations. The question is, can we make this classifier a little less sensitive to new observations and a little more forgiving? This is where we actually now arrive at the so called support vector classifier. It's not yet the support vector machines that we want to talk about in this section, but it's the support vector classifier, which is also usually summarized under this name support vector machines. The solution is use a classifier that allows for some observations, not all of course, but for some observations to be incorrectly specified and classified. You can see this here, for example, if we add this observation 11 and this observation 12, even though 11 is red and 12 is blue, the decision boundary doesn't change too much due to these newly added observations that make the decision boundary no longer a separating hyperplane. So we want to have a little discretion when it comes to the classification of the observations, and this is what the support vector classifier does. The difference to the maximum margin classifier can be seen here in the optimization that leads to the support vector classifier or the support vector hyperplane that we used to classify again, use our constant M. We have our parameters B zero, B one, et cetera for the hyperplane, and now we have additional variables Epsilon one through Epsilon M, and we maximize M with respect to these parameters. With the same constraint in line 28, that is the squared coefficients all add up to one, and the difference now is first of all, that the hyperplane, the observations should not be exactly higher or lower than MSO. It's not like we again, looking for a separating hyperplane, but we have some discretion, and this error that is possible is actually included here with these so called slack variables that allow the observations to be on the wrong side at least a little bit is what actually happens here. So we would expect 11 to be on this side, but it's okay that it's actually a little bit off and a little bit above the hyperplane. So then this would be captured in the slack variable. And we allow we demand, of course, that the si variables are non negative, and we have an additional hyperparameter, which is C so that the sum of all those slack variables, in other words, the sum of the errors that we allow, this needs to be smaller or equal than C. C is of course a non negative tuning hyperparameter. It's chosen via cross validation, and M is again the margin. This is the support vector classifier. We see four different support vector hyperplanes in these slots plots. And as you can see, from the title, these plots result from different choices for the tuning hyperparameter C. For example, the largest value of C was used in the top left panel. Smaller values were used then in top right, bottom left, and bottom right, with four different values and actually, when C is large, then there is a high tolerance for observations being on the wrong side of the margin, and so the margin will be very large. As C decreases as we reduce the maximum allowed sum of errors, as C decreases, this tolerance for observations being opposite on the wrong side of the hyoplane decreases and the margin narrow. You can see this here, let me see, this is actually the margin from this side and from this side, as you can see, it is reduced in each plot going from top left to bottom right. Now, now is are we done? Well, actually, no, is a linear classifier always warranted? Well, if you ever look at this plot, you can immediately see, now, you can try your best at using a plane, a hyperplane that is a line here to separate the blue and red points. Then, for example, if you were to use this line, then you have those observations here that make problem and cause a problem. You can use a hyperplane like this, it doesn't really help. What you really need is something actually, that is non linear, for example, could be something this or it could be something like this could also be useful. A linear hyperplane is not warranted here and we need a non linear classifier. This is where we finally get in the next couple of slides to the support vector misins. While the support vector classifier is a linear one. How can we automatically convert it to a classifier with non linear decision bound? First thing we could try is that we again maximize the margin. We use the betas, the Epsilons and MS, our parameters, and we now say, we are not using a hyperplane. But for example, as you can see here, this is again the response being minus one on one, the margin and we allow a certain error in the slack variables. But here in parentheses where previously we had a linear hyperplane, we now allow what is a polynomial function. So just like we did with polynomial regression, we allow the decision boundary to be a polynomial one, and this is one way of doing it. Problem is it might not be enough. A polynomial regression, we've seen that adds some flexibility to our model. Same here, but it might be that we need more non linearity. How can we cheat this First of all, recall the standards, in this case, the euclidean in a product for two vectors X and X bar, but dash actually, you have the scalar product, the euclidean in a product of two vectors, and we need this Because the linear support vector classifier can be represented. It's just a different representation, different way of showing what the support vector classifier looks like. With N parameters, Alpha I actually is what? It's the hyperplane is given by beta zero plus the sum of perimeters Alpha I times the inner products of X and X. In other words, we can use this representation where you this here, which is it's a linear function, it's a linear line, a plane or whatever, and and the hyperplane can also be represented in this way. Instead of using beta one times X, B X one, beta two times X two, and so on, we can also use the inner products. We get a different set of parameters. Now these are Alpha i and Va zero, of course. And to estimate these, we only need those N times N -1/2 inner products between all pairs of training observations. At first, this sounds like a lot, if we are looking at a big data sample, then obviously if with 1 million observations, then we would also have 1 million, um, now actually 1 million times almost 1000000/2 inner products and with N observations and N parameters. However, phi is non zero, only for the support vectors in the solution. You can again see why they're called support vectors, and this means that we actually don't need all those N parameters Alpha i, but we have a much smaller number of parameters that we need to represent the hyperplane from the support back to classifier. Now, why do we need this? Why do we need this different representation for the support linear classifier? Well, the linear support vector classifier can be extended in a very simple way and how we replace the inner product with a more general so called kernel. What is a kernel? A kernel K, at least in the context of machine learning, have slightly different meanings in different parts of mathematics and statistics. But here in the context of machine learning, a kernel is a function that quantifies the similarity of two observations. So we have two observations, X one and X two, and a kernel is a function that in some way measures the similarity. Could also be the distance between those two points. This is what we call a kernel. Quite clear, in the case of the inner product, this is, of course, a function that measures the distance in the Euclidean space between those two vectors. We need to replace the inner product. We need to replace with my cursor here, we need to replace the inner product here with a different kernel, and then we get a more general extension of the support vector classifier. This is what finally is a support vector machine. We can use different kernels. For example, we can use the linear kernel. We can also use a polynomial kernel of degree D or quite often use the so called radial kernel or radial basis function kernel, which is given in equation 37. These are different choices for comparing Xi and Xi to vectors, and if we now substitute the linear kernel by, say, the radial kernel, we get a different classifier and if we substitute the linear one with a non linear kernel, the resulting classifier is known as support vector machine or SVM. Equation 38, you see, we now only speak of a general kernel K, and what is important is, as I mentioned before, we don't need all N parameters because these parameters Alpha I will be non zero only for the support vectors. The set S that includes the NCs for the support vectors is usually much smaller than N and this representation is we only need these indCs in the set S. That's the set of support vectors. That's a support vector machine. Substituting the linear kernel with a non linear kernel and what we get is a much finer way of classifying these observations. On the left hand side, you see polynomial kernel. See that we have our observations here and we would have liked those red observations to be separated from the blue ones. And the polynomial kernel does this. Actually, as you can see on the right hand side, the radial basis function kernel works even better. So these are support vector machines. They are estimated in a similar way as the support vector classifier and all these classifiers. Actually, the maximal module classifier, the spot vector classifier, and these more general support vector machines, usually they are all summarized on the name of support vector machines. So these are SVMs, that's the theory, and in the next video, we'll be looking at some applications. 18. 18 SVMs in Practice Part 1: Hi, everyone, and welcome back to our class in artificial intelligence and machine learning in finance. In the last video, we've seen the key nearest neighbor and the support vector machines, and we now want to apply these to a dataset of credit card customers, and it's not a dataset that is concerned about defaults, which one could probably suspect when hearing the words credit card and finance and classification application, but we actually want to predict customer term. We want to see which customers are more prone to cancel and terminate their contract and which aren't. Data can be downloaded at Cagle. So if you go to this link here, you can find the data at Cagle. You can also find various usually Python notebooks that feature wide variety of different methods for actually using this classification, not just Kynars neighbor and support vector machines. However, here, because we've just seen those two models, we want to concentrate on those two. And with the support vector machines, we are going to use the support vector classifiers, so the linear one and one with a radial basis function kernel. We've already seen that on previous slide in our script here, we saw that the radial kernel, the radial basis function kernel actually achieves a non linear decision boundary, and this is probably better than the purely linear one. What is the motivation behind this application? Well, this is actually even more true in insurance management. In insurance management, for example, if you think about car insurance, the contracts usually expire all after one year, and then you have to renew your contract. It's usually in Germany, if it's renewed automatically if you do not cancel your policy. However, if you cancel it and switch to a cheaper, um, competitor, then obviously cancellation rates are a major concern and a major input for the calculation of premiums in an insurance company. Same with the credit card customers here. So if you're the manager, you're concerned about the question whether an increasing fraction of your customers quit and terminate their contracts. You want to slow this down because obviously this is bad for business and you're trying to use machine learning methods now K nearest neighbor and support vector machines to identify those features that are able to predict customer churn and contract cancellation. However, she needs a prediction of who's going to terminate the contract, so this is what we're going to do here. The dataset that is available at Cadle contains information on more than almost 10,000 customers. We have features including age, salary, credit card limit, and other features and the data are unbalanced, meaning that only about 16% of the customer in the dataset have actually canceled their contract, 16% have canceled and 84% have remained with the credit card company. This complicates training and interpretation of the predictive performance of the models. Obviously, this would even be worse if only, let's say, two customers had canceled and 9,998 customers didn't cancel the contract. So this is a slight problem. We'll come back to this later. Might also be more favorable to contact a customer who is not about to turn the not contacting a customer who is. So the question is after having done the prediction and after having done the classification, what is the best way to move forward? Should we contact those customers that are more likely to cancel their contracts or should we actually concentrate on those ones that are close to cancellation or those that will never cancel their contract in order to maximize our profits. This lecture, our primary focus is on the models. We will therefore not specifically address the so called class imbalance problem, meaning that only 16% have canceled, 84% haven't. But this is usually known as the class imbalance problem, and you can look this up in the literature in textbooks. Now for fitting the KN N model and the SVMs, also as well as the elastic NT model we've seen in the previous section, we will rely on the R package, carrot by Con 2020, and we will shortly introduce this package in more detail. We've already used it before, but here we are going to talk a little bit more about this in detail. Carrot is just short for classification and regression pretting. It summarizes activities related to model development in a very streamlined process. It allows you to test different models with very little changes to the code. We'll see this later on in the R code that we actually don't need to rewrite too much of the code in order to switch from one model to the next one and it offers automatic cross validation and parameter cuning. Now, this package provides a consistent modeling syntax. For example, by simply changing the method argument, you can easily change the line model from a key nearest neighbor to SVM. In total, it gives you the possibility to use more than 200 different montes from machine learn. Now now that behind the scenes, the package is not performing the modeling itself. Actually, it uses the standard methods from R. For example, if you use LM, it simply refers to the LM function from the stats package to estimate a linear regression model. It only simplifies the syntax. For Kors neighbors as well, it relies on the class package by Venables Ripley. So it is rather well, more convenient way of doing this in R rather than reinventing everything. And for a complete list of models, you can go to this link here in the script and access the documentation of carrot. Now, as before, we need to import and preprocess the data first, and we're doing this by downing the bank churn CSV from Cagle. You can see the link here. And then after having downloaded it to your computer, you need to import it in R. We use the package read R, and we read the CSV file into this object, Bank churns. Now, if you read the description at Kettle, it's recommended to drop the last two coms, which are not needed, and so this is what we are doing in line six. Bank churns is overwritten with bank churns, simply using the columns one, two, the number of columns bank churnus has minus two, so we are dropping the last two columns. We then print those data, the first seven observations, and you can see it's a table. It includes a client number, trition flag, the customer age, gender, and has 17 more variables and 10,000 rows approximately. We've 10,000 bank customers and as you can see, two, four, five, so we have 22 features for the customers. Now the valuable description is also available from Kegel. You can see for example client NUM as is expected is a unique identifier for the customer. Attrition flag is a flag that is one if the account has been closed and zero otherwise. So it's zero, it's an existing customer. If it's one, it is a customer that has canceled his or her contract. The dependent count is the number of dependents, gender, age, et cetera, or self explanatory. We will do some explorative analysis. For example, we'll see if there are any missing values. If you use N is NA on bank Jonas, you can check whether there is any NA not available. That's a data type in R, and it gives us faults back, so with no missing values. We can also check for any duplicates that's NA and then duplicate it on bank churns, again, comes out faults, so we don't have any duplicates. To see the class imbalance, we use a table of bank China's dollar attrition flag, divided by the number of rows, and then we round this up and you can see, we have at trited customer, 16% existing customer, 84%. We can clearly see, yes, we have a class imbalance. We are not addressing this problem in our application, but yes, it is a problem and it can, um deteriorate the quality of all classification models later on, but this is a topic for another necture. Now, we continue our exploitative analysis. We analyze churning customers by age. We use the package DPL YR. What we do is we try to plot the attrition flag for some sub sabotsF example, customer by customer age, um, you can also do it by gender, et cetera. And this is a little bit complicated because the plots are meant to nice. And if you look on the next slide, you can see what comes out of this. You can see the percentages of at triting customers, and it starts at approximately, I would say, 12% for the major 20 Then it goes up 30, 40, 50 years and it steeply drops for customer aged 60 and 70. We can see, yes, obviously, age seems to have some influence on the attrition flag. So it will probably be one of the predictors that will stay on models. Um, saying here for customer education, college, doctorate, graduate high school, post graduate, uneducated Amno, it seems that at least if you have a doctorate, the attrition flag and the percentage of attrition is much, much higher than for the rest of the education classes. Again, this might also be helpful later on in predicting customer chin. Now, we do have some categorical features. For example, customer education is a categorical feature, and it cannot simply be encoded in a numeric variable as it is sometimes possible for ordinal variables. For example, if you have rating scores, well, it's clear, the score, let's say, of 50 is better than the score of 30. With categorical features, this is not possible because you cannot really say that, for example, a doctorate is twice as good as a high school diploma. Customer education might be encoded a six binary dummy variables. However, this raises the dimensionality of the data significantly. If we do this for every categorical feature, then we will end up with a huge number of dummy variables. This is an especially serious problem for non prometric models like KN N classification because of the curse of dimensionality. We are increasing and artificially increasing the dimensionality and a few observations we have. We have 10,000 observations, but in this context, this is rather not too much. The curse of dimensionality means that if you increase the dimensionality of your problem, the a fixed number of observations of data observations you have will at some point vanish in the huge space that is your hyperdimensional space. For simplicity, we therefore drop all numeric variables. In practice, you should obviously carefully decide for each categorical variable if it should be included or not. We look at custom education, doctorate seems to have an influence. You might think about using one dummy variable that is one for doctorate and zero for all the other classes. This might be a solution. With age, you could split into two subsamples below 60 or below 50 and above 50. This might work. But if you were to include a Dummy variable for each age and each year, this would simply cause more problems than solve anything. Okay. Now, we then create the training and the test set, we remove all numeric features. And the column lin N, we don't meet the lien number has no predictive value, and we create the training and test set by randomly including 80% of observations in the training and 20% of observations in the test set. So this is what we'll do. We again set SDs 2021. Reproducibility, and then we split our dataset into X train, X test and Y train and test. Obviously, Bnc urns and the attrition flag, in this case is our response variable. We have our four data samples, the test and the trainee set and this for the predictors and the response variable. Now, by default, Kinars neighbor is based on the Euclidean distance and to ensure that all features contribute equally to the measured distances. We scale the data based on the minimum and maximum values. So we determine the per column minimum and the maximum, and then we scale all variables, for example, here X train and X test, and we call this now X train scaled and X train scaled. X train and test scaled. And what we do is we scale it so that it has as we can see here, there's a minimum and maximum of zero and one, respectively. So all variables in the training set have a minimum of zero and maximum of one, and it's monotonously scaled in between. And as scaling is based on data from the training set, the same is not necessarily true for the test set. So you can see here if we do this for the test scaled and the train scaled, it is slightly different. Okay. Now in the next video, we'll start with the K nearest neighbor classification and then do support vector machines. But you can see that it takes some time to preprocess the data in order to be able to perform these models in the first place. 19. 19 SVMs in Practice Part 2: Hello, everyone, and welcome to our class in artificial intelligence and machine learning in finance. Now, in this video, we will continue our application of K nearest neighbors and support Better machines in the context of predicting class labels related to customer journ related to the termination of credit card contracts. We've already seen the data preprocessing. We've seen the data, and we want to start with K nearest neighbor so for K nearest neighbor, we here rely on the Euclidean distance. We could have also used the Manhattan distance, the cosine distance, any other metric that measures the distance between two observations. And remember, K nearest neighbor is called K nearest neighbor because we still need to decide on if we want to use five neighbors, ten neighbors, 20 neighbors, et cetera. So, the number of neighbors is a hyperparameter. It's a parameter that we need to set in order to train our model and we determine the appropriate and we optimal value a ten fold cross validation. You can also do these computation in parallel. As you will see later on in R, actually, we are using a cluster and we use parallel computing to speed up the estimation process. This is not really necessary in this context, but I wanted to include this here to show you how you can use parallel computing to get to results quicker. We determine the optimal parameter and evalue the model's predictive performance that is how well we are able to predict the positive class label based on the two metrics accuracy and Chen's Capper and we'll have some more details on these two metrics later on. Now, we start again by loading the library carrot and we set the TR control object, which is simply a container for some options on how to train our model. By using this function train control, we set the method to cross validation CV. We want to do ten fold cross validation, number is equal to ten and then in line three, again, in order to be able to reproduce and to replicate our results, we use the set set command. Then the cross validation is done in parallel um, if you do encounter any problems with the PSO cluster, then you should simply comment this out and do the cross validation on your computer solely. So the library need is do parallel and CL is make PSO cluster six. So we're using six cores or threats, and then we register do parallel with this object CL. Into KN N model, this object, we are writing the results of our train model. We train based on the X train scale data object. We use as our response Y train. We use Kn N K nearest neighbor as the method. We will set the tune grid to grid that expands 1-10. What does it mean? Well, Tune grid specifies the tuning parameters to train over the model. So we're considering one, two, three, and so on ten neighbors in the end. And then we are using the result of our cross validation TR control, and the metric that is used to look at the accuracy of our model is actually the accuracy, and then we stop the cluster O. We print the result after having done our trading. So the optimal number of neighbors is actually the same year for both accuracy and C S Capper. In general, this is not necessarily the case in this scenario here it is. And as we can see, we see the resampling results cross tuning parameters for accuracy, Kappa and one, two, three, up to ten neighbors, and accuracy was used to select the optimal model and the final model was six. We are ending up with a six nearest neighbor model that we train on our data. To see the predictive performance, we predict the class labels for the test dataset. This is done using predict. Where's my cursor? Yes. We use predict. We use the fitted and train K nearest KNN model object and the new data is X test scaled rather than X train scaled. We see that these are the predictions and then we can use the so called confusion matrix, also error matrix to compare our predicted response variables with the reference data, which is Y test. We actually doing an out of sample forecasting accuracy or prediction accuracy test, and this is done in the so called confusion matrix or error matrix, which is quite common in machine learning. This is the confusion matrix. What you can see here is, let me highlight this my curse again, here it is. The p dot is loaded. You can see here in this first part, this is a simple matrix that compares the predicted class labels, in this case, atritd customers and existing customers in comparison with the observations in the reference dataset. We have attritd customers and existing customers, as you can see, this is fine. And this is fine. It means that actually those that actually were atretd customers were also predicted as atretd customers, and the existing ones, 1,672 were also existing. Well, not surprisingly, if we use red line here, this one is not good and this one is also bad for our prediction accuracy. Why? Because these are observations that fall into class one and are attributed to class two and vice versa. These are the errors our prediction has done. We also can see the accuracy confidence interval for accuracy, the no information rate, and he S Capper and some other metrics, and we want to comment on these now. Accuracy is actually defined very simply as the percentage of correctly classified observations. In our example, well, not surprisingly, we have 189 plus 1,672 divided by the total number of observations, which 189, 1,672, and those that were erroneously attributed to one of the two classes, and thus we get an accuracy of 90%. Then information rate is defined as the largest proportion of the observed classes, and our example, this corresponds to the proportion of existing customers. We take 53 plus 1,672, again, divided by the total number of observations, it's the highest accuracy, which can be achieved by constant prediction, meaning what? If, for example, we say a treated is one, termination is one, existing customer is zero, and let's say this is our dataset. This is our dataset. Then we could simply say, well, let's do it constantly. Let's say, a little bit more. Let's say one, one, one, one, one, one, one, one, one. How much and how many observations would we get right? One, two, three, four, five out of two, four, six, nine. Same could be done, let's say if we say it's zero, the red x is zero. One, two, three, four, and so on. We are simply saying we don't do any prediction at all, we simply set it all constant to one or zero. And the no information rate then is the highest accuracy, which you can achieve by constant prediction. In this case, this would be a constant one. If we simply say everyone is attributed, I think, this is the first let me just check now, it's actually the existing customer, so no termination. And this is the highest accuracy which we can achieve simply by setting all predictions to one or to zero. You have to decide which one is better of course. The ones Capper is a measure of a classifier performance relative to how well the model would have fared simply by chance. Therefore, you compare the accuracy of the model to the hypothetical probability of an agreement by chance. In our example, the no information rate was 84%, 83%, they belong to existing customers while 16.51% belong to the trided customers. The K nearest neighbor model classifies 88% as existing and 11.7% as at trided customer. The probability for an agreement by chance, can be calculated as you take the 83% times the 88% and the 16% for the attributed customers times the 11.71% and this gives us 75.65% Counts Cappa is now defined as the accuracy minus this probability divided by one minus this probability and it gives you 0.59 26. That's en Scapa. When you compare two models, higher Kappa signals a better predictive performance and the maximum value is one. However, there is no standardized way of interpreting its value. If you have two models, take the one with higher Chen Scapper. A negative value of Kappa would signal that the model's predictions are worse than predicting by chance. This is even worse than, I guess, setting it all constantly to one. Now we have some additional metrics, error costs of positives. In our example, this is the customers who terminate their contracts and negatives. They are usually different in this context, sensitivity and specificity are more informative than accuracy. So we start with sensitivity that's recall or true positive rate. Is defined as the number of correct positive predictions, divided by the total number of positives, that's 55%, and specificity is true negative rates, and it's the number of correct negative predictions, divided by the total number of negatives, it's 96.93%. Our case, the K nearest neighbor classifier is good in predicting customers that do not terminate the credit card contract, but bad in predicting customers that do and what is the possible reason? What's the problem here? Well, if you remember, we have a highly unbalanced data sample. Only a few terminations, a lot of customers who stay with us, so it's quite difficult based on this data sample to achieve a higher sensitivity. That's the reason. This is the example of the K nearest neighbor classifier and in the next video, we'll look at the support vector machine. 20. 20 SVMs in Practice Part 3: And welcome to this class in artificial intelligence and machine learning in finance. In this video, we want to continue our example of the application of classifiers in financial setting, and we are going to use now support vector machines to achieve the results, almost the same results as the Knees neighbor classifier in the context of predicting customtu. We will fit a linear SVM, as well as support vector machine with a radial basis function kernel to the training data on customer churn in the credit card data sample. Going to predict the labels for the test data, that is contract termination, no termination. For the linear support vector machine, that is without kernel or actually the linear kernel, we determine the most appropriate value for the hyperparameter C, which is also known as cost out of the set or grid two tan to the power of minus five to the power of minus four, and so on. Again, as we've seen in the last video for the K nearest neighbor model, this can be done in parallel to speed up computation time. You don't need to do this, but this is a good exercise to practice parallel computing in the support vector machine with the radial basis function, kernel has two hyperparameters, Sigma and C. And we do not provide a set of possible values, but we simply let the carrot package in R tryout ten reasonable parameter combination, and it chooses the combination itself. So this is tune length, the parameter tune length in the syntax of R and the carrot package. So we start with the hyperparameter tuning with the linear support vector machine or the support vector classifier. We set set in line one to be able to replicate the results later on. We again use train control as an option later on in the train command in line eight. In line two, we write down train control with cross validation as the method and five fold cross validation in the object TR control. We set the grid as we described on the previous slide, and then we use six clusters using the um, PSOC cluster in R. Then SVM linear is our object and we train the support vector machine using X train scale data, y train for the response. And as a method, we set SVM linear and then this is the result and we stop cluster. We print out the results and you can see the resampling results across the tuning parameters with different parameters for C and then accuracy and CAPA. We've seen that in the last video. And the accuracy was used to select the optimal model and it was optimal for a hyperparameter C equal to 0.0 3125. So in this case, choosing C with regard to accuracy leads to a different value than choosing it with regard to Chen S Capper, but we said that we want to use, in this case, the accuracy, and it's the standard and default setting in the carried package. Um, we can do the same with the radial basis function kernel. Again, set C, initialize the parallel computing cluster, and then SVM radial is the option we need to set the method to, and the tuned length is ten, we stop the cluster and get results for the second support vector machine. Again, you get the resambling results from cross validation and you see the tuning parameter Sigma was held constant at a value of 0.05. Accuracy was again used to select the optimal model and the final final values were Sigma equal to 0.05 and C equal to eight. This is what comes out of the training model. We predict the classes for the test scale data, and we do this both for the linear classifier and the Real basis support vector machine. Then we also print the confusion matrix for comparing our predictions with the t test data sample as the reference, both for the linear support vector machine and the Rial basis function. Um, Kern. This is the linear spot vector machine. Again, you can see from this very simple matrix, 186 1679 looks almost the same as for the K nearest neighbor. Let's compare this. Let's go back some slides and you see 189 1672. Actually, it's almost the same almost the same accuracy, 90.27% and the additional matrix here in the confusion matrix. For the radial basis function, you can see it is slightly better for existing customers, one observation, but it's hugely better for attrits customers. As you can see also in the accuracy, it increases by almost 3% in contrast to the linear SVM. So the radial basis function kernel support vector machine actually does a much better job than the two competing models. The predictive performance of the linear SVM is very similar to the K nearest neighbor classifier in terms of accuracy, Kappa sensitivity. The support vector machine with a nonlinear radial basis function kernel improves on this. Um, shifting the accuracy to almost 94% and increasing CAPA to 0.76 47. This substantial increase is mainly due to an improvement in predicting at triting customers, we've seen this. So the sensitivity is actually higher in this case, and almost 20% higher in comparison to the linear support actor machine. So yes, using a non linear model here makes sense, and it allows us to increase the sensitivity and in result, the accuracy of our forecasting. So this is the example for classification, and in the next subsection, we want to look at regression trees. 21. 21 Decision Trees for Classification: Hello, everyone, and welcome back to our class in artificial intelligence, machine learning here in finance. In this section, we want to look at regression trees as yet another method in machine learning for classification. That is, we want to classify customers. We want to classify observations we have, and maybe even stocks in an asset pricing application into classes could be good customers, bad customers, customers who terminate their contract or who are prone to terminate their contract and those who are not. And for this, we want to use regression trees as another method in comparison to, for example, a K nears neighbors model and the support vector machines, we've seen insectionFive. So this is a section on regression trees. What are trees? Again, just like support vector machines and K nears neighbors and many classification models. These models can actually be used for classification, but also for regression problems. It's simply a matter of changing the response variable from, for example, binary variable to a metric one. These trees involve stratifying or segmenting the predictor space into a number of simple box shaped regions. We've seen this with actually the support vector machines and the support vector classifier. What happened was that we took the predictor space, for example, the three dimensional space, and we cut it into two halves. So we stratified or segmented the predictor space. This segmenting of the predictor space is performed based on a set of splitting rules, and these roles can then also be summarize in a tree. In the end, what we are doing is we are setting up some set of rules that decide, well, for example, if you are a smoker, you're class one, if you're not a smoker, you're class two. Let's later on one of our applications, a very powerful predictor, actually, you can imagine it's for health now. So this is what a tree looks like on the right hand side, and it corresponds to the segmentation on the left hand side. So you can see we have two predictors, X one and X two, and we are segmenting the predictor space into one, two, three, four, five boxes. And what happens is this corresponds to a tree that starts out with X one. I X one is smaller or equal T one, that's cut off. Then it yields us boxes one and two, and how do we decide on whether it's one or two? Well, we say X two is small or equal than T two, then we get to R one and we are here. If X two is larger than we are in R two, and we get this box. If X one is actually larger than T one, then we are in this area. Then we still need to decide is X one smaller or equal than actually T three, then it's R three, and if it's larger, then we are in this area, and then we still have to decide on whether it's R four or R five and we do this with this cutoff it's either R five or R four. This is the tree, and this is the segmentation. Now, how do we do predictions? Well, for a specific observation, they are typically made by a majority rule by majority vote in classification, meaning if it's more than 50%, then it's on the right hand side, if it's less than 50%, then on the left hand side, or by using the mean or mode of the training observations in regression analysis in the region that corresponds to the given observation for a given region, the prediction for every observation that falls into this region is, of course, the same. We get the same prediction. Um, how do we build a tree? Now, in theory, for making predictions as on the previous slide, the regions could have any shape. We could have selected circles, we could have selected rectangles, and we could have also said, Okay, well, it looks like this. And this is one region. This is another one. This is the third one and this is the next one. We could have done this. Problem is, of course, this complicates things a lot. So for simplicity, interpretation, the predictor space is often divided into high dimensional rectangles or boxes. These boxes are chosen such that the classification error or the squared error of the residuals in regression analysis are minimized, so we have an optimization problem again. Now, if you think about what is possible, then of course, think about this predictor space as maybe an image and the boxes as pixels. Well, obviously, with a higher definition resolution, obviously, you get more pixels, you get more boxes. So it's theoretically, at least, it's, um, possible to partition the predictor space into an infinite number of boxes. So it's computationally infeasible to consider any possible partition of the feature space, even into a finite number of boxes. But increase if you decrease the size of the boxes, if you increase the resolution, you could even get to an infinite number of boxes. So therefore, one typically relies on a top down so called greedy approach that is commonly referred to as recursive binary splitting. So what do you do? It's top down? Because as we've seen in this slide here, you start at the top, you start at the top of the tree where you have all observations and they belong to a single region. So you start with the full sample, and then you successively split the full feature space into halves and then you go on top down. And it's greedy because at each step, you don't consider what could be happening at, let's say, two levels down. You only consider what is best right now. That's why it's greedy. You only consider the things at this box in this step. It could mean that you're looking for the best feature that can if you now split, for example, your observations into smokers, non smokers by majority rule, you consider only the best outcome for minimization of the MSE. Of the residual squared sum, or if you consider the classification error at this point, what rule and what feature could get you the largest minimization of your cost function. That's why it's three. And you don't consider what is optimal at this point, and does it make sense to choose a different feature here if you, for example, say, okay, let's start out with H here and consider smoking here if smoking gives you the best result here, then you start out with smoking and you don't consider the fall tree, and this is why it's gree. Okay. Now for performing recursive binary splitting, you first select the predictor XJ and the cut off or the cut point S such that the squared errors in regression or the classification error rate in classification are minimized over the resulting regions X, given that XJ is smaller than S and X given that XJ is larger or equal than S. In classification tasks, sometimes you can also use the Gene index or the so called entropy as a criterion for making the treat splits. You can see in the footnote, you probably know the Gene coefficient and the entropy is actually defined quite similarly. In this process, all predictors and all possible values for the cut point as are considered and this process is then repeated in each of the resulting regions, as we've seen, you go down the tree and then you make another cut, and you look for the best predictor and best cut point to minimize the squared recreation errors or the classification error rate further at that level. You don't look one step ahead, but you do it in a greasy way. This continues until some stopping criterion is met. For example, if you believe the classification rate error has been minimized in a sufficient way, by adding further splits to the tree, by going down further and growing the tree deeper, the squared regression errors or the classification error rate can only decrease or it can stay virtually the same. Problem is that this leads to overfitting. You can add layer after layer to your tree, but as the classification error rate cannot increase, I can only decrease, just like the R squared in regression analysis, the model that you're getting is way too complex. So indeed, you should use smaller trees with fewer splits. This might lead to lower variance and better interpretation at the cost of some minuscule increase in your bias. Remember the variance bias trade off, so variance will be lower, but the bias might be a little bit higher. Another possible remedy to overfit increase is to grow the tree only as long as the reduction in classification or regression error exceeds some threshold. This will lead to smaller trees, and this minimum required reduction can be described by a complexity parameter balancing, reductions in classification or regression error against the complexity of the model. And in the application, we will see how this works. We will see that a high value of the complexity parameter leads to more shallow trees while a low value yields deeper trees, and we will also see how in some instances, if we take the basic models, this will actually, if we don't look at overfitting, the models will not generalize well to new data. We will have huge overfitting. And again, as you might have imagined, the optimal value of the complexity parameter, it's a hyperparameter is determined via cross validation. So this is a tree versus a linear model. You can see that, for example, on the left hand side, we have a standard classification, for example, using a linear support vector machine. You see, we cut through this into the yellow and the green area. And obviously, if we have a linear boundary, using boxes and using a regression tree leads to a large bias. You will see this will lead to a huge classification error. Why? Because this is classified probably wrongly, because the majority is yellow. This is probably classified as green. This will be an error, this and this. Why is that? Well, as you can see, even though we are using linear boxes, it might be that this doesn't fit the data well. However, if we have a non linear boundary, you can see that using a very simple support vector linear classifier here leads to a huge classification error, whereas we get an almost well, actually, it is a perfect classification using the regression tree. In this case, we have a non linear boundary decision boundary, and the regression tree, although we are using linear boxes, rectangles, this is a perfect classification. So even though we are using linear objects, we're using rectangles, might actually be that the regression tree fits the data much better if we have a non linear decision boundary. Now when all the advantages of trees, they're easy to explain. I don't actually need any formula. They closely mirror human decision making. We are simply looking for boxes and doing our decisions, making our decisions based on off points, smoker non smoker, age, low age, high age. It can be displayed graphically and it can handle qualitative predictors, very simply. Smoker non smoker one, zero. The digital answers are that single trees often exhibit an inferior predictive accuracy as compared to, let's say, support vector machines, and they can be very num robust, small change in the data might lead to completely different tree. What you can do is you can aggregate many trees a methods like bedding random forests or boosting to improve the predictive performance, and we'll see this in our application. I'll come back to be boosting and random forests in probably two or three videos. So these are the advantages and disadvantages. In the next video, we'll have a look at the applications. 22. 22 Decision Trees for Classification Practical Example: Hello, everyone, and welcome to our class in Artificial Intelligence and learning in finance. In this video, we want to learn more about the applications of classification trees. We'll have a look at the use of regression trees, classification trees, tree modes in general. And we will again use the credit card customer dataset from our previous examples where we saw how it can be used to exemplify classification made by support vector machines and K nears Niber models. We are going to predict customer churn. It's a classification we want to forecast. We want to predict, um, the termination of contracts by customers for our credit card company. And to provide a more complete picture, we will also employ boosted decision trees. These are special random trees of random forests. We'll see how actually the initial models, the initial classification trees do not significantly improve on the performance of support vector machines and K nears neighbor models. But Lekron will see that if we employ boosted decision trees, yes, we can achieve some increase in accuracy. In the regression task later on, we will forecast health insurance premium. In this case, we have a metric response variable in contrast to the binary one in the previous customer churn example where it's only about termination, no termination and both data samples are available at Cagle and actually, if you have curso, well, take the cursor here. You can actually see that we've linked included the link here at Cagle for both the customer churn data and the insurance premium. Okay. So we'll start with the classification task. Again, a short reminder. If you haven't watched the other video, what the credit card customer data set is all about. I um, a data sample, approximately 10,000 observations, and you as the manager, you're responsible for looking out for the customers and you're worried that an increasing number of customers quit your services, terminate the contract, and you want to slow down customer churn. So G wants to proactively contact customers who are about to leave their credit card services to change their decision. Actually, you trying to actively influence the behavior of your customers based on a prediction of whether he or she is likely to terminate the contract. Therefore, the manager needs predictions on who is going to quit the services, so you want to classify those observations. Dataset includes more than 10,000 observations. You have features like age, salary, credit card limit, and other features. There's a D missing, and the data is unbalanced with only about 16% of the customers having churned. We've already talked about this class imbalance problem a little bit. This led to the fact that in the previous lecture where we use support vector machines and K nearest neighbor models, actually, the accuracy wasn't perfect because of the few observations we have where customers actually terminated the contract. So we start again with importing the data and preprocessing the data. You can also skip to slide 220 and the following ones where you can see we did the same. In our previous example, you can download the CSV file from this link. You need to import the CSV file. Actually, again, reminder that in many instances when using R, a CSV file is actually the best choice because it includes as few graphical um, additional information and layouting as possible. So it's the pure data. That's why CSV fights are actually a nice input format for many statistics programs. We drop the last two columns. According to the dataset description, we don't need those two columns, and then bank churns that's the object we create from the object we've imported. And this is in line eight here. We highlight this for you. Actually, we are dropping the last two columns. So we are only using columns one, to the number of columns minus two. For example, if the original object included ten columns, we are now using only the first eight ones. We then add pose to the Ken Resniva and the SVM example, we do not exclude categorical features. We can actually work with them here. So what we do is we create the training set and the test set, we randomly select 80% of the observations for the training set. And the remaining 20% are included in the test set, and we include the client number, which is a number for identifying individual customers has no predictive value. It doesn't carry any economic information, and we use the DPLYR library. Again, we use the set SIT command in order to be able to replicate the results, and then we set the training set and the test set. So what we do is we sample integers, we include our um 80%, 20%, and then we select based on the climb number. That's what we do in line seven. Same with the test set and then of course we need the same for the response value or y train and Y test Rgin the response values in the training and the test sample. Now, decision trees do not require feature scaling because they are not sensitive to the variance in the data. That's why we do not need to scale our observations as we've done before in the previous lecture on the support vector machines. Now, we fit the classification tree. Again, as is common in many of our machine learning algorithms, we need to select the hyperparameter, which is the parameter that governs the training, the learning process. We again, use the carat library and we do this in parallel. We use cross validation as you might know by now, train control is the function in carrot to select the method for hyperparameter training, and what we do here is ten fold cross validation. Set seed line four. If again, you encounter any problems with the paralleization, just comment these two and actually, two lines out, and then tree model is we train based on X train with the response in Y train. The method is a part, which is of course for partitioning. Tune length is 20 and the trading trees is relatively fast, so we can actually consider more possible values, and TR control is what we set in line three. That's our option for using ten fold cross validation to select the hypogrameor. The metric that is used for trading is the accuracy, and then we stop the cluster in line 12. Um, what happens here if we print out the tree model and the best tuning parameter, it is the complexity, which is given by an optimal parameter here of 0.0 0285 and so on. As you can see from the plot here, with increasing complexity, you actually get fewer accuracy from cross validation. This comes out if we plot tree model and this complexity parameter, CP, is the minimum improvement required at each node to make a further split. So remember that we're trying to train a tree here, meaning that at each level at each node, you need to decide whether you want to go one level deeper or if you say, Okay, this is enough, the tree is deep enough, we have enough accuracy. So this sameter balances possible reductions in the classification error. Via further splits against the complexity of the model, which is the number of splits or the depth of the tree. And a high value of the CP of the complexity parameter leads to more shallow trees, while a low value yields deeper trees that might overfit. And on the following slides, we'll exemplify this so that you can get an idea of what happens. Now, the RPA plots R package provides rather convenient visualization of tree ones. However, it requires a tree model that was fitted by R part directly and not in the carrot package. So therefore, we again fit the same model using just a different package in R and with the optimal parameters determined in the hyperparameter tuning step. So we basically get the same model just with a different package, but in this package, we are able to do some nice plots in R. So this is what we do, we visualize this best tree model and then the tree model is initialized with R part Again, we need p dot plot as the package, and then we can plot it. The result is shown on this slide. Now, if you have the slides, you can actually zoom in here, but can do this here as well. I can show you this. Yes. We start out here, existing customer, and we have 100% total trans CT, and we have total trans AMT, total transaction, total relationship count, and you can see those numerous features we have customer age, this is quite clear, customer age being larger or equal than 37 total revolving balance. These are all the cutoffs, and in the end, you get a tree that looks like this. So if you zoom in or do it at home with the same code, you get this tree. As one can see, this is actually quite deep. Now we will now raise the complexity parameter to obtain a more shallow tree. We multiply the complexity parameter from the current tree with 50 and do the same again, and what you get is on the next slide here. You can see total transaction CT, that's the total transaction count and the total revolving bands on the credit card. Again, these two are quite high up in the tree. But this is where the partitioning stops. That's the whole tree. Actually, we have the customer. If this total transaction count is smaller than 55, yes or no, we get a prediction here. We have an attritive customer, existing customer, existing customer. This is the whole tree quite shallow, that's because we increase the complexity parameter. Okay. Now, let's talk a little bit about the cutoff level. Now, classification in each node is, as I mentioned in the previous video, performed by majority vote. However, it is possible and sometimes even useful to choose a cutoff level that is different from 50%. So it's not 50% and one, then it's class one, otherwise, it's class zero. You can also vary this cutoff level and we can influence the sensitivity and the specificity by choosing a different cutoff level and the accuracy closely follows specificity due to the class imbalance in the data. This is shown on the training data and the deep tree. In this block, you can see in red, the specificity in blue, the sensitivity and black is the accuracy. You can see that if you take different cut off probabilities, probably you shouldn't use zero or close to zero or close to one. But as you can see, um, it's not constant for all these different kind of probabilities, but actually, there might be a choice that increases the accuracy or the sensitivity of your model. Okay. So what's the predictive performance of this classification tree? We rely on the model determined via cross validation, which is the depo tree. We don't use this shallow one with only two levels. Again, we use predict and we print the confusion or error matrix based on the test data sample, and this is what we get. We get in the confusion matrix 252 and 1677 customers which have been classified correctly. The accuracy does is 30 93%, sorry, with this confidence interval. And corresponding information on the no information rate, CAPA sensitivity, and so on. The predictive performance is actually comparable to the one of the support vector machine that uses a non linear raial kernel. Se also slide 248. But we can do even more. And this is what we do in the next video. We will use random forest. We will use boosting and then also come back to the regression example to improve the prediction accuracy of this classification tree. 23. 23 Gradient Boosted Classification Trees: Hello, everyone, and welcome to our class in artificial intelligence and machine learning in finance. In this video, I want to talk a little bit about some problems associated with classification trees and what can be done to improve on the predictive accuracy and the predictive performance of decision and classification trees. As we've seen in the previous video, one problem associated with classification trees is overfitting. You can always use more levels. You can always use more notes and go down even further, and what happens is the feature space is partitioned into more and more boxes, and this leads necessarily leads to overfitting. This is a problem for deep trees. While share load trees obviously might underfit the data. In the previous video, we saw an example where actually we only use two levels, two nodes or three nouns actually for partitioning the feature space, and this might lead to underfitting. So the variance bias trade off is very important to consider in the context of classification trees. And one possible solution to this problem is to construct multiple different trees at training time and produce forecasts based on the mode of the classes in classification or the mean or median of the response variable in the regression example of the individual trees. This is an ensemble technique this is called random forest. So when I random forests, they can be constructed by introducing sort of randomness into the construction of each tree. For example, if you choose random sub samples or variables and splits, and this will then yield trees that are built independently of each other. This concept, this idea is usually referred to as bagging, also bootstrap aggregating. Another approach is to use multiple trees in a so called boosting framework. Where we have weak learners, those are the shallow trees and they are combined to yield a stronger estimator such that at each iteration step, classification or the regression error is reduced. This will yield so called boosted classification trees or regression trees in which the individual trees are no longer built independently of each other. Um and to see illustration, I would advise you, my cursor to watch these two YouTube videos. So this one is the first, and this one is the second. There the principles of boosting and bedding are explained quite nicely, actually. And on the next slides, we will apply boosted decision trees to our classification problem and we will rely on the XGBoost algorithm, which is quite famous in the context of boosting. So we want to fit a boosted decision tree, again, carrot package. We do this in parallel for faster computation, five fold cross validation. So this is train control. As we're fitting boosted decision trees, it's quite in computationally intense. We only use five fold cross validation. If you start the um, parallel computing session before that and you move this to a cluster, you also try ten fold cross validation set C, as is tradition and with X train and X test with the numeric data. We again exclude all numeric features, all categorical features. The underlying XG boost algorithm already uses parallel processing implicitly. Therefore, we only employ two parallel processes in carat wrapped around the implicit parallelism that is included in XG boost. We set the tune length equal to two, as carat selects five different parameters, and this will then lead to two taking to the fifth power different parameter combinations that are considered in cross validation. So XG boost model is our object. We train based on X train and Y train. The method is XGB tree, so it's a classification tree that is boosted via XG boost, and the rest is standard chain control, TR control, our options, cross validation, five fold and the metric used to select the best model is accuracy, and then we stop the cluster. So this is what we get out. 100 rounds, maximal depth is two, and some additional parameters. And let's already consider the forecasting accuracy. So we predict based on XGBoost model, the new data is X test. The reference is Y test, and we look at the error matrix, and now we see that actually where we previously had, I think, 170 we have an improvement compared to both the support vector machine and the unboosted classification tree. The accuracy now increases to 97%, but most importantly, sensitivity increases. So it is a substantial improvement compared to all three previous mots, the classification tree, support vector machine, and Kaneus neighbor. You can see this in the increase in the sensitivity. I increased 74-87% for the booster tree ensemble. And by relying on the predictions from the boosted classification tree for almost 300 out of the 341 customers, we can rightly predict that they are about to quit these services. This leads to the increase in sensitivity, and then the manager could act on this. On the other hand, we would falsely approach only 26 out of 1,725 existing customers. So yes, this could be a way to move forward. No. In the next video, we want to take the same approach, decision trees, but use it in the context of a regression analysis. We need to use different data. Why? Because now we need a metric response variable. 24. 24 Decision Trees for Regression: Hello, everyone, and welcome back to our class in Artificial Intelligence, Machine Learning and finance. In the last videos, we've seen classification trees and we now want to use basically the same model for a different purpose for regression analysis. So we'll continue our application, but we now need a different dataset. We can use a dataset from cattle, but this time, it's one that is on insurance Premier. You can download the data here under this link, and we want to predict insurance premia, which are expenses for customers and potential policyholders, and we have data on close to 1,300 health insurance policies, and we have four numeric features, H, BMI, number of children and expenses, and three nominal features, sex smoker region. So obviously, we want to forecast the expenses, the insurance premium, and we are going to use HBMI number of children, sex, gender, smoker, yes, no, region, yes, not yes, no, but the different reson regions to predict the insurance premium. We need to import the data, so we use the Read art library. We download the CSV file. Um, import this. There seems to be a duplicate entry. So actually, we need to remove the duplicate entry. If we now use Sunit NA insurance, we get zero. So there are no missing values, which is important in this context, and we have a look at the insurance premier to get a summary. As you can see, the mean is 13,000 probably dollars. The maximum is 63,000 and the minimum is 1,122, so that's the data. We use the TB library. We convert the data into a table and we print the first eight observations and you can see we have female, male, male, male, female, female, female, the body mass index, number of children, smoker, yes, no, and the reason regions and the different expenses. Now we create the training and test data, which by now is pretty standard. Again, you can see the percent 80% of the observations go into the training data, 20% and the test data, X train, y train, X test, Byte test as a standard, and as it has been done before a number of time. Again, we use the carrot package. We set train control the option to cross validation, tenfold. We set seed in line three, and as single trees are fitted very fast, we do not need pearl computing here, and we actually consider various hyper parameters, so the length of our tuning step is 100. Metric needs to be different now. We cannot use accuracy, which is for classification, but this is a regression tree. Y train is now a metric variable. The function sees that even using parts, regression partitioning, it sees that this is not a classification but a regression task and the metric then is the relative mean squared error. There cannot be accuracy as before. Then we print the complexity parameter, 0.0 061. And this is what happens. You can see now with increasing complexity parameter, the RMSE increases. It stays constant, sometime then it jumps up. Now, to visualize the optimal tree for the optimal complexity parameter, you can see here that on the first level, it's quite interesting to see smoker yes or no. Actually, what you can see here, these are the expenses, the insurance permia as you can see, if you're no smoker, you immediately in this area where actually the premium are relatively low. Then if you're young, it's the lowest premium, if you're old and then it's a question if you're not as old or if you're rather old, then this is a distinction made here. If you're a smoker and if you have a high body mass index, and if you're old, then 5% of the observations have very high insurance premium. No. Now what is the predictive accuracy here? Again, we predict based on the test data sample, and you can see the R squared is 87%. The root mean squared error is 4,238 and the mean absolute error is 2,900. Obviously, the R squared is easiest to interpret, and we get that in this model, close to 87% of the variants can be explained by this regression tree. We could also use boosted regression trees. However, XG boost only works with numerical features, and as we've seen, actually smoker, yes or no, categorical feature, binary variable should be included because it seems to be the feature that has the highest predictive power. If we leave this out, this would lead to an inferior model. Now, we can do this, but you can see the R square immediately drops to 9%. Instead, if you rely on bedding, why are the method tree back if you use all features, this does not in this scenario lead to an improved performance compared to the single tree model, squared drops to any 1%, but it's a good thing to try this out. This is a regression tree. Classification regression can be done by trees. We've seen a few examples and some nice applications in the next video in the next sub section, we will start with using and discussing a huge part in artificial intelligence, which are neural networks. 25. 25 Neural Network Architecture: Hello, everyone, and welcome back to our class in artificial intelligence and its application in finance. In this video, we are going to start the subsection on neural networks and to be precise, artificial neural networks because these are obviously not biological networks we are looking at, but artificial ones. As I mentioned in one of the first videos, artificial neural networks try to mimic the behavior of the human brain and tries to mimic the processes that happen in the human brain encompassing neurons and synapses. And consequently, the turn neural network actually encompasses a large class of models. We are going here, we're going to look at the single layer perceptron or the single hidden layer back propagation network. It's usually just referred to as the single layer perceptron. This is very plain vanilla neural network. There are complicated ones, more complex ones. We also shortly cover deeper networks, so called multilayer perceptrons or MLPs. And after having seen regression and classification trees, you're probably understanding and what is meant with deeper networks and deeper models. And the same happens here with neural networks. If you're more interested in neural networks and what can be done on top of neural networks, including the single layer and the multilayer perceptrons, you can have a look at the HT tit run in Freedman and the Goodfellow Bengio and Curvil textbooks actually both should be available for free as open source books on the Internet and you can learn more about artificial neural networks. Now, as I mentioned in the first video, actually, I think, neural networks, artificial neural networks are far from being new and have been around for 80, 90 years. In fact, they date back to the 1940s. They have since gone by many names, used to be part of cybernetics, then was known as connectionism. Then was just neural networks, artificial neural networks. Nowadays, it is commonly referred to as deep learning. Why has it become more um uh, important. Why have these molobs become more sophisticated well, because we have more data to train these modules. We have big data available. We have more computing power, and as a result, these models have grown in size with increasing power of hardware and software. Nowadays, with the availability of big data syllabls and the availability of large clusters and powerful computers, these models can be put to a better use than this was possible, say in the 80s or 90s. And since then, neural networks have helped to solve increasingly complicated applications with increasing accuracy. What is the structure of a neural networks? Very simply, this is a single hidden layer feed forward neural network. It's in a single layer, single layer perceptron, sorry not multiple, but the single hidden layer perceptron. What happens is let me get my curse a here. We start out with our features X one, X two, and one until X B. Again, just like in the traditional linear regression setting, or the classification setting, we have P features, could be, for example, age, could be gender, could be income, could be maybe home region. Then what we do in the neuralnetwork in the single layer feed forward neural network, we combine these features linearly. In the first and only in this case, hidden layer. So we're constructing variables set one, set two, and so on until Zt M, and we are linearly combining these features to form, for example, et one. Obviously, this is a basically linear regression. We have coefficients that in this case, will be called weights, and these weights need to be estimated. What we get is one could say a couple of regression analysis results, a couple of variables that are hidden. Why are they called hidden? Because these are not observable. We can only observe y those are our responses and we can observe our features. For example, let's say this is the insurance premium dataset, we have age, gender, income, and so on, and these are it's the insurance premium. Uh example, we would only have one response. But actually, these mods are able to predict not just one response variable, but also multiple classes or multiple metric variables. So in the very simple example of the insurance premium dataset, we would only have Y one. We are going to predict insurance premium, and then this would be the insurance premium, and these would be the features. What we are doing is we're inserting yet another layer in between and before combining the features directly, as we will do in the regression analysis or in a regression tree instead of directly using the features to predict y, we first compute these hidden and unobservable variables that one to, and then we are recombining these variables for our prediction of Y. That's the only thing that is different basically in comparison to a linear regression model or linear classification. Why are we doing this? Well, by inserting this additional hidden layer, we can make and we can do more transformations of the data and thus we get more flexible models. That's basically it. That's also why if you take the hasty Tips running frequent textbook, it will tell you that well, neural networks are far from being magic. It's simply a non linear generalization of regression module. So that's what basically at least the single layer perceptrum is. It can be seen as a two stage regression or classification model can be done for both regression and classification where we have the hidden features in this hidden intermediate layer, and they are derived as a nonlinear function of the inputs. We'll see this in just a bit what is done to the features X one, two, X P in order to derive these hidden features 12 set M. These features, these hidden features are then used to model the responses, the targets, Y one, two Y K. For regression, you will usually set K equal to one and employ only one output node. That's what we did here when thinking about the insurance premium dataset, for classification, obviously with K classes, we can use the K output, so we have Y two until Y K. Then they will be code at 01 binary variables. We could say this is, let's say, um the class of say, low income, this is middle income and then we have a third class that is higher income, customers maybe. Okay. Now, to the formal description, we start out with the hidden layer. That's the hidden layer in this long in. What we are doing is we are using the features, this is something else to highlight this. We're using the features X. These are observable. We combine them linearly. We have actually the intercept in this case is called a bias, and the regression coefficients are called weights. We have a linear combination of our features. Okay. We have a linear combination of our features and if we were to set Sigma to the identity function, then we would simply get linear regression model. Then Z one, Z two, et cetera, would simply be a linear transformation of X of the features. But usually Sigma, which is the so called activation function is often chosen to be the Sigmoid function which is given here, and this is a nonlinear function. So our linear combination of the features is non linearly transformed and we get the hidden layer, and these hidden layers are then again, linearly combined to yield TK and then we can apply yet another non linear function, GK to T, and this is our prediction for Y. This basically is, let me use the red lion thee. This is basically our prediction for Y K it's no magic. It's very simple. Take the features, combine it linearly, apply the sigmoid function to this. You get the hidden layer, recombine those hidden features in a linear fashion. Then you can apply yet another non linear function G to those transform variables, TK and you get your prediction for Y. That's the sigmoid function. One can also use the rectified linear unit function. This is given by the maximum of zero or new, but that's basically the single hidden layer persectron. Now, Fk is the models prediction. We've seen that these predictions are constructed as a linear combination of the hidden features and we have a final transformation GK In regression analysis, usually one does not use this final transformation, GK is just the identify function. Basically, what you're saying is our prediction is TK and we don't apply any GK to it, but in classification, one usually uses the softmax function and this is actually the same as in the multi logit model, you might have seen in regression analysis. So that's um the single layer percepton. The units are called hidden units in the hidden layer. They're not directly observable. They are learned from the features X one through XP. These hidden features are then used to produce our predictions on the output features. They are again, observable. And when Sigma is the identity function, the whole model collapses to a linear model in the inputs. It's no big difference to a linear classification model or linear regression model. So one actually needs, for example, the Sigmoid function to get a generalization of the linear model and to have any extension of the linear model, otherwise, it doesn't make too much sense. And what are multi layer perceptrons now? Well, it's very simple. If you do this again, if you insert yet another layer and say, Okay, I'm combining these three here and I'm combining these four maybe and we get a second hidden layer. Then we have two layer perceptron, and this will give us a deeper model. If we have another and another, these all are called multilayer perceptrons. Stacking multiple hill layers on each other will give us the MLP. And if you want to see a very excellent graphical illustration of a neural network, if it's not clear by now, you can actually look at this video here on YouTube, which is a very, very good example and an illustration of what a neural network does. Let me comment on one thing and let me delete a little bit of my drawings here. Why is it called a perceptron? Why is it called an artificial neural network? Well, actually, what happens is usually in the human brain, at least, that's what we think of neurons and synapses. The nodes are the neurons and those edges, the synapses. What happens is have a signal. Let's think about this maybe if this is the human eye or human brain. We get a signal that starts out here and we get a signal here. For example, we have a male person. This is one and for this feature, we have a zero, and here we have a one, zero, zero. Now, based from our learn actually in the train process, we can see that, okay, if we have a one here and if we have a one here, then this hidden neuron is activated, so it's set to one. These are set to zero, and then we would see then suddenly, we can observe one, what happens is that in the training process, we see that if we have one here, if we have one here, then we have one here, and then we set the parameters which are the weights. Let me just check in the first, this should be Alpha, and then we have beta. Alpha, this Alpha and this beta, they are increased to make sure that we have one here, if we have one here, then we get to this one. And then the signals come in they are transported via the synapses, and then we get to this point. If we have one here, if we have a signal coming in here and here, maybe these are the weights that have been trained, have been increased, and then we get to this point. This is how actually the artificial neural network and out so the human brain works. We get signals coming into the neurons. We have synapses that have been trained, we remember certain things, and then we can decide on whether, for example, we are looking at a cat or a so this is what is also illustrated in the video, and you can see this especially when looking at the sigmoid function. Again, this is the non linear transformation of the linear combination of the features. And what happens is, if actually this were a linear function, then we would see, okay, maybe it looks like this. We get a small value coming in we get a small value. If it's a slightly higher value, we get a slightly higher value, et cetera. But what is actually done is we have a non linear transformation that is also governed by this parameter S. That's a scaling parameter in the sigmoid function. And if you look at the extreme example of S equal to ten, then actually what happens is you get the purple function here. Meaning that for all small values below zero coming in, you have no activation. The resulting function is zero, the function value. If it's a positive value, then actually it's a one. You can see for this extreme choice of the sigmoid function, how this actually works. This pretty much looks like, delete all this again. This is pretty much like a signaling function that if you have enough neurons that are being switched on or the right neurons that are being switched on, then the sigmoid function with this choice of scaling parameter will lead to, let's say, one, zero, zero, zero, zero. Then some hidden units are switched on, others are switched off, and now it becomes clear how this work works. Okay, so this is the structure of a neural network. In the next video, we are going to have a look at how to fit these neural networks, which means we have to estimate and train those pemters Alpha and 26. 26 Training Neural Networks: Everyone, and welcome back to our class in artificial intelligence and machine learning in finance. In the last video, we've seen the definition and the basic structure of an artificial neural network. In this video, we want to shortly discuss the question how to train, how to fit a neural network. For example, a simple single layer perceptron. Same for the multi layer perceptron. Obviously, the multilayer perceptron needs even more time than the single layer, but the principles are the same. Now, we've seen that the neural network has quite a large number of parameters. Why is that? Because we have the parameters, the weights, the coefficients of the linear combinations of our features in the first layer. For the observable features. Then we have the hidden layer where we are recombining those observations for the features and those combinations of the features. Again, we have another set of coefficients or weights as we call them in artificial neural networks. With Alpha N ending from one through N and B to K going from one through K, the numbers of hind layers and outputs and obviously also the biases. These are the intercepts in those linear combinations. Um, and this is actually quite a large number and it can increase even more if we add layer, another hind layer in the multi layer perceptron. Now the parameters are chosen such that the model predictions fit the training data well, as is the same as in any statistical learning model. But what can we done for training the model in the case in special case of neural network? Well, we have to distinguish between regression analysis and classification for regression, usually rely on the sum of squared errors as a measure of fit. For example, we use the cost function R of theater being the squared errors, YK, those are the observations we have for the K outputs. Remember that actually even though this regression analysis for the neural network, we can actually have more than one response variable minus our predictions. We square them, and then we add all this up for all observations, but also for all outputs. Theta is the vector that contains all trainable parameters, so all the weights of the neural network. And we have and training examples. For classification, you usually use the so called cross entropy or deviance, which is you take your productions, FK, based on the features. You take the logarithm, you multiply it with the observed values for the response, YK. Usually this will be one or zero, and thus you get the errors. And then you also add it all up and take the negative of this. Now the corresponding classifier in this case is usually the arc max function, that is the class to which the highest probability is assigned is chosen as the production. Now, these are in the sense, the error functions, the cost functions. We need to minimize these to train our neural network. The generic approach here is a gradient descent. You might know gradient descent also from our computation of finance lecture. Gradient descent is the idea that in order to minimize a function, this is actually generic optimization, in order to minimize a function, you compute the gradient in the one dimensal case, this is simply the first rooftive. You compute the gradient and then you move into the direction of the steps descent, and that's given by the gradient. Then you have an iterative algorithm and you try to minimize the function by moving down in the direction of the gradient. Now in the sending of neural networks, the gradient descent is usually referred to as back propagation quite famously, as the gradient can easily be derived by the chain rule of differentiation, and this can be done in a forward or backward sweep over the network. For details, you should have a look at the HCT Shnian treatment textbook. You can also look and I would appreciate if you do this. And recommend this highly. Take a look at these two links here and you will find two videos, this one and this one in which the back provabation algorithm in the context of neural networks is quite nicely explained. Now, calculating the gradients, based on the whole data sample, be sure if you take a look at this function R here, you can see that this is based on all N observations, and to calculate the gradient, you have to go through all the data, all the training observations. Now, Again, to calculate the gradient, you need to go through all examples or training observations, and this is called batch learning. This can be quite costly in terms of computational time. Because if the dataset is quite large, computing the gradient also needs a lot of time. Therefore, what one does is one usually relies on the so called stochastic. Gradient descent or SGD algorithm, which is referred to sometimes as mini Dech learning. What you do is you select small random subsamples. You concentrate on randomly selected smaller subsample to update the network weights. As a consequence, the confidential burden for each iteration, this is an iterative procedure. Uh, it does not increase with the total number of training examples because you keep the size of those random sub samples fixed and you can increase the training data, but the mini daches will remain of the same size. The number of training examples in each mini detch is referred to as the Datch size, while a complete sweep over the entire data sample of NN observations is referred to as one epoch. A neural network is typically trained over multiple epochs. You can see with this huge number of parameters and the need to catate the gradient in each iteration to minimize our cost function. Training a neural network with a large theater set usually requires a lot of time. Remember that a huge problem with neural networks is because of this huge number of parameters, we have a huge flexibility. Overfitting becomes a huge problem. We have a numerous parameters, and neural networks are prone to overfitting at the global minimum of the cost function. Now, while there are many means to mitigate overfitting, for example, by using smaller edge sizes that have a regular rising effect, there are also other simpler methods that explicitly address the problem of overfitting. And in the application, we will see two of these methods. The first one is dropout, the other one is early stopping. Now, dropout is frequently used in a way where we are using non output units and we are randomly removing those non output units from the network during training. So again, we are reducing the observations that are being used to train the model. While early stopping addresses the question of how long to train a neural network and we have some stopping rule that determines the training is stopped when the model performance starts to deteriorate on a validation set. So in a sense, we are already including the validation set in our training. And if we see that actually the model seems to overfit and the model seems to learn only based on the training sample and doesn't generalize well, we stop. So this is what we will see in the application. But before we go to the application, we quickly introduce an extension of neural networks, which is convolutional neural network in the next video. 27. 27 Intro to Convolutional Neural Networks: Hello, everyone. Welcome back to our class in artificial intelligence and machine learning and finance. In this short video, we are going to have a look at convolutional neural networks, which are a specialized kind of neural networks of the single and multi layer perceptron we've already seen and they're used for processing data with a grid like topology, very famously used for images. Um, actually, this is frequently used in business and also in finance. When looking at images, for example, to identify characters numbers can be actually used for very simple applications. If you want to read and read into your computer receipts or any other type of handwritten information. And they have been tremendously successful in practical applications. So let's have a look at these. Now why are they called convolutional? It indicates that the CNNs use a mathematical operation called convolution, which is a specialized linear operation. Now, we don't need to go into the details of what a convolution is in mathematics, but they are neural network that use convolution instead of general matrix multiplication like in the single layer or multi layered perceptrum. Remember that actually, it is the single and the multilayered perceptron. In a sense, if you leave out the sigmoid function, um, if you only look at the first layer, for example, and then later on on how the hidden units are combined, these are linear operations, and they are linearly combined to yield the output signals. Now, these are linear operations, and if you've taken a basic introduction to linear algebra, you know that linear operations are equivalent and special way to matrices. So actually, what we are doing in the single and multilayer perceptrons are matrix multiplications, and here we use a convolution instead of the matrix multiplication in at least one of their layers. So this is CNN. Uh, what are we doing? Actually, you have to think about this. Think about and we'll start with a very simple image that is not even an image. It's simply a grid of numbers. Think about this grid. This is the source and the image that we want to transform. Now, what we are doing is look at these numbers, zero, zero, zero, zero, one, one, one, 02. And we are using these nine pixels, and we are transforming these nine pixels according to the convolution. We're using the convolution kernel, and what happens is, you carve these nine pixels out. Now, the center element of the kernel is placed over the source pixel. The source pixel is then replaced with a weighted sum of itself and nearby pixels. You can see the calculation here. We don't need to go through all these and what happens is then we get minus eight, and then we reduce all this to minus eight, and this is the new pixel value the destination and we leave out all this as information. And what we do next is we shift this convolution kernel and to the right and to the right, we use nine pixels and calculate the one here and the one here, and the one here, you can see that in each dimension, I actually we go through here, we are actually losing two pixels, so we don't get this. I mean, and we can get this. Actually, we start here, the minus eight, go through these, and then we fill out all these boxes and pixels in the middle. This is the convolution very, very simplified and what happens is, actually, the convolution kernels are also often referred to as filters because depending on what kind of filter, what convolution you're using, you are extracting a different specific feature. For example, here, in this feature extraction, we are trying to detect etches. For example, you can see that here, let me highlight this way here. You see that we have one color coming in from here and then suddenly we have a much different color at this edge, and we are trying to detect these edges. We are not concentrating on colors only on edges. If you apply this feature extraction, this convolution kernel here for edge detection, what you get is this picture, and you have one filter, one convolution that extracts the edges of this picture. This is one layer of information we get, and we can do many filters. We can apply many filters to this picture and extract different features and then later on, learn from this. Typically, many different filters are used in each layer of a convolutional neural network in parallel, because this gam needs some time, this is done to extract different features at the same time. As a result, the output of a convolutional layer has the structure of a cuboid, and then you get many different multiple filters, and then you can recombine them to get a full idea of what the picture looks like and then train your neural network. Now some further architectural components and good slations like for example, pooling layers, they will be covered later on in the application, and we have an extensive treatment in the Goodfellow Bengio and Curvil textbook, you can also find information on other network architectures, recurrent neural networks. These are way beyond the scope of this lecture, but we will see the multi layer perceptron, which is, again, the simple extension of the single layer perceptrum with more hidden layers, but also the convolutional neural network. In the application, you now have a basic understanding of what the CNM and multi layer perceptron, and for the details, you can look this up but it's very simple to use these models in R, and we'll see this in the application. 28. 28 Neural Networks in Practice: Hello, everyone, and welcome back to our class in Artificial Intelligence and Finance. In this video, we want to look at the application of neural networks in a very generic example that is frequently used in economies in finance. But that is scanning documents. Usually these documents only exist in paper form and we have invoices, bank statements, printouts of static data, business cards, receipts, et cetera, and a scanned image cannot be searched in its native state. And thus, it is usually common to digitize printed text to enable electronic editing, searching, compact storage, and online display, and then for this, you need neural network. The retrieved hidden text behind image can also be fed directly into further machine processes, for example, automatically managing invoices, receipts, transactions. And you can imagine that this is usually the starting point digitalization of these printed documents that decades and years ago were only available in printed form, but nowadays, they can be digitized and used in a computer. Now, the technique behind this is usually called optical character recognition OCR, and this is done by neural networks. This is one of the most basic um, applications of neural networks. One can also use the Minsse pricing. I'll comment on that later on. But I think this is very instructive because you can easily see how the neural networks work. On the following slides, we employ neural networks for the task of handwritten digit recognition. A classification problem with ten classes, zero, one, two, three, four, five, six, seven, eight or nine. We have a handwritten note, and we want the neural network to be able to decide, is this a one? Is this five or is it zero, for example. We start with the multi layer perceptron. Doesn't make too much sense to use a single layer perceptron year. If we can use the multi layer one, the principle is the same, and we will discuss some regularization techniques to prevent the model from overfitting and we will fit a convolutional neural network to the data. All the application, we now rely on the MNIST database that is provided within the Keras R package. And this database contains 60,000 training and 10,000 test examples of handwritten digits, and more details on this particular dataset can be accessed at this link here. So this is the MNIST database in Keras. Okay. So we employ the Keras R package for fitting the neural network. This package provides an interface for the open source Python library, Keras, which in turn acts as an interface for Google's tens of Low library. You might have heard tens of Low, which is the library provided by Google on Deep Learning machine learning. And the syntax in the R package is very, very similar to the Python syntax in the original library. And it's one of the leading high level neural network APIs, it has a focus on enabling fast experimentation, quite user friendly, and also allows the training of neural networks on both CPU and GPU without changing the codes. We need additional computation power here and it also supports arbitrary network architectures, and it's quite appropriate for building essentially any deep learning model. That's why we are using this here and also we are able to stay within our now we fog, we can start to work with the data. We have to install the Python cars and TensorFlow back end first. This can mainly be done from within R. During the process, you might be asked which tens of low version you want to have installed? If you use default, this yields the CPU version which we'll use here. But please note that installing Python Keras via the Keras R package seems not to work on the RStudio server. So if you are using RStudio, please use your own computer and this doesn't work. Before executing the following lines of code, please also install Anaconda with the default settings from here. That's Anaconda then you install a Keras and this third line will also install mini Conda and several Python packages, Keras Nabi, et cetera. This is what we need in the that end in order to be able to fit our multi layer ectrons in our neural networks. Again, we start by importing the data. We load the library carries, and then MNIST is dataset MNIST. This downloads the database, the object size divided by 100, actually 1 million. This is 219, the size megabytes, this is not really large, but it might be too much for, let's say, a regular notebook might be that the data, even this rather small dataset is even too large for regular notebook. So this is why we are not using larger data samples here in this lecture. Do you have a look at the structure of Amnest, you can see it's a list of two. It's training, dollar training, that's one object and dollar test. And you see it is now a very simple structure, why is that we? The data basically is, these are just digitized images. It's not features like in the insurance risk premiere, data sample where we saw we have age, gender, et cetera. These are all images and they are not saved as images as JPEx obviously, but they are saved as a digitized images and he this. For example, if we view MNIST, that's our data sample and in the training dataset X one, the first observation, you can see this matrix. You can see a 28 by 28 grayscale image, and what you see is actually, these are all numbers. If we zoom in here, you can see, these are all numbers. And you can imagine what has been done. Actually, this is one of the images. Let me zoom out. This actually looks like this. This is one example. I would guess this is supposed to be a five. Someone wrote down a five and this is the digitized version of this. If you check the image label, you can visualize it, yes, we would yes that this is a five. Actually, yes, it is. So if we look at the response, the output value for this observation, it's a five. And hopefully, later on, our models are able to train on the training data and then be able if it's fed this image, to be able to determine that this is supposed to be five. Okay. Now, the data are stored in a three dimensional area. If you take a look at this, you can see this, this is the first this is the first dimension, one, then the second one and the third one. The first dimension is for the image. This the first image, so we get a matrix in the remaining two dimensions. So one and two. Actually, all the data are stored in a three dimensional array image by width and height. So actually, if we were to access one, three, two, it's by curve here, it's my cursor. One, we get the first image, three, 93, and column two, we would get this zero. So this is how the data is stored. Now, to be fed into a multi layer perceptron, this matrix has to be flattened. That is, it needs to be transformed into a vector. Additionally, because as you can see, the data is stored in gray scales 0-250, we need to transform the gray scale values 0-255 to the range 0-1. So X train is MS dollar train dollar X. X B train is MS train dollar Y, and so on for the test data. We need to reshape the data. X train and X test are X train from before, and X test from before, and we reshape them by flattening it. And then we re scale this by dividing all the numbers by 255. If you divide everything by 255, this whole range 0-255 is transformed to the interval 0-1. That's in lines ten and 11. And then the Y data, these are integer vectors with obviously integers ranging 0-9, zero, one, two, three, and so on and so on. I for training, we encode the two vectors into binary class matrices. This is done via the carries two categorical functions. You can see here the structure before transformation is five, 04, 19, two, one, three, four, these are the values. This was our first image, that was the five. And if we do this two categorical, y train and ten, you can see this is now a matrix. And we now have categorical or binary variables that, for example, if we start with these zero, which is actually the first observation we have, the first binary variable. Is this a zero? No, zero. Is this a zero? Yes, O, is this a zero? No, no, no, we did zero, one, zero, zero, zero, zero, et cetera. This is why test in the same manner. Now next in the next video, we are going to fit the multi layer perceptron to this type of data. 29. 29 Multilayer Perceptron Hands On Implementation: Hello, everyone, and welcome back to our class in Artificial Intelligence and finance. Now we want to use a multi layer perceptron in this example of images that we need to categorize. We need to determine whether the handwritten images are, for example, a five, a three, or a nine. Now, we want to build a first model, a multi layer perceptron and the core data structure in carries is a model. The simplest type of model is a sequential one where potentially different kinds of layers are stacked sequentially on each other. We start by defining a sequential model via this code carries model sequential and subsequently add layers to it. Instead of the object oriented syntax in the Python original cars library, which is mole AD, the R package uses the pipe operator, which is a percentage sign larger and percentage sign that we are already familiar with from the DPL YR package, which is also used in that one. Even shallow models can exhibit hundreds of thousands of printers. Therefore, be careful of overfitting. We are going to discuss overfitting in the next view in detail, and we are going to have a look at a very simple multilayer perceptron here in this one. This is the start. Let's build our first model, model one, carries molar sequential. We define a sequential model. This adds one hidden layer with 256 neurons and not the sigmoid, but the Lu activation function and the input shape corresponds to the length of the flattened images, which is 28 by 28. Now model one, and then we need this pipe operator. Layer dense, we need units 256, activation is u input shape is 784. Again, the pipe operator layer dense Units ten, activation softmax, this is the last layer and the output layer with one neuron for each class, Rimando, if you go back, we saw that actually we can have one Y. To which all the neurons go, maybe. All we can have more than one output, so this would be Y two. In this case, obviously, we have ten units because our result is one neuron for each class reflecting the probability for each digit zero, one, two, three, and so on. These are actually the two layers. This is the hidden layer and this is the output layer. Now, that's all we need. We can print out the model summary, so we do summary model one. It's a sequential model. You can see here layer, dense and dense one, 256 and ten urons. We get almost 201,000 parameters. The total number of parameters is 204,000. These are all trainable. You can see it's a huge number of parameters that needs to be estimated and this is bound to sulfur from overfitting. Next, we compile the model from the previous slides with an appropriate loss function, optimizer and ometric. Model one, pipe operator compile, the loss is the categorical cross entropy. The optimizer is this optimizer RMS prop and the matrix is accuracy. We want to use the categorical cross entropy for minimization. An optimization of the neural network and we are going to assess the accuracy of our model by using the accuracy. Now, we're not using, for example, the no information rate. Now we can train the model. This results from the training stacks. These are saved in the history object. History is model one, pipe operator, and then we fit the neural network. Based on X train and Y train, we use 50 EPOC. Remember that one EPOC is one suite over the whole dataset and it's now mini da the batch size is 64 number of images processed after which the parameters are again updated, and the validation split is 20%, this provides an indication of the generalization performance of the model. That is, we split our training data into 8%, 20%, and then we use the 20% of our data for validation. Let's evaluate model one on the test set result one, Rs one is our mod one pipe operator evaluate X test, Y test, and print both the loss and the accuracy. The accuracy is 98%. Pretty good model has very high accuracy, but as we later on C, it's due to overfitting. The model overfits in the training set. If we print out this mod one Pied evaluate Xtrain y train, the accuracy is almost 100%, but we will later on see it doesn't generalize well to new data. Well, we can also make the prediction now. Set predictions is written over by model one pipe operated predict classes from X test. And for example, for the first one, predictions one for the first image in the test set, which is a seven and we can see the true value is actually zero, one, two, three, four, five, six, seven. So yes, we would say yes, this is a seven. Our prediction is a seven, and the true value is given by this binary variable, and this is the binary variable for a seven. This would be eight, and this would be a nine. So this is quite nicely. To see what happens. Based on the 204,000 parameters, as I mentioned, the model overfits quite drastically and we will discuss this problem of overfitting and what to do with overfitting in the next. 30. 30 Multilayer Perceptron Handling Overfitting: Hello, everyone, and welcome back to our class in artificial intelligence and machine learning in finance. In the last video, we've seen the multi layer perceptron and we used it to do the following. If you skip back one slide, you can see that we trained a multi layer perceptron in order to recognize handwritten digits. As in this example, we were able, for example, to predict from this picture correctly that this was supposed to be a seven and it actually was a seven. So this is the train models prediction, and we haven't talked about the accuracy and actually overfitting. And if we plot the history of the model over the different actually 50 epochs, you can see here in, well, green or turquoise, the loss inaccuracy for the validation set and in orange, the loss in accuracy for the training set. As you can see, the accuracy is actually extremely high my cursor here it says, It is actually quite high and problem is, as you can see here, from the loss in the validation set with more and more epochs that we use for training on multi layer perceptron. Then the algorithm, the model doesn't generalize well to new data. We can see that the loss increases enormously and linearly with every epoch, even though the accuracy in the trading data set is almost close to 100%. So we can see overfitting being a huge problem here, which is not surprising given the fact that we have 200 and I think, 204,000 parameters and much less data observations. Actually, for each data observation, we have more parameters, and this obviously yields such a highly flexible model, but this is causing the overfitting we can see on this pot. It's obvious that, when we look at the previous slide where we compare the loss and the accuracy, and the training in the test data, that this first very simple multilayer perceptron overfits the data. And what happens is the model ends up memorizing the training sample because we have more parameters than we actually have observations. So we could simply store all our observations in one or more parameters, and thus it does not generalize well to previously unseen data. With the loss continuously decreasing on the training set, it starts increasing from Epoch ten on the deladation set. That's approximately 20% of the trading sample that we randomly selected via the validation split parameter in our model fitting. As a consequence, the classification accuracy on the deladation set there's also no further increase. For the training set, it's close to 100%, but not for the validation set. We have seen this here. You can see the accuracy doesn't really increase anymore after this point in the validation set and we cannot train the model much better for the training set, obviously. So a possible solution to overfitting in neural networks. One possibility is regularization, and this can be done for example via dropout and early stopping, and we will be doing this in this example here and exemplifying how these two procedures work. Dropout was proposed by Suva tava Hinton Khruschevski, Zutskiw and Zalakudyov in 2014. And it's a powerful regularization method that is applicable actually to broad family models. It's computationally inexpensive and it's frequently used in the current literature. And what you do is drop out trains ensemble of sub networks of a given neural network, and therefore we have non output units that are randomly removed from the network. And this is typically achieved by multiplying the outputs of the respective neurons with a zero and for each mini batch, a different subsample of hidden units is used. And then we calculate the gradiens and we do back propagation through these networks as usual. Now, early stopping as the second alternative for regularization of the neural network, addressed the question of how long to train the neural network. Because we can see here that at some point, sorry, at some point, probably here, we should have stopped and said, Okay, well, this is enough. The accuracy in the training sample will only increase slightly, but everything that follows now is an increase in the loss for the validation set. So we could have stopped early. That's what early stopping is all about. So little training might lead to underfitting stopped. If we had stopped earlier, let's say after maybe five epochs or three epochs and if we stop too late, we get overfitting. Early stopping proposes a compromise by stopping training at the point when performance on a validation set starts to degrade. It's very simple, very effective and widely used. Now we continue by adding a dropout layer to model one. The dropout rate specifying the percentage of neurons excluded per minute batch serves as a hyperparameters, here we choose a dropout rate of 50%. We estimate and fit model two, again, with Keras moon sequential, and the Pip operator, we have a layer dense, 256 neurons, activation function is u then we also have the layer dropout, which is rate with 50%, and then the last layer with ten, um, response binary variables or ten outputs, and we're using softmax as the activation. We have three layers now, and this is the summary of our regularized model. Again, it's sequential with those three layers. The dropout layer doesn't have any parameters. Again, we get the same number of parameters as before, so it's not really about reducing the flexibility of the model, just using maybe different data to fit the model. We compile this and fit the regularized model. So again, we are using cross entropy has loss function, same optimizer as before, and to assess the accuracy, sorry, we use the accuracy. This is what we do history. We fit the model X train, y train, 50 epoch, batch size is 64, and the validation split is again, 20% of our observations, and this is the result, as you can see, as before, we have a drop in the loss at first, and it still increases for the validation set. But it doesn't increase like this. Actually, this difference here and also this minimize difference between the accuracy in the training and in the validation set, these are the results of the regularization Y A dropout. So that's the results. Now, accuracy in the validation set slightly increases with the number of epochs where the loss over the validation set only slightly increases. Overfitting is not a major issue anymore. We can see that yes, it still increases. It's not that the loss, here's my cursor. It's not constant, it still increases slightly. Until Epoch 50, but it's not a major issue as before. While the accuracy of the regularized model, it's about 99% is lower in the training set compared to the original model, it is actually higher 98.2% in the test set. So generalization performance has improved, and this is also reflected in a lower loss over the test set. So yes, it's a good way to regularize the neural network and to prevent it from overfitting. And in the next video, we'll have a look at a deep Mr 31. 31 Multilayer Perceptron Building Deep Models: Hello, everyone, and welcome back to our class in artificial intelligence Machine learning in finance. We fitted a multi layer perceptron to our data example of digitized digits, handwritten digits that we wanted to digitize and to project. And in this video, we are going to have a look at a deep model or a deeper model after having seen how to deal with overfitting in our data. We want to fit a deeper model, and therefore we add two additional hidden layers to our multi layer perceptron. Actually before that it was only a single layer perceectron. We now add two additional hidden layers and we reduce the number of neurons by a factor of two in each consecutive layer, it gets more sparse as we move upwards to the outputs. Again, we apply dropout with a dropout rate of 50% to each hidden layer for regularization of our neural network. We now have a truly multi layer perceptron. It's model three. Again, it's fitted sequentially, and as you can see, we have layer dense. We start with 256 neurons, the Lou activation function, we use dropout, then we use 128, again, dropout 64, and finally, we have our output layer with ten binary response variables, and then we use the activation function softmax. This is the summary. Actually, with the multilayer perceptroon, we now have 243,000 parameters. All of these are trainable. And as you can see, interestingly in comparison to the single layer perceptron from before, the number of parameters doesn't increase that extremely when adding a second and third hidden layer. So we move 204-243 thousand parameters. Continue as before, we compile and fit the model, again, using cross entropy, inaccuracy, 50 epochs batch size 64, and we use 80% for training and 20% of our available data for validation. We visualize the training process, and as we can see now, actually, the accuracy is quite high for both the training and the validation set. And actually, the loss is also rather low for both the training. Not surprisingly, but also for the deletation set. If we compare this to the single layer perceptron, actually, you have to delete some of my drawings from the previous video, and you can see now that yes, even though we're using the same regularization, using two additional layers leads to a better performance. We now have a higher accuracy and comparable loss. So to evaluate the model, we use our test data, X test and Y test, and here we get an accuracy of 97.7% for the D model. Now, in the next video, we'll have a look at early stopping as the second way of regularizing multi layer cetera. 32. 32 Multilayer Perceptron Early Stopping Technique: Hello, everyone, and welcome back to our class in artificial intelligence and machine learning in finance. We are still looking and talking about the multi layer perceptron. We fit it to our dataset of handwritten digits that we want to recognize and to predict. And we've seen dropout as one way of recognizing a neural network. Remember that we saw that even our very simple single layer perceptron at 204,000 parameters. The multis layer perceptron with three hidden layers had close to 250,000 parameters, and they were prone to overfitting. So we need some way of dializing those neb networks. And the second way of doing this is early stopping, which is very simple procedure that actually decides when to stop the training of the data when overfitting starts to come in. So now want to fit the model with early stopping, can go back to slide 335 to see what early stopping is. I just explained it, we again rely on the previous model from the previous video on the multilayer perceptron with three in layers and for illustrative purposes, we reduce regularization by setting the dropout rate to 30% apart from this. It's the same model specification as before. And so early stopping, we compile and fit the model. This will be model four. As you can see, we compile it again using cross entropy, the same optimizer and accuracy as the metric which is used to measure the productive performance of our model. In training with early stopping, this is where it's now different from before. It's performed a introducing a so called call deck monitoring. Um, a callec that is monitoring the accuracy on the validation set. We fit our model four into history, the object based on X train, y train, 50 Ebox batch size 64, and 20% of the data is used for the validation set. As before, that's the same. But now Coldex is a list call Deck early stopping, we have to monitor the accuracy. The patient is a peremter will be five and restore best weights. If this is well, if we have seen that actually the accuracy deteriorates, should we restore the best weights, this option is set to true. These are the parameters. Monitor is the quantity that needs to be monitored. In this case, the accuracy, patients, the number of epochs with no improvement after which treading is stopped, and restore best weights is the option whether we should restore model weights from the epoch with the best value of the monitored quality or if we should stop, let's say, the patients number of epochs after we've stopped. These are the parameters, and this is the training process. If we plot history, we can see that with less dropout, lower dropout, and we could have done this even without additionally using drop off regularization, we can see that yes, the accuracy is quite high for the training and validation set. Um, we can also see that the loss actually increases and has been increasing, let's say, from Epoch five through Epoch close to 23, I guess, um, in the validation set. But this is where it stopped, and we've seen before that actually this would increase to this point in Epoch 50 with um the accuracy being just as high. This is, probably the good point, we could have also stopped maybe here, we should have used different parameters in our early stopping. Now, if we value model fall, we can see it has 97.8% accuracy and close to 15% loss. So the classification accuracy has slightly improved compared to the previous deep model. However, to assess whether the improvement is due to early stopping, the lower dropout rate or just having by chance, one would have to perform further analysis. It's not 100% fair comparison because we've changed the dropout rate as well to exemplify this effect of early stopping. Obviously, you can try different models and different choices for the hyperparameters. Both the modds do not achieve a better performance than the shallow mo with dropout. So we can see that deeper molots are not better per se and if we take the accuracy and neural networks need to be carefully specified and trained and deeper molots might perform better with more training examples with a different out rate, different number of neurons, layers, et cetera, we have a lot of ways of changing the different models. On the following slides and in the next video, we will consider a different network architecture that is the convolutional neural network. And this is especially suited for data, like in this example, with a grid like structure such as the images we are going to process here. 33. 33 Convolutional Neural Networks Practical Example: Hello, everyone, and welcome back to our class in artificial intelligence Machine Learning and finance. We've seen the single and the multilayer perceptron being used for predicting digits, handwritten digits in our application, and we've also seen how we can use regularization, wire dropout and early stopping to improve on the accuracy. Of our models and we now want to use convolutional neural networks as yet another alternative neural network model that can be used in this case. For the MLP models, we also flatten the input data. That is, we transform the two dimensional images into vectors of grayscale values 0-255. Actually, we then transform all those grayscale values to the interval 0-1. However, the convolutional neural network exploits this grid like structure of the images. We've seen how it works. It uses filters or kernels that goes through the image and then reduces the information, therefore, we need the training and test set in slightly different structure. What we are doing is X train and X test. You can see these we take the Amnest training set and the test set and we scale this again by 255. Uh, the dimension of X train is 60,000 images by 28 by 28 28 pixels height and 28 pixels width. And the CNN takes images in three dimensions rather than the MLP. So the last dimension typically has three values, the RGB channels for the color. And as we have grayscale images, we only need one channel. So what we are doing is uh, we're adding one dimension. Actually, X train then is an array of dimension, 60,000 images, 20 pixels, 20 pixels, one RGB channel because it's a gray scale picture and the same transformation is done for the test set. Now the convolutional neural networks can automatically learn a large number of filters. In our case, in the first convolutional layer, we'll use 32 and they will be learned in parallel. Each filter provides highly specific features that can be detected anywhere on the input images. We've seen this example of a picture where one filter was applied to see the edges without caring too much about the colors, just trying to see edges where suddenly, um, something appears or changes in the picture. In our example, we will apply filters of size three by three, just as we've seen in our illustration of the CNN and the following lines of code specify the whole model as sequential and add the first convolutional layer. Model CNN, that's what we are fitting, Keras model sequential, layer convolutional two dimensions, 32 filters, and the kernel size is three by three. The activation is, again, the Lou function and the input shape is 28 by 28 by one. So um, more or less the X and the Y axis of the picture, the height and the width and one channel for the color, which in this case, just gray. Now the output feature map obtained from the various filters is sensitive to the location of the features in the input. Pulling layers. That's an approach to downsample feature maps, and they summarize the presence of features in patches of the feature map. Common pulling methods are average and maximum pulling. Average pulling summarizes the average presence of a feature while maximum pulling captures the most activated presence of a feature. For example, a cat is present in the respective part of the image or it is not present. In our example, we use maximum pooling of a patches of size two by two and this reduces each 26 by 26 feature map. Remember that in each dimension, we are actually losing two pixels and thus we get its 26 by 26 pixel feature map, and maximum pooling reduces this feature map obtained by the convolution layer to a 13 by 13 map. We then add a pooling layer, model CNN, model CNN layer max putting two dimensional put size is two by two. We add another convolutional layer this time with 64 filters for detecting more detailed features in the image, followed by another maximum pulling there. We're adding layer after layer to our convolutional neural network. You can see here layer convolutional two D, filter 64 kernel size three by three, and layer maximum pulling two dimensional again, two by two. Um, to complete the model, we feed the outputs from the last convolutional layer, which attached pooling layer into a dense layer to perform classification. So this is the same as actually in our multi layer perceptron and in the single layer perceptron, and it's the same structure. We have ten units, ten binary variables for those digits, and the activation function is soft max. So before the outputs from the convolution layer can be fed into the dense layer, the three D output has to be flattened to one dimensions. And this whole model architecture then looks like this. We have a CA, we summarize model CNN and we have these different layers. We have the first layer with 320 parameters. Then the second one which actually has almost close to 19,000 parameters and the last one as 16,000 parameters. In total, we get 35,000 parameters, much less than in the example of the neural network of the multi layer perceptrum, but nothing comparable, for example, to the linear classifies or regression analysis we've seen before. Now, dimensionality of layers, the input data, as we've seen, consists of 28 by 28 gray scale images, applying filters of size three by three to them. So using these convolutions and these kernels leads to the loss of two pixels in each dimension. If you remember this picture with my cursor, we've seen that actually if this is the picture, and these are the pixels, we've seen that actually we are using three by three kernels, for example, we are using these actually these nine pixels to calculate this one. We use the next, go one to the right and then get this one and this one. Finally, if this is the last column, we get, let's say, this one. As you can see, we are losing this pixel and we are losing this pixel in this dimension going like this and obviously we also only compute this one, this one, this one, this one, so in the end, we get it 26 by 26 matrix, feature map. As in the first convolution layer, we again apply 32 different filters, output dimensionalities, then 26 by 26 times 32 different filters that we apply. By maximum pooling with a batch size of two by two, the dimensionalities actually reduced to 13 by 13 times 32 filters, so that's what we get in the end. Next convolutional layer, apply 64 different filters of size three by three to the output. Again, the input feature map loses two pixels. In the end, we get 11 by 11 by 64 filters. Maximum putting with patches of size two by two. Number of pixels in the first two dimensions is essentially divided by two, yielding the output dimension five by five by 64 and flattening this output tensor yields a vector length of five by five by 64, around 1,600. A very important feature of convolutional neural networks is parameter sharing. This is achieved by moving each filter over the picture thereby employing the same parameters at each location. We've seen that this three by three filter moves from left to right, for example. As a consequence, the first convolutional layer of our network only employs 320 different parameters, and this compares to almost 200,000 parameters in the first hidden layer of the multi layer or single layer perceptrm. Then each three by three filter involves nine parameters plus one bias premises and because we employ 32 filters, this yields 320 parameters. This is why we don't end up with almost 260 60,000 parameters as in the multilayer perceptron, but we stay at about 30,000. Pulling does not involve any trainable premeters the size of the patches over which pulling is performed is a hyperparameters, but in the pulling itself, we don't have any trainable parameters, and each of the 64 filters of size three by three from the second layer is applied to all 32 feature maps. This yields, again, 18,000 parameters. Pulling and flattening doesn't involve any trainable parameters and the last dense layer requests ten weights for each of the 1,600 neurons. That are mapped into the ten output neurons for all ten digits, and this in the end yields another 16,000 parameters. So in total, we have 35,000 compared to 204,000 parameters. The model has much fewer parameters, but as we'll see, we don't lose too much flexibility. Compiling and training is done in analogy to the multilayer perceectron, but it's computationally very expensive. We only perform ten epochs in contrast to 50 epochs, and this has been done on our university computer center using GPU clusters, which the cluster is suitable for fitting large and many neural networks. Actually, this one was done with four nodes with four test lave, 100 GPUs, and a lot of computational power. So compiling and training is done. As before, we compile model CNN, cross entropy metric is accuracy, and we fit the model using X train, y train, ten epochs, batch size is 64, and again, 20% is in the validation set. So this is the result. As we can see, accuracy is also very high. We are actually starting 98, so this is 99 close to 100% accuracy. And if you look at the validation set, it goes down for some time, and Remember, we would have to look at 50 epochs to see where this is going, but actually even for ten epochs, the accuracy is very good and we don't have such a high loss when we look at the validation set. Actually, 99% accuracy, 3% loss. So the convolutional neural network provides a higher forecasting accuracy on the test set than any of the multilayer perceptrons, and classification accuracy could probably be improved even further. By tuning the model specification by running 50 epochs, and this can be done. But even with this very simple example, you should see that it works much better than the multi layer perceptron. And this is why actually a convolutional neural network is usually used in practice in these applications where we are trying to analyze images and recognize characters from optical data. Okay. So these were the neural networks, and they can be applied in many different applications, actually. Every time you have a lot of data, you have your outputs and you want to do regression or classification. And as you can see, they are highly flexible. They should be applied on dig datasets, but then you need to think about regularization, how to combat overfitting. And in the last section of our lecture, we're going to have a look at some different aspects of AI and ML in finance, that is the usage of AI andM by companies for regulatory purposes, by regulators and supervisory agencies, and then systemic risk and also some ethical considerations. 34. 34 Regulatory Technology (RegTech): Everyone and welcome back to our class in artificial intelligence and machine learning in finance. In this last section of the lecture, we want to look at some on mathematical topics related to the usage of machine learning and artificial intelligence in financial applications and we'll start in this video with RegTech and sub tech. What are these? Well, we start with RegTech, which is probably more common than SupTech. RegTech is the use of AI NL in the management of regulatory processes for the financial industry within the financial industry. Actually, this is the case when banks, insurance companies, financial service providers use machine learning and artificial intelligence to speed up, to improve, to make more efficient their processes, internal processes to comply with regulatory filings to comply with laws and regulations. And thus, the main functions are regulatory monitoring, regulatory reporting, and regulatory compliance. Put differently every time a bank a financial institution, insurance company uses AI and or ML in order to improve regulatory processes internally, this is what we call RegTech. Why is this? Why do we even have a special name for this RegTech? Because many regulatory processes have become more and more complex in reality. We have more regulations, more supervision of financial institutions in the wake of the financial crisis. And thus, supervisors and regulators are demanding more and more information. There are more and more rules that banks and insurance companies need to stick by and thus fulfilling all regulatory requirements is actually a full time job for many banks nowadays, and they need to hire lawyers in compliance offices. And thus, there's much room for improvement because most of these things, usually supervisory agencies are not that sophisticated when it comes to technology that is being used. Um, so in many cases, you will actually hear from practitioners, we have to fill out excel sheets. You have huge Excel sheets that, for example, in solvency to in insurance companies that need to be filled out and sent to regulators, and this needs and takes up a lot of time. So in order to speed this up to make this process more efficient but also safer, you can use artificial intelligence and machine learning. And we have two, um, examples down here. These are companies that are offering now advertising, but these are two companies that you actually can find easily via Google Search. For example, anti fraud and risk management for digital transactions, this identity mind Global and trinomy management of consent for customer service data. There are numerous other companies and consulting companies also either provide services related to AI and ML, or they provide consulting on these topics, Dod bearing point, just to name two companies. So what are the different market segments to get a feeling of what RegTech encompasses? First of all, profiling and due diligence, that is you collect and integrate data from multiple internal and external sources. And what's your aim to profile an entity, to confirm the identity of a person of a company or to categorize them according to regulatory requirements. Reporting in dashboards, again, collect integrated data and the aim here is to build standardized reports for management or compliance or regulatory purposes. Very often, this is the case that as I mentioned, you have Excel sheets that need to be sent, for example, to IOPA or National Insurance supervisors and this is where you want reporting to be automatized and to be run much more efficiently and less error prone. Risk analytics. This is where it gets a little more interesting from a business perspective. You collect integrate data and the aim is to assess the risk of fraud, market abuse or misconduct at the transaction level. For example, in investment banking in trading in the back end in the middle office, you might want to use AIML tools to analyze the transactions and to see whether maybe the company has too high risk exposure, whether there might be risk of fraud and these sort of things. Dynamic compliance is when you use machine learning methods to facilitate, monitor regulatory changes, and in order to ensure a flexible adaptation of your policies, but more importantly, of your processes that you have in place, otherwise, what would happen is that every time something in regulation changes, you need to do this yourself or you need to hire external consultants to change the processes that have been put in place sometime before that. Market monitoring. You collect integrate data, and you aim to match market level adverse outcomes to regulatory or business rules. For example, you want to identify poor product performance, market manipulation, et cetera. We've seen some of these things on the side of supervisors, actually, you'll come to that when we talk about SupTech. But obviously, especially, investment banks can do this awesome and themselves. So numbers on RegTech, it's usually nowadays still dominated by startups, almost 70% of firms are younger than five years, according to this global RegTech industry report by the Cambridge Center for Alternative Finance and EY Japan. It employs just 44,000 people, but I think this is one part of the financial industry that will grow enormously in the next years because this is one way to move forward to combat the increase in regulatory and supervisory requirements. As about 5 billion annual review, and it raises a lot of capital in external funding. So what is the market environment? Well, um, market and regulatory environment are rated to be generally favorable to RegTech companies, and the pace of regulatory changes has increased ever since the great financial crisis, and it will increase even more, especially now after the Corona crisis. So um we have uh, punishments with penalties for non compliance with regulatory rulings, and this has led to a surge in the demand for not just compliance officers and experts in compliance and banking and insurance supervision, but of course, also for automated and reliable methods to speed these processes up. Some information on the top ten RegTech markets, obviously, UK, USA, but also in the European Union, Luxembourg, and Switzerland, and Ireland because these are the countries where we see a lot of financial industries and financial service companies and providers having their headquarters within the European Union or in the case of Switzerland in Switzerland, mostly in Zurich, and then Ireland, Australia, Singapore, Japan, Germany, France. But obviously, the UK and the United States are leading the hertz here. Now, who are the clients? Well, usually banks. Also 61% target insurers, but it's mostly about banks because after the great financial crisis, most of the new regulation and more stringent and strengthened supervision and regulation has hit banks. It is also with solvency to has hit insurance companies in the European Union. So most of these reg tech firms, they target banks and then insurance companies, but also some uh, do business with Fintech, because it makes sense if they are startups, financial technology companies, makes sense to start with automated process at the start, from the beginning rather than putting up processes that are quite traditional, especially if you're a Fintech company. And 50% of those react companies also claim that they have clients outside the financial service sector, which makes sense because large industrial companies usually have less regulation and usually no supervision, but most of these companies also have similar problems. For example, the risk of fraud of transactions being fraudulent or erroneous, thus company every large industrial company is also on the lookout automated processes within its finance function. This means there is substantial overlap between what we consider RegTech and later on SupTech companies because when we move to SupTech, this is actually where we are providing services to regulators and supervisors, actually, some of these processes can also be used by the companies again. So what are the technologies and tools used by RegTech companies? Cloud computing, machine learning, predictive data analysis, natural language processing, deep learning, graph analysis, image recognition. We've seen that with the convolutional neural network, biometrics, cryptotkens, virtual reality just a little bit, it's usually about cloud computing, machine learning, and some related topics. Okay, now, regulators, the regulators have to adapt to ever new technology enables financial services, which may present a challenge, especially for merging and developing economies. This why some regulators like Baffin, they have come up with something that is called sandbox, regulatory sandbox. What is that? It's a formal program that allows certain financial services providers and some business models which are not yet fully complying with existing laws. The aim is to learn about the opportunities and risks that a particular innovation in financial technology, for example, carries, and to develop evidence based policy to see what regulators should do and what they should not do. Imagine, for example, the introduction of cryptocurrencies. This is something that has come from the industry, from scientists, and we've seen a search in cryptocurrencies with Bitcoin obviously being the most prominent one. These are usually at least at the very start, these weren't regulated at all. The scheme from the industry, and then regulators and supervisors said, Okay, we need to think about how to regulate this if we should regulate at all. And this can, for example, be done in the regulatory sandbox in order to allow one type of innovation in a limited way and to closely monitor this and to see what should be done and what should not be done. So regulators like the UK FCA, actually also have innovation hubs or innovation offices. These are places where innovators and regulators meet to discuss solutions to the challenges to the financial sector. And what has been done, for example, the UK, these were seven tax sprints and two day events including industry representatives and innovators and these were, for example, on regulatory report financial service and mental health, anti money laundering and these kinds of topics. This is how regulators and the industry try to discuss new ideas when it comes to new technologies and new challenges. Okay. So this is RegTech, and we'll now in the next video switch to SupTech 35. 35 Supervisory Technology and Systemic Risk: Hello, everyone, and welcome back to our class in artificial intelligence and machine learning and Finance. In this short video, we're going to have a look at SupTech in contrast to RegTech and to define SupTech, actually, it's the mirror image of RegTech just on the side of financial supervisors and regulators. It can be considered a sub discipline of RegTech. But actually, it can also be seen as an extension or actually the mirror image on the side of supervisors and regulators. It's artificial intelligence and machine learning techniques that are used by regulators and supervisors as part of their supervisory actions and their supervisory conduct, such that we have financial authorities like Bathing in Germany, the FCA in UK and other authorities. And whenever they use big data, they use AI, machine learning to support the supervision of financial institutions, this is what we call SupTech. So um it's closely related to RegTech, but obviously, um, companies have an incentive to comply with regulations and make this as efficient and cost efficient as possible. Was SupTech sorry, SupTech is supposed to not only make this process more efficient on the side of supervisors, but also to identify more patterns, to identify more frauds, more potential problems, threats to financial stability, et cetera. This is why the focus is usually on misconduct analysis. On the reporting done by financial institutions and managing all the data that is coming into financial supervisors. Naturally, what you hear from many consultants and representatives of financial institutions is that they usually play that they have no idea and they have no clue what the financial supervisors are doing with all the data they are compiling. And with usually financial supervisory agencies being, let's say, having less funding than their counterparts in the private industry, it probably makes sense to invest in big data and I and ML technologies in order to process all this data. Examples, for example, the collection and management of detailed data on loans in the euro area. You can actually go to this link at the European Central Bank here, and the accelerating sub tech solutions and prototyping are to accelerator. That's another website where you can get some initial information on this relatively new field and um, it seems as if most supervisory agencies are still looking into SupTech and trying out at least some technologies, but this is not yet widely recognized or widely distributed on the realm of financial supervisors. There was one survey, um, among 39 financial authorities from 31 countries about the intimation of sub tech strategies. What they found is that they identify two broad approaches which supervisors followed. The first one is a specific sub tech roadmap based on particular needs of a department, and this approach tends to be more experimental that departments within a financial supervisor say, we might be in need of, let's say, a Mvo to identify fraudulent transactions in the stock exchange, for example, in the market. Um, and then things are experimented with the set is an institution wide digital transformation and data driven innovation program, which is a broader, much broader approach that encompasses usually the whole supervisory agency because management and governors of the agency have decided that they need a transformation of their overall IT. And then of course, it makes sense to concentrate on AI and MLA technologies as part of that IT transformation within the whole agency. And this is what they found out this strategy by the FSI, actually 50% said we have no strategy at all. So you can see that this is still very experimental with most supervisors. And as a conclusion, SupTech is still in its infancy, but it's gaining momentum because we've seen that the institutions that are supervised and regulated they are investing in RegTech, so it makes sense that agencies supervisory agencies keep track of this development and they follow in the footsteps of the institutions they're supposed to supervise. So even though this is still experimental or in the developmental stage, we will see many financial supervisory agencies investing more in ML and AI technologies as part of financial supervision. Problem, of course, is these agencies usually have much less funding at their disposal than companies that are supervised. But if we take the example of Germany and the recent blunder with the Wirecard scandal, it's likely that BAFN will be reformed to some extent and will probably also explore new ways to identify threats to financial stability and financial misconduct. On another topic that is related to supervision and regulation in this context are systemic risks. Now, systemic risk, you probably know this from other lectures, other classes on financial supervision, regulation. Financial stability is one main goal in financial supervision, supervisors aim to achieve financial stability to prevent financial crisis. The question is how is AIN ML the usage of AI and ML in finance related to systemic risk. Well, there are two sides to this coin, actually. One obviously is that AI and ML can be used by financial institutions, for example, to keep better track of risk exposure, to manage risks more proficiently and better in the bank, in the insurance company. And thus, the usage of AI and ML can actually enhance financial stability simply by doing a better job in, let's say, risk management and in trading. Other side to this coin is that obviously on the bad side, AI and ML technologies can also themselves cause systemic risk. Actually, AI and ML and the widespread use we will see in finance in the financial sector of AI and ML might be a future driver of systemic risk. And why is that? Well, especially when we're using it in regulatory functions, first of all, AI is unable to reason about events it has not yet been exposed to. We've seen in our modeling that a feature on important aspect of ML methods is their ability to generalize to new data. Humans can draw on a broad range of prior experience and also have imagination. AI doesn't have that. It can only extrapolate from what it has seen before and we can hope that a model that has been properly trained doesn't do overfitting and a generalized well to new data. That's the problem. We've seen in many financial crises. Sometimes things repeat themselves, but usually crises happen because we have something rather new happening or a new combination of things that leads to a crisis. Just like the COD crisis, we've just seen um, this also something at least mankind hasn't seen for decades or even centuries. Second problem is, we do not know how AI makes decisions. It is too complex for us to follow. It is usually a black box and it most certainly is a black box to external stakeholders so that only the companies usually know how the AI methods works. In transparency and, um, or opacity is never a good thing in financial institutions. As soon as investors, stakeholders, debt holders do not know what happens inside a bank, inside insurance company, this leads to uncertainty and this on the part of investors, and this in the end might lead to panicky reactions to news and thus to financial crises in the very end, of course, that AI being a black box in itself might cause some problems on the way when other things odds app the. Third factor is AI is more likely to amplify current cycles. It's prone to pro cyclicality than human Rdultors. The problem is that automation favors homogeneous methodologies and standardization and this leads to the problem of pro cyclicality, meaning that if We know this from regulation pro cyclicality in pro cyclical regulation means that if markets go down, you go hard on banks and you strengthen the crisis, actually, and you made it worse by regulation. And as soon as you have a boom phase, you loosen up regulation again and thus you cause the next crisis. Now, anti cyclicality and anti cyclical regulation is usually what nowadays financial economists believe to be a much smarter approach, meaning that if the economy takes a turn to the worse for the worse, um you loosen regulation in order for the economy to recover more quickly. And then in the boom phase, you tighten regulation in order to prevent excessive behavior by banks, excessive lending that would cause the next crisis. Well, AI is unfortunately quite procyclical. Then the high predictability and transparency of AI will enable individuals to bet against it. This is a very general principle, but imagine that all players, all investors in the stock market were AI methods and automated bots that have been trained on data and that trade based on AI and machine learning algorithms. Then you only need one human to understand how all these bots and all these robo advice, robot traders, all these algorithms, how all these work in tandem, and a human will be able to read those systems quite easily because all these models are predictable and usually quite transparent when it comes to the way they function. So that's why some problems might be caused simply by the fact that AI is more predictable in this sense. And here it is interesting to look at this distinction between exogens and indulgence risk. Well, exogius risk has been made famous by YonanYzant LSE and exogent risk is caused by events from the outside, much like Nasrit falling on London, war on Berlin, Washington. It is easy to measure. It's purely exogenous and you can't really do anything about that when it comes to, let's say, financial investments. Now, AI and machine ing as part of AI is suited well for the evaluation and management of exogens you have signals coming in you train your models, you use large datasets. Uh, you well established statistical methods and repeated events to train. So if you have enough data, if you've seen enough asteroids, if you've seen the data on past asteroid sightings and astrophysical data, you can try to teach and train your model and then predict future asteroids. Us, AI is well suited for micro regulation, internal risk management of exogence events. Problem is, you also have indulgence risk. Indulgence risk is caused by events from within the system. It starts when individuals and individual entities within the system stop acting independently, but synchronize their behavior. It's very difficult to measure this prime example by John Daniel, some from LSE, is the London Millennium Bridge, which when it was opened for the millennium, in 99 or 2000, yes, the Millennium bridge was specifically designed to withstand the wind that flows along the river Thames, actually, risk management was in place for the bridge to be safe when it gets windy. Problem is if you have people standing on the bridge, and the calculations that were done, I actually were actually random. This shouldn't be problem, but people standing on the bridge when it gets windy, they start to counteract. They start to act in a synchronized way because the bridge starts to, um, swing because of the wind then everyone isn't acting in a random way. No one continues walking in a random same manner as they would if there were no wind, but everyone starts to counteract the movements of the bridge. These movements, these counter movements by the people on the bridge, not the wind, this caused the bridge to become unstable so that they actually had to close the bridge for some time and do some renovations. And this is a prime example of indulgence risk. You see the risk is not the wind. That's the exogenous risk. Risk, the indulgent risk, these are the individual entities, the humans on the bridge, they stop acting independently, but they start synchronize their behavior. And artificial intelligence methods, they're trained with various games for situations, including interactions between entities with full information gains. So for example, in chess, all possible moves can be known and deep neural networks succeed in these games because they assume their opponent is their clone. Knows the same and they do not prepare for endogen risk situations, simply they cover all basis when it comes to exogenous risk. Have incomplete information games, for example, in poker, the opponents cards are unknown and AI performs worse than human players because it cannot theorize about the opponents intentions. It can only learn from past moves. We have cooperative games, success is even more difficult for AI when cooperation leads to multiple local optima, and this is, for example, the case in diplomacy and game theory, and this is where it becomes quite difficult for AI to be the human and this will limit the use of AI in these situations. Okay. So this is what I wanted to talk about SupTech and systemic risk. There are more and more discussions on the way when it comes to systemic risk and the use of AI and ML in finance views. As you can imagine, AI and ML will be used in trading. It will be used in lending, and so on. In many instances, we'll cover and we'll touch ethical considerations. I will talk about this in the next video, but you can see that with AI and ML not being rolled out in all parts of the financial industry yet, we cannot think about all possibilities where in the future, AI and ML will be maybe a cause of systemic risk, but supervisors and regulators obviously are already concerned about potential and potential threats to financial stability. In the next video, we'll have a final look at some ethical consider 36. 36 Ethical Considerations in AI and ML: Llo everyone, and welcome back to our class in artificial intelligence and machine learning in finance. And in this last video of our lecture, we are going to cover a huge topic and a topic that will become even bigger in the next couple of years and decades. That is the ethical side of the use of AI and ML and some ethical considerations. Um, we'll start with some considerations that are taken by a study by Thomas at A in this year, actually, and it shows that we have ethical risks at numerous levels when using AI and machine learning. For example, if we start with data, the training data must be free of biases, and it should include all relevant types of simuli. For example, racial sexist biases should be excluded. And this is one thing you need to consider when training machine learning models. AI should only be deployed in environments it has been trained in. That is, we shouldn't have any cross use for a different purpose. This might be problematic, especially in the financial sector, if you think about models that have been trained on, let's say, lending data on loan data that are used in a different context, and especially in Germany, privacy laws must be taken into account. Data privacy needs to be respected when collecting and processing the data. Turning to the algorithm site, we should think about unethical coding that might be introduced by the programmers and the Delta developers themselves, and the coding should be free of biases. That is, it doesn't necessarily mean that an ethical algorithm becomes unethical due to the selection of training data, but it could be that developers have already set up the model and set up AI such that, for example, when we consider the fact that most developers are white males, could be that there are some biases already included in the coding, and AI must be controlled for emerging biases during the life cycle. And last but not least, the business use, the purpose of using AI should be ethical in the first place. Shouldn't be, for example, to discriminate against certain parts of the population, and the unintended impact of AI can also be unethical. So these are some problems associated with AI and ML and we'll come back to some of these later on in more detail. And we'll start with racism and sexism in AI. Now, the machines and the models themselves, they obviously have no bias, but the learning datasets, the training datasets and algorithms are most likely as biased as our current society. So most of the AI developers, I shall usually white males. They lack the perspectives of minorities, and consequently, if everything goes wrong, existing biases will be introduced into the machines, and they can be amplified if we are not actively working against them. And there are several organizations, at least in the United States, that are working against biased algorithms, for example, the ACLU, the American Civil Liberties Union, and they, for example, fought in a wide variety of issues, and they exposed Amazon's recognition as racially biased. Also, the algorithmic Justice League, which was founded in 2016, mission to raise public awareness to biases in AI is one of these organizations that have highlighted the dangers of using AI algorithms in situations where previously we had human interactions where we had human managers, for example, that had to make decisions on, let's say, loan applications, et cetera, this is obviously not immune to biases and racism and sexism just because it's a machine, but um, if the algorithm itself or the training data if they include these biases, we might still end up with an AI method that produces biased results. The study by Thomas Et Al also gave some practical recommendations. For example, one should implement a general statement on the firms and the institution's intention for AI ethics, ideally I ethics framework or a charter, there should be an extension to the firm's already existing mission or purpose statement. Usually, you have this in place, especially large companies. Um, one should implement an internal application specific design plan, and you should regularly audit the processes so that if you see ethical risks and some concerns that these programs might produce unethical decisions, this should be flagged and in the end, one can do something about this. And of course, one should keep records of decisions concerning ethical trade offs for transparency, sometimes of course, it might be that you need to do a trade off, this doesn't mean that you have to act unethically, but in some cases, you need, of course, to, for example, weigh data privacy concerns against business purposes and also not just for your company, but also for the benefit of your customers. And in this case, these trade offs should be recorded just for transparency. How are artificial intelligence and ML regulated, especially from an ethical perspective? Well, actually, this is quite in its infancy. Artificial intelligence, machine learning have not yet even been clearly defined. We will later on see the definition by German BaFin but this is something in the making. For example, the European Commission has stated in its digital finance strategy that it intends to clarify together with the European supervisory authorities by 2024 at the last, so in almost three years, whether and how existing financial market regulation should be applied to the use of big data and artificial intelligence. Now BaFin has proposed several principles for the general use of algorithms in decision making processes of financial firms. This is the link to the document. You can find it under BaFin and then big data when consolate you into the dents. These are principles that represent preliminary considerations. It's not yet regulation, but these are some first ideas by BaFin for minimum supervisory requirements regarding the use of artificial intelligence in supervised financial institutions. And this is the definition of AI according to BaFin. They say artificial intelligence is a combination of large amounts of data, big data, computing resources, and machine learning. This rather a different definition than in our very first lecture, but this is the one that BaFin chose. So in machine learning, they say computers are given the ability to learn from data and experience on the basis of special algorithms and compared to rule based methods, learning takes place without the programmer specifying which results are to be derived from certain data constellations and how. So big data plus machine learning and high computational power. That's artificial intelligence in the definition of German BaFin. And again, they mentioned this not in regulations, but actually this is basically a newsletter, a publication in which they summarize their first ideas on what AI is, what it entails, and what they might be doing in the future on its regulation. Also say algorithms are rules of action that are usually integrated into computer program and solve an optimization problem or a class of problems. And in addition to the distinction, according to the type of algorithms, how is the problem technically solved? Applications of machine learning can also be differentiated according to result types, the basic extinction between classification, regression, clustering with CNN and data types. That's what Bavin says. They also have some overriding principles, and they mentioned this in their publication. Again, this is also um in a way typical for German supervision and regulation, it usually states the responsibility of top management that is management, just like risk for risk management. Management is responsible for company wide strategies and guidelines or policies for the use of algorithm based decision making processes. We can find the same kind of recommendation, regulation actually when it comes to risk management. That management is responsible for overall enterprise risk management. Now, potentials of such processes as well as their limits and risk should be taken into account, clearly stated, and a company wide strategy for the use of algorithm based decision processes should also be reflected in the IT strategy. You can see that at this point, they have these overriding principles that are very similar to the qualitative regulation we've seen in Basel two and Basel three, adequate risk and outsourcing management. Now, They financial institutions should establish risk management system adapted to the use of algorithm based decision making processes. If applications are sourced from service provider, management must also have an effective outsourcing management. Clearly, if you're using, um, AI and ML and you're outsourcing this, then obviously you should also keep track of your service provider. Responsibility reporting and controlled structures must be clearly defined. When establishing adequate risk management, one needs to consider the risks of an algorithmic decision making processes. This is risk mitigating measures and processes should start exactly where risk originates according to the polluter base principle. Okay. Also, one should prevent a bias. Avoidance of a bias, that is the systematic distortion of results in algorithm based decision making processes, and they must be avoided in order to be able to make business decisions based on results that are not systematically distorted and exclude the possibility of a systematic bias based discrimination against certain customer groups. And also, and this is important to stress, in the end, and probably this is one of the reasons why financial supervisors and regulators will be concerned about AI and ML is in the end, if you have ethical concerns within AI and ML in its use in a financial institution, this might cause a reputational risk. This might cause damage to your company's reputation, and at that point, this will become a concern for the regulator. And in accordance with the polluter pays principle, the risk of bias must be identified where it can arise. It must be analyzed and either eliminated or at least mitigated. Now, some things need to be regulated, but actually some other things are also privited by law. So for some financial services, it is actually stipulated by law that certain characteristics may not be used for differentiation. That is for calculating risk, calculating prices premier, and the danger of discrimination exists if these characteristics are replaced by an approximation. That is, for example, if instead of using, let's say, gender or ethnicity, you suddenly use hometown age and income groups, then in the end, it might lead to the fact that algorithms will simply substitute one feature by three other features which are correlated. Now, this again would be associated with increased reputational risks and also legal risks. So it's in the best interest of a financial institution to actually prevent this from happening, and companies should establish statistical verification processes that exclude discrimination and such a substitution of features within AI and ML processes. Now, this was true actually for all financial institutions. We can also find some more information by IOPA, the European Union's insurance and pension fund supervisory authority, and they had a group on digital ethics, and they also have formulated AI governance principles. So they are um subdivided into human oversight, robustness and performance, data governance, record keeping, transparency and explainability, fairness and non discrimination, and last but not least principle of proportionality. Let's talk about human oversight first. Now, again, this is for European insurance companies, but obviously, most of the things are also applicable to financial institutions in general. Insurance firms should establish adequate levels of human oversight, taking into account the impact of specific AI use cases and other governance and control measures in place. They should select the level of human oversight, and the selection should be proportionate to the nature, scale, and complexity of the risk inherent to the specific AI use case in that insurance company. Now different roles and different responsibilities for the staff involving AI processes should be clearly defined in policy documents. That's human oversight. Robustness and performance. Now the firm should assess and monitor the performance of the AI systems on an ongoing basis and take due consideration of their limitations and potential shortcomings. Now, performance metrics should be adapted to the objective pursued and the nature of the data used. You should check whether you're actually achieving the goals you've set yourself. Sound data management obviously is key to ensure the performance of AI systems, and they should produce stable outcomes over time. Otherwise, it doesn't make too much sense from a business perspective, but obviously also for the regulator, and insurance firms should develop resilient IT systems and infrastructure that cannot be tended with, for example. Data governance and record keeping. Now, insurance firms should adapt the data governance and record keeping measures to the impact of specific AI use cases, and data used in AI mould should be accurate, complete and appropriate. Again, some data governance should be applied throughout the AI mode life cycle. The data used in AI mod should be handled and stored in a secure manner, obviously, because usually especially in insurance, this is highly sensitive and confidential customer data and appropriate records of the data and the modeling methods should be kept to ensure reproduction and traceability. Transparency and explainability. The funds should adapt the types of actors boinations to specific AI use cases and to the recipient stakeholders. Now, funds should adapt their explanations to the different types of stakeholders, and they should strive to use explainable AI models in particular in high impact AI use cases. Data used need to be transparently communicated, and as a result, again, we need data security, data governance, and a sensible data management. Fairness and non discrimination, sound and transparent governance processes are key to ensure fairness and non discrimination, especially when it comes to the calculation of insurance premium. Otherwise, this could lead to reputational risks. Insurance firms should conduct their business in a fair manner when using AI and make reasonable efforts to take into account the outcomes of AI systems. Consumers not willing to share very personal and sensitive data are not strictly necessary for risk assessments should still have access to affordable insurance coverage, and insurance firms should respect the principle of human autonomy by developing AI systems that support consumers in their decision making process and avoid unfair nudging practices introduced by the use of AI methods. And last but not least the principle of proportionality, insurance undertaking should establish the necessary governance measures that are proportionate to the nature, scale, and complexity of their operations. AI use cases and use case impact assessment and the governance measures should be proportionate to the potential impact of a specific AI use case on consumers and all insurance firms. Insurance firms should then assess combination of measures put in place in order to ensure an ethical and trustworthy use of AI within, for example, premium calculation. These are the principles set forth by IOPAS GDE group on digital ethics and ethics and II in insurance. You can see many of these things can also be applied to the financial institutions, and this will be a huge topic in the years to come with more applications, with more models, and even a higher utility stemming from the use of AI and ML in finance, we will see more ethical problems, more use cases, and obviously more regulation and principles set forth by regulators and supervisors like IUPA. And thus, we are at the end of the lecture. If you are interested in more of the literature and the tools and the references we've used, you will find all of these on this last slide. I hope you've enjoyed the lecture. You've learned a little bit about the usage of AI M and L in finance and good luck with the exam.