2021 Data Science Interview Preparation Guide | Dr. Gary White | Skillshare

Playback Speed


  • 0.5x
  • 1x (Normal)
  • 1.25x
  • 1.5x
  • 2x

2021 Data Science Interview Preparation Guide

teacher avatar Dr. Gary White, Senior Data Scientist

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Watch this class and thousands more

Get unlimited access to every class
Taught by industry leaders & working professionals
Topics include illustration, design, photography, and more

Lessons in This Class

73 Lessons (3h 27m)
    • 1. Promo

      1:49
    • 2. What is the bias-variance tradeoff?

      4:12
    • 3. How is KNN different from k-means?

      3:02
    • 4. How would you implement the k-means algorithm?

      2:53
    • 5. How would you implement the k-means algorithm?

      2:23
    • 6. What are the pros and cons of the k-means algorithm?

      4:17
    • 7. What does an ROC curve show?

      3:26
    • 8. What does an ROC curve show?

      2:51
    • 9. Define precision and recall.

      4:31
    • 10. What is k-fold cross validation?

      3:26
    • 11. Explain what a false positive and a false negative are.

      2:52
    • 12. When would you use random forests vs. SVM and why?

      1:22
    • 13. Why is dimension reduction important?

      3:19
    • 14. What is principal component analysis? Explain the sort of problems you would use PCA for.

      3:56
    • 15. What is principal component analysis? Explain the sort of problems you would use PCA for.

      3:41
    • 16. What are some of the steps for data wrangling and data cleaning before applying machine learning alg

      3:30
    • 17. What is multicollinearity and how do we deal with it?

      1:26
    • 18. You are given a dataset on cancer detection. You have built a classification model and achieved an a

      2:25
    • 19. You are given a dataset on cancer detection. You have built a classification model and achieved an a

      4:12
    • 20. You have the 95th percentile of web server response times generated every 2 seconds for the last yea

      3:10
    • 21. What is Bayes’ Theorem? How is it useful in a machine learning context?

      6:04
    • 22. What is ‘Naive’ in a Naive Bayes?

      2:02
    • 23. Explain the difference between L1 and L2 regularization.

      2:41
    • 24. What cross-validation technique would you use on a time series dataset?

      1:55
    • 25. How can a time-series data be declared as stationery? What statistical test would you use?

      3:48
    • 26. What are the main components of an ARIMA time series forecasting model?

      1:45
    • 27. How do you ensure you’re not overfitting with a model?

      3:00
    • 28. You are given a data set consisting of variables with a lot of missing values. How will you deal wit

      7:35
    • 29. You are given a data set consisting of variables with a lot of missing values. How will you deal wit

      2:07
    • 30. You are given a data set consisting of variables with a lot of missing values. How will you deal wit

      0:52
    • 31. Your organization has a website where visitors randomly receive one of two coupons. It is also possi

      1:37
    • 32. What is the differentiate between univariate, bivariate and multivariate analysis?

      1:14
    • 33. Explain the SVM algorithm.

      2:02
    • 34. Describe in brief any type of Ensemble Learning.

      3:36
    • 35. What is a Box-Cox Transformation?

      1:53
    • 36. What is the Central Limit Theorem?

      3:00
    • 37. What is sampling?

      1:34
    • 38. Give 4 examples of probability-based sampling methods and how they work

      2:54
    • 39. Give 4 examples of non probability-based sampling methods and how they work

      2:16
    • 40. What are the advantages and disadvantages of neural networks?

      4:39
    • 41. What in your opinion is the reason for the popularity of Deep Learning in recent times?

      4:26
    • 42. Why do we use convolutions for images rather than just FC layers?

      1:06
    • 43. What Is the Difference Between Epoch, Batch size, and Number of iterations in Deep Learning?

      1:21
    • 44. What makes CNNs translation invariant?

      2:15
    • 45. What are the 4 main types of layers used to build a CNN?

      2:56
    • 46. What are 3 types of spatial pooling that can be used?

      1:42
    • 47. What is the stride in convolutional layers?

      1:29
    • 48. What are vanishing and exploding gradients?

      1:33
    • 49. What are 4 possible solutions to vanishing and exploding gradients?

      2:55
    • 50. What is dropout for neural networks? What effect does dropout have?

      2:43
    • 51. Why do segmentation CNNs typically have an encoder-decoder style / structure?

      2:31
    • 52. What is batch normalization and why does it work?

      1:26
    • 53. Why would you use many small convolutional kernels such as 3x3 rather than a few large ones?

      0:42
    • 54. What is the idea behind GANs?

      2:32
    • 55. Why we generally use Softmax non-linearity function as last operation in-network?

      2:12
    • 56. What is the following activation function? What are the advantages and disadvantages of this activat

      2:42
    • 57. What is the following activation function? What are the advantages and disadvantages of this activat

      1:54
    • 58. What is the following activation function? What are the advantages and disadvantages of this activat

      1:51
    • 59. What is backpropagation and how does it work?

      2:47
    • 60. What are the common hyperparameters related to neural network structure?

      1:17
    • 61. What are the common hyperparameters related to training neural networks?

      2:31
    • 62. What are 4 methods of hyperparameter tuning?

      4:23
    • 63. Using the above neural network key, state the name of the following network and give some basic info

      3:19
    • 64. Using the above neural network key, state the name of the following network and give some basic info

      2:12
    • 65. Using the above neural network key, state the name of the following network and give some basic info

      3:25
    • 66. Using the above neural network key, state the name of the following network and give some basic info

      4:27
    • 67. Using the above neural network key, state the name of the following network and give some basic info

      3:15
    • 68. Using the above neural network key, state the name of the following network and give some basic info

      4:09
    • 69. Using the above neural network key, state the name of the following network and give some basic info

      3:26
    • 70. Using the above neural network key, state the name of the following network and give some basic info

      2:17
    • 71. Using the above neural network key, state the name of the following network and give some basic info

      4:51
    • 72. Using the above neural network key, state the name of the following network and give some basic info

      3:06
    • 73. Using the above neural network key, state the name of the following network and give some basic info

      4:12
  • --
  • Beginner level
  • Intermediate level
  • Advanced level
  • All levels
  • Beg/Int level
  • Int/Adv level

Community Generated

The level is determined by a majority opinion of students who have reviewed this class. The teacher's recommendation is shown until at least 5 student responses are collected.

14

Students

--

Projects

About This Class

Being a data scientist is one of the most lucrative and future proof careers with Glassdoor naming it the best job in America for the third consecutive year in a row with great future growth prospects and a median base salary of $110,000. I have recently made the transition from being a PhD student in Computer Science to a Senior Data Scientist at a large tech company. In this course I give you all the questions and answers that I used to prepare for my data science interviews as well as the questions and answers that I now expect when I am giving interviews to potential data science candidates. The course provides a complete list of 150+ questions and answers that you can expects in a typical data science interview including questions on machine learning, neural networks and deep learning, statistics, practical experience, big data technologies, SQL, computer science, culture fit, questions for the interviewer and brainteasers.

What questions will you learn the answer to?

  • What is the bias-variance tradeoff?

  • How would you evaluate an algorithm on unbalanced data?

  • When would you use gradient descent (GD) over stochastic gradient descent (SDG), and vice-versa?

  • Why do segmentation CNNs typically have an encoder-decoder style / structure?

  • Why we generally use Softmax non-linearity function as last operation in-network?

  • You randomly draw a coin from 100 coins — 1 unfair coin (head-head), 99 fair coins (head-tail) and roll it 10 times. If the result is 10 heads, what is the probability that the coin is unfair?

  • Given the following statistic, what is the probability that a woman has cancer if she has a positive mammogram result? 1% of women have breast cancer, 90% of women who have breast cancer test positive on mammograms and 8% of women will have false positives.

  • Write a SQL query to get the second highest salary from the Employee table. If there is no second highest salary the query should return null.

  • What is the average time complexity to search an unsorted array?

  • Why do you want to work here?

  • How can you generate a random number between 1 – 7 with only a die?

About the instructor:

  • Senior Data Scientist at a large tech company

  • Recently finished PhD in Computer Science and moved to industry

  • 5+ years teaching experience at university level

Meet Your Teacher

Teacher Profile Image

Dr. Gary White

Senior Data Scientist

Teacher

Hello, I am a senior data scientist from Ireland. I recently finished my PhD in Computer Science and I am hoping to teach classes that I would have liked to have had while I was a student. My research and teaching experience is in machine learning and data science. I also have experience working with distributed systems and now work in industry for a large tech company.

See full profile

Class Ratings

Expectations Met?
  • Exceeded!
    0%
  • Yes
    0%
  • Somewhat
    0%
  • Not really
    0%
Reviews Archive

In October 2018, we updated our review system to improve the way we collect feedback. Below are the reviews written before that update.

Why Join Skillshare?

Take award-winning Skillshare Original Classes

Each class has short lessons, hands-on projects

Your membership supports Skillshare teachers

Learn From Anywhere

Take classes on the go with the Skillshare app. Stream or download to watch on the plane, the subway, or wherever you learn best.

Transcripts

1. Promo: Welcome to this data scientist interview preparation guide. In this course, we're going to go through over 150 of the most common interview questions which are used to evaluate data scientist candidates. My name is Dr. Gary English. I am a senior data scientist at a large tech company and I've conducted a huge amount of DES, interviews for a data analyst, data scientist, and senior data science positions. So I know most of the commonly used interview questions and answers. After taking this course, you will be able to answer a huge amount of questions in different topics such as machine learning, neural networks and deep learning, statistics, SQL, as well as some questions that you should ask the interviewer. The course is structured into these ten different sections with a number of questions and answers in each that will go through in detail. So for example, in the first section under machine learning, we go through what is the bias-variance tradeoff and we give a good textual example as to what exactly it is and show some pictures that really highlight the difference in the bias-variance tradeoff. We also have a number of other questions, such as what is naive in the Naive Bayes algorithm. And then we move on to the next section, which is on neural networks and deep learning wish some of the most commonly used activation functions. As well as looking at some different deep neural network architectures and not the most commonly used four. So we also have sections on statistics and practical experience, big data technologies, SQL, computer science, culture, fish, questions you should ask the interviewer and brain teasers. So data science is one of the most lucrative of them future-proof careers with glass doors naming it the best job in America for the last three consecutive years with great future prospects and a median starting salary of $110 thousand. So if you are a student or professional looking to get a job and data science, enroll in the course now or try some with the available lectures to see some of the topics that we cover. 2. What is the bias-variance tradeoff?: Okay, so in this question we're looking at wash is the bias-variance tradeoff. So we're going to begin by defining both of these terms. And so bias is the difference between the average prediction of our model on the correct value which we are trying to predict. So models would have high bias, pay very little attention to the training data and they oversimplify the model. So this leads to high error on both the training and test data set. So if we look at this model here, and we can see that it's clearly been under fish. So in this case, we have high bias in our model because we're trying to use this straight line linear model to capture this quadratic data. And you can see that it's obviously not doing that great of a job because there's larger fluctuations between the data points here and the actual values that we're getting. So we can see that this line isn't really a good predictor for this quadratic, quadratic data. So this model has high bias. So variance is the variability of the model for a given data point from sensitivity to small fluctuations in the training data. So high variance can cause the algorithm to model the random noise in the training data, leading to overfitting. So this leads to poor performance on data that it has not seen before. So this is your test or validation data. As a result, such models perform well under training data, but have high error rates on the test data. And so if we look at this example data that we have here, we can see that our model again has been fit in red and we have our data points in blue. We can see that this is a model which has been overfished. So this is where we get the high variance. So if we look at the data points here, we can see even though these points here are quite close together, we get a sudden jump down, then a sudden jump up, then Dan, Dan, Dan again, then Dan, Dan a sudden jump up. So this is probably capturing more of the random fluctuations in the data that we've drained them, then actually actually capturing and what the overall modulus should look like. So we're probably including some random fluctuations into training data, which will increase the error. So basically what the bias-variance tradeoff is all the bash is managing the complexity between these two different characteristics of the bias and variance. So if our model is too simple and has very few parameters, then if may have high bias and low variance. On the other hand, if our model has large number of parameters, then it's going to have high variance and low bias. So basically what the bias-variance tradeoff is all about is finding a good balance between overfitting and underfitting the data. The tradeoff and complexity is wide. There is a trade off between bias and variance. And that's because we can't have an algorithm is more complex and less complex at the same time. So this is an example of what a good balance of these models would look like. And so in this case, we see that we're capturing the overall structure of the data, but we're not building in these random fluctuations into data points. We have quite a smooth line that will generalize well to unseen data. So this would be our test data. And so if you look at the graph at the bottom here, you can see as the model complexity increases, we get this increase in variance. So this is when we start to overfit our model because we've either too many parameters and which is increasing the total error. And as our model complexity reduces, we can see that the bias starts to go up. So this is why in our model is too simple and we're not capturing the overall structure of the data. So basically, where we want to get to is around this point here. So this is why in our total error is minimized. And so it's at the optimal model complexity. When we have reduced both the variance unbiased, the combination of them to the lowest possible point. So that's basically what the bias-variance tradeoff is. 3. How is KNN different from k-means?: So this question is very popular in data science interviews. And it's checking your basic understanding of two very popular algorithms. So how is KNN different from KMeans? So although they vote of k and their name, they're totally different algorithms under used for completely different purposes. So k nearest neighbors is a supervised classification algorithm. Well, k-means clustering is an unsupervised clustering algorithm. So while the mechanisms may seem quite an art first, so we use k points, there are completely different. And in order for k nearest neighbors to work, you need to have labeled data points for the neighbors of the unlabeled data point. So K-means clustering requires only a set of unlabeled points on a threshold. So the algorithm will take the unlabeled points on two, will gradually learn how to cluster them into groups by computing the mean of the distance between the different points. And the critical difference here is that k-NN and needs labeled points on supervised learning. Well, k-means doesn't honest this unsupervised learning. So the way I would typically remember is the ash saving you data points here, these blue data points. And you want to try and classify. And we have an under series of data points here. So we want to try and make a prediction for this point. And, and when we're using K nearest neighbors, we want to see what was the k nearest neighbors. So if we set k to tree, or nearest neighbors would be 123. And then using that, we can make a prediction for what this value is going to be. And we see that there is to Lewin's and when PINK1. So we would label this as a blue point. And so you can see here in this, K nearest neighbors are using supervised labels here of being blue or pink. And we're using that to classify what this point would be. Well, if you're using a clustering, we wouldn't know anything about these points. And we will try to create two different clusters. So we wouldn't know what these labels were. But we might cluster NAMD is into one group. And we might cluster this into another group. And so that would be if we would have two clusters, then they've gotta have tree clusters. You might find it was clustered into something like this group, this group, then this group. So basically the important thing to remember is that KNN K means they're completely different algorithms. And for Dannon, labeled data points honest as supervised learning, while k-means doesn't, it's an unsupervised algorithm that will give you these different clusters. 4. How would you implement the k-means algorithm?: So another popular question that is asked often is how would you implement the K-means clustering algorithm? So first of all, we need to specify the number of clusters K that we're going to use. And once we have specified the number of clusters, we can initialize the centroids by for shuffling our dataset, then randomly selecting k data points for the centroids without replacement. So one of the important things when using the k-means algorithm is that your initialization points have to be quite goods or else it can lead to some strange results and which is probably good to perform it a number of times. And also to do some experiments with a different number of clusters k. And using the album method, which we'll talk about in the next question. And then the third step, which we are going to keep iterating until there is no change in the set, in the centroids. And there's been no change in how the data points and the labels of the data points. So they haven't changed. Is that first of all, we're going to compute the sum of the squared distance between the data points and all of the centroids. We're going to assign each data point to the closest centroid. And then we're going to compute the centroids for the clusters by taking the average of all the data points that belong to that cluster. And then we're going to just keep iterating crudest until there has been no change in the assignment of the labels of the data points. So you can see here in this GIF where at iteration 13, so we go back to the start here now, you can see that the red centroid, which is this big cross, it slowly moving towards this cluster here and had the yellow cluster is coming towards this data point. And you can see that it starts here. And for the red data points and then goes up to this cluster. And the yellow am centroid mu Stan here. And then the blue centroid sort of stays and captures all of this data. So we can see at each iteration, it's capturing the blue and it's capturing more points here because there's a lot of points that are closer to the centroid. And all these points are closer to this centroid. And then the yellow point is I am capturing the data in this burnt because all of this data is closer to the centroid. So we can see here, we've got quite good results. And the initialization was quite good in the centroids are spread out. And so I would say these are results we typically would've expected. And for this, because the initialization was quite good. And so that's how we would implement the K-means clustering algorithm. 5. How would you implement the k-means algorithm?: So one important aspect of using the k-means clustering algorithm, how you choose the number k. And one of the ways that we can choose this is by using the Elbow Method. So in cluster analysis, the Elbow Method Is this heuristic we can use to determine the number of clusters in the data set. So if you remember in the previous question, Am, I was talking about how we could cluster it into two or three different clusters. But how would we know which would be the most suitable amount of clusters to use? And so this elbow method consists of plotting the distortion, which is the sum of the squared difference between each of the data points and assigning the center as a function of the number of clusters. So the elbow of the curve is then the number of clusters to use. So the same method can be used to choose the number of parameters and other data data-driven models, such as the number of principle components in a data session. So we can see here we have this list of k. So this is the number of clusters that we think we should use. And this is the distortion score. So distortion is the sum of square distance between the points and the centers that they have been assigned to. And so we can see when we have am for clusters here are distance core is still quite. I were getting a big jump by using five clusters and big jump by using six clusters. And a big jumping dash where reducing this distance score, which is making the cluster is much more accurate. And we can see that at seven, we're starting to get to this elbow point. So we're not getting to big jumps in performance. So you can see at 891011, we're not really getting any performance and there's not a huge amount of change in performance between 78. So using this Ham distortion score and the different number of clusters, we have set the elbow dash k is equal to seven. So this is a way, useful way that we can calculate the optimal number of clusters to use for our data. 6. What are the pros and cons of the k-means algorithm?: So what are the pros and cons of the k-means clustering algorithm. So one of the pros is that it is simple to understand. And so this is becoming increasingly important in machine learning algorithms, especially in certain applications like health care, where we need to have algorithms that are simpler to understand and to be able to explain why a particular person may have been clustered into a certain group. So you can imagine and house care if using these clusterings for some medical application, then you need to be able to understand and to explain why a particular person has ended up in this group. Because there may be some serious applications or incidence as a result of doing this analysis. So you need to have algorithms that are simple to understand and that can be explainable. And explainable AI has become really popular research field in recent years. And so one of the good things about the k-means algorithm is that it's very simple to understand. So another pro, this algorithm is that it's fast cluster. So it's also easy to implement. So it's hostile cluster and because it's, it's quite simple algorithms. So you are looking at the K nearest points and you're using dash M to cluster the algorithm. And it's an iterative process. And so you can get your clusters quite fast. And one of the columns that we have of the algorithm is that we need to pick the number of clusters. So we need to choose the k amount of clusters that our data is going to be calculated into. So we didn't see in some of the previous questions how we would be able to do this. So by using the Elbow Method, where we plotted the number of clusters on the x axis. And then we had our error metric here. And we were able to choose when the elbow point, when we weren't getting as big reduction in the error. So one of the, another cons that we have with the k-means algorithm is that they are sensitive to initialization onto outliers. So say for example, we have a number of data points here and here and here. So if we initialize our clusters am to being here, and we have another one here, and another cluster here. So in this case, we can see that our data might have some issues and how this is clustered. So these points are quite close together. So there would be some competition between those points. And so the blue and might only end up getting these datapoints. And so the L1 does quite well in, captures all of these points. But then this pink one has started invading this blue sort of space here. So we can see due to this initialization, it has the pink piece here. I started capturing some of these blue cluster points. But if we had initialized, this may be further down. Then we might have, gosh, more suitable answers. So same navy that the pink cluster with all these data points here. So that's one of the problem. Wish K-means clustering is that it can be quite sensitive to outliers. So one of the other issues that we have with the k-means algorithm is that it doesn't really perform well on spherical solutions. So this is if we have m, two concentric circles and we're trying to cluster both of them, then k-means algorithm doesn't perform that well. We also need to perform some standardization as well when we're using the algorithm. 7. What does an ROC curve show?: So what does an ROC curve show? So an ROC curve is constructed by plotting the true positive rate against the false positive rate. So you can see here in this graph, we have our true positive rate on the y-axis, on our false positive rate here on the x axis. So the true positive rate is the proportion of observations that were correctly predicted to be positive, actually of all the possible observations. So this is the true positive over true positive plus false negatives. Similarly, if the false positive rate, and similarly the false positive rate is the proportion of observations that are incorrectly predicted to be positive hash of all the negative observations. So this is the false positives over the true negatives plus the false positives. So for example, in medical testing, the true positive ratio is the ratio in which people are correctly identified to test positive for the diseases in question. The ROC curve shows the trade-off between the sensitivity, which is the true positive rate, and specificity, which is one minus the false positive rate. So classifiers that give curves closer to the top left corner indicate better performance. As a baseline, a random classifier is expected to give points along the diagonal. So this is when the false positive ratio is equal to the true positive ratio. And the closer the curve comes to the 45-degree diagonal of ROC space, less I create the test. And so we often want numerical outputs. And so one of the ways that we can compare different classifiers and summarize the performance as into a single measure. And a common approach that is used to calculate the area under the ROC curve, which is abbreviated to AOCI. So this will give you an actual number as opposed to this graphical interpretation. And so we can see here, we have the true positive rate on the y axis, the false positive ratios on the X axis. And this is basically a useless model where it's not showing any good performance. Our false positive rate is equal to the true positive rate. And so that's not accurate or has no skill. And then our logistic model is this model here. So we can see that it's getting, capturing quite a lot of the area of the curve. And so the area under the curve is basically, you know, all this area in here. And we're capturing dash with this logistic model. So it could probably be improved. So say we had another modelling here, some deep learning algorithm in this yellow line on. So you can see that this is capturing m or more of the curve here. It's capturing some of the points that wasn't captured in the logistic model. So this model would have a greater area under the curve. So the important things to remember for the ROC curve is that the Y axis is the true positive rate. On the x-axis is the false positive rate. And basically we want our algorithms to be able to capture as much of this curve as possible. And then we can summarize dash by using this AUC metric. 8. What does an ROC curve show?: So what is the difference between a type one error and a type two error? And obviously apart from one as one and the other one has two in it. So a type one error is when the null hypothesis is true but is rejected. So this is a false positive error, basically asserting as something as true when it was actually faults. So if we take the classic story of the shepherd and the wolf, where the null hypothesis is that there is no world present. So a type one error or a false positive would be crying wolf. So this is saying that there is a wolf present when actually there was no Wolf. And a type two error is when the null hypothesis is false. Bush, we erroneously failed to be rejected. So this is a false negative error when the test indicates the condition failed when actually it was successful. So in the shepherd and well, for example, this would be doing nothing, not crying wolf and actually there was a wolf present. So this table is quite useful in capturing the error types and the null hypothesis. So if the null hypothesis is true, we get a type one error when we reject the null hypothesis. If we fail to reject the null hypothesis, then this is a true negative. D uppercase then is when the null hypothesis is false. And so if we reject the null hypothesis, and that's a true positive. And then if we fail to reject the null hypothesis, so we accept the null hypothesis even though it's false, that's a type two error, which is a false negative. So really, you need to type one and type two errors are all about the null hypothesis and whether you accepted or rejected. And if the null hypothesis is true and your objective, it's a type one error. And if the null hypothesis is false, but you fail to reject it to you except ish, then it's a false negative. And so the shepherd and the wolf am example is quite an easy way of remembering this. So a type one error or a false positive is when you're crying wolf, you say that there's a wolf present, present when actually there's none. And to Type two error is when we say that we're doing nothing, so we're not crying wolf when actually there is a rule of present. So yeah, just remember this table when you're trying to answer this question. And whether the null hypothesis is true and you either fail or excepted. 9. Define precision and recall.: So in this question, we're asked to define precision and recall. So precision is also known as the positive predicted value. So it's a measure of the amount of accurate positives that your model, that you're multiple claims compared to the total number of positive claims. So here we can see the actual definition of precision. So in this case we have E true positives and that is over at the true positives plus the false positives. So this is what we mean by comparing the number of true positives to the total number of positives that your model claims. So the total number of positives thermal claims are the true positives and the false positives. So the true puzzles, true positives are when our model predicts that fish will be positive and the actual value is positive. And the false positives swear a model predicts MOOC be positive, but actually a was negative. And recall is also known as the true positive rate. So in this case, the amount of positives that the model claims compared to the actual number of positives, there are trade data. And so if we look at the definition for recall, we can see on top is the true positive as well. Bush, in this case, we're looking at the number of positives traced the data. So in this case, it's over the true positives plus the false negatives. And so what this is capturing is how much of the data and was recalled. And so this is capturing the true positive rate over the false negatives. So the true positive again, is where the label is positive and we've predicted a positive push. The false negatives are where the label was positive. Bush in this case, we predicted that it was negative. So it can be easier to think of recall and precision in the context of use case. And so an example of this was if we predicted that there was ten apples and oranges and a case of ten apples, then you would have perfect recall. So there are actually ten apples, and we predicted that there would be ten. Bush are precision is 66.7% because out of the 15 events, we predicted, only ten were actually correct. And so another way of looking at this is by, and this figure here, where we have the true positives, the false positives. And basically, our precision is a member of how many of the selected items are relevant. So this is the true positives over the true positives plus the false positives. And another way that we can think about the recall is how many of the has_many relevant items are selected. So in this case, we can see here, we're capturing the true positives here, where I was capturing the true positives on the top. But what makes it different is wetter. We're capturing the false negatives or false positives. So in recall here, you can see that we're capturing the false negatives and the true positives, which is shown here. And for precision, we are capturing the true positives over the true positives. And we're also capturing these false positives here. So a really good way to think about is in these questions. So for precision, we're trying to capture many of these items are relevant. For recall, we're trying to capture him. Many relevant items are selected. And I find that actually probably the easiest way is just to remember these formulas and you can sort of derive everything back from there. So the only thing that changes here is that in the precision, you're capturing false positives, and in the recall, your changing defaults positives for false negatives. So the true positives, I'm always remained same and both of them. So I would probably suggest just to memorize these and then you can derive everything back from these two formulas. 10. What is k-fold cross validation?: So in this question we're looking at washed is k-fold cross-validation. So cross-validation more generally, is a statistical method that is used to estimate the skill of machine learning models. So in k-fold cross-validation, the procedure as a single parameter K, that refers to the number of groups that are given data sample is going to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen and it may be used in place of K and to reference the model. So when really popular value that is often chosen is k equal to ten, so it would become 10-fold cross-validation. So the general procedure for k-fold cross-validation is.na reshuffled a dataset randomly. Then we're going to split the dataset into the k groups, which we will specify. And so in this example that's being shown here, we have k equal to three and n equal to 12 for the number of data points in this particular dataset. And then for each unique group in our dataset, we're going to take the group as a holder or a test dataset. We're going to use the remaining groups as the training dataset. And then we're going to fit the model on the training data and evaluated on the Test dataset. So then we're going to retain this evaluation score and discard the model. And we're going to summarize the skill of the model using the sample of model evaluation scores. So we're going to get multiple different evaluation scores using and the different parts of the data for training and then evaluating their estimate of data. And so we will summarize all these evaluation scores and that will give us an more robust evaluation of our algorithm. So you can see here we have our test data in blue and our training data is in this red color. And this is at, this sort of give here is showing you the procedure. So this is our data. And then we shuffled the data. Our model here is being created on four datapoints because we've set k equal to three and we have 12 data points. So each testing dataset is going to be equal to four. And then we're going to train on the remaining data, which would be equal to h. So we train the model on a subset of a, I'm sorry, we, we trained the model on the eight different points. So in the first period, we train the model on this data. And then we're going to test on the remaining data. And then for the other eight iterations in the model, we reverse the process. So you can see here that the model is always being changed. So more testing on this DES diversed and these four data points, and these four data points. And we're training on the remaining added session. So that's what k-fold cross-validation is. 11. Explain what a false positive and a false negative are.: So in this question, we're asked to explain what a false positive and a false-negative her. We're also asked as a follow on question to provide an example of when false positives and false negatives. And also another example of where false negatives are more important than false positives. So a false positive is an incorrect identification of the presence of the condition when it is actually absent. And a false negative is an incorrect identification of the absence of a condition when it is actually present. So a good way to think of these is in terms of examples. And an example of when a false positive is more important than a false negative is for spam detection in your email. So if you miss a job oh, her or an email from your boss because it was mistakenly identified as spam, then there can be large consequences. Whoever, if you occasionally get an email from a Nigerian prince offering you money, you can quite easily ignoring or mark it as spam. So in this case, if we're falsely marking something has been, it's a false positive. And then it doesn't gash and inclusion in her email inbox. And so if you miss a job offer or an email from your boss, it has been falsely identified as spam, which has quite a big consequence compared to and just getting around them. Spam, Spam letter. So an example of when a false negative is more important than a false positive is when it's used a lot in healthcare applications and really important on screening for cancer. So it's much worse to say that someone doesn't have cancer when they do. Instead of saying that someone does have concert and maybe you do some additional tests and then later realized that they don't. So in this case are where getting a false negative because we're negatively saying that someone doesn't have cancer when they actually do. And instead of saying that they did have cancer and then possibly realizing later that they don't. And so there's a huge trade-off in here where if false negatives are much more important than false positives. And so these are two quite good examples of when you can see were false positives are much more important than false negatives in the spam detection case. And an example of where false negatives are much more important than false positives when screening for cancer. And in a lot of their medical applications. 12. When would you use random forests vs. SVM and why?: So the next question we have is, when would you use random forests versus SVMs? Why would you make that choice? So there's a couple of reasons of why random forests, man, maybe a better choice of model Support Vector Machines or SVMs. And so wouldn't reason is dash Random Forests allow you to determine at the feature importance. So one of the limitations of SVMs is that they are unable to do this. And other additional benefit that you get with random forests is that typically there much quicker and simpler to build than an SVM. And so you can use a lot of different kernel functions with SVMs, but they can be quite complicated and take much longer to fish compared to random forests. And finally, for multiclass classification problems, SVN require a one versus the rest method and which is less scalable and there's more memory intensive. So you can see the M If you want something dash is going to be quick. Simpler. Random forests typically perform much better than SVM models. 13. Why is dimension reduction important?: So in this question we're asked why is dimensionality reduction important? And so just to clarify what dimensionality reduction is. So dimensionality reduction is the process of reducing the number of features we have in our data session. So this is mainly input port and important in the case where you want to reduce the variance of your model so over fitting. So one of the main advantages is that it has, is that it reduces the time on storage space required. So if we have much less features that we're going to be fitting in our dataset. Then it will take much less time to train our models. And also, these storage space that will be required will also be reduced greatly because we just don't have as much data that we need to store. Another advantage of m dimensional dimension reduction is the removal of multicollinearity. So this is when two or more explanatory variables are related. And this improves the interpretation of the parameters of the machine learning model. So if we have m, two particular features than are related. So say for example, we're doing some fruit classification on the size of the Friesian. The shape of the fruit are closely related, so circular, fresh and have a typical size Bush if am shaped more like a cucumber, then they have typically different sizes. And we're getting some multicollinearity between these two variables. Then if we reduce the amount of features in the dataset, we can improve the interpretation of the parameters that are actually being used by the machine learning model. Another improvement we cache with dimensionality reduction is that it makes it easier to visualize the data. And when it's reduced to very low dimensions, so such as 2D or 3D. So it's almost impossible for us to really comprehend wash. Data in 20 dimensions are 50 dimensions. Actually it looks like. But if we use M dimensionality reduction techniques and we can't see our data into a tree dimensions, then we can identify some clusters that are happening. So for example, if our data looks like this, and we see one particular cluster over here, another here, and another here based on these latent variables. Then we can do more analysis of why these clusters end up here and here, and over here. And one of the final advantages that dimensionality reduction has is that it avoids the curse of dimensionality. So the curse of dimensionality is done and we have dimension, hey, if we have data that isn't high-dimensional space, so say like a 15-dimensional 50 dimensions, then the distance between the data points becomes huge. So it becomes really difficult to use standards as Euclidean techniques. And so it can be important to reduce the dimensionality to 2D or treaty, where these techniques will perform much better. 14. What is principal component analysis? Explain the sort of problems you would use PCA for.: So in this question, we're asked to banish a particular dimensionality reduction technique called principal components analysis. And so we're asked what is principal component, component analysis? And to explain the sort of problems that you would use principle components analysis for. So principal component analysis, or PCA, is a dimensionality reduction method. And it's often used to reduce the dimensionality of large datasets by transforming the large set of variables into a smaller one that contains most of the information of the large data session. So that's an important point that we're not going to capture everything. We're capturing most of the important data by greatly reducing the number of features. So reducing the number of variables in the dataset comes at the expense of accuracy. And because we're not capturing all this information. But the trick in dimensionality reduction is to trade as little accuracy. To trade a little accuracy for simplicity. Because smaller datasets are easier to explore and visualize and make data analysis much easier and faster for machine learning algorithms without extraneous variables to process. So the principal components and new variables that are constructed as linear combinations are mixtures of the initial variables. So these combinations are done in a way that the new variable, so the new variable or at the principal components are uncorrelated. And most of the information within the initial variables is squeezed or compressed into the first components. So the idea is ten dimensional data gives you ten principle components. But principal components analysis tries to push the maximum possible information in the first component, then the remaining information into second component and so on until you have something like this, ask reports here. So on the y-axis we have the percentage of explained variance, and on the x-axis we have the number of principal components. And so you can see that we're getting a huge amount of explained variance from our first principal component. And then it drops off almost exponentially in this fashion. So we're getting 40% of the explained variance from the first principle component than just under 20% for the second principal component. And here I am 15% for deterred and maybe seven or 8% for the fourth, and maybe five or 6% for the fifth. So you can see that the first and say five principal components have captured most of the explained variance. And with these features. And as we keep adding more, we are cutting a greater explanation from the data. And push. We're also making it a bit more complex and that we need to capture much more of the principal components. So basically, to sum up, the idea of principal component analysis is very simple. We want to reduce the number of variables in the dataset while preserving as much information as possible. So PCA is commonly used for compression purposes to reduce the required memory and to speed up the algorithm, as well as for our visualization purposes, making it easier to summarize the data. So yeah, it can be very useful for visualization purposes, especially if you want to and plot your data in two or two or three dimensions, you can use the principal components to achieve that. So that's an example of some of the problems that you would use PCA for. 15. What is principal component analysis? Explain the sort of problems you would use PCA for.: So in this question we're looking at Wash or the assumptions required for linear regression and what happens if those assumptions are violated? So the assumptions are as follows. So first of all, the sample data used to fit the model is representative of the population data. So if we're using a subsample of the overall population data, then this should be representative of the overall population. The second assumption is that the relationship between x and the mean of y is linear. And this is often at Nash checked in a number of these models. So when you're using a linear model, the data, the relationship between x and y has to be linear. So if for example you have data which looks like this, then there's no way you're going to be able to fish and linear model to it that's going to accurately capture the data. So if you do for the linear model, it would probably end up looking something like that. But this is not going to be able to. So, and one of the problems with a lot of statistical packages and am programming tools and libraries is that you will get an answer and you will get a model M. And if you don't look at the details of the accuracy, am on a bit closer at your model, then you can think that this is a good model even though it bears absolutely no relationship to your data. So it's often important to do some analysis and visualization of your data before you actually train your model. So the third assumption that we have is that the variance of the residual is the same for any value of x. And this is also known, oh, good acidity. So this means that the amount of variance that we're going to have. So and say for example, if we have an iterator dataset, bash, we're not just getting data which is quite close together here. And then there's a huge amount of variance later on model. So you can see that our variance was quite low here. And then we're getting a large amount of variance as the model and increased in value of x. So this is known as homoscedasticity. And another important assumption that we have for our model is that the observations are independent of each other. So there shouldn't be any relationship between each of the observations and they should all be independent. And then finally, for any value of x and y, there should be a, it should be normally distributed. So we typically assume for, and a lot of these more simple models dot d values are normally distributed. And so extreme violations of these assumptions would make the results redundant. So small violations of these assumptions will result in greater bias or variance of the estimator. So basically, even if you are having small violations of these assumptions, your results still really aren't that valid until he makes sure. Or at least you do some statistical checks on each of these assumptions that you're making. 16. What are some of the steps for data wrangling and data cleaning before applying machine learning alg: So water to some of the steps for a data wrangling and data cleaning before we apply our machine learning algorithms. So there are many different steps and when data wrangling and data cleaning. So this is a list of some of the, the most common steps. Push it. There's, there's many more that you could go into. But for just answering this question, it would be good if you hit these four or five points. So the first one is data profiling. So almost everyone starts off by getting an understanding, understanding after dataset. So more specifically, you want to look at the shape of the dataset and get a description of the numerical variables. So another way or another item that can help with this data profiling is to do some data visualization. So sometimes it's useful to visualize your data which histograms, box plots, and scatter plots to have a better understanding of the relationships between your variables and also to identify potential outliers or values dash M seem a bit strange in your data. So another step that we can do before applying our machine learning algorithms is to do some standardization or normalization. So depending on the data set that we're working with and the machine learning method that you decide to use. It maybe useful to standardized or normalized or data so that different scales of different variables don't negatively impact performance of remodeled. So some machining learning algorithms are more susceptible to outliers than others and some can't handle outliers quite well. But you need to do that analysis before you need to know whether your algorithm can handle outliers. And if it can't, then you need to do some standardization or normalization. If you're looking at house prices. And one of the features is the number of rooms and another feature is the square footage. Then, because the square footage is such a large number compared to the number of rooms, even though they can be correlated, then the square footage will am basically and domination over the number of rooms. So it's important that you normalize or standardize your data. If your machine, your machine learning algorithm and it doesn't have air, isn't robust to and data like this. So the fourth then is how our model will handle the null values. So there are a number of ways to handle null values on this kind of glued deleting rows with the null values altogether. We can replace the null values with the mean, the median, or the mode, or replacing the null values with a new category. So for example, unknown. We can predict the values or use machine learning algorithms that can deal with null values. So again, there's some machine learning algorithms that can perfectly suitable to use no values, but you need to make that check before you used a. So just some other things that we can, can include a as removing irrelevant data, removing duplicates, and type conversion. So we could also consider doing things like some dimensionality reduction that could help us to remove some irrelevant features. 17. What is multicollinearity and how do we deal with it?: So what is multicollinearity and how do we deal with it? So multicollinearity exists when we have an independent variable. W1 is highly correlated with another independent variable in a multiple regression equation. So this can be problematic because undermines the statistical significance of an independent variable. Because there's such a where basically you can think of it as counting a feature twice because we have one feature that is highly correlated with another. And so if we use both of these features were basically saying the exact same thing twice and which can skew, add the values that our model will produce. So one of the ways that we can deal with it is to use the variance inflation factors. So VIF, to determine if there's any multicollinearity between two independent variables. So a standard benchmark that we can use for is that if the VIF is greater than five, then multicollinearity exists. We can then remove one of the variables with multicolinearity. And this is typically how we would deal with this in a multiple regression equation. 18. You are given a dataset on cancer detection. You have built a classification model and achieved an a: So this is a really interesting question. And one of the questions that I really like to ask in a data science interview, because it's sort of a, you want to, the candidate to be able to sort of decipher what you're asking and to be able to figure out what the, the question behind the question is. And so basically you say you are given a dataset on concert detection, and then you build a classification model, and this model achieves 98% accuracy. So is this model ready to be used in production? And so these results seem quite positive and an accuracy of 98% is pretty amazing, you would think. So is this going to model that you've created as a kind of revolutionized the field of concert detection. But really, this is more of a question about imbalanced datasets and how you deal with dash. So cancer detection results in imbalanced data. So in, in balanced datasets, accuracy should not be based as a measure of performance. So it can be easy to achieve high accuracy on unbalanced data. What a stupid model. So say just everyone doesn't have cancer. And then for example, if you had 100 images there in 96 images and without cancer and four with cancer, then this model, it's used 96% accuracy. It's more of a question about imbalanced datasets. And you evaluate in ponds Datasets than, you know, this amazing 98% model, 98% actually that you've achieved with this model. And hence to evaluate the model performance. In this case, before it is ready for production, we should use the sensitivity or the true positive rate, the specificity, which is the true negative rate, and the F measure to determine the class wise performance of the classifier before we would be happy to use this model in production. And so it's an interesting question, but you need to sort of understand what the interviewer is getting out when he's asking this question. 19. You are given a dataset on cancer detection. You have built a classification model and achieved an a: So this question follows on from the previous question on the concert detection dataset. And in this question we're looking at how would we evaluate an algorithm on unbalanced data? So there are multiple ways that we can use to evaluate an algorithm on unbalanced data. So the first thing that we should do is to evaluate using the right metrics. We can also use some techniques to look at re-sampling training session. And we can also use a cost function in the model penalizes the wrong classification of the rare cases. So the first thing we're going to look at is using the right evaluation metrics. So for unbalanced datasets such as cancer detection, where there are a lot of non-cancerous images and a few cancerous images. Good accuracy can be achieved by labeling all the data as non-cancerous. However, by using other metrics. And so accuracy is quite a bad metric to be using for imbalanced datasets. And so we should use our metrics, such as the precision or the specificity. And this is how many of the selected instances are irrelevant. We can also use the recall or sensitivity, which is how many relevant inferences are selected. And then a good combination of both of these is the F1 score, which is the harmonic mean of the precision and recall. And other useful metric which isn't used that often, but is very useful is MCC, which is the Matthew's Correlation Coefficient and between the observed and the predicted binary classifications. And finally, we can use a and you see which we've talked about before in our ROC curves, which is the relation between the true positive rate and the false positive rate. So this will give you a good visual interpretation if you draw the ROC curve and you can evaluate how your model is performing and doing that. So another possible technique that we could use is 2x, or another alternative approach is to make a bounce dataset of our own bonds dataset by undersampling and over something. So first of all, under sampling. And this balances the dataset by reducing the size of the abundant class. So this method is used when the quantity of data is sufficient by keeping all the samples in the class, randomly selecting an equal number of samples in the abundant class to create a new bonds dataset. So another technique that we can use is oversampling. So when the quantity of data is insufficient, over sampling is used to increase the amount of rare samples. So the new rare samples are generated by repetition, bootstrapping or smoked, which is synthetic minority over sampling technique. So these are different techniques you can use to oversample and you metric for our cases. And this is particularly good when the quantity of data is insufficient and you can't use under sampling. So finally, another approach that we can use is to use a cost function and the model that penalizes the wrong classification of the rare cases more than the wrong CAS classes of the abundant cases. So it's possible to design models that naturally generalize to favor of direct kit class. For example, we can tweak an SVM model to penalize the wrong classification of rare cases by the same ratio that the class is under representated. So we're basically just adding a greater happy realization if we're misclassifying one of the rare kit glasses compared to the button classes. Mm-hm. 20. You have the 95th percentile of web server response times generated every 2 seconds for the last yea: So in this question, at you have the 95th percentile of web server response times that are generated every two seconds for the last year. And you want to aggregate the percentiles for the last week and the last day, week, and month. So is there a way that you can aggregate these percentiles? This was actually a question I was faced with in one of my work task before. And it's actually quite interesting. And so you actually cannot do this. So there's no way to aggregate these percentiles m for different periods. And a simple way to demonstrate why any attempt to aggregate percentiles by averaging them. So either using weighted averaging or not is useless. So there's a simple way to think about this, and it's to reason about the 100% percentile, IE, the mikes. So if I had the following 100% percentiles reported for each one minute interval, each with the same overall event count. So at 010316014231202. So the average of this sequence is 42, and it has absolutely no relevance to the 100% percentile. So even though we're reporting these for each one minute interval, we can't do any averaging on this because it just doesn't make any sense like the 100% percentile is the 100% percentile. And so no amount of fancy averaging, weighted or not, will produce the correct answer for what is the 100% percentile of the overall 15-minute period. So there's only one correct answer, and that is 601. And that was the 100% percentile during the 15 minute period. So you can see here that if we were SA reporting the, in this case, the 95th percentile of web server response times. Then there is no averaging that we can do for each of those minutes. That will give us some summary metric. The, it, it just doesn't make any sense. And so the best way of thinking about it is in this 100 percentile, it's basically the max value that you have. There's no amount of averaging that you're going to do that will give you the 100 percentile. It can only be calculated by the max value. And so basically, if you're trying to summarize these data, you don't have enough information by just having the percentiles for the last day, week, and month. So you need, you need all of the data to be present. And then you could calculate the 90, 95% tile, but using all the data. But if you just have the aggregated percentiles and there's no way that you can aggregate all these values for a greater period. 21. What is Bayes’ Theorem? How is it useful in a machine learning context?: So in this question we're asked about Bayes theorem and how is it useful enemy machine learning context. So Bayes theorem gives you the probability of an event with what is known as prior knowledge. So in this case, we're given n. So this formula here is the formula for Bayes Theorem. So the probability of a given B, so is equal to the probability of B given that a has happened and times the probability of a all over the probability of B. And so another way that we can reformulate this is that it's the probability of a given B ovens, and that's equal to the probability of B given a and multiply it by the probability of a. And another way that we can increase this probability of B by saying the probability of B given a times the probability of B plus the probability of B given not a times the probability of not a. So this formulation can sometimes be easier to use in questions if they're giving you some negated values. And, but this is the classic formula of Bayes Theorem. So for example, if the risk of developing health problems is known to increase with age. So Bayes theorem allows us to allow us the risk of an individual of unknown age to be assessed more accurately by conditioning on his age than simply using the assumption that the individual is a typical of the population as a whole. So basically, it allows you to add a lot more additional context to your model. So even if 100% of patients with pancreatic cancer a certain symptom, then when someone has the same symptom, it does not mean that this person has 100% chance of getting pancreatic cancer. And the reason for that is the incident rate of pancreatic cancer is extremely low. So even if you have all the same symptoms, because the prior probability of developing pancreatic cancer is so low. And you're still not that likely to have pancreatic cancer even if you have the symptoms, but she's obviously so should go to a doctor and get it checked out. So if we assume that the incidents of pancreatic cancer is one over 100 thousand. While when over 10 thousand healthy individuals have the same symptoms worldwide. Worldwide, the probability of having pancreatic cancer given the symptoms is only 9.1%. The other 90.9% could be false positives. That could be falsey saying You have cancer. So based on the Internet rate, we can produce the following table that corresponds to the numbers per 100 thousand people. So this table shows the symptoms and the cancer. And so in this case, we have one person who actually had cancer. And there were ten people who had the symptoms but then ended up not having cancer. And in this case, there was 0 people who didn't have, who had the cancer but showed no symptoms. And towards 99,989, which didn't have the cancer and didn't show the symptoms. So we can see that I am using these totals and we can then calculate the probability of having cancer when you have the symptoms using base term. So in this case m, the probability is the probability of cancer given the symptoms. And we have P of symptoms given cancer is equal to the probability of counter over the probability of symptoms. And if you scroll over, we can see we're using this increased formulation of Bayes theorem. And so we fill in for the probability of symptoms given cancer. And so in this case, the probability of having the symptoms and given that you have cancer is one. The probability of having the cancer is D1 over 100 thousand. So it's this value here, which is very small. And that's all over the probability that you have symptoms given dancer and the probability of cancer, which is the same values down here as you haven't Desktop. And then we also include the probability of having the symptoms given that you didn't have cancer, and then the probability of not having Gunter. So in this case there were ten people who had the symptoms even though they didn't have cancer. And then the probability of not having cancer was 0.99999. So this gives us m one over 11, which is about 9.1%. So this is how we can calculate that. Given that the probability of having pancreatic cancer, given the symptoms is only 9.1% and the other 90.9% are false positives. So Bayes Theorem allows you to calculate this prior knowledge of that. Easy to incorporate this prior knowledge of how likely something is. And then use that to influence whether even though dispersion, that's all the individual symptoms, how likely they are to actually develop cancer, because very good for capturing the prior knowledge in particularly rare cases. So things like cancer and medical diagnosis. And it's usually probably has something that's a lot more common. 22. What is ‘Naive’ in a Naive Bayes?: So in this question, we're asked what is naive in the Naive Bayes Algorithm? So the Naive Bayes is naive because it makes the assumption that the features of a measurement are independent of each other. And this is naive because it's almost never true. So in simple terms and naive Bayes classifier assumes that the presence or absence of a particular feature of the class is unrelated to the presence or absence of any other feature given the class and variable. So for example, if we were building a classifier for fresh fruits may be considered an apple if it is read Rand and about four inches in diameter. And even if these features depend on each other or upon the existence of other features, a Naive Bayes classifier considers all of these properties to be independent and to contribute to the probability data fruit is an apple. So we can see that there are probably would be some correlation between those features in Nash. Wetter, a fruit rind will probably have some influence as the overall size of the fruit. And also the color can have some indication as to the shape of the fruit and the overall size of the fridge. And so the Naive Bayes algorithm is naive because it makes an assumption that these features are totally independent of each other. But it actually can be quite a good classification algorithm and it's certainly a good baseline to use if you want to compare it against some more complicated models because it's quite simple to use and easy to implement. And it can achieve quite effective results. But usually for, if you want more accuracy, you can use with sort of more deep learning based techniques. 23. Explain the difference between L1 and L2 regularization.: So in this question, we're asked to explain the difference between L1 and L2 regularization. So from a practical standpoint, L1 tends to shrink the coefficients to 0, whereas L2 tends to shrink the coefficients evenly. So L1 regularization is therefore useful for feature selection as we drop any of the variables associated with coefficients and that go to 0. L2 regularization, on the other hand, is useful when you have colinear are co-dependent features. So let's do some two-variable simulations of the random quadratic loss functions at random locations and see how many coefficients end up at 0. So there's no guarantee that these random probability of loss functions and annuli represents real dataset, the real datasets, but it's a way to at least compare L1 and L2 regularization. So if we started with a symmetrical loss function, which looks like bowls of various sizes and locations. We can compare how many as 0 coefficients appear between L1 and L2 regularization. So if we look at the first function that we're using at the symmetric loss function, L1 regularization gives sixty-six percent zeros. So all of these green values in the t values that have gone to 0. And you can see this has led to a great reduction in the amount of variables that we're using. So you can see L1 regularization is therefore a US useful for feature selection as we can drop a lot of the values that go to 0. And so if we look at he lost am asymmetric loss function for L2 regularization. We can see that in this case, on the 3% of the values, I've gone to 0. So we can see that the difference between the regularization functions is how they change the actual values of the coefficients that we're using, the amount of parameters. In this case, we can see how it changes the amount of coefficients that are set to 0. So for L2 regularization, we're only getting 20% of values set to 0. But for L1 regularization, we get 6166% percent of values set to 0 on the same loss function, Mincut. 24. What cross-validation technique would you use on a time series dataset?: And this question were asked quite cross validation technique, when would you use on time-series data sets? So we did look at some cross-validation techniques before, such as k-fold cross-validation. But you can see how that might not work with time series data sets is that you can't divide the data up randomly because there is a time component to it, and that was us to move forward in time. So instead of using standard k-fold cross-validation, you have to pay attention to the fact that time series is not randomly distributed data. So it's inherently ordered by chronological order. So if a pattern emerges in later time periods, for example, your model may still pick up an edge even if that effect doesn't hold in earlier years. So you'll want to have something like forward chaining where you're able to model on past data and then look at the forward facing data. So you always want to be looking at the forward facing data and making sure that you include this inherent chronological order. So for example, if we have our data like this and we can create fold one by having training on one data point and then testing it on the next student point. And for a second fold, and we include the two data points and then test it on the third data point. For full tree, we train on the treatment of prints and tested on the fourth for fold for retrain on the four datasets, intestine and a five. And then for full five, we test on five data points, or we trained on day five data points and tested on the six data points. So you can see it's just a slight variation on k-fold cross-validation where we need to include this chronological order. And that's how you deal with cross-validation on time-series data sets. 25. How can a time-series data be declared as stationery? What statistical test would you use?: So another question that is often asked Quran, time-series data is how can a time-series data be declared a stationery? And what statistical tests would use to make this classification appended stationary. So time series is stationary and when the variance, covariance and the mean of the series are constant with time. So here's a really good visual example on the principles of stationarity. So here we can see image a, and this is an example of a stationary series. And so each of these show the particular characteristics that a dataset needs to be, to be stationary. And we can see in each one of them the cases of why they're not stationary. So in this case, this, this dataset is non-stationary. And the reason it's non-stationary is because there's a time-dependent mean. So we can see the mean of the dataset shown to this yellow line is increasing over time. And so that is one of the principles that we need for stationary is, and the mean isn't varying over time. And that's why this dataset isn't stationary. So if we move on to this figure, so image C, we can see that this dataset is also not stationary. And that's because there is a time dependent variance. And so we can see here how the data is varying increasingly around this point. These points here and at the edges, and it isn't varying as much. So we can see that there is a time dependent variance. And for this particular dataset, which is why it isn't stationary. And so finally, in this case, we can also see that this dataset is non-stationary. And this is because there is a time dependent covariance. And so we can see in the middle here, the frequency of the data is much higher in this period. So all of the waves are much closer together. And at the edges of the dataset, say here and here, we can see that it's much more spread out. So there's some time-dependent covariance that is being captured in this data. So this is a really good visual example which shows you all of the properties that you need to have a stationary dataset. But basically just data than is consistent over time and that has the stable mean. So more textual reference to this is that a time series is stationary if it has a time-dependent mean, time-dependent variants and time-dependent covariance. So the first green graph shows that a time series meets all of these conditions. And so some of the tests that we can use to check for stationarity is the augmented Dickey-Fuller test and which is one of the most popular. So it checks the time series if the time series has a universe. So if the P-value is greater than 0.05, we fail to reject the null hypothesis. The data has a unit root and it's non-stationary. And if the p-value is less than or equal to 0.05, we reject the null hypothesis. And in this case the data does not have a unit rule is stationary. So a unit root is a random time series would adrift. So statistical tests that we use as the augmented Dickey-Fuller test. And to have a dataset that is stationary, it must not have any of these properties. So it can't have a time-dependent mean, variance or covariance. 26. What are the main components of an ARIMA time series forecasting model?: So what are the main components of an arena time series forecasting model? So a remark is auto regressive integrated moving average model. And so non-seasonal arena models are generally denoted by a Rima and PDQ. A parameter is p and q are non-negative integers. So p is the order, the number of legs of the autoregressive model. D is the degree of differencing. And so this is the number of times that the past data pass values had been subtracted from each other. And this is usually to get rid of any mean, time-varying mean and the data. And Q is the order of the moving average model. So basically in unary MA model, we have three main components. Pd and Q. P is the number of time length, d is the degree of differencing, and Q is the order of the moving average model. So in our EMA 000 model, it's basically just like nice model. So we're not including any auto regressive components. There is no differencing and there's no average model with an arena 011 model. So it's a model without a constant. It's a basic exponential smoothing model. And an arena 0 to two model as a equivalent to a host linear model, which additive errors or double exponential smoothing. 27. How do you ensure you’re not overfitting with a model?: So how did you ensure that you're not overfitting where the model. So there are three main methods that we can use to avoid overfitting. So one of the first ones is to keep the model of simpler. So basically we want to reduce the variance by taking into account into fewer variables and parameters, thereby removing some of the noise in the training data. And I would be a big fan of this particular method. And especially when you're tackling maybe some new machine learning problems, is that for some of your initial experiments, you always want to start off with a very simple model. And these simple models can give you a rough estimate of what the typical error you should be achieving. And when you start to use more complicated, say, deep learning techniques, for example, in Time Series forecasting, you can use a very simple model called the persistence model. And what this model basically does is it makes some prediction for the next at time series point that it's going to be the exact same as it is currently. And this gets updated every time new data is added. So for example, if you're doing weather forecasting and you want to calculate the weather for the next day, you just make a prediction that the wetter for tomorrow is going to be the exact same average temperature as it was today. And then for tomorrow's weather forecast, you can make a prediction for what the wetter would be the day after that by saying it would be the same as it was that day. So each time a new data point kits introduced, we just make a weekend, update our information, make a prediction for the next day. But it's a very simple model. And then we can use that as a reference or a comparison to more complicated LSD OEMs and RNN based models. So another important technique that we can use to avoid overfitting what a model is, to use cross-validation techniques such as k-fold cross-validation. So this makes sure that we're getting the most dataset and that word training on different parts of the dataset and testing on as well, different parts of the data set as well. And so it gives us more comprehensive evaluation because we're testing it on K-fold as opposed to just having one part of the dataset that we train on and one part of the dataset where we do our testing on. The final technique that we can use is regularization techniques, such as last Whoo that penalized certain model parameters if they're likely to cause overfitting. So this can use L1 or L2. So you get, you can use L1 or L2 regularization techniques. And this will help you reduce the number of parameters that you're taking into account in your model. Which will help you to reduce the likelihood that you're going to get a model that has lead to overfitting. 28. You are given a data set consisting of variables with a lot of missing values. How will you deal wit: So in this question, we're given a dataset consisting of variables, was a lot of missing values. So how would you deal with this? And you can see by this picture here that looks at handling missing data, that there's a lot of different techniques that we can use. The techniques that we use are based on the size and the type of the data. So you can divide this into a subset that looks at deletion and a subset that looks at imputation. And so if the dataset is already large, then we can use deletion, which is quite simple. In the hash, we already have enough dataset. And if we have to remove some of the values, it's not really it up in the video because we already have more than enough data. And we can just use the rest of that data to predict the values. So if we're looking at deletion, we can just delete the rows. So deleting a particular instances in our dataset where we don't have the values for all of the features. We can also look at pairwise deletion. So pairwise deletion analyzes all cases in which the variables of interest are present and thus maximizes all Dana available by an analysis spaces. And the final technique we can use is deleting columns. So this may be just deleting some particular features that weren't really that useful for the problem we were trying to tackle. Or, and, and really well filled in in the forums. So say for we were looking at doing a regression problem on house prices and we wanted to calculate the value of S. And we haven't included one of the features, which was the number of trees that were planted in the garden. But the problem is that some don't have some acetone of gardens and some, if they do have a garden, maybe it's still Lawson filled in or there was a lot of empty values where it wasn't am filled-in accurately. Then in this case, we might just want to remove that feature and not use it and just use the existing values we already have. Like say, the number of rooms in the house or the square footage of the actual house. And so for smaller datasets, we don't really have the luxury of just deleting the particular instances and features that we had a lot of missing values. So we need to look at some techniques that use imputation. And so we can tackle this as a general problem or a time series problem. And the first one that we're going to look at is the time series problem. So for time series, it can depend on whether the data has trend and seasonality. And so the data without trend and seasonality, we can use the mean, median, and random imputation. If we have data that has trend and seasonality, we can use linear interpolation. And if the data with trend and seasonality, we can use seasonal adjustment and interpolation. So if the Data is time series. We can interpolate the missing data depending on whether the time series has trend and seasonality. So for general continuous data, we can use the mean, median, mode and multiple imputation linear regression to fill in the missing values. So in this case here, you can see these red values are the values that were filled in. And basically we're just filling in what we would have expected the head trend to be between these two particular points. So we're basically just making a prediction for what the value would have looked like. To join up these two points. And we used the trend and seasonality and of the previous points in the time series. And so for general problems that have categorical data, we can use mode imputation as one of the methods, but this will introduce bias. So another technique we can use is that the missing values can be treated as a separate category by itself. And we can create another category for the missing values and use them as a different level. So this is the simplest method. We can also use some prediction models. So here we create a prediction model to estimate the values that we'll substitute the missing value. In this case, we can divide our dataset into two sets. So once ash with no missing values for the variable and thus the training. And then another one with missing values for d, witnessing values which is the test. And we can use methods like logistic regression and ANOVA for prediction. The final method we can look into is multiple imputation. And so this is a general approach to the problem of missing data. And it's a variable available in several commonly used statistical packages. So it allows for the uncertainty about the missing data by creating several different plausible imputed datasets and appropriately combining the results from each of them. So there's a few different parameters that you can tweak in this multiple imputation technique, but it's usually quite effective hash making at least some reasonable approximations for the data. So if we just go back up to get an overview, we can look at all the different techniques that we've highlighted for dealing with this problem. So basically, how you're going to handle the missing data will depend on the data. So if you have enough, you can use very simple deletion techniques. Just delete the row is delete the columns because you already have enough data that you can use for your model. If you don't have enough data, you have to start dealing with these imputation techniques. And so for the time series problem, we've seen how we can use if the data doesn't have a trend, doesn't know seasonality, we can use mean, the medium demoed and some random sample imputation. If the data has a trend and doesn't know seasonality, we can use the linear interpolation. And if the data has a trend and seasonality, then we can use the seasonal adjustment plus interpolation. So for the general problem of imputation, if we have categorical data, we can mark an aim. It's a level. We can use multiple imputation or we can use logistic regression. And finally, if we have a general continuous problem, then we can use the same methods that were used in the time series problem because it's continuous. So we can use the mean, the mode, multiple MPI implementation and linear regression. So this is the number of different techniques that you can use to handle missing values. And it would be really good in the interview as well to ask all of these questions about the data. So do we have enough data that we can do deletion? Or is it a case where we need to do some more slightly advanced techniques on imputation. 29. You are given a data set consisting of variables with a lot of missing values. How will you deal wit: So what is the Kernel Trick and how is it useful? So the kernel trick involves Kernel functions that can be a name that can enable higher dimensional spaces without explicitly calculating the coordinates of the points within that dimension. So instead, kernel functions computes the inner product between all pairs of data in the feature space. This allows them to very useful attribute of calculating the coordinates of higher dimensions while being computationally cheaper than the explicit calculation I'm sad coordinates. Many algorithms can be expressed in terms of inner products and using the kernel trick and enables us to effectively run algorithms in high dimensional space with lower-dimensional data. So if we take this example here of where we have the red squares and the green circles. And we want to build as some classification to make these separable, separable. So as you can see in the above picture, we want to map the data from two-dimensional space to three-dimensional space, where we will be able to find a decision surface that clearly divides between the different classes. So you can see here that if we're able to map this two, this three dimensional space, then there's a very simple decision surface here that we can use to make these two different classes separately. However, when there are more and more dimensions, computations within that space become more and more expensive. And this is where the kernel trick comes in. So it allows us to operate in the original feature space where computing the coordinates of the data in higher-dimensional space. So it's basically all by being more efficient and using this three-dimensional space to create this very simple decision surface that will allow us to separate into two different classes. And it's much more simple than using this lower-dimensional representation. 30. You are given a data set consisting of variables with a lot of missing values. How will you deal wit: So when would you use gradient descent over a Stochastic gradient descent and vice versa. So gradient descent theoretically minimizes the error function better than stochastic gradient descent. However, stochastic gradient descent converges much faster once the dataset becomes large. So this means the ash, GD or gradient descent is preferable for small datasets, while SGD is preferable for larger ones. In practice, however, we use SGD for most applications because it minimizes the error function well enough, while being much faster and more memory efficient for large datasets. 31. Your organization has a website where visitors randomly receive one of two coupons. It is also possi: So in this question, your organization has a website where visitors randomly receive one of two coupons. It is also possible that the visitors to the website will not receive a coupon. So you have been asked to determine if offering a coupon to a website visitor has any impact on their purchase decisions. So which analysis method should you use? So there's four possible options. You have a one-way ANOVA, k-means clustering, association rules and Student's t-test until the answer that I would go for here is one way ANOVA. And so in Statistics, one-way analysis of variance, which is typically abbreviated to when we're going over, are just ANOVA, is a technique that we can use to compare the means of trigger more samples using the F distribution. And this technique can be used only for numerical data, typically whoever and the one-way ANOVA is used to test for differences among at least three groups, since the two-group case can be covered by the t-test. So in this case there was three different groups. So we had the visitors randomly receiving one of two coupons. So we had two groups for each of the coupons. And we also have the possibility of a user not receiving unequal been. So if there were just the two covenants, we could have used Student's t-test. In this case, we need to use one way ANOVA because we have three different classes. 32. What is the differentiate between univariate, bivariate and multivariate analysis?: So why does the difference between univariate, bivariate and multivariate analysis? So universe analyses are descriptive statistical analysis deal with one variable at a time. So for example, the pie chart of sales based on territory involves only one variable. And the analysis can be referred to as univariate analysis. So in this case, we're just looking at the number of sales in a particular territory, and we're just doing that over that particular variable. So bivariate analysis attempts to understand the difference between two variables at a time, as in the case of a scatter plot. So for example, by analyzing the volume of sales and spending, this can be considered an example of bivariate analysis. And finally, we have multivariate analysis. And this deals with the study of more than two variables to under, understand the effect that these variables can have on the response responses. And E101 multicollinearity between the individual variables. 33. Explain the SVM algorithm.: So in this question, we're asked to explain the SVM algorithm. So SVM stands for support vector machine is a supervised machine learning algorithm which can be used for both regression and classification. However, it is mostly used for classification problems. If you have N features in your training dataset, SVM tries to plot it in n dimensional space with the value of each feature being the value of the particular coordinate. These vectors are cases that define the hyperplane are the support vectors. So these, as you can see, that are labeled here, are the support vectors. And this is the margin that we're using two and divide the two classes. And basically we're trying to maximize this margin. So, and a simple algorithm that you can think of to explain the SBI algorithm is that we define an optimal hyperplane and we want to maximize the margin in this hyperplane. So we extend the above definition for non-linear separate, non-linearly separable, separable problems and have em determined for misclassifications. So we can map the data to high-dimensional space where it is easier to classify with linear decision surfaces. So basically we're reformulating the problem so that the data is mapped implicitly to the space. So basically you can see here we have our support vectors that are the points that are closest to this line here. And what we're trying to do is to increase the space between these lines so that we're getting the greatest distance between each of the classes. And this will improve the accuracy of our classification or regression model. And that is basically the algorithm on tight SVM works. 34. Describe in brief any type of Ensemble Learning.: So this question asks us to describe in brief any type of ensemble learning. So in some learning has many types, but two of the most popular ensemble techniques are bagging and boosting. And these are the ones that we're going to give a brief explanation of. So bagging and tries to implement similar learners on small sample populations and then take a mean of all of the predictions. So in generalized bagging, you can use different learners on different parts of the population. And you expect that this will help to reduce the variance error. So for Bagging here, we have our training data, and this is divided into k samples. We've sampled 12 tree on k. So we can go up to as many samples as you want. And what we're doing here is random sampling with replacement. And then for each of these samples are going to train the model. And so we're going to build the DK models on the k samples. And the most important thing with this is that we're doing it in parallel. So each of these models are trained in parallel. And then we combine these together using the average of each model score. So basically we're combining all of these different models which have been trained on different symbols off the data. So boosting has a more iterative technique which adjusts the weights from the observation based on the last classification. So the important thing here is that it is iterative. So instead of it being m divided in parallel into these different samples and the model is built in, in, in parallel. We're going to have an iterative technique. So if an observation is classified incorrectly, tries to increase the weight of this observation and vice versa. So boosting in general decreases the bias error and build strong predictive models. One of the limitations though, is that it may over fit on the training data because you're always am increasing the weight of m samples that have been classified incorrectly. And so it can tend to start overfitting some of these incorrectly classified data points. So here's boosting. So we have our original training data again. So we random sample with replacement from this training data. Then we build our model. And so then from our model, we can see which observations were classified incorrectly and increase the weight of those observations. Then we do another ensemble. We train a model again. We see which ones are classified incorrectly. We take another sample, we train a model, and we keep doing this for the K models. And so doing this, we can build K models on the K weighted samples. And this is any sequential fashion. And then using these k models, we can up the weighted average of each model score. So the main difference between a bagged ensemble on boosting ensemble is that for bagging, you are doing all this in parallel and dividing your samples and in parallel with replacement. Whereas for boosting, you're doing it in a sequential fashion. So you are taking one sample and fitting a model. You're seeing which points are labeled incorrectly, incorrectly, then you are taking another sample. You are building a model of the feedback evolved from the first sample. And so that's two basic ensemble techniques. You're using multiple models to create your final prediction. 35. What is a Box-Cox Transformation?: So in this question we're asked what is a Box-Cox transformation? So the Box-Cox transformation transforms our data so that it closely resembles a normal distribution. So normality is an important assumption in many statistical techniques. If your data isn't normal, applying a Box-Cox means that you are able to run a broad number of tests. So many statistical techniques, we assume that the errors are normally distributed. So this assumption allows us to construct confidence intervals and conduct hypothesis tests. By transforming your target variable, we can hopefully normalize our errors if they're not already normal. Additionally, transforming our variable is going to improve the predictive power of our models because transformation can put away white noise. So suppose we have a beta distribution where alpha equals one and beta equals tree. If we plot this distribution, it would look something like this. So we can see this is obviously not normal. We have m points here and 0. And then it sort of tails off into his tail distribution. So if we use a Box-Cox transformation, we can transform this original data and into something close to a normal distribution as the Box-Cox transformation permits. So you can see this has transformed our data into something which closely resembles the normal distribution. So it's the classic sort of bell-shaped here where most of our data is focused around the central point and then our tails go out evenly on both sides. So it's a cosmic had bell-shaped normal curve. So that's what a Box-Cox transformation is useful. 36. What is the Central Limit Theorem?: And this question we're asked, what is the central limit theorem? So the central limit theorem states that the sampling distribution of the sampling means approaches a normal distribution as the sample size gets larger, no matter what the shape of the populist population distribution. So this fact holes especially true for sample sizes of over 30. So here's what the Central Limit Theorem is stating, add graphically. So the picture below shows one of the simplest type of deaths rolling a fair die. So the more times you rolled a die, the more likely you are that the shape of the distribution of the means tends to look like a normally normal distribution graph. So decentralized central limit theorem is important because it is used in hypothesis testing and also used to calculate confidence intervals. So we can see when n is equal to one for a standard fare die, we have this uniform distribution. So the value can be between 16 and we have a 1 sixth probability of getting each value. So when the number of rolls goes to two, we can see Bash, our values are now between 212. So if you've got two ones that would be a 2x and 2x is would be 12. And we can already start to see how this affects our distribution. So we're starting to approach this more and normal shape now. And when n is equal to three, we can see how it's affecting the dataset even more. And so you can see here that a lot of the families were around seven, which is the most popular number. If you're rolling two dice. And now it's starting to spread out even more evenly. So at the very limit, we have tree on 18 depending on where the if you rolled and tree ones, which is the only way that you can get tree. But there's multiple ways that you can dash m 1011. So say for example, for the case for n equal to two. So the only way that you can get to is by rolling two ones. But you can get a seven by rolling a tree in a 44 and a tree if I have an a to a 251 to six. So there's multiple different ways that you can get it too, which is why the probability as the I is for seven. And you can see how this increases. So for n equal to, equal to four and n equal to five, until we're getting something that really starts to look a lot like a normal distribution. So basically wanted saying is that no matter what your population distribution is. So you can see here we have something which is clearly not normal. When we're taking samples of size n. Dash d sampling distribution of demean, no matter what the population distribution will always approached this normal Gaussian curve. 37. What is sampling?: So in this question we're asked what is sampling? Sampling is a statistical analysis technique used to select and manipulate and analyze a representative subset of data points to identify patterns and grand in the larger dataset being examined. So it enables data scientists predicted Milner's Unordered data analysts to work with a small, manageable amount of data, statistical population to build and run analytic, analytical models more quickly while producing accurate findings. So there are many different methods for drawing samples from the data. The ideal sampling methods depend on the dataset, on the situation. So sampling can be based on probability as or like you roll a die. And depending on what number comes up, that's the sample that you're going to choose m. So yeah, using random numbers that corresponds to point in a dataset to ensure that there is no correlation between points. Every new roll of the dice em, as a new chance of coming up. So there are also non probability-based methods such as consecutive sampling, where the data is collected from every sample that meets the criteria until the predetermined sample sizes match. And then the next two questions, we're going to give some example of probability-based sampling methods are non probability-based sampling methods on how they work. 38. Give 4 examples of probability-based sampling methods and how they work: So in this question, we're asked to give four examples of probability-based sampling methods and how they work. So the first method is simple random sampling. And so this is random sample, randomly sample subjects from the whole population. So here we have an example where we have our whole population here on, we're just basically drawing random lots to choose which person and gets picked. So you could say we have 100 sided dice and, and we roll it out and we see what numbers come up. And other approach that is based on probability is systematic sampling. So a sample is created by setting an interval at which to extract data from the larger population. And this, for example, can be to select every tenth row and a spreadsheet of 200 items to create a sample size of 20 rows to analyze. So in this case, you can see here we're taking every third person. We skipped these to select this guy, skip these to select this guy. Skip these two ones like this guy, like this guy. So it's a systematic sampling based on the amount of data that we need. Another technique that we can use based on probability is stratified something. So this is where we create subsets of the population based on a common factor. And then samples are randomly drawn from each subgroup using simple random sampling. So you can see here, we creation are subsets of the population based on a common factor. So it could be age or hair color or eye color. And then based on the subsets that we've created, we use simple random sampling. So we choose something from this group, someone from this group, someone from this group, and this group. And finally, we can use cluster sampling. So cluster sampling is where the larger dataset is divided into clusters based on a predefined factor. Then a random sampling of the clusters as analyzed. The sampling unit is the old cluster. Instead of sampling individuals from within the group or research for we'll study the whole clusters. So that's how it's different from stratified sampling. And so you can see here, we're using some clusters, so this could be latent features, so it's not easily explainable, hadn't been clustered, dissuade. Or there could be some more explainable way I've had, have been clustered. But for cluster sampling, we're selecting entire coasters, so we're not making a sample from each clusters, we're just selecting these entire clusters. And that's how clustered sampling is different from stratified something. So these are the four examples of probability-based sampling methods on how they work. 39. Give 4 examples of non probability-based sampling methods and how they work: So in this question, we're asked to give four examples of non-probability based sampling techniques and how they work. So non-probability sampling techniques can include convenience sampling. And so this is where data is collected from an easily accessible and available group. So for a lot of em, studies based on universities, as students are a easily accessible, unavailable group. But this can bias the results of your experiment or the analysis that you're doing. Because there may be concerns that are quite important to students, but they may not be so important to be blur, elderly, or people who have professional jobs. So the next sampling technique that we can use is consecutive sampling. And so this is where data is collected from every subject that meets the criteria until the pre-determined sample sizes mesh. So basically, we're going to have our predefined criteria. So let's say we wanted to M women age between 3025, wish am black hair. And we're basically going to keep collecting everyone that meets this criteria until our sample sizes mesh. So the next sampling technique that we have is purposive or judgmental sampling. And so this is where the researcher selects the data to be sampled based on pre-defined criteria. So we define a criteria beforehand, and then we're going to sample based on this predefined criteria. Finally, we have quota sampling. And this is where it there researcher ensures equal representation within the symbol for all subgroups in the data set for the population. And random sampling is not used. So say for example, this might be if you're doing some study on M hair color and had an effect on your life or something. We might want to ensure equal representation for all as subgroups in the dataset or population. So we would need to have an equal representation on people with black hair, blonde hair, and red hair and on the ITER at clear affair. And that's how we would use quota sampling. 40. What are the advantages and disadvantages of neural networks?: So in this section we're going to answer some questions on neural networks and in particular deep learning. So this has become a very popular topic in recent years in data science interviews. And due to the increased popularity of neural networks and deep learning. So one of the advantages of these approaches is that they have performed really well a number of different problems. And so that's why a number of companies are trying to recruit for these positions and why, even as a data scientist, you're going to be expected to know some basic techniques of neural networks on how they can be applied to a range of different problems. So one of the most basic questions that you can be asked and bash neural networks and deep learning is what are the advantages and disadvantages of neural networks. So covering the advantages first, one of the things that I mentioned previously is that they've achieved state of the art performance on range of problems. So this can include image processing, language processing, and translation speed trading nation on time series forecasting. So you can see all of these problems are quite different on the US, quite different datasets. So image processing is very different to speech recognition, and which is also very different from time series forecasting and language processing and translation. So even though these problems are quite different and they use totally different datasets, neural networks have been shown to work quite well on all of these problems. And under advantage of neural networks is that they're quite robust to noise in the training data. And so even if the training data contains errors, it doesn't affect the final output too much, so they are quite robust to noise in the training data. Another advantage that you can mention is that the input is stored in its own networks instead of a database. And hence, the loss of data does not affect its working. So once you've trained your model, you have the output of that model and the loss of any previous training dataset and doesn't matter, it will still work quite well on the new training data. One of the disadvantages that you may have a wish neural networks is that they're difficult to interpret. So even a one layer feedforward, feedforward neural network, the number of parameters in the US is huge. And if you look especially at the most recent deep learning models, where they can have hundreds of different layers, it can be almost impossible to interpret. Wash is going on in the middle of those layers. So this has become a really popular field and deep learning called, which calls for better interoperability of these models. I'm trying to explain why they're making some decisions. Bush, there's still ongoing research in this area, and it will probably be a good few years until we have something that's more interpretable, interpretable compared to other machine learning models. And other disadvantage of neural networks is that they require a lot of data for efficient inference. So one of the advantages of neural networks is that they work quite well when they have a lot of data. But this is a problem if you don't have a lot of data. So when you don't have this huge amount of data or big data, um, other machine learning algorithms, which are much more simpler, come form quite well. And it may not be necessary to try to use it. And neural network or deep neural network based approach and Maven produce worse results. And another disadvantage of neural networks is that they can over-fit easily if the mantle training data is sufficient. So because they're quite good at function approximation, they can tend to overfit the training data if you don't do regularization, either true to form of dropout or some literate regularization technique. So this can make them perform quite poor on the test data set as opposed to the training dataset. One other disadvantages such a convention is that there's no grand truth for a hyperparameter tweaking. So for example, it is still not known whether adding more layers is improving or hurting the model. And it is mainly done by brute force. So this sort of comes back to that the models are hard to interpret. And so when you're building one of these models, it can be difficult to know whether adding more layers is improving or making your model's worse. And you really need to have this brute force approach where you test all of these possible different models and then choose the one that performs best. Am trigger validation dataset. 41. What in your opinion is the reason for the popularity of Deep Learning in recent times?: So in this question, we're asked to give her opinion for the popularity of deep learning in recent times. So deep learning has been around for quite a few years, but the major breakthroughs from these techniques have come in recent years. And so this is because of two main reasons. The first reason is the increase in the amount of data generated through various sources. So one of the things we know about deep learning is that it requires a huge amount of data to perform well and achieved these state of the art are amazing results. And so if we plot the performance here on the y axis and we have the amount of data on the x-axis. We can see for traditional older learning algorithms like linear regression, we would have this increase in performance as geometric data increased. And then we will reach this plateau where we wouldn't really see any increase in performance as the amount of data increased. But for deep learning algorithms, we can see that we're still getting this increase at the start. But as we keep drawing more and more data and niche where getting still better and better performance. And so that is why companies like Google have achieved such really good results with their deep learning algorithms is that because they have huge amounts of data to throw it out. And so if you've ever had any of those questions of where Google is asking you to select the traffic lights and an image or to select the boat. And they're then able to use this data, which becomes labeled data to train your deep neural networks, am Ash image recognition. And so really, that's why and data has become such a valuable commodity in recent times, is because it can be used to train these deep neural network algorithms, which then can be used in production. And so another one of the breakthroughs that has led to the deep learning being so popular in recent years is the growth in hardware resources required to run these models. So the GPUs that we have access to now are multiple times faster than they were in a five or ten years ago. And they help us to build bigger and deeper, deep learning models in comparatively less time thunder required previously. So if we were trying to a train some of the modern deep neural network models, say just using a single core CPU, we would never be able to use all the data that we currently have access to. So we need to be able to have these GPUs to run the computations in parallel to make use of the terabytes of data that we have access to. So yet, if we were trying to say train a modern AM ImageNet algorithm on a single core CPU. It would take tens of years to be able to do. And so GPUs are the reasons that we're able to train these models in a reasonable time. And so one other resource that I want to show you and to show you the growth of deep learning in recent years is Jeffrey Hinton's and Google Scholar page. So he is a very famous computer researcher and his work has been on machine learning, artificial intelligence, and deep learning. And so if you look at his most cited papers, you can see here a is a paper from 1996 on backpropagation. So error propagation on it's commonly known as backpropagation. And so if you look at the amount of citations he's received from his articles, you can see that although he started back here in it, when it goes back to 19 ADH. But you can see this article was published in 1986. You can see that most of the citations have come since 2012. So this really shows the huge growth in popularity of his papers since 2012, even though he'd been working on these problems since 1988. They've just become very popular in the last eight years because we have access to more data now. And we have access to the hardware to be able to actually train these models. So even though he'd been working on it for a long time, it's in recent years, studies become very famous. And in as h x h index of 158, which is very impressive. 42. Why do we use convolutions for images rather than just FC layers?: And this question we're asked, why do we use convolutions for images rather than just fully connected layers? So fully connected layers don't exploit the local structure of images on are not equivalent under translation. So that means if a person and an image moves from the left to the right, then a fully connected layer won't capture this translation. Convolutional layers, on the other hand, extract local features as DICOM to the initial layer I puts are only dependent on adjacent pixels of the previous layer. So convolutional layers are equivariant under translations. And so this means that convolutional layers will be able to capture it this translation. Another reason is the amount of parameters. So fully connected layers have a huge number of parameters, whereas convolutional layers have much, much less since they share the kernel over the patches on the whole input image. 43. What Is the Difference Between Epoch, Batch size, and Number of iterations in Deep Learning?: So what is the difference between epoch batch size and the number of iterations in deep learning. So one epoch is one forward pass and backward pass. Of all the training examples. The batch size is the number of training examples in one forward, backwards pass. And the higher the batch size, the more memory space you'll need. So this is the number of examples you're going to be fitting in in one pass. So you need to keep all of these in memory. So that's why you're going to need more memory space if you're going to try and use a huge amount of, say, images in one training example. And finally, you have the number of iterations. So the number of iterations is the number of passes, each pass using the batch size number of examples. And just to be clear, one pass equals one forward pass plus one backwards bus. We do not count the forward pass and the backward pass as two different parses. So just a quick example. If you have 100 training examples on your batch size is 500, then it will take two iterations to completion when epoch. So that's the definition of these terms. 44. What makes CNNs translation invariant?: So in this question we're asked quashed makes convolutional neural networks translation in variant. So translational invariance is a result of the pooling operation. So in a traditional CNN architecture, there are three stages. So in the first stage, the layer performs convolutional operation on the English to give linear activations. In the second stage, the results and activations are passed through a non-linearity, non-linear activation function, such as sigmoid 10 or value. In the third stage, we've pooling operation to modify the output further. So in the pooling operation, we replaced the egg Bush of the communist at a certain location with a summary statistic of the nearby outputs, such as a maximum indicates of Mike's Boolean as we replaced the bush with Max in the case of max pooling. Hence, even if we change the input slightly, it won't affect the values of most of the pools outputs. So translation and variance is a useful property where the exact location of the object is not required. For example, if you're a building model to detect faces, all you need to detect is whether the eyes are present or not. So it's, its exact position is not necessary. While in segmentation tasks the exact position is required. So the use of pooling can be viewed as adding a strong prior. To that, the function to layer learns must be invariant to translation. When the prior is correct, it greatly improves the statistical efficiency of the network. And so here we can see an example of this translation property where we have someone who's doing a pose here in this part of the image. And if they move over to this part of the image, we can still see that we're able to capture this inner deep numeric. So it's basically the property. No matter where in the image you are performing this action, you will be able to be classified by the deep neural network architecture. And so the convolutional layers and in particular, the pooling operation help to make CNN's translation in variant. 45. What are the 4 main types of layers used to build a CNN?: So in this question, we're asked, what are the four main types of layers to build a CNN? So typically, terror for traditional There is used to build a CNN. So the first one is the convolutional layer. So in this layer, we perform a convolutional operation, and this creates several smaller picture windows to go over the data. And so if we look at this input here of a car, we can see in our convolutional layer, we're having this little square box where we're creating these convolutions as this square box moves over the image. The second layer that we have is the value there. So this brings non-linearity to network and it converts all the negative pixels to 0. And so the push is a rectified feature map. And so this creates the features that are going to be used in our pooling layer, which is next layer. So the third layer is the pooling layer. And pooling is basically done sampling operation that reduces the dimensionality of the feature map. So you can see here, we've got all our features which have been created by doing this convolution and then this is input into our pooling layer. And from that, we're able to create more convolutions and release. And we have and for the pooling layers, which is why this is a deep learning. And most convolutional neural networks will have multiple of these layers, which learns deeper and deeper features. And finally, we'll have our fully-connected layer. So in our fully-connected layer, and this is what we use to recognize unclassified the objects in the image. So you can see here we have this classified as feature learning. So this is all our different layers over deep neural network where we're creating all these different features to the layers of the architecture. By taking these convolutional plus Riley, I'm pooling steps. And then finally we have our classification stage here where we have our fully connected layers. And so we also usually would apply a softmax activation function to the final layer. And this is what gives us the values that we want to predict. So say in this case we're doing image classification. We would have a output would be equal to the total amount of images that we wanted to predict. So in this case, and if we're just predicting modes of transport, maybe 50 or a 100 different options, which one can be fun, and then goes all the way to bicycle. So this is how we actually get what the result is ever image. So we need these fully connected layer at the end to be able to give us an answer to what isn't in this image by using all the features that we've had in this feature learning stage. 46. What are 3 types of spatial pooling that can be used?: So what are the three types of spatial pulling that can be used? So pooling layers are used to reduce the number of parameters when the images are too large. So spatial pooling, also called sub-sampling or a dense sampling, which reduces the dimensionality of each map, but it contains, retains the important information. So spatial pooling, can we have three different types? We can have max pooling, average pooling, and some pooling. So max pooling takes the largest element from the rectified feature map. Average pooling takes the average of all the elements into feature map. And some cooling takes the sum of all the elements in the feature map. So we can see here we have our single tap slice and our x and y-axis. And so a max pool would a two-by-two filters and stride length two will have this egg push. So here we can see that this red color am contains these boxes. So we're basically dividing it up into four different boxes. And so for max pooling, the highest value here is six. I's value here is H, here is tree. This value here is four. So that's basically add this subsampling or down-sampling feature works. We're just going to choose whatever is the highest value in these individual boxes of our two-by-two filter on a stride of two. And so for average building or some pooling, we would just take the average of these values are the sum total of each of the values. So it's quite straightforward, but these are the three spatial pooling techniques that are typically used. 47. What is the stride in convolutional layers?: So in this question we're asked, what is the stride in convolutional layers. So astride is the number of pixels shifts over the n push matrix. So when the stride is one, remove the filters one pixel at a time. When the stride is two, we moved the filters to pixels at a time and so on. So the figure below shows a convolutional convolutional work or would work with a stride of two. So you can see we have the four colors here. This red sort of color than a yellow color to blue color and green color. And you can see here at the output of each of these filters. And you can see we start here and then we move on to this yellow color here. And so we're shifting over to blocks each time, which is a filter length of two. And you can see this is also happening and for the outer layers down here. So we have this blue layer and our Greenland. We can see that again. We're having to shift over of two blocks and then we're calculating t pooling. So it can be the average pooling, max pooling, or some building. And a steroid is basically Hamlet's your filters moving over each time. 48. What are vanishing and exploding gradients?: So in this question we're asked watch her vanishing and exploding gradients. So in a network of n hidden theirs and derivatives will be multiplied together. So if the derivatives are large, then the gradient will increase exponentially as we propagate down the model until they eventually explode. And this is what we call the problem of exploding gradient. So you can imagine that as these derivatives keep getting multiplied together and you're all getting increasingly large, then they just become too large. Your model can't converge because the changes become 2a. So Alternatively, if the derivatives are small, then the gradient will decrease exponentially as we propagate them true, until it eventually vanishes. And this is the problem of vanishing gradient. And so this is the exact opposite problem in that if our derivatives are very small, then as we keep multiplying them together, they become increasingly, increasingly small. And so the problem we get here is that as our model get deeper and deeper, it becomes more difficult for us to update these deeper layers of our network. Because the event that we're changing and becomes very, very small, she vanishes. So that's vanishing gradient. So we're not getting the features built into these deeper layers. And so that's what the vanishing gradient problem is. 49. What are 4 possible solutions to vanishing and exploding gradients?: So this question builds on the previous question on exploding and vanishing gradients. And so what are our three possible solutions that we have to vanish in an exploding gradients. And so first of all, we can reduce the amount of layers. And so this solution can be used in both scenarios. So exploding and vanishing gradients. However, by reducing the amount of layers in our network, we give up some of our model's complexity, since having more layers makes them more capable of representing complex mappings. Perhaps we can have some compromise in accuracy by doing this. And the other way that we can possibly find a solution to vanishing unexploded gradients by gradient clipping. And so this works for tackling exploding gradients. So the idea of gradient clipping is very simple. If the gradient gets too large, be rescale it to keep it small. And so you can imagine that as our gradients and reach some very large value, it's basically setting in max value dot the gradient can reach, and once it reaches this max value, then we just rescale it to keep it small. The third solution that we have for vanishing and exploding gradients is waste regularization. So another approach is to check the size of the weights and apply a penalty when the network loss function for, for very large weights. This is called weight regularization and often L1. So absolute weights are L2 squared weights penalty can be used. So this is L1 and L2 regularization that we've had questions but before on machine learning. And the final way is to use long, short-term memory networks. So these are called l-s-t is also called LSM and neural networks. So in recurrent neural networks, gradient exploding kinda occur given the inherent instability in the training type of this network. Via, via back propagation through time. That is essentially transforms the recurrent network into a deep multi-layer perception neural network. So exploding gradients can be reduced by using the ASTM or long, short-term memory network, uh, units and perhaps related data type neuron structures. So GREs, so adopting Alice TM memory units is new best practice for recurrent neural networks for sequence prediction. So LSD EMS have become very popular for doing sequence predictions. And as well, gated type neuron structures, such as GR use slightly more simple structures that have also been quite popular in recent years for sequence prediction. So these are four possible solutions that we can use to vanishing unexploded gradient problems. 50. What is dropout for neural networks? What effect does dropout have?: So in this question we're asked what is dropping for neural networks and what effect does drop hashed out. So drop ash is a regularization technique, were randomly selected. Neurons are ignorant during training. They are dropped I randomly. So this means that their contribution to the activation of demonstrating neurons is temporarily removed on the forward bias and weight updates are not applied to neuron in the backward pass. So as a neural network learns, neuron weights settle in to their context within the, within the network. So weights of neurons are tuned for specific features providing some specialization. So neighboring neurons can become reliant on this specialization, which if taken too far, can result in a fragile model that is too specialized in the training data. So this reliance on context for neuron during training is referred two complex co-adaptation. So the effect is that the networks become less sensitive to specific weights of the neurons. And this in turn, results in network that is better capable of generalization. And there's LED light that's likely to overfit the training data. And so we can see here the sort of the big differences that we have by applying drop edge. So we have our standard neural network. And this is a fully connected neural networks where each of the neurons are fully connected to the other neurons in the architecture. And so we can see that in this case, you know, all the weights are being shared in the forward pass and the backward pass, I'm where I'm fully utilizing each of these neurons, Bush, it become, become quite fragile and dash. And maybe only one of the neurons were doing image detection. And only one of these neurons here are able to detect faces. But in this case, when we drop out some of these neurons. So if, in this case, this was the only neuron that was capable of detecting faces. Now suddenly it's not available. And when we feed in an image and we do backpropagation and we find that her image wasn't classified correctly. Then some of these other neurons are going to have to learn to detect faces. And so this really helps her model become more robust and less likely to overfish on the training data. So drop out is a technique which is used in a lot of different deep neural network models. And this is the reason why, because it helps us do stop overfitting on our training data, which can be quite a big problem in deep neural network architectures. 51. Why do segmentation CNNs typically have an encoder-decoder style / structure?: So why do segmentation? Cnn's typically have an encoder decoder style structure. So the encoder CNN can basically be taught of as a feature extraction network. While the decoder uses that information to predict the image segments by decoding the features on upscaling to the original image size. So you can see here, if you don't know, watch em segmentation is it's where we have this input image. And we're trying to segment into the different features. So you can see here that this has ever crossing gets identified into a d, separate features. We have our road area here, which is segmented into a different feature. We have our pedestrian walking area here, which is also segmented. And then we have our carriers, which are Segment is here. And the people, my chair as segmented in yellow. And they're segmented here. So basically the typical structure that we'll use for this is this convolutional encoder decoder network. And we want to take these input images. And we're going to have our convolutional layers, push blood Sam, batch normalization, and rarely. Then we're going to have a pooling layer in each of these stages. And so for each of these stages, we're basically getting to this point here where we have a latent representation of our structure, which is basically just the hidden features that we're going to use to represent this image. And then we're going to upsample our image and back into its original size. And we're gonna use these latent features here. And finally, M will have this softmax layer here. And the softmax layer, as we've seen before, is basically telling us what the output of our image should be. And in this case, the output is the declassification of RAM, which segment it's going to belong to. So in this case you can see again, our road has been classified into one segment. We have our footpath classified into another segment. We have a segmentation here for the hazardous sky area. So it's missing. I had a bit of detail on the trees and building here. But we can also see that the carriers have been quite well. It's segmented. So that's why we use the encoder decoder style for our segmentation problems. 52. What is batch normalization and why does it work?: So what is batch normalization? Why does it work? Training deep neural networks is complicated by the fact that the distribution of each layer as inputs changes during training, as the parameters of the previous layers change. So the idea is then to normalize the inputs of a layer in such a way that they have mean output activation of 0 and a standard deviation of one. So this is done for each individual minibatch at each layer. So we compute the mean and the variance of that minibatch alone, then we normalize. So this is analogous to how the inputs to networks are standardized, where we might do some preprocessing on our data so that the inputs are standardized and which can help the network to learn faster. So how does this help? We know that normalizing the inputs to a network helps it to learn. A network is just a series of layers where at the output of one layer becomes the input to the next. This means that we can think of a layer in a neural network as the first layer of a smaller subsequent network, thought of as a series of neural networks feeding into each other. We normalize the output of one layer before applying the activation function and then feed it into the following. There are subnetwork. 53. Why would you use many small convolutional kernels such as 3x3 rather than a few large ones?: So why would you use many small convolutional kernels such as three-by-three rather than a few larger ones. So there are two main reasons for this. First of all, you can use several smaller criminals rather than a few large ones to get the same receptive field and capture more spatial context. But with the smaller kernels you are using less parameters and computation's. Secondly, because what's on the kernels, you can use more filters. You'll be able to use more activation functions and thus have a more discriminative mapping function being learned by your CNN. 54. What is the idea behind GANs?: So in this question we're asked, what is the idea behind gans? So guns or general adversarial networks consist of two networks, D and G, where D is the discriminator network and g is degenerative network. So the goal is to create data. So in this case, it can be images, for instance, which are indistinguishable from real images. So suppose we want to create an adversarial example of a cash. The network G will generate images. Then Network D will then classify these images according to whether they are a cat or enough. So the cost function of g will then be constructed such that it tries to fuji. So wish, wants it to classify its output is always a cash. So we can see here in this bottom figure, we have a more drawn out example. So our green here is the forward propagation and the red is the backward propagation. So we have our two networks, or a generative network here and our discriminative network here. So first of all, we can input a random values. Then the generative network is trained to maximize the final classification error. So it's trying to fool our discriminative to network. So degenerative ad distribution and the true distribution or not compared directly. So it's the job of the discriminator network to train to try and minimize the final classification error. And so the misclassification error is the basis metric for training of both networks. And so one of the really interesting results that we have for this is the generation of fake images. And so there's a really interesting website called this person does not exist.com. And it's very aptly named. So you can see here that this image looks extremely realistic, but it's actually been generated by a general adversarial network. And we can reload it a couple of times. And in each example, we'll get a totally different person who looks absolutely photo-realistic. And if you were to see this image, you would say that it looked like a real human person. So it can really create a hugely impressive results by using general adversarial networks to create these images. And they can also be used for a number of butter and creating a number of other different images, such as images of animals as well. 55. Why we generally use Softmax non-linearity function as last operation in-network?: And so in this question we're asked, Why do we generally use the softmax nonlinearity function as the last operation in network. And so the reason for this is that because it takes in a vector of real numbers and these numbers can be positive, negative, whatever. There's no constraints, and it returns a probability distribution. So the formula for the softmax function is as follows. So basically it states that we need to apply a standard exponential function to each element of the output layer. And then normalize these values by dividing by the sum of all the exponentials. So you can see that sort of it's a 0. So this symbol is for the sum of all values from j equal to one. Okay? And that's all the exponentials of the output layer. And we're dividing by the sum of all the exponentials. So doing so show ensures that the sum of all of the exponentiated values adds up to one. And it should be clear that the push is a probability distribution. And so each element is non-negative and the sum over all the components is one. And so if we can see in our output layer, and we have these values of Save and one by 3.15 to 0.20.71.1 if we apply the softmax activation function. So taking the exponent of each of these values over the sum of it exponents. You can see the probabilities that we would get, and this is what we would expect. So r value here, which is 5.1, which is much greater than d other values, has a probability of 0.9. And we can see that our lowest value here, which was m, 0.7. has the lowest probability and also the values which were quite low as well. So 1.1. and when, when tree also have quite low probabilities. So you can see that there's quite an established mapping between using the softmax function and the probabilities that we get. 56. What is the following activation function? What are the advantages and disadvantages of this activat: So in this question, we're asked about the falling activation function. And what are the advantages and disadvantage of this activation function? So the activation function we're talking about is here. So you can see we have our y axis here, and the values are between 01 and our x-axis here, where R 0 is centered here. And we go from minus six to six. So this is a Sigmoid or logistic activation function. And you can tell here by the sort of smooth gradients and it being equal to 0.50 and then going all the way up and approaching one. So the advantages that we get with the sigmoid or logistic function is that it has a smooth gradient and so it prevents jumps in the output values. So as you can see here, it's quite a smooth gradient from the minimum to the maximum values. And another advantage that it gives us is that the output values are banned. So as you can see here, there are band between 01, so it normalizes the output of each neuron. You can see here we are gradually approaching one. And the final advantage that we get from using a Sigmoid or logistic activation function is that for export to our x and minus two, it tends to bring the y value, which is the prediction to the edge of the curve, very close to one or 0. And this enabled clear predictions. So we can get quite confidence results if there's a big change in the activation function. So one of the disadvantages that we have by using sigmoid or logistic functions is vanishing gradient. So for high or low values of X, there is almost no change to the prediction causing a vanished, vanishing gradient problem. And this can result in the network refusing to learn further or being slow to reach an accurate prediction. So as you can see, as we, as we get out to these values of minus six, even minus four, we're coming very close to 0, which can lead to the vanishing gradient problem and our network refusing to learn. So one of the other disadvantages of this activation function is that the outputs are not zero-centered. So you can see that our center is around here, which is 0.5. as opposed to being centered around 0. And the final disadvantage that we have with this activation function is that it's quite computationally expensive compared to some other activation functions which we will see later on. 57. What is the following activation function? What are the advantages and disadvantages of this activat: So in this question we're asked about the following activation function here. And we're also ask what are the advantages and disadvantage of this activation function? So you can see our activation function here is bounded by minus 1010 on the x-axis and 010 on the y axis. So this activation function is a activation function or a rectified linear unit. And so one of the advantages of this activation function is that it is computationally efficient. And so this allows the network to converge very quickly and is one of the reasons why a rvalue activation function is very popular in convolutional neural networks. A second advantage that this particular activation function has is that it is non-linear. So although it looks like a linear function, the revenue has a derivative function and allows for a buck for backpropagation. So this is very important for deep learning. Am do backpropagate the error is true? Did Network. One of the disadvantages it has, however, is the dying revenue problem. And so when the inputs approach 0 or a negative, the gradient of the function becomes 0. And so the network cannot perform backpropagation and cannot learn. So as you can see here, we don't have any pink line here. So that means that the function becomes 0. And so we can't backpropagation and the errors, which means that we can't learn anything, which does have quite a limitation for some applications. But the advantages are that it's very computationally efficient and it's also nonlinear. 58. What is the following activation function? What are the advantages and disadvantages of this activat: So in this question, we're asked about the following activation function located here. And what are the advantages and disadvantages of this activation function? So we can see that it's quite similar to the previous activation function that we are looking at here. And that is because it builds on top of it. So this is a leaky revenue as opposed to being the standards activation function. And you can see the changes that has been made here. So as opposed to the activation function going to 0, when m, the x-axis is 0 negative, we see that we are able to return some value which can be used in backpropagation. So this prevents the dying rally problem. So this variation of the value as a small positive slope and the negative area. So it does enable backpropagation even for negative input values. And otherwise, it's like rarely. So it is a linear activation function and it's also quite computationally efficient. So one of the disadvantages that you get with this network is that the results are not consistent. So leaky value do not provide consistent predictions for negative input values. And so this can be quite a disadvantage depending on the particular application of your neural network and whether it needs to have consistent predictions. And so that can be very important, say for medical applications where you need to have consistent and predictions for say, breast cancer screening. And so this is the function that we can use to create this and leaky earlier, where we take the max of 0.1 multiplied by the X and X values. 59. What is backpropagation and how does it work?: So what is backpropagation on how does it work? So after a neural network is defined with its initial whites and a forward pass is performed to generate the initial prediction. There is an error function which defines how far away the model is from the true prediction. So there are many possible algorithms that can minimize the error function. For example, one can do a brute force search of the weights that generate the smallest error. However, for large neural networks, a training algorithm is needed that is very computationally efficient. And so back propagation is that algorithm. He'd discovered the optimal weights relatively quickly, even for a network with millions of different weights. And so here we can get a overall flavor of the different stages in back propagation. So we have the input layer, we have our hidden layers on our I put layers. And so we have this forward as so we have four distinct stages that I'll go through here. So we have our forward pass, and so this is where our weights are initialized on the inputs from the training session are fed into the network. So the forward pass is carried out on the model, generates its initial prediction. Then this initial prediction as combined with the error functions. So the error function is computed by checking how far away this prediction is from the known true value. And once we have our known true value and our prediction, we can calculate how much we were off by or our difference in our desired goals, which is captured here. And then we can backpropagate through each layer of the network. And so backpropagation with gradient descent. So the backpropagation algorithm calculates how much the output values are affected by each of the weights in the model. To do this, it calculates partial derivatives going back from the error function to a specific neuron and iteration, this provides complete traceability from total errors back to a specific way which contribution to share. The results of backpropagation is a set of weights that minimize the error function. And finally, we have four, which is the weight updation. So the weights can be updated after every sample in a training session, but this is usually not practical. Typically the batch of samples is run in one big forward pass and then backpropagation is performed on the aggregate results. The batch size and the number of batches used in training called iterations are important. Hyperparameters that can be tuned to get the best results. So running the training session, true, the backpropagation process is called an epoch, which we would have seen in one of our previous questions. 60. What are the common hyperparameters related to neural network structure?: So this is a pretty straightforward question and which were asked, what are the common hyper parameters related to neural network structure? And so some of you can give a number of different answer, but some of the common answer is that you would typically expect would be the number of hidden layers that you want to have in your neural network structure. Whether the network contain drop ash and you could also have the specific parameters that you're, you're under set drop I2. So what is the probability of each particular neuron being set, add to, be dropped out. And whether you have varying levels of drought and each individual layer. You can also talk about the activation function. So we've gone through a number of different examples of activation function from sigmoid to rally activation functions. And you can set those two different parameters. And then you can also talk and bash your weights initialization. So whether they were all set to a common value or are there randomly initialized, or you used some particular function to initialize your neural network and values. 61. What are the common hyperparameters related to training neural networks?: So in this final question on hyper parameters were asked what are the four methods of hyper-parameter tuning? And so one of the first methods that you can use is manual hyperparameter training. And so an experienced operator can guess the parameter values that will achieve very high accuracy. So this is mainly as a result of trial and error. And if an operator has spent a very large amount of time training and keeping neural networks and has a lot of experience and say on one particular problems. So it computer vision problem, then it will usually have a pretty good guess as to what the parameters will be that will achieve very high accuracy, or at least be able to put it bends on it, that will reduce the total amount of space that was after search. So another method that we can use for i parameter tuning is grid search. And so this involves systematically testing multiple values of each hyperparameter and retraining the model for each combination. So basically we create a grid of value from lower and upper bends. And we systematically search through each of these values for each of the parameters. So for example, we may have a drop add value between m 0.22.7. And we would systematically iterate through each of the different values. So m 0.3.4.56. And see how the hyperparameter affects the results of our neural network. The third option that we can use is random search. So using a random hyperparameter values can actually be much more efficient than manual or a grid search as it can lead us to explore. And I'll use data we may not have considered before, which might actually improve performance. The final method that we can use for hyperparameter tuning is Bayesian optimization. So this trains to model with different type of parameter values over and over again and tries to observe the shape of the function generated by different parameter values. It then extends this function to predict the best possible values on this method provides higher accuracy than random search. So Bayesian optimization has shown to be quite effective in a number of different applications. And it's quite a good method for hyperparameter tuning. 62. What are 4 methods of hyperparameter tuning?: Okay, so in this question we're looking at different neural network architectures and we're going to use a number of different cells in these different architectures to make sure that we understand the main characteristics of different neural network architectures. So you can see the cells are broken up into a number of groups. We have our input cells. The different types of hidden cells are output cells. Then we have different types of recovery, recurrent memory cells. And then we also have some convolutional kernel layers. So the first type of cells we can have the input cells. So we can have a classic input cells which is just this little circle. Then we can have back edge and Patel's, which are at the circle with a smaller black circle in inside. And then we can have, also have noisy inputs else which is the yellow circle when a triangle inside of it. We can also have these hidden results which are denoted by this green circle. And a, the classic hidden input, input had hidden cell is just green circle. Then we can have probabilistic hidden cells which are the green circle witness or glands iTouch and this biking hidden cell, which is a green circle and a triangle inside of us. And then a capsule cell and which is the circle would a black square inside of it. We can also then have the outputs all, which is this red circle. And then also having a red circle with a circle inside of us, which is the input I've put self. So these can be used in autoencoders. We then go on to different types of recurrent cells and memory cells. So this is indicated by the blue circle. To the memory cell has the blue circle with a black circle inside devotion. And then we also have the gated memories L, which is the blue circle with this black triangle inside division. And then we'll have some kernel on convolutional are pooling layers. So all of these are the different types of cells that can be used to create deep neural networks. And we're going to see how these can be combined and some of these architectures. So in the first architecture that we're going to look gosh, is this classic architecture here. And this is basically the most, one of the most simple architectures that you can have in deep neural networks and want to create is a feed-forward neural network. So you can see here we have the most simple cells. So basically we have our inputs are A1's l under I puts it, and we're not using. And the ITER advancements in these cells, such as probabilistic hidden solids there's biking in cells is just the classic inputs. I'll install an app itself. And so these are very simple neural networks. They can feed information from front to back. And the simplest am practical application has two inputs and one output cells. So they can be used to model logic gates. So this is the logic gates like and, or. And so we're usually trying to feed forward networks, true backpropagation, giving the network paired datasets of what is going on, what we want to have coming out. So this would be t, truth or false in the am and OR gates. So the error being back-propagation is often some variation of the difference between the input and output cells. So just like Mean Squared Error or the linear difference. So given that the network has enough hidden neurons, it can theoretically modeled relationship between the input and I push. Practically there uses a lot more limited now, but they can be combined with their networks to form new neural networks. So for example, you might have this ambush, but maybe you have multiple different hidden layers. And these also have multiple different hidden layers. So it would be much more complicated architecture where this hidden node maybe represents m 15 to 20 other hidden nodes and their combined in some architecture. But just having one hidden node is a bit simplistic and compared to some of the state of the earth, but they were one of the first approaches that were proposed. 63. Using the above neural network key, state the name of the following network and give some basic info: So in this question, and we're looking at this architecture here. And we can see that we have an input cell, some recurring cells on multiple different applets holes. So this architecture is a recurrent neural network. And so recurrent neural networks are sort of like the simple feedforward neural networks, except they have a time twist. So they are not stateless. They have connections between and so they have connections through time. So neurons are fed information not just from the previous pair but also from themselves in previous boss. This means that the order in which you feed the input on the train network matters. So if eating milk and then cookies may yield different results compared to feeding it cookies and milk. So one big problem wish recurrent neural networks is the vanishing or exploding gradient problem. Where depending on the activation functions used, information rapidly gets lost over time. And we've talked about this in previous questions on some of the solutions that we can use to deal with this. And we will see one of the architectures that we talk about when I had to deal with this. And so Alice, DEMs and g are use in the upcoming questions. So just like the very deep feed-forward networks lose information with that. And these networks can lose information when you go back over a long period of time. So intuitively, this wouldn't be much of a problem because these weights are Nash neuron states. But the weights tree time where at the information in the past is stored. So if the weights reach a value of 0 or 1 million, the previous states won't show any information. So RNNs kind of principle will be used in many different fields because most foam forms of data and that don't actually have a timeline, say unlike sand or video can be represented as a sequence. So a picture or a string of texts can be fed one pixel or a character at a time. So the time-dependent weights are used for what comes before into sequence, not actually what happened x seconds before. So you can see how we can turn a sequence of words into this sort of temporal pattern where we identified the next character based on the previous character. Bush in general, recurrent neural networks are a very good choice for advancing or completing information such as autocompletion. Because you're learning new characters all the time on, based on that, you can update. Do you think the word is going to be? So you can see how it differs from the sort of classic feedforward neural network architecture. In dash you have, you still have, your service is going to have these inputs, L's and output cells. But now we have these am time-dependent memory cells that are feeding back into each other through time, which is the main difference between the recurrent neural network and this classic network architecture. And we'll see in the next few questions, there's different architectures that we can use to take into account this temporal information. And that also solves some of the problems of exploding and vanishing gradients. 64. Using the above neural network key, state the name of the following network and give some basic info: So in this question, we're dealing with this architecture here. We can see it has the classic inputs, ALS, or I'd put soles. Then in the middle, we have these memory cells. So we can see how it's slightly different to this previous architecture, which was the recurrent neural network. And instead of just having recurrent cells, we have these memory cells. And so that's the difference which leads it to being a STM or long, short term memory network as opposed to just a recurrent neural network. So unless DNS or long, short term memory networks, they tried to bash the vanishing exploding gradient problem by introducing gates and non-explicit defined memories L. So these are expired, inspired, mostly from circuitry and not so much from biology. So each neuron has the memories L and three gates. So the input, the bush on the forget gates. The function of these gates is to safeguard the information by stopping or allowing the flow of it. So this helps to stop blend the gradients are becoming either too big or too small. So the input gates determine how much of the information from the previous layer gets stored in the cell. And the egg push Add Layer takes the job on the other end and determined how much of the next layer gets to know about the state of the cell. So the forget gate seems like an odd inclusion at first, but sometimes it's good to forget. So if it's learning ebook on a new chapter begins, it may be necessary to forget some characters from the previous chapter as maybe the story is going to change quite a lot now. So the main applications for Alice DEMs are in complex sequences, such as writing Shakespeare or composing a primitive music. They can also be used in time series forecasting, where we want to forecast the next number in a sequence. Say if we want to predict the stock market or something, or predict the weather, temperature the next day. So note that each of these gates has a waste to a cell in the previous neuron. So they typically require more resources to run. 65. Using the above neural network key, state the name of the following network and give some basic info: So in this question, we're dealing with this architecture here. We have our inputs or outputs o, and in the middle we have these gated memory cells. So opposed to the recurrent memory cells that we had in the previous question on the recurrence that we had for our recurrent neural networks. So you can see all of these architectures are quite similar. And it's just the cells in the middle dot get changed. So whether they're gated memory cells and recur in cells or the memory cells. So the gated memory cells are V to this architecture, which is gated recurrent units or g are used. So these are a slight variation on Alice gems. And you can see the architecture for all of these time-dependent and neural networks is they're very similar. And it's just the cell in the middle of 3D vector changing. And even the difference between Alice, TMZ and recurrent and memory cells on gated recurrent memories L's is just a very slight variation. So they have one less gauge and are wired slightly differently. So instead of having an input I've pushed on forget gauge, they have an update gate. So this update gauge determines how much information they keep from the last stage and how much information they let in from the previous layer. So the reset gate functions much like the forget gate, almond house TM, but is located slightly differently. So they always send a year, therefore, they don't have an I put cash. So the gauge functions much like the forget gate of analysis, DM, but it's located slightly differently. So they always send their full stage. They don't have an application. In most cases they function very similar to ls DEMs, with the biggest difference being Dutch. Gpus are slightly faster and easier to run, but also it's likely less expressive. So in practice, this tends to cancel each other out. As you need a bigger network to regain some expressiveness, which then turns out the counsel of the performance benefits. So income, some cases where the x-rays and expressiveness is not needed, GPUs, can I perform ls, DMZ? And so all of these network act architectures are used in pretty similar am applications. So it's usually in time-dependent sequences where you want to introduce some time series forecasting or you want to do some autocompletion. And they're just different variations essentially of doing the exact same thing. And you can talk about and how the architecture that are similar. And basically it's these different cells in the middle of these network architecture is, that is the main difference between the traditional recurrent neural network. And less DEMs and g are use. And you can talk about the slight variations and tradeoffs you have dreamed using a Jie Ru versus an Alice dm and that there is slightly faster and easier to run, but they're also slightly less expressive. So that's why I would say Alice DEMs are probably the most popular architecture out of these tree. And because they tend to be a bit more expressive and as hardware keeps getting better, I'm being slightly faster and easier to run isn't as important as it was before. 66. Using the above neural network key, state the name of the following network and give some basic info: So in this question, we're talking about this architecture here. So we can see that this is quite different to the architecture is that we've been talking about before. So the time-based architectures where we had our recurrent cells or cells that were feeding back into each other tree time, like the RNNs, DLS, DMZ on the GREs. So in this case we have our input cells and our I'd put cells. And we also have this hidden cell in the middle. So you can see that there's a lot less hidden cells than there are in the input and output layer. And so in the input pair, we have four of these l's. And in the hidden layer if we only have two. So we can see that we're going to be doing some dimensionality reduction in the middle here. And then we're going to be trying to reconstruct the data that we had in the inputs o. So we can see that these are not just the standard IP photos that we had previously. These are the same red, sort of orangey color, but they have a circle inside them, which means it's a matched input I put cell. So this is auto-encoder architecture where we're trying to basically match the output to the input that we received and create this hidden layer of this latent layer, which is essentially the dimensionality reduction. We're able to then reconstruct the same push. So autoencoders are somewhat similar to feedforward neural networks as auto encoder, autoencoders are more like a different use of feedforward neural networks rather than a fundamentally different architecture. So the basic idea behind autoencoders is to encode information as in compress on not encrypted automatically. And that's where they derived their name from. So the entire network resembles an hourglass shape. So you can see here in this sort of hourglass-shaped on its side where we have, this is where the sand would be passing true. So autoencoders are also symmetric around the middle layers, one or two depending on an even or odd romantic layers. So you can have multiple hidden layers in this architecture. So for example, we could have three different hidden layers. And so then we might have in the first layer, we would have three different cells and different cells here. So I'll just draw a different color. So we might go for, then have three different cells here. And then you need to match dot as well on the opposite side. So you would need to have three different cells here. And then you would have these two hidden layers in the middle. And so you don't need to just have one hidden there, Nero, the encoder network. You can have multiple different layers, but it needs to be matched on either side. So the smallest layer is always in the middle. This is the place where the information is most compressed. So this is the choke point of the network, isn't it a way that it's described? So everything up to the middle part is called the encoding perished. And everything after the middle is the decoding, and the middle is the code. So everything to decide is your encoding layer on everything, decide as your decoding there. So we can train using backpropagation by feeding in Bush on setting the error to meet the difference between the input and the output came out as. So basically, we're trying to reconstruct exactly what we've fed into this network on the output side. And then we can use this latent representation. So I didn't coders can be built symmetrically when it comes to weight. So the encoding rates aren't the same as the decoding weights. So autoencoders have a range of different applications on this can be Dimensionality Reduction, image compression, image denoising, feature extraction, image generation under also use in recommendation systems. So it's a very popular architecture, can be combined with a number of different neural network architectures. So say if you wanted to do some dimensionality reduction first and then maybe do some time series forecasting paced than dash, you might combine autoencoders, which are recurrent neural network or an STM based architecture. 67. Using the above neural network key, state the name of the following network and give some basic info: So in this question, we're talking about this architecture here, where we have, our hidden cells are in Bush and I'd put myself. And we also have our hidden cells. So in this case they're probabilistic hidden cells, as opposed to the traditional lead and cells that we add in our autoencoder architecture here. So we can see that the architecture is quite similar to the previous question on autoencoders. And the only difference being the probabilistic kids else here. So the green circles with the circle inside them. And also that we have slightly more cells in the hidden part of our network. So it's not quite this hourglass shape. And so we can see that these are a slight variation on this traditional auto-encoder architecture and that's where they get their name from. So the DR and variational autoencoders or vt. So they basically have the same architecture as autoencoders, but they're taught something else. And this is an approximate probability distribution of the input samples. So they rely on Bayesian mathematics regarding the probabilistic inference and independence, as well as read parameterization trick. Dash is used to achieve this different representation. So the inference on the independence parts make sense intuitively, but they rely on so much complex mathematics. So the basics come standard is a take influence into account. So if one thing happens in one place and something else happened somewhere else, they are not necessarily related. If this is not for, if they are not related, the error propagation should consider this and discovered the Bayesian mathematics comes in. And this is a useful approach because neural networks are large graphs in a way. So it helps if you can rule out the influence from some other nodes as you go deeper into some of the more deeper layers. So variational autoencoders have become very popular and quite recently as they are generative networks. So they can be used to generate a lot of new data. And this can be things such as images or new songs or even am text. So one of their most recent popular applications has been generating and new faces from my data set of faces. And given those faces different emotional expressions. So they can also be used for molecule design in drug design. And they have also been used to compose music by feeding in previous recorded songs. And you can see the applications of this might grow in the future. In that if you're a band, two ones to record any sung. And you can just feed on all your previous songs into this neural network architecture. And then it comes up with a tight isn't different tongues are maybe 100 different songs and you just play all of those and significance has come up with something interesting. And then you can continue to develop the most interesting Sung's and sort of then an evolution based approach where you could damage all the bad songs and then just keep pursuing the good songs on how this and variational autoencoder to most of the creating for you. 68. Using the above neural network key, state the name of the following network and give some basic info: So in this question, we're talking about this architecture here. And you can see this is quite an interesting architecture and it's very different to some of the architectures that we've looked at before. Such as their recurrent neural networks on T autoencoders where you have this sort of very simple mapping of your inputs L, some hidden cell, whether they have some time components. And then having your eye puts L. So in this case we have these back fed input cells. And each cell is connected to every other cell in the network. And you can see this generates a huge amount of different connections. So you can see this cell is connected to this cell and this cell, and the cell and the cell and the cell, this cell and this cell on the same goes for every other cell in the network. So there's a huge amount of connections that are generated. And so these networks are called Hopfield networks. They were much more popular in the eighties and sort of lost popularity in recent years due to being so consuming and just the success of other different types of architectures. So the Hopfield network is a network where every neuron is connected to every other neuron. And it's completely entangled plate of spaghetti. As even all of the nodes function as everything. So each node is input before training, then hidden during training, and I've put afterwards. So the networks are trained by setting the value of neurons to the desired pattern after which the weights can be computed on the weights do not change after this. So once trained for one or more patterns, the network will converge to one of the Learn pathogens because the network is only stable in each of these states. So each neuron has an activation threshold which scales to this temperature, which if surpassed by summing the input cases. So if the neurons takes the form of one of two states, usually minus one or plus one, or sometimes 0 or one. So updating the network can be done synchronously or more commonly one-by-one. So if updated one by one, I fair random sequence is created to organize which sells updation, which order. So fair random being all options n occurring exactly once every n items. So this is so you can tell when the network is stable. So it's done converging. And once every cell has been updated and known and I've changed, the network is stable or annealed. So these networks are often associated with memory because they converge to the most similar stage as the input. If humans see half a table, we can imagine theatre half on this network will be able to converge to a table if presented with half noise and half a table. So Hopfield networks are a form of associative memory, just like the human mind. And basically it's initially trained to store a number of different patterns. And then it's able to recognize any of those learned patterns by exposure to perish or even corrupted information. So the common applications are those in where pattern recognition is useful. And I'll field networks have been used for image detection and recognition, enhancement of X-ray images, medical image restoration on applications such as Dutch. So they are not used as much am now, due to the popularity of our network architectures, which is and auto-encoders and convolutional neural networks for image detection and recognition, where you're using your convolutional layers and pooling layers to reduce this huge number of connections. And that is just led to do much more improved accuracy in these particular applications. But they were big in the eighties. So it's important to understand and the previous generation of architectures that lead to convolutional neural networks and some of the architectures that we use today. 69. Using the above neural network key, state the name of the following network and give some basic info: So in this question, we're discussing this architecture here. We can see that it's quite similar to the architecture we were talking about in the previous question. And which ones? This Hopfield network architecture, bush, in this case we have different labels on the cells. So we have a group of inputs, L's here in yellow, and our group of hidden cells here in green. So this network architecture is called Boltzmann machines, or VMs for short. So Boltzmann machines are a lot like the Hopfield networks, but some neurons are marked as input neurons, while others remain hidden. So the input neurons become output neurons at the end of the full network update. So it starts with the random weights and learns through back-propagation, or more recently, true contrastive divergence, where a Markov chain is used to determine the gradients between two informational gains. So compared to all feel networks denote the neurons, mostly how binary activation patterns. So as hinted by being trained by a Markov chain, the Boltzmann machines are stochastic networks. So the training and running process of a Boltzmann machine is fairly similar to national of a Hopfield network. So once that's the input neurons to certain values after which the network is set free. And so while the network is free, the cells can gash on E-value Andre repeatedly go back and forth between deed in Bush and hidden neurons. The activation is controlled by global temperature value, which lowers the energy of the cells. So this lower energy causes their activation patterns to stabilize. So the network reaches an equilibrium given the right temperature. So Boltzmann machines which unconstrained connectivity have not proven to be useful for practical problems in machine learning or inference. Bush, if the connectivity is properly constrained, the learning can be made efficient enough to be used in practical, practical problems, such as combinatorial, combinatorial optimization. And so we can see here, much like the Hopfield networks, every node is connected to other nodes within the architecture. And this means that the number of connections and number of weights that you have to a train grows exponentially as you're adding more notes. So you can see that one of the applications that they had been drawn to be used for was for our image recognition and classification. But we can see how convolutional neural networks and true they're pooling layers have been able to reduce the dimensionality and also have a lot of nice properties like the transit time translation and variance that we talked about before. And so that's why these type of architectures hasn't been used that often are not been replaced by architectures like am convolutional neural networks. But we will also talk about am restrictive Boltzmann machines in the next question. And he has connections between the different types of nodes. So there's no connections between the hidden to hidden nodes as you have in the traditional and Boltzmann machines. And there's also no connections between the input, input connections. So there's any connections between different layers. 70. Using the above neural network key, state the name of the following network and give some basic info: And this question, we're looking at this architecture here. And we can see the similarity between our previous architecture, which was the Boltzmann machines, and this architecture here, where we have our input cells and our hidden salt. And there's multiple connections between each of them. So we can see that this inputs L here is connected to all of these different hidden cells. And, but the only difference between them is that there is no connections between the input, two input cells. And in the hidden self here, there's no connections between these hidden to hidden cells. And so this type of accurate texture is called a restricted Boltzmann machine. So it's similar to this M Boltzmann machine architecture, except that is restricted in that there is no connections to the cells between each layer. So the restricted Boltzmann machines are remarkably similar to Boltzmann machines, which isn't a big surprise. And therefore they're also similar to Hopfield networks. So the biggest difference between Boltzmann machines and restricted Boltzmann machines. If that restricted Boltzmann machines are more usable because they're more restricted. So in restricted Boltzmann machines, they're only connections between the hidden and visible units and none between the units of the same type. So that's as I showed previously, there's no connection here between the hidden to hidden or visible to visible connections. So RBMs can be trained like an feedforward neural networks with a twist. So instead of passing data forward and then back-propagating, you can pass the data and then backward pass the data back to the first layer. And after that you train with forward and backward propagations. So forward and back propagation. So RBMs have been useful applications in a number of areas. And that can be used for dimensionality reduction, classification, Collaborative Filtering, Feature Learning, topic model, and many quantum body mechanic problems. So they can be trained in either supervised or unsupervised way depending on the task. 71. Using the above neural network key, state the name of the following network and give some basic info: So in this question, we're talking about this architecture here. And you can see here that we have our input nodes and we have these kernels. And then we have a series of convolutional layers. And then we have a series of these hidden layers. And finally we have our output layers as well. So we can see that this is forming a traditional convolutional neural network. So a CNN or a deep convolutional neural network, DC and n. So these two terms are used interchangeably. And so you can see that we basically have two different stages of our feature extraction, where we're using all these different convolutional layers. And then we have our classification when we have these more fully connected and neurons and these are used to use all of the different features that we've used through these different m convolutional steps to either do some image classification or image recognition. So convolutional neural networks, CNN or deep convolutional neural networks, DC and n's are quite different from most other types of networks. So they are primarily used for image processing, but can also be used for other types of inputs such as audio. And they've also been used in some text applications as well. So a typical use case for CNN's is when you feed to neural network images and the network classifies the data. So for example, you might expect the pleasure of cache if you wish. And input picture of a cache on dog when you give an input picture of a dog. So CNN's tends to stylish wish and input scanner, which is not indented to Paris all the training data at once. So for example, and the do an input image of 200 by 200 pixels, you wouldn't expect a layer which 40 thousand nodes. So router, you create a scanning input layer of, sorry, 22 by 20, and you feed the first 20 by 20 pixels of the image, usually starting in the upper left corner. So once you've passed the ambush and possibly used it for training, you feed in the next 20 per 20 pixels. And in this case you're moving the scanner one pixel to the right. So that one wouldn't move the input 20 pixels over whatever the scanning wish. So you're not detecting the image into different blocks of 20 by 20, ureter, crawling otherish. And we can read previous questions that we're talking about, things like the step size and the different wish of this scanner. So the input data is then Fred treated the convolutional layers instead of normal hours where all nodes are not connected to all the other nodes. So you can see the difference here in the connections between these convolutional layers for her doing her feature extraction. And in the fully connected a hidden, an output layer is here. Or we're doing our classification. So each node only concerns itself with close and neighboring cells. And how close depends on the implementation, but it's usually not more than a few. And these convolutional layers also tend to shrink as they become deeper. So you can see here that our network is tending to shrink in size as it's becoming deeper and deeper. And this is mostly by a easily divisible factor of the input. So 20 would go to a layer of ten followed by other five. So this is why powers of two are commonly used here as they can be divided cleanly and completely by definition. So we might have an input of 32 and then the next layer would be 168421. And besides these convolutional layers, they also often future pooling layers. So building his way to filter out the details, I commonly find pooling technique is max pooling, where we take, say, two-by-two pixels and pass on the pixels with the most amount of bread. So to policy and ends for audio, you basically feed the input audio waves over the length of the clip segment by segment. So real-world implementations of CNN's often glue and fully feedforward neural networks to the end to process the data, which allows for a non-linear abstractions. So these networks are called DC and ends, but the names and abbreviations between the two are often used interchangeably. So the typical applications for CNN's or DC and N's are in image and video recognition, Recommender Systems, image classification, medical image analysis, neural language processing, and brain-computer interfaces. So we can see that they've been used in variety of applications and have become very popular in recent years. 72. Using the above neural network key, state the name of the following network and give some basic info: So in this question, we're talking about this architecture here. So h is quite similar to the previous recurrent neural networks that we've talked about. So we have our input and cells are traditional ag, put cells and our recurrence cells. But we can see that the way that they're connected is very different to the previous architectures that we've looked at. So even in the previous convolutional neural network, you can see that there's very distinct layers and we have connections between each of those layers on it. I'm very organized pattern. But if we look at this architecture, which is an echo state network, we can see that the connections are not as organized. And sometimes, so for example, and this node here for change colors. So this node here is connected to this node and this node, but we can see this node which is in the same layer, is not connected to eat. Or if these nodes, it's connected to this node here and this node here. So there's quite a variation in which nodes are connected to theatre types of notes, which is not the same as the previous architecture that we've looked at. Where they've each bin connected in a very set pattern to each of the layers. And this was the same for all of the architectures that we looked at. So including the previous recurrent neural networks and others DNS. And you can say that this is just a variant on these architectures, which the different connection types. So basically echo state networks are yet another type of recurrent network. So this one sets itself apart by having random connections between the neurons. So they're not organized into a neat set of layers and they are trained differently. So instead of feeding him push on backpropagation error, we feed the ambush forward ish and update the neurons for awhile and observed the output over time. So the input and output layers have a slightly unconventional role as the input layer is used to prime the network on the output layer acts as an observer of the activation patterns that unfold over time. So during training, only the connections between the observer on the soup of hidden units are changed. And so this can be used for a number of M. Traditional serious applications on echo state networks have been used for time series data mining, time series forecasting, and sequence prediction. So sort of similar to recurrent neural networks and Alice teams. And one area that they've found to be quite goods and even shown some improvements. Do ls DMZ on RNNs are predicting and very dynamic sequences overtime. So if you have a time series dash changes quite lush, then echo state networks can show improved performance compared to RNNs and Alice DEMs. And they also have quite a reduced training time compared to these architectures because there isn't as much connections between each of the layers. 73. Using the above neural network key, state the name of the following network and give some basic info: So in this question, we're discussing this architecture here, where we have our input nodes and then a series of hidden nodes, then an output node, which is a matched I put node matched input, output node. Then we have another series in our network. We have these hidden nodes, m here, so another series of hidden nodes, and then we have our output nodes. And so you can see that there's two different stages in this network. And this is generally called a parser, generative adversarial network or a gun. And so general adversarial networks are a different breed of network. So they are essentially twins to networks which are acting together. So guns consists of two networks, although often a combination of feedforward on convolutional neural networks with one task to generate content. And I think the other has to judge the content. So you can see how they can work together to make sure each network is achieving good results. Because if you have someone who's generating content and another one who's judging content, then you're able to get feedback in each iteration as to whether the content is good and also to whether you're judging correctly. So the discriminative network and receives either training data are generated content from degenerative network. So how well the discriminating network was able to correctly predict the datasource is then used as part of the era for degenerating network. So if the generating network is able to fool the discriminant to network into thinking that it was getting the actual training data than the generative network is working quite well. But if the discriminative network is always able to identify the source, then the generative network needs to improve on those samples data can identify. So this creates a form of competition, or the discriminator is getting better at distinguishing real data from generated data. And the generator is learning to become less predictable to the discriminator. So this works well in part, even in quite a complex noise-like patterns that become eventually predictable, but generate contents and learn features to the input data and which is harder to learn to distinguish. So guns can be quite difficult to train as you don't just have to train to networks, either of which can pose it's own problems, as you saw wish, am just RNNs or Alice DMZ. But there are dynamics need to be balanced as well. So if prediction or generation becomes too good compared to the other Augustan won't converge. And there is an intrinsic divergence. So if the discriminator is never learning to actually discrimination between degenerative network, then degenerative network would never improve. So guns have a number of applications, including generating examples for image datasets, generating photographs of human faces, generating cartoon characters, text to image translations, semantic, semantic image do photo translation, generating new human poses, photos to emojis, video prediction and treaty abject terror generation. So they're also used in image modification. So here we can take an input image and we can have different characteristics on it to this image. If we take the top row here, we can see that we cannot blond hair. We can change the gender and changed to whether we want this person to be aged 12, pale skin. And we can also do different things like changing the emotion. So in this case, we have an input image here and we want to make this person look angry. So these sort of for the brows and we can also make him look happy. So the rays are more open and it's bigger smile compared to the hungry. And you can also make them look fearful as well. So they used in a lot of generation tasks where we want to either generate some new images, images, or generate some examples for a different dataset.