Transcripts
1. Introduction: Artificial intelligence is one of the hottest fields at the moment. And people who work in this domain have usually big wages, but it is really hard to get in. We have designed this course in order to teach you how to crack the AI interview, we will go to many possible questions that were actually asked during a real interview and try to solve them together with this little technicalities as possible. Our goal is to make the course readable even for a beginner that is interested in working in AI, This is structured in multiple parts. We will tackle interview questions from diverse topics including data science, artificial intelligence, deep learning, and machine learning. We will also separate the discussion into entry-level, medium and hard, difficult questions. It is important to mention that almost all of the questions were collected from multiple open sources. And some of them are actually proposed by me, a postgraduate student that is currently studying and working in AI. Okay, let's get to work. This is how the storyboard would look like. In the first class, we shall discuss some general questions that will appear regardless of your specialization. Then, as I said before, we will tackle questions from data science, artificial intelligence, deep learning, and machine learning. Three levels of difficulty. I really think this will be useful to anyone attending a job interviewing this field's. See you in class.
2. General Questions: First question from the general set. What's the difference between data science, AI, deep learning and machine learning? Don't they refer to the same thing? This is a very popular question. You need to get it right. You have to identify that this is a relationship of inclusion. Deep learning is a part of machine learning, that is a part of AI, that is generally known to be a part of the data science field. The main idea behind machine learning is to find patterns in data, learn from the data in deep learning, usually the model is represented by an artificial neural network structured to mimic the human brain, effective for feature detection and tries to extract the features itself and tries to adapt itself based on those. Ai is vast domain that incorporates not only machine learning, but other fields like robotics, expert systems for natural language processing. All of this form what we call data science, the science of learning from data. Second question from the general set. What are some limitations of deep learning? This question has the purpose of testing what your field of view regarding a particular domain of data science is. Deep learning requires large amounts of training. Data can be easily fooled. And the most important of all, the success of deep learning is purely empirical. It is actually criticized as uninterpretable black boxes. Third question. How would you explain machine-learning to a random person on the street? With this question, the interviewer wants to know to what extent you are able to explain complicated things in a simple manner. This skill is important in teamwork as you may need to explain or clarify some technical things to your colleagues. You go to a car exposition with a friend. There are dozens of cars, they're plenty to look at. Your friend asks you what type of card you like from there. Since you don't have immediate technical knowledge about all of them, you will classify them on the basis of some particular features, like the brand, the color of the car, the aspect of the car, or the type of the car. This way, you're actually devising clusters of cars with similar characteristics without prior knowledge, grouping them together to find a particular subsets that you really like. This is Unsupervised Learning. To immediately explain, supervise learning as well, imagine that you can see the brand of the cars. You have to guess that based on the same features that you devised earlier. But you have a teacher, your friend, who tells you if you are right or wrong. This is supervised learning. Final topic. Tell me about the cool Ai project you developed. It is important for your employer to get a sense of how well you can present or explained past projects. Good communication is an important skill to have for every company. Be sure to say when you started working on the project, its duration and the finite product or the results mentioned, if it was a personal project or a team project. Was it a research paper? Is your work on this project published somewhere? When talking in detail about the project, be sure not to go super in-depth regarding the technical details, though, it is important to mention what technologies you use, like programming language or frameworks. You can end the presentation by emphasizing the things you learned, completing the project and some fun facts or particular results you got from there.
3. Data Science: Easy Questions: Hi, welcome to the data science part of this course where we're going to discuss easy interview questions. In total, we will have nine different questions here. Let's start. The first question is, what do you understand by selection bias? When dealing with such questions, the interviewer will always appreciate if you can explain it in your own words alongside giving a textbook definition. So what exactly is selection bias? Well, the definition describes it as a statistical error that causes a biasing the sampling portion of an experiment. The way I would describe the statistical error is by imagining a model represented by a survey that is wrong in predicting the winner of an international football competition. It actually happens that the majority of people who took the survey to gather the data were from a specific country that participated in the competition. So the results are biased. That particular class of people who obviously taught that their team would win. Next question. What do you understand by precision and recall? This two terms are critical in assessing a model's performance. So you must get this right. Recall is the ratio of the number of events you can correctly recall to a number of all correct events. However, if you recall more events than the true number, we can see how that formula is not that precise as we will get a 100%. That is why precision was introduced. This formula gives the percentage that the model correctly identifies a true positive. As an example, let's say your teacher asks how many times you are late for class. The correct answer is free. However, you remember only two, so your recall is 66%, but your precision is a 100% perfect score because you did not have any false positives in your answer. In the same way, if you remember five times instead of to, your recall will be a 100%, but your precision would be decreased to free A25. Next question. Have you heard of the F1 score? Why is it used? This may come as a follow-up from the previous question. The F1 score is mathematically defined as the harmonic mean of precision and recall. In order to address the utility of such an expression, you need to bring up the simple average score, which is the arithmetic mean between precision and recall. We can easily observe here that the classifier with a precision of one and the recall of 0 would have a simple average of 0.05. but then F1 score of 0. Therefore, it follows that F1 score really punishes extreme values for either recall or precision. We use this metric because sometimes we might want to maximize the recall at the expense of precision. As an example, let's say that there is an infectious disease running out there. And we want to identify correctly all the patients for a follow-up examination. In this case. Our goal would be a recall near a 100% because it is absolutely critical to find all the patients who actually have the disease. Next question. What is a confusion matrix? Here? It is a good idea to start with the definition and the general representation. It is a table used for measuring the performance of a classification algorithm. A general representation for a binary classifier is present in the figure on the left. It is enough to draw this in order to assure the interviewer that you really know what you're talking about. As an example, you can draw the table from the figure on the right. In this confusion matrix out of the eight actual cats, the model predicted that 3-word dogs, and of the five dogs, it predicted the two-word cats. All correct predictions are located in the diagonal of the table, highlighted in bold. So it is easy to visually inspect the table for prediction errors. This also helps to compute the precision and recall. Next question, what is the difference between inductive and deductive learning? Here, a good example, crush this topic. Luckily, we found one on the web. Let's imagine we want to reach the conclusion that fire burns things. We want to teach that to somebody using some videos on this matter. We can show him multiple example proofs that show how the fire burns many objects. From this one can make a generalization and assume that the fire burns almost anything. That is inductive learning. Now, we can also let that person get himself burned. Therefore, from this conclusion or rule, he can assume that at least all the other people will get burned. So inductive learning means that from observation we get conclusion. But for deductive learning, conclusion means observation. The interviewer may want the more abstract approach, specifically for data science. That's why you can draw this diagram from data. With induction, we obtain a model, and from this model we've deduction. We can get the prediction. Next question. What this colinearity, thus co-linearity in a dataset affect the decision tree. Colinearity happens when two predictor variables, X1 and X2 in a multiple regression concept have some correlation. To better understand this in relational algebra, we can say that two features are correlated if there is a functional dependency between them. For example, date of birth to age. From the birth date, we can easily determined the age. Colinearity means that the regression coefficients are not uniquely determined. In turn, it hurts the interpretability of the model is then the regression coefficients are not unique and have influences from other features. But it generally does not affect the predicting power of a model. That's why when applying linear regression, you may want to use different models for prediction and one for interpretation of the data. For the second question, colinearity will not affect the predictions from decision trees. It can only affect the interpretability of the model or the ability to make inferences based on their results. Next question, what this cluster sampling? Even if you don't know the answer of this question at first, it is really easy to deduct what this term might mean. Sampling refers to selecting a subset of individuals from a statistical population. This is usually done in order to estimate features of the whole population, but more efficiently because we work on list data. Clustering refers to grouping a set of individuals in such a way that the individuals from a particular group are more similar to each other than to those from the other groups. As an example, clustering appears in universities as students choose to take only specific classes that are of some interest to them and hang out most with colleagues. Therefore, cluster sampling refers to selecting intact groups within the defined population, sharing similar characteristics in order to better reduce the complexity of discovering hidden characteristics. Next topic, give me some examples of recommendation systems. Here, if you can give some popular methods of recommendation together with some concrete examples, it should be enough. One of them would be collaborative filtering. That's what web stores like Amazon used to recommend new products to buy. Another one can be content-based filtering. This is actually what Netflix does to recommend you TV shows that you might like. Next topic, name some Python libraries for data analysis and scientific computation. Again, a straightforward answer will suffice. Name at least three or four from numpy, scipy, pandas, sky kit, matplotlib, Tokyo, and seaboard. The interviewer might also like to hear which one of those is for a quick analysis and which one can be used for more in-depth. For quick analysis, matplotlib is generally used, but for publishing your presentation, use Boca and for depth analysis, seaborne would be an excellent choice. You have reached the end of this section. Usually at the end of a section, I will live the question related for the project. This question is correlated with some that we have already discussed here. So what is the difference between type one and type two errors? And there's a follow-up. Is it better to have too many type one or type two errors in a solution? Leave your answers in the project section. I will give a hint though. It is related to the questions we had with true positive, false negative rates. Thank you for your attention.
4. Data Science: Medium Questions: Hello, welcome to this section where we handled medium level difficulty questions. In total, we will discuss eight questions and we will leave one out for the project. Let's begin. First topic explained the difference between likelihood and probability. Follow-up explained maximum likelihood. In everyday life, probability and likelihood mean the same thing. But in statistics, they are different. When dealing with such probability and statistics questions, it is recommended to always take a distribution before trying to devise an example in order to answer the question, data distribution is very important. Okay, let's take a normal fixed distribution and tried to explain this two terms here. Mu means the mean or expectation. If we increase the value of mu here, the curve, which is called the bell curve actually will shift to the right. And if we decrease it, the curve will shift to the left. Sigma is the standard deviation. It governs how wide this little spaces here. Probability is the area under the curve of this fixed distribution P of data given the distribution. On the other hand, when talking about the likelihood that distribution is the variable part of the formula and the data is fixed l of distribution given data. The likelihood R d y axis values for fixed data points, we have distributions that can be moved. For example, the probability that we pick a number between 12 is this area, but we have already picked three. What's the likelihood of that happening given the current distribution? Well, it is this point over here, about 0.1. But what if we had this normal distribution? It would be over 0.3. for the follow-up, we got a set of observation points, random numbers that we picked. Maximum likelihood wants to find the perfect feeds to that set of points within a type of distribution. We will stay with our normal distribution. We need a metric to find how well a distribution fits or set of points. Does this look good? Nope. But how about this? Well, this seems much better. But why? It all comes down to the likelihood of our observations to be actually observed. As all the variables are independent, we want to maximize this product on mean and standard deviation because this two solely define a Normal Distribution, can you intuitively guess what the values for maximum likelihood are for mu and sigma. For mu, the average of all the data points may be a good approximation. But for sigma, the true value gets a little more complicated. The interviewer might be satisfied with such an answer, but it might demand a way to reach this representations. In which case we need a little more math. This is a maximization problem in two variables. So we will need to find the critical points. That's why we need to compute the partial derivatives of our likelihood function over the set of observations. Taking the log of the likelihood will strongly help to compute the derivatives. Computing this is not hard. You need to know some basic calculus. Equalizing This too is 0. We will get our desired results. Second question. How do you know you are not overfitting with a model? Careful. It's this question, does not ask for any overfitting prevention methods. It is about how you detect an overfitting problem. Overfitting occurs when we have low bias but high variance. And it can usually be observed in the train loss Teslas ratio. A small loss and train dataset, and the much higher loss on the test data is a clear sign of overfitting and in general, causes great performance in the training phase followed by bad performance in the real-world. Third question. How would you solve overfitting? I will name a few of the methods that help reduce the overfitting in general. You can go wrong with collecting more data or generating more data to augmentation. You can also try and assembling methods that average models. You can try using simple models. Regularizations have proven themselves to also reduce overfitting in many cases. You can also try cross validation or reducing the dimensionality to principal component analysis or autoencoders. Early stopping or using dropout also helps. Fourth question. How would you prevent over-fitting? Many of the answers provided that the previous question still apply here. But you shall remember that when you overfit, you actually capture the noise of the data. So using any method that reduces variance is a prevention strategy for that happening. Fifth question. What is bias and what is variance? Bias refers to the simplifying assumptions made by a model to make the target function easier to learn. For example, in linear regression, we do have high bias because we assume linear relationship, no autocorrelation and multivariate normality. Variance is the amount that the estimate of the target function will change if different training data is used. For example, your whole dataset comes from data within your country. How much will your model change if we use the same data but from a different country? Sixth question. Suppose you found that to remodel is suffering from low bias and high variance. What algorithms do Finn could tackle the situation and why? We have this question. The interviewer wants to know if you correctly deduct what it means to have low bias but high variance. It means overfitting. Of course. Low bias occurs when the models predicted values are near the actual values. To tackle high variance problem, we can use bagging algorithm or we can lowered model complexity by using regularization. Most probably the previous model was a non-linear one. And we've all the variables in the dataset, the algorithm is facing difficulty in finding meaningful signal. Another solution would be to use the top end features from a variable importance chart. Seven question, how do you screen for outliers? And what would you do in case you find one? You can identify outliers by using linear models fitting or probabilistic and statistical models. For the second part, it depends if they are not important to the data, you can just drop them. You can use winds rising. This means setting a for max and min for how far the outlier can be and then reset it to max or min. If you find an extreme case. You can also use discriminative learning like Naive Bayes classifier, where outliers are better handled. Question eight, What are the differences between standard error of the mean and the standard deviation? Well, if we look back at this picture, the standard deviation measures the amount of variability or dispersion from the individual data values to the mean. Now, for the standard error of the mean, Let's take a random distribution, as we can see here. If we take 16 random numbers from this distribution and average them 10 thousand times, we will obtain another distribution called the distribution of the means. We are interested to see what standard deviation does this distribution have. But not only for 16 samples and 10 thousand iterations, we want to estimate how is the sample mean of the data compared to the true population mean? Well, the standard error of the mean answers to this exact question. It turns out that the standard deviation of the sampling distribution of the sample mean, which is called the standard error of the mean, is sigma over square root of n. Where n is the number of samples we extract from the original distribution. So if we get 16 samples over 10 thousand iterations, that distribution Standard Deviation would be 9.3 over four, which is 2.3 to five, close to the real value of 2.33. Obviously, the higher the number of iterations we have, the more precise the approximation of your original standard deviation. Why is it useful though? Well, if we do not know how the original data distribution looks like, at least we can approximate the standard deviation with the sampling method following the standard error mean formula. You have reached the end of this section. Usually at the end of a section, I will leave the question related for the project. This question is correlated with some that we have already discussed here. So explain the bias-variance tradeoff in random forests. Leave your answers in the project section. I will give a hint. Think how every tree in this ensemble is built and how the features are managed. Thank you for your attention.
5. Data Science: Hard Questions: Hello, welcome to the final section of data science interview preparation, where we handle hard questions. In total, we will discuss for questions, and we will leave one out for the project. Let's begin. First question. You are working with a time series dataset. Your goal is to build a high accuracy model. You start with a decision tree algorithm. Since you know it works fairly well on all sorts of data. Later, you try the time-series regression model and get higher accuracy. Can this happen? Why? The answer is yes, it can happen. This question has the purpose of testing that you know that linear regression works best for time series data fitting. Time-series data is based on linearity, whereas the decision tree is known to work best on non-linear interactions. So it is more likely to capture the noise of the data. A linear regression model can provide a more robust prediction only if the dataset satisfies its linearity assumptions. Second question. You are given a dataset. The dataset contains many variables, some of which are highly correlated and you know about it, your manager has asked you to run a principal component analysis. Would you remove the correlated variables? And as a follow-up question, What are the main steps to do a principal component analysis? You have probably heard this question before. It is a known one. You have to know that the PCA principle component analysis is an unsupervised non-parametric statistical technique, primarily used for dimensionality reduction. Pca actually assumes data correlation. The feature set is usually correlated that the reduced feature set effectively represents the original dataset. Therefore, the answer is no. Discarding the correlated features will have a substantial effect on the PCA because in presence of correlated sample variance explained by a particular component gets inflated. For the follow-up, you can mention the main steps like this. Normalized the data. Create a covariance matrix for decomposition and select the optimal number of principal components. Third question. Explain L1, L2 regularization techniques. The difference between them. For what purpose are they used? First of all, regularization is a technique to discourage the complexity of the model. We want this to reduce the variance and increase the bias, hence solving over-fitting. It does this by penalizing the loss function, assuming that smaller ways generate simpler models. Let us take a typical loss function, sum of the squared error. Here, theta is the vector of r parameters we need to tune. The optimization algorithm tries to minimize this expression. So if we also want the ways to be as close to 0 as possible in order to make the model not overfit on the noise of the data. Then what do we need to add here in order to be sure that that happens? Well, we can't add the sum of the entries because we don't want the laws to go to minus infinity. We could add the sum of all absolute values of theta or the sum of squares of the entries from data. This is actually what regularization does have the L1 and L2 norms. Using L1 regularization. We add the sum of absolute values multiplied by a regularization parameter that determines how much penalize weights. On the other hand, using L2 regularization, we add the sum of squared values multiplied by the regularization parameter. Observed that when we make lambda close to 0, we actually don't punish the weights at all. And the loss function becomes the original one. But if lambda is large, then we force the weights to be really close to 0. The difference is between this two are plenty. Both of them force the model parameters to be small, but the L1 norm has a built-in feature selection. It does this by assigning insignificant input features with a 0 weight and useful features with a non-zero weight. It is a direct consequence of using the absolute value instead of the squared value. Meanwhile, L2 norm does not make them 0 and thus non-sparse solution. However, L2 norm is not robust to outliers, whereas L1 is robust to outliers but cannot learn complex patterns. Fourth topic. You are given a dataset consisting of variables having more than 30% missing values. How do you deal with them? Another very popular question. You can always remove them if and only if they are not too important. However, we can check their distribution with the target variable and the fund. Any pattern will keep those missing values and assign them a new category while removing others. Also, you might be able to decipher the missing values looking for similar variables in the dataset. You have reached the end of this section. Usually at the end of a section, I will leave the question related for the project. This question is correlated with some that we have already discussed here. So how can you say that the time series data is stationary? Leave your answers in the project section. I will give a hint. Think about how the variance and the mean of the series changeover time. Thank you for your attention.
6. AI: Easy Questions: Hello, welcome to this section where we handled easy AI interview questions. We will only discuss free questions here. And there will be no problem for the project in this section, let's begin. First topic. Name the different domains of AI. Straightforward answer. You do not need to give many details here. Machine learning, deep learning, robotics, expert systems, fuzzy logic, and natural language processing. Second topic explained the assessment that is used to test the intelligence of a machine. Here, the interviewer probably wants to know if you heard about the Turing test, which is a method of inquiry for determining whether or not the computer is capable of thinking like a human. It is done by free terminals, each of which is physically separated from the other 21 terminal is operated by computer, while the other two are operated by humans. During the test. One of the humans functions as the questioner, while the second human and the computer function as respondents. The questioner talks to both of them over multiple iterations. And at the end of each discussion, it is asked to decide which is which. In the long run, if the question or makes the correct determination 50% of the time than the machine can be considered artificial intelligence. Third topic, what is stemming and lemmatization in natural language processing? Stemming words by cutting off the end or the beginning of the word, taking into account a list of common prefixes and suffixes. A torques in some cases, but not all the time, as we can see here, studies without the suffix goes into studio without a Y. Lemmatization, On the other hand, takes into consideration the morphological analysis of the word to detail the dictionaries. Lemma equals the base form of all its inflectional forms, whereas the stem does not refer to that.
7. AI: Medium Questions: Hello, welcome to this section where we handled medium difficulty AI interview questions. We will discuss here only free questions. There is no problem for the project section here. Let's begin. First question. What method do you know that is used for optimizing a minimax based solution? Here, the interviewer wants to hear if you know about the alpha-beta pruning method. Being only an optimization technique. It returns the same move as a standard minimax algorithm, but it removes all the nodes that are possibly not affecting the final decision. For example, if a decision has an unexpected return off max of this expression, then the result will be free without computing the right member. Because it is essentially the maximum between free and the variable c that we are sure it is less than two. We do not even care what values a and b may hold. Second question, what are the components of an expert system? A straightforward answer here would be enough of an expert system has knowledgebase. We've advanced, specialized data on the matter, an inference engine. This is basically the decision-making Brain and the user-interface. Third question, what is market basket analysis and how can AIB used perform this? Market basket analysis explains the combinations of products that frequently co-occur in transactions. This can be done in AI through algorithms like association rule mining that is composed of if then statements that help to show the probabilities of relationships between data items. And the known algorithm apriori.
8. AI: Hard Questions: Hello, welcome to this section where we handle hard AI interview questions. In total, we will discuss for questions and we will leave one out for the project. Let's begin. First question. How would you filter out spam messages? It can be done to natural language processing. First, you might want to collect some sets of words that frequently appear together in a spam email, like a 100% guaranteed, 100% free, free gift, extra cash, and some context on the sender. See if those words are amongst the set of keywords of that email using maybe some alien or Watson API, then obtain an abstract representation word to vec to feed us input in a deep neural network classifier. In general, we follow the same classic steps for a machine learning classification problems. Data collections, data cleaning, data analysis, data modelling, model evaluation and optimization. Second question. What are Bayesian networks? Can you provide a practical application where they would be good? Short answer only if the interviewer once further explanation, you can take an example and draw one. A Bayesian network is a statistical model that represents a set of variables and their conditional dependencies in the form of a directed acyclic graph. An application can be identifying the most probable cause of a disease having as input the symptoms. Third topic, explain autoencoders. Before answering this question, keep in mind some points you need to touch her. You should describe the architecture of a typical auto-encoder, its purpose, and how it operates. Auto encoders are unsupervised learning models. We have an input layer and output layer and one or more hidden layers connecting them. The output layer has the same number of units as the input layer. The purpose is to reconstruct its own inputs, but reducing the dimensionality typically for learning generative models of data. Fourth question. What do you think are the main steps for face verification? This is a very complex question, Is it requires in-depth analysis through image processing and machine learning. So the first steps may be data collection, data cleaning, and data analysis. But how about the model? And how exactly it we process the data? You might be tempted to say that we shall use a CNN as it performs very well on images, right? However, here it is not that simple, is the mere presence of eyes, nose, lips is enough to trigger CNN. That does not account for spatial hierarchy. A solution to this might be the capsule neural networks. Another solution would be true alignment, align features in the face between eyes and lips, for example. Or representation. Reconstruct 3D models of the face to visualize and then classify using maybe a residual network. All of this is because even for a simple face detection, this problems appear. We really need to take into consideration the positioning of all the elements. You have reached the end of this section. Usually at the end of this section, I will leave the question related for the project. This question is correlated with some that we have already discussed here. So how can AIB used in detecting fraud? Leave your answers in the project section. I will give a hint. Think about how you can identify anomalies and patterns in data. Thank you for your attention.
9. Deep Learning: Easy Questions: Hello, welcome to this section where we handle easy interview questions in the field of deep learning. In total, we will discuss for questions and we will leave one out for the project. Let's begin. First question. What is the role of weights and biases in neural networks? For a perceptron, the weights determine the slope of the classifier line. The bias allows to shift the line towards left or right. Normally, biases are treated like another weight input if the input value. Second question. Why do we use non-linearity in a neural network? Very good question. Activation functions cannot always be linear because neural networks with a linear activation function are effective only one layer deep, regardless of how complex their architecture is. Input to networks is usually leaner transformation inputs times weight plus bias. But real-world problems are non-linear. To make the incoming data non-linear, we use nonlinear mapping called activation function. The third topic is recommended to use rectified linear unit or a linear activation in the hidden layers of a neural network. Why? Yes, the problem we've other activation functions like sigmoid or hyperbolic tangents as activations within hidden layers is that the saturated kill gradients. The problem of vanishing gradient or exploding gradient is pretty common in deep learning. Moreover, sigmoid outputs are not zero-centered. That is why hyperbolic tangents non-linearity is preferred over sigmoid and that is actually used in recurrent neural networks. Fourth topic, what is data normalization and why do we need it? Data normalization is a very important preprocessing steps used to rescale values to fit in a specific range to assure better conversions during backpropagation. It also prevents the exploding gradient problem. Usually we use statistical normalization involving the mean and standard deviation. Lately, batch normalization was introduced as a form of normalization that is applied to hidden layers with the same purpose of reassuring convergence. You have reached the end of this section. Usually at the end of a section, I will leave the question related for the project. This question is correlated with some that we have already discussed here. So name some advantages of choosing rectified linear unit as activation function for hidden layers in a neural network. And the follow-up disadvantages. Leave your answers in the project section. I will give a hint. Think about this keywords. Saturation, efficiency, vanishing slash, exploding gradient problem handling, and dead rectified linear units. Thank you for your attention.
10. Deep Learning: Medium Questions: Hello, welcome to this section where we handle medium difficulty interview questions in the field of deep learning. We will discuss free questions. There is no problem left out for the project here. Let's begin. First question. What are the benefits of Mini-batch gradient descent? Popular question. The truth is that Mini-batch gradient descent is more efficient than stochastic gradient descent in general, because it generalizes by finding the flat minima minibatches allowed to approximate the gradient of the entire training set, which helps us avoid local minima. Second question. What is a residual neural net? This is a fairly new concept. But this question might arise if the company is really using them in their projects. As a definition, a residual neural network is an artificial neural network of a kind that builds on constructs known from pyramidal cells in the cerebral cortex. Residual neural networks do this by utilizing skip connections or shortcuts to jump over some layers. This way, the gradient can flow directly from the input layer to the output layer if needed. This is the main characteristic of a residual neural network. So be sure you emphasize it correctly. The main idea behind the scoops Is to backpropagate through the identity function, which preserves the gradient by just using a vector addition. This way, we prevent some well-known problems machine learning, like vanishing gradient problem. In training the neural network, you'll notice that the loss does not decrease in the few starting that box. Why could that be? And how to solve it? This can be caused by multiple issues. Learning rates to small already started local minima or regularization parameter is too high. If we are already stuck in the local minima, a solution would be to reinitialize the algorithm. At that point we for higher step size to get out. For the other possible causes, try adjusting the learning rate or the regularization parameter empirically, or by using some hyperparameter optimization techniques like grid search.
11. Deep Learning: Hard Questions: Hello, welcome to this section where we handle hard interview questions in the field of deep learning. We will discuss free questions here. There is no problem left out for the project section. Let's begin. First question. What are some issues while training a recurrent neural net? Recurrent neural networks are computationally expensive for backpropagation. The algorithm is applied for every time stamp backpropagation through time BDT, exploding or vanishing gradient problem can also appear frequently. Second question. What is a capsule neural network? Capstone neural network is a new type of CNN that improves on it by adding spatial hierarchies into account. This represented one of the biggest drawbacks of CNN. Actually, higher level features combined lower level features as a weight. Some activations of a preceding layer are multiplied by the following layer neurons weights and added before being passed to activation non-linearity. Nowhere in this setup there is post translational and rotational relationship between simpler features that make up a higher-level feature. Cnn approach to solve this issue is to use max pooling or successive convolutional layers that reduced spatial size of the data flowing through the network and therefore increase the field of view of higher layers neurons, thus allowing them to detect higher-order features in a larger region of the input image. But max pooling nontheless is losing valuable information. Capsules introduce a new building block that can be used in deep learning to better model hierarchical relationships instead of internal knowledge representation of a neural network, the design of the capsule builds up upon the design of artificial neuron, but expands it to the vector form to allow for some more powerful representational capabilities. It also introduces matrix weights to encode important hierarchical relationships between features of different layers. Third question. What is a restricted Boltzmann machine? A restricted Boltzmann machine is a generative stochastic artificial neural network that can learn a probability distribution over its set of inputs. It is useful therefore for dimensionality reduction, collaborative filtering, and feature learning. Restricted Boltzmann machines are shallow, two-layer neural nets that constitute the building blocks of deep belief networks. The first layer of the RBM is called the visible or input layer, and the second is the hidden layer. The restriction in a restricted Boltzmann machine is that there is no interlayer communication. Each node is a locus of computation that processes input and begins by making stochastic decisions about whether to transmit that input or not.
12. Machine Learning: Easy Questions: Hello, welcome to the section where we handle easy interview questions in the field of machine learning. We will discuss for questions here and we will leave one out for the project. Let's begin. First question. Which one is the most important to you? Model accuracy or modal performance? Tricky question. Express your opinion first, then come up with some examples to strengthen your point, you need to identify the inclusion between the two concepts. The model accuracy is a subset of a model performance that might not provide the full picture of how good the model really is. That is why we prefer the model performance obtained through multiple metrics or the most important metric for that domain. For example, in classifying dog and cat pictures accurately, the accuracy might be the most important metric. But for medical diagnostic, that recall or precision in identifying true positives would be the most important to determine the model performance. An example where the accuracy is really not a good indication of the model performance would be a classification problem where the classes are not statistically balanced. In this case, the model. In most of the cases, we'll get the correct category as it has more samples of it. Second question. What's the difference between Ginny impurity and entropy? Here, don't tackle only the difference mathematically in the formulas. Be sure to emphasize what they really mean. Entropy measures lack of information to reduce the uncertainty about the label. Whereas Ginny impurity is a probability of incorrectly classifying a random data point. We randomly labeled by class distribution. Entropy measures in general, the chaos in a space and any decrease in entropy means and information gain, which is what we want. We can see a clear difference in the formulas. Third topic, in what scenario do we need to do cross-validation? Explain briefly the methods used to do a cross-validation. Cross-validation is used when we couldn't fit the model on the training data or when we are overfitting the data. Methods used to do so, our validation, in which you usually split the data into 70% training, 30% test dataset. Or you use a validation dataset during training to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters leave one out cross validation, where you train on all dataset. Leave only one data point out for the test. K-fold cross-validation popular method when you also have limited data, you split the data into k number of subsets and then perform training on all the subsets. Leave one out for evaluation, so we iterate k times n. Total. Cross-validation also helps with feature selection and it reduces the bias towards a specific dataset. Fourth question. What's the difference between parametric and non-parametric models? In parametric models we have a fixed number of parameters, string assumptions about the data. It requires lesser data and it is faster. Some examples include logistic regression, Naive Bayes classifier. In non-parametric models, we have flexible number of parameters. It requires more data and fewer assumptions about the data. For example, kNN or decision trees. You have reached the end of this section. Usually at the end of a section, I will leave the question related for the project. This question is correlated with some that we have already discussed here. So how do decision trees split the data? Expand a method to do so in detail. Leave your answers in the project section. To give a hint. Think about that related question regarding Ginny impurity and entropy. Thank you for your attention.
13. Machine Learning: Medium Questions: Hello, welcome to this section, or we handle medium difficulty interview questions in the field of machine learning. We will discuss five questions here, and we will leave one out for the project. Let's begin. First topic. Explain assembled learning technique in machine learning. Very important question. It can appear during the interview in many forms. Be careful to provide a detailed explanation for all the free main methods in ensemble training. Assembled learning helps improve machine-learning results and also helps against overfitting by combining several models. It was empirically proved that they have a better predictive performance compared to a single model. That's why they're actually very popular in competitive machine-learning. It combines several methods in order to decrease variance bias and improve predictions. Bagging, boosting, stacking. In ensemble learning through bagging, we actually perform bootstrap aggregation. We divide the dataset into its subsets. With repeated randomized sampling. We take the output of multiple learners with each subset as input. And we use voting for classification problems and averaging for regression problems to get the final prediction. It is important to mention here that combining stable learning methods is less advantageous since the ensemble will not help improve generalization performance. For example, combining 2K and ends, which are stable, would be weaker than combining a k-NN and the decision tree, which is less stable. That is why random forests is a pretty good assembling learning technique that provides good results. In boosting. We convert weak learners into strong learners, training multiple models sequentially, each one trying to correct the mistakes of the previous one. For instance, Adaboost. In stacking, the outputs of the models are taken as inputs in another model to map the relation to the outputs and labels. So here, basically the models are wrapped inside a bigger model that tries to generalize from the outputs coming from the smaller models. Second question. What is a receiver operating characteristic curve and what does it represent? This is a fundamental tool for diagnostic test evaluation of a binary classifier. As it's discrimination threshold is varied. It plots the true positive rate sensitivity against false positive rates specificity for the different possible cutoff points of a diagnostic test. It is important because it shows trade off between sensitivity and specificity. We want the area covered by the plot to be as high as possible to have the best precision in identifying true positives. If the plot, for example, is a diagonal line to model has a 50% chance to be correct in its prediction. We do not want that. Another thing that is worth mentioning is that the slope of the tangent line at the cut point gives likelihood ratio for that value of the test. Third topic, explain the bias-variance tradeoff in linear and non-linear ML algorithms. We've examples. We kind of touched on this topic before. In the other sections. Linear algorithms usually come with high bias and low variance, whereas non-linear machine learning algorithms are exactly the opposite. Therefore, some low variance algorithms are linear regression and logistic regression. And some high variance ones are decision trees, cannons, or support vector machines. Due to low bias, high variance, this algorithms are more likely to overfit. Moreover, in general, we, for decreasing variance, we get an increase in bias. Fourth topic. What's the difference between hyperparameters and model parameters? Straightforward answer. The model parameters are represented by features of the training data to be learned on its own during training, weights, biases, split points in decision trees. Hyperparameters are the parameters that determine the entire training process, learning rate, number of hidden layers, number of neurons in each layer. There are external to the model and their value cannot be directly learned from the data. Fifth topic, explain briefly the different algorithms used for hyperparameter optimization. There are many methods for hyperparameter optimization. Greed search, which is exhaustive search tool manually specified subset of the hyper-parameter space. Random search, which is randomly sampling the search space and devaluating sets from a particular probability distribution. Bayesian Optimization, which is an optimization method for noisy black-box functions. It builds a probabilistic model of the function mapping from the hyperparameter space to the objective evaluated on the validation set, gradient-based optimization. For some problems, it is possible to compute the gradient with respect to the hyper parameters and to optimize through gradient descent. Evolutionary optimization, which is optimization through genetic algorithms, population-based optimization, which can be done to particle swarm optimization. You have reached the end of this section. Usually at the end of a section, I will live the question related for the project. This question is correlated with some that we have already discussed here. So what machine-learning concepts can be applied to targeted marketing? Leave your answers in the project section. To give a hint. There are actually multiple questions from the AI and data science sections related to this. Remember market basket analysis and other recommendation systems. Thank you for your attention.
14. Machine Learning: Hard Questions: Hello, welcome to this section where we handle hard interview questions in the field of machine learning. We will discuss for questions here and we will leave one out for the project. Let's begin. First topic. You ran two classifiers over a data set of training data, a decision tree and the random forest with a single tree, you get higher accuracy from the random forest classifier. How is this possible? Here, the interviewer wants to test your knowledge of the Random Forests ensemble classifier. For an inexperienced fellow, those two might seem the same, but a random forests introduces to random factors that might boost the performance even when using one tree. It considers a random subset of the feature set and it considers only a bootstrapped sub-sample of the whole training set. In some cases, this little nuances might lead to higher overall performance has the question. Second topic in image processing. What is a gray level co-occurrence matrix and how can this be used in machine learning? This special matrix is an image obtained from the original image that helps us extract features about the original image. By definition, an entry in the matrix is the number of entries in the matrix of the original image, such that this property is satisfied. So the GLCM image of the original image and the right is this one. Once obtained, we can easily get the contrast entropy, uniformity and homogeneity of the original image. Is there just some double sums if the GLCM entries. And this can be used as calorie inputs in machine learning classification problems. That's why this concept is important and very useful. Third topic, you are given a cancer detection dataset. Let's suppose when you build a classification model, you achieve ninety-six percent accuracy. Why shouldn't you be happy about it? Is there something you can do to better the performance of the model? Popular question? It might be difficult to answer this properly. First of all, the data might not be statistically balanced as there are significantly more people without cancer than we've cancer. However, this can be easily diagnosed by a confusion matrix. Hence, the accuracy here might not be a good indication of the model performance. The distribution of positive cases is very low. We might want to use precision or recall and the Analyze the receiver operating characteristics curve as well. In order to solve this issue regarding unbalanced data, we can use some random undersampling or oversampling on the minority class we've smoked. We might also want to try adding more data or treating missing outlier values. Final question, what would be a process for classifying complex images like the ones used in self-driving cars industry? A very general answer would be image acquisition to image preprocessing, smoothing, for example, to image segmentation, semantic segmentation to clustering, to feature extraction, to classification. You might also consider image enhancement or color space conversion, like Scilab. You have reached the end of this section. Usually at the end of a section, I will leave the question related for the project. This question is correlated with some that we have already discussed here. You use a random forests for a classification problem because it is a generally good learning technique which provides good results. However, it seems the model is overfitting on the training data. What do you do? Leave your answers in the project section. As a hint, think about how the number of trees in the random forest might affect this. Consider also models with high bias. Thank you for your attention.
15. Bonus: Statistical ML: Welcome to the bonus section. Here we will discuss five interview questions that might appear during an interview to test your knowledge about probabilities and statistics. Let's begin with our first question. Suppose you are given a dataset which has missing values spread along one standard deviation from the median. What percentage of the data would remain unaffected? I really like this question. Let's visualize what one standard deviation from the median really means. We take a random normal distribution, we can see that the data spread in that interval is the green region. And that is indeed what's concerning us. Now, intuitively, this might seem to be over 50% of the data, but we need to find the precise answer. There are two ways to do this. First one is an instinctual for the problem, but it might not be the most appreciated one. We have ruled in statistics called sixty eight, ninety five, ninety nine point seven rule that says 68.2795.4599.73 of the values lie within 123 standard deviations of the mean respectively. So the answer to the problem will be a 100 minus 68.27, which is 31.73, approximately. Great. However, the interviewer might ask how you reach such a number and might need therefore to prove this rule, at least some of it. So let's imagine we play darts on this distribution of data. The question is, what's the probability that we reach only that green region when throwing our stuff? Well, the probability that we pick a number from the whole distribution is one, but the probability that we pick a number from our target interval is the area under the curve in that interval. So it's the integral from mean minus standard deviation to mean plus the standard deviation from the density function of the normal distribution, which is this. With an obvious substitution, x minus mu over standard deviation, we reach this expression. There is no secret that this integral cannot be directly calculated, although it resembles the Gaussian integral, it must be approximated. And for this we can use Simpson rule as it is really easy to apply on the minus1 one interval with two breakpoints. We get this integral as roughly 1.73, which leads to 69%. Is the result pretty close to the actual result of 68.27? Next question, how many people must be gathered together in a room before you can be certain that there is a greater than 50-50 chance that at least two of them have the same birthday. This is a nice but known question. The answer is a surprisingly low number. The trickiest to take the probability that none of the guy's shoulder birthdays. So the result will be one minus that probability. How many ways are there to ascend birthdays to end people without any restriction? Well, the answer is of course 265 power n. Now, how many ways are there to send birthdays to end? People cite that there is no shirt birthday. If we imagine the 356 days as empty holes, then we can place the best birthday's at any hole. So 365 possibilities. After that. We got 364 empty spaces, so there are 365 multiplied by 364 for two people. Generalizing, we discovered that this is a problem to count permutations of 365 distinct objects taken n at the time, which is denoted by p of 365 n And has this general formula. So the final probability will be a fraction having p of 365 N at the top and 365 power n at the denominator. Plotting this function, it seems that for N equals 23, we already get the probability lower than 0.5. so the answer is a stunning 23 people. The next question, in a game, you are asked to roll for two fair six-sided dice. If the sum of the values equals seven, then you win $21. But to participate, you need to pay up $5. Do you played this game? This is a pretty easy question. There is a one in six chance to actually land on a sum equal to seven. So the expected return if we play six games is one win. In the long run. That means we actually win $21 for every six games played. But we also need to pay $30 for six games. Therefore, it is really not a good idea to play this game, as we will lose money considerably at a rate of $1.5 per game. Actually. The next question, a jar has 1000 coins, of which 999 are fair, and one is double-headed. Pick a coin at random and toss it ten times. Given that you see ten heads, what is the probability that the next toss of coin is also ahead? This question has the purpose to showcase Bayes theorem and you should use it to prove the interviewer that you have a solid foundation of basic probabilities and statistics. First, Let's calculate the probability that we see ten heads. This will be useful later. This probability is equal to the probability of choosing the double headed coin plus the probability of choosing a normal coin multiplied by the probability that we actually reach ten heads with that coin. So it is 0.001 plus 0.999 multiplied by one over two power ten, which leads to 0.01976. Great. Now, what is the probability that we have picked the double-headed coin knowing that we got ten heads. Well, using the base theorem, it's the probability that we got ten heads knowing that we have picked double-headed coin, which is one, multiplied by the probability that we picked the double-headed queen, which is 0.001, all over the probability that we see ten heads, which we calculated earlier. Plugging in the numbers, we get 0.5. 0-6. To close up the problem, all we need to compute now is the weight, some probability of selecting the double-headed coin in that condition plus 0.5. multiplied by the probability of selecting a fair coin in that condition. And we get approximately 0.75, which means a 75% chance. The final question, Let's suppose a miracle has a probability of one in 1 million of happening. And to simplify, let's say that the miracle takes only 1 second to happen knowing that there are 7 billion people on Earth, what's the probability that in this exact second, when you read this question, no miracle had happened to any person? This is a very good question. It has the purpose to show how likely small miracle events are at any level. For one person, the probability that a miracle hasn't happened human this exact second is one minus one over 1 million, which is this. Therefore is there are 7 billion people on the planet living there. Parallel lives. Multiply this quantity 7 billion times using the helper information from the problem. This is roughly equal to 0.36 multiplied 7 thousand times, which is of course, less than 0.5. power 7 thousand times. Using the last inequality trick, we get that this probability is less than one over ten power a 100, which is actually called a Google. To understand how small this number is, it is a known fact that the Google is a bigger number than the number of elementary particles in the known universe. So really this quantity is negligible. A miracle has most probably happened to someone in this exact second.