Posts

Showing posts from April, 2017

Is it possible capture the correlation between continuous and categorical variable? If yes, how?

Is it possible capture the correlation between continuous and categorical variable? If yes, how? Answer: Yes, we can use ANCOVA (analysis of covariance) technique to capture association between continuous and categorical variables.

What is the difference between covariance and correlation?

What is the difference between covariance and correlation? Correlation is the standardized form of covariance. Covariances are difficult to compare. For example: if we calculate the covariances of salary ($) and age (years), we'll get different covariances which can't be compared because of having unequal scales. To combat such situation, we calculate correlation to get a value between -1 and 1, irrespective of their respective scale.

While working on a data set, how do you select important variables? Explain your methods.

While working on a data set, how do you select important variables? Explain your methods. Answer: Following are the methods of variable selection you can use: Remove the correlated variables prior to selecting important variables Use linear regression and select variables based on p values Use Forward Selection, Backward Selection, Stepwise Selection Use Random Forest, Xgboost and plot variable importance chart Use Lasso Regression Measure information gain for the available set of features and select top n features accordingly.

Rise in global average temperature led to decrease in number of pirates around the world. Does that mean that decrease in number of pirates caused the climate change?

Rise in global average temperature led to decrease in number of pirates around the world. Does that mean that decrease in number of pirates caused the climate change? Answer: After reading this question, you should have understood that this is a classic case of "causation and correlation". No, we can't conclude that decrease in number of pirates caused the climate change because there might be other factors (lurking or confounding variables) influencing this phenomenon. Therefore, there might be a correlation between global average temperature and number of pirates, but based on this information we can't say that pirated died because of rise in global average temperature.

When is Ridge regression favorable over Lasso regression?

When is Ridge regression favorable over Lasso regression? Answer: You can quote ISLR's authors Hastie, Tibshirani who asserted that, in presence of few variables with medium / large sized effect, use lasso regression. In presence of many variables with small / medium sized effect, use ridge regression. Conceptually, we can say, lasso regression (L1) does both variable selection and parameter shrinkage, whereas Ridge regression only does parameter shrinkage and end up including all the coefficients in the model. In presence of correlated variables, ridge regression might be the preferred choice. Also, ridge regression works best in situations where the least square estimates have higher variance. Therefore, it depends on our model objective.

After analyzing the model, your manager has informed that your regression model is suffering from multicollinearity. How would you check if he's true? Without losing any information, can you still build a better model?

After analyzing the model, your manager has informed that your regression model is suffering from multicollinearity. How would you check if he's true? Without losing any information, can you still build a better model? Answer: To check multicollinearity, we can create a correlation matrix to identify & remove variables having correlation above 75% (deciding a threshold is subjective). In addition, we can use calculate VIF (variance inflation factor) to check the presence of multicollinearity. VIF value <= 4 suggests no multicollinearity whereas a value of >= 10 implies serious multicollinearity. Also, we can use tolerance as an indicator of multicollinearity. But, removing correlated variables might lead to loss of information. In order to retain those variables, we can use penalized regression models like ridge or lasso regression. Also, we can add some random noise in correlated variable so that the variables become different from each other. But, adding noise m...

You have built a multiple regression model. Your model R² isn't as good as you wanted. For improvement, your remove the intercept term, your model R² becomes 0.8 from 0.3. Is it possible? How?

You have built a multiple regression model. Your model R² isn't as good as you wanted. For improvement, your remove the intercept term, your model R² becomes 0.8 from 0.3. Is it possible? How? Answer: Yes, it is possible. We need to understand the significance of intercept term in a regression model. The intercept term shows model prediction without any independent variable i.e. mean prediction. The formula of R² = 1 - ∑(y - y´)²/∑(y - ymean)² where y´ is predicted value. When intercept term is present, R² value evaluates your model wrt. to the mean model. In absence of intercept term (ymean), the model can make no such evaluation, with large denominator, ∑(y - y´)²/∑(y)² equation's value becomes smaller than actual, resulting in higher R².

How is True Positive Rate and Recall related? Write the equation.

How is True Positive Rate and Recall related? Write the equation. Answer: True Positive Rate = Recall. Yes, they are equal having the formula (TP/TP + FN).

After spending several hours, you are now anxious to build a high accuracy model. As a result, you build 5 GBM models, thinking a boosting algorithm would do the magic. Unfortunately, neither of models could perform better than benchmark score. Finally, you decided to combine those models. Though, ensembled models are known to return high accuracy, but you are unfortunate. Where did you miss?

After spending several hours, you are now anxious to build a high accuracy model. As a result, you build 5 GBM models, thinking a boosting algorithm would do the magic. Unfortunately, neither of models could perform better than benchmark score. Finally, you decided to combine those models. Though, ensembled models are known to return high accuracy, but you are unfortunate. Where did you miss? Answer: As we know, ensemble learners are based on the idea of combining weak learners to create strong learners. But, these learners provide superior result when the combined models are uncorrelated. Since, we have used 5 GBM models and got no accuracy improvement, suggests that the models are correlated. The problem with correlated models is, all the models provide same information. For example: If model 1 has classified User1122 as 1, there are high chances model 2 and model 3 would have done the same, even if its actual value is 0. Therefore, ensemble learners are built on the prem...

You are given a data set. The data set contains many variables, some of which are highly correlated and you know about it. Your manager has asked you to run PCA. Would you remove correlated variables first? Why?

You are given a data set. The data set contains many variables, some of which are highly correlated and you know about it. Your manager has asked you to run PCA. Would you remove correlated variables first? Why? Answer: Chances are, you might be tempted to say No, but that would be incorrect. Discarding correlated variables have a substantial effect on PCA because, in presence of correlated variables, the variance explained by a particular component gets inflated. For example: You have 3 variables in a data set, of which 2 are correlated. If you run PCA on this data set, the first principal component would exhibit twice the variance than it would exhibit with uncorrelated variables. Also, adding correlated variables lets PCA put more importance on those variable, which is misleading.

You came to know that your model is suffering from low bias and high variance. Which algorithm should you use to tackle it? Why?

You came to know that your model is suffering from low bias and high variance. Which algorithm should you use to tackle it? Why? Answer: Low bias occurs when the model's predicted values are near to actual values. In other words, the model becomes flexible enough to mimic the training data distribution. While it sounds like great achievement, but not to forget, a flexible model has no generalization capabilities. It means, when this model is tested on an unseen data, it gives disappointing results. In such situations, we can use bagging algorithm (like random forest) to tackle high variance problem. Bagging algorithms divides a data set into subsets made with repeated randomized sampling. Then, these samples are used to generate a set of models using a single learning algorithm. Later, the model predictions are combined using voting (classification) or averaging (regression). Also, to combat high variance, we can: Use regularization technique, where higher model coeffici...

You are assigned a new project which involves helping a food delivery company save more money. The problem is, company's delivery team aren't able to deliver food on time. As a result, their customers get unhappy. And, to keep them happy, they end up delivering food for free. Which machine learning algorithm can save them?

You are assigned a new project which involves helping a food delivery company save more money. The problem is, company's delivery team aren't able to deliver food on time. As a result, their customers get unhappy. And, to keep them happy, they end up delivering food for free. Which machine learning algorithm can save them? Answer: You might have started hopping through the list of ML algorithms in your mind. But, wait! Such questions are asked to test your machine learning fundamentals. This is not a machine learning problem. This is a route optimization problem. A machine learning problem consist of three things: There exist a pattern. You cannot solve it mathematically (even by writing exponential equations). You have data on it. Always look for these three factors to decide if machine learning is a tool to solve a particular problem.

You are working on a time series data set. You manager has asked you to build a high accuracy model. You start with the decision tree algorithm, since you know it works fairly well on all kinds of data. Later, you tried a time series regression model and got higher accuracy than decision tree model. Can this happen? Why?

You are working on a time series data set. You manager has asked you to build a high accuracy model. You start with the decision tree algorithm, since you know it works fairly well on all kinds of data. Later, you tried a time series regression model and got higher accuracy than decision tree model. Can this happen? Why? Answer: Time series data is known to posses linearity. On the other hand, a decision tree algorithm is known to work best to detect non - linear interactions. The reason why decision tree failed to provide robust predictions because it couldn't map the linear relationship as good as a regression model did. Therefore, we learned that, a linear regression model can provide robust prediction given the data set satisfies its linearity assumptions.

Explain marginal likelihood in context of naiveBayes algorithm?

Explain marginal likelihood in context of naiveBayes algorithm? Answer: Marginal likelihood is, the probability that the word 'FREE' is used in any message.

Explain likelihood in context of naiveBayes algorithm?

Explain likelihood in context of naiveBayes algorithm? Answer: Likelihood is the probability of classifying a given observation as 1 in presence of some other variable. For example: The probability that the word 'FREE' is used in previous spam message is likelihood.

Explain prior probability in context of naiveBayes algorithm?

Explain prior probability in context of naiveBayes algorithm? Answer: Prior probability is nothing but, the proportion of dependent (binary) variable in the data set. It is the closest guess you can make about a class, without any further information. For example: In a data set, the dependent variable is binary (1 and 0). The proportion of 1 (spam) is 70% and 0 (not spam) is 30%. Hence, we can estimate that there are 70% chances that any new email would be classified as spam.

You are given a data set on cancer detection. You've build a classification model and achieved an accuracy of 96%. Why shouldn't you be happy with your model performance? What can you do about it?

You are given a data set on cancer detection. You've build a classification model and achieved an accuracy of 96%. Why shouldn't you be happy with your model performance? What can you do about it? Answer: If you have worked on enough data sets, you should deduce that cancer detection results in imbalanced data. In an imbalanced data set, accuracy should not be used as a measure of performance because 96% (as given) might only be predicting majority class correctly, but our class of interest is minority class (4%) which is the people who actually got diagnosed with cancer. Hence, in order to evaluate model performance, we should use Sensitivity (True Positive Rate), Specificity (True Negative Rate), F measure to determine class wise performance of the classifier. If the minority class performance is found to to be poor, we can undertake the following steps: We can use undersampling, oversampling or SMOTE to make the data balanced. We can alter the prediction threshold v...

You are given a data set. The data set has missing values which spread along 1 standard deviation from the median. What percentage of data would remain unaffected? Why?

You are given a data set. The data set has missing values which spread along 1 standard deviation from the median. What percentage of data would remain unaffected? Why? Answer: This question has enough hints for you to start thinking! Since, the data is spread across median, let's assume it's a normal distribution. We know, in a normal distribution, ~68% of the data lies in 1 standard deviation from mean (or mode, median), which leaves ~32% of the data unaffected. Therefore, ~32% of the data would remain unaffected by missing values. Read More:  Normal/Gaussian Distributions – Rishabh Shukla

Is rotation necessary in PCA? If yes, Why? What will happen if you don't rotate the components?

Is rotation necessary in PCA? If yes, Why? What will happen if you don't rotate the components? Answer: Yes, rotation (orthogonal) is necessary because it maximizes the difference between variance captured by the component. This makes the components easier to interpret. Not to forget, that's the motive of doing PCA where, we aim to select fewer components (than features) which can explain the maximum variance in the data set. By doing rotation, the relative location of the components doesn't change, it only changes the actual coordinates of the points. If we don't rotate the components, the effect of PCA will diminish and we'll have to select more number of components to explain variance in the data set.

You are given a train data set having 1000 columns and 1 million rows. The data set is based on a classification problem. Your manager has asked you to reduce the dimension of this data so that model computation time can be reduced. Your machine has memory constraints. What would you do? (You are free to make practical assumptions.)

You are given a train data set having 1000 columns and 1 million rows. The data set is based on a classification problem. Your manager has asked you to reduce the dimension of this data so that model computation time can be reduced. Your machine has memory constraints. What would you do? (You are free to make practical assumptions.) Answer: Since we have lower RAM, we should close all other applications in our machine, including the web browser, so that most of the memory can be put to use.

Machine Learning Interview Questions Part #1

What is the difference between supervised and unsupervised machine learning? Answer: Training labelled data, Train the model, No labeling data. How is KNN different from k-means clustering?  Answer: Supervised classification algorithm, Unsupervised clustering algorithm, Mechanisms, Unlabeled points and threshold, Computing the mean of the distance. Explain how a ROC curve works.  Answer: Contrast between true positive rates and the false positive rate, Sensitivity, Fall-out, False alarm.

Google Interview Part #2

What is the time and space complexity of heapsort? Answer: O(n lg n) time O(1) space What is the time and space complexity of merge sort? Answer: O(n lg n) time O(n) space How would you split up a data set in order to choose from multiple models? Answer: In such a situation, you should split the data into three parts: a training set for building models, a validation set for choosing among trained models (called the cross-validation set), and a test set for judging the final model. What is a Type 1 error? Answer: A false positive What is a Type 2 error? Answer: A false negative In statistics, how would you calculate precision? Answer: true_pos / (true_pos + false_pos) In statistics, how would you calculate recall? Answer: true_pos / (true_pos + false_neg) In statistics, what does precision measure? Answer: Precision measures how accurate our positive predictions are. In statistics, what does recall measure? Answer...

Google Interview ( Time ) Part #1

Time of L1 cache reference ? Answer: 0.5 ns. Time of branch mis-predict ? Answer: 5 ns. Time of L2 cache reference ? Answer: 7 ns. Time of mutex lock/unlock ? Answer: 100 ns. Time of main memory reference ? Answer: 100 ns. Time of compressing 1 kB with Zippy ? Answer: 10 us. Time of sending 2 kB over GigE ? Answer: 20 us. Time of reading 1 MB sequentially from memory ? Answer: 250 us. Round-trip time within same data center ? Answer: 500 us. Time of disk seek ? Answer: 10 ms. Time to read 1 MB sequentially from network ? Answer: 10 ms. Time to read 1 MB sequentially from disk ? Answer: 30 ms. RTT of packet from CA to Netherlands and back ? Answer: 150 ms.

Google Interview - Distributed Systems

An RPC (remote procedure call) is initiated by the: Client RPC works between two processes. These processes may be : On the same computer or on different computers connected with a network RPC : Remote Procedure Call The local operating system on the server machine passes the incoming packets to the : Server stub _____is a framework for distributed objects on the Microsoft platform.  Answer: DCOM ____ is a framework for distributed objects using Borland Delphi. Answer: DDObjects ____ is a framework for distributed components using a messaging paradigm. Answer: Jt ____ is a Sun specification for a distributed, shared memory. Answer: JavaSpaces ____ is a framework for distributed objects using the Python programming language. Answer: Pyro The reduce function typically outputs a smaller set than what is input to it. Answer: True If there are M partitions of the input, there are M map workers running simultaneously. True or Fals...

Google Adsense Terms Related Interview Questions

What is Placements ? Locations on the display network where ads can appear. (Where Google Ads can appear) What is Display Network (GDN) ? Larger area of where ads can appear. Comprised of websites, videos, and apps that partner with Google What is Ad Rank ? A value that's used to determine your ad position and whether your ads will show at all. Ad Rank is calculated using your bid amount, the components of quality score, and the expected impact of extensions and other ad formats What is Quality Score ? Expected clickthrough rate, ad relevance, and landing page experience What is Ad Extensions ? A feature that shows extra business information with your ad, like an address, phone number, store rating, or more webpage links (what else the ad links to) What is Max CPC Bid ? Maximum cost-per-click; the maximum amount you're willing to pay for each click on your ad What is Ad Auction ? The process that happens with each Google sear...

Common Linux Commands for Sys Admins

man The most important command in Linux, man shows information about other commands. You can start by running "man man" to find more about the man command. uptime This command tells you how long your system has been running for. w This command shows who is logged into your system and what they are currently doing. users This shows you the usernames of users who are currently logged in to your system. whoami Prints the username of the user that you are currently logged in as. grep Finds text in files and outputs the neighboring text along with file names. You can recursively search in multiple files using -r. You can output only file names using -l. less If the case your output of a command or file contests are more your screen can accommodate, you can view it in parts using less. The output of the previous command using less is the following. cat This helps in displaying, copying or combining text files. pwd Prints the absolute path of the c...

Basic Google Interview Questions

What are the steps of the Linux Boot Process? 1. BIOS (Power On Self Test), loads MBR 2. MBR - Master Boot Record, loads and executes GRUB boot loader 3. GRUB - GRand Unified Bootloader, loads and executes the kernel and initrd image. 3. Kernel - Mounts the root file system as specified in the "root=" in 4. Init - to determine run level 5. Run Level Programs - Services and programs are loaded depending on the run level What is the MBR? * Master Boot Record * Located in the first sector of the bootable disk (/dev/hda or /dev/sda) * 512 bytes * Primary Boot Loader 446 bytes * Partition Table Info 64 bytes * MBR validation check 2 bytes * Contains info about the GRUB * MBR executes the GRUB boot loader What are the most common network protocols and ports? * SSH - Port 22 * Telnet - Port 23 * SMTP - Port 25 * DNS - Port 53 * BOOTP - Port 67 * HTTP - Port 80 * HTTPS - Port 443 What is Apache? Most commonly used web server software. What is MySQL...