Posts

Showing posts with the label Data Science Interview Questions and Answers

What are a few ways you can handle missing values for a feature in your data?

What are a few ways you can handle missing values for a feature in your data? Answer: You can drop the data instance if there aren't a lot of them. Yout can replace with the mean variable in many instances. In a time series problem you might use a neighboring value. You could do some type of clustering algorithm and then use the average value for instances in the same cluster. EX: There are two columns in a health dataset, 'current weight' and 'heaviest weight ever'. There are values for every instance of the current weight, but many missing values for heaviest weight. One way to handle missing values is looking at the average percentage difference between heaviest weight and current weight and then applying that percentage difference to calculate the heaviest weight for missing values.

Explain the difference between bagging and boosting ensemble models.

Explain the difference between bagging and boosting ensemble models. Answer: Both methods are examples of ensemble methods that combine multiple models to create the final model. A bagging model (random forests) will create all of its component models without any information from the other models, and then work to aggregate them all together. Boosted models are created sequentially where it creates the first model then uses data from that model (usually the errors/residuals) to create the next model, and so on.

What are a few ways you can evaluate a linear regression model?

What are a few ways you can evaluate a linear regression model? Answer: R^2, RMSE, MAE, MAPE

What is selection bias, why is it important and how can you avoid it?

What is selection bias, why is it important and how can you avoid it? Answer: Selection bias is the term used to describe the situation where an analysis has been conducted among a subset of the data (a sample) with the goal of drawing conclusions about the population, but the resulting conclusions will likely be wrong (biased), because the subgroup differs from the population in some important way. Selection bias is usually introduced as an error with the sampling and having a selection for analysis that is not properly randomized. It can be avoided by taking a random sampling of the population and testing to make sure that the subgroup looks like the population along many measures (age, gender, education).

You find your random forest model is overfitting the data. What can you change about your model to reduce this.

You find your random forest model is overfitting the data. What can you change about your model to reduce this. Answer: I would reduce the number of features available for each tree. This will make your trees more diverse and less likely to overfit to a particular feature. Additionally increase the minimum sample leaf size will prevent you from creating a tree that is too highly fit to the data

What is the difference between a left, inner, and outer join?

What is the difference between a left, inner, and outer join? Answer: A outer join will combine all rows into a table, even if one instance is not present in one of the tables. An inner join will leave you will only data instances that were originally present in all combined tables. Finally a left join will only have instances in the original table even if they aren't present in the additionally joined tables.

What are two models you can use for classification problems, and when would you use one instead of the other?

What are two models you can use for classification problems, and when would you use one instead of the other? Answer: Logistic regression and Random Forests are two models that can be used for classification. Often I would try both models and see which one performs better. If the output of the model must be easily interpretable, I would use the logistic regression over the random forest because the model outputs coefficients that I can interpret.

What are the basic assumptions to be made for linear regression?

What are the basic assumptions to be made for linear regression? Answer: The assumptions of linear regression are, (1) linear association between input and output variable (2) normally distributed errors and (3) independence of error term with input

How would you explain a linear regression to a business executive?

How would you explain a linear regression to a business executive? Answer: Linear regression models are used to show or predict the relationship between two variables or factors. The factor that is being predicted (the factor that the equation solves for) is called the dependent variable. The factors that are used to predict the value of the dependent variable are called the independent variables. You can use linear regression to predict continuous variables (salary) taking into account variables that explain (education, experience, occupation). You may have heard something along the lines of "Women in the US earn 77% of what men earn, but if you account for different factors like experience, occupation, etc., that number becomes 91%."

What does P-value signify about the statistical data?

What does P-value signify about the statistical data? Answer: The P value, or calculated probability, is the probability of finding the observed, results when the null hypothesis (H0) of a study question is true. In layman's terms, it is saying how likely is it that my results are actually significant or is it possible that they are the result of random sampling.

What is the difference between Supervised Learning and Unsupervised Learning?

What is the difference between Supervised Learning and Unsupervised Learning? Answer: In supervised learning you you know what your target variable and your data set has labels for that variable. Therefore, the goal of supervised learning is to learn a function that, given a sample of data and desired outputs, best approximates the relationship between input and output observable in the data. Unsupervised learning, on the other hand, does not have labeled outputs, so its goal is to infer the natural structure present within a set of data points.

What is the goal of A/B Testing?

What is the goal of A/B Testing? Answer: This is a statistical hypothesis testing for randomized experiments with two variables, A and B. The objective of A/B testing is to detect any difference in the outcomes between the A and B group. For example, if you want to test out a new landing page to see if it leads to more sales, you would set up an A/b test where half of the visitors see the old page and half of the visitors see the new page. Then you use a statistical test to see if the actions of those visitors were different.

What is a recommendation engine? How does it work?

What is a recommendation engine? How does it work? Answer: Recommendation engines basically are data filtering tools that make use of algorithms and data to recommend the most relevant items to a particular user. Or in simple terms, they are nothing but an automated form of a "shop counter guy". You ask him for a product. Not only he shows that product, but also the related ones which you could buy. There are three main types: - Collaborative Filtering - Content-Based Filtering - Hybrid Recommendation Systems

Explain cross-validation, both the process and why you do it.

Explain cross-validation, both the process and why you do it. Answer: Cross-validation is an effective tool to measure the accuracy of your model and check to see if it is underfitting or overfitting. In addition, it is useful to determine the hyperparameters of the model. You will use cross validation to determine which parameters will result in lowest test error. It does this by splitting your data into multiple groups, then training your model some of the groups and validating it on another group.

What is the curse of dimensionality?

What is the curse of dimensionality? Answer: As you increase the number of dimensions in your feature space the less effective standard computational and statistical techniques become. Your models will require more computational power to be fitted and more observations of data. When fitting a model, you make certain assumptions that the data sample is representative of the population. The more features you have,relative to the data instances, the less confidently you can say that the assumptions

What is regularization and what kind of problems does regularization solve?

What is regularization and what kind of problems does regularization solve? Answer: Regularization is used to help prevent you from overfitting your model. It does this by introducing a penalty term for the size of the coefficients in your model.

Please explain the bias-variance tradeoff.

Please explain the bias-variance tradeoff. Answer: The bias-variance tradeoff is essentially a questions of how complex you would like to make your model. The more complex your model, the more likely you model can vary based on the sample of data. This would be high variance and you could be overfitting your model. While a simpler model, reduces the likelihood of this, it increases the chance of you underfitting your model and making it bias towards the features selected for your model.

What are various steps involved in an analytics project?

What are various steps involved in an analytics project? Answer: - Look at the big picture - Get the data - EDA - Data Prep (cleaning and feature engineering) - Select a model - Fine-tune your model (test metrics and hyperparameter tuning) - Present your solution - Launch, monitor, and maintain your system.

Why did you switch careers to become a data scientist?

Why did you switch careers to become a data scientist? Answer: 30 Second Elevator Pitch