Posts

Google Interview

What is the time and space complexity of heapsort? O(n lg n) time O(1) space What is the time and space complexity of merge sort? O(n lg n) time O(n) space How would you split up a data set in order to choose from multiple models? In such a situation, you should split the data into three parts: a training set for building models, a validation set for choosing among trained models (called the cross-validation set), and a test set for judging the final model. What is a Type 1 error? A false positive What is a Type 2 error? A false negative In statistics, how would you calculate precision? true_pos / (true_pos + false_pos) In statistics, how would you calculate recall? true_pos / (true_pos + false_neg) In statistics, what does precision measure? Precision measures how accurate our positive predictions are. In statistics, what does recall measure? Recall measures what fraction of the positives our model identified. How would you calculate the F1 score? ...

Give examples of bad and good visualizations

 Give examples of bad and good visualizations Bad visualization: - Pie charts: difficult to make comparisons between items when area is used, especially when there are lots of items - Color choice for classes: abundant use of red, orange and blue. Readers can think that the colors could mean good (blue) versus bad (orange and red) whereas these are just associated with a specific segment - 3D charts: can distort perception and therefore skew data - Using a solid line in a line chart: dashed and dotted lines can be distracting Good visualization: - Heat map with a single color: some colors stand out more than others, giving more weight to that data. A single color with varying shades show the intensity better - Adding a trend line (regression line) to a scatter plot help the reader highlighting trends

Do you know a few "rules of thumb" used in statistical or computer science? Or in business analytics?

Do you know a few "rules of thumb" used in statistical or computer science? Or in business analytics? Pareto rule: - 80% of the effects come from 20% of the causes - 80% of the sales come from 20% of the customers Computer science: "simple and inexpensive beats complicated and expensive" - Rod Elder Finance, rule of 72: - Estimate the time needed for a money investment to double - 100$ at a rate of 9%: 72/9=8 years Rule of three (Economics): - There are always three major competitors in a free market within one industry

Explain the difference between "long" and "wide" format data. Why would you use one or the other?

Explain the difference between "long" and "wide" format data. Why would you use one or the other? -Long: one column containing the values and another column listing the context of the value Fam_id year fam_inc -Wide: each different variable in a separate column Fam_id fam_inc96 fam_inc97 fam_inc98 Long Vs Wide: - Data manipulations are much easier when data is in the wide format: summarize, filter - Program requirements

What is your definition of big data?

What is your definition of big data? Big data is high volume, high velocity and/or high variety information assets that require new forms of processing - Volume: big data doesn't sample, just observes and tracks what happens - Velocity: big data is often available in real-time - Variety: big data comes from texts, images, audio, video... Difference big data/business intelligence: - Business intelligence uses descriptive statistics with data with high density information to measure things, detect trends etc. - Big data uses inductive statistics (statistical inference) and concepts from non-linear system identification to infer laws (regression, classification, clustering) from large data sets with low density information to reveal relationships and dependencies or to perform prediction of outcomes or behaviors

Is it better to spend 5 days developing a 90% accurate solution, or 10 days for 100% accuracy?

Is it better to spend 5 days developing a 90% accurate solution, or 10 days for 100% accuracy?  Depends on the context? -"premature optimization is the root of all evils" -At the beginning: quick-and-dirty model is better -Optimization later Other answer: - Depends on the context - Is error acceptable? Fraud detection, quality assurance

How to detect individual paid accounts shared by multiple users?

How to detect individual paid accounts shared by multiple users? -Check geographical region: Friday morning a log in from Paris and Friday evening a log in from Tokyo -Bandwidth consumption: if a user goes over some high limit -Counter of live sessions: if they have 100 sessions per day (4 times per hour) that seems more than one person can do