Posts

Showing posts from August, 2018

Google Interview

What is the time and space complexity of heapsort? O(n lg n) time O(1) space What is the time and space complexity of merge sort? O(n lg n) time O(n) space How would you split up a data set in order to choose from multiple models? In such a situation, you should split the data into three parts: a training set for building models, a validation set for choosing among trained models (called the cross-validation set), and a test set for judging the final model. What is a Type 1 error? A false positive What is a Type 2 error? A false negative In statistics, how would you calculate precision? true_pos / (true_pos + false_pos) In statistics, how would you calculate recall? true_pos / (true_pos + false_neg) In statistics, what does precision measure? Precision measures how accurate our positive predictions are. In statistics, what does recall measure? Recall measures what fraction of the positives our model identified. How would you calculate the F1 score? ...

Give examples of bad and good visualizations

 Give examples of bad and good visualizations Bad visualization: - Pie charts: difficult to make comparisons between items when area is used, especially when there are lots of items - Color choice for classes: abundant use of red, orange and blue. Readers can think that the colors could mean good (blue) versus bad (orange and red) whereas these are just associated with a specific segment - 3D charts: can distort perception and therefore skew data - Using a solid line in a line chart: dashed and dotted lines can be distracting Good visualization: - Heat map with a single color: some colors stand out more than others, giving more weight to that data. A single color with varying shades show the intensity better - Adding a trend line (regression line) to a scatter plot help the reader highlighting trends

Do you know a few "rules of thumb" used in statistical or computer science? Or in business analytics?

Do you know a few "rules of thumb" used in statistical or computer science? Or in business analytics? Pareto rule: - 80% of the effects come from 20% of the causes - 80% of the sales come from 20% of the customers Computer science: "simple and inexpensive beats complicated and expensive" - Rod Elder Finance, rule of 72: - Estimate the time needed for a money investment to double - 100$ at a rate of 9%: 72/9=8 years Rule of three (Economics): - There are always three major competitors in a free market within one industry

Explain the difference between "long" and "wide" format data. Why would you use one or the other?

Explain the difference between "long" and "wide" format data. Why would you use one or the other? -Long: one column containing the values and another column listing the context of the value Fam_id year fam_inc -Wide: each different variable in a separate column Fam_id fam_inc96 fam_inc97 fam_inc98 Long Vs Wide: - Data manipulations are much easier when data is in the wide format: summarize, filter - Program requirements

What is your definition of big data?

What is your definition of big data? Big data is high volume, high velocity and/or high variety information assets that require new forms of processing - Volume: big data doesn't sample, just observes and tracks what happens - Velocity: big data is often available in real-time - Variety: big data comes from texts, images, audio, video... Difference big data/business intelligence: - Business intelligence uses descriptive statistics with data with high density information to measure things, detect trends etc. - Big data uses inductive statistics (statistical inference) and concepts from non-linear system identification to infer laws (regression, classification, clustering) from large data sets with low density information to reveal relationships and dependencies or to perform prediction of outcomes or behaviors

Is it better to spend 5 days developing a 90% accurate solution, or 10 days for 100% accuracy?

Is it better to spend 5 days developing a 90% accurate solution, or 10 days for 100% accuracy?  Depends on the context? -"premature optimization is the root of all evils" -At the beginning: quick-and-dirty model is better -Optimization later Other answer: - Depends on the context - Is error acceptable? Fraud detection, quality assurance

How to detect individual paid accounts shared by multiple users?

How to detect individual paid accounts shared by multiple users? -Check geographical region: Friday morning a log in from Paris and Friday evening a log in from Tokyo -Bandwidth consumption: if a user goes over some high limit -Counter of live sessions: if they have 100 sessions per day (4 times per hour) that seems more than one person can do

How would you come up with a solution to identify plagiarism?

How would you come up with a solution to identify plagiarism? -Vector space model approach -Represent documents (the suspect and original ones) as vectors of terms -Terms: n-grams; n=1 to as much we can (detect passage plagiarism) -Measure the similarity between both documents -Similarity measure: cosine distance, Jaro-Winkler, Jaccard -Declare plagiarism at a certain threshold

Explain Tufte's concept of "chart junk"

Explain Tufte's concept of "chart junk" All visuals elements in charts and graphs that are not necessary to comprehend the information represented, or that distract the viewer from this information Examples of unnecessary elements include: - Unnecessary text - Heavy or dark grid lines - Ornamented chart axes - Pictures - Background - Unnecessary dimensions - Elements depicted out of scale to one another - 3-D simulations in line or bar charts

What is POC (proof of concept)?

What is POC (proof of concept)? -A realization of a certain method to demonstrate its feasibility -In engineering: a rough prototype of a new idea is often constructed as a proof of concept

How frequently an algorithm must be updated?

How frequently an algorithm must be updated? You want to update an algorithm when: - You want the model to evolve as data streams through infrastructure - The underlying data source is changing - Example: a retail store model that remains accurate as the business grows - Dealing with non-stationarity Some options: - Incremental algorithms: the model is updated every time it sees a new training example Note: simple, you always have an up-to-date model but you can't incorporate data to different degrees. Sometimes mandatory: when data must be discarded once seen (privacy) - Periodic re-training in "batch" mode: simply buffer the relevant data and update the model every-so-often Note: more decisions and more complex implementations How frequently? - Is the sacrifice worth it? - Data horizon: how quickly do you need the most recent training example to be part of your model? - Data obsolescence: how long does it take before data is irrelevant to the model? A...

How to efficiently scrape web data, or collect tons of tweets?

How to efficiently scrape web data, or collect tons of tweets? -Python example -Requesting and fetching the webpage into the code: httplib2 module -Parsing the content and getting the necessary info: BeautifulSoup from bs4 package -Twitter API: the Python wrapper for performing API requests. It handles all the OAuth and API queries in a single Python interface -MongoDB as the database -PyMongo: the Python wrapper for interacting with the MongoDB database -Cronjobs: a time based scheduler in order to run scripts at specific intervals; allows to bypass the "rate limit exceed" error

What is the life cycle of a data science project ?

What is the life cycle of a data science project ? 1. Data acquisition Acquiring data from both internal and external sources, including social media or web scraping. In a steady state, data extraction and routines should be in place, and new sources, once identified would be acquired following the established processes 2. Data preparation Also called data wrangling: cleaning the data and shaping it into a suitable form for later analyses. Involves exploratory data analysis and feature extraction. 3. Hypothesis & modelling Like in data mining but not with samples, with all the data instead. Applying machine learning techniques to all the data. A key sub-step: model selection. This involves preparing a training set for model candidates, and validation and test sets for comparing model performances, selecting the best performing model, gauging model accuracy and preventing overfitting 4. Evaluation & interpretation Steps 2 to 4 are repeated a number of times as ne...

What is star schema? Lookup tables?

What is star schema? Lookup tables? The star schema is a traditional database schema with a central (fact) table (the "observations", with database "keys" for joining with satellite tables, and with several fields encoded as ID's). Satellite tables map ID's to physical name or description and can be "joined" to the central fact table using the ID fields; these tables are known as lookup tables, and are particularly useful in real-time applications, as they save a lot of memory. Sometimes star schemas involve multiple layers of summarization (summary tables, from granular to less granular) to retrieve information faster. Lookup tables: - Array that replace runtime computations with a simpler array indexing operation

Is it better to have 100 small hash tables or one big hash table, in memory, in terms of access speed (assuming both fit within RAM)? What do you think about in-database analytics?

Is it better to have 100 small hash tables or one big hash table, in memory, in terms of access speed (assuming both fit within RAM)? What do you think about in-database analytics? Hash tables: - Average case O(1)O(1) lookup time - Lookup time doesn't depend on size Even in terms of memory: - O(n)O(n) memory - Space scales linearly with number of elements - Lots of dictionaries won't take up significantly less space than a larger one In-database analytics: - Integration of data analytics in data warehousing functionality - Much faster and corporate information is more secure, it doesn't leave the enterprise data warehouse Good for real-time analytics: fraud detection, credit scoring, transaction processing, pricing and margin analysis, behavioral ad targeting and recommendation engines

Compare R and Python

Compare R and Python R - Focuses on better, user friendly data analysis, statistics and graphical models - The closer you are to statistics, data science and research, the more you might prefer R - Statistical models can be written with only a few lines in R - The same piece of functionality can be written in several ways in R - Mainly used for standalone computing or analysis on individual servers - Large number of packages, for anything! Python - Used by programmers that want to delve into data science - The closer you are working in an engineering environment, the more you might prefer Python - Coding and debugging is easier mainly because of the nice syntax - Any piece of functionality is always written the same way in Python - When data analysis needs to be implemented with web apps - Good tool to implement algorithms for production use

Provide examples of machine-to-machine communications

Provide examples of machine-to-machine communications Telemedicine - Heart patients wear specialized monitor which gather information regarding heart state - The collected data is sent to an electronic implanted device which sends back electric shocks to the patient for correcting incorrect rhythms Product restocking - Vending machines are capable of messaging the distributor whenever an item is running out of stock

Examples of NoSQL architecture

Examples of NoSQL architecture -Key-value: in a key-value NoSQL database, all of the data within consists of an indexed key and a value. Cassandra, DynamoDB -Column-based: designed for storing data tables as sections of columns of data rather than as rows of data. HBase, SAP HANA -Document Database: map a key to some document that contains structured information. The key is used to retrieve the document. MongoDB, CouchDB -Graph Database: designed for data whose relations are well-represented as a graph and has elements which are interconnected, with an undetermined number of relations between them. Polyglot Neo4J

How to optimize algorithms? (parallel processing and/or faster algorithms). Provide examples for both

How to optimize algorithms? (parallel processing and/or faster algorithms). Provide examples for both "Premature optimization is the root of all evil"; Donald Knuth Parallel processing: for instance in R with a single machine. - doParallel and foreach package - doParallel: parallel backend, will select n-cores of the machine - for each: assign tasks for each core - using Hadoop on a single node - using Hadoop on multi-node Faster algorithm: - In computer science: Pareto principle; 90% of the execution time is spent executing 10% of the code - Data structure: affect performance - Caching: avoid unnecessary work - Improve source code level For instance: on early C compilers, WHILE(something) was slower than FOR(;;), because WHILE evaluated "something" and then had a conditional jump which tested if it was true while FOR had unconditional jump.

The homicide rate in Scotland fell last year to 99 from 115 the year before. Is this reported change really networthy?

The homicide rate in Scotland fell last year to 99 from 115 the year before. Is this reported change really networthy? -Consider the homicides as independent; a Poisson distribution can be a reasonable model -95% interval for the true homicide rate is 115±2×115−−−√=115±22=[94,137]115±2×115=115±22=[94,137] -It's not reasonable to conclude that there has been a reduction in the true rate

Geiger counter records 100 radioactive decays in 5 minutes. Find an approximate 95% interval for the number of decays per hour.

Geiger counter records 100 radioactive decays in 5 minutes. Find an approximate 95% interval for the number of decays per hour. -Start by finding a 95% interval for radioactive decay in a 5 minutes period -The estimated standard deviation is 100−−−√=10100=10 -So the interval is λ^±1.96×10=100±19.6λ^±1.96×10=100±19.6 -So, per hour: [964.8,1435.2]

You are running for office and your pollster polled hundred people. 56 of them claimed they will vote for you. Can you relax?

You are running for office and your pollster polled hundred people. 56 of them claimed they will vote for you. Can you relax? Quick: - Intervals take the form p±z×1n×p×(1−p)−−−−−−−−−−−−−√p±z×1n×p×(1−p) - We know that p(1−p)p(1−p) is maximized at 1212 and z=1.96z=1.96 is the relevant quantile for a 95% confidence interval - So: p±1n√p±1n is a quick estimate for pp - Here: 1100√=0.11100=0.1 so 95% of the intervals would be [46,66][46,66] - It's not enough!

Consider the number of people that show up at a bus station is Poisson with mean 2.5/h. What is the probability that at most three people show up in a four hour period?

Consider the number of people that show up at a bus station is Poisson with mean 2.5/h. What is the probability that at most three people show up in a four hour period? X\~Poisson(λ=2.5×t)X\~Poisson(λ=2.5×t) R code: ppois(3,lambda=2.5*4) ## [1] 0.01033605

A random variable X is normal with mean 1020 and standard deviation 50. Calculate P(X>1200)

A random variable X is normal with mean 1020 and standard deviation 50. Calculate P(X>1200) X\~N(1020,50)X\~N(1020,50) Our new quantile: z=1200−102050=3.6z=1200−102050=3.6 R code: pnorm(3.6,lower.tail=F) ## [1] 0.0001591086

You roll a biased coin (p(head)=0.8) five times. What's the probability of getting three or more heads?

You roll a biased coin (p(head)=0.8) five times. What's the probability of getting three or more heads? 5 trials, p=0.8 P("3ormoreheads")=(35)×0.83×0.8×0.22+(41)×0.84×0.21+(55)∗0.85∗0.20=0.94

Infection rates at a hospital above a 1 infection per 100 person days at risk are considered high. An hospital had 10 infections over the last 1787 person days at risk. Give the p-value of the correct one-sided test of whether the hospital is below the standard

Infection rates at a hospital above a 1 infection per 100 person days at risk are considered high. An hospital had 10 infections over the last 1787 person days at risk. Give the p-value of the correct one-sided test of whether the hospital is below the standard One-sided test, assume a Poisson distribution Ho: lambda=0.01 ; H1:lambda>0.01 R code: ppois(10,1787*0.01) ## [1] 0.03237153

An HIV test has a sensitivity of 99.7% and a specificity of 98.5%. A subject from a population of prevalence 0.1% receives a positive test result. What is the precision of the test (i.e the probability he is HIV positive)?

An HIV test has a sensitivity of 99.7% and a specificity of 98.5%. A subject from a population of prevalence 0.1% receives a positive test result. What is the precision of the test (i.e the probability he is HIV positive)? Bayes rule: P(Actu+|Pred+)=P(Pred+|Actu+)×P(Actu+)P(Pred+|Actu+)×P(Actu+)+P(Pred+|Actu−)P(Actu−)P(Actu+|Pred+)=P(Pred+|Actu+)×P(Actu+)P(Pred+|Actu+)×P(Actu+)+P(Pred+|Actu−)P(Actu−) We have: sensitivity×prevalencesensitivity×prevalence+(1−specificity)×(1−prevalence)=0.997×0.0010.997×0.001+0.15×0.999=0.62

What is A/B testing?

What is A/B testing? -Two-sample hypothesis testing -Randomized experiments with two variants: A and B -A: control; B: variation -User-experience design: identify changes to web pages that increase clicks on a banner -Current website: control; NULL hypothesis -New version: variation; alternative hypothesis

What are confounding variables?

What are confounding variables? -Extraneous variable in a statistical model that correlates directly or inversely with both the dependent and the independent variable -A spurious relationship is a perceived relationship between an independent variable and a dependent variable that has been estimated incorrectly -The estimate fails to account for the confounding factor -See Question 18 about root cause analysis

How do you control for biases?

How do you control for biases? -Choose a representative sample, preferably by a random method -Choose an adequate size of sample -Identify all confounding factors if possible -Identify sources of bias and include them as additional predictors in statistical analyses -Use randomization: by randomly recruiting or assigning subjects in a study, all our experimental groups have an equal chance of being influenced by the same bias Notes: - Randomization: in randomized control trials, research participants are assigned by chance, rather than by choice to either the experimental group or the control group. - Random sampling: obtaining data that is representative of the population of interest

When you sample, what bias are you inflicting?

When you sample, what bias are you inflicting? Selection bias: - An online survey about computer use is likely to attract people more interested in technology than in typical Under coverage bias: - Sample too few observations from a segment of population Survivorship bias: - Observations at the end of the study are a non-random set of those present at the beginning of the investigation - In finance and economics: the tendency for failed companies to be excluded from performance studies because they no longer exist

How do you calculate needed sample size?

How do you calculate needed sample size? Estimate a population mean: - General formula is ME=t×Sn√ME=t×Sn or ME=z×sn√ME=z×sn - MEME is the desired margin of error - tt is the t score or z score that we need to use to calculate our confidence interval - ss is the standard deviation Example: we would like to start a study to estimate the average internet usage of households in one week for our business plan. How many households must we randomly select to be 95% sure that the sample mean is within 1minute from the true mean of the population? A previous survey of household usage has shown a standard deviation of 6.95 minutes. -Z score corresponding to a 95% interval: 1.96 (97.5%, α2=0.025α2=0.025) -s=6.95s=6.95 -n=(z×sME)2=(1.96×6.95)2=13.622=186n=(z×sME)2=(1.96×6.95)2=13.622=186 Estimate a proportion: - Similar: ME=z×p(1−p)n−−−−−√ME=z×p(1−p)n Example: a professor in Harvard wants to determine the proportion of students who support gay marriage. She asks "how larg...

What is the Law of Large Numbers?

What is the Law of Large Numbers? -A theorem that describes the result of performing the same experiment a large number of times -Forms the basis of frequency-style thinking -It says that the sample mean, the sample variance and the sample standard deviation converge to what they are trying to estimate -Example: roll a dice, expected value is 3.5. For a large number of experiments, the average converges to 3.5

Given two fair dices, what is the probability of getting scores that sum to 4? to 8?

Given two fair dices, what is the probability of getting scores that sum to 4? to 8? -Total: 36 combinations -Of these, 3 involve a score of 4: (1,3), (3,1), (2,2) -So: 336=112336=112 -Considering a score of 8: (2,6), (3,5), (4,4), (6,2), (5,3) -So: 536

Give an example where the median is a better measure than the mean

Give an example where the median is a better measure than the mean When data is skewed

What is root cause analysis? How to identify a cause vs. a correlation? Give examples

What is root cause analysis? How to identify a cause vs. a correlation? Give examples Root cause analysis: - Method of problem solving used for identifying the root causes or faults of a problem - A factor is considered a root cause if removal of it prevents the final undesirable event from recurring Identify a cause vs. a correlation: - Correlation: statistical measure that describes the size and direction of a relationship between two or more variables. A correlation between two variables doesn't imply that the change in one variable is the cause of the change in the values of the other variable - Causation: indicates that one event is the result of the occurrence of the other event; there is a causal relationship between the two events - Differences between the two types of relationships are easy to identify, but establishing a cause and effect is difficult Example: sleeping with one's shoes on is strongly correlated with waking up with a headache. Correlatio...

Give examples of data that does not have a Gaussian distribution, nor log-normal.

Give examples of data that does not have a Gaussian distribution, nor log-normal. Allocation of wealth among individuals Values of oil reserves among oil fields (many small ones, a small number of large ones)

Define: quality assurance, six sigma.

Define: quality assurance, six sigma. Quality assurance: - A way of preventing mistakes or defects in manufacturing products or when delivering services to customers - In a machine learning context: anomaly detection Six sigma: - Set of techniques and tools for process improvement - 99.99966% of products are defect-free products (3.4 per 1 million) - 6 standard deviation from the process mean

What is: lift, KPI, robustness, model fitting, design of experiments, 80/20 rule?

What is: lift, KPI, robustness, model fitting, design of experiments, 80/20 rule? Lift: It's measure of performance of a targeting model (or a rule) at predicting or classifying cases as having an enhanced response (with respect to the population as a whole), measured against a random choice targeting model. Lift is simply: target response/average response. Suppose a population has an average response rate of 5% (mailing for instance). A certain model (or rule) has identified a segment with a response rate of 20%, then lift=20/5=4 Typically, the modeler seeks to divide the population into quantiles, and rank the quantiles by lift. He can then consider each quantile, and by weighing the predicted response rate against the cost, he can decide to market that quantile or not. "if we use the probability scores on customers, we can get 60% of the total responders we'd get mailing randomly by only mailing the top 30% of the scored customers". KPI: - Key perform...

There's one box - has 12 black and 12 red cards, 2nd box has 24 black and 24 red; if you want to draw 2 cards at random from one of the 2 boxes, which box has the higher probability of getting the same color? Can you tell intuitively why the 2nd box has a higher probability

There's one box - has 12 black and 12 red cards, 2nd box has 24 black and 24 red; if you want to draw 2 cards at random from one of the 2 boxes, which box has the higher probability of getting the same color? Can you tell intuitively why the 2nd box has a higher probability -First select: for both, then and ; compare them -BA=529517

You're about to get on a plane to Seattle. You want to know if you should bring an umbrella. You call 3 random friends of yours who live there and ask each independently if it's raining. Each of your friends has a 2/3 chance of telling you the truth and a 1/3 chance of messing with you by lying. All 3 friends tell you that "Yes" it is raining. What is the probability that it's actually raining in Seattle?

You're about to get on a plane to Seattle. You want to know if you should bring an umbrella. You call 3 random friends of yours who live there and ask each independently if it's raining. Each of your friends has a 2/3 chance of telling you the truth and a 1/3 chance of messing with you by lying. All 3 friends tell you that "Yes" it is raining. What is the probability that it's actually raining in Seattle? -All say yes: all three lie or three say the truth -P("allsaythetruth")=(23)3=827P("allsaythetruth")=(23)3=827 -P("alllie")=(13)3=127P("alllie")=(13)3=127 -P("allyes")=127+827=13P("allyes")=127+827=13 -Out of these numbers, there is 82713=8982713=89 chance it's actually raining

You are compiling a report for user content uploaded every month and notice a spike in uploads in October. In particular, a spike in picture uploads. What might you think is the cause of this, and how would you test it?

You are compiling a report for user content uploaded every month and notice a spike in uploads in October. In particular, a spike in picture uploads. What might you think is the cause of this, and how would you test it? -Halloween pictures? -Look at uploads in countries that don't observe Halloween as a sort of counter-factual analysis -Compare uploads mean in October and uploads means with September: hypothesis testing

Explain likely differences between administrative datasets and datasets gathered from experimental studies. What are likely problems encountered with administrative data? How do experimental methods help alleviate these problems? What problem do they bring?

Explain likely differences between administrative datasets and datasets gathered from experimental studies. What are likely problems encountered with administrative data? How do experimental methods help alleviate these problems? What problem do they bring? Advantages: - Cost - Large coverage of population - Captures individuals who may not respond to surveys - Regularly updated, allow consistent time-series to be built-up Disadvantages: - Restricted to data collected for administrative purposes (limited to administrative definitions. For instance: incomes of a married couple, not individuals, which can be more useful) - Lack of researcher control over content - Missing or erroneous entries - Quality issues (addresses may not be updated or a postal code is provided only) - Data privacy issues - Underdeveloped theories and methods (sampling methods...)

You have data on the durations of calls to a call center. Generate a plan for how you would code and analyze these data. Explain a plausible scenario for what the distribution of these durations might look like. How could you test, even graphically, whether your expectations are borne out?

You have data on the durations of calls to a call center. Generate a plan for how you would code and analyze these data. Explain a plausible scenario for what the distribution of these durations might look like. How could you test, even graphically, whether your expectations are borne out? 1. Exploratory data analysis -Histogram of durations -histogram of durations per service type, per day of week, per hours of day (durations can be systematically longer from 10am to 1pm for instance), per employee... 2. Distribution: lognormal? 3. Test graphically with QQ plot: sample quantiles of log(durations)log⁡(durations) Vs normal quantiles

How do you handle missing data? What imputation techniques do you recommend?

How do you handle missing data? What imputation techniques do you recommend? -If data missing at random: deletion has no bias effect, but decreases the power of the analysis by decreasing the effective sample size. -Recommended: Knn imputation, Gaussian mixture imputation.

What is an outlier? Explain how you might screen for outliers and what would you do if you found them in your dataset.

What is an outlier? Explain how you might screen for outliers and what would you do if you found them in your dataset.  Also, explain what an inlier is and how you might screen for them and what would you do if you found them in your dataset. Outliers: - An observation point that is distant from other observations - Can occur by chance in any distribution - Often, they indicate measurement error or a heavy-tailed distribution - Measurement error: discard them or use robust statistics - Heavy-tailed distribution: high skewness, can't use tools assuming a normal distribution - Three-sigma rules (normally distributed data): 1 in 22 observations will differ by twice the standard deviation from the mean - Three-sigma rules: 1 in 370 observations will differ by three times the standard deviation from the mean Three-sigma rules example: in a sample of 1000 observations, the presence of up to 5 observations deviating from the mean by more than three times the standard deviat...

Is mean imputation of missing data acceptable practice? Why or why not?

Is mean imputation of missing data acceptable practice? Why or why not? -Bad practice in general -If just estimating means: mean imputation preserves the mean of the observed data -Leads to an underestimate of the standard deviation -Distorts relationships between variables by "pulling" estimates of the correlation toward zero

Provide a simple example of how an experimental design can help answer a question about behavior. How does experimental data contrast with observational data?

Provide a simple example of how an experimental design can help answer a question about behavior. How does experimental data contrast with observational data? -You are researching the effect of music-listening on studying efficiency -You might divide your subjects into two groups: one would listen to music and the other (control group) wouldn't listen anything! -You give them a test -Then, you compare grades between the two groups Differences between observational and experimental data: - Observational data: measures the characteristics of a population by studying individuals in a sample, but doesn't attempt to manipulate or influence the variables of interest - Experimental data: applies a treatment to individuals and attempts to isolate the effects of the treatment on a response variable Observational data: find 100 women age 30 of which 50 have been smoking a pack a day for 10 years while the other have been smoke free for 10 years. Measure lung capacity for ea...

Explain selection bias (with regard to a dataset, not variable selection). Why is it important? How can data management procedures such as missing data handling make it worse?

Explain selection bias (with regard to a dataset, not variable selection). Why is it important? How can data management procedures such as missing data handling make it worse? -Selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved Types: - Sampling bias: systematic error due to a non-random sample of a population causing some members to be less likely to be included than others - Time interval: a trial may terminated early at an extreme value (ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all the variables have similar means - Data: "cherry picking", when specific subsets of the data are chosen to support a conclusion (citing examples of plane crashes as evidence of airline flight being unsafe, while the far more common example of flights that complete safely) - Studies: performing experiments and reporting only the most favorable results - ...

What is statistical power?

What is statistical power? -sensitivity of a binary hypothesis test -Probability that the test correctly rejects the null hypothesis H0H0 when the alternative is true H1H1 -Ability of a test to detect an effect, if the effect actually exists -Power=P(rejectH0|H1istrue)Power=P(rejectH0|H1istrue) -As power increases, chances of Type II error (false negative) decrease -Used in the design of experiments, to calculate the minimum sample size required so that one can reasonably detects an effect. i.e: "how many times do I need to flip a coin to conclude it is biased?" -Used to compare tests. Example: between a parametric and a non-parametric test of the same hypothesis

What is the Central Limit Theorem? Explain it. Why is it important?

What is the Central Limit Theorem? Explain it. Why is it important? The CLT states that the arithmetic mean of a sufficiently large number of iterates of independent random variables will be approximately normally distributed regardless of the underlying distribution. i.e: the sampling distribution of the sample mean is normally distributed. - Used in hypothesis testing - Used for confidence intervals - Random variables must be iid: independent and identically distributed - Finite variance

Explain what a long-tailed distribution is and provide three examples of relevant phenomena that have long tails. Why are they important in classification and regression problems?

Explain what a long-tailed distribution is and provide three examples of relevant phenomena that have long tails. Why are they important in classification and regression problems? -In long tailed distributions, a high frequency population is followed by a low frequency population, which gradually tails off asymptotically -Rule of thumb: majority of occurrences (more than half, and when Pareto principles applies, 80%) are accounted for by the first 20% items in the distribution -The least frequently occurring 80% of items are more important as a proportion of the total population -Zipf's law, Pareto distribution, power laws Examples: 1) Natural language - Given some corpus of natural language - The frequency of any word is inversely proportional to its rank in the frequency table - The most frequent word will occur twice as often as the second most frequent, three times as often as the third most frequent... - "The" accounts for 7% of all word occurrences (700...

How do you assess the statistical significance of an insight?

How do you assess the statistical significance of an insight? Statistical significance can be accessed using hypothesis testing: - Stating a null hypothesis which is usually the opposite of what we wish to test (classifiers A and B perform equivalently, Treatment A is equal of treatment B) - Then, we choose a suitable statistical test and statistics used to reject the null hypothesis - Also, we choose a critical region for the statistics to lie in that is extreme enough for the null hypothesis to be rejected (p-value) - We calculate the observed test statistics from the data and check whether it lies in the critical region Common tests: - One sample Z test - Two-sample Z test - One sample t-test - paired t-test - Two sample pooled equal variances t-test - Two sample unpooled unequal variances t-test and unequal sample sizes (Welch's t-test) - Chi-squared test for variances - Chi-squared test for goodness of fit - Anova (for instance: are the two regression model...

Imagine you have N pieces of rope in a bucket. You reach in and grab one end-piece, then reach in and grab another end-piece, and tie those two together. What is the expected value of the number of loops in the bucket?

Imagine you have N pieces of rope in a bucket. You reach in and grab one end-piece, then reach in and grab another end-piece, and tie those two together. What is the expected value of the number of loops in the bucket? -There are nn entirely unattached pieces of rope in a bucket -A loop: any number of rope attached in a closed chain -Suppose the expected number of loops for n−1n−1 pieces of rope is denoted Ln−1Ln−1 -Consider the bucket of nn pieces of rope; there are 2n2n rope ends Pick an end of rope. Of the remaining 2n−12n−1 ends of rope, only one end creates a loop (the other end of the same piece of rope). There are then n−1n−1 untied pieces of rope. The rest of the time, two separates pieces of rope are tied together and there are effectively n−1n−1 untied pieces of rope. The recurrence is therefore: -Ln=12n−1+Ln−1Ln=12n−1+Ln−1 Clearly, L1=1L1=1 so: -Ln=∑nk=112k−1=H2n−Hn2Ln=∑k=1n12k−1=H2n−Hn2 -Where HkHk is the kthkth harmonic number Since Hk≐γ+lnkHk≐γ+ln⁡k for l...

How do we multiply matrices?

How do we multiply matrices? -A∈Rn×mA∈Rn×m and B∈Rm×pB∈Rm×p -Each entry: ABij=∑mk=1AikBkj

What is Ax=bAx=b? How to solve it?

What is Ax=bAx=b? How to solve it? -A matrix equation/a system of linear equations -calculate the inverse of AA (if non singular) -can be done using Gaussian elimination

What is curse of dimensionality? How does it affect distance and similarity measures?

What is curse of dimensionality? How does it affect distance and similarity measures? -Refers to various phenomena that arise when analyzing and organizing data in high dimensional spaces -Common theme: when number of dimensions increases, the volume of the space increases so fast that the available data becomes sparse -Issue with any method that requires statistical significance: the amount of data needed to support the result grows exponentially with the dimensionality -Issue when algorithms don't scale well on high dimensions typically when O(nkn)O(nkn) -Everything becomes far and difficult to organize Illustrative example: compare the proportion of an inscribed hypersphere with radius rr and dimension d to that of a hypercube with edges of length 2r2r - Volume of such a sphere is Vsphere=2rdπd/2dΓ(d/2)Vsphere=2rdπd/2dΓ(d/2) - The volume of the cube is: Vcube=2rdVcube=2rd As d increases (space dimension), the volume of hypersphere becomes insignificant relative to ...

Is it beneficial to perform dimensionality reduction before fitting an SVM? Why or why not?

Is it beneficial to perform dimensionality reduction before fitting an SVM? Why or why not? -When the number of features is large comparing to the number of observations (e.g. document-term matrix) -SVM will perform better in this reduced space

Which kernels do you know? How to choose a kernel?

Which kernels do you know? How to choose a kernel? -Gaussian kernel -Linear kernel -Polynomial kernel -Laplace kernel -Esoteric kernels: string kernels, chi-square kernels -If number of features is large (relative to number of observations): SVM with linear kernel ; e.g. text classification with lots of words, small training example -If number of features is small, number of observations is intermediate: Gaussian kernel -If number of features is small, number of observations is small: linear kernel

What is the maximal margin classifier? How this margin can be achieved?

What is the maximal margin classifier? How this margin can be achieved? -When the data can be perfectly separated using a hyperplane, there actually exists an infinite number of these hyperplanes -Intuition: a hyperplane can usually be shifted a tiny bit up, or down, or rotated, without coming into contact with any of the observations -Large margin classifier: choosing the hyperplance that is farthest from the training observations -This margin can be achieved using support vectors

How do we train a logistic regression model? How do we interpret its coefficients?

How do we train a logistic regression model? How do we interpret its coefficients? log(odds)=log(P(y=1|x)P(y=0|x))=log⁡(odds)=log⁡(P(y=1|x)P(y=0|x))= is a linear function of the input features Minimization objective/Cost function: -J(β)=−1m∑mi=1yilog(hβ(xi))+(1−yi)log(1−hβ(xi))J(β)=−1m∑i=1myilog⁡(hβ(xi))+(1−yi)log⁡(1−hβ(xi)) -Where: hβ(x)=g(βTx)hβ(x)=g(βTx) and g(z)=11+e−zg(z)=11+e−z (sigmoid function) -Intuition: - if yi=0yi=0, J(β)=log(1−hβ(x)i)J(β)=log(1−hβ(x)i), will converge to ∞∞ as hβ(x)ihβ(x)i becomes far from 0 - Converse: when yi=1yi=1, J(β)=log(hβ(x)i)J(β)=log(hβ(x)i), will converge to ∞∞ as hβ(x)ihβ(x)i becomes far from 1 Interpretation of the coefficients: the increase of logoddslog⁡odds for the increase of one unit of a predictor, given all the other predictors are fixed.

What is random forest? Why is it good?

What is random forest? Why is it good? Random forest? (Intuition): - Underlying principle: several weak learners combined provide a strong learner - Builds several decision trees on bootstrapped training samples of data - On each tree, each time a split is considered, a random sample of mm predictors is chosen as split candidates, out of all pp predictors - Rule of thumb: at each split m=p-√m=p - Predictions: at the majority rule Why is it good? - Very good performance (decorrelates the features) - Can model non-linear class boundaries - Generalization error for free: no cross-validation needed, gives an unbiased estimate of the generalization error as the trees is built - Generates variable importance

What impurity measures do you know?

What impurity measures do you know? Gini -Gini=1−∑jp2jGini=1−∑jpj2 Information Gain/Deviance -InformationGain=∑jpjlog2pjInformationGain=∑jpjlog2pj -Better than Gini when pjpj are very small: multiplying very small numbers leads to rounding errors, we can instead take logs.

What is a decision tree?

What is a decision tree? 1. Take the entire data set as input 2. Search for a split that maximizes the "separation" of the classes. A split is any test that divides the data in two (e.g. if variable2>10) 3. Apply the split to the input data (divide step) 4. Re-apply steps 1 to 2 to the divided data 5. Stop when you meet some stopping criteria 6. (Optional) Clean up the tree when you went too far doing splits (called pruning) Finding a split: methods vary, from greedy search (e.g. C4.5) to randomly selecting attributes and split points (random forests) Purity measure: information gain, Gini coefficient, Chi Squared values Stopping criteria: methods vary from minimum size, particular confidence in prediction, purity criteria threshold Pruning: reduced error pruning, out of bag error pruning (ensemble methods)

How to check if the regression model fits the data well?

How to check if the regression model fits the data well? R squared/Adjusted R squared: - R2=RSStot−RSSresRSStot=RSSregRSStot=1−RSSresRSStotR2=RSStot−RSSresRSStot=RSSregRSStot=1−RSSresRSStot - Describes the percentage of the total variation described by the model - R2R2 always increases when adding new variables: adjusted R2R2 incorporates the model's degrees of freedom F test: - Evaluate the hypothesis HoHo: all regression coefficients are equal to zero Vs H1H1: at least one doesn't - Indicates that R2R2 is reliable RMSE: - Absolute measure of fit (whereas R2R2 is a relative measure of fit)

What is collinearity and what to do with it? How to remove multicollinearity?

What is collinearity and what to do with it? How to remove multicollinearity? Collinearity/Multicollinearity: - In multiple regression: when two or more variables are highly correlated - They provide redundant information - In case of perfect multicollinearity: β=(XTX)−1XTyβ=(XTX)−1XTy doesn't exist, the design matrix isn't invertible - It doesn't affect the model as a whole, doesn't bias results - The standard errors of the regression coefficients of the affected variables tend to be large - The test of hypothesis that the coefficient is equal to zero may lead to a failure to reject a false null hypothesis of no effect of the explanatory (Type II error) - Leads to overfitting Remove multicollinearity: - Drop some of affected variables - Principal component regression: gives uncorrelated predictors - Combine the affected variables - Ridge regression - Partial least square regression Detection of multicollinearity: - Large changes in the individual ...

What are the assumptions required for linear regression? What if some of these assumptions are violated?

What are the assumptions required for linear regression? What if some of these assumptions are violated? 1. The data used in fitting the model is representative of the population 2. The true underlying relation between xx and yy is linear 3. Variance of the residuals is constant (homoscedastic, not heteroscedastic) 4. The residuals are independent 5. The residuals are normally distributed -Predict yy from xx: 1) + 2) -Estimate the standard error of predictors: 1) + 2) + 3) -Get an unbiased estimation of yy from xx: 1) + 2) + 3) + 4) -Make probability statements, hypothesis testing involving slope and correlation, confidence intervals: 1) + 2) + 3) + 4) + 5) Note: - Common mythology: linear regression doesn't assume anything about the distributions of xx and yy - It only makes assumptions about the distribution of the residuals - And this is only needed for statistical tests to be valid - Regression can be applied to many purposes, even if the errors are not norm...

Do we always need the intercept term in a regression model?

Do we always need the intercept term in a regression model? -It guarantees that the residuals have a zero mean -It guarantees the least squares slopes estimates are unbiased -the regression line floats up and down, by adjusting the constant, to a point where the mean of the residuals is zero

How would you define and measure the predictive power of a metric?

How would you define and measure the predictive power of a metric? -Predictive power of a metric: the accuracy of a metric's success at predicting the empirical -They are all domain specific -Example: in field like manufacturing, failure rates of tools are easily observable. A metric can be trained and the success can be easily measured as the deviation over time from the observed -In information security: if the metric says that an attack is coming and one should do X. Did the recommendation stop the attack or the attack never happened?

Do you know / used data reduction techniques other than PCA? What do you think of step-wise regression? What kind of step-wise techniques are you familiar with?

Do you know / used data reduction techniques other than PCA? What do you think of step-wise regression? What kind of step-wise techniques are you familiar with? Data reduction techniques other than  Principal component analysis (PCA) : -Partial least squares: like PCR (principal component regression) but chooses the principal components in a supervised way. Gives higher weights to variables that are most strongly related to the response step-wise regression? - the choice of predictive variables are carried out using a systematic procedure - Usually, it takes the form of a sequence of F-tests, t-tests, adjusted R-squared, AIC, BIC - at any given step, the model is fit using unconstrained least squares - can get stuck in local optima - Better: Lasso step-wise techniques: - Forward-selection: begin with no variables, adding them when they improve a chosen model comparison criterion - Backward-selection: begin with all the variables, removing them when it improves a cho...

What do you think about the idea of injecting noise in your data set to test the sensitivity of your models?

What do you think about the idea of injecting noise in your data set to test the sensitivity of your models? -Effect would be similar to regularization: avoid overfitting -Used to increase robustness

How can you prove that one improvement you've brought to an algorithm is really an improvement over not doing anything? Are you familiar with A/B testing?

How can you prove that one improvement you've brought to an algorithm is really an improvement over not doing anything? Are you familiar with A/B testing? Example with linear regression: - F-statistic (ANOVA) F=RSS1−RSS2p2−p1RSS2n−p2F=RSS1−RSS2p2−p1RSS2n−p2 p1p1: number of parameters of model 1 p2p2: number of parameters of model 2 nn: number of observations Under the null hypothesis that model 2 doesn't provide a significantly better fit than model 1, FF will have an FF distribution with (p2−p1,n−p2)(p2−p1,n−p2) degrees of freedom. The null hypothesis is rejected if the FF calculated from the data is greater than the critical value of the FF distribution for some desired significance level. Others: AIC/BIC (regression), cross-validation: assessing test error on a test/validation set

Why is mean square error a bad measure of model performance? What would you suggest instead?

Why is mean square error a bad measure of model performance? What would you suggest instead? -see question 3 about metrics in regression -It puts too much emphasis on large deviations (squared) -Alternative: mean absolute deviation

Do you think 50 small decision trees are better than a large one? Why?

Do you think 50 small decision trees are better than a large one? Why? -Yes! -More robust model (ensemble of weak learners that come and make a strong learner) -Better to improve a model by taking many small steps than fewer large steps -If one tree is erroneous, it can be auto-corrected by the following -Less prone to overfitting

What are the drawbacks of linear model? Are you familiar with alternatives (Lasso, ridge regression)?

What are the drawbacks of linear model? Are you familiar with alternatives (Lasso, ridge regression)? -Assumption of linearity of the errors -Can't be used for count outcomes, binary outcomes -Can't vary model flexibility: overfitting problems -Alternatives: see question 4 about regularization

Why is naive Bayes so bad? How would you improve a spam detection algorithm that uses naive Bayes?

Why is naive Bayes so bad? How would you improve a spam detection algorithm that uses naive Bayes? -Naïve: the features are assumed independent/uncorrelated -Assumption not feasible in many cases -Improvement: decorrelate features (covariance matrix into identity matrix)

What is better: good data or good models? And how do you define "good"? Is there a universal good model? Are there any models that are definitely not so good?

What is better: good data or good models? And how do you define "good"? Is there a universal good model? Are there any models that are definitely not so good? -Good data is definitely more important than good models -If quality of the data wasn't of importance, organizations wouldn't spend so much time cleaning and preprocessing it! -Even for scientific purpose: good data (reflected by the design of experiments) is very important How do you define good? - good data: data relevant regarding the project/task to be handled - good model: model relevant regarding the project/task - good model: a model that generalizes on external data sets Is there a universal good model? - No, otherwise there wouldn't be the overfitting problem! - Algorithm can be universal but not the model - Model built on a specific data set in a specific organization could be ineffective in other data set of the same organization - Models have to be updated on a somewhat regular b...

What is: collaborative filtering, n-grams, cosine distance?

What is: collaborative filtering, n-grams, cosine distance? Collaborative filtering: - Technique used by some recommender systems - Filtering for information or patterns using techniques involving collaboration of multiple agents: viewpoints, data sources. 1. A user expresses his/her preferences by rating items (movies, CDs.) 2. The system matches this user's ratings against other users' and finds people with most similar tastes 3. With similar users, the system recommends items that the similar users have rated highly but not yet being rated by this user n-grams: - Contiguous sequence of n items from a given sequence of text or speech - "Andrew is a talented data scientist" - Bi-gram: "Andrew is", "is a", "a talented". - Tri-grams: "Andrew is a", "is a talented", "a talented data". - An n-gram model models sequences using statistical properties of n-grams; see: Shannon Game - More concisely...

How do you test whether a new credit risk scoring model works?

How do you test whether a new credit risk scoring model works? -Test on a holdout set -Kolmogorov-Smirnov test Kolmogorov-Smirnov test: - Non-parametric test - Compare a sample with a reference probability distribution or compare two samples - Quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution - Or between the empirical distribution functions of two samples - Null hypothesis (two-samples test): samples are drawn from the same distribution - Can be modified as a goodness of fit test - In our case: cumulative percentages of good, cumulative percentages of bad

How do you know if one algorithm is better than other?

How do you know if one algorithm is better than other? -In terms of performance on a given data set? -In terms of performance on several data sets? -In terms of efficiency? In terms of performance on several data sets: - "Does learning algorithm A have a higher chance of producing a better predictor than learning algorithm B in the given context?" - "Bayesian Comparison of Machine Learning Algorithms on Single and Multiple Datasets", A. Lacoste and F. Laviolette - "Statistical Comparisons of Classifiers over Multiple Data Sets", Janez Demsar In terms of performance on a given data set: - One wants to choose between two learning algorithms - Need to compare their performances and assess the statistical significance One approach (Not preferred in the literature): - Multiple k-fold cross validation: run CV multiple times and take the mean and sd - You have: algorithm A (mean and sd) and algorithm B (mean and sd) - Is the difference meaning...

How do you take millions of users with 100's transactions each, amongst 10k's of products and group the users together in meaningful segments?

How do you take millions of users with 100's transactions each, amongst 10k's of products and group the users together in meaningful segments? 1. Some exploratory data analysis (get a first insight) -Transactions by date -Count of customers Vs number of items bought -Total items Vs total basket per customer -Total items Vs total basket per area 2. Create new features (per customer): Counts: -Total baskets (unique days) -Total items -Total spent -Unique product id Distributions: -Items per basket -Spent per basket -Product id per basket -Duration between visits -Product preferences: proportion of items per product cat per basket 3. Too many features, dimension-reduction? PCA? 4. Clustering: -PCA 5. Interpreting model fit -View the clustering by principal component axis pairs PC1 Vs PC2, PC2 Vs PC1. -Interpret each principal component regarding the linear combination it's obtained from; example: PC1=spendy axis (proportion of baskets conta...

When would you use random forests Vs SVM and why?

When would you use random forests Vs SVM and why? -In a case of a multi-class classification problem: SVM will require one-against-all method (memory intensive) -If one needs to know the variable importance (random forests can perform it as well) -If one needs to get a model fast (SVM is long to tune, need to choose the appropriate kernel and its parameters, for instance sigma and epsilon) -In a semi-supervised learning context (random forest and dissimilarity measure): SVM can work only in a supervised learning mode

What are feature vectors?

What are feature vectors? -n-dimensional vector of numerical features that represent some object -term occurrences frequencies, pixels of an image etc. -Feature space: vector space associated with these vectors

What does NLP stand for?

"Natural language processing"! -Interaction with human (natural) and computers languages -Involves natural language understanding Major tasks: - Machine translation - Question answering: "what's the capital of Canada?" - Sentiment analysis: extract subjective information from a set of documents, identify trends or public opinions in the social media - Information retrieval

Difference between Supervised learning and Unsupervised learning.

Supervised learning:  Inferring a function from labeled training data. Predictor measurements associated with a response measurement; we wish to fit a model that relates both for better understanding the relation between them (inference) or with the aim to accurately predicting the response for future observations (prediction). Supervised learning: support vector machines, neural networks, linear regression, logistic regression, extreme gradient boosting Supervised learning examples : Predict the price of a house based on the are, size.; churn prediction; predict the relevance of search engine results. Unsupervised learning:  Inferring a function to describe hidden structure of unlabeled data. We lack a response variable that can supervise our analysis. Clustering, principal component analysis, singular value decomposition; identify group of customers. Unsupervised learning examples : find customer segments; image segmentation; classify US senators by thei...

Explain what a false positive and a false negative are.

Provide examples when false positives are more important than false negatives, false negatives are more important than false positives and when these two types of errors are equally important. False positive Improperly reporting the presence of a condition when it's not in reality. Example: HIV positive test when the patient is actually HIV negative. False negative Improperly reporting the absence of a condition when in reality it's the case. Example: not detecting a disease when the patient has this disease. When false positives are more important than false negatives: In a non-contagious disease, where treatment delay doesn't have any long-term consequences but the treatment itself is grueling. HIV test: psychological impact. When false negatives are more important than false positives: If early treatment is important for good outcomes. In quality control: a defective item passes through the cracks! Software testing: a test to catch a virus has ...

Principal component analysis (PCA)

Statistical method that uses an orthogonal transformation to convert a set of observations of correlated variables into a set of values of linearly uncorrelated variables called principal components . Reduce the data from nn to kk dimensions: find the kk vectors onto which to project the data so as to minimize the projection error. Algorithm: 1) Preprocessing (standardization): PCA is sensitive to the relative scaling of the original variable. 2) Compute covariance matrix ΣΣ. 3) Compute eigenvectors of ΣΣ. 4) Choose kk principal components so as to retain xx% of the variance (typically x=99x=99). Applications: 1) Compression - Reduce disk/memory needed to store data. - Speed up learning algorithm. Warning: mapping should be defined only on training set and then applied to test set. 2) Visualization: 2 or 3 principal components, so as to summarize data. Limitations : - PCA is not scale invariant. - The directions with largest variance are assumed to be of most int...

How does ANCOVA differ from Blocking?

Both remove error from DV, but ANCOVA also controls for variation in IV associated with covariate. As with blocking: The effects of the covariate are subtracted from the error term, making it smaller. The covariate is a more powerful way to do this if the control variable is continuous, but it's conceptually the same. Unlike blocking: Treatment means are adjusted to account for differences on the covariate Random assignment to IV conditions normally prevent differences in covariate means (confounds should be designed out). But incase covariate does differ across groups, ANCOVA effectively partials out the effect of the covariate from the focal IV as well as the error term.

Latent Semantic Indexing

Indexing and retrieval method that uses singular value decomposition to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text. Based on the principle that words that are used in the same contexts tend to have similar meanings. "Latent": semantic associations between words is present not explicitly but only latently For example: two synonyms may never occur in the same passage but should nonetheless have highly associated representations. Latent Semantic Indexing Used for: Learning correct word meanings Subject matter comprehension Information retrieval Sentiment analysis (social network analysis)

Assume you need to generate a predictive model using multiple regression. Explain how you intend to validate this model?

Validation using R2R2: % of variance retained by the model Issue: R2R2 is always increased when adding variables R2=RSStot−RSSresRSStot=RSSregRSStot=1−RSSresRSStotR2=RSStot−RSSresRSStot=RSSregRSStot=1−RSSresRSStot Analysis of residuals: Heteroskedasticity (relation between the variance of the model errors and the size of an independent variable's observations). Scatter plots residuals Vs predictors. Normality of errors Etc. : diagnostic plots Out-of-sample evaluation: with cross-validation .

Explain what a local optimum is and why it is important in a specific context, such as K-means clustering. What are specific ways of determining if you have a local optimum problem? What can be done to avoid local optima?

A solution that is optimal in within a neighboring set of candidate solutions. In contrast with global optimum: the optimal solution among all others. K-means clustering context: It's proven that the objective cost function will always decrease until a local optimum is reached. Results will depend on the initial random cluster assignment. Determining if you have a local optimum problem: Tendency of premature convergence Different initialization induces different optima Avoid local optima in a K-means context: repeat K-means and take the solution that has the lowest cost.

Explain what regularization is and why it is useful. What are the benefits and drawbacks of specific methods, such as ridge regression and lasso?

Used to prevent overfitting: improve the generalization of a model. Decreases complexity of a model. Introducing a regularization term to a general loss function: adding a term to the minimization problem. Impose Occam's Razor in the solution. Ridge regression: We use an L2L2 penalty when fitting the model using least squares. We add to the minimization problem an expression (shrinkage penalty) of the form λ×∑coefficientsλ×∑coefficients λ: tuning parameter; controls the bias-variance tradeoff; accessed with cross-validation A bit faster than the lasso: β^ridge=argminβ{∑ni=1(yi−β0−∑pj=1xijβj)2+λ∑pj=1β2j}β^ridge=argminβ{∑i=1n(yi−β0−∑j=1pxijβj)2+λ∑j=1pβj2} The Lasso: We use an L1L1 penalty when fitting the model using least squares. Can force regression coefficients to be exactly: feature selection method by itself. β^lasso=argminβ{∑ni=1(yi−β0−∑pj=1xijβj)2+λ∑pj=1||βj||}

How to define/select metrics?

Type of task: regression? Classification? Business goal? What is the distribution of the target variable? What metric do we optimize for? Regression: RMSE (root mean squared error), MAE (mean absolute error), WMAE(weighted mean absolute error), RMSLE (root mean squared logarithmic error). Classification: recall, AUC, accuracy, misclassification error, Cohen's Kappa. Common metrics in regression: Mean Squared Error Vs Mean Absolute Error RMSE gives a relatively high weight to large errors. The RMSE is most useful when large errors are particularly undesirable. The MAE is a linear score: all the individual differences are weighted equally in the average. MAE is more robust to outliers than MSE. RMSE=1n∑ni=1(yi−y^i)2−−−−−−−−−−−−−−√RMSE=1n∑i=1n(yi−y^i)2 MAE=1n∑ni=1|yi−y^i|MAE=1n∑i=1n|yi−y^i| Root Mean Squared Logarithmic Error RMSLE penalizes an under-predicted estimate greater than an over-predicted estimate (opposite to RMSE) RMSLE=1n∑ni=1(log(pi+1)−log(ai+1...

Is it better to design robust or accurate algorithms?

The ultimate goal is to design systems with good generalization capacity, that is, systems that correctly identify patterns in data instances not seen before. The generalization performance of a learning system strongly depends on the complexity of the model assumed. If the model is too simple, the system can only capture the actual data regularities in a rough manner. In this case, the system has poor generalization properties and is said to suffer from underfitting By contrast, when the model is too complex, the system can identify accidental patterns in the training data that need not be present in the test set. These spurious patterns can be the result of random fluctuations or of measurement errors during the data collection process. In this case, the generalization capacity of the learning system is also poor. The learning system is said to be affected by overfitting. Spurious patterns, which are only present by accident in the data, tend to have complex forms. This is the...

What is cross-validation? How to do it right?

It's a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. Mainly used in settings where the goal is prediction and one wants to estimate how accurately a model will perform in practice. The goal of cross-validation is to define a data set to test the model in the training phase (i.e. validation data set) in order to limit problems like overfitting, and get an insight on how the model will generalize to an independent data set. Examples: leave-one-out cross validation, K-fold cross validation. How to do it right? The training and validation data sets have to be drawn from the same population. Predicting stock prices: trained for a certain 5-year period, it's unrealistic to treat the subsequent 5-year a draw from the same population. Common mistake: for instance the step of choosing the kernel parameters of a SVM should be cross-validated as well. Bias-variance trade-off for k-fold cross ...

Cross-validation

What is cross-validation? Cross-validation, sometimes called 'rotation estimation', is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. In a prediction problem, a model is usually given a dataset of known data on which training is run (training dataset), and a dataset of unknown data (or first seen data) against which the model is tested (testing dataset). The goal of cross validation is to define a dataset to "test" the model in the training phase (i.e., the validation dataset), in order to limit problems like overfitting, give an insight on how the model will generalize to an independent dataset (i.e., an unknown dataset, for instance from a real problem), etc. How Cross-validation works? One round of cross-validation involves partiti...

Exploratory Factor Analysis (EFA)

Introduction to Exploratory Factor Analysis (EFA) - Things psychs are interested in are often unobservable How psychologists have held onto their jobs: - Good at coming up with indirect ways of measuring things (questionnaires, experimental tasks, etc. - directly observable responses reflect unobservable psychological constructs) - Operationalise theoretical constructs - run stat tests on observable measures, use them to discuss psychological constructs (e.g. differences in state anxiety scores described as "differences in anxiety") - can only do this if measures accurately reflect psych constructs being discussed. EFA is one way to check if measures reflect construct. EFA uses - Often collect data on far more variables than we care to talk about - EFA - simplifies complex data sets - organises similar variables by assessing shared variance in responses - hypothetical constructs aren't directly measured in study - just mathematically examines which varia...