Explain selection bias (with regard to a dataset, not variable selection). Why is it important? How can data management procedures such as missing data handling make it worse?

August 28, 2018

Explain selection bias (with regard to a dataset, not variable selection). Why is it important? How can data management procedures such as missing data handling make it worse?

-Selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved

Types:
- Sampling bias: systematic error due to a non-random sample of a population causing some members to be less likely to be included than others
- Time interval: a trial may terminated early at an extreme value (ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all the variables have similar means
- Data: "cherry picking", when specific subsets of the data are chosen to support a conclusion (citing examples of plane crashes as evidence of airline flight being unsafe, while the far more common example of flights that complete safely)
- Studies: performing experiments and reporting only the most favorable results
- Can lead to unaccurate or even erroneous conclusions
- Statistical methods can generally not overcome it

Why data handling make it worse?
- Example: individuals who know or suspect that they are HIV positive are less likely to participate in HIV surveys
- Missing data handling will increase this effect as it's based on most HIV negative
-Prevalence estimates will be unaccurate

Search This Blog

Google Interview Hacks

Explain selection bias (with regard to a dataset, not variable selection). Why is it important? How can data management procedures such as missing data handling make it worse?

Explain selection bias (with regard to a dataset, not variable selection). Why is it important? How can data management procedures such as missing data handling make it worse?

Popular posts from this blog

Is rotation necessary in PCA? If yes, Why? What will happen if you don't rotate the components?

After analyzing the model, your manager has informed that your regression model is suffering from multicollinearity. How would you check if he's true? Without losing any information, can you still build a better model?

What does Latency mean?