Outliers - reasons for screening data, Advanced Statistics

Outliers - Reasons for Screening Data

Outliers are due to data entry errors, subject is not a member of the population that the sample is trying to represent, or the subject is really different. Statistical tests are quite sensitive to outliers so this problem should be addressed.

Univariate outliers are easy to detect (z-scores, box plots, histograms, etc.) standard scores larger than +/-3 are outliers (consider 4 is n>100 or 2.5 if n<10)

Multivariate outliers are difficult to detect. Mahalanobis distance is one powerful technique to use in this case (discussed later). This is evaluated as a chi-square statistic with degrees of freedom equal to number of variables in the analysis. A chi-sqaure statistic value that is significant beyond p<0.001 level determines outliers.

In most cases, it is ok to drop the value from the sample. One can also take steps to reduce the relative influence of outliers if the researcher decides to include the values in the analysis.

Posted Date: 3/4/2013 6:22:24 AM | Location : United States







Related Discussions:- Outliers - reasons for screening data, Assignment Help, Ask Question on Outliers - reasons for screening data, Get Answer, Expert's Help, Outliers - reasons for screening data Discussions

Write discussion on Outliers - reasons for screening data
Your posts are moderated
Related Questions
What is the EM?

Marginal matching is the matching of the treatment groups in terms of means or other summary characteristics of matching variables. This has been shown to be almost as efficient a

Tracking is the term sometimes used in the discussions of data from the longitudinal study, to describe the ability to predict the subsequent observations from previous values. In

For a career woman, wearing lipstick has become an integral part of her daily life. It is not unusual for a woman to look for a lipstick that will stay on her lips and not smudge o

A family of the probability distributions of the form given as   here θ is the parameter and a, b, c, d are the known functions. It includes the gamma distribution, normal dis

Lie factor : A measure suggested by Tufte for judging the honesty of the graphical presentation of data. Which can be calculated as follows   The values close to one are desir

Principal components analysis is a process for analysing multivariate data which transforms original variables into the new ones which are uncorrelated and account for decreasing

How large would the sample need to be if we are to pick a 95% confidence level sample: (i) From a population of 70; (ii) From a population of 450; (iii) From a population of 1000;

Prospective study : The studies in which individuals are followed-up over the period of time. A general example of this type of investigation is where the samples of individuals ar

HOW TO OBTAIN THE LASPEYRES QUANTITY INDEX AND THE FORMULA