Outliers - reasons for screening data, Advanced Statistics

Outliers - Reasons for Screening Data

Outliers are due to data entry errors, subject is not a member of the population that the sample is trying to represent, or the subject is really different. Statistical tests are quite sensitive to outliers so this problem should be addressed.

Univariate outliers are easy to detect (z-scores, box plots, histograms, etc.) standard scores larger than +/-3 are outliers (consider 4 is n>100 or 2.5 if n<10)

Multivariate outliers are difficult to detect. Mahalanobis distance is one powerful technique to use in this case (discussed later). This is evaluated as a chi-square statistic with degrees of freedom equal to number of variables in the analysis. A chi-sqaure statistic value that is significant beyond p<0.001 level determines outliers.

In most cases, it is ok to drop the value from the sample. One can also take steps to reduce the relative influence of outliers if the researcher decides to include the values in the analysis.

Posted Date: 3/4/2013 6:22:24 AM | Location : United States







Related Discussions:- Outliers - reasons for screening data, Assignment Help, Ask Question on Outliers - reasons for screening data, Get Answer, Expert's Help, Outliers - reasons for screening data Discussions

Write discussion on Outliers - reasons for screening data
Your posts are moderated
Related Questions

McNemar's test  is the test for comparing proportions in data involving the paired samples. The test statistic can be given by   it is most useful when the data have a symmetri

Biplots: It is the multivariate analogue of the scatter plots, which estimates the multivariate distribution of the sample in a few dimensions, typically two and superimpose on th


A study not involving the passing of time. All information is collected at the same time and subjects are contacted only once. Many surveys are of this type. The temporal sequence

A directed graph is simple if each ordered pair of vertices is the head and tail of at most one edge; one loop may be present at each vertex. For each n ≥ 1, prove or disprove the

Collector's problem : A problem which derives from the schemes in which packets of a particular brand of coffe, cereal etc., are sold with coupons, cards, or other tokens. There ar

explain the graphical method of measure of central tendency

Product-limit estimator is a method for estimating the survival functions for the set of survival times, some of which might be censored observations. The logic behind the procedu

Designs which permits two or more questions to be addressed in the investigation. The easiest factorial design is one in which each of the two treatments or interventions are p