Outliers - reasons for screening data, Advanced Statistics

Outliers - Reasons for Screening Data

Outliers are due to data entry errors, subject is not a member of the population that the sample is trying to represent, or the subject is really different. Statistical tests are quite sensitive to outliers so this problem should be addressed.

Univariate outliers are easy to detect (z-scores, box plots, histograms, etc.) standard scores larger than +/-3 are outliers (consider 4 is n>100 or 2.5 if n<10)

Multivariate outliers are difficult to detect. Mahalanobis distance is one powerful technique to use in this case (discussed later). This is evaluated as a chi-square statistic with degrees of freedom equal to number of variables in the analysis. A chi-sqaure statistic value that is significant beyond p<0.001 level determines outliers.

In most cases, it is ok to drop the value from the sample. One can also take steps to reduce the relative influence of outliers if the researcher decides to include the values in the analysis.

Posted Date: 3/4/2013 6:22:24 AM | Location : United States







Related Discussions:- Outliers - reasons for screening data, Assignment Help, Ask Question on Outliers - reasons for screening data, Get Answer, Expert's Help, Outliers - reasons for screening data Discussions

Write discussion on Outliers - reasons for screening data
Your posts are moderated
Related Questions
Advantages and disadvantages of Integrated Economic Statistics

Negative binomial distribution is the probability distribution of number of failures, X, before the kth success in the sequence of Bernoulli trials where the probability of succes

1)  Consider an antenna with a pattern: G(θ,φ) = sinn(θ/θ0) cos(θ/θ0)   where θ0 = Π/1.5 (a) What is the 3-dB bandwidth? (b) What is the 10-dB beam width? (c) What is t

Balanced incomplete block design : A design in which all the treatments are not used in all blocks. Such designs have the below stated properties: * each block comprises the

Nearest-neighbour methods are the methods of discriminant analysis are based on studying the training set subjects much similar to the subject to be classified. Classification mig

Cluster analysis : A set of methods or techniques for constructing a sensible and informative classi?cation of an initially unclassi?ed set of data, using variable values observed

Orthogonal is a term which occurs in several regions of the statistics with different meanings in each case. Most commonly the encountered in the relation to two variables or t

Multiple correlation coefficient is the correlation among the observed values of dependent variable in the multiple regression, and the values predicted by estimated regression

Likert scales is often used in the studies of attitudes in which the raw scores are based on the graded alternative responses to each of a series of queries. For instance, the sub

Regression through the origin : In some of the situations a relationship between the two variables estimated by the regression analysis is expected to pass by the origin because th