Outliers - reasons for screening data, Advanced Statistics

Outliers - Reasons for Screening Data

Outliers are due to data entry errors, subject is not a member of the population that the sample is trying to represent, or the subject is really different. Statistical tests are quite sensitive to outliers so this problem should be addressed.

Univariate outliers are easy to detect (z-scores, box plots, histograms, etc.) standard scores larger than +/-3 are outliers (consider 4 is n>100 or 2.5 if n<10)

Multivariate outliers are difficult to detect. Mahalanobis distance is one powerful technique to use in this case (discussed later). This is evaluated as a chi-square statistic with degrees of freedom equal to number of variables in the analysis. A chi-sqaure statistic value that is significant beyond p<0.001 level determines outliers.

In most cases, it is ok to drop the value from the sample. One can also take steps to reduce the relative influence of outliers if the researcher decides to include the values in the analysis.

Posted Date: 3/4/2013 6:22:24 AM | Location : United States

Related Discussions:- Outliers - reasons for screening data, Assignment Help, Ask Question on Outliers - reasons for screening data, Get Answer, Expert's Help, Outliers - reasons for screening data Discussions

Write discussion on Outliers - reasons for screening data
Your posts are moderated
Related Questions
A family of the probability distributions of the form given as   here θ is the parameter and a, b, c, d are the known functions. It includes the gamma distribution, normal dis

Designs which permits two or more questions to be addressed in the investigation. The easiest factorial design is one in which each of the two treatments or interventions are p

Model is the description of the supposed structure of a set of observations which can range from a fairly imprecise verbal account to, more commonly, a formalized mathematical exp

The GRE has a combined verbal and quantitative mean of 1000 and a standard deviation of 200.

This is an attempt to measure the suffering caused by the illness which takes into the account both the years of the potential life lost due to the premature mortality as well as t

It is the multivariate normal random vector which satisfies certain conditional independence suppositions. This can be viewed as a model framework which contains a wide range of st

Hazard function : The risk which an individual experiences an event in a small time interval, given that the individual has survived up to the starting of the interval. It is th

L'Abbe ´ plot is often used in the meta-analysis of the clinical trials where the result is the binary response of it. The event risk (number of events/number of the patients in a

Randomized consent design is the design at first introduced to overcome some of the perceived ethical problems facing clinicians entering patients in the clinical trials including

Observational study   is the study in which the objective is to discover cause-and-effect relationships but in which it is not feasible to use the controlled experimentation, in th