Outliers - reasons for screening data, Advanced Statistics

Outliers - Reasons for Screening Data

Outliers are due to data entry errors, subject is not a member of the population that the sample is trying to represent, or the subject is really different. Statistical tests are quite sensitive to outliers so this problem should be addressed.

Univariate outliers are easy to detect (z-scores, box plots, histograms, etc.) standard scores larger than +/-3 are outliers (consider 4 is n>100 or 2.5 if n<10)

Multivariate outliers are difficult to detect. Mahalanobis distance is one powerful technique to use in this case (discussed later). This is evaluated as a chi-square statistic with degrees of freedom equal to number of variables in the analysis. A chi-sqaure statistic value that is significant beyond p<0.001 level determines outliers.

In most cases, it is ok to drop the value from the sample. One can also take steps to reduce the relative influence of outliers if the researcher decides to include the values in the analysis.

Posted Date: 3/4/2013 6:22:24 AM | Location : United States







Related Discussions:- Outliers - reasons for screening data, Assignment Help, Ask Question on Outliers - reasons for screening data, Get Answer, Expert's Help, Outliers - reasons for screening data Discussions

Write discussion on Outliers - reasons for screening data
Your posts are moderated
Related Questions
Primary Model Below is a regression analysis without 17 outliers that have been removed Regression Analysis: wfood versus totexp, income, age, nk The regression equat

The values assigned to factors for the individual sample units in a factor analysis. The most common approach is "regression method". When the factors are seen as the random variab

This term is sometimes used for the data collected in those longitudinal studies in which more than the single response variable is recorded for each subject on each occasion. For

Group visible design is an arrangement of the v mn treatments in b blocks such that: * Each block comprises k distinct treatments k5v; * Each treatment is replicated r number

The procedure in which the prior distribution is required in the application of Bayesian inference, it is determined from empirical evidence, namely same data for which the posteri

Prevalence : The measure of the number of people in a population who have a certain disease at a given point in time. It c an be measured by two methods, as point prevalence and p

The method or technique for producing the sequence of parameter estimates that, under the mild regularity conditions, converges to maximum likelihood estimator. Of particular signi

Respondent-driven sampling (RDS ): The form of snowball sampling which starts with the recruitment of the small number of people in the target population to serve as the seeds. Aft

Weathervane plot is the graphical display of the multivariate data based on bubble plot. The latter is enhanced by the addiction of the lines whose lengths and directions code the

Ask quesoil company is considering whether or not to bid for an offshore drilling contract. If they bid, the value would be $600m with a 65% chance of gaining the contract. The com