Web Toolbar by Wibiya Data Stat: November 2008

Monday, November 24, 2008

Statistic Assumptions

• Normal distribution of data (which can be tested by using a normality test, such as the Shapiro-Wilk and Kolmogorov-Smirnov tests).

• Equality of variances (which can be tested by using the F test, the more robust Levene's test, Bartlett's test, or the Brown-Forsythe test).

• Samples may be independent or dependent, depending on the hypothesis and the type of samples:

o Independent samples are usually two randomly selected groups

o Dependent samples are either two groups matched on some variable (for example, age) or are the same people being tested twice (called repeated measures)

Since all calculations are done subject to the null hypothesis, it may be very difficult to come up with a reasonable null hypothesis that accounts for equal means in the presence of unequal variances. In the usual case, the null hypothesis is that the different treatments have no effect — this makes unequal variances untenable. In this case, one should forgo the ease of using this variant afforded by the statistical packages. See also Behrens–Fisher problem.
One scenario in which it would be plausible to have equal means but unequal variances is when the 'samples' represent repeated measurements of a single quantity, taken using two different methods. If systematic error is negligible (e.g. due to appropriate calibration) the effective population means for the two measurement methods are equal, but they may still have different levels of precision and hence different variances.

Determining type

For novices, the most difficult issue is often whether the samples are independent or dependent. Independent samples typically consist of two groups with no relationship. Dependent samples typically consist of a matched sample (or a "paired" sample) or one group that has been tested twice (repeated measures).
Dependent t-tests are also used for matched-paired samples, where two groups are matched on a particular variable. For example, if we examined the heights of men and women in a relationship, the two groups are matched on relationship status. This would call for a dependent t-test because it is a paired sample (one man paired with one woman). Alternatively, we might recruit 100 men and 100 women, with no relationship between any particular man and any particular woman; in this case we would use an independent samples test.
Another example of a matched sample would be to take two groups of students, match each student in one group with a student in the other group based on an achievement test result, then examine how much each student reads. An example pair might be two students that score 90 and 91 or two students that scored 45 and 40 on the same test. The hypothesis would be that students that did well on the test may or may not read more. Alternatively, we might recruit students with low scores and students with high scores in two groups and assess their reading amounts independently.
An example of a repeated measures t-test would be if one group were pre- and post-tested. (This example occurs in education quite frequently.) If a teacher wanted to examine the effect of a new set of textbooks on student achievement, (s)he could test the class at the beginning of the year (pretest) and at the end of the year (posttest). A dependent t-test would be used, treating the pretest and posttest as matched variables (matched by student).

Statistic Uses

Among the most frequently used t tests are:

* A test of whether the mean of a normally distributed population has a value specified in a null hypothesis.
* A test of the null hypothesis that the means of two normally distributed populations are equal. Given two data sets, each characterized by its mean, standard deviation and number of data points, we can use some kind of t test to determine whether the means are distinct, provided that the underlying distributions can be assumed to be normal. All such tests are usually called Student's t tests, though strictly speaking that name should only be used if the variances of the two populations are also assumed to be equal; the form of the test used when this assumption is dropped is sometimes called Welch's t test. There are different versions of the t test depending on whether the two samples are
o unpaired, independent of each other (e.g., individuals randomly assigned into two groups, measured after an intervention and compared with the other group[4]), or
o paired, so that each member of one sample has a unique relationship with a particular member of the other sample (e.g., the same people measured before and after an intervention[4]).

If the calculated p-value is below the threshold chosen for statistical significance (usually the 0.10, the 0.05, or 0.01 level), then the null hypothesis which usually states that the two groups do not differ is rejected in favor of an alternative hypothesis, which typically states that the groups do differ.

* A test of whether the slope of a regression line differs significantly from 0.

Once a t value is determined, a p-value can be found using a table of values from Student's t-distribution.

Sunday, November 23, 2008

Correlation

From the free encyclopedia

This article is about the correlation coefficient between two variables. For other uses, see Correlation (disambiguation).

Several sets of (x, y) points, with the correlation coefficient of x and y for each set. Note that the correlation reflects the noisiness and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the center has a slope of 0 but in that case the correlation coefficient is undefined because the variance of Y is zero.

Look up Correlation in
Wiktionary, the free dictionary.
In probability theory and statistics, correlation (often measured as a correlation coefficient) indicates the strength and direction of a linear relationship between two random variables. That is in contrast with the usage of the term in colloquial speech, denoting any relationship, not necessarily linear. In general statistical usage, correlation or co-relation refers to the departure of two random variables from independence. In this broad sense there are several coefficients, measuring the degree of correlation, adapted to the nature of the data.
A number of different coefficients are used for different situations. The best known is the Pearson product-moment correlation coefficient, which is obtained by dividing the covariance of the two variables by the product of their standard deviations. Despite its name, it was first introduced by Francis Galton.[1]

Statistical assumptions

When the number of measurements, N, is larger than the number of unknown parameters, k, and the measurement errors εi (see below) are normally distributed then the excess of information contained in N - k) measurements is used make the following statistical predictions about the unknown parameters:
• confidence intervals of unknown parameters.

Independent measurements

Quantitatively, this is explained by the following example: Consider a regression model with, say, three unknown parameters β0, β1 and β2. An experimenter performed 10 repeated measurements at exactly the same value of independent variables X. In this case regression analysis fails to give a unique value for the three unknown parameters: the experimenter did not provide enough information. The best one can do is to calculate the average value of the dependent variable Y and its standard deviation.
If the experimenter had performed five measurements at X1, four at X2 and one at X3, where X1, X2 and X3 are different values of the independent variable X then regression analysis would provide a unique solution to unknown parameters β.
In the case of general linear regression (see below) the above statement is equivalent to the requirement that matrix XTX is regular (that is: it has an inverse matrix).

Regression equation

It is convenient to assume an environment in which an experiment is performed: the dependent variable is then outcome of a measurement.

The regression equation deals with the following variables:
• The unknown parameters denoted as β. This may be a scalar or a vector of length k.
• The independent variables, X.
• The dependent variable, Y.

Regression equation is a function of variables X and β.

The user of regression analysis must make an intelligent guess about this function. Sometimes the form of this function is known, sometimes he must apply a trial and error process.
Assume now that the vector of unknown parameters, β is of length k. In order to perform a regression analysis the user must provide information about the dependent variable Y:

• If the user performs the measurement N times, where N < k, regression analysis cannot be performed: there is not provided enough information to do so.

• If the user performs N independent measurements, where N = k, then the problem reduces to solving a set of N equations with N unknowns β.

• If, on the other hand, the user provides results of N independent measurements, where N > k regression analysis can be performed. Such a system is also called an overdetermined system;

In the last case the regression analysis provides the tools for:

1. finding a solution for unknown parameters β that will, for example, minimize the distance between the measured and predicted values of the dependent variable Y (also known as method of least squares).

2. under certain statistical assumptions the regression analysis uses the surplus of information to provide statistical information about the unknown parameters β and predicted values of the dependent variable Y.

Regression diagnostics

Once a regression model has been constructed, it may be important to confirm the goodness of fit of the model and the statistical significance of the estimated parameters. Commonly used checks of goodness of fit include the R-squared, analyses of the pattern of residuals and hypothesis testing. Statistical significance can be checked by an F-test of the overall fit, followed by t-tests of individual parameters.

Interpretations of these diagnostic tests rest heavily on the model assumptions. Although examination of the residuals can be used to invalidate a model, the results of a t-test or F-test are sometimes more difficult to interpret if the model's assumptions are violated. For example, if the error term does not have a normal distribution, in small samples the estimated parameters will not follow normal distributions, which complicates inference. With relatively large samples, however, a central limit theorem can be invoked such that hypothesis testing may proceed using asymptotic approximations.

Regression analysis

From the free encyclopedia

In statistics, regression analysis is a collective name for techniques for the modeling and analysis of numerical data consisting of values of a dependent variable (also called response variable or measurement) and of one or more independent variables (also known as explanatory variables or predictors). The dependent variable in the regression equation is modeled as a function of the independent variables, corresponding parameters ("constants"), and an error term. The error term is treated as a random variable. It represents unexplained variation in the dependent variable. The parameters are estimated so as to give a "best fit" of the data. Most commonly the best fit is evaluated by using the least squares method, but other criteria have also been used.

Regression can be used for prediction (including forecasting of time-series data), inference, hypothesis testing, and modeling of causal relationships. These uses of regression rely heavily on the underlying assumptions being satisfied. Regression analysis has been criticized as being misused for these purposes in many cases where the appropriate assumptions cannot be verified to hold.[1][2] One factor contributing to the misuse of regression is that it can take considerably more skill to critique a model than to fit a model