|
|||||||||
Undergraduate Stats
Undergrad Stats Reference
REFERENCE FOR UNDERGRADUATE STATISTICS STUDENTS
Logic underlying Multiple Regression Analysis
The logic of multiple regression analysis is that for a given set of independent
variables an analyst can predict the average dependent variable (for example,
the number of years a person would be sentenced to prison for committing robbery)
by simply using the mean of all the Y values. This, unfortunately, results in
a substantial amount of prediction error, referred to as 'total variation.'
Total variation is equal to the sum of the squared differences between each
observation and the grand mean for all of the observations, which is represented
in formula form as: S (Yi - Y)². Therefore, the analyst will often introduce
a second variable to decrease prediction error.
A specific objective of multiple regression is
to plot a line with the greatest precision possible in order to minimize the
amount of error variation in the conditional scores of Y that occur around the
line. To estimate this "best fit line," the analyst must calculate the slope
(b), also referred to as the regression coefficient, and the intercept (a),
also referred to as the regression constant. In equation form, a straight line
is represented as Y = a + b(x). To calculate the slope (b) the analyst uses
the formula: b = [N(SXY) - (SX) (SY)] / [N SX² - (SX)²]. To calculate the intercept
known as (a), the analyst uses the formula:
a
= Y - (b) (X). The next logical progression under multiple regression is to
determine the proportion of variance explained (R²) in the model based on knowledge
of a second variable. The analyst accomplishes this by obtaining the Pearson
product moment correlation (r) between the variables. Once obtained, the coefficient
is squared. For example, if the obtained Pearson correlation was represented
as r = 0.84, the corresponding square (.84)² would result in a coefficient of
determination of .71. This would mean that 71% of the variance in the dependent
variable is explained by the model. The analyst next must determine if the obtained
R² is statistically significant. Therefore, the analyst must partition the sums
of squares into their additive parts. To accomplish this, the analyst calculates
the total variance by taking the sum of squared differences between each observed
Y score and the mean of all Y scores. Next, the analyst must calculate the regression
variance. This is accomplished by obtaining the sum of squared differences between
each predicted Y value and the mean of all Y scores. Lastly, the analyst must
calculate the residual variance, which is the sum of squared differences between
each observed Y score and the predicted value for Y. The total variance is thus
the sum of the regression variance and the residual variance, which is represented
in equation form as: SStotal = SSregression + SSresidual.
In order to complete the logical process of multiple regression analysis, the
analyst must calculate an F ratio. In formula form this appears as:
SSregression/df
SSresidual/df
where df represents degrees of freedom (in formula form degrees of freedom is
seen as: df = N-1). The obtained coefficient is compared to the appropriate
F table statistic. If the obtained F exceeds the critical F table value, then
the analyst rejects the null hypothesis of no improvement (due to addition of
a second variable) and concludes that the proportion of variance explained by
the addition of a second variable was statistically significant as compared
to the mean.
What is the difference between an unstandardized (b) and standardized
(ß) regression coefficient. What is the utility of each? What is the general
form for both an unstandardized as well as a standardized regression equation?
An unstandardized regression coefficient (b) is expressed in units of the variable
with which it is associated. Therefore, comparisons between variables are simply
not appropriate. The analyst would be "comparing apples to oranges." A standardized
coefficient (ß), on the other hand, allows for a direct comparison between coefficients
as to their power in explaining the dependent variable, due to the fact that
the units of the variables with which standardized coefficients are associated
are the same. This allows the analyst to compare "apples to apples."
The utility of the unstandardized regression coefficient
is that it represents the slope of the regression line, which is the change
in Y for each unit change in X. The b regression coefficient is an integral
part of interpreting the regression variate. The utility of the standardized
regression coefficient is that it allows the analyst to compare the relative
effect of each particular independent variable on the dependent variable.
The general form of an unstandardized equation appears as:
SENTENCE = 2.434 + 0.542 (dr_score) - 0.0358 (tm_disp) + 0.04864 (jail_tm)
The general form of a standardized equation appears as:
Zsentence = 0.291 (Zdr_score) - 0.177 (Ztm_disp) + 0.444 (Zjail_tm)
What
is the role of the regression constant (a)? Example provided.
The role of the regression constant- (a) - is to aid
the researcher in the estimation of the best-fit line. The intercept (regression
constant) is used along with the slope- (b) - (regression coefficient) to plot
the line with the greatest precision so that it is in the position which best
minimizes the amount of distance (error variation) in the conditional Y scores
around the line. In a regression equation, the regression constant has a specific
place:
Y = a + b(x). The calculation of the regression constant (intercept) appears
as:
a = Y - (b) (X). Where N = 10 and Y = 75 and X = 36, the calculation would be
represented as: a = (75/10) - 2.432 (36/10) = -1.26.
Compare and contrast the various methods available in SPSS for determining
the order in which variables are entered into a regression model. Which method
is most commonly used? Under this protocol, what criteria are used to determine
the order in which variables are entered into the model? Explain why these criteria
for variable selection are used.
There are various methods available in SPSS for determining
the order in which variables are entered into a regression model. These methods
include the confirmatory
method, the sequential search method (which is sub-divided into forward, backward
and stepwise procedures) and the combinatorial method. The confirmatory method
proceeds with the analyst specifying the variables to be included in the model
as well as the order in which each is entered into the model. As
compared to the confirmatory method, which is analyst driven and subject to
researcher bias, the sequential search method of variable selection proceeds
iteratively (as well, the computer, not the researcher, is responsible for variable
selection and entry) and selects the particular variables for inclusion in the
model that have the greatest explanatory influence. In contrast to the confirmatory
method, which is a "solo act," the sequential search method contains three separate
methods that can be employed to aid the researcher in variable selection. These
methods are the forward addition method the backward elimination method and
the stepwise method. In the forward addition procedure of the sequential
search method, variables are entered into the model and their contribution to
the prediction of the model is considered. If a variable is not found to be
significant it is retained nonetheless. In the backward elimination procedure,
the order in which variables are entered into the regression model details inserting
all variables into the model and eliminating the specific variables that are
found not to be significant in their contribution. In the stepwise method, variables
are entered into and removed from the model based upon their strength. In this
procedure it is possible that an independent variable could be selected for
inclusion at the first iteration of the model but be removed at a subsequent
iteration due to the fact that its significance is not strong enough to be retained
based upon the parameters of the model. In the combinatorial method of variable
selection all possible combinations of the independent variables are considered.
The overall model strength is identified at each iteration of the model
based upon calculations that reveal which combinations of the independent variables
provide the highest level of prediction of the dependent variable. The
combinatorial method, like the confirmatory method, is a "solo act." However,
in contrast to the confirmatory method, the combinatorial method is not analyst
driven. The combinatorial method differs from the sequential search method and
the confirmatory method in that in the sequential search method and the confirmatory
method variables are considered individually, while in the combinatorial method
variables are considered simultaneously for their effect on the regression model. The
most commonly used method is the sequential search method in general, and the
stepwise method specifically. Under the stepwise protocol the independent variables
that are the best predictor of the dependent variable are selected for inclusion
first. Other independent variables are selected for addition based upon the
incremental explanatory power they possess and therefore are able to add to
the regression model. Independent variables whose partial correlation coefficients
are not statistically significant are not added to the model. As well, if an
independent variable's predictive power falls below an established significance
level it is dropped from the model.
These criteria for variable selection are used in order
to produce a prediction equation that has the best possibility of explaining
the dependent variable. If variables that are not significant are included
and contained in subsequent iterations of the regression model the predictive
power of the prediction equation is significantly decreased. Using the stepwise
protocol under the sequential search method provides the
analyst with a powerful tool for determining the prediction equation from the
overall regression model.
How
an analyst "assesses model fit" in multiple regression analysis.
First, the analyst must select the method by which
the regression model will be
estimated (discussed in #5 above). Second, the analyst must assess the statistical
significance of the overall model as concerns the model's ability in predicting
the dependent variable. Third, the analyst must determine if any of the observations
have the effect of an undue influence (outlier, influential observation) on
the results. The individual variables must meet the assumptions of linearity,
constant variance (homoscedasticity), independence and normality. These assumptions
must be met by the regression variate as well. Hypotheses concerning the model
can be tested to ensure that the model is representative of the population.
Two specific tests are a test of the variation explained (R², coefficient of
determination) and a test of coefficients (referred to as Multiple R). Multiple
R represents the degree of correlation that exists between all independent variables
in the model and the dependent variable. The coefficient of determination (R²)
represents the proportion of variance in the dependent variable explained by
the linear combination of the independent variables in the model. The statistical
term Adjusted R² is an estimate of the model's explanatory power if it
were to be applied to another sample with similar parameters.
The analyst also must pay attention to the standard error of the estimate (SEE).
This statistical term relates to the standard deviation of dependent values
around the regression line. To test the hypothesis that the amount
of variation in the dependent variable explained by the regression model is
more than average (that R² is greater than zero), the analyst uses an F ratio.
The F test statistic in equation form appears as:
Sum of squared errorregression
F= Degrees of freedomregression
Sum of squared errorTotal
Degrees of freedomresidual
Degrees of freedom regression is equal to the number of estimated coefficients
(which includes the constant) - 1. Degrees of freedom residual is equal to N
- the
number of estimated coefficients (constant included).
2. Influential observations (such as outliers) that strongly influence the results
of the regression analysis are identified. An outlier has a large residual value
and is only identified in accordance to a specific regression model. Leverage
points, another form of influential observation, are distinctive observations
based on the value of their independent variables. Specifically, influential
observations is a broad concept that contains all observations that have a disproportionate
effect on the results of the regression analysis. It must be noted that not
all outliers and leverage points are considered influential observations. The
analyst typically deletes extreme influential observations but remains cautious
to delete observations that are representative of the population. An important
consideration that should not be overlooked in assessing model fit is the analyst's
knowledge of the specific situation being studied. This allows the analyst to
correctly identify which variables to include and what are the appropriate signs
and weights of the coefficients. By employing these methods the analyst can
ensure that the model fits the data, therefore ensuring reliability and validity.
How multiple regression analysis differs from other multivariate techniques
such as discriminant analysis and logistical regression.
Multiple regression is a multivariate statistical analysis
technique that is used by researchers to analyze the relationship between one
dependent variable (Y) and several independent variables (X). Discriminant analysis
is a multivariate statistical technique that is used by researchers to estimate
the relationship between a single dependent variable that is nonmetric (data
that are categorical) and a set of independent variables that are metric (data
that are interval and ratio). Logistical regression is a multivariate statistical
analysis technique that is used by researchers to design a study which employs
a binary (dichotomous) dependent variable that is nonmetric.
Along with the differences in their general application,
discriminant analysis and logistical regression (LOGIT) differ from multiple
regression specifically in that multiple regression requires that data must
be metric or be appropriately transformed. Logistic regression allows for the
independent variables to contain nonmetric data through the use of dummy coding
which has the effect of transforming nominal or ordinal data. Logistic regression
also allows the analyst to employ a binary dependent
variable, whereas multiple regression does not. Discriminant analysis differs
from multiple regression in that the dependent variable is nonmetric and the
independent variables are metric. Furthermore, multiple regression entails one
dependent variable, whereas multiple discriminant analysis can employ more than
one dependent variable simultaneously.
GLOSSARY
a.) beta coefficient- A regression coefficient that has been standardized.
Regression coefficients are expressed in the units of measurement of the
particular variable which they are associated with. This makes direct comparison
difficult, so the regression coefficients are standardized so that all of the
variables being compared can be compared along the same unit lines directly.
b.) coefficient of determination- R². The amount of variance of the dependent
variable
explained by the predictor variables in a model, represented as a proportion.
c.) dependence technique- When one or more variables is selected as the dependent
variable, in order to be explained by other independent variables.
d.) dependent variable- Represented by the symbol "Y." Variable that responds
to a change in the independent variable or variables.
e.) homoscedasticity- The variance of the error terms of the dependent variable
are
constant over the full range of independent variables.
f). imputation- A way of correcting for missing data. By employing an imputation
method, a researcher can estimate the value of the missing data/case by using
other
values of other valid variables in the model. An example would be 'Cold deck
imputation.'
g.) independent variable- Predictor variable. Represented by the symbol "X."
Variable suspected of causing a change in the dependent variable.
h.) interdependence technique- Statistical procedure where the variables are
not classified as dependent or independent. The researcher examines all the
variables simultaneously.
i.) kurtosis- Peakedness. Flatness. When a distribution of data is placed on
a graph and compared to a normal distribution, a negative distribution will
appear flat while
a positive distribution would appear "peaked." Calculated from the moments about the
mean of a distribution.
j.) linearity- Where a statistical model predicts values that appear in a straight
line. This is the result of the dependent variable displaying a constant unit
change for every unit change of the independent variable.
k.) metric data- Data that is interval and ratio in its nature. Non-metric data
is nominal and ordinal in nature. Metric data describes an object by use of
a descriptor as well as by what degree the descriptor defines the object. For
example, a human being can also be described by their height and weight. 'Quantitative'
data.
l.) missing at random- Data that are actually missing due to unforeseen circumstances
on the part of the researcher, the survey instrument or the subject. Often referred
to as MAR when discussing issues of missing data.
m.) missing completely at random- Data that are actually missing and cannot
be
explained by the researcher. Often referred to as MCAR when discussing issues
of
missing data.
n.) multicollinearity- The extent to which one variable can be explained by
another variable. If multicollinearity is too high, the significance of the
explanation of the variate is decreased. The researcher is not able to distinguish
the specific effect of one variable as compared to the one it is highly correlated
with.
non-metric data- Data that is nominal and ordinal in its nature. Characteristic
properties that describe a subject, such as male or female, and qualities such
as high or low approval ratings of politicians. 'Qualitative' data.
o.) normal distribution- Probability distribution where the horizontal (X) axis
represent all of the possible values of the variable under consideration and
the
vertical (Y) axis represent all probable occurrences of the variable. A graphic
depiction would appear as a "bell curve."
q.) normality- To what degree the sample data distribution is represented in
a normal distribution.
r.) outlier- A data point that lies outside of the sample distribution.
s.) practical versus statistical significance- Practical significance relates
to the results of analysis allowing for demonstrable effects. Statistical significance
relates to the ability of inference from the sample population to the general
population.
t.) skew- A distribution that has few large values is positively skewed and
"tails" off to the right. A distribution that has few small values is negatively
skewed and tails off to the left. Values that lie outside a range of -1 to 1
are extremely skewed.
u.) specification error- Relates to whether or not the researcher excludes pertinent
variables from the set of independent variables as well as includes variables
that are not necessary for inclusion in the model.
v.) stepwise protocol- Multivariate data analysis where the contribution of
each predictor variable is considered before the regression equation is developed.
The
benefit of this procedure is that at each iteration of the model, variables
can be
added or deleted.
w.) type I error- Also referred to as "alpha." The probability that a researcher
will
incorrectly reject the null hypothesis.
x.) type II error- Also referred to as "beta." The probability that a researcher
will incorrectly fail to reject the null hypothesis.