Criminal Justice

Undergraduate Stats

 

 

      Undergrad Stats Reference

 

   REFERENCE FOR UNDERGRADUATE STATISTICS STUDENTS

Logic underlying Multiple Regression Analysis

The logic of multiple regression analysis is that for a given set of independent variables an analyst can predict the average dependent variable (for example, the number of years a person would be sentenced to prison for committing robbery) by simply using the mean of all the Y values. This, unfortunately, results in a substantial amount of prediction error, referred to as 'total variation.' Total variation is equal to the sum of the squared differences between each observation and the grand mean for all of the observations, which is represented in formula form as: S (Yi - Y)². Therefore, the analyst will often introduce a second variable to decrease prediction error. 
      A specific objective of multiple regression is to plot a line with the greatest precision possible in order to minimize the amount of error variation in the conditional scores of Y that occur around the line. To estimate this "best fit line," the analyst must calculate the slope (b), also referred to as the regression coefficient, and the intercept (a), also referred to as the regression constant. In equation form, a straight line is represented as Y = a + b(x). To calculate the slope (b) the analyst uses the formula: b = [N(SXY) - (SX) (SY)] / [N SX² - (SX)²]. To calculate the intercept known as (a), the analyst uses the formula:

 a = Y - (b) (X). The next logical progression under multiple regression is to determine the proportion of variance explained (R²) in the model based on knowledge of a second variable. The analyst accomplishes this by obtaining the Pearson product moment correlation (r) between the variables. Once obtained, the coefficient is squared. For example, if the obtained Pearson correlation was represented as r = 0.84, the corresponding square (.84)² would result in a coefficient of determination of .71. This would mean that 71% of the variance in the dependent variable is explained by the model. The analyst next must determine if the obtained R² is statistically significant. Therefore, the analyst must partition the sums of squares into their additive parts. To accomplish this, the analyst calculates the total variance by taking the sum of squared differences between each observed Y score and the mean of all Y scores. Next, the analyst must calculate the regression variance. This is accomplished by obtaining the sum of squared differences between each predicted Y value and the mean of all Y scores. Lastly, the analyst must calculate the residual variance, which is the sum of squared differences between each observed Y score and the predicted value for Y. The total variance is thus the sum of the regression variance and the residual variance, which is represented in equation form as: SStotal = SSregression + SSresidual.
In order to complete the logical process of multiple regression analysis, the 
analyst must calculate an F ratio. In formula form this appears as:

  SSregression/df 
SSresidual/df
where df represents degrees of freedom (in formula form degrees of freedom is seen as: df = N-1). The obtained coefficient is compared to the appropriate F table statistic. If the obtained F exceeds the critical F table value, then the analyst rejects the null hypothesis of no improvement (due to addition of a second variable) and concludes that the proportion of variance explained by the addition of a second variable was statistically significant as compared to the mean.

What is the difference between an unstandardized (b) and standardized (ß) regression coefficient. What is the utility of each? What is the general form for both an unstandardized as well as a standardized regression equation?

      An unstandardized regression coefficient (b) is expressed in units of the variable with which it is associated. Therefore, comparisons between variables are simply not appropriate. The analyst would be "comparing apples to oranges." A standardized coefficient (ß), on the other hand, allows for a direct comparison between coefficients as to their power in explaining the dependent variable, due to the fact that the units of the variables with which standardized coefficients are associated are the same. This allows the analyst to compare "apples to apples." 
     The utility of the unstandardized regression coefficient is that it represents the slope of the regression line, which is the change in Y for each unit change in X. The b regression coefficient is an integral part of interpreting the regression variate. The utility of the standardized regression coefficient is that it allows the analyst to compare the relative effect of each particular independent variable on the dependent variable. 
The general form of an unstandardized equation appears as: 
SENTENCE = 2.434 + 0.542 (dr_score) - 0.0358 (tm_disp) + 0.04864 (jail_tm)
The general form of a standardized equation appears as:
Zsentence = 0.291 (Zdr_score) - 0.177 (Ztm_disp) + 0.444 (Zjail_tm) 

What is the role of the regression constant (a)? Example provided.
     The role of the regression constant- (a) - is to aid the researcher in the estimation of the best-fit line. The intercept (regression constant) is used along with the slope- (b) - (regression coefficient) to plot the line with the greatest precision so that it is in the position which best minimizes the amount of distance (error variation) in the conditional Y scores around the line. In a regression equation, the regression constant has a specific place: 
Y = a + b(x). The calculation of the regression constant (intercept) appears as:
a = Y - (b) (X). Where N = 10 and Y = 75 and X = 36, the calculation would be represented as: a = (75/10) - 2.432 (36/10) = -1.26. 

Compare and contrast the various methods available in SPSS for determining the order in which variables are entered into a regression model. Which method is most commonly used? Under this protocol, what criteria are used to determine the order in which variables are entered into the model? Explain why these criteria for variable selection are used.

     There are various methods available in SPSS for determining the order in which variables are entered into a regression model. These methods include the confirmatory 
method, the sequential search method (which is sub-divided into forward, backward and stepwise procedures) and the combinatorial method. The confirmatory method proceeds with the analyst specifying the variables to be included in the model as well as the order in which each is entered into the model. As 
compared to the confirmatory method, which is analyst driven and subject to researcher bias, the sequential search method of variable selection proceeds iteratively (as well, the computer, not the researcher, is responsible for variable selection and entry) and selects the particular variables for inclusion in the model that have the greatest explanatory influence. In contrast to the confirmatory method, which is a "solo act," the sequential search method contains three separate methods that can be employed to aid the researcher in variable selection. These methods are the forward addition method the backward elimination method and the stepwise method. In the forward addition procedure of the sequential search method, variables are entered into the model and their contribution to the prediction of the model is considered. If a variable is not found to be significant it is retained nonetheless. In the backward elimination procedure, the order in which variables are entered into the regression model details inserting all variables into the model and eliminating the specific variables that are found not to be significant in their contribution. In the stepwise method, variables are entered into and removed from the model based upon their strength. In this procedure it is possible that an independent variable could be selected for inclusion at the first iteration of the model but be removed at a subsequent iteration due to the fact that its significance is not strong enough to be retained based upon the parameters of the model. In the combinatorial method of variable selection all possible combinations of the independent variables are considered. The overall model strength is identified at each iteration of the model based upon calculations that reveal which combinations of the independent variables provide the highest level of prediction of the dependent variable. The combinatorial method, like the confirmatory method, is a "solo act." However, in contrast to the confirmatory method, the combinatorial method is not analyst driven. The combinatorial method differs from the sequential search method and the confirmatory method in that in the sequential search method and the confirmatory method variables are considered individually, while in the combinatorial method variables are considered simultaneously for their effect on the regression model. The most commonly used method is the sequential search method in general, and the stepwise method specifically. Under the stepwise protocol the independent variables that are the best predictor of the dependent variable are selected for inclusion first. Other independent variables are selected for addition based upon the incremental explanatory power they possess and therefore are able to add to the regression model. Independent variables whose partial correlation coefficients are not statistically significant are not added to the model. As well, if an independent variable's predictive power falls below an established significance level it is dropped from the model. 
     These criteria for variable selection are used in order to produce a prediction equation that has the best possibility of explaining the dependent variable. If variables that are not significant are included and contained in subsequent iterations of the regression model the predictive power of the prediction equation is significantly decreased. Using the stepwise protocol under the sequential search method provides the 
analyst with a powerful tool for determining the prediction equation from the overall regression model. 

How an analyst "assesses model fit" in multiple regression analysis.
     First, the analyst must select the method by which the regression model will be
estimated (discussed in #5 above). Second, the analyst must assess the statistical significance of the overall model as concerns the model's ability in predicting the dependent variable. Third, the analyst must determine if any of the observations have the effect of an undue influence (outlier, influential observation) on the results. The individual variables must meet the assumptions of linearity, constant variance (homoscedasticity), independence and normality. These assumptions must be met by the regression variate as well. Hypotheses concerning the model can be tested to ensure that the model is representative of the population. Two specific tests are a test of the variation explained (R², coefficient of determination) and a test of coefficients (referred to as Multiple R). Multiple R represents the degree of correlation that exists between all independent variables in the model and the dependent variable. The coefficient of determination (R²) represents the proportion of variance in the dependent variable explained by the linear combination of the independent variables in the model. The statistical term Adjusted R² is an estimate of the model's explanatory power if it were to be applied to another sample with similar parameters. 
The analyst also must pay attention to the standard error of the estimate (SEE). This statistical term relates to the standard deviation of dependent values around the regression line. To test the hypothesis that the amount of variation in the dependent variable explained by the regression model is more than average (that R² is greater than zero), the analyst uses an F ratio. The F test statistic in equation form appears as:
Sum of squared errorregression 
F= Degrees of freedomregression 
Sum of squared errorTotal 
Degrees of freedomresidual 

Degrees of freedom regression is equal to the number of estimated coefficients 
(which includes the constant) - 1. Degrees of freedom residual is equal to N - the 
number of estimated coefficients (constant included).
2. Influential observations (such as outliers) that strongly influence the results of the regression analysis are identified. An outlier has a large residual value and is only identified in accordance to a specific regression model. Leverage points, another form of influential observation, are distinctive observations based on the value of their independent variables. Specifically, influential observations is a broad concept that contains all observations that have a disproportionate effect on the results of the regression analysis. It must be noted that not all outliers and leverage points are considered influential observations. The analyst typically deletes extreme influential observations but remains cautious to delete observations that are representative of the population. An important consideration that should not be overlooked in assessing model fit is the analyst's knowledge of the specific situation being studied. This allows the analyst to correctly identify which variables to include and what are the appropriate signs and weights of the coefficients. By employing these methods the analyst can ensure that the model fits the data, therefore ensuring reliability and validity.

How multiple regression analysis differs from other multivariate techniques 
such as discriminant analysis and logistical regression.


     Multiple regression is a multivariate statistical analysis technique that is used by researchers to analyze the relationship between one dependent variable (Y) and several independent variables (X). Discriminant analysis is a multivariate statistical technique that is used by researchers to estimate the relationship between a single dependent variable that is nonmetric (data that are categorical) and a set of independent variables that are metric (data that are interval and ratio). Logistical regression is a multivariate statistical analysis technique that is used by researchers to design a study which employs a binary (dichotomous) dependent variable that is nonmetric. 
     Along with the differences in their general application, discriminant analysis and logistical regression (LOGIT) differ from multiple regression specifically in that multiple regression requires that data must be metric or be appropriately transformed. Logistic regression allows for the independent variables to contain nonmetric data through the use of dummy coding which has the effect of transforming nominal or ordinal data. Logistic regression also allows the analyst to employ a binary dependent 
variable, whereas multiple regression does not. Discriminant analysis differs from multiple regression in that the dependent variable is nonmetric and the independent variables are metric. Furthermore, multiple regression entails one dependent variable, whereas multiple discriminant analysis can employ more than one dependent variable simultaneously.

GLOSSARY
a.) beta coefficient- A regression coefficient that has been standardized. 
Regression coefficients are expressed in the units of measurement of the 
particular variable which they are associated with. This makes direct comparison 
difficult, so the regression coefficients are standardized so that all of the 
variables being compared can be compared along the same unit lines directly.
b.) coefficient of determination- R². The amount of variance of the dependent variable 
explained by the predictor variables in a model, represented as a proportion.
c.) dependence technique- When one or more variables is selected as the dependent 
variable, in order to be explained by other independent variables.
d.) dependent variable- Represented by the symbol "Y." Variable that responds to a change in the independent variable or variables.
e.) homoscedasticity- The variance of the error terms of the dependent variable are 
constant over the full range of independent variables.
f). imputation- A way of correcting for missing data. By employing an imputation 
method, a researcher can estimate the value of the missing data/case by using other 
values of other valid variables in the model. An example would be 'Cold deck 
imputation.'
g.) independent variable- Predictor variable. Represented by the symbol "X." Variable suspected of causing a change in the dependent variable.
h.) interdependence technique- Statistical procedure where the variables are not classified as dependent or independent. The researcher examines all the variables simultaneously.
i.) kurtosis- Peakedness. Flatness. When a distribution of data is placed on a graph and compared to a normal distribution, a negative distribution will appear flat while 
a positive distribution would appear "peaked." Calculated from the moments about the mean of a distribution.
j.) linearity- Where a statistical model predicts values that appear in a straight line. This is the result of the dependent variable displaying a constant unit change for every unit change of the independent variable.
k.) metric data- Data that is interval and ratio in its nature. Non-metric data is nominal and ordinal in nature. Metric data describes an object by use of a descriptor as well as by what degree the descriptor defines the object. For example, a human being can also be described by their height and weight. 'Quantitative' data.
l.) missing at random- Data that are actually missing due to unforeseen circumstances on the part of the researcher, the survey instrument or the subject. Often referred to as MAR when discussing issues of missing data.
m.) missing completely at random- Data that are actually missing and cannot be 
explained by the researcher. Often referred to as MCAR when discussing issues of 
missing data.
n.) multicollinearity- The extent to which one variable can be explained by another variable. If multicollinearity is too high, the significance of the explanation of the variate is decreased. The researcher is not able to distinguish the specific effect of one variable as compared to the one it is highly correlated with.
non-metric data- Data that is nominal and ordinal in its nature. Characteristic properties that describe a subject, such as male or female, and qualities such as high or low approval ratings of politicians. 'Qualitative' data.
o.) normal distribution- Probability distribution where the horizontal (X) axis 
represent all of the possible values of the variable under consideration and the 
vertical (Y) axis represent all probable occurrences of the variable. A graphic 
depiction would appear as a "bell curve."
q.) normality- To what degree the sample data distribution is represented in a normal distribution.
r.) outlier- A data point that lies outside of the sample distribution.
s.) practical versus statistical significance- Practical significance relates to the results of analysis allowing for demonstrable effects. Statistical significance relates to the ability of inference from the sample population to the general population.
t.) skew- A distribution that has few large values is positively skewed and "tails" off to the right. A distribution that has few small values is negatively skewed and tails off to the left. Values that lie outside a range of -1 to 1 are extremely skewed. 
u.) specification error- Relates to whether or not the researcher excludes pertinent variables from the set of independent variables as well as includes variables that are not necessary for inclusion in the model.
v.) stepwise protocol- Multivariate data analysis where the contribution of each predictor variable is considered before the regression equation is developed. The 
benefit of this procedure is that at each iteration of the model, variables can be 
added or deleted.
w.) type I error- Also referred to as "alpha." The probability that a researcher will 
incorrectly reject the null hypothesis.
x.) type II error- Also referred to as "beta." The probability that a researcher will incorrectly fail to reject the null hypothesis.