Statistical Support |
Research Consulting |

Section
4: Summarizing Data

Descriptive
Statistics

Frequencies

Crosstabulation

Section
5: Inferential Statistics

Chi-Square

*T*
test

Correlation

Regression

General
Linear Model

This document is the second module of a four module tutorial series. This
document describes the use of SPSS to obtain
descriptive and inferential statistics. In this module, you will be introduced
to procedures used to obtain several descriptive statistics, frequency tables,
and crosstabulations in the first section. In the second section, the Chi-square
test of independence, independent and paired sample *t* tests, bivariate
and partial correlations, regression, and the general linear model will be
covered. If you are not familiar with SPSS or need
more information about how to get SPSS to read
your data, consult the first module of this four part tutorial, SPSS for
Windows: Getting Started. This set of documents uses a sample dataset,
*Employee data.sav*, that SPSS provides.
It can be found in the root SPSS
directory. If you installed SPSS in the
default location, then this file will be located in the following location:
C:\Program Files\SPSS\Employee Data.sav.

Some users prefer to use keystrokes to navigate through SPSS. Information on common keystrokes are available in our SPSS 10 for Windows Keystoke Manual.

**Section 4: Summarizing Data**

A common first step in data analysis is to summarize information about
variables in your dataset, such as the averages and variances of variables.
Several summary or descriptive statistics are available under the
*Descriptives* option available from the *Analyze* and *Descriptive
Statistics* menus:

** Analyze**
** Descriptive Statistics**
**
Descriptives...**

After selecting the *Descriptives* option, the following dialog box will
appear:

This dialog box allows you to select the variables for which descriptive
statistics are desired. To select variables, first click on a variable name in
the box on the left side of the dialog box, then click on the arrow button that
will move those variables to the *Variable(s) *box. For example, the
variables *salbegin* and *salary* have been selected in this manner in
the above example. To view the available descriptive statistics, click on the
button labeled **Options**. This will produce the following dialog box:

Clicking on the boxes next to the statistics' names will result in these
statistics being displayed in the output for this procedure. In the above
example, only the default statistics have been selected (mean, standard
deviation, minimum, and maximum), however, there are several others that could
be selected. After selecting all of the statistics you desire, output can be
generated by first clicking on the **Continue** button in the *Options*
dialog box, then clicking on the **OK** button in the *Descriptives*
dialog box. The statistics that you selected will be printed in the Output
Viewer. For example, the selections from the preceding example would produce the
following output:

This output contains several pieces of information that can be useful to you
in understanding the descriptive qualities of your data. The number of cases in
the dataset is recorded under the column labeled *N*. Information about the
range of variables is contained in the *Minimum* and *Maximum*
columns. For example, beginning salaries ranged from $9000 to $79,980 whereas
current salaries range from $15,750 to $135,000. The average salary is contained
in the *Mean* column. Variability can be assessed by examining the values
in the *Std. *column. The standard deviation measures the amount of
variability in the distribution of a variable. Thus, the more that the
individual data points differ from each other, the larger the standard deviation
will be. Conversely, if there is a great deal of similarity between data points,
the standard deviation will be quite small. The standard deviation describes the
standard amount variables differ from the mean. For example, a starting salary
with the value of $24,886.73 is one standard deviation above the mean in the
above example in which the variable, *salary* has a mean of $17,016.09 and
a standard deviation of $7,870.64. Examining differences in variability could be
useful for anticipating further analyses: in the above example, it is clear that
there is much greater variability in the current salaries than beginning
salaries. Because equal variances is an assumption of many inferential
statistics, this information is important to a data analyst.

While the descriptive statistics procedure described above is useful for
summarizing data with an underlying continuous distribution, the
*Descriptives* procedure will not prove helpful for interpreting
categorical data. Instead, it is more useful to investigate the numbers of cases
that fall into various categories. The *Frequencies* option allows you to
obtain the number of people within each education level in the dataset. The
*Frequencies* procedure is found under the *Analyze* menu:

** Analyze**
** Descriptives Statistics**
**
Frequencies...**

Selecting this menu item produces the following dialog box:

Select variables by clicking on them in the left box, then clicking the arrow
in between the two boxes. Frequencies will be obtained for all of the variables
in the box labeled *Variable(s)*. This is the only step necessary for
obtaining frequency tables; however, there are several other descriptive
statistics available, many of which are described in the preceding section. The
example in the above dialog box would produce the following output:

Clicking on the **Statistics** button produces a dialog box with several
additional descriptive statistics. Clicking on the **Charts** button produces
the following box which allows you to graphically examine their data in several
different formats:

Each of the available options provides a visual display of the data. For
example, clicking on the *Histograms *button with its suboption, *With
normal curve*, will provide you with a chart similar to that shown below.
This will allow you to assess whether your data are normally distributed, which
is an assumption of several inferential statistics. You can also use the
*Explore* procedure, available from the *Descriptives* menu, to obtain
the *Kolmogorov-Smirnov test*, which is a hypothesis test to determine if
your data are normally distributed.

While frequencies show the numbers of cases in each level of a categorical
variable, they do not give information about the relationship between
categorical variables. For example, frequencies can give you the number of men
and women in a company AND the number of people in each employment category, but
not the number of men and women IN each employment category. The
*Crosstabs* procedure is useful for investigating this type of information
because it can provide information about the intersection of two variables. The
number of men and women in each of three employment categories is one example of
information that can be crosstabulated. The *Crosstabs* procedure is found
in the *Analyze* menu in the Data Editor window:

** Analyze**
** Descriptive Statistics**
**
Crosstabs…**

After selecting *Crosstabs* from the menu, the dialog box shown above
will appear on your monitor. The box on the left side of the dialog box contains
a list of all of the variables in the working dataset. Variables from this list
can be selected for rows, columns, or layers in a crosstabulation. For example,
selecting the variable *gender* for the rows of the table and *jobcat*
for the columns would produce a crosstabulation of gender by job category.

The options available by selecting the **Statistics** and **Cells
**buttons provide you with several additional output features. Selecting the
**Cells** button will produce a menu that allows you to add additional values
to your table. For example, the dialog box shown below illustrates an example in
which *Expected* option in the *Counts* box and the *Row*,
*Column*, and *Total* options in the *Percentages* box have been
selected.

The combination of the two dialog boxes shown above will produce the following output table:

The crosstabulation statistics provide several interesting observations about the data. In the above table, there appears to be an association between gender and employment category as the expected values, which are the values expected by chance, and the actual counts are different from each other. The following section will discuss how to further examine this relationship with inferential statistics.

**Section 5: Inferential
Statistics**

The Chi-square test for independence is used in situations where you have two
*categorical variables*. A categorical variable is a qualitative variable
in which cases are classified in one and only one of the possible levels. A
classic example is gender, in which cases are classified in one of two possible
levels. The example in the above section, in which *Gender* and
*Employment Category* are crosstabulated using the SPSS
*Crosstabs* procedure, is an example of data with which you could conduct a
*Chi-square test of independence* testing the null hypothesis that there is
no relationship between the two variables.

For instance, you could conduct a test of the hypothesis that there is no
relationship between *Gender* and *Employment Category*. If this
hypothesis were true, you would expect that the proportion of men and women
would be the same within each level of *Employment Category*. In other
words, there should be little difference between *observed *and* expected
*values, where the expected values represent the numbers that would be in
each cell when the variables are independent of each other. The difference
between observed and expected values is the basis of the Chi-square statistic:
it evaluates the likelihood that the differences between the observed and
expected values would occur under the null hypothesis that there is no
difference between these values. The expected values can be obtained by clicking
on the **Cells **box in the *Crosstabs* dialog box, as described in the
preceding section. Examining the table above, it appears that it is indeed the
case that gender and employment category are independent of each other. It
appears that there are more women in clerical positions than would be expected
by chance, whereas there are more men in custodial and managerial positions than
would be expected by chance. Conducting a Chi-square test of independence would
tell us if the observed pattern is statistically different from the pattern
expected due to chance.

The Chi-square test of independence can be obtained through the
*Crosstabs* dialog boxes that were used above to get a crosstabulation of
the data. After opening the *Crosstabs* dialog box as described in the
preceding section, click the **Statistics** button to get the following
dialog box:

By clicking on the box labeled *Chi-Square*, you will obtain the
Chi-square test of independence for the variables you have crosstabulated. This
will produce the following table in the Output Viewer:

Inspecting the table in the previous section, it appears that the the two
variables, gender and employment category, are related to each other in some
way. This finding is implicated by the substantial differences in the observed
and expected counts: these differences represent the difference between values
expected if gender and employment classification were independent of each other
(expected counts) and the actual numbers of cases in each cell (observed
counts). For example, if gender and employment classification were unrelated,
then it is expected that 38.3 women would be in the manager classification as
opposed to the observed number, 10. In this example, the expected value of 38.3
represents the fact that 45.6% of the cases in this dataset are women, so it is
expected that 45.6% of the 84 managers in the dataset would also be women if
gender and employment classification were independent of each other. The output
above provides a statistical hypothesis test for the hypothesis that gender and
employment category are independent of each other. The large Chi-Square
statistic (79.28) and its small significance level (*p* < .000)
indicates that it is very unlikely that these variables are independent of each
other. Thus, you can conclude that there is a relationship between a person's
gender and their employment classification.

The *t* test is a useful technique for comparing mean values of two sets
of numbers. The comparison will provide you with a statistic for evaluating
whether the difference between two means is statistically significant. *T*
tests can be used either to compare two independent groups (independent-samples
*t* test) or to compare observations from two measurement occasions for the
same group (paired-samples *t* test). To conduct a *t* test, your data
should be a sample drawn from a continuous underlying distribution. If you are
using the *t* test to compare two groups, the groups should be randomly
drawn from normally distributed and independent populations. For example, if you
were comparing clerical and managerial salaries, the *independent
populations* are clerks and managers, which are two nonoverlapping groups. If
you have more than two groups or more than two variables in a single group that
you want to compare, you should use one of the General Linear Model
procedures in SPSS, which
are described below.

There are three types of *t* tests; the options are all located
under the *Analyze* menu item:

** Analyze**
** Compare Means**
**
One-Sample T test...**

While each of these *t* tests compares mean values of two sets of
numbers, they are designed for distinctly different situations:

- The
*one-sample t test*is used compare a single sample with a population value. For example, a test could be conducted to compare the average salary of managers within a company with a value that was known to represent the national average for managers. - The
*independent-sample t test*is used to compare two groups' scores on the same variable. For example, it could be used to compare the salaries of clerks and managers to evaluate whether there is a difference in their salaries. - The
*paired-sample t test*is used to compare the means of two variables within a single group. For example, it could be used to see if there is a statistically significant difference between starting salaries and current salaries among the custodial staff in an organization.

To select variables for the analysis, first highlight them by clicking on
them in the box on the left. Then move them into the appropriate box on the
right by clicking on the arrow button in the center of the box. Your independent
variable should go in the *Grouping Variable *box, which is a variable that
defines which groups are being compared. For example, because employment
categories are being compared in this analysis, the *jobcat* variable is
selected. However, because *jobcat* has more than two levels, you will need
to click on **Define Groups** to specify the two levels of *jobcat* that
you want to compare. This will produce another dialog box as is shown below:

Here, the groups to be compared are limited to the groups with the values 2
and 3, which represent the clerical and managerial groups. After selecting the
groups to be compared, click the **Continue **button, and then click the
**OK** button in the main dialog box. The above choices will produce the
following output:

The first output table,
labeled *Group Statistics*, displays descriptive statistics. The second
output table, labeled *Independent Samples Test*, contains the statistics
that are critical to evaluating the current research question. This table
contains two sets of analyses: the first assumes equal variances and the second
does not. To assess whether you should use the statistics for equal or unequal
variances, use the significance level associated with the value under the
heading, *Levene's Test for Equality of Variances*. It tests the hypothesis
that the variances of the two groups are equal. A small value in the column
labeled *Sig*. indicates that this hypothesis is false and that the groups
do indeed have unequal variances. In the above case, the small value in that
column indicates that the variance of the two groups, clerks and managers, is
not equal. Thus, you should use the statistics in the row labeled* Equal
variances not assumed*.

The SPSS output
reports a *t statistic *and *degrees of freedom* for all *t* test
procedures. Every unique value of the *t *statistic and its associated
degrees of freedom have a significance value. In the above example in which the
hypothesis that clerks and managers do not differ in their salaries, the
*t* statistic under the assumption of unequal variances has a value of
-16.3, and the degrees of freedom has a value of 89.6 with an associated
significance level of .000. The significance level tells us that the probability
that there is no difference between clerical and managerial salaries is very
small: specifically, less than one time in a thousand would we obtain a mean
difference of $33,038 or larger between these groups if there were really no
differences in their salaries.

To obtain a paired-samples *t* test, select the menu items described
above and the following dialog box will appear:

The above example illustrates a *t* test between the variables
*salbegin* and *salary* which represent employees' beginning salary
and their current salary. To set up a paired-samples *t* test as in the
above example, click on the two variables that you want to compare. The variable
names will appear in the section of the box labeled *Current Selections*.
When these variable names appear there, click the arrow in the middle of the
dialog box and they will appear in the *Paired Variables *box. Clicking the
**OK** button with the above variables selected will produce output for the
paired-samples *t* test. The following output is an example of the
statistics you would obtain from the above example.

As with the independent
samples *t* test, there is a *t *statistic and degrees of freedom that
has a significance level associated with it. The *t* test in this example
tests the hypothesis that there is no difference in clerks' beginning and
current salaries. The *t *statistic, (35.04), and its associated
significance level (*p* < .000) indicate that this in not the case. In
fact, the observed mean difference of $17,403.48 between beginning and current
salaries would occur fewer than once in a thousand times if there really were no
difference between clerks' beginning and current salaries.

Correlation is one of the most common forms of data analysis both because it
can provide an analysis that stands on its own, and also because it underlies
many other analyses, and can can be a good way to support conclusions after
primary analyses have been completed. *Correlations* are a measure of the
linear relationship between two variables. A correlation coefficient has a value
ranging from -1 to 1. Values that are closer to the absolute value of 1 indicate
that there is a strong relationship between the variables being correlated
whereas values closer to 0 indicate that there is little or no linear
relationship. The sign of a correlation coefficient describes the type of
relationship between the variables being correlated. A positive correlation
coefficient indicates that there is a positive linear relationship between the
variables: as one variable increases in value, so does the other. An example of
two variables that are likely to be positively correlated are the number of days
a student attended class and test grades because, as the number of classes
attended increases in value, so do test grades. A negative value indicates a
negative linear relationship between variables: as one variable increases in
value, the other variable decreases in value. The number of days students miss
class and their test scores are likely to be negatively correlated because as
the number of days of missed classed increases, test scores typically decrease.

To obtain a correlation in SPSS, start at
the *Analyze* menu. Select the *Correlate* option from this menu. By
selecting this menu item, you will see that there are three options for
correlating variables: (1) *Bivariate*, (2) *Partial*, and (3)
*Distances.* This document will cover the first two types of correlations.
The *bivariate correlation* is for situations where you are interested only
in the relationship between two variables. *Partial correlations* should be
used when you are measuring the association between two variables but want to
factor out the effect of one or more other variables.

To obtain a bivariate correlation, choose the following menu option:

** Analyze**
** Correlate**
**
Bivariate...**

This will produce the following dialog box:

To obtain correlations, first click on the variable names in the variable
list on the left side of the dialog box. Next, click on the arrow between the
two white boxes which will move the selected variables into the *Variables*
box. Each variable listed in the *Variables* box will be correlated with
every other variable in the box. For example, with the above selections, we
would obtain correlations between *Education Level* and *Current
Salary*, between *Education Level* and *Previous Experience*, and
between *Current Salary* and *Previous Experience*. We will maintain
the default options shown in the above dialog box in this example. The first
option to consider is the type of correlation coefficient. Pearson's is
appropriate for continuous data as noted in the above example, whereas the other
two correlation coefficients, Kendall's tau-b and Spearman's, are designed for
ranked data. The choice between a one and two-tailed significance test in the
*Test of Significance* box should be determined by whether the hypothesis
you are testing is making a prediction about the direction of effect between the
two variables: if you are making a prediction that there is a negative or
positive relationship between the variables, then the one-tailed test is
appropriate; if you are not making a directional prediction, you should use the
two-tailed test if there is not a specific prediction about the direction of the
relationship between the variables you are correlating. The selections in the
above dialog box will produce the following output:

This output gives us a
correlation matrix for the three correlations requested in the above dialog box.
Note that despite there being nine cells in the above matrix, there are only
three correlation coefficients of interest: (1) the correlation between current
salary and educational level, the correlation between previous experience and
educational level, and the correlation between current salary and previous
experience. The reason only three of the nine correlations are of interest is
because the diagonal consists of correlations of each variable with itself,
always resulting in a value of 1.00 and the values on each side of the diagonal
replicate the values on the opposite side of the diagonal. For example, the
three unique correlation coefficients show there is a positive correlation
between employees' number of years of education and their current salary. This
positive correlation coefficient (.661) indicates that there is a statistically
significant (*p* < .001) linear relationship between these two variables
such that the more education a person has, the larger that person's salary is.
Also observe that there is a statistically significant (*p* < .001)
negative correlation coefficient (-.252) for the association between education
level and previous experience, indicating that the linear relationship between
these two variables is one in which the values of one variable decrease as the
other increases. The third correlation coefficient (-.097) also indicates a
negative association between employee's current salaries and their previous work
experience, although this correlation is fairly weak.

The second type of
correlation listed under the *Correlate* menu item is the partial
correlation, which measures an association between two variables with the
effects of one or more other variables factored out. To obtain a partial
correlation, select the following menu item:

Notice that the correlation coefficient is
considerably smaller in the output above than in the bivariate correlation
example: the correlation between these variables was .661, whereas it is only
.281 in the partial correlation. Nevertheless, a statistically significant
association (** Analyze**
** Correlate**
**
Partial...**

This will produce the following dialog box:

Here, we have selected the variables we want to correlate as well as the
variable for which we want to control by first clicking on variable names to
highlight them on the left side of the box, then moving them to the boxes on the
right by clicking on the arrow immediately to the left of either the
*Variables* box or the *Controlling for* box. In this example, we are
correlating current salaries with years of education while controlling for
beginning salaries. Thus, we will have a measure of the association between
current salaries and years of education, while removing the association between
beginning salaries and the two variables we are correlating. The above example
will produce the following output:

`- - - P A R T I A L C O R R E L A T
I O N C O E F F I C I E N T S - - -`

`Controlling for.. SALBEGIN`

`
SALARY EDUC`

`SALARY
1.0000 .2810`
`
( 0) ( 471)`
`
P= . P= .000`

`EDUC
.2810 1.0000`
`
( 471) ( 0)`
`
P= .000 P= .`

`(Coefficient / (D.F.) / 2-tailed Significance)`

`" . " is printed if a coefficient cannot be computed`

Partial correlations can be especially useful in situations where it is not obvious whether variables possess a unique relationship or whether several variables overlap with each other. For example, if you were attempting to correlate anxiety with job performance and stress with job performance, it would be useful to conduct partial correlations. You could correlate anxiety and a job performance measure while controlling for stress to determine if there were a unique relationship between anxiety and job performance or whether perhaps stress is highly correlated with anxiety--which would result in little remaining variance that could be uniquely attributed to the association between anxiety and job performance.

Regression is a technique that can be used to investigate the effect of one or more predictor variables on an outcome variable. Regression allows you to make statements about how well one or more independent variables will predict the value of a dependent variable. For example, if you were interested in investigating which variables in the employee database were good predictors of employees' current salaries, you could create a regression equation that would use several of the variables in the dataset to predict employees' salaries. By doing this you will be able to make statements about whether knowing something about variables such as employees' number of years of education, their starting salary, or their number of months on the job are good predictors of their current salaries.

To conduct a regression analysis, select the following from the
*Analyze* menu:

** Analyze**
** Regression**
**
Linear...**

This will produce the following dialog box:

This dialog box illustrates an example regression equation. As with other
analyses, you select variables from the box on the left by clicking on them,
then moving them to the boxes on the right by clicking the arrow next to the box
where you want to enter a particular variable. Here, employees' current
salary has been entered as the dependent variable. In the *Independent(s)*
box, several predictor variables have been entered, including education level,
beginning salary, months since hire, and previous experience.

NOTE: Before you run a regression model, you should consider the method that
you use for selecting or rejecting variables in that model. The box labeled
*Method* allows you to select from one of five methods: *Enter*,
*Remove*, *Forward*, *Backward*, and *Stepwise*.
Unfortunately, we cannot offer a comprehensive discussion of the characteristics
of each of these methods here, but you have several options regarding the method
you use to remove and retain predictor variables in your regression equation. In
this example, we will use the SPSS default
method, *Enter*, which is a standard approach in regression models. If you
have questions about which method is most appropriate for your data analysis,
consult a regression text book, the SPSS help
facilities, or contact a consultant.

The following output assumes that only the default options have been
requested. If you have selected options from the *Statistics*,
*Plots*, or *Options* boxes, then you will have more output than is
shown below and some of your tables may contain additional statistics not shown
here.

The first table in the output, shown below, includes information about the
quantity of variance that is explained by your predictor variables. The first
statistic, *R*, is the multiple correlation coefficient between all of the
predictor variables and the dependent variable. In this model, the value is .90,
which indicates that there is a great deal of variance shared by the independent
variables and the dependent variables. The next value, *R* Square, is
simply the squared value of *R*. This is frequently used to describe the
goodness-of-fit or the amount of variance explained by a given set of predictor
variables. In this example, the value is .81, which indicates that 81% of the
variance in the dependent variable is explained by the independent variables in
the model.

The second table in the output is an
ANOVA table that describes the
overall variance accounted for in the model. The *F* statistic represents a
test of the null hypothesis that the expected values of the regression
coefficients are equal to each other and that they equal zero. Put another way,
this *F *statistic tests whether the R square proportion of variance in the
dependent variable accounted for by the predictors is zero. If the null
hypothesis were true, then that would indicate that there is not a regression
relationship between the dependent variable and the predictor variables. But,
instead, it appears that the four predictor variables in the present example are
not all equal to each other and could be used to predict the dependent variable,
current salary, as is indicated by a large *F* value and a small
significance level.

In addition to the coefficients, the table also provides a significance test
for each of the independent variables in the model. The significance test
evaluates the null hypothesis that the unstandardized regression coefficient for
the predictor is zero when all other predictors' coefficients are fixed to zero.
This test is presented as a *t* statistic. For example, examining the
*t* statistic for the variable, *Months Since Hire*, you can see that
it is associated with a significance value of .000, indicating that the null
hypothesis, that states that this variable's regression coefficient is zero when
all other predictor coefficients are fixed to zero, can be rejected.

The majority of procedures used for conducting analysis of variance (ANOVA)
in SPSS can be
found under the *General Linear Model* (GLM) menu item in the
*Analyze* menu. Analysis of variance can be used in many situations to
determine whether there are differences between groups on the basis of one or
more outcome variables or if a continuous variable is a good predictor of one or
more dependent variables. There are three varieties of of the general linear
model available in SPSS:
univariate, multivariate, and repeated measures. The *univariate general
linear model* is used in situations where you only have a single dependent
variable, but may have several independent variables that can be fixed
between-subjects factors, random between-subjects factors, or covariates. The
*multivariate general linear model *is used in situations where there is
more than one dependent variable and independent variables are either fixed
between-subjects factors or covariates. The *repeated measures general linear
model is* used in situations where you have more than one measurement
occasion for a dependent variable and have fixed between-subjects factors or
covariates as independent variables. Because it is beyond the scope of this
document to cover all three varieties of the general linear model in detail, we
will focus on the univariate version of the general linear model with some
attention given to topics that are unique to the repeated measures general
linear model. Several features of the univariate general linear model are useful
for understanding other varieties of the model that are provided in SPSS:
understanding the univariate model will prove useful for understanding other
GLM options.

The univariate general linear model is used to compare differences between group means and estimating the effect of covariates on a single dependent variable. For example, you may want to see if there are differences between men and women's salaries in a sample of employee data. To do this, you would want to demonstrate that the average salary is significantly different between men and women. However, in doing such an analysis, you are likely aware that there are other factors that could affect a person's salary that need to be controlled for in such an analysis. For example, educational background and starting salary are some such variables. By including these variables in our analysis, you will be able to evaluate the differences between men and women's salaries while controlling for the influence of these other variables.

To specify a univariate general linear model in SPSS, go to the analyze menu and select univariate from the general linear model menu:

** Analyze**
** General Linear Model**
**
Univariate...**

This will produce the following dialog box:

The above box demonstrates a model with multiple types of independent
variables. The variable, gender, has been designated as a *fixed factor*
because it contains all of the levels of interest.

In contrast, *random variables* are variables that represent a random
sample of the possible levels that could be sampled. There are not any true
random variables in our dataset; therefore, this input box has been left blank
here. However, you could imagine a situation similar to the above example where
you sampled data from multiple corporations for our employee database. In that
case, you would have introduced a random variable into the model--the
corporation to which an employee belongs. Corporation is a random factor because
you would only be sampling a few of the many possible corporations to which you
would want to generalize your results.

The next input box contains the covariates in your model. A *covariate*
is a quantitative independent variable. Covariates are often entered in models
to reduce error variance: by removing the effects of the relationship between
the covariate and the dependent variable, you can often get a better estimate of
the amount of variance that is being accounted for by the factors in the model.
Covariates can also be used to measure the linear association between the
covariate and a dependent variable, as is done in regression models. In this
situation, a linear relationship indicates that the dependent variable increases
or decreases in value as the covariate increases or decreases in value.

The box labeled *WLS Weight* can contain a variable that is used to
weight other variables in a weighted least-squares analysis. This procedure is
infrequently used however, and is not discussed in any detail here.

The default model for the SPSS
univariate GLM will include
main effects for all independent variables and will provide interaction terms
for all possible combinations of fixed and random factors. You may not want this
default model, or you may want to create interaction terms between your
covariates and some of the factors. In fact, if you intend to conduct an
analysis of covariance, you should test for interactions between covariates and
factors. Doing so will determine whether you have met the *homogeneity of
regression slopes* assumption, which states that the regression slopes for
all groups in your analysis are equal. This assumption is important because the
means for each group are adjusted by averaging the slopes for each group so that
group differences in the covariate are removed from the dependent variable.
Thus, it is assumed that the relationship between the covariate and the
dependent variable is the same at all levels of the independent variables. To
make changes in the default model, click on the **Model** button which will
produce the following dialog box:

The first step for modifying the default model is to click on the button
labeled *Custom, *to activate the grayed out areas of the dialog box. At
this point, you can begin to move variables in the *Factors &
Covariates* box into the *Model* box. First, move all of the main
effects into the *Model* box. The quickest way to do that is to
double-click on their names in the *Factors & Covariates* box. After
entering all of the main effects, you can begin building interaction terms. To
build the interactions, click on the arrow facing downwards in the *Build
Term(s)* section and select interaction, as shown in the figure above. After
you have selected the interaction, you can click on the names of the variables
with which you would like to build an interaction, then click on the arrow
facing right under the *Build Term(s)* heading. In the above example, the
*educ*gender* term has already been created. The *salbegin*gender*
and* salbegin*educ* terms can be created by highlighting two terms at a
time as shown above, then clicking on the right-facing arrow. Some of the other
options in the *Build Terms* list that you may find useful are the *All
n-way* options. For example if you highlighted all three variables in the
Factors & Covariates box, you could create all of the three possible 2-way
interactions by selecting the *All 2-wa*y option from the *Build
Terms(s)* drop-down menu, then clicking the right-facing arrow.

If you are testing the homogeneity of regression slopes assumption, you
should examine your group by covariate interactions, as well as any covariate by
covariate interactions. In order to meet the ANCOVA assumption, these
interactions should not be significant. Examining the output from the example
above, we expect to see nonsignificant effects for the *gender*educ* and
the *gender*salbegin* interaction effects:

Examining the group by covariate effects, you can see that both were
nonsignificant. The *gender*salbegin *effect has a small *F* statistic
(.660) and a large significance value (.417), the *educ*salbegin *effect
also has a small *F* statistic (1.808) and large significance value (.369),
and the *salbegin*educ* effect also has a small *F* statistic (1.493)
and large significance level (.222). Because all of these significance levels
are greater than .05, the homogeneity of regression assumption has been met and
you can proceed with the ANCOVA.

Knowing that the model does not violate the homogeneity of regression slopes
assumption, you can remove the interaction terms from the model by returning to
the *GLM Univariate*
dialog box, clicking the **Model** button, and selecting *Full
Factorial*. This will return the model to its default form in which there are
no interactions with covariates. After you have done this, click **OK** in
the *GLM Univariate
*dialog box to produce the following output:

The repeated measures version of the general linear model has many similarities to the univariate model described above. However, the key difference between the models is that there are multiple measurement occasions of the dependent variable in repeated measures models, whereas the univariate model only permits a single dependent variable. You could conduct a similar model with repeated measurements by using beginning salaries and current salaries as the repeated measurement occasions.

To conduct this analysis, you should select the *Repeated Measures*
option from the *General Linear Model* submenu of the *Analyze* menu:

** Analyze**
** General Linear Model**
**
Repeated Measures...**

Selecting this option will produce the following dialog box:

This dialog box is used for defining the repeated measures, or
within-subjects, dependent variables. You first give the within-subject factor a
name in the box labeled *Within-Subject Factor Name*. This name should be
something that describes the dependent variables you are grouping together. For
example, in this dialog box, salaries are being analyzed, so the within-subject
factor was given the name *salaries*. Next, specify the number of
levels, or number of measurement occasions, in the box labeled *Number of
Levels*. This is the number of times the dependent variable was measured.
Thus, in the present example, there are two measurement occasions for salary
because you are measuring beginning salaries and current salaries. After you
have filled in the *Within-Subject Factor Name* and the *Number of
Levels* input boxes, click the **Add **button which will transfer the
information in the input boxes into the box below. Repeat this process until you
have specified all of your within-subject factors. Then, click on the
**Define** button, and the following dialog box will appear:

When this box initially appears, you will see a slot for each level of the
within-subject factor variables that you specified in the previous dialog box.
These slots are labeled numerically for each level of the within-subject factor
but do not contain variable names. You still need to specify which variable
fills each slot of the within-subject factors. To do this, click the variable's
name in the variable list on the left side of the dialog box. Next, click on the
arrow pointing towards the *Within-Subject Variables* dialog box to move
the variable name from the list to the top slot in the within-subjects box. This
process has been completed for *salbegin*, the first level of the
*salaries* within-subject factor. The same process should be repeated for
*salary*, the variable representing an employee's current salary.

After you have completed the specifications for the within-subjects factors,
you can define your independent variables. Between-subject factors, or fixed
factors should be moved into the box labeled *Between-Subjects Factors(s)*
by first clicking on the variable name in the variable list, then clicking on
the arrow to the left of the *Between-Subjects Factor(s)* box. In this
example, gender has been selected as a between-subjects factor. Covariates, or
continuous predictor variables, can be moved into the *Covariates* box in
the same manner as were the between-subjects factors. Above, *educ*, the
variable representing employee's number of years of education, has been
specified as a covariate.

This will produce several output tables, but we will focus here on the tables
describing between-subject and within-subject effects. However, these tables for
univariate analysis of variance may not always be the appropriate. The
univariate tests have an additional assumption: the assumption of sphericity. If
this assumption is violated, you should use the multivariate output or adjust
your results using one of the correction factors in the SPSS output.
For a more detailed discussion of this topic, see the usage note, *Repeated
Measures ANOVA Using SPSS MANOVA* in the section,
"Within-Subjects Tests: The Univariate versus the Multivariate Approach." This
usage note can be found at http://www.utexas.edu/cc/rack/stat.html.

The following output contains the statistics for the effects in the model
specified in the above dialog boxes:

This table contains information about the within-subject factor,

The output for the repeated measures general linear model also provides
statistics for between-subject effects. In this example, the model contains two
between-subjects factors: employees' education level and their gender. Education
level was entered as a covariate in the model, and therefore the statistics
associated with it are a measure of the linear relationship between education
level and salaries. In contrast, the statistics for the between-subjects factor,
gender, represents a comparison between groups across all levels of the
within-subjects factors. Specifically, it is a comparison between males and
females on differences between their beginning and current salaries. In the
above example, both education level and gender are statistically significant.
The *F* statistic (277.96) and significance level (*p* < .000)
associated with education level allows us to reject the null hypothesis that
there is not a linear relationship between education and salaries. By rejecting
the null hypothesis, you can conclude that there is a positive linear
relationship between the two variables indicating that as number of years of
education increases, salaries do as well. The *F* statistic (55.79)
for gender and its associated significance level (*p* < .000) represent
a test of the null hypothesis that there are no group differences in salaries.
The significant *F* statistic indicates that you can reject this null
hypothesis and conclude that there is a statistically significant difference
between men and women's salaries.

To learn more about SPSS, please proceed to the next SPSS tutorial.

14 September 2001

Statistical Support, a division of Research Consulting at ITS

Send us e-mail at stats@cc.utexas.edu or submit a feedback form

Copyright 2003, UT Austin