Outline
Course Information
What is New  in
Assignments?
Solutions?
Research?
Contact Information
Comments&Suggestions
Resources for
EconomistsDetails
Syllabus
SPSS/win Primer
Tutorials
Descriptive
Statistics
Hypothesis Testing
Regression/Correlation
Time
Series Basics
Multivariate Statistics
Research Guidelines
Guide to Research
Sources of Data
Writing Guide
Glossary of Statistical Terms
Notations & Formulas
What is New in
Projects
Send me your Comments
Return to Home Page
Top
Top or Return to Home
Page
Top or return to Home Page or Send me your Comments
Top or return to Home Page or Send me your Comments
Top or return to Home Page or Send me your Comments
Top
or return to Home Page or
Send me your Comments
Top or return to Home Page or Send me your Comments
Top or return to Home Page or Send me your Comments
Top or return to Home
Page or Send me
your Comments
Top or return to Home Page or Send me your Comments
Top or return to Home Page or Send me your Comments
Top or return to Home Page or Send me our Comments
Top or return to Home Page or Send me your Comments
Top or return to Home Page or Send me your Comments
Top or return to Home Page or Send me your Comments
Top or return to Home Page or Send me your Comments
Top or return to Home Page or Send me your Comments
Top or return to Home Page or Send me your Comments
Top or return to Home Page or Send me your Comments
Top or return to Home Page or Send me your Comments
Top or return to Home Page or Send me your Comments
Top or return to Home Page or Send me your Comments
Top or return to Home Page or Send me your Comments
Top or return to Home Page or Send me your Comments
Top or return to Home Page or Send me your Comments


Select the first letter of the word from the list above to jump to appropriate section
of the glossary. If the term you are looking for starts with a digit or symbol, choose the
'#' link.
 Autocorrelation (same as Serial Correlation)
 In the simplest form (first order), it is the correlation between the error terms at the
observation periods t and t1 in a correctly specified LRM. Its presence is a violation of
the classical assumption of no serial correlation
with serious consequences for the reliability of parameter estimates. Autocorrelation
problem is often encountered when using time series data in regression analysis. The form,
consequences, diagnostic tests, and remedies of this problem will be examined in
Econ 5853
& 6976.
Generally, autocorrelation in the context of time series analysis and forecasting refers
to the correlation between the values of a time series at a given lag l (i.e. tl
for l = 1, 2, ...). The lag length l is the number of time periods
skipped in associating a past value y_{tl} of a series with a current
value y_{t}.
 [empty]
For now.
Top or Select Back on the Toolbar of your Browser to
return to previous page
 Bar Graph or Chart
A graphical depiction of the frequency distribution, relative
frequency distribution, or percent frequency distribution of a
qualitative variable (or set of data).
Bernoulli or Binomial Experiment
A statistical experiment that conforms with all the properties of the Bernoulli
process, namely: (1) the experiment involves 'n' identical trials, (2) each trial results
in two possible outcomes denoted as success or as failure, (3) the probability of a
success (denoted as p) remains constant throughout the experiment, and (4) the outcome of
each trial is independent of those of the previous trials.
Binomial Probability Distribution
A table, graph, or function showing the probability of X successes in 'n'
Bernoulli trials. It is a discrete distribution because possible values are 0, 1, 2, 3,
..., n.
Binomial Probability Function
A mathematical function that describes the probability distribution of a Binomial Random
Variable; it is used for computing the probabilities of successes from 'n' Bernoulli
experiments.
Top or Select Back on the Toolbar of your Browser to
return to previous page
 Causal Relationship
A relationship that holds true on the average due to the influences of uncertain
factors besides those of the IVs that are explicitly identifiable. For example, Quantity
Demanded of bread (in loaves) increases as the Price of bread
decreases, and conversely, ceteris paribus. The other things
held constant include and a host of uncertain factors.
Class Interval
A range of values with specified lower and upper limits that contains a certain number of
cases or frequencies in a frequency distribution. Except for openended classes, each
class interval (denoted as j) for a quantitative variable must a lower limit (LL_{j})
and
an upper limit (UL_{j}). The difference between the upper and the lower limits is
the class size or width (W). The midpoint value
(sometimes called class mark, and denoted as M_{j}) is the
arithmetic average of the two class limits for each class interval j given as M_{j} = (LL_{j}+UL_{j})/2 or alternatively
as LL_{j} + ½ W_{j}
Classical Assumptions
In regression analysis, the LRM is based on certain assumptions that must be met in order
for the OLS estimators to be the best available or BLU. These
assumptions include:
1. The regression model is linear in the coefficients B_{j} (j = 1, 2, 3, ..., K where K is the number of IVs in the
model).
2. The error term ε_{i} has a zero mean [i.e., E(ε_{i}] = 0; where E is the expected value operator)
3. All IVs are uncorrelated with E_{i}.
4. Observations of ε_{i} are uncorrelated with each other (i.e., no
autocorrelation).
5. The ε_{i} has a constant variance (i.e. no heteroskedasticity)
6. No IV is a perfect linear function of any other IV(s) [i.e., no perfect multicollinearity].
7. The ε_{i} is normally distributed with E(E_{i}) = 0, and a constant
variance Ó_{εi}
Coefficient of Variation (CV)
A measure of relative dispersion for a data set, found by dividing the standard
deviation by the mean and multiplying by 100. It is used basically for comparing
variability/dispersion in two or more sets of data when (in the case of two sets of data on
the variables X and Y):
 1. µ_{x} (or Xbar) is not equal to µ_{y} (or Ybar)
 2. X and Y are measured in different units
 3. X and Y are measured in same units but the magnitudes are different (say larger for X
than for Y).
 If any of these conditions prevails and, say, CV_{x} is strictly greater
than CV_{y} then it can be concluded that there is more variation in X set of data
than in Y set of data. Where
 CV_{x} = (σ_{x}/µ_{x})% [or (S_{x}/xbar)% in the case
of a sample of data]; σ_{x} (or S_{x}) is the population (sample)
standard deviation of X, and µ_{x} (or xbar) is the population
(sample) mean of
X. The CV_{y} is computed in a similar manner using
σ_{y} (or S_{y}),
µ_{y} (or ybar).
Collinearity
See Multicollinearity
Command Sequence
A sequence of program commands and related syntax executed through the menu system of the
Windows operating environment.
Continuity Correction Factor
A value of .5 that is added to and/or subtracted from a value of a Binomial
random variable X when the continuous normal probability distribution is used to
approximate the discrete binomial probability distribution.
Continuous Variable
A variable that can assume any value in a given range with no gaps between
successive values. In theory, the range can be as wide as ±infinity
Correlation Analysis
A statistical technique for measuring/quantifying the degree or strength of a linear
association between any two variables (in the case of a simple correlation or bivariate
analysis) or among many variables using the partial correlation coefficient while
controlling for the effects of one or more variables (in the case of a multiple
correlation or multivariate analysis). Note that a correlation coefficient is not an
appropriate summary statistic for assessing the degree of a nonlinear relationship.
Correlation Coefficient
A numerical measure of linear association between two variables that takes the values between 1 (perfectly strong and indirect relationship) to +1 (perfectly strong and direct
relationship). Values near zero indicate a lack
of linear relationship. A matrix of these coefficients is called the correlation
matrix. It is always a symmetric matrix with ONES
(i.e.; unity) along the main diagonal.
Covariance
A numerical measure of linear association between two variables. Positive values indicate
a positive relationship, and negative values indicate a negative relationship.
 Cumulative Frequency Distribution
A tabular summary of a set of quantitative data showing the number of items/cases having
values less than or equal to the upper class limit of each class interval. The cumulative
relative frequency distribution shows the fraction or proportion of the items
having values less than or equal to the upper class limit of each class; while the cumulative
percent frequency distribution shows the percentage of items/cases having values
less than or equal to the upper class limit of each class.
Top or Select Back on the Toolbar of your Browser to
return to previous page
 Data
Measurements or facts that are collected from a statistical unit/entity of interest. They
are classified as quantitative (continuous and discrete) if they contain
numeric information e.g.; sales in $: ¢ (continuous), and number of students in a
statistics class (discrete), or qualitative if they contain nonnumeric
information  e.g.; gender of employees.
Descriptive Statistics
A branch of statistics that is concerned with the use of tabular, graphical, and numerical
methods to summarize data.
Deterministic Relationship
A relationship that holds true in a mathematical sense according to some preconceived rule
or formula. For example, A = WL describes
the relationship between the Area (A), the Width (W) and the Length (L) of a rectangle.
Distance Learning
"Distance Learning" is a general term used to cover the broad range of teaching
and learning events in which the student is separated (at a distance) from the instructor,
or other fellow learners. (Glenn Hoyle, Distance Learning on the Net, January
1997, p. 1). Basically, it is the desired outcome of Distance Education. If the WWW is the
delivery medium, the term "WebBased Learning" (as the desired
outcome of WebBased Education) is more appropriate.
Top or Select Back on the Toolbar of your Browser to
return to previous page
 Estimator
An estimator is essentially a rule for computing the numerical value of a
statistic. In general, a good estimator should have certain desirable
properties, namely:
1. Unbiasedness  That is, its expected value must be equal
to the true unknown value of the parameter which it is designed to estimate.
2. Relative Efficiency  That is, its variance must be the
smallest possible when compared with the variances of all other competing estimators for
the same parameter.
3. Consistency  That is, its value must approaches the true
value of the parameter that it is designed to estimate as the sample size 'n' increases.
4. Sufficiency  That is, it must use all of the information
contained in a sample of data that is used in the computation of its value.
Expected Value
A measure of the central tendency/location of a random variable.
Thus, E(X) = µ
Top or Select Back on the Toolbar of your Browser to
return to previous page
 Fact:
A verified data or sample evidence used along with probability theory to support
hypothesis testing procedures. For example, the statement " A 10% increase in
advertising expenditure resulted in $2million increase in sales" is a fact because
its validity is based on observed data on the two decision variables, namely, sales and
advertising expenditure.
Frequency Distribution
A table that shows the number of cases/items that fall in each of several
nonoverlapping classes of the data. The numbers in each class are referred to as frequencies.
When the number of cases/items are expressed by their proportion in each class, the table
is referred to as the a relative frequency distribution or a percentage
distribution.
Top or Select Back on the Toolbar of your Browser to
return to previous page
 Grouped Data
Data that have been organized into a frequency distribution. Thus, for a variable X the
individual values (X_{i}) in the original data set are unobservable. The
distinction between grouped data and ungrouped data (data that has not been organized or
summarized in any manner) is important: the formulas for calculating basic statistics
(mode, median, mean, variance, and standard deviation) differ for the two types of
data.
Top or Select Back on the Toolbar of your Browser to
return to previous page
 Histogram
A graphical depiction of the frequency distribution, relative frequency
distribution, or percent frequency distribution of a quantitative variable (or data).
Hypothesis:
A statement whose validity is to be determined from sample evidence (verified data). For
example, the statement "An increase in advertising expenditure will result in an
increase in sales is a causal hypothesis that stipulates a positive relationship between
the two variables. As another example, the statement that the average height of all
females in the U.S., eighteen years and older, is 5':
6" is a hypothesis about the value of the parameter 'µ'
Heteroskedasticity
It is the nonconstancy of the error variance (σ_{εi}) for
different observations so that each observation can not be considered as being drawn
independently. This is a violation of the classical
assumption of homoskedasticity (constant variance) which is not always realistic in
econometric practice. This problem is common when using crosssectional data in regression
analysis. The nature, consequences, diagnostic tests, and remedies of the problem will be
examined in Econs 5853 & 6976.
Homoskedasticity
See Heteroskedasticity, and also the classical assumptions.
Top or Select Back on the Toolbar of your Browser to
return to previous page
 Inferential Statistics
A branch of statistics that is concerned with the use of sample evidence and probability
theory to make safe generalizations about he characteristics of a population. The two main
aspects or subbranches are interval estimation
and hypothesis testing.
Top or Select Back on the Toolbar of your Browser to
return to previous page
 (empty)
Top or Select Back on the Toolbar of your Browser to
return to previous page
 (empty)
Top or Select Back on the Toolbar of your Browser to
return to previous page
 (empty)
Top or Select Back on the Toolbar of your Browser to
return to previous page
 Mean
A numerical measure of central tendency/location in a set of data. For a population of
data, it is the value of the parameter µ; for a sample of data, it is the value of the
statistic Xbar.
The numerical measure is derived by summing all the values/numbers and then dividing the
sum by the number of observations ('N' in the case of a population or 'n' in the case of a
sample). : The mean
understates (overstates) the true value of the central tendency if there is a
minimum (maximum) value outlier. Despite this only flaw, the sample mean
(xbar) has some nice properties that make it the most reliable/popular estimator for
making inferences about the population mean µ
or central tendency. These properties include:
1. Unbiasedness  That is, its expected value is equal to the
true value of the parameter µ
(which is always unknown)
2. Relative Efficiency  That is, its variance is the
smallest when compared with the variances of the competing summary measures
(e.g., the
median) of central tendency or µ
 3. Consistency  That is, its value approaches the true
value of µ (which is always
unknown) as the sample size 'n' increases.
4. Sufficiency  That is, it must use all of the information
contained in a sample of data that is used in the computation of its value.
Median
The middlemost value when all the observed values are arranged in
numerical order either ascending or descending manner. It is another measure of central
tendency in a given set of data. It is a better measure of central tendency than the mean
when there are outliers in the data set.
MidPoint Value
see class interval
Mode
The value that occurs most often in a set of data. It is also another
measure of central tendency in a given set of data. Modal Types: The distribution of a
data set is said to be unimodal if it
contains only one mode; it is said to be bimodal
if it contains two distinct modes; and it is said to be multimodal
if it contains more than two distinct modes.
Multicollinearity
It is a violation of the classical assumption that
the IVs not be linearly related to one another. Collinearity is often used to describe the
correlation between two IVs especially in a LRM that involves only two IVs.
Multicollinearity refers to the correlation among two or more IVs in a LRM. This makes it
difficult to interpret the regression coefficient B_{j} as reflecting the partial
effect of X_{j} on the DV since the other IVs cannot be held constant.
Note that multicollinearity does not depend on any theoretical or actual relationship
among any of the IVs; it depends on the existence of an appropriate linear relationship in
the data set at hand. In other words, it is a problem often caused by the particular
sample available. The nature, consequences, diagnostic tests of multicolinearity will be
examined in Econ 5853 & 6976.
Top or Select Back on the Toolbar of your Browser to
return to previous page
 Normal Probability Distribution
A probability distribution of a continuous random variable. Its pdf is
bellshaped and is determined by the two parameters, µ (mu) and
σ (sigma).
Top or Select Back on the Toolbar of your Browser to
return to previous page
 Objective:
A statement of purpose. For example, the statement "The company wishes to increase
sales by 20% next quarter" expresses the objective of the firm.
Ordinary Least Squares (OLS) Method
A mathematical technique for estimating the sample regression equation to obtain the OLS
estimators which are then used to make inferences about the regression
parameters. The technique involves the use of differentiation rules to minimize
the
residual or error sum of squares (ESS). The derived estimators are BLU in that they are
efficient hence the best by having the smallest possible variance, linear in
that they can be expressed in terms of the DV, and unbiased in that their expected
values equals the true unknown values of the parameters which they are designed to
estimate  provided all the classical assumptions
are met.
See estimator for a summary of the desirable properties of a good
estimator.
Outlier(s)
One or more data values that depart significantly from the rest of the values either by
being too big [maximum value outlier(s)] or
too small [minimum value outlier(s)].
Outliers can cause trouble with statistical analysis, so they should identified and acted
on prior to analysis.
Top or Select Back on the Toolbar of your Browser to
return to previous page
 Pie Chart
A graphical device for presenting qualitative data where the area of the whole
pie represents 100% of the data being studies and the slices (or subdivisions of the
circle) correspond to the relative frequency for each class (or subdivision or sector) .
Population
The set of all elements in the universe of interest to the researcher. A frame
comprises the elementary units with the appropriate restrictions imposed on the target
population. A sample a subset of the population or frame. When a
researcher gathers data from the whole population for a given measurement, it is called a
census (e.g., the U. S. population census every ten years with the restriction that those
eligible must be U. S. citizens, permanent residents are excluded). The population size is
often denoted as N ('n' for the sample size).
Parameter
A summary measure whose value is contained/embedded in a population of data. In
most instances this value is unknown; hence must be estimated from that of the
corresponding sample statistic. For example µ
is a parameter while the corresponding sample statistic is Xbar.
Probability distribution
A table, graph, or mathematical function that describes how the probabilities are
distributed over the values that the random variable of interest (X) can assume.
Probability Density Function (PDF)
A probability distribution of a continuous random variable. For example, if a
continuous random variable X is distributed as normal, then its mathematical function f(x)
is a pdf.
Probability Mass Function (PMF)
A probability distribution of a discrete random variable. For example, if a discrete
random variable X has a binomial distribution, then it mathematical function f(x) is a
pmf.
Top or Select Back on the Toolbar of your Browser to
return to previous page
 Qualitative Data
Data that provide or contain nonnumeric information; they serve merely as labels or names
for identifying special attributes of the statistical entity/unit of interest. Qualitative
data can be rendered numeric by coding the nonnumeric values. A variable that assumes
qualitative values is called a Qualitative Variable. An example is the Gender of employees with the values Male
or Female.
Quantitative Data
Data that provide or contain information as to how much or how many; hence they
are always numeric. A variable that assumes quantitative values is called a Quantitative variable. An example is the Salary
or Experience (in years) of the employees.
Top or Select Back on the Toolbar of your Browser to
return to previous page
 Random Sample
A sample drawn in such a way that each member of the population has an equal chance of being selected.
Random Variable
A variable that takes on different numerical values that are determined by chance. For
example, in an experiment of flipping a fair coin thrice, if X denotes the random outcome
of the number of heads that could show up then the possible values () are X : (read the
as ). In this case, X
is a Discrete random variable because it assumes only a
finite sequence of values (with gaps between them). A random variable that assumes any
value in an interval or collection of intervals (a continuum, no gaps
between successive values) is called a Continuous random variable.
Regression Analysis
A statistical technique for measuring/quantifying the type of causal
relationship among variables; one of which is the Dependent Variable (DV)
while the others are the Independent Variables (IVs). The analysis is
called Simple Regression if there is only one IV in the model; it is
called Multiple Regression if there are two or more IVs in the model. A
regression model whether in the Simple or Multiple form can be used for prediction
purposes as well as for testing existing economic theories, among others. Regression analysis is the heart of Econometrics.
Some historical notes. The term regression was introduced
by Francis
Galton (1886) in his famous paper in which he found that although there was a
tendency for tall parents to have tall children and for short parents to have short
children, the average height of children born of parents of a given height tended
to move or "regress" toward the average height in the population as a whole.
Galton's law of universal regression, as it later came to be known, was confirmed by his friend, Karl Pearson (1903), who used more than a thousand
records of heights of members of family groups.
Top or Select Back on the Toolbar of your Browser to
return to previous page
 Sample
A subset of the population of interest to the researcher. The size is often denoted as n.
In practice, we will be interested in a random sample for the
purpose of making reasonable inferences about the population being studied/analyzed.
Sample Statistic
A summary measure/value computed from a sample of data. Thus, this value is always known.
For example, Xbar is a statistic whose value from a sample of size 'n' can be used to
make inferences (point or interval estimation) about the true unknown value of the
population mean µ. See the mean for a discussion of the desirable properties that make xbar a good
estimator of µ
Skewness
A measure of the symmetry or lack of it in a set of data as apparent from the shape of the
distribution  the three measures of shape are: skewness, kurtosis, and box and whiskers
plots. A distribution is said to be symmetric if the left half of the graph of the
distribution is the mirror image of the right half. If a distribution is skewed to the right (positive skewness) it must be the
case that the mean is greater than median which in turn is greater than the mode (i.e.; mean > median > mode); in which case the
skewness coefficient is greater than zero. If a distribution is skewed
to the left (negative skewness) then the relationship is reversed; in
which case the coefficient is less than zero. If there is no
skewness or the distribution is symmetric like the bellshaped
normal curve then the mean = median = mode.
: Karl Pearson is credited with developing
at least two coefficients of skewness (S_{k}) that can be used to assess the degree
of skewness in a distribution. One is given as S_{k} = [3(µ  M_{d})/σ], where µ and ó
are the population mean and standard deviation, respectively. This is the same Pearson
that also developed the Coefficient
of Correlation, as well as the Pearson ChiSquare
statistic. Imagine how irrational the decision making process would have been
without these summary measures that allow us to uncover patterns and relationships
inherent in bodies of data.

 Standard Deviation
A measure of dispersion for a body/set of data, found by taking the positive
square root of the variance.

 Statistical Analysis (Types)
A statistical analysis is said to be Univariate if the
applicable technique involves only one statistical variable (e.g. finding the average age
of all female medical doctors in the U.S.); it is said to be Bivariate
if the applicable technique involves two variable (e.g. Simple Regression analysis of the
effect of Advertising Expenditure on Sales); and it is said to be Multivariate
if the applicable technique involves more than two variables (e.g. Multiple
Regression analysis of the effects of annual Family Income and Family Size on annual Family
Food Expenditure).
Top or Select Back on the Toolbar of your Browser to
return to previous page
 Transformation
Replacing each data value by a different number (such as its logarithm) to facilitate
statistical analysis. The logarithm often transforms skewness into symmetry by stretching
the scale near zero, thus spreading out all the small values that had been bunched
together. It also pulls together the very large data values which had been thinly
scattered at the at the high end of the scale.
Top or Select Back on the Toolbar of your Browser to
return to previous page
 Uniform Probability Distribution
A probability distribution in which equal probabilities are assigned to all values of a
random variable. The distribution can be a pdf (probability density
function) or a pmf (probability mass function) depending on whether the random
variable X is continuous or discrete.
Top or Select Back on the Toolbar of your Browser to
return to previous page
 Variable
A characteristic or an attribute of the statistical unit/entity of interest with values
that are numeric (in the case of a quantitative variable) or nonnumeric
(in the case of a qualitative variable). The standard notation for a
variable is X in the case of a univariate analysis, X
and Y in the case of a bivariate analysis, or X, Y and
Z in the case of a threevariable multivariate analysis.
Top or Select Back on the Toolbar of your Browser to
return to previous page
 (empty for now)
Top or Select Back on the Toolbar of your Browser to
return to previous page
 (empty for now)
Top or Select Back on the Toolbar of your Browser to
return to previous page
 (empty for now)
Top or Select Back on the Toolbar of your Browser to
return to previous page
 (empty for now)
Top or Select Back on the Toolbar of your Browser to
return to previous page
 (empty for now)
Top or Select Back on the Toolbar of your Browser to
return to previous page
Copyright© 1996, Ebenge Usip, all rights reserved.
Last revised:
Tuesday, July 09, 2013.
