﻿ Glossary of Statistical Terms, Dr. Usip, Economics

## Glossary of Statistical Terms/eeusip

SITE CONTENTS

Outline
Course Information
What is New - in
Assignments?
Solutions?
Research?

Contact Information

Resources for Economists

Details
Syllabus
SPSS/win Primer
Tutorials
Descriptive Statistics
Hypothesis Testing

Regression/Correlation
Time Series Basics
Multivariate Statistics
Research Guidelines
Guide to Research

Sources of Data
Writing Guide
Glossary of Statistical Terms
Notations & Formulas

What is New in   Projects

### ABCDEFGHIJKLMNOPQRSTUVWXYZ#

Select the first letter of the word from the list above to jump to appropriate section of the glossary. If the term you are looking for starts with a digit or symbol, choose the '#' link.

# - A -

Autocorrelation (same as Serial Correlation)
In the simplest form (first order), it is the correlation between the error terms at the observation periods t and t-1 in a correctly specified LRM. Its presence is a violation of the classical assumption of no serial correlation with serious consequences for the reliability of parameter estimates. Autocorrelation problem is often encountered when using time series data in regression analysis. The form, consequences, diagnostic tests, and remedies of this problem will be examined in Econ 5853 & 6976.

Generally, autocorrelation in the context of time series analysis and forecasting refers to the correlation between the values of a time series at a given lag l (i.e. t-l for l = 1, 2, ...). The lag length l is the number of time periods skipped in associating a past value yt-l of a series with a current value yt.
[empty]

For now.

# - B -

Bar Graph or Chart
A graphical depiction of the frequency distribution, relative frequency distribution, or percent frequency distribution of a qualitative variable (or set of data).

Bernoulli or Binomial Experiment
A statistical experiment that conforms with all the properties of the Bernoulli process, namely: (1) the experiment involves 'n' identical trials, (2) each trial results in two possible outcomes denoted as success or as failure, (3) the probability of a success (denoted as p) remains constant throughout the experiment, and (4) the outcome of each trial is independent of those of the previous trials.

Binomial Probability Distribution
A table, graph, or function showing the probability of X successes in 'n' Bernoulli trials. It is a discrete distribution because possible values are 0, 1, 2, 3, ..., n.

Binomial Probability Function
A mathematical function that describes the probability distribution of a Binomial Random Variable; it is used for computing the probabilities of successes from 'n' Bernoulli experiments.

# - C -

Causal Relationship
A relationship that holds true on the average due to the influences of uncertain factors besides those of the IVs that are explicitly identifiable. For example, Quantity Demanded of bread (in loaves) increases as the Price of bread decreases, and conversely, ceteris paribus. The other things held constant include income and a host of uncertain factors.

Class Interval
A range of values with specified lower and upper limits that contains a certain number of cases or frequencies in a frequency distribution. Except for open-ended classes, each class interval (denoted as j) for a quantitative variable must a lower limit (LLj) and an upper limit (ULj). The difference between the upper and the lower limits is the class size or width (W). The midpoint value (sometimes called class mark, and denoted as Mj) is the arithmetic average of the two class limits for each class interval j given as Mj = (LLj+ULj)/2 or alternatively as LLj + ½ Wj

Classical Assumptions
In regression analysis, the LRM is based on certain assumptions that must be met in order for the OLS estimators to be the best available or BLU. These assumptions include:
1. The regression model is linear in the coefficients Bj (j = 1, 2, 3, ..., K where K is the number of IVs in the model).
2. The error term εi has a zero mean [i.e., E(εi] = 0; where E is the expected value operator)
3. All IVs are uncorrelated with Ei.
4. Observations of εi are uncorrelated with each other (i.e., no       autocorrelation).
5. The εi has a constant variance (i.e. no heteroskedasticity)
6. No IV is a perfect linear function of any other IV(s) [i.e., no perfect multicollinearity].
7. The εi is normally distributed with E(Ei) = 0, and a constant variance Óεi

Coefficient of Variation (CV)
A measure of relative dispersion for a data set, found by dividing the standard deviation by the mean and multiplying by 100. It is used basically for comparing variability/dispersion in two or more sets of data when (in the case of two sets of data on the variables X and Y):
1. µx (or X-bar) is not equal to µy (or Y-bar)
2. X and Y are measured in different units
3. X and Y are measured in same units but the magnitudes are different (say larger for X than for Y).
If any of these conditions prevails and, say,  CVx is strictly greater than CVy then it can be concluded that there is more variation in X set of data than in Y set of data. Where
CVx = (σxx)% [or (Sx/x-bar)% in the case of a sample of data]; σx (or Sx) is the population (sample) standard deviation of X, and µx (or x-bar) is the population (sample) mean of X. The CVy is computed in a similar manner using σy (or Sy), µy (or y-bar).

Collinearity
See Multicollinearity

Command Sequence
A sequence of program commands and related syntax executed through the menu system of the Windows operating environment.

Continuity Correction Factor
A value of .5 that is added to and/or subtracted from a value of a Binomial random variable X when the continuous normal probability distribution is used to approximate the discrete binomial probability distribution.

Continuous Variable
A variable that can assume any value in a given range with no gaps between successive values. In theory, the range can be as wide as ±infinity

Correlation Analysis
A statistical technique for measuring/quantifying the degree or strength of a linear association between any two variables (in the case of a simple correlation or bivariate analysis) or among many variables using the partial correlation coefficient while controlling for the effects of one or more variables (in the case of a multiple correlation or multivariate analysis). Note that a correlation coefficient is not an appropriate summary statistic for assessing the degree of a nonlinear relationship.

Correlation Coefficient
A numerical measure of linear association between two variables that takes the values between -1 (perfectly strong and indirect relationship) to +1 (perfectly strong and direct relationship). Values near zero indicate a lack of linear relationship. A matrix of these coefficients is called the correlation matrix. It is always a symmetric matrix with ONES (i.e.; unity) along the main diagonal.

Covariance
A numerical measure of linear association between two variables. Positive values indicate a positive relationship, and negative values indicate a negative relationship.
Cumulative Frequency Distribution
A tabular summary of a set of quantitative data showing the number of items/cases having values less than or equal to the upper class limit of each class interval. The cumulative relative frequency distribution shows the fraction or proportion of the items having values less than or equal to the upper class limit of each class; while the cumulative percent frequency distribution shows the percentage of items/cases having values less than or equal to the upper class limit of each class.

# - D -

Data
Measurements or facts that are collected from a statistical unit/entity of interest. They are classified as quantitative (continuous and discrete) if they contain numeric information -e.g.; sales in \$: ¢ (continuous), and number of students in a statistics class (discrete), or qualitative if they contain nonnumeric information - e.g.; gender of employees.

Descriptive Statistics
A branch of statistics that is concerned with the use of tabular, graphical, and numerical methods to summarize data.

Deterministic Relationship
A relationship that holds true in a mathematical sense according to some preconceived rule or formula. For example, A = WL describes the relationship between the Area (A), the Width (W) and the Length (L) of a rectangle.

Distance Learning
"Distance Learning" is a general term used to cover the broad range of teaching and learning events in which the student is separated (at a distance) from the instructor, or other fellow learners. (Glenn Hoyle, Distance Learning on the Net, January 1997, p. 1). Basically, it is the desired outcome of Distance Education. If the WWW is the delivery medium, the term "Web-Based Learning" (as the desired outcome of Web-Based Education) is more appropriate.

# - E -

Estimator
An estimator is essentially a rule for computing the numerical value of a statistic. In general, a good estimator should have certain desirable properties, namely:
1. Unbiasedness - That is, its expected value must be equal to   the true unknown value of the parameter which it is designed to estimate.
2. Relative Efficiency - That is, its variance must be the smallest possible when compared with the variances of all other competing estimators for the same parameter.
3. Consistency - That is, its value must approaches the true value of the parameter that it is designed to estimate as the sample size 'n' increases.
4. Sufficiency - That is, it must use all of the information contained in a sample of data that is used in the computation of its value.

Expected Value
A measure of the central tendency/location of a random variable. Thus, E(X) = µ

# - F -

Fact:
A verified data or sample evidence used along with probability theory to support hypothesis testing procedures. For example, the statement " A 10% increase in advertising expenditure resulted in \$2million increase in sales" is a fact because its validity is based on observed data on the two decision variables, namely, sales and advertising expenditure.

Frequency Distribution
A table that shows the number of cases/items that fall in each of several non-overlapping classes of the data. The numbers in each class are referred to as frequencies. When the number of cases/items are expressed by their proportion in each class, the table is referred to as the a relative frequency distribution or a percentage distribution.

# - G -

Grouped Data
Data that have been organized into a frequency distribution. Thus, for a variable X the individual values (Xi) in the original data set are unobservable. The distinction between grouped data and ungrouped data (data that has not been organized or summarized in any manner) is important: the formulas for calculating basic statistics (mode, median, mean, variance, and standard deviation) differ for the two types of data.

# - H -

Histogram
A graphical depiction of the frequency distribution, relative frequency distribution, or percent frequency distribution of a quantitative variable (or data).

Hypothesis:
A statement whose validity is to be determined from sample evidence (verified data). For example, the statement "An increase in advertising expenditure will result in an increase in sales is a causal hypothesis that stipulates a positive relationship between the two variables. As another example, the statement that the average height of all females in the U.S., eighteen years and older, is 5': 6" is a hypothesis about the value of the parameter 'µ'

Heteroskedasticity
It is the nonconstancy of the error variance (σεi) for different observations so that each observation can not be considered as being drawn independently. This is a violation of the classical assumption of homoskedasticity (constant variance) which is not always realistic in econometric practice. This problem is common when using cross-sectional data in regression analysis. The nature, consequences, diagnostic tests, and remedies of the problem will be examined in Econs 5853 & 6976.

Homoskedasticity
See Heteroskedasticity, and also the classical assumptions.

# - I -

Inferential Statistics
A branch of statistics that is concerned with the use of sample evidence and probability theory to make safe generalizations about he characteristics of a population. The two main aspects or sub-branches are interval estimation and hypothesis testing.

(empty)

(empty)

(empty)

# - M -

Mean
A numerical measure of central tendency/location in a set of data. For a population of data, it is the value of the parameter µ; for a sample of data, it is the value of the statistic  X-bar. The numerical measure is derived by summing all the values/numbers and then dividing the sum by the number of observations ('N' in the case of a population or 'n' in the case of a sample). Caution: The mean understates (overstates) the true value of the central tendency if there is a minimum (maximum) value outlier. Despite this only flaw, the sample mean (x-bar) has some nice properties that make it the most reliable/popular estimator for making inferences about the population mean µ or central tendency. These properties include:
1. Unbiasedness - That is, its expected value is equal to the true value of the parameter µ (which is always unknown)
2. Relative Efficiency - That is, its variance is the smallest when compared with the variances of the competing summary measures (e.g., the median) of central tendency or µ
3. Consistency - That is, its value approaches the true value of  µ (which is always unknown) as the sample size 'n' increases.
4. Sufficiency - That is, it must use all of the information contained in a sample of data that is used in the computation of its value.

Median
The middlemost value when all the observed values are arranged in numerical order either ascending or descending manner. It is another measure of central tendency in a given set of data. It is a better measure of central tendency than the mean when there are outliers in the data set.

Mid-Point Value
see class interval

Mode
The value that occurs most often in a set of data. It is also another measure of central tendency in a given set of data. Modal Types: The distribution of a data set is said to be unimodal if it contains only one mode; it is said to be bimodal if it contains two distinct modes; and it is said to be multimodal if it contains more than two distinct modes.

Multicollinearity
It is a violation of the classical assumption that the IVs not be linearly related to one another. Collinearity is often used to describe the correlation between two IVs especially in a  LRM that involves only two IVs.   Multicollinearity refers to the correlation among two or more IVs in a LRM. This makes it difficult to interpret the regression coefficient Bj as reflecting the partial effect of Xj on the DV since the other IVs cannot be held constant.

Note that multicollinearity does not depend on any theoretical or actual relationship among any of the IVs; it depends on the existence of an appropriate linear relationship in the data set at hand. In other words, it is a problem often caused by the particular sample available. The nature, consequences, diagnostic tests of multicolinearity will be examined in Econ 5853 & 6976.

# - N -

Normal Probability Distribution
A probability distribution of a continuous random variable. Its pdf is bell-shaped and is determined by the two parameters, µ (mu) and σ (sigma).

# - O -

Objective:
A statement of purpose. For example, the statement "The company wishes to increase sales by 20% next quarter" expresses the objective of the firm.

Ordinary Least Squares (OLS) Method
A mathematical technique for estimating the sample regression equation to obtain the OLS estimators which are then used to make inferences about the regression parameters. The technique involves the use of differentiation rules to minimize the residual or error sum of squares (ESS). The derived estimators are BLU in that they are efficient hence the best by having the smallest possible variance, linear in that they can be expressed in terms of the DV, and unbiased in that their expected values equals the true unknown values of the parameters which they are designed to estimate -- provided all the classical assumptions are met.

See estimator for a summary of the desirable properties of a good estimator.

Outlier(s)
One or more data values that depart significantly from the rest of the values either by being too big [maximum value outlier(s)] or too small [minimum value outlier(s)]. Outliers can cause trouble with statistical analysis, so they should identified and acted on prior to analysis.

# - P -

Pie Chart
A graphical device for presenting qualitative data where the area of the whole pie represents 100% of the data being studies and the slices (or subdivisions of the circle) correspond to the relative frequency for each class (or subdivision or sector) .

Population
The set of all elements in the universe of interest to the researcher. A frame comprises the elementary units with the appropriate restrictions imposed on the target population. A sample a subset of the population or frame. When a researcher gathers data from the whole population for a given measurement, it is called a census (e.g., the U. S. population census every ten years with the restriction that those eligible must be U. S. citizens, permanent residents are excluded). The population size is often denoted as N ('n' for the sample size).

Parameter
A summary measure whose value is contained/embedded in a population of data. In most instances this value is unknown; hence must be estimated from that of the corresponding sample statistic. For example µ is a parameter while the corresponding sample statistic is X-bar.

Probability distribution
A table, graph, or mathematical function that describes how the probabilities are distributed over the values that the random variable of interest (X) can assume.

Probability Density Function (PDF)
A probability distribution of a continuous random variable. For example, if  a continuous random variable X is distributed as normal, then its mathematical function f(x) is a pdf.

Probability Mass Function (PMF)
A probability distribution of a discrete random variable. For example, if  a discrete random variable X has a binomial distribution, then it mathematical function f(x) is a pmf.

# - Q -

Qualitative Data
Data that provide or contain non-numeric information; they serve merely as labels or names for identifying special attributes of the statistical entity/unit of interest. Qualitative data can be rendered numeric by coding the non-numeric values. A variable that assumes qualitative values is called a Qualitative Variable. An example is the Gender of employees with the values Male or Female.

Quantitative Data
Data that provide or contain information as to how much or how many; hence they are always numeric. A variable that assumes quantitative values is called a Quantitative variable. An example is the Salary or Experience (in years) of the employees.

# - R -

Random Sample
A sample drawn in such a way that each member of the population has an equal chance of being selected.

Random Variable
A variable that takes on different numerical values that are determined by chance. For example, in an experiment of flipping a fair coin thrice, if X denotes the random outcome of the number of heads that could show up then the possible values (xi) are X : xi = 0, 1, 2, 3 (read the : as such that). In this case, X is a Discrete random variable because it assumes only a finite sequence of values (with gaps between them). A random variable that assumes any value in an interval or collection of intervals (a continuum, no gaps between successive values) is called a Continuous random variable.

Regression Analysis
A statistical technique for measuring/quantifying the type of causal relationship among variables; one of which is the Dependent Variable (DV) while the others are the Independent Variables (IVs). The analysis is called Simple Regression if there is only one IV in the model; it is called Multiple Regression if there are two or more IVs in the model. A regression model whether in the Simple or Multiple form can be used for prediction purposes as well as for testing existing economic theories, among others. Regression analysis is the heart of Econometrics.

Some historical notes. The term regression was introduced by Francis Galton (1886) in his famous paper in which he found that although there was a tendency for tall parents to have tall children and for short parents to have short children, the average height of children born of parents of a given height tended to move or "regress" toward the average height in the population as a whole. Galton's law of universal regression, as it later came to be known, was confirmed by his friend, Karl Pearson (1903), who used more than a thousand records of heights of members of family groups.

# - S -

Sample
A subset of the population of interest to the researcher. The size is often denoted as n. In practice, we will be interested in a random sample for the purpose of making reasonable inferences about the population being studied/analyzed.

Sample Statistic
A summary measure/value computed from a sample of data. Thus, this value is always known. For example, X-bar is a statistic whose value from a sample of size 'n' can be used to make inferences (point or interval estimation) about the true unknown value of the population mean µ. See the mean for a discussion of the desirable properties that make x-bar a good estimator of µ

Skewness
A measure of the symmetry or lack of it in a set of data as apparent from the shape of the distribution -- the three measures of shape are: skewness, kurtosis, and box and whiskers plots.  A distribution is said to be symmetric if the left half of the graph of the distribution is the mirror image of the right half. If a distribution is skewed to the right (positive skewness) it must be the case that the mean is greater than median which in turn is greater than the mode (i.e.; mean > median > mode); in which case the skewness coefficient is greater than zero. If a distribution is skewed to the left (negative skewness) then the relationship is reversed; in which case the coefficient is less than zero. If there is no skewness or the distribution is symmetric like the bell-shaped normal curve then the mean = median = mode.

Historical Notes: Karl Pearson is credited with developing at least two coefficients of skewness (Sk) that can be used to assess the degree of skewness in a distribution. One is given as Sk = [3(µ - Md)/σ], where µ and ó are the population mean and standard deviation, respectively. This is the same Pearson that also developed the Coefficient of Correlation, as well as the Pearson Chi-Square statistic. Imagine how irrational the decision making process would have been without these summary measures that allow us to uncover patterns and relationships inherent in bodies of data.

Standard Deviation
A measure of dispersion for a body/set of data, found by taking the positive square root of the variance.

Statistical Analysis (Types)
A statistical analysis is said to be Univariate if the applicable technique involves only one statistical variable (e.g. finding the average age of all female medical doctors in the U.S.); it is said to be Bivariate if the applicable technique involves two variable (e.g. Simple Regression analysis of the effect of Advertising Expenditure on Sales); and it is said to be Multivariate if the applicable technique involves more than two variables (e.g. Multiple Regression analysis of the effects of  annual Family Income and Family Size on annual Family Food Expenditure).

# - T -

Transformation
Replacing each data value by a different number (such as its logarithm) to facilitate statistical analysis. The logarithm often transforms skewness into symmetry by stretching the scale near zero, thus spreading out all the small values that had been bunched together. It also pulls together the very large data values which had been thinly scattered at the at the high end of the scale.

# - U -

Uniform Probability Distribution
A probability distribution in which equal probabilities are assigned to all values of a random variable. The distribution can be a pdf (probability density function) or a pmf (probability mass function) depending on whether the random variable X is continuous or discrete.

# - V -

Variable
A characteristic or an attribute of the statistical unit/entity of interest with values that are numeric (in the case of a quantitative variable) or non-numeric (in the case of a qualitative variable). The standard notation for a variable is X in the case of a univariate analysis, X and Y in the case of a bivariate analysis, or X, Y and Z in the case of a three-variable multivariate analysis.

(empty for now)

(empty for now)

(empty for now)