1.3 Descriptive Measures
This section summarizes the measures we use to describe data samples.
Central Tendency: the central value of a data set
Variation: the dispersion (scattering) around a central value
Shape: the distribution pattern of the data values
These measures will be used repeatedly in our analyses, and will affect how confident we are in our conclusions.
To introduce you to some numerical results in R, we will continue with our light bulb scenario and add some actual data. Suppose we sampled 60 light bulbs from our population, turned them on, and timed each one until it burned out. If we recorded the lifetime of each light bulb, then we have a dataset (or data sample) of 60 observations on the lifetimes of light bulbs. This is what we will be using below.
1.3.1 Central Tendency
The Arithmetic or sample mean is the average value of a variable within the sample.
\[\bar{X}=\frac{\text{Sum of values}}{\text{Number of observations}}=\frac{1}{n} \sum\limits_{i=1}^n X_i\]
# the mean of our light bulb sample:
# First we load the data set and this will give us 60 observations of the lifespan of each of our 60 light bulbs in the sample. The variable is called "Lifetime" - which can be observed by using the "list" command.
load("data/Lightbulb.Rdata")
list(Lifetime)
## [[1]]
## [1] 858.9164 797.2652 1013.5366 1064.8195 874.2275 825.1137 897.0879 924.0998 870.0674 966.2095
## [11] 955.1281 977.2073 888.1690 826.6483 776.7479 877.5691 998.7101 892.8178 886.0261 831.7615
## [21] 1082.9650 1034.9549 784.5026 919.2082 1049.1824 923.5767 907.7295 890.3758 856.4240 808.8035
## [31] 1009.7146 890.3709 930.9597 809.9274 919.9381 793.7455 919.9824 948.8593 810.6887 846.9573
## [41] 955.3873 833.2762 892.4969 973.1861 913.7650 928.6057 940.7637 964.4341 914.2733 880.3329
## [51] 831.5395 967.2442 1030.7598 857.5421 889.3689 1094.1440 927.7684 730.9976 918.8359 867.5931
mean(Lifetime)) (
## [1] 907.5552
The average lifetime of our 60 \((n=60)\) observed light bulbs is 908 hours.
The median is the middle value of an ordered data set.
If there is an odd number of values in a data set, the median is the middle value
If there an even number, median is the average of the two middle values
# the median of our light bulb sample:
median(Lifetime)) (
## [1] 902.4087
The median lifetime of our 60 observed light bulbs is 902 hours.
The mode is the most frequently appearing value in a set of observations. The mode might not exist if there is a set of unique, non-repeating observations.
# There isn't a built in function for the mode in R... but we can create one.
<- function(x) {
Mode <- unique(x)
ux which.max(tabulate(match(x, ux)))]
ux[
}
Mode(Lifetime)) (
## [1] 858.9164
The most commonly observed lifetime of our 60 observed light bulbs is 859 hours. Note that this doesn’t say anything really about how many times this value has been observed, just that it has been observed more than any other value. We will not be using this measure as much as the previous two.
Percentiles break the ordered values of a sample into proportions.
Quartiles split the data into 4 equal parts
Deciles split the data into 10 equal parts
In general, the pth percentile is given by: \((p * 100)^{th}=p(n+1)\)
A percentile delivers an observed value such that a determined proportion of observations are less than or equal to that value. You can choose any percentile value you wish. For example, the code below calculates the 4th, 40th, and 80th percentiles of our light bulb sample.
# You can generate any percentile (e.g. the 4th, 40th, and 80th) using the quantile function:
quantile(Lifetime,c(0.04, 0.40, 0.80))) (
## 4% 40% 80%
## 787.8301 888.8889 966.4164
This result states that 4% of our observations are less that 788 hours, 40% of our observations are less than 889 hours, and 80% of our observations are less than 966 hours. Note that the median (being the middle-ranked observation) is by default the 50th percentile.
quantile(Lifetime,0.50)) (
## 50%
## 902.4087
The main items of central tendency can be laid out in a Five-Number Summary:
- Minimum
- First Quartile (25th percentile)
- Second Quartile (median)
- Third Quartile (75th percentile)
- Maximum
summary(Lifetime)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 731.0 857.3 902.4 907.6 955.2 1094.1
1.3.2 Variation
The sample variance measures the average (squared) amount of dispersion each individual observation has around the sample mean. This is a very important measure in statistics, so take some time to understand exactly what this equation is calculating. In particular, \(X\) is a variable and \(X_i\) is an arbitrary single observation from that group of data. Once the mean \((\bar{X})\) is calculated, \(X_i - \bar{X}\) is the difference between a single observation of X and the overall mean of X. Sometimes this difference is negative \((X_i < \bar{X})\) and sometimes this difference is positive \(X_i > \bar{X}\) - which is why we need to square these differences before adding them all up. Nonetheless, once we obtain the average value of these differences, we get a sense of the amount of dispersion these individual observations are scattered around the sample average. If this value was zero, then every observation of \(X\) is equal to \(\bar{X}\). The greater the value is from zero, the greater the average dispersion of individual values around the mean.
\[S^2=\frac{1}{n-1}\sum\limits_{i=1}^n(X_i-\bar{X})^2\]
The sample standard deviation is the square-root of the sample variance and measures the average amount of dispersion in the same units as the mean. This essentially is done to back-out the fact that we had to square the differences of \(X_i - \bar{X}\), so the variance is technically denoted in squared units.
\[S=\sqrt{S^2}\]
var(Lifetime)) (
## [1] 6235.852
sd(Lifetime)) (
## [1] 78.96741
The variance of our sample of light bulb lifetimes is 6236 squared-hours. After taking the square root of this number, we can conclude that the standard deviation of our sample is 79 hours. Is this standard deviation big or small? The answer to this comes when we get to statistical inference.
Discussion
The term \((X_i-\bar{X})\) is squared because individual observations are either above or below the mean by design. If you don’t square the terms (making the negative numbers positive) then they will sum to zero by design.
The term \((n-1)\) appears in the denominator because this is a sample variance and not a population variance. In a population variance equation, \((n-1)\) gets replaced with \(n\) because we know the population mean. Since we had to estimate the population mean (i.e., used the sample mean), we had to deduct one degree of freedom. We will talk more about degrees of freedom later, but the rule of thumb is that we deduct a degree of freedom every time we build a sample statistic (like sample variance) using another sample statistic (like sample mean).
The coefficient of variation is a relative measure which denotes the amount of scatter in the data relative to the mean.
The coefficient of variation is useful when comparing data on variables measured in different units or scales (because the CV reduces everything to percentages).
\[CV=\frac{S}{\bar{X}}*100\%\]
Take for example the Gross Domestic Product (i.e., output) for the states of California and Delaware.
library(readxl)
<- read_excel("data/CARGSP.xls")
CARGSP <- read_excel("data/DENGSP.xls")
DENGSP
<- CARGSP$CGDP
CGDP <- DENGSP$DGDP
DGDP
mean(CGDP)) (
## [1] 2094764
sd(CGDP)) (
## [1] 397103.9
mean(DGDP)) (
## [1] 56996.58
sd(DGDP)) (
## [1] 12395.78
A quick analysis of the real annual output observations from these two states between the years 1997 and 2020 suggest that the average annual output of California is 2,094,764 million dollars (with a standard deviation of 397,104 million) and that of Delaware is 56,997 million dollars (with a standard deviation of 12,396 million). These two states have lots of differences between them, and it is difficult to tell which state has more volatility in their output.
If we construct coefficients of variation:
sd(CGDP)/mean(CGDP))*100 (
## [1] 18.95698
sd(DGDP)/mean(DGDP))*100 (
## [1] 21.74828
We can now conclude that Delaware’s standard deviation of output is almost 22% that of its’ average output, while California’s standard deviation is 19%. This would suggest that Delaware has the more volatile output, relatively speaking.
1.3.3 Measures of shape
Comparing the mean and median of a sample will inform us of the skewness of the distribution.
- mean = median: a symmetric or zero-skewed distribution.
- mean > median: a positive-skewed or right-skewed distribution
- the right-tail is pulled in the positive direction
- mean < median: a negative-skewed or a left-skewed distribution
- the left-tail is pulled in the negative direction
The degree of skewness is indicative of outliers (extreme high or low values) which change the shape of a distribution.
1.3.4 Covariance and Correlation
While we won’t be examining relationships between different variables until later on in the course, we can easily calculate and visualize these relationships.
The covariance measures the strength of the relationship between two variables. This measure is similar to a variance, but it can be either positive or negative depending on how the two variables move in relation to each other.
\[cov(X,Y)=\frac{1}{n-1}\sum\limits_{i=1}^n(X_i-\bar{X})(Y_i-\bar{Y})\]
The coefficient of correlation transforms the covariance into a relative measure
\[corr(X,Y)=\frac{cov(X,Y)}{S_{X}S_{Y}}\] The correlation transformed the covariance relationship into a measure between -1 and 1
\(corr(X,Y)=0\): There is no relationship between \(X\) and \(Y\)
\(corr(X,Y)>0\): There is a positive relationship between \(X\) and \(Y\) - meaning that the two variables tend to move in the same direction
\(corr(X,Y)<0\): There is a negative relationship between \(X\) and \(Y\) - meaning that the two variables tend to move in the opposite direction
Extended Example:
This chapter concludes with a summary of all of the descriptive measures we discussed. Consider a dataset that is internal to R (called mtcars) that contains characteristics of 32 different automobiles. We will focus on two variables: the average miles per gallon (mpg) and the weight of the car (in thousands of pounds).
<- mtcars # This loads the dataset and calls it car
car
# Lets examine mpg first:
summary(car$mpg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 15.43 19.20 20.09 22.80 33.90
hist(car$mpg,20,col = "yellow")
# Variance:
var(car$mpg)) (
## [1] 36.3241
# Standard deviation:
sd(car$mpg)) (
## [1] 6.026948
The above analysis indicates the following:
The sample average MPG in the sample is 20.09, while the median in 19.20. This indicates that there is a slight positive skew to the distribution of observations.
The lowest MPG is 10.4 while the highest is 33.90.
The first quartile is 15.43 while the third is 22.80. This delivers the inter-quartile range (the middle 50% of the distribution)
The standard deviation is 6.03 which delivers a 30 percent coefficient of correlation.
## Lets now examine weight:
summary(car$wt)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.513 2.581 3.325 3.217 3.610 5.424
hist(car$wt,20,col = "pink")
# Variance:
var(car$wt)) (
## [1] 0.957379
# Standard deviation:
sd(car$wt)) (
## [1] 0.9784574
The above analysis indicates the following:
The sample average weight in the sample is 3.22 thousand pounds, while the median in 3.33. This indicates that there is a slight negative skew to the distribution of observations.
The lowest weight is 1.51 while the highest is 5.42.
The first quartile is 2.58 while the third is 3.61.
The standard deviation is 0.99 which also delivers a 30 percent coefficient of correlation.
# We can now look at the relation between mpg and weight
cov(car$mpg,car$wt)) (
## [1] -5.116685
cor(car$mpg,car$wt)) (
## [1] -0.8676594
plot(car$mpg,car$wt,
xlab = "Miles per gallon",
ylab = "Weight of car",
main = "A Scatter plot",
col = "cyan")
The negative correlation as well as the obviously negative relationship in the scatter-plot between the weight of a car and its miles per gallon should make intuitive sense - heavy cars are less efficient.
The Punchline
Suppose we want to learn about a relationship between a car’s weight and its fuel efficiency. Our sample is 32 automobiles, but our population is EVERY automobile (EVER). We would like to say something about the population mean Weight and MPG.
How does the sample variance(s) give us confidence on making statements about the population mean when we’re only given the sample? That’s where inferential statistics comes in. Before we get into that, we will dig into elements of collecting data (upon which our descriptive statistics are based on) and using R (with which we will use to calculate our descriptive statistics using our collected data).