4 summarize — Summary statistics
The idea of the mean is quite old (Plackett 1958), but its extension to a scheme of moment-based
measures was not done until the end of the 19th century. Between 1893 and 1905, Pearson
discussed and named the standard deviation, skewness, and kurtosis, but he was not the first
to use any of these. Thiele (1889), in contrast, had earlier firmly grasped the notion that the
m
r
provide a systematic basis for discussing distributions. However, even earlier anticipations
can also be found. For example, Euler in 1778 used m
2
and m
3
in passing in a treatment of
estimation (Hald 1998, 87), but seemingly did not build on that.
Similarly, the idea of the median is quite old. The history of the interquartile range is tangled
up with that of the probable error, a long-popular measure. Extending this in various ways to a
more general approach based on quantiles (to use a later term) occurred to several people in the
nineteenth century. Galton (1875) is a nice example, particularly because he seems so close to
the key idea of the quantiles as a function, which took another century to reemerge strongly.
Thorvald Nicolai Thiele (1838–1910) was a Danish scientist who worked in astronomy, math-
ematics, actuarial science, and statistics. He made many pioneering contributions to statistics,
several of which were overlooked until recently. Thiele advocated graphical analysis of residuals
checking for trends, symmetry of distributions, and changes of sign, and he even warned against
overinterpreting such graphs.
Example 2: summarize with the detail option
The detail option provides all the information of a normal summarize and more. The format
of the output also differs, as shown here:
. summarize mpg, detail
Mileage (mpg)
Percentiles Smallest
1% 12 12
5% 14 12
10% 14 14 Obs 74
25% 18 14 Sum of wgt. 74
50% 20 Mean 21.2973
Largest Std. dev. 5.785503
75% 25 34
90% 29 35 Variance 33.47205
95% 34 35 Skewness .9487176
99% 41 41 Kurtosis 3.975005
As in the previous example, we see that the mean of mpg is 21.3 miles per gallon and that the standard
deviation is 5.79. We also see the various percentiles. The median of mpg (the 50th percentile) is 20
miles per gallon. The 25th percentile is 18, and the 75th percentile is 25.
When we performed summarize, we learned that the minimum and maximum were 12 and 41,
respectively. We now see that the four smallest values in our dataset are 12, 12, 14, and 14. The four
largest values are 34, 35, 35, and 41. The skewness of the distribution is 0.95, and the kurtosis is
3.98. (A normal distribution would have a skewness of 0 and a kurtosis of 3.)
Skewness is a measure of the lack of symmetry of a distribution. If the distribution is symmetric,
the coefficient of skewness is 0. If the coefficient is negative, the median is usually greater than
the mean and the distribution is said to be skewed left. If the coefficient is positive, the median is
usually less than the mean and the distribution is said to be skewed right. Kurtosis (from the Greek
kyrtosis, meaning curvature) is a measure of peakedness of a distribution. The smaller the coefficient
of kurtosis, the flatter the distribution. The normal distribution has a coefficient of kurtosis of 3 and
provides a convenient benchmark.