Limitations of the Central Limit Theorem

Guest Essay by Kip Hansen — 17 December 2022

The Central Limit Theorem is particularly good and valuable especially when have many measurements that have slightly different results. Say, for instance, you wanted to know very precisely the length of a particular stainless-steel rod. You measure it and get 502 mm. You expected 500 mm. So you measure it again: 498 mm. And again and again: 499, 501. You check the conditions: temperature the same each time? You get a better, more precise ruler. Measure again: 499.5 and again 500.2 and again 499.9 — one hundred times you measure. You can’t seem to get exactly the same result. Now you can use the Central Limit Theory (hereafter CLT) to good result. Throw your 108 measurements into a distribution chart or CLT calculator and you’ll see your central value very darned close to 500 mm and you’ll have an idea of the variation in measurements.

While the Law of Large Numbers is based on repeating the same experiment, or measurement, many times, thus could be depended on in this exact instance, the CLT only requires a largish population (overall data set) and the taking of the means of many samples of that data set.

It would take another post (possibly a book) to explain the all the benefits and limitations of the Central Limit Theory (CLT), but I will use a few examples to introduce that topic.

Example 1:

You take 100 measurements of the diameter of ball bearings produced by a machine on the same day. You can calculate the mean and can estimate a variance in the data. But you want a better idea, so you realize that you have 100 measurements from each Friday for the past year. 50 data sets of 100 measurements, which if sampled would give you fifty samples out of 306 possible daily samples of the total 3,060 measurements if you had 100 samples for every work day (six days a week, 51 weeks).

The central limit theory is about probability. It will tell you what the most likely (probable) mean diameter is of all your ball bearings produced on that machine. But, if you are presented with only the mean and the SD, and not the full distribution, it will tell you very little about how many ball bearings are within specification and thus have value to the company. The CLT can not tell you how many or what percentage of the ball bearings would have been within the specifications (if measured when produced) and how many outside spec (and thus useless). Oh, the Standard Deviation will not tell you either — it is not a measurement or quantity, it is a creature of probability.

Example 2:

The Khan Academy gives a fine example of the limitations of the Central Limit Theorem (albeit, not intentionally) in the following example (watch the YouTube if you like, about ten minutes) :

The image is the distribution diagram for our oddly loaded die (one of a pair of dice). It is loaded to come up 1 or 6, or 3 or 4, but never 2 or 5. But twice more likely to come 1 or 6 than 3 or 4. The image shows a diagram of expected distribution of the results of many rolls with the ratios of two 1s, one 3, one 4, and two 6s. Taking the means of random samples of this distribution out of 1000 rolls (technically, “the sampling distribution for the sample mean”), say samples of twenty rolls repeatedly, will eventually lead to a “normal distribution” with a fairly clearly visible (calculable) mean and SD.

Here, relying on the Central Limit Theorem, we return a mean of ≈3.5 (with some standard deviation).(We take “the mean of this sampling distribution” – the mean of means, an average of averages).

Now, if we take a fair die (one not loaded) and do the same thing, we will get the same mean of 3.5 (with some standard deviation).

Note: These distributions of frequencies of the sampled means are from 1000 random rolls (in Excel, using fx=RANDBETWEEN(1,6) – that for the loaded die was modified as required) and sampled every 25 rolls. Had we sampled a data set of 10,000 random rolls, the central limit would narrow and the mean of the sampled means — 3.5 —would become more distinct.

The Central Limit Theorem works exactly as claimed. If one collects enough samples (randomly selected data) from a population (or dataset…) and finds the means of those samples, the means will tend towards a standard or normal distribution – as we see in the charts above – the values of the means tend towards the (in this case known) true mean. In man-on-the-street language, the means are clumping in the center around the value of the mean at 3.5, making the characteristic “hump” of a Normal Distribution. Remember, this resulting mean is really the “mean of the sampled means”.

So, our fair die and our loaded die both produce approximate normal distributions when testing a 1000 random roll data set and sampling means. The distribution of the mean would improve – get closer to the known mean – if we had ten or one hundred times more of the random rolls and equally larger number of samples. Both the fair and loaded die have the same mean (though slightly different variance or deviation). I say “known mean” because we can, in this case, know the mean by straight-forward calculation, we have all the data points of the population and know the mean of the real-world distribution of the dies themselves.

In this setting, this is a true but almost totally useless result. Any high school math nerd could have just looked at the dies, maybe made a few rolls with each, and told you the same: the range of values is 1 through 6; the width of the range is 5; the mean of the range is 2.5 + 1 = 3.5. There is nothing more to discover by using the Central Limit Theorem against a data base of 1000 rolls of the one die – though it will also tell you the approximate Standard Deviation – which is also almost entirely useless.

Why do I say useless? Because context is important. Dice are used for games involving chance (well, more properly, probability) in which it is assumed that the sides of the dice that land facing up do so randomly. Further, each roll of a die or pair of dice is totally independent of any previous rolls.

Impermissible Values

As with all averages of every type, the means are just numbers. They may or not have physically sensible meanings.

One simple example is that a single die will never ever come up at the mean value of 3.5. The mean is correct but is not a possible (permissible) value for the roll of one die – never in a million rolls.

Our loaded die can only roll: 1, 3, 4 or 6. Our fair die can only roll 1, 2, 3, 4, 5 or 6. There just is no 3.5.

This is so basic and so universal that many will object to it as nonsense. But there are many physical metrics that have impermissible values. The classic and tired old cliché is the average number of children being 2.4. And we all know why, there are no “.4” children in any family – children come in whole numbers only.

However, if for some reason you want or need an approximate, statistically-derived mean for your intended purpose, then using the principles of the CLT is your ticket. Remember, to get a true mean of a set of values, one must add all the values together divide by the number of values.

The Central Limit Theorem method does not reduce uncertainty:

There is a common pretense (def: “Something imagined or pretended“) used often in science today, which treats a data set (all the measurements) as a sample, then take samples of the sample, use a CLT calculator, and call the result a truer mean than the mean of the actual measurements. Not only “truer”, but more precise. However, while the CLT value achieved may have small standard deviations, that fact is not the same as more accuracy of the measurements or less uncertainty regarding what the actual mean of the data set would be. If the data set is made up of uncertain measurements, then the true mean will be uncertain to the same degree.

Distribution of Values May be More Important

The Central Limit Theory-provided mean would be of no use whatever when considering the use of this loaded die in gambling. Why? … because the gambler wants to know how many times in a dozen die-rolls he can expect to get a “6”, or if rolling a pair of loaded dice, maybe a “7” or “11”. How much of an edge over the other gamblers does he gain if he introduces the loaded dice into the game when it’s his roll?

(BTW: I was once a semi-professional stage magician, and I assure you, introducing a pair of loaded dice is easy on stage or in a street game with all its distractions but nearly impossible in a casino.)

Let’s see this in frequency distributions of rolls of our dice, rolling just one die, fair and loaded (1000 simulated random rolls in Excel):

And if we are using a pair of fair or loaded dice (many games use two dice):

On the left, fair dice return more sevens than any other value. You can see this is tending towards the mean (of two dice) as expected. Two 1’s or two 6’s are rare for fair dice … as there is only a single unique combination each for the combined values of 2 and 12. Lots of ways to get a 7.

Our loaded dice return even more 7’s. In fact, over twice as many 7’s as any other number, almost 1-in-3 rolls. Also, the loaded dice have a much better chance of rolling 2 or 12, five times better than with fair dice. The loaded dice don’t ever return 3 or 11.

Now here we see that if we depended on the statistical (CLT) central value of the means of rolls to prove the dice were fair (which, remember is 3.5 for both fair and loaded dice) we have made a fatal error. The house (the casino itself) expects the distribution on the left from a pair of fair dice and thus the sets the rules to give the house a small percentage in its favor.

The gambler needs the actual distribution probability of the values of the rolls to make betting decisions.

If there are any dicing gamblers reading, please explain to non-gamblers in comments what an advantage this would be.

Finding and Using Means Isn’t Always What You Want

This insistence on using means produced approximately using the Central Limit Theorem (and its returned Standard Deviations) can create non-physical and useless results when misapplied. The CLT means could have misled us into believing that the loaded dice were fair, as they share a common mean with fair dice. But the CLT is a tool of probability and not a pragmatic tool that we can use to predict values of measurements in the real world. The CLT does not predict or provide values – it only provides estimated means and estimated deviations from that mean and these are just numbers.

Our Khan academy teacher, almost in the hushed tones of a description of an extra-normal phenomenon, points out that taking random same-sized samples from a data set (population of collected measurements, for instance) will also produce a Normal Distribution of the sampled sums! The triviality of this fact should be apparent – if the “sums divided by the [same] number of components” (the means of the samples) are normally distributed then the sums of the samples must need also be normally distributed (basic algebra).

In the Real World

Whether considering gambling with dice – loaded and fair – or evaluating the usability of ball bearing from the machinery we are evaluating – we may well find the estimated means and deviations obtained by applying the CLT are not always what we need and might even mislead us.

If we need to know which, and how many, of our ball bearings will fit the bearing races of a tractor manufacturing customer, we will need some analysis system and quality assurance tool closer to reality.

If our gambler is going to bet his money on the throw of a pair of specially-prepared loaded dice, he needs the full potential distribution, not of the means, but the probability distribution of the throws.

Averages or Means: One number to rule them all

Averages seem to be the sweetheart of data analysts of all stripes. Oddly enough, even when they have a complete data set like daily high tides for the year, which they could just look at visually, they want to find the mean.

The mean water level, which happens to be 27.15 ft (rounded) does not tell us much. The Mean High Water tells us more, but not nearly as much as the simple graph of the data points. For those unfamiliar with astronomic tides, most tides are on a ≈13 hour cycle, with a Higher High Tide (MHHW) and a less-high High Tide (MHW). That explains what seems to be two traces above.

Note: the data points are actually a time series of a small part of a cycle, we are pulling out the set of the two higher points and the two lower points in a graph like this. One can see the usefulness of a different plotting above each visually revealing more data than the other.

When launching my sailboat at a boat ramp near the station, the graph of actual high tide’s data points shows me that I need to catch the higher of the two high tides (Higher High Water), which sometimes gives me more than an extra two feet of water (over the mean) under the keel. If I used the mean and attempted to launch on the lower of the two high tides (High Water), I could find myself with a whole foot less water than I expected and if I had arrived with the boat expecting to pull it out with the boat trailer at the wrong point of the tide cycle, I could find five feet less water than at the MHHW. Far easier to put the boat in or take it out at the highest of the tides.

With this view of the tides for a month, we can see that each of the two higher tides themselves have a little harmonic cycle, up and down.

Here we have the distribution of values of the high tides. Doesn’t tell us very much – almost nothing about the tides that is numerically useful – unless of course, one only wants the means, which would be just as easily eye-ball guessed from the charts above or this chart — we would get a vaguely useful “around 29 feet.”

In this case, we have all the data points for the high tides at this station for the month, and could just calculate the mean directly and exactly (within the limits of the measurements) if we needed that – which I doubt would be the case. But at least we would have a true precise mean (plus the measurement uncertainty, of course) but I think we would find that in many practical senses, it is useless – in practice, we need the whole cycle and its values and its timing.

Why One Number?

Finding means (averages) gives a one-number result. Which is oh-so–much easier to look at and easier to understand than all that messy, confusing data!

In a previous post on a related topic, one commenter suggested we could use the CLT to find “the 2021 average maximum daily temperature at some fixed spot.” When asked why one would want do to so, the commenter replied “To tell if it is warmer regarding max temps than say 2020 or 1920, obviously.” [I particularly liked the ‘obviously’.] Now, any physicists reading here? Why does the requested single number — “2021 average maximum daily temperature” — not tell us much of anything that resembles “if it is warmer regarding max temps than say 2020 or 1920”? If we also had a similar single number for the “1920 average maximum daily temperature” at the same fixed spot, we would only know if our number for 2021 was higher or lower than the number for 1920. We would not know if “it was warmer” (in regards to anything).

At the most basic level, the “average maximum daily temperature” is not a measurement of temperature or warmness at all, but rather, as the same commenter admitted, is “just a number”.

If that isn’t clear to you (and, admittedly, the relationship between temperature and “warmness” and “heat content of the air” can be tricky), you’ll have to wait for a future essay on the topic.

It might be possible to tell if there is some temperature gradient at the fixed place using a fuller temperature record for that place…but comparing one single number with another single number does not do that.

And that is the major limitation of the Central Limit Theorem

The CLT is terrific at producing an approximate mean value of some population of data/measurements without having to directly calculate it from a full set of measurements. It gives one a SINGLE NUMBER from a messy collection of hundreds, thousands, millions of data points. It allows one to pretend that the single number (and its variation, as SDs) faithfully represents the whole data set/population-of-measurements. However, that is not true – it only gives the approximate mean, which is an average, and because it is an average (an estimated mean) it carries all of the limitations and disadvantages of all other types of averages.

The CLT is a model, a method, that will produce a Mean Value from ANY large enough set of numbers – the numbers do not need to be about anything real, they can be entirely random with no validity about anything. The CLT method pops out the estimated mean, closer and closer to a single value whenever more and more samples from the larger population are supplied it. Even when dealing with scientific measurements, the CLT will discover a mean (that looks very precise when “the uncertainty of the mean” is attached) just as easily from sloppy measurements, from fraudulent measurements, from copy-and-pasted findings, from “just-plain-made-up” findings, from “I generated my finding using a random number generator” findings and from findings with so much uncertainty as to hardly be called measurements at all.

Bottom Lines:

1. Using the CLT is useful if one has a large data set (many data points) and wishes, for some reason, to find an approximate mean of the data set, then using the principles of the Central Limit Theorem; finding the means of multiple samples from the data set, making a distribution diagram, and with enough samples, by finding the mean of the means, the CLT will point to the approximate mean, and give an idea of the variance in the data.

2. Since the result will be a mean, an average, and an approximate mean at that, then all the caveats and cautions that apply to the use of averages apply to the result.

3. The mean found through use of the CLT cannot and will not be less uncertain than the uncertainty of the actual mean of original uncertain measurements themselves. However, it is almost universally claimed that “the uncertainty of the mean” (really the SD or some such) thus found is many times smaller than the uncertainty of the actual mean of the original measurements (or data points) of the data set.

This claim is a so generally accepted and firmly held as a Statisticians’ Article of Faith that many commenting below will deride the idea of its falseness and present voluminous “proofs” from their statistical manuals to show that they such methods do reduce uncertainty.

4. When doing science and evaluating data sets, the urge to seek a “single number” to represent the large, messy, complex and complicated data sets is irresistible to many – and can lead to serious misunderstandings and even comical errors.

5. It is almost always better to do much more nuanced evaluation of a data set than simply finding and substituting a single number — such as a mean and then pretending that that single number can stand in for the real data.

# # # # #

Author’s Comment:

One Number to Rule Them All as a principal, go-to-first approach in science has been disastrous for reliability and trustworthiness of scientific research.

Substituting statistically-derived single numbers for actual data, even when the data itself is available and easily accessible, has been and is an endemic malpractice of today’s science.

I blame the ease of “computation without prior thought” – we all too often are looking for The Easy Way. We throw data sets at our computers filled with analysis models and statistical software which are often barely understood and way, way too often without real thought as to the caveats, limitations and consequences of varying methodologies.

I am not the first or only one to recognize this – maybe one of the last – but the poor practices continue and doubting the validity of these practices draws criticism and attacks.

I could be wrong now, but I don’t think so! (h/t Randy Newman)

# # # # #