There is a view in brand marketing research these days that random sampling is on its last legs. With tons of data in hand and real-time testing on the rise, many claim there is no need for random sampling anymore.
This view was even given an imprimatur of historical legitimacy recently in the bestselling book, Big Data: A Revolution That Will Transform How We Live, Work, and Think by Oxford professor Viktor Mayer-Schonberger and Economist reporter Kenneth Cukier. Early on in their book, the authors assert that “[s]ampling was a solution to the problem of information overload in an earlier age,” presumably, a problem that will no longer plague us in the post-revolutionary age of Big Data. The authors then go on to detail all the problems of, well, truth be told, bad samples, not proper random samples.
This conflation of bad sampling with random samples by Mayer-Schonberger and Cukier is yet another instance of the sort of confusion that has long characterized untutored discussions of sampling. Muddles like this are sure to get worse as Big Data, mobile devices, digital footprints and cloud computing come together over the very near future.
The misperception is that the need for random sampling is obviated by computers able to analyzeyottabytes of data at speeds reaching dozens of petaflops. This misunderstanding can have unfortunate consequences for brand marketing, so a little clarification is in order.
As a start, let’s be clear on terminology. There are many types of random samples, but the type we generally think of is a simple random sample in which every element of the population being sampled hasan equal probability of inclusion in the sample. It would be better to speak of probability samples than of random samples, but the points to be made here are the same, and the idea of a simple random sample is the easiest to bear in mind.
The presumption of those now eulogizing random samples is that having data on the full population ensures a clear view of what’s going on. This is just not so, for at least three reasons.
The first reason is an important bit of statistical nuance that is almost always overlooked. It’s called power. Most of us are familiar with the concept of sampling error, which is the plus-or-minus error range attached to statistical estimates such as election polls. The mirror image of sampling error is power.
Sampling error ranges keep us from calling something real when it’s actually random chance. Power helps us detect what’s real when we might otherwise mistake it for random chance.
In statistical lingo, sampling error ranges are about avoiding false positives (i.e., not committing Type I error). Power is about avoiding false negatives (i.e., not committing Type II error).
Sampling error ranges and power trade off. When a sample has a wide margin of error (or a big plus-or-minus range), the sample itself has little power, hence, a poor ability to detect real differences. On the other hand, when a sample has a small margin of error, it has a lot of power, as you would expect from more precision.
But there’s a catch. You don’t want too much power because then every difference, no matter how miniscule, will be statistically significant, including differences that are actually random. But you don’t want too little power either because then the differences that are real will not show up as statistically significant. This is where the science of statistics becomes the art of research. One of the most important tasks in every study is figuring out the right balance of sampling error and power. In practical terms, this means determining the ideal sample size. Too big a sample means too much power; too small a sample means too much sampling error.
To put it another way, there is, indeed, such a thing as a dataset that is too big, a statistical reality that Big Data enthusiasts typically overlook. With too much data, every difference is statistically significant. Analysis of the data won’t separate real differences from chance differences because every difference will look to be real.
Statistically speaking, data on an entire population can be the very same thing as a sample that is so big it has too much power. When too much data makes every difference statistically significant, statistical testing is of no help, so we are forced back on our own judgment. But this encourages us to indulge our inborn tendency to see patterns where none really exist.
This takes us to the second reason that Big Data does not ensure a clear view of what’s really going on. Random samples are often better than entire populations for figuring out what’s going on.
Rule number one of probability is that random events occur in clumps. What looks like a pattern to the naked eye is almost always just a random distribution. Unfortunately, we have a built-in bias for seeing structure and order where none exists. Human beings specialize in interpreting chance, and then telling very compelling stories that make these misinterpretations seem true. Even when we go out looking for evidence to put our beliefs to the test, we usually fall prey to confirmation bias.
Having all the data can’t keep us from misinterpreting what we see. Time and again, research has shown that common sense fails us when it comes to analyzing large-scale phenomena. We need a process that protects us from ourselves.
The process to follow is one that begins with a look at all the data before us in the context of our past experience and prior knowledge. From this, we come up with hypotheses about what’s going on. Then we put these hypotheses to the test.
In today’s digital marketplace, Google’s A/B testing is often cited as best practice for junking hifalutin theory and just looking at the data to see what works. Even Big Data enthusiasts believe in A/B testing. But here’s the thing: the A and the B in A/B testing are both samples, and if they’re not random samples, then you can’t have confidence that the test results are reliable enough for significant brand marketing investments.
A sample is a subset of the population. Any division of the population is sampling (though not always random sampling). Splitting the population in two means two samples, even if those two parts are very large. There’s no rule that says a sample has to be small. The defining characteristic of a sample is that it is partial, and the defining characteristic of a good sample is that it is random.
Consider a very simplistic but illustrative example. Suppose you’re testing a new ad versus the existing ad in an A/B test in which everyone in the A cell is male and everyone in the B cell is female. No matter what the results are, gender can’t be ruled out as the explanation, so it would be imprudent to rely on these A/B test results to pick one ad over the other. Now, obviously, extreme sample biases like this would be detected and corrected. But the problem is that most potential biases either can’t be detected beforehand or aren’t known to be problems to watch out for. Random sampling is the only way to reliably eliminate the possibility that sample differences not marketing differences are the explanation for the observed results.
The importance of random sampling for interpreting results brings us to final reason for not yet counting out random sampling. Big Data analytics require random sampling. The choice posed between random sampling and Big Data is a false dichotomy. Even with Big Data, random sampling remains essential.
When working with samples, only results from random samples can be said with any degree of confidence to be true of the entire population. This is where Big Data enthusiasts jump in to proclaim that having all the data eliminates the need for sample projection, and thus the need for random samples. If you can see the entire population (at an affordable cost and in reasonable time), then, it is asked, why use only a small piece that generates results with a big error range around them.
This is where the logic starts to get circular. The reason that we can’t “see the entire population,” so to speak, is that what we see are patterns that often turn out to be nothing but random clumping. The only way to be sure we are not “seeing” structure and order in a random distribution is to put what we see to the test. Whether that test is an A/B field test or a laboratory simulation or a structural equation model, the population dataset is going to have to be split apart and the parts compared. Sampling is an inherent part of good explanation, not an alternative way of developing explanations that stands in contrast to explanations built on entire populations.
An A/B test is an obvious division of a population into sampled parts. But this is true of simulations and model-building, too. Model parameters are estimated by comparing those in a population who are high on a predictive variable against those who are low. Such a division of the population is always confounded by other factors besides the predictive variable, so random sampling is often needed to reliably estimate model parameters.
Models are only as useful as the accuracy of their predictions. Calculating the accuracy of results in order to calibrate model parameters almost always involves one or more random sampling methods such as holdout data or test cases or Bernoulli simulations.
Any simulation involves a finite number of iterations, though that number is often very large. So the successive iterations must include a sufficiently wide range of starting points to yield a proper distribution of simulated outcomes. This is generally accomplished by randomizing the initial conditions, which makes the final set of iterations a random sample of all possible outcomes.
In almost every way possible, Big Data analytics are rife with random sampling. Big Data draws on established statistical methodologies, and random sampling is a cornerstone of statistics.
But to note that random sampling is alive and well despite the obituaries being written about it is not to say that less data is better than more data. More data means more options and bigger brand marketing possibilities, particularly execution and delivery. But figuring out what those options are and how to capitalize on those possibilities in a more competitive, more rapidly changing marketplace will take every bit of smarts and savvy we can bring to the table, of which random sampling is part and parcel.