When a small sample isn’t just a small sample

Pretty much every Padre fan on Twitter and the radio dial responded to the Padres’ 2016 start with some flavor of the “small sample size” argument. Even the Padres’ team carolers got in on the fun:

Let me tell you, there’s nothing I love better than when someone “learns” a topic on the very surface and then proceeds to apply it to situations where it does not, in fact, apply.

The Padres start to 2016 is one of these situations where it doesn’t apply, and it didn’t even apply after two shutouts (when I originally started writing this), let alone three.

Let me start with a “real life” example. Suppose I wanted you to tell me how good of a person Bob is based on a small amount of information. If I told you two activities out of 162 that Bob did, and they were walking the dog and brushing his teeth, you wouldn’t know shit about Bob’s quality as a human. But if the two activities I told you were: Bob farted in the elevator and Bob ate the last slice of pizza…well, Bob is much more likely to be a shitty person.

What made brushing teeth and dog walking a meaningless sample was the dual fact that the sample was small and the sampled data points were common in the overall population of both good and bad humans. Since both assholes and nice guys brush their teeth roughly the same amount, we didn’t learn anything by having that information.

Going back to baseball, in simple terms: bad teams get shut out far more frequently than good teams, let alone successive triplicate shutouts.

Take, for example, the drastic difference between the 1998 Padres and the 2014 Padres. Inherently, we know that the odds of the 2014 Padres getting shut out twice in-a-row are much higher than the 1998 Padres: the 2014 Padres were shut out 19 times, while the 1998 Padres were shut out 7 times. Let alone the very real difference in quality between the two teams that we know to be true.

Let’s suppose that those shutout rates are an exact display of the teams’ odds of getting shut out: a team of the 1998 Padres caliber will always be shut out exactly 7 times in 162 games and a team of the 2014 Padres caliber will always be shut out 19 times. With this framework, it becomes a (simple) exercise in Bayesian probability to determine several relevant features of the Padres two game sample in 2016.

Allow me to explain.

If the results of 162 unique games are truly random, then there are 162*161 = 26,082 different results you can get by selecting two games, again at random. If you randomly arrange, i.e. sample, the 2014 Padres results, there are 19 * 18 = 342 different simulations of those 26,082 which result in the 2014 Padres getting shut out twice in-a-row to start the season. For the 1998 Padres, there are only 7 * 6 = 42 different simulations of those 26,082 total simulations which result in successive shutouts to start the season.

So, if I told you that I was holding a simulation that starts the season with two shutouts and told you that it belonged to either the 2014 Padres or the 1998 Padres, what are the odds that it belongs to the bad team?

Since 342 of 384 total simulations which start with consecutive shutouts belong to the 2014 Padres, the odds are 89%.

The thing is, we actually have a two game simulation: it’s called 2016! And with just those two games we can conclude that it’s eight times more likely that the 2016 Padres are the 2014 Padres than the 1998 Padres. If we now expand out to three shutouts, the odds of it belonging to the 2014 Padres, rather than the 1998 Padres, are 5814 / (5814 + 210) = 96.5%.

A small, but meaningful, example may be a foreign concept to many, but it shouldn’t be. Going back to another human example. Pretend Bob killed a guy on one day. The average day for Bob probably looks very similar to the normal population; if we sampled one day of his life, most of the time we’d have a completely innocuous sample. Some day back in second grade where he traded pogs and watched Scooby-Doo wouldn’t give us any clue and we would be right to state that it was too small of a sample for us to judge Bob. However, if we just happened to sample the day Bob murdered a guy, we’d be super wrong to say “small sample”.

That’s because the specific results of that day and, going back to the baseball case, those games are rare events that occur significantly more for bad people/teams than good people/teams. In the Padres case, these weren’t three 7-3 losses. There is far more information to glean from three shutouts than there are three 7-3 losses. Good and bad teams both post many 7-3 losses in a season.

Of course, this isn’t the whole equation. We have many more teams that we can compare the 2016 Padres to than just the 1998 and 2014 Padres. A true result would take a probabilistic distribution across the three game shutout simulation odds of every team ever, and project in the final win totals of those teams at the respective simulation odds. For example, if 1% of simulations came from a 100 win team, and 99% came from a 70 win team, we could project the Padres for 70.3 wins from just the shutout information alone.

I’m not going to do that simulation, but I’m sure you get the point by now: small samples can still be meaningful, if the events sampled into the distribution are, by themselves, rare and significant.

Unfortunately for the 2016 Padres, that puts them in a meaningful, miserable spot.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s