This past week I got around to a deeper reading of *Statistics Done Wrong: The Woefully Complete Guide *by Alex Reinhart.

This is the book that I wish Seth Stephens-Davidowitz had picked up at some point.

It’s all about how to screw up your experimental design and data analysis. It draws from copious examples, the majority of which come from *published studies *in *reputable journals *😲.

Reinhart covers why and how a groundbreaking discovery might not be real. For example, that thing about women’s periods syncing up? There’s no definitive evidence to support that.

He covers how and why scientists might miss small but important effects—like additional pedestrian fatalities after legalizing right turns at red lights.

He covers why news articles depict doctors frequently changing their minds about which foods are good for you—several studies’ means might all live in the same ballpark, but two of their confidence intervals might include zero (indicating likely no effect) and the other three might have confidence intervals that *just barely* don’t include zero (indicating a likely significant effect).

This isn’t really a one page notes because I have three pages on this book: two about the book itself, then a third one on some follow-up resources. But we can talk about this book in a couple of different parts.

Here we’ll cover the first six chapters. These six chapters’ examples come chiefly from classification problems—we have this mean in the control group, we have this other mean in the experimental group, and we need to classify our treatment as *effective *or *ineffective* based on the distance between those two means relative to our *certainty* about those means.

#### Confidence Intervals and Statistical Power

So we start off with some metrics to help us quantify certainty: *confidence intervals* and *statistical power. *The more examples we have in our dataset, the more *certain* we can be about the means we extract (thanks, regression to the mean!)

It turns out that confidence intervals can give us a much clearer picture than the commonly-referenced *p values* about our certainty in our data. A confidence interval says ‘We calculated this mean. Based on how much data we collected, the *true mean* of the phenomenon that this data *samples* could be anywhere in this range, with X percent certainty.’

For some reason `scikit-learn`

models don’t come standard with confidence interval calculations. The library has a number of pull requests and open issues about this, with a lot of discussion in them around how this works for high-dimensional data, whether it applies here or here, et cetera et cetera. (Start here, if you’re interested). So far I found only one released bolt-on solution for `scikit-learn`

that gives you confidence intervals, and it only works for `RandomForestRegressor` and `RandomForestClassifier.` `scipy`

‘s `stats`

module gives you the method for calculating a confidence interval, but it does so with the normal distribution, which assumes that you have a large (at least 120, and preferably thousands) number of examples in your dataset.

What if your dataset is teeny?

You can still use a confidence interval that employs the t-distribution, a flatter, wider distribution than the normal distribution that gives us a starting point for small sample sizes (generally described as 30 or fewer). The t-distribution’s z-scores start to line up with the normal distribution’s right around 120 samples. Keep that in mind as you’re tallying up your example totals.

Here is some code to help you do that:

import math | |

from scipy.stats import t | |

import numpy as np | |

def confidence_interval_for(samples=[], confidence=0.95): | |

sample_size = len(samples) | |

degrees_freedom = sample_size – 1 | |

outlier_tails = (1.0 – confidence) / 2.0 | |

t_distribution_number = –1 * t.ppf(outlier_tails, degrees_freedom) | |

step_1 = np.std(samples)/math.sqrt(sample_size) | |

step_2 = step_1 * t_distribution_number | |

low_end = np.mean(samples) – step_2 | |

high_end = np.mean(samples) + step_2 | |

return low_end, high_end |

Suppose you have some data that you’ve grouped together like so:

And you want confidence intervals for all of your groupings.

I gotchu, boo:

import math | |

from scipy.stats import t | |

import numpy as np | |

def confidence_interval_for_collection(sample_size=[], standard_deviation=[], mean=[], confidence=0.95): | |

degrees_freedom = [count – 1 for count in sample_size] | |

outlier_tails = (1.0 – confidence) / 2.0 | |

confidence_collection = [outlier_tails for _ in sample_size] | |

t_distribution_number = [–1 * t.ppf(tails, df) for tails, df in zip(confidence_collection, degrees_freedom)] | |

step_1 = [std/math.sqrt(count) for std, count in zip(standard_deviation, sample_size)] | |

step_2 = [step * t for step, t in zip(step_1, t_distribution_number)] | |

low_end = [mean_num – step_num for mean_num, step_num in zip(mean, step_2)] | |

high_end = [mean_num + step_num for mean_num, step_num in zip(mean, step_2)] | |

return low_end, high_end |

Statistical power, it turns out, is more complicated to calculate. To fully explain it, Reinhart points us to a 560 page book called *Statistical Power Analysis for the Behavioral Sciences. *To his credit (as well as the author’s and whichever soul scanned all 560 of those pages), it’s available for free online. I haven’t gotten through it yet. In the meantime, SASS and a number of other statistical analysis programs can help you with this, as could a statistical consultant. I’ll give you code when I have it.

#### T-Tests

We also have an important metric for quantifying *distance* between our means: t-tests.

You can have two means whose confidence intervals overlap that *still* possess a meaningful difference. Enter the t-test!

def t_test_for(num_samples_1, standard_deviation_1, mean_1, num_samples_2, standard_deviation_2, mean_2, confidence=0.95): | |

alpha = 1 – confidence | |

total_degrees_freedom = num_samples_1 + num_samples_2 – 2 | |

t_distribution_number = –1 * t.ppf(alpha, total_degrees_freedom) | |

degrees_freedom_1 = num_samples_1 – 1 | |

degrees_freedom_2 = num_samples_2 – 1 | |

sum_of_squares_1 = (standard_deviation_1 ** 2) * degrees_freedom_1 | |

sum_of_squares_2 = (standard_deviation_2 ** 2) * degrees_freedom_2 | |

combined_variance = (sum_of_squares_1 + sum_of_squares_2) / (degrees_freedom_1 + degrees_freedom_2) | |

first_dividend_addend = combined_variance/float(num_samples_1) | |

second_dividend_addend = combined_variance/float(num_samples_2) | |

denominator = math.sqrt(first_dividend_addend + second_dividend_addend) | |

numerator = mean_1 – mean_2 | |

t_value = float(numerator)/float(denominator) | |

accept_null_hypothesis = abs(t_value) < abs(t_distribution_number) #results are not significant | |

return accept_null_hypothesis, t_value |

We have written here a method that will tell us, firstly, whether to accept or reject the *null hypothesis*, which assumes no meaningful difference between the two sets of data we want to compare. I have named that output ‘accept_null_hypothesis’ because I don’t love the ubiquitous use of the confounding phrase ‘*reject the null hypothesis*‘ in scientific inquiry. It’s a double negative (*reject* the *absence* of meaningful difference), which adds an unnecessary additional piece of mental acrobatics to the (already frequently herculean) task of determining what, exactly, the scientists are trying to say in their conclusion paragraph.

We are going with *accept* the *absence of* meaningful difference as the variable name for two reasons. First of all, we remove the double negative this way. Second of all, accepting the null hypothesis is (or should be) the outcome of the vast majority of scientific inquiry. Scientists, collectively, test a *whole bunch* of stuff to see what has an effect. Most of the things tried, it turns out, don’t have that effect. So our `accept_null_hypothesis`

value will usually be true. When it’s false, we should sit up and pay attention.

#### …which brings us to multiple comparisons

This is an important topic, and Reinhart devotes three chapters in this first part of the book to this topic.

Multiple comparisons: when scientists compare the *same* examples to their dependent variable multiple times to look for correlates or causes.

When a study uses 1,000 cell samples, all of which came from one of the same two mice, we have multiple comparisons by *autocorrelation.*

When we study a variety of stock prices and include each of their year-over-year returns as separate data points in our model, we have multiple comparisons by taking multiple measurements.

And then there’s this:

Ah, yes. You may have 95% certainty that any one comparison *isn’t* statistically significant by fluke, but when you run a bunch of comparisons, eventually *one* of them *will* be a fluke. In fact, when you run 100 separate comparisons, your likelihood that *none* of the significant outcomes are flukes drops to a measly 1%.

This is also what’s happening when you take a group of people with some incidence rate of cancer and you hand them a questionnaire with 100 questions on it: meat consumption, egg consumption, dairy consumption, soybean consumption, exercise, sleep habits, etc. Even if *none *of these questions has *anything* to do with cancer, at some point *one *of the answer sets will line up with who got cancer by pure fluke. Your questions could be about the most ridiculous garbage imaginable: favorite casual reading genre, self-described level of narcissism, third favorite pizza topping, color of favorite pair of shoes—and this is *still* going to happen.

In fact, Reinhart cites a memorable example in which researchers demonstrated exactly that. They determined, by making a staggering number of bogus comparisons and picking the most fortuitously significant-looking one, that listening to the song “When I’m Sixty-Four” made a randomly-assigned group of undergraduates a year and a half younger than another group that listened to the song “Kalimba,” when controlling for their fathers’ ages. In addition to demonstrating the problem with multiple comparisons, that study fattens my growing collection of evidence that, by comparison to plenty of other people, I’m not even *that* salty. Huzzah!

#### Conclusion

I’ve glossed over some of the important concepts in the first part of Reinhart’s book, so if you’re looking for something specific I recommend checking for it in the handwritten notes in the first photograph or, even better, picking up the book yourself! I went through the revised and expanded version, but the free version available online is also quite good and, in fact, I have consulted it on a few occasions during projects.

Bottom line: the distance between your control and experimental outcomes only matters relative to your certainty about their locations. Much of the math that goes into establishing the meaningfulness of study results boils down to striking a balance between those two things. And even when that balance is struck, there is still a chance that the seemingly meaningful result is a fluke or the too-small-to-establish-meaning difference is, in fact, meaningful.

The world is a messy place, and our data and models represent only *approximations* of it. Keep that in mind.

I’ve shared more notes on the second part of this book, if you’re interested.

#### If you like it when I talk about data science, you might also enjoy:

An Experiment in Making Error Analysis for Regression Models Suuuper Sweeeet

Chelsea Explains Gradient Descent with Crayons, Gets Shown Up by Mentor with Javascript Animation

The Software Side of Getting Good Data (third post in a series)