This week’s US rollout of the first COVID-19 vaccine is a major milestone, a true triumph for scientists, and a massive relief for the rest of us. But it’s also an excuse to revisit my least favorite paper published this year.
That paper, “Estimating the deep replicability of scientific findings using human and artificial intelligence,” was written by a team of researchers at Northwestern led by Brian Uzzi. It was published in PNAS on May 4, and its publication was accompanied by a glowing press release (“AI speeds up search for COVID-19 treatments and vaccines”) and received credulous coverage in outlets like Fortune and The Wall Street Journal.
One of my primary professional interests is using data analysis to systematically identify good science, so I was eager to dig into the paper. Unfortunately, I found that the paper is flawed and doesn’t support the Covid-related story that the authors and Northwestern shared with the media. My initial skepticism has proved out; vaccines are now being distributed with (as far as I can tell) no help whatsoever from this particular bit of AI. Closer analysis will show that the paper isn’t convincing, that it had nothing to do with Covid, and that the author was reckless in how he promoted it.
Problem #1: The machine learning in the academic paper is flawed
The core of the paper is a machine learning model built by the authors that predicts whether or not a paper will replicate. To be technical about it, the model is trained on a dataset of 96 social science papers, 59 of which (61.4%) failed to replicate. The model takes the full text of the paper as an input, uses word embeddings and TF-IDF to convert each text to a 400-dimensional vector, and then feeds those vectors into an ensemble logistic regression/random forest model. The cross-validated results show an average accuracy of 0.69 across runs compared to the baseline accuracy of 0.614. These are all standard techniques, but skilled machine learning practitioners are already raising their eyebrows about three points:
The authors don’t have enough data to build a reliable model. The authors have used just 96 examples to build a model with 400 input variables. As mentioned above, the model has two components: a logistic regressor and a random forest. A conventional rule of thumb is that logistic regression requires a minimum of 10 examples per variable, which would suggest that the authors need 40x more data. “Underdetermined” doesn’t even begin to describe the situation.
The data needs of random forests are harder to characterize. While geneticists routinely use random forests in settings with more variables than examples, their use case is typically more focused on determining variable importance than actually making predictions. And indeed, some research suggests that random forests need more than 200 examples per variable, or almost 1000x more data than the authors have.
The bottom line is that you can’t build a reliable machine learning model on just 96 papers.
The model structure is too complicated. Model structure is the garden of forking paths for machine learning. Adjustments to a model can improve its performance on available data while reducing performance on unseen data. (And no, cross-validation alone doesn’t fix this!) The model structure the authors describe is reasonable enough, but it also includes some clearly arbitrary choices like using both logistic regression and random forests (rather than just picking one) or aggregating word vectors using both simple averaging and TF-IDF (again rather than just picking one.) With just 96 examples in the dataset, each version of model that the authors tried had a real chance to show a cross-validation accuracy that looked like success despite arising from chance. In context, trying multiple model architectures is the the equivalent of performing subgroup analyses.
The effect size is too small. Increasing accuracy from the baseline of 0.614 to 0.69 is too small an effect to achieve statistical significance particularly in light of the small sample size. The large number of degrees of freedom in model design. The paper’s statistical analyses generate pleasing p-values (p<0.001) demonstrating that the model is effective on this particular set of papers.
What we’re actually interested in, though, is whether the model outperforms the baseline on unseen data (i.e. whether it has better generalization error.) Performing inference about generalization error is a challenging task (and there isn’t a single agreed upon methodology).
But as a sanity check, consider the t-test we would use to e.g. determine if one diagnostic test were more accurate than another when given to patients. The cynical baseline (predicting that nothing ever replicates) gives an accuracy of 0.614 on these 96 papers. The authors’ model achieves an accuracy of 0.69 on those same papers. That gives a one-tailed p-value of 0.134 — a delightful value for a paper that is itself about replicability.
This point isn't just pedantry; I'm genuinely unsure if the model will actually outperform the cynical baseline on unseen data. I don't know what the base rate for replication is in the test sets. I did track down the replication status for one set (Row 2 in the paper) and saw 7 out of the 8 results in that set failed to replicate, so our cynical baseline achieves an accuracy of 0.875 — outperforming the “AI” model significantly on this admittedly small set.
Let me be very clear: These are very fundamental problems. After reviewing the paper, I’m not confident that their machine learning model adds any value at all. It reflects poorly on PNAS that this paper made it through peer review. Unfortunately, general scientific journals - no matter how prestigious - don’t seem equipped to effectively review papers involving machine learning; Nature’s infamous paper on predicting earthquakes with deep learning was widely criticized in the machine learning community.
Problem #2: The paper has nothing to do with Covid
Let’s set aside every issue I’ve raised to this point and accept that the authors really can identify social science papers that are less likely to replicate. That still doesn’t make it relevant to Covid.
Their entire system is premised on picking up subtle linguistic markers that supposedly indicate when a researcher (perhaps subconsciously) believes she’s performing sub-par science. Uzzi compares the approach to reading body language.
But there’s no reason to believe that the linguistic “body language” of psychologists tells us anything about the body language of Covid-19 researchers. Psychology and virology are very different fields with different conventions even in normal times. The pandemic itself has undoubtedly impacted word choices, as papers written under extreme time pressure by researchers from around the world get shared to pre-print servers rather than being polished and published in journals. At a minimum, the model would have to be significantly adjusted to be applied to Covid research.
Problem #3: Northwestern and Brian Uzzi crossed the line promoting this paper
Self-promotion is a natural and even important part of science; good research doesn’t always get the attention is deserves. And certainly the decade-long AI boom has been driven forward by rosy projections about what AI can accomplish. But the paper’s lead author, Brian Uzzi, went too far in his efforts to promote it.
The paper was published just two months into the pandemic at a time when the trauma felt more acute than chronic. The uncertainty and fear fueled a desperation for anything that might end the ordeal. In that environment, putting out a press release entitled “AI speeds up search for COVID-19 treatments and vaccines” takes on a moral dimension.
The scientists and trial participants who brought us a vaccine in record time are heroes. Meanwhile, the Wall Street Journal coverage of this paper now has a correction appended:
Northwestern University researchers will make an artificial-intelligence tool designed to rate the promise of scientific papers on Covid-19 vaccines and treatments available when testing is completed. An earlier version of this article incorrectly said the tool would be available later this year.
Indeed.
Correction: Due to an oversight on my part, this post originally misrepresented the results the model achieved on the test data. It has been updated to correct that error.