Relevance, Validity, and Analytical Fiddling
In my AAEA presidential address, “Why Should I Believe Your Applied Economics?” (in press, AJAE Feb 2018), I called for research that is both relevant and valid. The toughest question from the audience asked how to handle cases where the researcher distorts results to boost their importance. In short, does analytical fiddling mean that the goals of relevance and validity are at loggerheads?
In a world where citations drive impact and novelty drives citations, the temptation is great for a researcher to fiddle analytically until hitting upon some result that is both novel and statistically significant. Analytical fiddling may take the form of p-hacking in statistical models or of choosing biased parameters in simulation or optimization models. But analytical fiddling also serves essential, benevolent purposes. Here are four follow-up thoughts to that very important question.
First, applied economists are making progress in defending against diffusion of distorted research results. For hypothesis tests, the arsenal of defenses against biased results is growing. Some responses, to p-hacking, for example, are not new. Statisticians back to Carlo Emilio Bonferroni have highlighted the need to correct significance thresholds for multiple hypothesis tests. Unfortunately, few researchers acknowledge how many hypothesis tests they conduct in search of the final model specification(s) submitted for publication.
Happily, microeconometricians have made great strides in checking empirical validity. Identification strategies have become much more explicit, and results of alternative specifications are routinely supplied in main text and online supplements to journal articles (albeit at the notable cost of greater length). Equally important for economic experiments has been the rise of analysis registries. The American Economic Association’s Registry for Randomized Control Trials (RCTs) is a case in point. It enables RCT researchers to register experimental design, analysis plan, and other details in advance of undertaking the analysis to allay concerns about p-hacking to come up with statistically significant results. As Angrist and Pischke (2010) observe, much has been accomplished to “take the Con out of Econometrics.” Although replication remains rare in econometrics, it is at least becoming more feasible.
Second, economists should not allow the focus on p-hacking to distract us from weighing the importance of Type I vs. Type II errors. The recent literature around p-hacking and published literature bias toward statistically significant studies focuses on those that limit Type I error to an alpha level of 5% or 1% (e.g., Simonsohn et al., 2014). The consequences of rejecting the null when it is true may be grave in instances such as costly medical cures. But the problem of false negatives is equally important.
Failing to reject the null when it is false (Type II Error) can occur either because the significance level is set too high or because the power of the test is too low. For many economic research questions, a false negative may be worse than a false positive. Consider a recent PNAS paper that could not reject the null hypothesis that farmer net revenue was unchanged upon removing 10% of the cropped area from corn and soybean fields (Schulte et al., 2017). The lion’s share of the reported 95% confidence interval around net revenue change fell below zero. Had the authors focused on the risk of failing to detect change in net revenue by reducing the confidence interval to say 80%, would they have found that farmers stood to lose money 4 years in 5?. Weighing the cost of the wrong decision is hardly new (Manderscheid 1965), but too important to neglect.
Equally important for avoiding Type II Error is having sufficient sample size not only to distinguish whether an effect exists, but also to measure how big it is. A new meta-analysis of over 6,700 empirical studies in Economic Journal reports that an overwhelming share lack the power to measure the effects that they wish to capture, typically resulting in overestimates (Ioannidis, et al., 2017).
Third, analytical fiddling in the statistical world has very respectable roots, not just rotten ones like p-hacking. In his 1978 book on specification searches with nonexperimental data, Edward Leamer builds a typology of specification searches. His sixth and last search type he calls “post-data model construction to improve an existing model.” Such searches play a special role epistemologically. Leamer invokes Sherlock Holmes’ dictum that a hypothesis should not be developed until all the evidence is in hand. That notion underpins the practice of inductive hypothesis generation. Inductive reasoning often stimulates the development of theory that only later engenders testable hypotheses.
Fourth, analytical fiddling offers similar risks and benefits in normative models as in statistical ones. Normative models can be valuable for running scenarios on phenomena that have not occurred, meaning that statistical data are unavailable. Important current application areas include predicting economic consequences of policies to create international trade advantage and to mitigate or adapt to climate change. Many normative modeling exercises involve interaction between economics and other disciplines.
How to avert invalid results in normative modeling? Rigorous peer review is essential, but the “black box” nature of many simulation and optimization models can make review difficult. Ultimately, replicability is the key to confirming validity. The highly-cited Searchinger et al (2008) PNAS article drew attention both for recognizing the potential for land use change as a supply response to U.S. biofuels policy and for its empirical estimate that the global carbon footprint of that land use change was so large as to overwhelm the carbon emissions saved by increased U.S. domestic ethanol consumption. Subsequent analyses by peer models have not replicated that high estimate of global land use change. Replicability remains the best validity test—and perhaps the only one practicable.
Returning to the original question, does analytical fiddling put research relevance and validity at odds? Short answer: Not necessarily, but we must be vigilant. In experimental economics, recent innovations in professional norms are reducing the risk of reporting spuriously significant results, though more attention is needed to ensure adequate statistical power. But we should remember that analytical fiddling has real strengths. Re-analysis of data is a core function of research, and one that is indispensable for hypothesis generation. Especially with nonexperimental data, specification searches and normative model refinements can lead to the creation of new knowledge. Ideally, researchers do their work ethically and transparently. But we are all susceptible to temptation, so ultimately it takes the village of peer reviewers and fellow researchers to vet, replicate, and extend research results in order to ensure that research that appears relevant is also sound. With time, that village will separate the grain from the chaff in research results.
Angrist, J.D., and J.-S. Pischke. 2010. The credibility revolution in empirical economics: how better research design is taking the con out of econometrics. J. Econ. Persp. 24(2):3-30.
Ioannidis, J.P.A., T.D. Stanley, and H. Doucouliagos. 2017. The power of bias in economics research. Econ. J. 127:F236-F265.
Leamer, E.E. 1978. Specification Searches: Ad hoc Inference with Nonexperimental Data. New York: Wiley.
Manderscheid, L.V. 1965. Significance Levels--0.05, 0.01, or ? J. Farm Econ. 47(5):1381-1385.
Schulte, L. A., J. Niemi, M. J. Helmers, et al. 2017. Prairie strips improve biodiversity and the delivery of multiple ecosystem services from corn–soybean croplands. Proc. Nat. Acad. Sci. 114(42): 11247-11252.
Searchinger, T., R. Heimlich, R. A. Houghton, et al.(2008). Use of U.S. croplands for biofuels increases greenhouse gases through emissions from land-use change. Science 319(5867): 1238-1240.
Simonsohn, U., L.D. Nelson, and J.P. Simmons. 2014. P-curve: A key to the file-drawer. J. Experim.l Psych.: Gen. 143(2):534-547.
Note: Text about Schulte et al (2017) was revised on Jan. 19, 2018.