Tracking the Credibility Revolution across Fields

Paul Goldsmith-Pinkham Yale School of Management and NBER. Email: paul.goldsmith-pinkham@yale.edu. I thank Dana Scott, Pedro Sant'Anna, Nils Enevoldsen, and Esmée Zwiers for helpful comments and suggestions.

⬇ Download PDF

How far has the credibility revolution spread beyond applied microeconomics? I update Currie, Kleven, and Zwiers (2020) using approximately 44,000 papers—31,500 NBER working papers (1982–2025) and 12,300 articles from eleven top economics and finance journals (2011–2024)—measuring mentions of empirical methods through keyword matching. Three findings emerge. First, finance and macro/other fields differ substantially from applied micro in their mention of credibility revolution methods: as of 2024, 63 percent of applied micro papers mention experimental or quasi-experimental methods, compared to 47 percent in finance and 39 percent in macro/other. The current levels in finance and macro/other are comparable to where applied micro was in 2008–2010, though the long-run trajectories may differ. Second, growth outside applied micro is driven overwhelmingly by difference-in-differences; including DiD raises the share of finance papers mentioning any experimental or quasi-experimental method by roughly 55 percent versus 30 percent for applied micro. Other quasi-experimental methods—instrumental variables, regression discontinuity, experiments—have seen far less growth. Third, I document a striking gap between the methods studied in the Journal of Econometrics—where nonparametric estimation and asymptotic theory dominate—and those used by applied researchers, where DiD and identification strategies dominate. Published journal articles confirm these patterns are not artifacts of the NBER sample.

JEL Codes: C18, C81, B41
Keywords: Credibility revolution, difference-in-differences, text analysis, empirical methods, causal inference

Introduction

How far has the credibility revolution spread? Angrist and Pischke (2010) documented a sea change in how economists approach empirical work—a shift toward transparent research designs, explicit identification strategies, and credible causal inference. Currie, Kleven, and Zwiers (2020) showed that this shift was accelerating through the late 2010s, at least in applied microeconomics. But that analysis left open a basic question: are finance, macroeconomics, and other fields keeping pace, or has the revolution been narrower than it appears? Throughout this paper, I use "macro/other" to refer to the NBER field grouping that includes macroeconomics alongside several other programs; see Table 2 for the full composition.

I take up this question by extending Currie, Kleven, and Zwiers (2020)'s approach to a much larger corpus. Using keyword matching on the full text of approximately 44,000 economics papers—31,500 NBER working papers (1982–2025) and 12,300 articles from eleven top journals (2011–2024)—I track mentions of empirical methods across fields and over time. The expanded sample adds finance and macro/other fields, which were omitted from the original analysis, and supplements working papers with published journal articles. Because the analysis measures keyword mentions rather than verified use, the trends should be interpreted as tracking the diffusion of methodological language—a proxy for, but not identical to, actual method adoption.

The answer is clear: mentions of credibility revolution methods have spread unevenly across fields. I organize the findings around three main results.

First, finance and macro/other differ substantially from applied micro on most measures. As of 2024, 63 percent of applied micro papers mention experimental or quasi-experimental methods, compared to 47 percent in finance and 39 percent in macro/other (Table 3). In identification language, the current levels in finance and macro/other are comparable to where applied micro was in 2008–2010. The gap has shown little sign of closing.

Second, the credibility revolution outside applied micro has been—to a first approximation—a difference-in-differences revolution. Including DiD in the methods measure raises the finance share by roughly 55 percent versus 30 percent for applied micro. Other quasi-experimental tools—instrumental variables, regression discontinuity, RCTs—have seen far less growth in finance and macro. This reliance on a single method is striking given the recent econometrics literature highlighting sensitivities in DiD designs (Roth 2022; De Chaisemartin and d'Haultfoeuille 2020; Callaway, Goodman-Bacon, and Sant'Anna 2024).

Third, I document a pronounced gap between the methods studied in the Journal of Econometrics—where nonparametric estimation, bootstrap methods, and asymptotic theory dominate—and those used by applied researchers, where DiD and identification strategies are the dominant tools. The tools powering the credibility revolution and the theoretical literature developing new estimators occupy largely separate methodological spaces.

Two features of the analysis strengthen confidence in these patterns. Published articles from top journals show trends that closely mirror the NBER data, with slightly higher rates of credibility revolution methods—consistent with a publication selection effect favoring methodologically rigorous papers. And a validation exercise using LLM-based classification confirms that keyword matching achieves 80–92 percent agreement rates for most method categories with more sophisticated approaches at near-zero computational cost, though agreement is lower for broader categories like identification strategy and structural models.

The paper proceeds as follows. Section 1 describes the data and methods. Section 2 presents the NBER working paper results. Section 3 extends the analysis to published articles from top journals. Section 4 examines the gap between econometric theory and applied practice. Section 5 concludes.

1. Data and Methods

I measure mentions of empirical methods over time following the approach in Currie, Kleven, and Zwiers (2020): searching the full text of papers for keywords and regular expressions that capture the language of the credibility revolution (e.g. "threats to identification" or "identification strategy"). See the Appendix for the full set of keywords. I follow the same method as Currie, Kleven, and Zwiers (2020).

NBER Working Papers

I collect the full text of approximately 31,500 NBER working papers from the NBER website, covering papers 1000 through the most recent available (1982–2025). Unlike Currie, Kleven, and Zwiers (2020), who focus exclusively on "applied micro" papers, I include all papers in the NBER working paper series. Each paper is associated with one or more of nineteen NBER research programs, which I use for field classification.

Text Processing

For each paper, I extract the full text, strip out the references section—identified by looking for section headers followed by high concentrations of "Journal" mentions—and apply the keyword search. I use the same keywords and regular expressions as Currie, Kleven, and Zwiers (2020), with appropriate case sensitivity for each category. The full list is in the Appendix.

Validation

I validate keyword matching against two external benchmarks. First, I compare keyword flags to the hand-coded method labels in Brodeur, Cook, and Heyes (2020), matching 357 papers across nine journals (2011–2020) by title. Treating Brodeur et al.'s labels as ground truth, keywords achieve high recall—99% for DiD and IV, 95% for RD—meaning they rarely miss a paper that uses a given method. Precision is lower (69–74% for DiD, IV, and RD), reflecting that keywords also flag papers that mention a method without using it as a primary research design. Second, I classify a stratified sample of 750 papers using two independent LLMs (Claude Haiku 4.5 and Qwen 3.5-122B). Both LLMs produce nearly identical positive rates for every method category, and agreement with keywords runs 80–92% for most categories. Full results appear in Appendix: Validation.

Field Classification

For journal articles, I classify papers into fields using a two-step procedure. First, field-specific journals are directly classified: AEJ Applied and AEJ Policy map to "Applied Micro," AEJ Macro to "Macro," AEJ Micro to "Micro Theory," the three finance journals (JF, JFE, RFS) to "Finance," and the Journal of Econometrics to "Econometrics." Second, for the general-interest journals (AER, QJE, JPE), I use JEL codes when available. Each paper's JEL code first letters determine its field: D, J, L, H, I, Q, R, or K codes map to "Applied Micro"; G codes to "Finance"; E or F codes to "Macro"; and C codes to "Econometrics." When a paper has JEL codes spanning multiple fields, I assign it to the first matching field in the priority order listed above. Papers without JEL codes—primarily from QJE and JPE, which do not report them—default to "General Econ."

Comparison of sample size to Currie et al. (2020) in applied micro — **Figure 1.** NBER Working Paper Counts over Time. Data for Currie, Kleven, and Zwiers (2020) is measured in Appendix Figure B.I. in their paper. My sample ends in early 2025.

Total papers in final sample over time — **Figure 1.** NBER Working Paper Counts over Time. Data for Currie, Kleven, and Zwiers (2020) is measured in Appendix Figure B.I. in their paper. My sample ends in early 2025.

As Currie, Kleven, and Zwiers (2020) note in their replication package (Currie, Kleven, and Zwiers 2020b), PDF-to-text conversion introduces errors. To see how this affects my sample, I compare paper counts over time in the "applied micro" setting to Currie, Kleven, and Zwiers (2020) in Figure 1. My sample has more gaps in the 1990s—reflecting data processing errors for PDFs in that period—but coverage is close in the early 1980s and from 1999 onwards. Figure 2 provides a more direct check: I compare two headline estimates from Currie, Kleven, and Zwiers (2020) to mine. My estimates track well except in the late 1990s. I therefore focus on 2000 onwards for all results, leaving a sample of 24,702 papers.

Comparison of identification measure — **Figure 2.** Validation of measurement with Currie, Kleven, and Zwiers (2020). Data for CKZ is taken from Figure 2 Panel A and B. I plot the raw (annual) measure, while CKZ data is a rolling five-year mean; the smoothing explains the slight visual discrepancy between the two series.

Comparison of experimental/quasi-experimental measure — **Figure 2.** Validation of measurement with Currie, Kleven, and Zwiers (2020). Data for CKZ is taken from Figure 2 Panel A and B. I plot the raw (annual) measure, while CKZ data is a rolling five-year mean; the smoothing explains the slight visual discrepancy between the two series.

Each NBER working paper can be submitted to one or more of nineteen programs, and 55 percent list more than one. 45 percent have one program, 32 percent have two, 15 percent have three, 5 percent have four, and 2 percent have five. Table 1 reports the breakdown. The three largest programs are Economic Fluctuations and Growth (macroeconomics), Public Economics (applied micro), and Labor Studies (also applied micro).

**Table 1.** NBER Working Paper Series counts by program
NBER Program	Number of Papers
Applied Micro
Labor Studies	5,970
Public Economics	5,896
Economics of Health	3,641
International Trade and Investment	2,466
Children and Families	2,193
Industrial Organization	2,160
Economics of Education	2,105
Development Economics	1,955
Political Economy	1,869
Environment and Energy Economics	1,724
Economics of Aging	1,698
Finance
Asset Pricing	2,985
Corporate Finance	2,785
Macro/Others
Economic Fluctuations and Growth	5,645
International Finance and Macroeconomics	3,107
Monetary Economics	2,924
Productivity, Innovation, and Entrepreneurship	2,785
Development of the American Economy	1,675
Law and Economics	1,385

To compare across programs, I extend Currie, Kleven, and Zwiers (2020)'s classification. I define "finance" as Asset Pricing and Corporate Finance, and "macro/other" as the remaining programs. Table 2 defines these groupings.

**Table 2.** Breakdown of papers by field groupings
Field Group	Number of Papers
Applied Micro	18,288
Macro/Others	5,111
Finance	1,758
Finance + Macro/Others	1,692

Throughout the analysis, field and program labels are non-exclusive: a paper contributes to every program to which it is submitted. I focus on 2000 onwards for most results, leaving a sample of approximately 24,700 NBER papers. Table 3 provides a snapshot of the headline numbers.

**Table 3.** Summary of credibility revolution measures by field. Shares are computed from NBER working papers. "Exp./Quasi-exp." includes DiD, event studies, IV, RD, RCTs, lab experiments, and bunching. "Excl. DiD" excludes difference-in-differences and event studies.
Field	2016–2024					2000–2015
Field	$N$	Ident.	Exp./QE	DiD	Excl. DiD	$N$	Ident.	Exp./QE	DiD	Excl. DiD
Applied Micro	8,265	40.2%	58.3%	25.3%	45.8%	9,067	33.4%	42.9%	11.8%	37.1%
Finance	586	22.7%	35.8%	19.8%	23.2%	1,121	15.1%	22.5%	10.9%	14.2%
Macro/Others	2,514	25.1%	29.7%	12.5%	21.8%	4,047	17.6%	22.0%	6.3%	17.7%

2. Results from NBER Working Papers

Overall trends

Figure 3 presents the updated version of Currie, Kleven, and Zwiers (2020)'s Figure 2, now covering all NBER papers through May 2024. Currie, Kleven, and Zwiers (2020) use a five-year moving average; I present two-year moving averages throughout. Each panel shows field-specific trends as colored lines, with the overall aggregate as a dashed black line.

Nearly all trends continue in the direction Currie, Kleven, and Zwiers (2020) documented. The share of papers explicitly mentioning identification has risen overall, with growth slowing markedly since 2016 (panel a). The share mentioning any experimental or quasi-experimental method, by contrast, has continued to rise even after 2016 (panel b). This means identification language has saturated while mentions of specific methods keep growing. Administrative data (panel c) has also continued its upward trend.

But the aggregate trends mask substantial heterogeneity. Figure 3 previews the paper's central finding: mentions of credibility revolution methods have spread unevenly, with persistent gaps between applied micro on the one hand and finance and macro/other on the other.

**Figure 3.** Credibility revolution trends in NBER working papers (two-year moving averages). Colored lines show field-specific trends; dashed black line shows the overall aggregate. See Table 2 for field definitions and the Appendix for keyword definitions.

Comparison across fields

Figure 3 splits each variable by the three field groupings. The gaps are large and persistent. For identification, experimental and quasi-experimental methods, and administrative data, applied micro is well above both finance and macro/other. Applied micro's identification share has grown more slowly since 2017, reaching 46 percent by 2024, and remains 13–17 percentage points above finance and macro/other. For experimental and quasi-experimental methods, applied micro reaches 63 percent by 2024, while finance stands at 47 percent and macro/other at 39 percent (Table 3).

To put these gaps in context, it helps to ask where finance and macro/other stand today relative to applied micro in the past. In identification, the current levels in finance and macro/other are comparable to where applied micro was in 2008–2010. In experimental and quasi-experimental methods, finance is comparable to applied micro circa 2011–2012 and macro/other to applied micro circa 2008. Whether this reflects a lag that will eventually close or different long-run equilibria is an important open question.

Figure 4 presents method-specific trends by field. I start with difference-in-differences (panel a), which includes event studies. All three fields show steep growth, with applied micro leading. Finance is close behind—partly because the term "event study" captures financial event studies (abnormal return studies) that differ methodologically from DiD-style event studies. Appendix: DiD Decomposition decomposes this measure.

Panel (b) tells a different story: synthetic controls. Growth continued through 2020 but has since leveled off. Panel (c) examines Bartik and shift-share instruments (Goldsmith-Pinkham, Sorkin, and Swift 2020; Borusyak, Hull, and Jaravel 2022; Adão, Kolesár, and Morales 2019). Since 2013, this method has grown rapidly across all fields. Panel (d) plots the share mentioning instrumental variables, which has stayed roughly constant over time. In panel (e), applied micro leads in RCT mentions, with 20 percent of papers by 2024. In panel (f), applied micro leads finance and macro/other by about 7–8 percentage points in regression discontinuity mentions, but the share has flattened across all fields over the past eight years.

What accounts for the gap between applied micro and the other fields? One possibility is structural estimation. In Figure 5, macro/other and finance have a 7.5–10 percentage point higher share of structural estimation mentions. More revealing is panel (b), which isolates papers that mention structural estimation without also mentioning experimental or quasi-experimental methods. Here the gap widens: finance and macro/other papers are roughly twice as likely to fall in this category as applied micro papers. This means that when applied micro papers use structural models, they typically pair them with complementary research designs—a pattern far less common in finance and macro.

Structural models — **Figure 5.** Panel (a) reports the share of papers that mention structural model estimation. Panel (b) reports the share mentioning structural estimation without any experimental or quasi-experimental methods.

Structural models without quasi-experimental methods — **Figure 5.** Panel (a) reports the share of papers that mention structural model estimation. Panel (b) reports the share mentioning structural estimation without any experimental or quasi-experimental methods.

Breakdown across programs

The field-level averages mask important within-field variation. Figure 6 plots the share of papers mentioning identification and experimental/quasi-experimental methods across all nineteen programs using slope charts. Each line segment connects a program's 2000–2015 share (left) to its 2016–2024 share (right), colored by field.

Despite within-field heterogeneity, the cross-field pattern is strikingly consistent. Applied micro programs have higher identification shares than nearly all finance and macro/other programs, with the exceptions of Productivity, Innovation, and Entrepreneurship and Law and Economics. Within finance, there is a large gap between Asset Pricing and Corporate Finance.

**Figure 6.** Method mentions across NBER programs. Each line segment connects a program's 2000–2015 share (left) to its 2016–2024 share (right), colored by field. Steeper upward slopes indicate faster growth.

Which methods have driven the growth? Figure 7 presents a heatmap of the change in method share by program between 2000–2015 and 2016–2024. The answer is unambiguous: DiD accounts for most of the growth across programs. The share mentioning instrumental variables has stayed roughly constant. Regression discontinuity has risen only slightly. The credibility revolution in finance and macro has been, to a first approximation, a difference-in-differences revolution.

**Figure 7.** Change in method-specific mentions across NBER programs, 2016–2024 minus 2000–2015. Each cell shows the percentage-point change. Blue indicates growth, red indicates decline.

The dominance of difference-in-differences across fields

How much does this single method account for the overall growth? Figure 8 compares method shares with and without DiD.

Panel (a) breaks down the comparison by field. Over 2016–2024, including DiD raises finance's methods share by roughly 13 percentage points—a 56 percent increase—compared to a similar 13 percentage point increase for applied micro, which represents only a 29 percent increase because applied micro's baseline is much higher. Panel (b) decomposes the percentage increase by program. International Finance and Macroeconomics shows the largest increase, followed by Corporate Finance, Health Economics, and Asset Pricing. By contrast, applied micro programs with high overall method shares—such as Development Economics and Education—show relatively small increases from DiD, reflecting their diversified methodological portfolios.

**Figure 8.** The dominance of difference-in-differences. Panel (a): experimental and quasi-experimental method shares by field, faceted by whether DiD is included or excluded. Panel (b): percentage increase in method share from including DiD, by NBER program (2016–2024).

3. Evidence from Top Journals

The NBER working paper series is a natural laboratory for studying methodological trends, but it has a limitation: NBER affiliates are a selected group. Do the patterns above survive in a different sample?

Overall trends across top journals

Figure 9 compares identification and DiD mentions across journal fields between 2011–2017 and 2018–2024. The field-level patterns closely mirror the NBER data. Applied micro journals show the highest rates of identification language and experimental/quasi-experimental methods, followed by finance, with macro trailing.

**Figure 9.** Method mentions across fields in top journals: 2011–2017 vs. 2018–2024. Each line connects a field's early-period share (left) to its late-period share (right). Panel (a) identification, (b) difference-in-differences.

Comparison across individual journals

Figure 10 compares mention rates across individual journals. AEJ Applied Economics and AEJ Economic Policy show the highest rates of credibility revolution methods. Among the general-interest journals, AER and QJE show higher rates than JPE. The finance journals show moderate adoption of DiD and identification language but lower rates of RD and experimental methods.

Heatmap of method mentions across journals — **Figure 10.** Heatmap of method mentions across individual journals (2011–2024). Color intensity reflects the share of papers mentioning each method category. Journals ordered by identification share.

NBER working papers vs. published articles

Could the NBER trends be artifacts of the working paper selection process? Figure 11 overlays the NBER and journal time series for key methods, matching by field. The trends are strikingly similar. Published articles show slightly higher rates of most credibility revolution methods—consistent with a selection effect where papers using transparent research designs are more likely to clear the bar at top journals.

NBER vs published journal articles — **Figure 11.** NBER working papers vs. published journal articles: time series by field (2011–2024). Each panel compares NBER working papers (left facet) with published journals (right facet) for identification (left) and difference-in-differences (right).

4. Econometric Theory and Applied Practice

Having established that the credibility revolution has spread unevenly across applied fields, I now turn to a deeper question. The credibility revolution depends on tools developed by econometricians. If the revolution's reach has been uneven across applied fields, what about the field that supplies its theoretical infrastructure?

Panel (a) of Figure 12 shows which credibility revolution methods appear in the Journal of Econometrics. Most—DiD, event studies, RD, RCTs, administrative data, synthetic control—appear far less frequently than in applied journals. The exceptions are identification language and instrumental variables, reflecting the theoretical literature on these topics.

Panel (b) takes a data-driven approach: what does the Journal of Econometrics publish? I construct keyword lists for twenty candidate topic areas in econometric theory, drawn from the major sections of standard graduate econometrics textbooks and the journal's own subject classifications—nonparametric and semiparametric estimation, time series models, Bayesian methods, bootstrap and resampling, machine learning, panel data, limited dependent variables, quantile regression, kernel methods, forecasting, robust inference, weak identification, simulation, and asymptotic theory—then rank by prevalence in the Journal of Econometrics and show the top fifteen. Asymptotic theory and Monte Carlo simulation top the list—appearing in 86% and 65% of papers respectively. The more informative contrasts involve substantive methods: nonparametric estimation (58%), time series models (54%), structural/GMM/MLE methods (54%), and Bayesian methods all appear at far higher rates than in applied journals.

Journal of Econometrics vs applied journals — **Figure 12.** Comparison of term prevalence: *Journal of Econometrics* vs. applied economics journals (2011–2024). Panel (a) shows credibility revolution methods; panel (b) shows the fifteen most prevalent topics in the *Journal of Econometrics*, ranked by share.

Figure 13 makes the full picture concrete. The Journal of Econometrics has a strikingly different methodological profile from every other journal in the sample.

**Figure 13.** Heatmap of method term prevalence across journals (2011–2024). Color intensity reflects the share of papers mentioning each term.

Three caveats are important. First, the Journal of Econometrics has been at its most influential when it engages directly with the credibility revolution's tools—the literatures on heterogeneous treatment effects (De Chaisemartin and d'Haultfoeuille 2020; Callaway, Goodman-Bacon, and Sant'Anna 2024), staggered DiD (Roth 2022; Rambachan and Roth 2023), and machine learning for causal inference have reshaped applied practice. Second, the gap could reflect productive intellectual specialization rather than misalignment. Third, the cross-field differences should not be read as implying that all fields should converge to the applied micro toolkit. Nakamura and Steinsson (2018) offer a thoughtful example of how credibility revolution thinking can be adapted to macroeconomic settings.

Why does this gap matter? Because the rare instances where the two literatures do intersect have been extraordinarily productive. The DiD robustness literature—Callaway, Goodman-Bacon, and Sant'Anna (2024), De Chaisemartin and d'Haultfoeuille (2020), Sun and Abraham (2021)—moved from econometrics journals to widespread applied adoption in under five years. The gap documented here thus represents an opportunity, not just a description.

5. Conclusion

The credibility revolution has continued to advance, but the picture is one of uneven progress rather than uniform transformation. Three patterns stand out.

First, credibility revolution methods remain most prevalent in applied microeconomics. Finance and macro/other have made real strides since the early 2000s, but they differ substantially from applied micro on most measures—with current levels comparable to where applied micro was roughly a decade ago. Whether these gaps reflect a lag that will close over time or different long-run equilibria is an important open question.

Second, outside applied micro, the credibility revolution has been—to a first approximation—a difference-in-differences revolution. Over 2016–2024, including DiD raises the finance methods share by roughly 55 percent versus 30 percent for applied micro. This concentration on a single method is noteworthy given the recent econometrics literature highlighting sensitivities in DiD designs. The rapid diffusion of methodological refinements suggests that the concentration on DiD may be less concerning if practitioners are adopting improved estimators alongside the research design itself.

Third, this pattern extends to the boundary between econometric theory and applied practice. The Journal of Econometrics and applied journals occupy largely separate methodological spaces, though the gap may partly reflect productive specialization.

Looking ahead, the dominance of DiD raises a question about the trajectory of the credibility revolution. The revolution's early promise was methodological pluralism—a toolkit of transparent research designs, each suited to different empirical settings. The data show that this pluralism has been more fully realized in applied micro than elsewhere. As finance and macroeconomics continue to adopt credible methods, there is value in diversifying beyond DiD, both to strengthen the robustness of individual studies and to expand the set of questions these fields can credibly address.

References

Adão, Rodrigo, Michal Kolesár, and Eduardo Morales. 2019. "Shift-Share Designs: Theory and Inference." The Quarterly Journal of Economics 134 (4): 1949–2010.
Angrist, Joshua D., and Alan B. Krueger. 1991. "Does Compulsory School Attendance Affect Schooling and Earnings?" The Quarterly Journal of Economics 106 (4): 979–1014.
Angrist, Joshua D., and Jörn-Steffen Pischke. 2009. Mostly Harmless Econometrics: An Empiricist's Companion. Princeton University Press.
Angrist, Joshua D., and Jörn-Steffen Pischke. 2010. "The Credibility Revolution in Empirical Economics: How Better Research Design Is Taking the Con out of Econometrics." Journal of Economic Perspectives 24 (2): 3–30.
Anthropic. 2025. "Claude Language Models." anthropic.com.
Arkhangelsky, Dmitry, Susan Athey, David A. Hirshberg, Guido W. Imbens, and Stefan Wager. 2021. "Synthetic Difference-in-Differences." American Economic Review 111 (12): 4088–4118.
Autor, David H., David Dorn, and Gordon H. Hanson. 2013. "The China Syndrome: Local Labor Market Effects of Import Competition in the United States." American Economic Review 103 (6): 2121–2168.
Bartik, Timothy J. 1991. Who Benefits from State and Local Economic Development Policies? W.E. Upjohn Institute for Employment Research.
Borusyak, Kirill, Peter Hull, and Xavier Jaravel. 2022. "Quasi-Experimental Shift-Share Research Designs." The Review of Economic Studies 89 (1): 181–213.
Boustan, Leah Platt. 2010. "Was Postwar Suburbanization 'White Flight'? Evidence from the Black Migration." The Quarterly Journal of Economics 125 (1): 417–443.
Brodeur, Abel, Nikolai Cook, and Anthony Heyes. 2024. "Mass Reproducibility and Replicability: A New Hope." American Economic Review 114 (11): 3564–3610.
Brodeur, Abel, Nikolai Cook, and Anthony Heyes. 2020. "Methods Matter: p-Hacking and Publication Bias in Causal Analysis in Economics." American Economic Review 110 (11): 3634–3660.
Callaway, Brantly, Andrew Goodman-Bacon, and Pedro H.C. Sant'Anna. 2024. "Difference-in-Differences with a Continuous Treatment." NBER Working Paper.
Currie, Janet, Henrik Kleven, and Esmée Zwiers. 2020. "Technology and Big Data Are Changing Economics: Mining Text to Track Methods." AEA Papers and Proceedings 110: 42–48.
Currie, Janet, Henrik Kleven, and Esmée Zwiers. 2020b. "Data and Code for Technology and Big Data Are Changing Economics." doi.org/10.3886/E120827V1.
De Chaisemartin, Clément, and Xavier d'Haultfoeuille. 2020. "Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects." American Economic Review 110 (9): 2964–2996.
de Chaisemartin, Clément, Xavier d'Haultfoeuille, Félix Pasquier, and Gonzalo Vazquez-Bare. 2022. "Difference-in-Differences Estimators for Treatments Continuously Distributed at Every Period." arXiv:2201.06898.
Garg, Nikhil, and Thiemo Fetzer. 2025. "Tracking the Credibility Revolution Across Fields Using LLMs." Working Paper.
Goldsmith-Pinkham, Paul, Isaac Sorkin, and Henry Swift. 2020. "Bartik Instruments: What, When, Why, and How." American Economic Review 110 (8): 2586–2624.
Imbens, Guido W., and Thomas Lemieux. 2008. "Regression Discontinuity Designs: A Guide to Practice." Journal of Econometrics 142 (2): 615–635.
Nakamura, Emi, and Jón Steinsson. 2018. "Identification in Macroeconomics." Journal of Economic Perspectives 32 (3): 59–86.
Rambachan, Ashesh, and Jonathan Roth. 2023. "A More Credible Approach to Parallel Trends." Review of Economic Studies 90 (5): 2555–2591.
Roth, Jonathan. 2022. "Pretest with Caution: Event-Study Estimates after Testing for Parallel Trends." American Economic Review: Insights 4 (3): 305–322.
Roth, Jonathan, and Pedro H.C. Sant'Anna. 2023. "When Is Parallel Trends Sensitive to Functional Form?" Econometrica 91 (2): 737–747.
Sun, Liyang, and Sarah Abraham. 2021. "Estimating Dynamic Treatment Effects in Event Studies with Heterogeneous Treatment Effects." Journal of Econometrics 225 (2): 175–199.

Appendix

Appendix A: Search Categories and Trigger Phrases

Unless noted otherwise, the outcome is the fraction of papers with at least one phrase match. "Figure" and "Table" categories use average word count per paper.

Category	Trigger Phrases	Case Sens.	Wildcard	Cond. data
Administrative Data	'administrative data', 'admin data', 'administrative-data', 'admin-data', 'administrative record', 'admin record', 'administrative regist', 'admin regist', 'register data', 'registry data'	No	Yes	Yes
Big Data	'big data', 'big-data'	No	Yes	Yes
Binscatter	'binscatter', 'bin scatter', 'binned scatter'	No	Yes	No
Bunching	'bunching'	No	Yes	No
Clustering	'cluster'	No	Yes	Yes
Confidence Interval	'confidence interval'	No	Yes	Yes
Data	'data'	No	Yes	No
Difference-in-Differences	'Difference in Diff', 'difference in diff', 'Difference-in-Diff', 'difference-in-diff', 'Differences in Diff', 'differences in diff', 'Differences-in-Diff', 'differences-in-diff', 'diff-in-diff', 'd-in-d', 'DiD'	Yes	Yes	No
Event Study	'event stud', 'event-stud'	No	Yes	No
External Validity	'external validity', 'external-validity', 'externally valid', 'externally-valid'	No	Yes	No
Fixed Effects	'FE', 'Fixed Effect', 'fixed effect', 'Fixed Effects', 'fixed effects', 'Fixed-Effect', 'fixed-effect', 'Fixed-Effects', 'fixed-effects'	Yes	No	Yes
General Equilibrium	'general equilibr', 'general-equilibr'	No	Yes	No
Identification	Sentence structure: 'identif' in combination with 'effect', 'response', 'impact', 'elasticit', 'parameter', or 'coefficient' (max two words between). Also: 'causal identification', 'identification strategy', 'identification assumption', 'identifying assumption', 'identifying variation', 'partial identification', 'point identification', 'set identification', 'weak identification', etc.	No	Yes	No
Instrumental Variables	'Instrumental Variable', 'instrumental variable', 'Two Stage Least Squares', 'two stage least squares', '2SLS', 'TSLS', 'valid instrument', 'exogenous instrument', 'IV Estimat', 'IV estimat', 'exclusion restriction', 'weak first stage', 'simulated instrument', etc.	Yes	Yes	Yes
Lab Experiments	'Laboratory Experiment', 'lab experiment', 'Dictator Game', 'dictator game', 'Ultimatum Game', 'Trust Game', 'trust game', 'Public Good Game', 'Z-tree', 'zTree', 'ORSEE', 'show-up fee', etc.	Yes	Yes	No
Machine Learning	'machine learning', 'lasso', 'random forest'	No	Yes	No
Matching	'propensity score', 'propensity score matching', 'matching estimat', 'nearest neighbor matching', 'caliper matching', 'exact matching', 'kernel matching', 'inverse probability matching', etc.	No	Yes	Yes
Quasi- and Natural Experiments	'quasi experiment', 'quasi-experiment', 'quasiexperiment', 'natural experiment', 'natural-experiment'	No	Yes	No
RCTs	'Randomized Controlled Trial', 'randomized controlled trial', 'RCT', 'randomized experiment', 'randomised experiment', 'randomized evaluation', 'field experiment', 'Social Experiment', etc.	Yes	Yes	No
Regression Discontinuity	'Regression Discontinuit', 'regression discontinuit', 'Regression Kink', 'regression kink', 'RD Design', 'RD design', 'RD Estimat', 'RDD', 'RKD', etc.	Yes	Yes	No
Structural Model	Sentence structure: 'structural' + 'model'/'specification'/'estimate'/'parameter' within two sentences. Also: 'Structural Model', 'Method of Moments', 'BLP', 'GMM', 'Maximum Likelihood Estimat', 'MLE', etc.	Yes	Yes	No
Survey Data	Sentence structure: 'survey' and 'data' within two sentences.	No	Yes	Yes
Synthetic Control	'synthetic control'	No	Yes	Yes
Text Analysis	'natural language processing', 'text analys', 'computational linguistics', 'text data', 'text mining', 'tokeniz', etc.	No	Yes	No
Econometrics categories (Section 4)
Asymptotic Theory	'asymptot', 'large sample', 'convergence rate', 'consistency', 'limiting distribut'	No	Yes	No
Bayesian	'Bayesian', 'posterior distribut', 'prior distribut', 'Markov chain Monte Carlo', 'MCMC'	Mixed	Yes	No
Bootstrap	'bootstrap', 'resampl'	No	Yes	No
Nonparametric	'nonparametric', 'non-parametric', 'non parametric'	No	Yes	No
Time Series (VAR/GARCH)	'VAR', 'vector autoregress', 'ARMA', 'ARIMA', 'unit root', 'cointegrat', 'GARCH', 'ARCH', 'stationarity', 'impulse response'	Mixed	Yes	No
Simulation/Monte Carlo	'Monte Carlo', 'MCMC', 'Markov chain Monte', 'Gibbs sampl'	Mixed	Yes	No
Treatment Effects	'treatment effect', 'average treatment', 'causal effect'	No	Yes	No

Appendix B: LLM Validation of Keyword Matching

I classify a stratified sample of approximately 750 papers using Claude Haiku 4.5. For each paper, I provide the first 1,500 words and ask the model to identify which methods are actually used, as opposed to merely mentioned. I then compare the LLM classification with the keyword flags.

**Table 4.** Keyword vs. LLM Classification Agreement
Method	N	Accuracy	Precision	Recall	F1	κ
DiD	750	86.7%	47.1%	80.2%	59.3%	0.520
Event Study	750	86.5%	43.2%	64.3%	51.7%	0.442
IV	750	80.4%	37.6%	93.4%	53.6%	0.439
RD	750	91.6%	43.5%	95.9%	59.9%	0.559
RCT	750	86.8%	40.3%	94.1%	56.4%	0.500
Lab Experiment	750	92.7%	37.5%	85.7%	52.2%	0.489
Identification Strategy	750	65.6%	39.4%	75.0%	51.7%	0.288
Structural Model	750	69.1%	59.1%	36.0%	44.8%	0.250
Administrative Data	750	65.9%	79.4%	39.5%	52.8%	0.305

Notes: Keyword-based classification treated as positive when any pattern matches. LLM classification uses Qwen3.5-122B-A10B-FP8 with temperature 0. Precision and recall measured with keyword as the classifier and LLM as ground truth.

**Table 5.** Keyword vs. LLM Classification Agreement by Field
Method	Overall	Applied Micro	Finance	Macro/Others
DiD	85.7%	83.2%	90.1%	93.6%
Event Study	84.7%	83.2%	79.1%	90.4%
IV	80.8%	80.2%	84.6%	89.4%
RD	91.5%	88.6%	97.8%	96.8%
RCT	90.7%	88.2%	96.7%	100.0%
Lab Experiment	91.3%	88.6%	96.7%	98.9%
Identification Strategy	66.1%	61.7%	69.2%	81.9%
Structural Model	65.7%	69.7%	65.9%	57.4%
Administrative Data	67.7%	64.7%	68.1%	71.3%

Notes: Agreement rates between keyword-based and LLM-based classification, stratified by field.

Validation accuracy comparison — **Figure A.1.** Agreement rates between keyword matching and LLM classification, by method. Where available, GPT-4o-mini accuracy from Garg and Fetzer (2025) is shown for comparison.

Benchmark against hand-coded classifications

**Table 6.** Keyword Classification vs. Brodeur, Cook, and Heyes (2024) Hand-Coded Benchmark
Method	N_hand	N_kw	Precision	Recall	F1	Accuracy	κ
DID	164	198	69.2%	83.5%	75.7%	82.4%	0.621
IV	164	270	57.8%	95.1%	71.9%	75.6%	0.526
RCT	108	134	67.2%	83.3%	74.4%	87.6%	0.663
RDD	65	94	64.9%	93.8%	76.7%	92.6%	0.725

Notes: Benchmark against hand-coded classifications from Brodeur, Cook, and Heyes (2024). 501 papers matched by title.

**Table 7.** Keyword vs. LLM Classification: Brodeur et al. Hand-Coded Benchmark (357 papers)
Method	Keywords			Qwen 3.5-122B
Method	Precision	Recall	F1	Precision	Recall	F1
DID	68.0%	99.2%	80.7%	84.4%	78.6%	81.4%
IV	69.1%	100.0%	81.7%	90.3%	57.9%	70.6%
RDD	74.3%	94.5%	83.2%	88.9%	87.3%	88.1%
RCT	81.2%	83.9%	82.5%	91.6%	93.5%	92.6%

Notes: Both approaches benchmarked against hand-coded method labels from Brodeur, Cook, and Heyes (2020). Sample of 357 papers matched by journal, year, and title across nine journals (2011–2020).

Appendix C: Graphical Revolution

**Figure A.2.** Graphical revolution trends in NBER working papers (two-year moving averages). The ratio of figure mentions to table mentions continues upward across all fields.

Appendix D: DiD Decomposition — Strict DiD vs. Event Studies

The main text uses a composite "DiD" measure. This appendix decomposes strict DiD language from event study mentions.

Appendix E: Exclusive Field Classification

Papers assigned to a single field based on their program affiliations. Cross-listed papers are excluded. Sample: 18,697 papers (11,828 Applied Micro, 1,758 Finance, 5,111 Macro/Others).

Appendix F: Rate of Change across Fields

Appendix G: Structural Model Measure — Broad vs. Narrow

**Figure A.6.** Structural model measures: broad (including GMM/MLE) vs. narrow (excluding GMM/MLE). Panels (a)–(b): all papers. Panels (c)–(d): papers without experimental/quasi-experimental mentions.

Appendix H: Denominator Composition

Restricting to "empirical" papers—those mentioning at least one empirical method or data source—narrows the gap from 63/47/39 percent (unconditional) to 76/65/61 percent (conditional) for experimental and quasi-experimental methods in 2024. The gap clearly persists.

Appendix I: Synthetic Control and Synthetic DiD

Synthetic control trends — **Figure A.8.** Synthetic control and synthetic DiD mentions by field (two-year moving averages). The decline in synthetic control after 2020 is not fully explained by substitution toward synthetic DiD methods.

Appendix J: Top Journals — IV and Structural Model Trends

Full journal time series — **Figure A.9.** Full time series of method mentions across fields in top journals (2011–2024). Three-year moving averages.

Appendix K: Journal Analysis — Excluding General Economics Journals

Field-specific journals only — **Figure A.10.** Journal trends excluding general economics journals (field-specific journals only). Cross-field patterns hold.

Appendix L: Journal Text Extraction Coverage

**Table 8.** Text extraction coverage by journal and year. Each cell shows papers with extracted text / total papers. Shaded cells indicate coverage below 80%.
Journal	2011	2012	2013	2014	2015	2016	2017	2018	2019	2020	2021	2022	2023	2024	2025	Total
AEJ Applied	30/37	40/40	47/47	30/30	45/45	37/37	31/31	46/46	44/45	50/50	41/41	63/63	55/55	64/64	28/28	651/659 (99%)
AEJ Macro	31/32	30/31	36/36	24/24	43/43	30/30	24/24	32/32	40/40	48/48	46/46	56/56	52/52	50/50	30/30	572/574 (100%)
AEJ Micro	38/38	31/31	34/34	44/44	49/49	43/43	41/41	40/40	38/38	34/34	56/56	82/82	65/65	52/52	35/35	682/682 (100%)
AEJ Policy	29/30	37/37	44/44	45/45	46/46	39/39	51/52	48/48	54/54	51/51	57/57	63/64	64/64	64/64	48/48	740/743 (100%)
AER	260/274	260/264	257/259	255/257	261/262	273/274	244/264	112/113	134/134	120/120	115/115	114/115	95/96	111/111	87/87	2698/2745 (98%)
J. Econometrics	150/150	136/136	135/135	150/150	195/195	129/129	118/118	159/159	165/165	262/262	164/164	143/143	192/192	175/175	161/161	2434/2434 (100%)
J. Finance	61/63	93/94	102/105	78/82	100/103	85/130	71/84	87/104	85/103	88/98	81/90	77/86	94/100	95/98	74/89	1271/1429 (89%)
JFE	139/140	123/125	135/136	114/116	86/87	134/137	112/114	162/164	136/142	141/143	269/273	88/89	80/82	114/120	122/147	1955/2015 (97%)
JPE	24/36	33/43	31/44	29/46	41/55	33/55	72/97	89/121	66/89	75/97	69/93	86/115	80/115	79/108	73/93	880/1207 (73%)
QJE	46/47	56/56	36/45	38/46	39/47	45/54	51/52	32/32	39/40	50/51	48/48	42/42	54/54	38/45	38/48	652/707 (92%)
RFS	0/114	96/101	95/100	97/98	109/121	97/104	143/147	132/138	143/145	144/147	135/137	94/94	93/95	85/89	86/111	1549/1741 (89%)
Excluded from main analysis (robustness only):
Econometrica	1/70	0/99	0/89	0/83	42/86	54/77	61/82	61/84	58/83	87/111	91/113	91/115	75/100	64/89	58/79	743/1360 (55%)
R. Econ. Stud.	0/84	0/47	0/51	0/44	0/53	0/65	0/72	0/75	0/63	0/94	0/98	0/77	0/115	0/113	—	0/1051 (0%)

Appendix M: J. Econometrics — Time Trends in Credibility Revolution Methods