Tracking the Credibility Revolution across Fields

Paul Goldsmith-Pinkham Yale School of Management and NBER. Email: paul.goldsmith-pinkham@yale.edu. I thank Dana Scott, Pedro Sant'Anna, Nils Enevoldsen, and Esmée Zwiers for helpful comments and suggestions.

⬇ Download PDF

How far has the credibility revolution spread beyond applied microeconomics? I update Currie, Kleven, and Zwiers (2020) using approximately 44,000 papers—31,500 NBER working papers (1982–2025) and 12,300 articles from eleven top economics and finance journals (2011–2024)—measuring mentions of empirical methods through keyword matching. Three findings emerge. First, finance and macro/other fields differ substantially from applied micro in their mention of credibility revolution methods: as of 2024, 63 percent of applied micro papers mention experimental or quasi-experimental methods, compared to 47 percent in finance and 39 percent in macro/other. The current levels in finance and macro/other are comparable to where applied micro was in 2008–2010, though the long-run trajectories may differ. Second, growth outside applied micro is driven overwhelmingly by difference-in-differences; including DiD raises the share of finance papers mentioning any experimental or quasi-experimental method by roughly 55 percent versus 30 percent for applied micro. Other quasi-experimental methods—instrumental variables, regression discontinuity, experiments—have seen far less growth. Third, I document a striking gap between the methods studied in the Journal of Econometrics—where nonparametric estimation and asymptotic theory dominate—and those used by applied researchers, where DiD and identification strategies dominate. Published journal articles confirm these patterns are not artifacts of the NBER sample.

JEL Codes: C18, C81, B41
Keywords: Credibility revolution, difference-in-differences, text analysis, empirical methods, causal inference

Introduction

How far has the credibility revolution spread? Angrist and Pischke (2010) documented a sea change in how economists approach empirical work—a shift toward transparent research designs, explicit identification strategies, and credible causal inference. Currie, Kleven, and Zwiers (2020) showed that this shift was accelerating through the late 2010s, at least in applied microeconomics. But that analysis left open a basic question: are finance, macroeconomics, and other fields keeping pace, or has the revolution been narrower than it appears? Throughout this paper, I use "macro/other" to refer to the NBER field grouping that includes macroeconomics alongside several other programs; see Table 2 for the full composition.

I take up this question by extending Currie, Kleven, and Zwiers (2020)'s approach to a much larger corpus. Using keyword matching on the full text of approximately 44,000 economics papers—31,500 NBER working papers (1982–2025) and 12,300 articles from eleven top journals (2011–2024)—I track mentions of empirical methods across fields and over time. The expanded sample adds finance and macro/other fields, which were omitted from the original analysis, and supplements working papers with published journal articles. Because the analysis measures keyword mentions rather than verified use, the trends should be interpreted as tracking the diffusion of methodological language—a proxy for, but not identical to, actual method adoption.

The answer is clear: mentions of credibility revolution methods have spread unevenly across fields. I organize the findings around three main results.

First, finance and macro/other differ substantially from applied micro on most measures. As of 2024, 63 percent of applied micro papers mention experimental or quasi-experimental methods, compared to 47 percent in finance and 39 percent in macro/other (Table 3). In identification language, the current levels in finance and macro/other are comparable to where applied micro was in 2008–2010. The gap has shown little sign of closing.

Second, the credibility revolution outside applied micro has been—to a first approximation—a difference-in-differences revolution. Including DiD in the methods measure raises the finance share by roughly 55 percent versus 30 percent for applied micro. Other quasi-experimental tools—instrumental variables, regression discontinuity, RCTs—have seen far less growth in finance and macro. This reliance on a single method is striking given the recent econometrics literature highlighting sensitivities in DiD designs (Roth 2022; De Chaisemartin and d'Haultfoeuille 2020; Callaway, Goodman-Bacon, and Sant'Anna 2024).

Third, I document a pronounced gap between the methods studied in the Journal of Econometrics—where nonparametric estimation, bootstrap methods, and asymptotic theory dominate—and those used by applied researchers, where DiD and identification strategies are the dominant tools. The tools powering the credibility revolution and the theoretical literature developing new estimators occupy largely separate methodological spaces.

Two features of the analysis strengthen confidence in these patterns. Published articles from top journals show trends that closely mirror the NBER data, with slightly higher rates of credibility revolution methods—consistent with a publication selection effect favoring methodologically rigorous papers. And a validation exercise using LLM-based classification confirms that keyword matching achieves 80–92 percent agreement rates for most method categories with more sophisticated approaches at near-zero computational cost, though agreement is lower for broader categories like identification strategy and structural models.

The paper proceeds as follows. Section 1 describes the data and methods. Section 2 presents the NBER working paper results. Section 3 extends the analysis to published articles from top journals. Section 4 examines the gap between econometric theory and applied practice. Section 5 concludes.

1. Data and Methods

I measure mentions of empirical methods over time following the approach in Currie, Kleven, and Zwiers (2020): searching the full text of papers for keywords and regular expressions that capture the language of the credibility revolution (e.g. "threats to identification" or "identification strategy"). See the Appendix for the full set of keywords. I follow the same method as Currie, Kleven, and Zwiers (2020).

NBER Working Papers

I collect the full text of approximately 31,500 NBER working papers from the NBER website, covering papers 1000 through the most recent available (1982–2025). Unlike Currie, Kleven, and Zwiers (2020), who focus exclusively on "applied micro" papers, I include all papers in the NBER working paper series. Each paper is associated with one or more of nineteen NBER research programs, which I use for field classification.

Top Journal Articles

I supplement the NBER data with articles from eleven leading economics and finance journals, covering 2011–2024: three general-interest economics journals (AER, QJE, JPE), the four American Economic Journals (Applied, Policy, Macro, Micro), three top finance journals (Journal of Finance, Review of Financial Studies, Journal of Financial Economics), and the Journal of Econometrics. I extract full text from published PDFs using PyMuPDF. For AER, I filter out Papers and Proceedings (P&P) articles using DOI patterns. I exclude the Review of Economic Studies (zero text extraction coverage) and Econometrica (near-zero text coverage for 2011–2014, partial thereafter) from the main analysis; Appendix: Coverage documents text extraction rates by journal and year. In total, the journal sample comprises approximately 12,300 articles.

Text Processing

For each paper, I extract the full text, strip out the references section—identified by looking for section headers followed by high concentrations of "Journal" mentions—and apply the keyword search. I use the same keywords and regular expressions as Currie, Kleven, and Zwiers (2020), with appropriate case sensitivity for each category. The full list is in the Appendix.

Validation

I validate keyword matching against two external benchmarks. First, I compare keyword flags to the hand-coded method labels in Brodeur, Cook, and Heyes (2020), matching 357 papers across nine journals (2011–2020) by title. Treating Brodeur et al.'s labels as ground truth, keywords achieve high recall—99% for DiD and IV, 95% for RD—meaning they rarely miss a paper that uses a given method. Precision is lower (69–74% for DiD, IV, and RD), reflecting that keywords also flag papers that mention a method without using it as a primary research design. Second, I classify a stratified sample of 750 papers using two independent LLMs (Claude Haiku 4.5 and Qwen 3.5-122B). Both LLMs produce nearly identical positive rates for every method category, and agreement with keywords runs 80–92% for most categories. Full results appear in Appendix: Validation.

Field Classification

For journal articles, I classify papers into fields using a two-step procedure. First, field-specific journals are directly classified: AEJ Applied and AEJ Policy map to "Applied Micro," AEJ Macro to "Macro," AEJ Micro to "Micro Theory," the three finance journals (JF, JFE, RFS) to "Finance," and the Journal of Econometrics to "Econometrics." Second, for the general-interest journals (AER, QJE, JPE), I use JEL codes when available. Each paper's JEL code first letters determine its field: D, J, L, H, I, Q, R, or K codes map to "Applied Micro"; G codes to "Finance"; E or F codes to "Macro"; and C codes to "Econometrics." When a paper has JEL codes spanning multiple fields, I assign it to the first matching field in the priority order listed above. Papers without JEL codes—primarily from QJE and JPE, which do not report them—default to "General Econ."

Comparison of sample size to Currie et al. (2020) in applied micro

(a) Comparison of sample size to Currie, Kleven, and Zwiers (2020) in "applied micro"

Total papers in final sample over time

(b) Total papers in final sample over time

Figure 1. NBER Working Paper Counts over Time. Data for Currie, Kleven, and Zwiers (2020) is measured in Appendix Figure B.I. in their paper. My sample ends in early 2025.

As Currie, Kleven, and Zwiers (2020) note in their replication package (Currie, Kleven, and Zwiers 2020b), PDF-to-text conversion introduces errors. To see how this affects my sample, I compare paper counts over time in the "applied micro" setting to Currie, Kleven, and Zwiers (2020) in Figure 1. My sample has more gaps in the 1990s—reflecting data processing errors for PDFs in that period—but coverage is close in the early 1980s and from 1999 onwards. Figure 2 provides a more direct check: I compare two headline estimates from Currie, Kleven, and Zwiers (2020) to mine. My estimates track well except in the late 1990s. I therefore focus on 2000 onwards for all results, leaving a sample of 24,702 papers.

Comparison of identification measure

(a) Comparison of identification measure to Currie, Kleven, and Zwiers (2020) in "applied micro"

Comparison of experimental/quasi-experimental measure

(b) Comparison of all experimental and quasi-experimental measure to Currie, Kleven, and Zwiers (2020) in "applied micro"

Figure 2. Validation of measurement with Currie, Kleven, and Zwiers (2020). Data for CKZ is taken from Figure 2 Panel A and B. I plot the raw (annual) measure, while CKZ data is a rolling five-year mean; the smoothing explains the slight visual discrepancy between the two series.

Each NBER working paper can be submitted to one or more of nineteen programs, and 55 percent list more than one. 45 percent have one program, 32 percent have two, 15 percent have three, 5 percent have four, and 2 percent have five. Table 1 reports the breakdown. The three largest programs are Economic Fluctuations and Growth (macroeconomics), Public Economics (applied micro), and Labor Studies (also applied micro).

Table 1. NBER Working Paper Series counts by program
NBER ProgramNumber of Papers
Applied Micro
Labor Studies5,970
Public Economics5,896
Economics of Health3,641
International Trade and Investment2,466
Children and Families2,193
Industrial Organization2,160
Economics of Education2,105
Development Economics1,955
Political Economy1,869
Environment and Energy Economics1,724
Economics of Aging1,698
Finance
Asset Pricing2,985
Corporate Finance2,785
Macro/Others
Economic Fluctuations and Growth5,645
International Finance and Macroeconomics3,107
Monetary Economics2,924
Productivity, Innovation, and Entrepreneurship2,785
Development of the American Economy1,675
Law and Economics1,385

To compare across programs, I extend Currie, Kleven, and Zwiers (2020)'s classification. I define "finance" as Asset Pricing and Corporate Finance, and "macro/other" as the remaining programs. Table 2 defines these groupings.

Table 2. Breakdown of papers by field groupings
Field GroupNumber of Papers
Applied Micro18,288
Macro/Others5,111
Finance1,758
Finance + Macro/Others1,692

Throughout the analysis, field and program labels are non-exclusive: a paper contributes to every program to which it is submitted. I focus on 2000 onwards for most results, leaving a sample of approximately 24,700 NBER papers. Table 3 provides a snapshot of the headline numbers.

Table 3. Summary of credibility revolution measures by field. Shares are computed from NBER working papers. "Exp./Quasi-exp." includes DiD, event studies, IV, RD, RCTs, lab experiments, and bunching. "Excl. DiD" excludes difference-in-differences and event studies.
Field 2016–2024 2000–2015
$N$Ident.Exp./QEDiDExcl. DiD $N$Ident.Exp./QEDiDExcl. DiD
Applied Micro 8,26540.2%58.3%25.3%45.8% 9,06733.4%42.9%11.8%37.1%
Finance 58622.7%35.8%19.8%23.2% 1,12115.1%22.5%10.9%14.2%
Macro/Others 2,51425.1%29.7%12.5%21.8% 4,04717.6%22.0%6.3%17.7%

2. Results from NBER Working Papers

Overall trends

Figure 3 presents the updated version of Currie, Kleven, and Zwiers (2020)'s Figure 2, now covering all NBER papers through May 2024. Currie, Kleven, and Zwiers (2020) use a five-year moving average; I present two-year moving averages throughout. Each panel shows field-specific trends as colored lines, with the overall aggregate as a dashed black line.

Nearly all trends continue in the direction Currie, Kleven, and Zwiers (2020) documented. The share of papers explicitly mentioning identification has risen overall, with growth slowing markedly since 2016 (panel a). The share mentioning any experimental or quasi-experimental method, by contrast, has continued to rise even after 2016 (panel b). This means identification language has saturated while mentions of specific methods keep growing. Administrative data (panel c) has also continued its upward trend.

But the aggregate trends mask substantial heterogeneity. Figure 3 previews the paper's central finding: mentions of credibility revolution methods have spread unevenly, with persistent gaps between applied micro on the one hand and finance and macro/other on the other.

Credibility revolution trends in NBER working papers
Figure 3. Credibility revolution trends in NBER working papers (two-year moving averages). Colored lines show field-specific trends; dashed black line shows the overall aggregate. See Table 2 for field definitions and the Appendix for keyword definitions.

Comparison across fields

Figure 3 splits each variable by the three field groupings. The gaps are large and persistent. For identification, experimental and quasi-experimental methods, and administrative data, applied micro is well above both finance and macro/other. Applied micro's identification share has grown more slowly since 2017, reaching 46 percent by 2024, and remains 13–17 percentage points above finance and macro/other. For experimental and quasi-experimental methods, applied micro reaches 63 percent by 2024, while finance stands at 47 percent and macro/other at 39 percent (Table 3).

To put these gaps in context, it helps to ask where finance and macro/other stand today relative to applied micro in the past. In identification, the current levels in finance and macro/other are comparable to where applied micro was in 2008–2010. In experimental and quasi-experimental methods, finance is comparable to applied micro circa 2011–2012 and macro/other to applied micro circa 2008. Whether this reflects a lag that will eventually close or different long-run equilibria is an important open question.

Figure 4 presents method-specific trends by field. I start with difference-in-differences (panel a), which includes event studies. All three fields show steep growth, with applied micro leading. Finance is close behind—partly because the term "event study" captures financial event studies (abnormal return studies) that differ methodologically from DiD-style event studies. Appendix: DiD Decomposition decomposes this measure.

Panel (b) tells a different story: synthetic controls. Growth continued through 2020 but has since leveled off. Panel (c) examines Bartik and shift-share instruments (Goldsmith-Pinkham, Sorkin, and Swift 2020; Borusyak, Hull, and Jaravel 2022; Adão, Kolesár, and Morales 2019). Since 2013, this method has grown rapidly across all fields. Panel (d) plots the share mentioning instrumental variables, which has stayed roughly constant over time. In panel (e), applied micro leads in RCT mentions, with 20 percent of papers by 2024. In panel (f), applied micro leads finance and macro/other by about 7–8 percentage points in regression discontinuity mentions, but the share has flattened across all fields over the past eight years.

Method-specific trends by field
Figure 4. Method-specific trends by field (two-year moving averages). Note: y-axis ranges differ across panels to accommodate different prevalence levels.

What accounts for the gap between applied micro and the other fields? One possibility is structural estimation. In Figure 5, macro/other and finance have a 7.5–10 percentage point higher share of structural estimation mentions. More revealing is panel (b), which isolates papers that mention structural estimation without also mentioning experimental or quasi-experimental methods. Here the gap widens: finance and macro/other papers are roughly twice as likely to fall in this category as applied micro papers. This means that when applied micro papers use structural models, they typically pair them with complementary research designs—a pattern far less common in finance and macro.

Structural models

(a) Structural Models

Structural models without quasi-experimental methods

(b) Structural Models without mention of experimental or quasi-experimental methods

Figure 5. Panel (a) reports the share of papers that mention structural model estimation. Panel (b) reports the share mentioning structural estimation without any experimental or quasi-experimental methods.

Breakdown across programs

The field-level averages mask important within-field variation. Figure 6 plots the share of papers mentioning identification and experimental/quasi-experimental methods across all nineteen programs using slope charts. Each line segment connects a program's 2000–2015 share (left) to its 2016–2024 share (right), colored by field.

Despite within-field heterogeneity, the cross-field pattern is strikingly consistent. Applied micro programs have higher identification shares than nearly all finance and macro/other programs, with the exceptions of Productivity, Innovation, and Entrepreneurship and Law and Economics. Within finance, there is a large gap between Asset Pricing and Corporate Finance.

Method mentions across NBER programs
Figure 6. Method mentions across NBER programs. Each line segment connects a program's 2000–2015 share (left) to its 2016–2024 share (right), colored by field. Steeper upward slopes indicate faster growth.

Which methods have driven the growth? Figure 7 presents a heatmap of the change in method share by program between 2000–2015 and 2016–2024. The answer is unambiguous: DiD accounts for most of the growth across programs. The share mentioning instrumental variables has stayed roughly constant. Regression discontinuity has risen only slightly. The credibility revolution in finance and macro has been, to a first approximation, a difference-in-differences revolution.

Change in method-specific mentions across NBER programs
Figure 7. Change in method-specific mentions across NBER programs, 2016–2024 minus 2000–2015. Each cell shows the percentage-point change. Blue indicates growth, red indicates decline.

The dominance of difference-in-differences across fields

How much does this single method account for the overall growth? Figure 8 compares method shares with and without DiD.

Panel (a) breaks down the comparison by field. Over 2016–2024, including DiD raises finance's methods share by roughly 13 percentage points—a 56 percent increase—compared to a similar 13 percentage point increase for applied micro, which represents only a 29 percent increase because applied micro's baseline is much higher. Panel (b) decomposes the percentage increase by program. International Finance and Macroeconomics shows the largest increase, followed by Corporate Finance, Health Economics, and Asset Pricing. By contrast, applied micro programs with high overall method shares—such as Development Economics and Education—show relatively small increases from DiD, reflecting their diversified methodological portfolios.

The dominance of difference-in-differences
Figure 8. The dominance of difference-in-differences. Panel (a): experimental and quasi-experimental method shares by field, faceted by whether DiD is included or excluded. Panel (b): percentage increase in method share from including DiD, by NBER program (2016–2024).

3. Evidence from Top Journals

The NBER working paper series is a natural laboratory for studying methodological trends, but it has a limitation: NBER affiliates are a selected group. Do the patterns above survive in a different sample?

Overall trends across top journals

Figure 9 compares identification and DiD mentions across journal fields between 2011–2017 and 2018–2024. The field-level patterns closely mirror the NBER data. Applied micro journals show the highest rates of identification language and experimental/quasi-experimental methods, followed by finance, with macro trailing.

Method mentions across fields in top journals
Figure 9. Method mentions across fields in top journals: 2011–2017 vs. 2018–2024. Each line connects a field's early-period share (left) to its late-period share (right). Panel (a) identification, (b) difference-in-differences.

Comparison across individual journals

Figure 10 compares mention rates across individual journals. AEJ Applied Economics and AEJ Economic Policy show the highest rates of credibility revolution methods. Among the general-interest journals, AER and QJE show higher rates than JPE. The finance journals show moderate adoption of DiD and identification language but lower rates of RD and experimental methods.

Heatmap of method mentions across journals
Figure 10. Heatmap of method mentions across individual journals (2011–2024). Color intensity reflects the share of papers mentioning each method category. Journals ordered by identification share.

NBER working papers vs. published articles

Could the NBER trends be artifacts of the working paper selection process? Figure 11 overlays the NBER and journal time series for key methods, matching by field. The trends are strikingly similar. Published articles show slightly higher rates of most credibility revolution methods—consistent with a selection effect where papers using transparent research designs are more likely to clear the bar at top journals.

NBER vs published journal articles
Figure 11. NBER working papers vs. published journal articles: time series by field (2011–2024). Each panel compares NBER working papers (left facet) with published journals (right facet) for identification (left) and difference-in-differences (right).

4. Econometric Theory and Applied Practice

Having established that the credibility revolution has spread unevenly across applied fields, I now turn to a deeper question. The credibility revolution depends on tools developed by econometricians. If the revolution's reach has been uneven across applied fields, what about the field that supplies its theoretical infrastructure?

Panel (a) of Figure 12 shows which credibility revolution methods appear in the Journal of Econometrics. Most—DiD, event studies, RD, RCTs, administrative data, synthetic control—appear far less frequently than in applied journals. The exceptions are identification language and instrumental variables, reflecting the theoretical literature on these topics.

Panel (b) takes a data-driven approach: what does the Journal of Econometrics publish? I construct keyword lists for twenty candidate topic areas in econometric theory, drawn from the major sections of standard graduate econometrics textbooks and the journal's own subject classifications—nonparametric and semiparametric estimation, time series models, Bayesian methods, bootstrap and resampling, machine learning, panel data, limited dependent variables, quantile regression, kernel methods, forecasting, robust inference, weak identification, simulation, and asymptotic theory—then rank by prevalence in the Journal of Econometrics and show the top fifteen. Asymptotic theory and Monte Carlo simulation top the list—appearing in 86% and 65% of papers respectively. The more informative contrasts involve substantive methods: nonparametric estimation (58%), time series models (54%), structural/GMM/MLE methods (54%), and Bayesian methods all appear at far higher rates than in applied journals.

Journal of Econometrics vs applied journals
Figure 12. Comparison of term prevalence: Journal of Econometrics vs. applied economics journals (2011–2024). Panel (a) shows credibility revolution methods; panel (b) shows the fifteen most prevalent topics in the Journal of Econometrics, ranked by share.

Figure 13 makes the full picture concrete. The Journal of Econometrics has a strikingly different methodological profile from every other journal in the sample.

Heatmap of method term prevalence across journals
Figure 13. Heatmap of method term prevalence across journals (2011–2024). Color intensity reflects the share of papers mentioning each term.

Three caveats are important. First, the Journal of Econometrics has been at its most influential when it engages directly with the credibility revolution's tools—the literatures on heterogeneous treatment effects (De Chaisemartin and d'Haultfoeuille 2020; Callaway, Goodman-Bacon, and Sant'Anna 2024), staggered DiD (Roth 2022; Rambachan and Roth 2023), and machine learning for causal inference have reshaped applied practice. Second, the gap could reflect productive intellectual specialization rather than misalignment. Third, the cross-field differences should not be read as implying that all fields should converge to the applied micro toolkit. Nakamura and Steinsson (2018) offer a thoughtful example of how credibility revolution thinking can be adapted to macroeconomic settings.

Why does this gap matter? Because the rare instances where the two literatures do intersect have been extraordinarily productive. The DiD robustness literature—Callaway, Goodman-Bacon, and Sant'Anna (2024), De Chaisemartin and d'Haultfoeuille (2020), Sun and Abraham (2021)—moved from econometrics journals to widespread applied adoption in under five years. The gap documented here thus represents an opportunity, not just a description.

5. Conclusion

The credibility revolution has continued to advance, but the picture is one of uneven progress rather than uniform transformation. Three patterns stand out.

First, credibility revolution methods remain most prevalent in applied microeconomics. Finance and macro/other have made real strides since the early 2000s, but they differ substantially from applied micro on most measures—with current levels comparable to where applied micro was roughly a decade ago. Whether these gaps reflect a lag that will close over time or different long-run equilibria is an important open question.

Second, outside applied micro, the credibility revolution has been—to a first approximation—a difference-in-differences revolution. Over 2016–2024, including DiD raises the finance methods share by roughly 55 percent versus 30 percent for applied micro. This concentration on a single method is noteworthy given the recent econometrics literature highlighting sensitivities in DiD designs. The rapid diffusion of methodological refinements suggests that the concentration on DiD may be less concerning if practitioners are adopting improved estimators alongside the research design itself.

Third, this pattern extends to the boundary between econometric theory and applied practice. The Journal of Econometrics and applied journals occupy largely separate methodological spaces, though the gap may partly reflect productive specialization.

Looking ahead, the dominance of DiD raises a question about the trajectory of the credibility revolution. The revolution's early promise was methodological pluralism—a toolkit of transparent research designs, each suited to different empirical settings. The data show that this pluralism has been more fully realized in applied micro than elsewhere. As finance and macroeconomics continue to adopt credible methods, there is value in diversifying beyond DiD, both to strengthen the robustness of individual studies and to expand the set of questions these fields can credibly address.

References

Appendix

Appendix A: Search Categories and Trigger Phrases

Unless noted otherwise, the outcome is the fraction of papers with at least one phrase match. "Figure" and "Table" categories use average word count per paper.

CategoryTrigger PhrasesCase Sens.WildcardCond. data
Administrative Data'administrative data', 'admin data', 'administrative-data', 'admin-data', 'administrative record', 'admin record', 'administrative regist', 'admin regist', 'register data', 'registry data'NoYesYes
Big Data'big data', 'big-data'NoYesYes
Binscatter'binscatter', 'bin scatter', 'binned scatter'NoYesNo
Bunching'bunching'NoYesNo
Clustering'cluster'NoYesYes
Confidence Interval'confidence interval'NoYesYes
Data'data'NoYesNo
Difference-in-Differences'Difference in Diff', 'difference in diff', 'Difference-in-Diff', 'difference-in-diff', 'Differences in Diff', 'differences in diff', 'Differences-in-Diff', 'differences-in-diff', 'diff-in-diff', 'd-in-d', 'DiD'YesYesNo
Event Study'event stud', 'event-stud'NoYesNo
External Validity'external validity', 'external-validity', 'externally valid', 'externally-valid'NoYesNo
Fixed Effects'FE', 'Fixed Effect', 'fixed effect', 'Fixed Effects', 'fixed effects', 'Fixed-Effect', 'fixed-effect', 'Fixed-Effects', 'fixed-effects'YesNoYes
General Equilibrium'general equilibr', 'general-equilibr'NoYesNo
IdentificationSentence structure: 'identif' in combination with 'effect', 'response', 'impact', 'elasticit', 'parameter', or 'coefficient' (max two words between). Also: 'causal identification', 'identification strategy', 'identification assumption', 'identifying assumption', 'identifying variation', 'partial identification', 'point identification', 'set identification', 'weak identification', etc.NoYesNo
Instrumental Variables'Instrumental Variable', 'instrumental variable', 'Two Stage Least Squares', 'two stage least squares', '2SLS', 'TSLS', 'valid instrument', 'exogenous instrument', 'IV Estimat', 'IV estimat', 'exclusion restriction', 'weak first stage', 'simulated instrument', etc.YesYesYes
Lab Experiments'Laboratory Experiment', 'lab experiment', 'Dictator Game', 'dictator game', 'Ultimatum Game', 'Trust Game', 'trust game', 'Public Good Game', 'Z-tree', 'zTree', 'ORSEE', 'show-up fee', etc.YesYesNo
Machine Learning'machine learning', 'lasso', 'random forest'NoYesNo
Matching'propensity score', 'propensity score matching', 'matching estimat', 'nearest neighbor matching', 'caliper matching', 'exact matching', 'kernel matching', 'inverse probability matching', etc.NoYesYes
Quasi- and Natural Experiments'quasi experiment', 'quasi-experiment', 'quasiexperiment', 'natural experiment', 'natural-experiment'NoYesNo
RCTs'Randomized Controlled Trial', 'randomized controlled trial', 'RCT', 'randomized experiment', 'randomised experiment', 'randomized evaluation', 'field experiment', 'Social Experiment', etc.YesYesNo
Regression Discontinuity'Regression Discontinuit', 'regression discontinuit', 'Regression Kink', 'regression kink', 'RD Design', 'RD design', 'RD Estimat', 'RDD', 'RKD', etc.YesYesNo
Structural ModelSentence structure: 'structural' + 'model'/'specification'/'estimate'/'parameter' within two sentences. Also: 'Structural Model', 'Method of Moments', 'BLP', 'GMM', 'Maximum Likelihood Estimat', 'MLE', etc.YesYesNo
Survey DataSentence structure: 'survey' and 'data' within two sentences.NoYesYes
Synthetic Control'synthetic control'NoYesYes
Text Analysis'natural language processing', 'text analys', 'computational linguistics', 'text data', 'text mining', 'tokeniz', etc.NoYesNo
Econometrics categories (Section 4)
Asymptotic Theory'asymptot', 'large sample', 'convergence rate', 'consistency', 'limiting distribut'NoYesNo
Bayesian'Bayesian', 'posterior distribut', 'prior distribut', 'Markov chain Monte Carlo', 'MCMC'MixedYesNo
Bootstrap'bootstrap', 'resampl'NoYesNo
Nonparametric'nonparametric', 'non-parametric', 'non parametric'NoYesNo
Time Series (VAR/GARCH)'VAR', 'vector autoregress', 'ARMA', 'ARIMA', 'unit root', 'cointegrat', 'GARCH', 'ARCH', 'stationarity', 'impulse response'MixedYesNo
Simulation/Monte Carlo'Monte Carlo', 'MCMC', 'Markov chain Monte', 'Gibbs sampl'MixedYesNo
Treatment Effects'treatment effect', 'average treatment', 'causal effect'NoYesNo
Appendix B: LLM Validation of Keyword Matching

I classify a stratified sample of approximately 750 papers using Claude Haiku 4.5. For each paper, I provide the first 1,500 words and ask the model to identify which methods are actually used, as opposed to merely mentioned. I then compare the LLM classification with the keyword flags.

Table 4. Keyword vs. LLM Classification Agreement
MethodNAccuracyPrecisionRecallF1κ
DiD75086.7%47.1%80.2%59.3%0.520
Event Study75086.5%43.2%64.3%51.7%0.442
IV75080.4%37.6%93.4%53.6%0.439
RD75091.6%43.5%95.9%59.9%0.559
RCT75086.8%40.3%94.1%56.4%0.500
Lab Experiment75092.7%37.5%85.7%52.2%0.489
Identification Strategy75065.6%39.4%75.0%51.7%0.288
Structural Model75069.1%59.1%36.0%44.8%0.250
Administrative Data75065.9%79.4%39.5%52.8%0.305

Notes: Keyword-based classification treated as positive when any pattern matches. LLM classification uses Qwen3.5-122B-A10B-FP8 with temperature 0. Precision and recall measured with keyword as the classifier and LLM as ground truth.


Table 5. Keyword vs. LLM Classification Agreement by Field
MethodOverallApplied MicroFinanceMacro/Others
DiD85.7%83.2%90.1%93.6%
Event Study84.7%83.2%79.1%90.4%
IV80.8%80.2%84.6%89.4%
RD91.5%88.6%97.8%96.8%
RCT90.7%88.2%96.7%100.0%
Lab Experiment91.3%88.6%96.7%98.9%
Identification Strategy66.1%61.7%69.2%81.9%
Structural Model65.7%69.7%65.9%57.4%
Administrative Data67.7%64.7%68.1%71.3%

Notes: Agreement rates between keyword-based and LLM-based classification, stratified by field.

Validation accuracy comparison
Figure A.1. Agreement rates between keyword matching and LLM classification, by method. Where available, GPT-4o-mini accuracy from Garg and Fetzer (2025) is shown for comparison.

Benchmark against hand-coded classifications

Table 6. Keyword Classification vs. Brodeur, Cook, and Heyes (2024) Hand-Coded Benchmark
MethodNhandNkwPrecisionRecallF1Accuracyκ
DID16419869.2%83.5%75.7%82.4%0.621
IV16427057.8%95.1%71.9%75.6%0.526
RCT10813467.2%83.3%74.4%87.6%0.663
RDD659464.9%93.8%76.7%92.6%0.725

Notes: Benchmark against hand-coded classifications from Brodeur, Cook, and Heyes (2024). 501 papers matched by title.


Table 7. Keyword vs. LLM Classification: Brodeur et al. Hand-Coded Benchmark (357 papers)
MethodKeywordsQwen 3.5-122B
PrecisionRecallF1PrecisionRecallF1
DID68.0%99.2%80.7%84.4%78.6%81.4%
IV69.1%100.0%81.7%90.3%57.9%70.6%
RDD74.3%94.5%83.2%88.9%87.3%88.1%
RCT81.2%83.9%82.5%91.6%93.5%92.6%

Notes: Both approaches benchmarked against hand-coded method labels from Brodeur, Cook, and Heyes (2020). Sample of 357 papers matched by journal, year, and title across nine journals (2011–2020).

Appendix C: Graphical Revolution
Graphical revolution trends
Figure A.2. Graphical revolution trends in NBER working papers (two-year moving averages). The ratio of figure mentions to table mentions continues upward across all fields.
Appendix D: DiD Decomposition — Strict DiD vs. Event Studies

The main text uses a composite "DiD" measure. This appendix decomposes strict DiD language from event study mentions.

DiD decomposition
Figure A.3. Decomposition of the DiD measure by field. Panel (a): strict DiD language only. Panel (b): any event study mention. Panel (c): event study mentions without strict DiD. Panel (d): event study mentions with "abnormal return" (financial event study proxy).
Appendix E: Exclusive Field Classification

Papers assigned to a single field based on their program affiliations. Cross-listed papers are excluded. Sample: 18,697 papers (11,828 Applied Micro, 1,758 Finance, 5,111 Macro/Others).

Exclusive field classification
Figure A.4. Credibility revolution trends under exclusive field classification. The cross-field gaps are qualitatively similar and slightly wider.
Appendix F: Rate of Change across Fields
Rate of change
Figure A.5. Rate of change in credibility revolution measures by field (three-year moving average of first differences). The data are more consistent with different long-run equilibria than with simple convergence.
Appendix G: Structural Model Measure — Broad vs. Narrow
Structural model measures
Figure A.6. Structural model measures: broad (including GMM/MLE) vs. narrow (excluding GMM/MLE). Panels (a)–(b): all papers. Panels (c)–(d): papers without experimental/quasi-experimental mentions.
Appendix H: Denominator Composition

Restricting to "empirical" papers—those mentioning at least one empirical method or data source—narrows the gap from 63/47/39 percent (unconditional) to 76/65/61 percent (conditional) for experimental and quasi-experimental methods in 2024. The gap clearly persists.

Denominator composition
Figure A.7. Method-specific trends: all papers (left) vs. empirical papers only (right). Denominator composition accounts for roughly one-third of the cross-field gap.
Appendix I: Synthetic Control and Synthetic DiD
Synthetic control trends
Figure A.8. Synthetic control and synthetic DiD mentions by field (two-year moving averages). The decline in synthetic control after 2020 is not fully explained by substitution toward synthetic DiD methods.
Appendix J: Top Journals — IV and Structural Model Trends
Full journal time series
Figure A.9. Full time series of method mentions across fields in top journals (2011–2024). Three-year moving averages.
Appendix K: Journal Analysis — Excluding General Economics Journals
Field-specific journals only
Figure A.10. Journal trends excluding general economics journals (field-specific journals only). Cross-field patterns hold.
Appendix L: Journal Text Extraction Coverage
Table 8. Text extraction coverage by journal and year. Each cell shows papers with extracted text / total papers. Shaded cells indicate coverage below 80%.
Journal201120122013201420152016201720182019202020212022202320242025Total
AEJ Applied30/3740/4047/4730/3045/4537/3731/3146/4644/4550/5041/4163/6355/5564/6428/28651/659 (99%)
AEJ Macro31/3230/3136/3624/2443/4330/3024/2432/3240/4048/4846/4656/5652/5250/5030/30572/574 (100%)
AEJ Micro38/3831/3134/3444/4449/4943/4341/4140/4038/3834/3456/5682/8265/6552/5235/35682/682 (100%)
AEJ Policy29/3037/3744/4445/4546/4639/3951/5248/4854/5451/5157/5763/6464/6464/6448/48740/743 (100%)
AER260/274260/264257/259255/257261/262273/274244/264112/113134/134120/120115/115114/11595/96111/11187/872698/2745 (98%)
J. Econometrics150/150136/136135/135150/150195/195129/129118/118159/159165/165262/262164/164143/143192/192175/175161/1612434/2434 (100%)
J. Finance61/6393/94102/10578/82100/10385/13071/8487/10485/10388/9881/9077/8694/10095/9874/891271/1429 (89%)
JFE139/140123/125135/136114/11686/87134/137112/114162/164136/142141/143269/27388/8980/82114/120122/1471955/2015 (97%)
JPE24/3633/4331/4429/4641/5533/5572/9789/12166/8975/9769/9386/11580/11579/10873/93880/1207 (73%)
QJE46/4756/5636/4538/4639/4745/5451/5232/3239/4050/5148/4842/4254/5438/4538/48652/707 (92%)
RFS0/11496/10195/10097/98109/12197/104143/147132/138143/145144/147135/13794/9493/9585/8986/1111549/1741 (89%)
Excluded from main analysis (robustness only):
Econometrica1/700/990/890/8342/8654/7761/8261/8458/8387/11191/11391/11575/10064/8958/79743/1360 (55%)
R. Econ. Stud.0/840/470/510/440/530/650/720/750/630/940/980/770/1150/1130/1051 (0%)