Critiquing the METR Productivity Study
Everybody and their uncle is making lazy posts on a study of developer productivity
Anybody who has seen my posts or comments on LinkedIn within the GenAI and ML space might have observed a few things:
I’m neither strictly for nor against LLMs, but I’m not a fan of influencer-fed hype shoving things down a developer’s throat and in ignorance of the context in which that particular developer works. The standard measure of productivity should be productivity itself, not imagined proxies for productivity.
I believe we’ll need more than merely DNN (Deep Neural Network) LLMs as the core of the token generation and reasoning mechanism, but DNNs have made solid progress and thus are likely to stick around for awhile.
I dislike crappy science, but I dislike crappy reporting about science even more than I dislike crappy science.
This is a post I had not intended to tackle so soon, but these crappy-reporting situations are coming up constantly. It doesn’t matter if you’re talking about the more mainstream media outlets, or the hives of self-promotion activity that you get in places like LinkedIn. It is past time somebody started at least skimming a paper they wrote about, instead of all these mindless confirmation-bias exercises we’re constantly exposed to. I don’t care if you’re anti-LLM, or if you are pro-LLM, but if your side is right it should still be right when you read and report on papers fairly. If you have to write lazy posts to justify your position, your position is weak sauce. Seriously, many influencers need to learn how to make a better case.
As a result, this is the start of an ongoing series I’m tagging “Research Critiques”. Collectively we should all be more informed than many influencers want us to be. The goal is to understand and draw intelligent conclusions in any direction a paper legitimately supports, not just mindlessly cheer for a team. I’m not going to reproduce the contents of these papers, but point out some key items, and nudge you into at least skimming the paper yourself. What you will be seeing here won’t be the result of days of painstaking reading, it will just be from actually bothering to look at all. If you do the same, you can form your own conclusions of a paper. Your conclusions should be yours, not mine, and not some random influencer that never even looked at the paper. The practice is worthwhile, and doesn’t have to take long.
Enter the METR study: Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. Everybody that is anti-LLM has been dog-piling on this with posts that barely do anything to properly lay out where this study fits, which conclusions are reasonable to draw, which ones aren’t reasonable, and which ones even the authors themselves are not attempting to argue.
Summary
The study involves 16 developers and 246 tasks with data recorded for 143 hours total with 10-second resolution (although data cleaning reduced that to 84 hours).
Why do you care? You care because depending on the question asked, you might look at this as providing either 16 samples, or 246 samples, or 16 x 246 = 3936 samples, or 84 x (60 * 60 / 10) = 30,240 samples.
The 16 is easy to focus on and reasonable to note that it is pretty small, but not all studies involve large numbers of participants. That alone does not invalidate a research effort. Sometimes it just means you might categorize the effort as exploratory research that should be followed up with additional studies before strong conclusions are accepted as truth. Research studies are expensive, so there is value to early exploratory research that helps better frame efforts in later studies.
One thing this study includes is it reports how much of the developer time was related to coding. I like that it does this, other papers I’ve seen did not. The answer here is 29%. This is good to know, because when you see posts about how much of an impact LLMs had on coding they usually only measure coding and not what that means per day of an employee’s work activity. If coding is 10% of your day, and an LLM helps you code 10% faster, then the LLM would only have a 1% impact on your daily routine. Maybe there are other LLM impact numbers for the non-coding activity too, and if so they should be included and computed appropriately. Context matters.
The headline numbers you might see tossed around are this:
Self-estimates made by developers suggested a 20% productivity gain.
Actual productivity measured by the researchers showed a 19% productivity loss.
Before you run for your preferred flavor of torch or pitchfork, there is more to think about in the study.
Funding
It’s always worth taking a look at the backstory on funding for a paper. In the case of this study that’s provided in the Partnerships section of the METR About page. The connections to the AI industry seem modest, and geared to the same level that academic institutions would likely receive from many tech companies so that their respective products or services could be used without the deep pockets of a commercial budget.
Study Design
Study participants performed work on their own pre-existing open-source projects. There is some good and perhaps some bad to this, but it’s a reasonable exercise. It eliminates questions of seeing a benefit only when doing greenfield coding. The study is measuring what it means to bring GenAI into an existing non-trivial situation. For experienced engineers our greenfield opportunities tend to be on the light side, so I appreciate this aspect of the study design. It makes what is measured likely to have relevance to more of us. The bad angle I guess I would argue is that the open-source projects chosen were high-profile (23,000+ stars), and such projects usually come with strong curation practices that are much less typical in the codebases most engineers work on.
The participants varied in their previous LLM experience, and the study captures this. This is an area where the 16 number on sample size feels like it has relevance. It seems a little on the light side to indicate how LLM experience influences the outcome. I would rather something like at least 70 participants in 7 cohorts of 10, each cohort of increasing experience, in order to make really strong statements about how more or less AI experience mattered and to observe if the impact of that variable capped out in either direction. This observation could apply for other variables, and follow-up studies should consider picking a smaller number of variables but exploring them more deeply. I’m not criticizing the paper over this, I’m suggesting that readers (and influencers) shouldn’t draw conclusions bigger than the data.
The tooling use was primarily Cursor Pro paired with Claude Sonnet. Only 44% of the developers had previous experience with the IDE. I consider this a weak aspect of the study. Any developer that has transitioned from a familiar to an unfamiliar IDE has experienced that you can lose a lot of productivity for a few weeks. I found nothing in the study which tracked changes in productivity with respect to how far along participation was in the study. That seems like a material gap in reporting. With an almost 50/50 split in tooling experience, seeing if productivity between the two groups changed as time passed would have allowed other data points to be better understood as either measuring LLM impact or measuring unfamiliar-IDE impact. You could also make a similar case about wanting to compare past LLM experience vs study progress, to see if the less-experienced “caught up” to their more experienced peers over time or not. Both of these variables have that whiff of being confounders for other measurements, which suggests benefit to taking more care with them.
Some Findings
There is a lot of material in the document, particularly in the Factor Analysis appendix (starting on page 18), so I want to pull out some items to show how this study can easily be viewed as “for” or “against” LLM use in coding, depending on how somebody decides to cherry-pick from it.
Section: Forecasted vs Actual Impact
This is a case where I believe you’re seeing the study through the lens of 16 x 246 = 3936 samples. Developers were not good at estimating the impact of AI, and experts not involved in the coding were even worse. I consider this an important result mostly for sifting through a lot of the online noise we see every day. Anecdotal reporting may be more driven by an emotional perspective than time-clock reality.
I’m not saying anything about LLM-supported productivity here, I’m saying something about human beings and all that weird wetware we have between our ears. The way to measure productivity is to measure productivity; anecdotes that aren’t reporting an actual carefully-examined productivity measurement are not themselves a productivity measurement.
Section: Extended Discussion
Page 17 provides a table of issues that the authors are very definitely not trying to make statements about. One in particular I believe warrants noticing:
We do not provide evidence that:
There are not ways of using existing AI systems more effectively to achieve positive speedup in our exact setting.Clarification:
Cursor does not sample many tokens from LLMs, it may not use optimal prompting/scaffolding, and domain/repository-specific training/finetuning/few-shot learning could yield positive speedup
To me this goes right back to my earlier comments about developer familiarity with the tooling. Developers that live with these tools evolve their CLAUDE.md file and care for it like a first-born child. Sometimes the level of detail can border on the silly because the prompting is as complicated as the artifacts that would be generated… but be that as it may, there is likely to be a big difference between those who live in this ecosystem versus those just recently introduced to it, because the former may have in essence macro’d better outcomes as part of their routine workflow.
The authors absolutely do acknowledge issues like this and do so more than once (including the section “Below-average use of AI tools” starting on page 23), but I feel like this one is pretty integral to a study on GenAI productivity and should have been controlled for. As a mental-model comparison, imagine a study on Python developer productivity where you observed the time to create properly-baked wheels but didn’t take note of which developers already had experience with all the rough-edged pain of Python library and application packaging versus which ones were being dropped into the soup for the first time.
Section: High developer familiarity with repositories
The findings here (starting on page 18) probably won’t be surprising to experienced coders. Developers already well-versed in a particular code issue weren’t particularly helped by an LLM while working on it. The assistance from an LLM was more material when dealing with something unfamiliar and for which developers needed reference material outside the current codebase.
It seems reasonable to note that if a company wants to predict the impact of LLM use on any given development team, it likely depends on the rate of novel change in the codebase. As efforts shift to maintenance and operations, we accrue more context and thus the opportunities may shrink or turn negative for LLMs to be a net performance gain over human knowledge and skill that is sitting there ready to be tapped. This might also apply to very deep expertise with specific technology components.
Section: Low AI Reliability
This material beginning on page 19 is worth noting. The developers in the study were being careful about PR quality in submissions, because these were codebases they were familiar with, invested in, and knew the community standards for. This is why I called out earlier that the 23,000+ star projects may not represent a typical engineering situation for study purposes.
Developer reporting is pretty clear about how much effort they had to put into the PRs, and so it isn’t immediately apparent why this wasn’t as strongly reflected in their self-reported estimates for LLM performance impact. I believe the effort finding does at least speak to the differences in performance anecdotes we hear contrasting experienced engineers and naïve coders using LLMs, which makes sense because the former group will have more habits for being less tolerant of low code quality. Whether that is good or bad is in the eye of the beholder, but it is going to be reflected in actual performance outcomes if engineers maintain that perspective. In the extreme case of popular open-source projects the engineers definitely must have such views unless the project community decides to revise its policies, but that influence may be somewhat moderated in other domains.
Wrap-Up
There is more you can dig into by reading the study yourself, and you should. I hope that at a minimum you will see the study has some substance, some limitations, seems fair in its perspective for the issues it chose to examine, but that it is fundamentally an exploratory effort and should be considered in that light. Influencer opportunism should not be the basis for deciding its merits, or for painting it as having more meaning than it deserves or that the authors themselves intended.
The Experimentalist : Critiquing the METR Productivity Study © 2025 by Reid M. Pinchback is licensed under CC BY-SA 4.0