Using AI in the development of stdlib

A reflection on stdlib's participation in the 2025 METR study on AI's impact on open-source developer productivity.

A cartoon fox looking confused while sitting at a wooden desk and staring at a laptop screen.

I read the results of the recent METR study on "Impact of Early-2025 AI on Experienced Open-Source Developer Productivity" with great interest for two reasons. Firstly, I have been an early adopter of LLM tools. In 2020, I was lucky enough to get access to the private beta of the OpenAI API from then CTO Greg Brockman and explored the use of AI for education at Carnegie Mellon University. Secondly, because stdlib participated in the METR study, I was personally involved and contributed by working on randomized issues over several months, being allowed to use AI for some tasks and forbidden for others.

Given that stdlib's involvement is central to my perspective, it's worth providing some context on the project. stdlib is a comprehensive open-source standard library for JavaScript and Node.js, with a specific and ambitious goal: to be the fundamental library for numerical and scientific computing on the web. It is a large-scale project with well over 5 million source lines of JavaScript, C, Fortran, and WebAssembly, and composed of thousands of independently consumable packages, bringing the rigor of high-performance mathematics, statistics, and machine learning to the JavaScript ecosystem. Think of it as a foundational layer for data-intensive applications similar to the roles NumPy and SciPy serve in the Python ecosystem. In short, stdlib isn't your average JavaScript project.

A Word of Thanks

Before diving into my reflection, I want to take the opportunity to thank the METR team and especially Nate Rush for giving stdlib the chance to participate in this study with two core stdlib developers, Muhammad Haris and myself. It was a great experience to work with the METR team, and I am eager to see any future studies they will conduct. It is my conviction that, with the entire tech industry being gripped by an AI gold rush, it is incredibly valuable to have a non-profit research institute like METR conduct studies that cut through the noise with actual data.

The Slowdown

The results of the METR study are surprising, clashing with some previously published and very optimistic study results on the impact of generative AI (e.g., see GitHub and Accenture's 2023 study on the impact of Copilot on developer productivity). Citing from the Core Result section of the METR study page:

When developers are allowed to use AI tools, they take 19% longer to complete issues—a significant slowdown that goes against developer beliefs and expert forecasts. This gap between perception and reality is striking: developers expected AI to speed them up by 24%, and even after experiencing the slowdown, they still believed AI had sped them up by 20%.

Rather predictably, the results have led to a lot of discussion on Hacker News and other social channels, with parties on both sides lining up with their pitchforks.

The Perception Gap

I am part of the group of developers who estimated that they were sped up 20%-30% during the study's exit interview. While I like to believe that my productivity didn't suffer while using AI for my tasks, it's not unlikely that it might not have helped me as much as I anticipated or maybe even hampered my efforts.

But how can that be? Daily, we are reading about how AI is already revolutionizing the workplace or making software engineers redundant, with companies like Salesforce announcing that they won't be hiring for software engineering positions anymore or online lender Klarna announcing that they were shuttering their entire human customer support in favor of AI.

Many of these stories have turned out to be more hyperbole than reality. Klarna still has human support, and Salesforce still has many engineering job listings. Sadly, some of these stories appear influenced by ulterior motives, such as Klarna's strategic positioning as an "AI-native" company to capture premium valuations ahead of its IPO amid the current AI wave.

However, I have been using AI tools daily for the past three years, both at work and outside, and find them immensely useful. How do I square these benefits with the study results?

On Study Design

When confronted with results that go counter to one's expectations, it is a natural instinct to try to attack the study and identify holes to explain away the result. For example, one could point to the small sample size of 16 developers. There is also the argument that the study was conducted in a very specific context, with experienced developers working on projects they are intimately familiar with.

There might also have been a subtle selection effect in the tasks themselves: since project maintainers proposed their own task lists, it is possible that those more experienced with AI subconsciously selected issues they believed were more amenable to an agentic workflow. One could also argue that the developers were subject to the Hawthorne effect, altering their behavior simply because they knew they were being video-recorded, perhaps over-relying on the AI tools for the sake of the experiment.

Finally, and perhaps most importantly, the experimental setup of requiring screen recordings and active time tracking for a single task enforced a synchronous workflow. This effectively locked developers into what I call "supervision mode", where they had to watch the agent work rather than being free to context-switch to another problem.

Some of these critiques, particularly the enforced "supervision" workflow, could directly contribute to the observed slowdown. But others, such as selecting "AI-friendly" tasks or over-relying on the tool to impress researchers, should have biased the results toward a speedup. This makes the final outcome even more notable. The direction of various potential biases is ambiguous at best, which is why we must look at the study's core design.

As a randomized control trial, the study follows the gold standard experimental design for detecting causality. By randomizing individual tasks to "AI-allowed" or "AI-disallowed", the study isolates the effect of AI tooling. Instead of comparing one group of developers against a control group (where differences in skill could skew the results), it compares each developer against themselves. This "within-subjects" design controls for individual characteristics, from typing speed to experience with the project. With such a study design, results are harder to write off as mere statistical noise, even with a smaller sample size.

Crucially, the tasks were defined before this randomization. This avoids a common pitfall where AI might simply produce more verbose code or encourage developers to break tasks into smaller pull requests, which can inflate some productivity metrics without representing more work getting done.

16 developers from several open-source projects might not sound like much, but, in total, we completed 246 tasks. To give a sense of the work involved, the tasks Haris and I worked on were not trivial, while still being hand scoped to be completed in a few hours or less. They were a mix of core feature development (such as adding new array, string, and BLAS functions), creating custom ESLint rules to enforce project-specific coding standards, enhancing our CI/CD pipelines with new automation, and fixing bugs from our issue tracker.

And while a single developer's performance on one task is likely correlated with their performance on another and the precision of the estimates thus larger than otherwise, it is quite notable that the effect was in the opposite direction from what economists, ML experts, and the developers themselves predicted (with the former two groups being more in the range of a 40% speedup). Moreover, the effect is quite large in magnitude. A quick back-of-the-envelope calculation reveals that if the true effect were a 40% speedup, the probability of observing a result this far in the opposite direction is astronomically low.

In light of this, I have no reason to doubt the internal validity of the study and would venture that the effect measured is real within the context of the experiment. If one believed the chatter on social media and the hype merchants who two years ago were all shilling cryptocurrency (and maybe still are!) but have meanwhile all switched over to extolling the amazing speedup AI offers, then increases of 100%, 5x, or even 10x should have been in the cards. But this is definitively not what the study observed.

Embracing Agentic Development

The more important consideration for squaring my own experience with these results is external validity: how generalizable are the study's findings? The paper is a great read and touches on many possible criticisms and threats to external validity, and I won't belabor any of the points raised therein.

Instead, I will solely focus on my experience as a study participant and how I have been leveraging AI with success. I will also share my own hypotheses for why the performance of the developers in this sample was overall negatively affected by the use of AI.

To give some context, my main way of incorporating LLMs into my work before participating in this study was twofold. As something of an early adopter, I had used GitHub Copilot for auto-completion and inline suggestions and made heavy use of ChatGPT and Anthropic Claude web apps by assembling relevant context, writing detailed prompts, and copying results back into my editor. Tools such as Repomix helped streamline the process of incorporating LLMs into my daily development workflow. This general approach allowed me to review changes quickly, iterate on them by asking questions, and have the LLM make follow-up edits directly in a chat interface.

The METR study subsequently provided an excuse for me to delve into agentic programming and make Cursor an integral part of my workflow. I had used it briefly some time before but didn't find the AI-generated results compelling enough to let it loose on any codebase I was working on. But Claude Sonnet 3.7 had come out, which is still one of the most powerful models for coding tasks. Due to some very encouraging results during early testing, I was eager to put it to work on a backlog of tooling that we wanted to build for stdlib, alongside various refactoring and bug fixes.

One of my first impressions with Cursor this time around was the underlying LLM's rather impressive ability to follow the very specific coding standards and conventions of the project and, when placed in agent mode, to automatically and reliably fix lint errors and attempt to iteratively resolve errors in unit tests. This felt like another step change in capabilities, just like when OpenAI released GPT-3 Davinci in June 2020, which made a lot of use cases suddenly feasible that before would break down in any realistic scenario.

While I no longer use Cursor and have meanwhile switched to Claude Code (more on that later), I found Cursor straightforward to use, especially given that it is a fork of VSCode, which has been my IDE of choice for many years. I heavily doubt that inexperience with Cursor, which I shared with roughly a half of the developers in the study, played a major role in the results. While I didn't have an extensive .cursorrules setup (which has since been deprecated in favor of project rules), I did add basic instructions and context about the project and made sure to index the stdlib codebase. Aside from that, further customization was neither possible nor necessary, as the Cursor Agent was able to automatically pull in other files, look up function call signatures, and perform other operations for assembling context.

My experience of Cursor was largely positive during the study. As an example, I ended up working on several Bash scripts for our CI/CD pipeline, and Cursor definitely sped up my development workflow by not having to look up the man page of jq for the eleventh time given that I only use this command-line tool for manipulating JSON once in a blue moon. With the AI agent's help, I could quickly generate a function like this one to check if a GitHub issue has a specific label:

# Check if an issue has the "Tracking Issue" label.
#
# $1 - Issue number
is_tracking_issue() {
    local issue_number="$1"
    local response

    debug_log "Checking if issue #${issue_number} is a tracking issue"
    # Get the issue:
    if ! response=$(github_api "GET" "/repos/${repo_owner}/${repo_name}/issues/${issue_number}"); then
        echo "Warning: Failed to fetch issue #${issue_number}" >&2
        return 1
    fi

    # ...

    # Check if the issue has the "Tracking Issue" label:
    if echo "$response" | jq -r '.labels[].name' 2>/dev/null | grep -q "Tracking Issue"; then
        debug_log "Issue #${issue_number} is a tracking issue"
        return 0
    else
        debug_log "Issue #${issue_number} is not a tracking issue"
        return 1
    fi
}

The agent correctly assembled the jq -r '.labels[].name' filter to extract the label names from the JSON response—something that would have sent me to a documentation page for a few minutes. While a small speed bump, these moments add up. The AI handled the rote task of recalling obscure syntax, letting me focus on the actual logic.

My first takeaway is this: current LLMs are very powerful for tasks in domains that you are not intimately familiar with, allowing you to move much more quickly. Agentic tools such as Cursor and Claude Code are also very helpful to quickly navigate and learn your way around a large codebase, allowing you to ask questions and explore the codebase in a natural way. Leveraging "deep research" provides another means to more exhaustively explore a problem space in a way that the search engines of old simply cannot match.

On the other hand, some tasks were very frustrating. For example, the Cursor agent wrote one ESLint rule almost fully in one shot, but for another one, the Cursor agent was running in circles and unable to figure out the correct algorithm. Trying to prompt it to fix the bug was unsuccessful multiple times. It would have been better to not fall prey to the sunk cost fallacy and instead throw away the code and then either give the agent another shot or write it myself.

Cursor does have a neat feature of breakpoints which allow you to stop the agent at any time and revert to a prior state, something I wholeheartedly recommend using. It is a great way to avoid getting stuck in a loop of the agent trying to fix a bug that it cannot figure out.

I freely admit that I may have been a bit overeager about using AI for all of the AI-enabled tasks, partly due to my desire to learn to use Cursor productively but also due to my general amazement of what these new technologies unlock. However, maybe the METR study suggests that the question of whether a task can be more efficiently completed by AI, or whether one would be better off completing it by hand, is far from settled.

The Blank Slate Problem

Aside from occasional inefficiencies and outright mistakes in the generated code, coding agents do not have access to all the implicit knowledge and conventions of a large, mature project, which often might not be written down. In his reflections on the study, John Whiles identifies a core conflict: an expert engineer's primary value isn't just writing code; it's holding a complete, evolving mental model of the entire system in their head. The agent does not have such a mental model. Every interaction starts from a blank slate.

It is possible that some of this can be mitigated with better, more targeted instructions. As usual, there is no free lunch. One has to actively invest in making one's codebase more accessible to coding agents. And more generally, memory and learning is an unsolved problem with transformer-based LLMs, and changing that will likely require fundamental architectural advancements.

The necessity of auditing the agent's code for mistakes created two major sources of friction: the cognitive drain of 'babysitting' the AI and the time spent waiting for and reviewing its output. For every minute the agent spent running in circles on that ESLint rule, I was blocked, my attention monopolized by the need to supervise its flawed process. This synchronous, blocking workflow is exhausting and inefficient. It's the digital equivalent of shoulder-surfing an overconfident junior developer who has memorized everything there is to know about programming but cannot be trusted and who will make subtle mistakes that are hard to spot.

My advice: stay in the driver's seat during such pair programming and use the AI as a sparring partner to bounce ideas back and forth instead of yielding agency.

Delegate, Don't Supervise

Partly based on my experiences in the study, my workflow has evolved, and I have subsequently switched to using Anthropic's Claude Code. This has changed my interaction model from synchronous supervision to asynchronous delegation. I can now define a complex task via Claude Code's planning mode and then have the agent work on the task in the background. I can then turn my full attention elsewhere, be it attending a meeting, reviewing a colleague's code, or simply thinking through the next problem without interruption. Claude's work happens in parallel and is not a blocker to my own. The cognitive cost of babysitting is replaced by the much lower cost of reviewing a completed proposal later; if it didn't work out, I might just throw away the code and have the model try again, instead of engaging in a fruitless back and forth.

Claude Sonnet 4 and Opus 4 were not released at the time the METR study was conducted, and, while they mark another improvement, especially with regard to tool use by the model, the dynamics haven't fundamentally changed. The models still make mistakes and do not always implement things in an optimal or sound way, but they are now much better at following instructions and can work uninterrupted for longer periods of time.

At least for me, in contrast to those who frame coding agents as mere "stochastic parrots", I find myself absolutely amazed that, despite its warts and hiccups, we have now a technology that, given a set of instructions, is able to generate a fully-formed pull request that correctly implements logic, adheres to style guidelines, and has a passing test suite. And, in the best cases, this can happen without any human intervention.

The First 80 Percent

We still need to reconcile the observed performance decrease with how many developers, including myself, have now been leveraging AI to get tasks done in a fraction of the time, tasks that would have taken them hours or days previously. I believe that the Pareto Principle is a helpful yardstick. Named after Italian economist Vilfredo Pareto, it is commonly referred to as the 80/20 rule and posits that roughly 80% of effects come from 20% of the causes. Coding agents can now generate working code that mostly works but that might fall short if the goal is 100%.

In many instances, coding agents can easily accomplish the first 80% of a programming task, generating boilerplate, scaffolding logic, implementing core functionality, and writing a test suite. However, the final 20% of the task, from handling tricky edge cases, adhering to unwritten architectural conventions, ensuring optimal performance, and avoiding code duplication by reusing existing utilities is where the complexity lies. This last mile still requires the developer's deep, stateful mental model of the project. The rub here is that, by using the AI agent, one may bypass all the little steps which are necessary in the process of building that mental model.

But does it matter? When working on a crucial piece of a larger, complex system, it definitely does, and I would be hesitant with generative AI. But when working on a well-defined, isolated piece of code with expected behavior for inputs and outputs, why bother? The marginal cost of writing code (long recognized as only a small part of software engineering) is going to zero. In the event that there is a problem with the code, it can simply be thrown away and rewritten. The code that AI agents now generate is of decent quality, well-documented, and capable of adhering to one's coding conventions.

This brings to mind the following quote by Kent Beck.

The value of 90% of my skills just dropped to $0. The leverage for the remaining 10% went up 1000x. I need to recalibrate.

AI as a force multiplier is why I am long on AI, even though the METR study is a good reminder that we all can easily fall prey to cognitive biases.

In Thinking, Fast and Slow, Daniel Kahneman gives a classic example for biases driven by the availability heuristic: people overestimate plane crash risks due to vivid media coverage, making such events more "available" to memory than statistically riskier, yet routine, car crashes. Our judgment is swayed not by data, but by the ease of recall. In the case of working with AI agents, observing them build fully-functioning tools in seconds is a very memorable and visceral experience. On the other hand, the slow, frustrating "death by a thousand cuts" experience of auditing, debugging, and correcting the AI's subtle mistakes is the equivalent of the mundane car crash. It's a distributed cost with no single dramatic moment.

Nevertheless, I have no reason to believe that this technology will not continue to improve, and I, for one, am excited about the possibilities. For any big and ambitious project, the amount of tickets to be completed, features to be implemented, and bugs to fix vastly outstrips the available amount of time and human bandwidth to work on them.

What Future Studies Should Tell Us

It remains to be seen whether the results of the METR study can be replicated. However, the study clearly demonstrated that experts and developers were overly optimistic about the impact of AI on productivity. This is an important insight that should inform future research.

In some ways, the study raises more new questions than it answers. It looked at a very particular situation: seasoned experts working in the familiar territory of their own large, mature projects. Future studies by METR and others could vary these conditions. What happens when we throw developers into unfamiliar codebases, where, at least per my anecdotal experience, AI agents shine? Or what about junior developers or new contributors to an established open-source codebase? Under what conditions can AI act as a great equalizer, compressing the skill gap and providing a speed boost rather than slowdown?

Furthermore, the current study centered on completion time, but faster isn't always better. One possible follow-up would be a blinded study where human experts review pull requests without knowing if AI was involved. We could then measure things like the number of review cycles, the time spent in review, and the long-term maintainability of the code. This might shed light on when and how AI-assisted development may impact trading short-term speed for long-term technical debt.

Finally, the field of AI is still evolving at a rapid pace. The synchronous workflow that the study's setup encouraged could be fundamentally suboptimal. Exploring different interaction models, such as the asynchronous delegation workflow that I've moved to, could yield very different results.

How to Work With AI Now

What follows are my current recommendations for using AI in your daily workflow based on my experiences and the METR study.

Adopt an Asynchronous Workflow

The biggest drain from using AI is the cognitive load of "babysitting" it. Instead of watching the agent work, adopt an asynchronous model:

  • Define one or more tasks (e.g., running a set of commands to audit a codebase for lint errors and documentation mistakes) and then let AI agents work on them in the background (e.g., in separate Git worktrees of your repository), and turn your attention elsewhere.
  • Review the completed task(s) later. If the output is flawed, it's often better to discard it and have the model try again with a better prompt rather than engaging in a frustrating back-and-forth.

Know What to Delegate

AI can now handle the first 80% of many programming tasks, but the final 20% often requires deep context. The key is to choose the right tasks for AI:

  • "Vibe Code" and Prototypes: use AI for mock-ups or small, isolated tools that can be thrown away. This is where the technology's speed offers a distinct advantage.
  • Verifiable Code: AI is excellent for tasks that can be fully verified against an existing, robust test suite. The tests act as a safety net to catch the subtle mistakes the AI might make.
  • Boilerplate Code: AI can quickly generate boilerplate code, such as REST API endpoints or form validation, and can do so in a way that follows project conventions.
  • Learning and Navigation: use AI to quickly learn your way around a large codebase, document previously undocumented code, or to get help with tools you use infrequently. Asking LLMs questions can be much faster than hunting through documentation, particularly if that documentation is split across multiple resources.

Use and Customize Claude Code

For tools such as Claude Code, customization is a helpful means of writing down any implicit knowledge about the project that is not readily accessible from the code alone.

  • Provide Proper Context: drag and drop relevant files (this can include images!) into the Claude Code window for the model to use as context for the task at hand. One approach I have found useful is to add TODO comments in the codebase with the required changes, and then have Claude Code work on them. Use the planning mode to have the model think through the task and generate a plan that can be approved before immediately jumping into implementation.
  • Use Project Memory: use CLAUDE.md files to give the model project-specific memory, specifically on its architecture and unwritten knowledge. You can have multiple CLAUDE.md files in different project sub-directories, and the model will intelligently pick up the most relevant one based on your current context.
  • Build Custom Tooling: use the Claude CLI to build small, automated tools, such as a review bot that flags typos as a daily CRON job. For fuzzy tasks such as pointing out typos or inconsistencies in a PR, it's best to let Claude generate output that can be verified by a human. For well-defined tasks that can be fully automated, it is better to have Claude produce code that deterministically runs and can be verified.
  • Set up Hooks to Automate Actions: hooks are a powerful new feature of Claude Code that allows you to run scripts and commands at different points in Claude's agentic lifecycle.
  • Automate Repetitive Actions: create custom slash commands for frequent tasks performing routine work. Below is an example stdlib:review-changed-packages command that I run to flag any possible errors in PRs that were recently merged to our development branch:
- Pull down the latest changes from the develop branch of the stdlib repository.
- Get all commits from the past $ARGUMENTS day(s) that were merged to the develop branch
- Extract a list of @stdlib packages touched by those commits
- Review the packages for any typos, bugs, violations of the stdlib style guidelines, or inconsistencies introduced by the changes.
- Fix any issues found during the review.
  • Build Custom Tooling: use the Claude CLI to build small, automated tools, such as a review bot that flags typos as a daily CRON job. For fuzzy tasks such as pointing out typos or inconsistencies in a PR, it's best to let Claude generate output that can be verified by a human. For well-defined tasks that can be fully automated, it is better to have Claude produce code that deterministically runs and can be verified.
  • Set up Hooks to Automate Actions: hooks are a powerful new feature of Claude Code that allows you to run scripts and commands at different points in Claude's agentic lifecycle.

Final Thoughts

It's natural to attack a study whose results you don't like. A better response is to ask what they might be telling you. For me, it tells me there is still a lot to learn about how to use this new, powerful, but often deeply weird and unpredictable technology. One mistake is treating it as the driver in a pair programming session that requires your constant attention. Instead, treat it like a batch process for grunt work, freeing you to focus on the problems that actually require a human brain.


Philipp Burckhardt is a data scientist and software engineer securing software supply chains at Socket and a core contributor of stdlib.


stdlib is an open source software project dedicated to providing a comprehensive suite of robust, high-performance libraries to accelerate your project's development and give you peace of mind knowing that you're depending on expertly crafted, high-quality software.

If you've enjoyed this post, give us a star 🌟 on GitHub and consider financially supporting the project. Your contributions and continued support help ensure the project's long-term success and are greatly appreciated!