Blog

  • Task time vs AI Success – a new Moore’s Law?

    A few folks have sent me this paper by Kwa et al., and commentary by Toby Ord: Is there a Half-Life for the Success Rates of AI Agents? — Toby Ord.

    First the paper. Kwa and co-authors note,


    However, existing benchmarks face several key limitations. First, they often consist of artificial rather than economically valuable tasks. Second, benchmarks are often adversarially selected for tasks that current models struggle with compared to humans, biasing the comparison to human performance. Most critically, individual benchmarks saturate increasingly quickly, and we lack
    a more general, intuitive, and quantitative way to compare between different benchmarks, which prevents meaningful comparison between models of vastly different capabilities (e.g., GPT-2 versus o1).

    They propose and test a new measure to track AI progress by measuring the task completion time horizon: the length of time it takes to complete tasks that models can finish at X% success rates(e.g. 50% as one example). This approach is based on the psychometric literature and item-response theory.

    For example, a metric could be, for a task like reading complex clinical cases and answering multiple choice answers. The metric would be first, whether the AI can solve the question at some accuracy (e.g. 50%), vs the time it takes expert clinicians to answer it.

    The authors provide a nice graph which shows a Moore’s Law like doubling of performance of models every 7 months.

    Toby Ord puts it this way:


    The idea of measuring improvement in AI capabilities over time via time horizons at a chosen success rate is novel and interesting. AI forecasting is often hamstrung by the lack of a good measure for the y-axis of performance over time. We can track progress within a particular benchmark, but these are often solved in a couple of years, and we lack a good measure of underlying capability that can span multiple benchmarks. METR’s measure allows comparisons between very different kinds of tasks in a common currency (time it takes a human) and shows a strikingly clear trend line — suggesting it is measuring something real.

    He also notes that the decay of performance is not linear but follows a hazard rate model.


    If AI agent success-rates drop off with task length in this manner, then the 50% success rate time-horizon for each agent from Kwa et al. is precisely the half-life of that agent. As with the half-life of a radioisotope, this isn’t just the median lifespan, it is the median remaining lifespan starting at any time — something that is only possible for an exponential survival curve. Unlike for particles, this AI agent half-life would be measured not in clock time, but in how long it takes a human to complete the task.


    This constant hazard rate model would predict that the time horizon for an 80% success rate is about ⅓ of the time horizon for a 50% success rate. This is because the chance of surviving three periods with an 80% success rate = (0.8)3 = 0.512 ≈ 50%. More precisely, the time horizon for a success probability of p would be ln(p)/ln(q) times as long as one with success probability q. So an 80% time-horizon would be ln(0.8)/ln(0.5) = 0.322 times as long as the 50% time-horizon.


    One rationale for this constant hazard rate model for AI agents is that tasks require getting past a series of steps each of which could end your attempt, with the longer the duration of the task, the more such steps. More precisely, if tasks could be broken down into a long sequence of equal-length subtasks with a constant (and independent) chance of failure, such that to succeed in the whole task, the agent needs to succeed in all subtasks, then that would create an exponential survival curve. I.e. when Pr(Task) = Pr(_Subtask_1 & _Substask_2 & … & _Subtask_N).

    Why is this useful? It can allow predictions of what length/complexity tasks can be done at high accuracy (99% rates) for a given model; it provides a meaningful way to assess task complexity that models (and agentic systems) can complete; and if the doubling time of 7 months holds up, it offers an approach to forecast model performance in the future.

    This needs to continue to be tested; but if it holds up it may represent a kind of Moore’s Law for AI systems.

  • Vaccines are safe

    Dr. Jake Scott with a good summary of information known about vaccines from a database of studies he maintains.

    I’m a physician who has looked at hundreds of studies of vaccine safety, and here’s some of what RFK Jr. gets wrong

  • Software (3.0) will eat the world?

    In a refreshing break from the AI hype, Andrej Karpathy has a great talk on youtube that he gave to Y combinator AI Startup School

    Andrej Karpathy: Software Is Changing (Again) – YouTube.

    Among the highlights (for me):

    1. If writing traditional programs using IDE’s with bespoke code by software engineers was Software 1.0, he frames two additional phases of software development. Software 2.0 was the broad dissemination and use of neural networks, such as is used in computer vision, classifiers, and networks trained through supervised learning. Software 3.0, which we are in now, is generative AI with natural language understanding. Andrej does not think we are at a time in which the software engineer is obsolete. LLMs are too unreliable since they are non deterministic, don’t have guardrails, and don’t actually think. But what they can be is a component of larger systems, in which things that were once hard and required hand coding are increasingly being done by the AI tools. Larger and larger sections of codebases will be able to be move into a larger AI based system (and software 3.0 will eat older code bases).
    1. Large LLM vendors can be thought of as utilities (ubiquitous, and an enabler of other tools; valued for their stability and up time; paid for by usage that is metered), fabs (manufacturers of commodities that are built into tools that require ongoing capital expense and extensive R&D), or operating systems(software ecosystems for developers to built on top of, some switching costs to change).
    2. AI as a product is still not understood. Human in the loop systems will produce the most value as these can manage the AI tools and impose guardrails on the systems. The best UIs still remain to be built.

    I found the talk very pragmatic and a nice counterpoint to artificial superintelligence hype.

  • Orchestrating the H&P

    Nick Mokey at Venture Beat has a good discussion of an Oxford authored paper that further shows the gaps in current LLM systems for healthcare. The authors of the paper (Bean, et al) found


    that while LLMs could correctly identify relevant conditions 94.9% of the time when directly presented with test scenarios, human participants using LLMs to diagnose the same scenarios identified the correct conditions less than 34.5% of the time.

    Perhaps even more notably, patients using LLMs performed even worse than a control group that was merely instructed to diagnose themselves using “any methods they would typically employ at home.” The group left to their own devices was 76% more likely to identify the correct conditions than the group assisted by LLMs.

    The Oxford study raises questions about the suitability of LLMs for medical advice and the benchmarks we use to evaluate chatbot deployments for various applications.

    Mokey notes,

    When we say an LLM can pass a medical licensing test, real estate licensing exam, or a state bar exam, we’re probing the depths of its knowledge base using tools designed to evaluate humans. However, these measures tell us very little about how successfully these chatbots will interact with humans.

    “The prompts were textbook (as validated by the source and medical community), but life and people are not textbook,” explains Dr. Volkheimer….

    Real customers use vague terms, express frustration, or describe problems in unexpected ways. The LLM, benchmarked only on clear-cut questions, gets confused and provides incorrect or unhelpful answers…

    This study serves as a critical reminder for AI engineers and orchestration specialists: if an LLM is designed to interact with humans, relying solely on non-interactive benchmarks can create a dangerous false sense of security about its real-world capabilities. If you’re designing an LLM to interact with humans, you need to test it with humans – not tests for humans.

    Medical students are taught the art of performing a history and physical. It is an integration of eliciting information, pattern matching on the fly, listening and developing an inner empathy balanced with learned skepticism, all while building trust through bedside manner. Their final two years are spent building this critical skill. Some third and fourth year medical students struggle, because while they had always been the best in multiple choice question (mcq) assessments in every class they had ever taken, now, they were being evaluated by peers on how they interact and treat other people.

    In residency, this is taken further – in the fast pace of internship and beyond, one must learn how to optimize time management. What are the key facts this patient in front of me is conveying? How can I quickly assess their needs to make sure I can treat them effectively?

    We do not have the benchmarks needed to assess real world AI performance in a patient setting. Healthcare benchmarks to date are like the tests a second year med student has aced: mcq based knowledge assessments that indicate little about the most important elements of patient care.

    Additionally, we need to think about how we orchestrate effective experiences. How do we break down the components of a patient interaction and connect those elements? “AI” needs to be viewed as a system that orchestrates multiple tasks that the caregiver is managing in the encounter: eliciting a history as a trusted partner, pattern matching on the fly with guided follow up questions, a targeted physical exam, and ordering the necessary and sufficient diagnostics for a workup.

    This is why the paper from Apple last week matters for healthcare. The complexity of the healthcare challenge is like the Tower of Hanoi, seemingly just an extension of better pattern matching but actually a problem of insight. It is not a scaling problem, but a systems problem. It won’t need bigger LLMs or more compute, but better architectures that perhaps agents will address.

  • Tariffs and tough choices

    Hospital purchasing still buffeted by trade winds


    “Everyone in the supply chain, from hospitals to suppliers to manufacturers, is grappling with how to plan thoughtfully and proceed in a way that doesn’t either under- or over-correct for the potential impacts of these tariffs,” Akin Demehin, the American Hospital Association’s vice president of quality and patient safety policy, told Axios.

    “Are there going to be instances where those low margin products are just not worth manufacturing anymore?,” Hendrickson said.

    I’ve heard anecdotally from hospital executives that tariffs are a threat to already tight margins. Along with the changes to indirect payments and cuts to grants, inflation, and general uncertainty, academic medical centers are faced with tough financial decisions.

  • American Medicine’s Meagerness Paradox

    American Medicine’s Meagerness Paradox – The Health Care Blog

    Beautifully written piece by Mark-David Munk about the contradiction between the vast amounts of money that enters the healthcare system and the day to day, on the ground, reality of patient care in important settings.


    Here, regulatory agencies have made themselves both expensive and indispensable. The American Board of Internal Medicine brought in $90 million in fees in 2023 (and there are 23 other specialty boards). The Joint Commission pulled in $208 million last year. Press Ganey, which owns a large part of the mandatory patient survey business, reportedly had revenues in the hundreds of millions of dollars (before they stopped reporting revenue figures after being bought by private equity). The medical journal business is especially galling: doctors write, edit, and review articles for free, yet those journals are locked behind paywalls. Elsevier’s parent company, with over 2,500 journals, generated £3.06 billion in revenue in 2023 with a 38% profit margin.

    Good luck saying no to all this. We’re stuck. Doctors have no choice but to be board certified. Hospitals must be surveyed. Expensive licenses and permits are non-negotiable. We pay what they ask, with increases year after year. On these cocooned organizations, we impose few demands, few hard bargains and few consequences for poor value. In my early years in medicine it felt like there was at least a veneer of plentitude. These days, I look at our worn clinic and our patients who respond with dignity as I explain that their insurer has rejected their fifteen-dollar pain medication prescription, again.

  • Has interoperability’s time finally come?

    FHIR based interoperability has been tantalizingly close for over 5 years. Interoperability that has rendered data coming from EHRs and claims into a commodity, has de-risked connecting to hospitals, and has enabled a company to focus on the use case and value prop of the product remains out of reach. I think that those companies that offer data integration services are safe.

    With the recent federal RFI on FHIR, digital transformation, and interoperability, I have been thinking a bit on whether I would make a “FHIR-first” healthcare company if I was a new founder.

    In full disclosure, I’ve been skeptical about the impact of FHIR on the industry in the past few years. In a larger sense, I would say I have developed a deeper cynicism about standards based data exchange in healthcare. This has been the result of 15 years in public health and clinical informatics in which I have seen the promise and later disappointment of multiple iterations of standards. The barriers to success are often similar: lack of adoption; overly complex standards; variable implementations; addressing the letter, not the spirit, of new rules; and a lack of business models to support change. I would not, in 2009 at the start of meaningful use, have thought that in 2025 this would be unsolved.

    Josh Mandel has been on a tear on Linked In, describing the many current barriers to true interoperability. His publicly available response to the federal RFI is a great read.

    Here are some of the things that are still risk points for digital health companies, and will cost extra money and time if not thought through. For a new startup or business seeking to address a healthcare problem, what is the ease of implementation and risks of use of FHIR? Can I reliably count on a QHIN to have data coverage in the population I may have of interest, and what will the cost be? Will the data be sufficiently standardized across the participants in that network that I can minimize the transformation work I may need to do? And, what is different now compared to the eras of HIEs to make QHINs more likely to be successful?

    Thinking about provider customers, can my integration team count on a healthcare system or provider to be a strong partner to make FHIR worthwhile? Will they be USCDI v3 capable? Will my company’s support of FHIR standards be seen as a positive or value proposition by my CIO leaders?

    Perhaps information blocking has made data exchange necessary and available for EHR vendors, enabling a market for integration and clean and standard data to exist.

    There also remains the issue of write access to the EHR. Providers want workflows that are in the EHR and seamless. How does a company deliver it to them? Given that FHIR write capability is really at the prototype stage in EHRs, this means that FHIR APIs really represent just a data source. Will CDS Hooks be available or will we be the first implementation for a customer? Given all the tasks we need to do, I’m not sure I want to be the first unless a) it scales broadly in the industry and b) it makes integration into workflow easy.

    But having said all this, I must note that there has been slow and steady progress. Information blocking and TEFCA have improved the landscape of data sharing. Epic has a path for rapid app distribution. There is a generational change in available technology and customer expectations occurring, and persistent progress across administrations.

    So for a healthcare technology company, there could be value in putting some time and expertise on the table as an investment into a FHIR based future. If the cards are dealt correctly, it could become a differentiator, an accelerant, and a means to leap ahead if data is a key asset. But you won’t be able to give up on HL7 2.x just yet.

  • However, vibe coding is moving forward

    Software engineers are finding use from LLMs and are deploying AI assisted code to enterprise and consumers. The CTO of Cloudflare wrote a library for OAuth and open sourced it using Claude (Max Mitchell | I Read All Of Cloudflare’s Claude-Generated Commits), and a self professed “product and GTM guy” used Cursor to develop a photo management app for iOS in Swift that it sounds like he is putting into the App store. In both instances, it was human + AI, and the autogenerated content helped the human be faster and learn along the way.