Key Research on AI Capabilities and Impact

Last updated June 30, 2025

Dozens of AI-related papers get released daily through journals, arXiv, and labs. It’s hard to keep up. But every so often, significant research drops that shapes how I think about AI, and that I keep on file for referencing. This page includes those findings, organized by category in ascending alphabetical order, in reverse order of publication within each category.

Coding

2025-03 | Coding task length doubles every seven months

  • Findings: Researchers proposed the “50% task-completion time horizon”—the human clock-time of tasks an AI agent finishes half the time. Claude 3.7 Sonnet already tackles ~50-minute jobs, beating GPT-4 and 94% of expert programmers. Horizon length doubled every seven months from 2019–2025, pointing to month-long software projects before 2030. Progress stems from sharper reasoning, richer tool use, and quicker error recovery, but raises autonomy and bio-risk concerns.

    • Update (April 2025): Fresh benchmarking with newer models shows the horizon now doubles roughly every four months, so the paper’s original seven-month estimate may understate current velocity.

  • Method: The team logged 2,529 hours of human work across 170 software-engineering, ML-research, and cybersecurity tasks (HCAST, RE-Bench, SWAA). Thirteen frontier models (GPT-2 to o1) ran in identical agent scaffolds; logistic curves linked success to human time, and hierarchical bootstraps traced trends.

  • Reference: Kwa, Thomas, et al. “Measuring AI Ability to Complete Long Tasks.” arXiv, 30 Mar 2025. https://doi.org/10.48550/arXiv.2503.14499.

2025-02 | GitHub Copilot increases coding output 26%

  • Findings: Across three randomized trials at Microsoft, Accenture, and a Fortune 100 firm, software developers who used GitHub Copilot completed 26.08% more tasks per week than their peers. Gains were strongest for newer, less senior, and previously less productive developers—who also adopted the tool faster and stuck with it longer. More experienced developers saw smaller or no productivity benefits. Output quality did not decline, suggesting Copilot's speed boost did not come at the cost of reliability.

  • Method: In total, 4,867 developers participated in field experiments lasting 2–8 months. Treatment groups received access to Copilot; control groups did not, with rollout later extended to all participants. Developer productivity was measured using GitHub activity: pull requests, commits, builds, and build success rates. Imperfect compliance was handled using instrumental variable regression, weighted by treatment-control adoption gaps.

  • Reference: Cui, Kevin Zheyuan, et al. “The Effects of Generative AI on High-Skilled Work: Evidence from Three Field Experiments with Software Developers.” SSRN, Feb. 2025. https://dx.doi.org/10.2139/ssrn.4945566.

2023-02 | GitHub Copilot cuts coding time 56%

  • Findings: In a single-task test, professional freelance developers with GitHub Copilot finished an HTTP-server implementation in ≈71 min on average versus ≈161 min for the control group, a 55.8% speed-up. The share of participants whose code passed all 12 unit tests was 7 pp higher for Copilot users, although the difference was not statistically significant. Benefits were largest for less-experienced and older programmers.

  • Method: 95 programmers recruited on Upwork were randomly assigned to treatment (Copilot, n=45) or control (n=50). Each wrote a JavaScript HTTP server; GitHub Classroom logged time-to-first-passing-submission and unit-test results automatically. Demographics and self-reported perceptions were collected in entry/exit surveys.

  • Reference: Peng, Sida, et al. “The Impact of AI on Developer Productivity: Evidence from GitHub Copilot.” arXiv, 13 Feb. 2023, https://doi.org/10.48550/arXiv.2302.06590.

Creativity

2024-02 | GPT-4 beats humans on creativity tests

  • Findings: In side-by-side trials on three staple divergent-thinking assessments—the Alternate Uses, Consequences, and Divergent Associations tasks—GPT-4 beat 151 human participants on every creativity metric. Even when the number of ideas was held constant, the model’s responses were markedly more original and up to seven times more elaborated, underscoring its capacity to deliver fresher, richer ideas than the average person.

  • Method: Researchers paired 151 GPT-4 chat sessions with 151 U.S. adults recruited via Prolific. Both groups completed the same tasks; GPT-4 was told to generate exactly as many ideas as its human counterpart for each prompt. Outputs were scored blind with the Open Creativity Scoring tool (GloVe semantic-distance metrics) and word-count analysis, then compared with standard statistical tests, confirming the AI’s clear advantage across all three tasks.

  • Reference: Hubert, Kent F., Kim N. Awa, and Darya L. Zabelina. “The Current State of Artificial Intelligence Generative Language Models Is More Creative Than Humans on Divergent Thinking Tasks.” Scientific Reports, vol. 14, article 3440, 10 Feb. 2024. https://doi.org/10.1038/s41598-024-53303-w.

Education

2025-06 | AI tutor delivers 3.2 years of learning in six weeks

  • Findings: A World Bank-backed study found that Microsoft’s Copilot (powered by GPT-4) significantly improved learning among high school students in Nigeria. Over just six weeks, students who used the AI tutor scored higher on English tests and maintained their gains on regular school exams. The improvement was equivalent to moving an average student into the top third of the class. Girls and stronger students saw the biggest benefit, but everyone gained with more time spent using the tool. This was one of the first rigorous tests of generative AI in classrooms in the Global South. It showed that AI tutors can drive real academic progress even in low-resource settings—and do so at remarkably low cost. For every $100 spent, the program delivered the equivalent of 3.2 years of traditional learning. That places it among the most cost-effective education interventions ever recorded.

  • Method: Researchers ran a randomized controlled trial across nine public schools in Benin City. Students were randomly assigned to either attend AI-assisted after-school sessions or stick with traditional instruction. Those in the AI group met in computer labs twice a week, working in pairs while teachers offered support. The team compared outcomes across tests, controlled for starting scores, and verified the results through multiple statistical checks.

  • Source: De Simone, Martín, et al. From Chalkboards to Chatbots: Evaluating the Impact of Generative AI on Learning Outcomes in Nigeria. World Bank Policy Research Working Paper 11125, May 2025. https://documents.worldbank.org/en/publication/documents-reports/documentdetail/099548105192529324.

2025-05 | ChatGPT significantly boosts learning outcomes

  • Findings: Across 51 classroom experiments, students who used ChatGPT out-performed peers by almost a full standard deviation (Hedges’ g=0.867)—roughly the difference between the 80th and 50th percentile. Smaller but still meaningful lifts appeared in learners’ self-reported confidence (g=0.456) and higher-order thinking skills (g=0.457). The biggest academic gains showed up in problem-based courses and programmes that ran the AI for 4–8 weeks.

  • Method: The authors conducted a random-effects meta-analysis of 51 experimental and quasi-experimental studies (November 2022 to February 2025). They compared ChatGPT-assisted groups with traditional instruction, tested for publication bias, and ran moderator tests on grade level, course type, pedagogy, duration, ChatGPT’s role (tutor, partner, tool), and subject domain. Results remained robust after all bias checks.

  • Reference: Wang, Jin, and Wenxiang Fan. “The Effect of ChatGPT on Students’ Learning Performance, Learning Perception, and Higher-Order Thinking: Insights from a Meta-Analysis.” Humanities and Social Sciences Communications, vol. 12, article 621, 2025. https://doi.org/10.1057/s41599-025-04787-y.

Emotional intelligence

2025-05 | LLMs beat human emotional intelligence scores

  • Findings: Large language models including GPT-4, o1, and others scored on average 25 percentage points higher than humans across five standardized emotional intelligence tests. These tests—used to assess things like leadership, HR, and counseling aptitude—measure how well someone understands emotions, manages them, and selects effective responses. The results show that LLMs now reason through emotional scenarios better than most people, with implications for customer service, coaching, hiring, mental health support, and other areas where high emotional intelligence is critical to success.

  • Method: The researchers administered five validated emotional intelligence assessments—STEM, STEU, GEMOK-Blends, GECo-Reg, and GECo-Mgmt—to six LLMs, each tested 10 times, and compared results to human baselines from the tests’ original development samples. Human scores averaged 56%; AIs, 81%.

  • Reference: Schlegel, Katja, et al. “Large Language Models Are Proficient in Solving and Creating Emotional Intelligence Tests.” Communications Psychology, vol. 3, 2025, article 80, Nature Portfolio, https://doi.org/10.1038/s44271-025-00258-x.

Employment

2023-07 | ChatGPT reduced freelancer gigs and pay

  • Findings: After ChatGPT’s public launch (November 2022), writing-focused freelancers on Upwork rapidly saw a 2% drop in monthly jobs and a 5% fall in earnings, with both the likelihood of landing any job and the number of jobs per active worker sliding. Similar or larger declines hit designers after the release of image-generating models such as DALL-E 2. Notably, top-rated and higher-earning freelancers were not insulated—if anything, the losses skewed toward them, hinting that generative AI can compress skill premiums.

  • Method: The researchers scraped 92,547 single-occupation Upwork profiles and tracked monthly job counts and income from January 2022 to April 2023. Using a difference-in-differences design, they compared “treated” occupations (writing, editing, proofreading; later, visual design) with less-affected roles, controlling for freelancer and time fixed effects and clustering errors by occupation. Robustness checks—event studies, matching, wild bootstrap errors, and a parallel analysis of image-generation tools—reinforced the core result that generative AI substitutes rather than complements knowledge workers in the short run.

  • Reference: Hui, Xiang, Oren Reshef, and Luofeng Zhou. “The Short-Term Effects of Generative Artificial Intelligence on Employment: Evidence from an Online Labor Market.” SSRN, 31 July 2023, https://dx.doi.org/10.2139/ssrn.4527336.

Healthcare

Here’s a concise and well-structured entry for your Healthcare research page, matching your established format:


Healthcare

2025-06 | AI outdiagnoses doctors with lower test costs

  • Findings: Microsoft’s “MAI-DxO” agent framework achieved 85.5% diagnostic accuracy and significantly reduced testing costs using a new benchmark based on 304 real-world NEJM clinical cases. Even the standalone OpenAI o3 model outperformed experienced physicians, suggesting that current frontier models already surpass human performance in complex diagnostic reasoning. Cost savings were substantial: MAI-DxO cut test costs by ~70% compared to o3 alone, and ~20% compared to human doctors.

  • Method: Researchers created the Sequential Diagnosis Benchmark (SDBench) by turning NEJM Clinicopathological Conference case studies into interactive simulations. AI models—and humans—had to query patients, order tests, and provide a final diagnosis, with each test incurring a virtual cost. Performance was evaluated on historical cases and a hold-out set of 56 new cases published after models’ known training cut-offs to reduce contamination risk.

  • Nori, Harsha, et al. “Sequential Diagnosis with Language Models.” arXiv, 27 June 2025, arXiv:2506.22405, https://doi.org/10.48550/arXiv.2506.22405.

2025-06 | AI does 12 years of medical review in 2 days

  • Findings: A fully automated LLM workflow reproduced and updated 12 Cochrane Reviews—work typically requiring 12 person-years—in under 48 hours. The system outperformed expert humans in both screening (96.7% sensitivity vs. 81.7%) and data extraction (93.1% accuracy vs. 79.7%), with a blinded panel siding with the AI over original Cochrane extractions in 69% of disagreements. Contrary to a common concern, most AI “errors” stemmed from inaccessible data, not hallucination.

  • Method: The researchers built a multi-agent pipeline (otto-SR) combining OpenAI’s GPT-4.1 for abstract/full-text screening and o3-mini-high for data extraction. It processed 146,276 citations across 12 reviews from the April 2024 issue of the Cochrane Database. Meta-analyses replicated original results, with newly included studies altering statistical conclusions in several cases. Performance was benchmarked against dual human reviewers and the commercial tool Elicit.

  • Reference: Cao, Christian, et al. “Automation of Systematic Reviews with Large Language Models.” medRxiv, 13 June 2025, https://doi.org/10.1101/2025.06.13.25329541.

2024-10 | GPT-4 gives accurate diagnoses doctors ignore

  • Findings: In a single-blind trial, GPT-4 on its own scored a 92% diagnostic-reasoning grade, beating board-certified internists by 16 percentage points. Yet when those same physicians could see the chatbot’s ranked differentials, their own scores ticked up by just 2 pp—a statistically insignificant lift. In practice, doctors often stuck with their initial hunches even when the AI’s answer was demonstrably better, leaving most of the accuracy gain on the table.

  • Method: Fifty attendings and residents from three US academic centers were randomized to work through up to six real-case vignettes with either conventional references or conventional + GPT-4. A structured-reflection rubric captured differential breadth, supporting/contradictory evidence, next steps, and final diagnosis. Reviewers blinded to group assignment scored 244 physician case submissions; a separate arm ran GPT-4 alone on the same cases for comparison.

  • Reference: Goh, Ethan, et al. “Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial.” JAMA Network Open, vol. 7, no. 10, 28 Oct. 2024, e2440969. https://dx.doi.org/10.1001/jamanetworkopen.2024.40969.

Innovation

2025-03 | GPT-4 lets one person out-innovate two

  • Findings: In a live test with 776 Procter & Gamble professionals, solo employees who used GPT-4 generated ideas 0.37 SD higher in quality than solo peers without AI—on par with two-person teams working unaided. When a pair used GPT-4, their odds of hitting the top-10% of ideas doubled versus human-only teams. AI users also worked 12–16% faster, wrote more detailed proposals, reported more positive emotions, and produced concepts that mixed R&D and commercial perspectives instead of staying in their functional lanes.

  • Method: The authors ran a preregistered, 2 × 2 field experiment inside P&G’s product-innovation “hackathon.” Participants were randomly assigned to one of four cells—solo or pair, with or without GPT-4 access—and tasked with real business challenges. Blind expert judges scored 550 final submissions on quality, novelty, feasibility and business impact; time-on-task, word count, sentiment shifts and AI-retention metrics provided additional data.

  • Reference: Dell’Acqua, Fabrizio, et al. “The Cybernetic Teammate: A Field Experiment on Generative AI Reshaping Teamwork and Expertise.” Harvard Business School Working Paper, no. 25-043, 28 Mar. 2025. SSRN, https://dx.doi.org/10.2139/ssrn.5188231.

Persuasion

2025-05 | AI outpersuades paid human persuaders

  • Findings: In a high-stakes experiment, Claude 3.5 Sonnet convinced people to change their minds more often than real humans who were paid to persuade. The AI boosted correct answers more than its human counterparts—and misled more effectively when asked to deceive. People followed the AI’s advice nearly 8 percentage points more often. This confirms AI’s growing power to sway opinion, for better or worse.

  • Method: Researchers ran a live-chat quiz with 1,242 US participants. In each round, one person answered a multiple-choice question while another—either a human or Claude 3.5—tried to influence their answer. Both sides had cash incentives. The experiment covered trivia, common misinformation, and near-term forecasts, testing both helpful and deceptive nudges. In total, the team analyzed over 12,000 messages.

  • Reference: Schoenegger, Philipp, et al. “Large Language Models Are More Persuasive Than Incentivized Human Persuaders.” arXiv, 14 May 2025, https://doi.org/10.48550/arXiv.2505.09662.

2025-04 | AI chatbots out-argue 98% of Reddit debaters

  • Findings: In a live test on r/ChangeMyView, AI–powered accounts persuaded the original poster in 17–18% of debates—5-6× the human success rate (~3%) and good enough for the 98th–99th percentile of all users, including the subreddit’s top experts. No Redditor flagged the comments as machine-written, highlighting the ease with which AI can pass for—and outperform—human persuaders. (The working paper has since been withdrawn amid ethical debate.)

  • Method: Researchers posted 1,061 AI-generated replies to new discussions (November 2024 to March 2025) and analyzed the 478 that survived moderator deletions. Each post was randomly assigned to one of three treatments: Generic AI, Personalized AI (tailored to the poster’s inferred demographics), or Community-Aligned AI (fine-tuned on past winning arguments). All content was human-screened before release; the study was preregistered and cleared by the University of Zurich ethics board.

  • Reference: “Can AI Change Your View? Evidence from a Large-Scale Online Field Experiment.” Unpublished working paper, 18 Apr. 2025. https://regmedia.co.uk/2025/04/29/supplied_can_ai_change_your_view.pdf

Productivity

2025-05 | AI cuts task time by two-thirds

  • Findings: Generative AI users report completing workplace tasks in 30 minutes on average—down from 90 minutes without AI, a 3X speedup. The most dramatic gains occur in cognitively demanding tasks like writing, scientific research, and negotiation, with time reductions of four- to five-fold (400–500% faster) in tasks like science, negotiation, and reputation management. Daily use of generative AI has grown rapidly (from 30% to 43% of US workers in four months), but most engagement is light and episodic—under 15 hours a week, often in short “copilot” bursts. Productivity gains follow a U-shaped pattern by income: the biggest time savings appear among workers earning either less than $35K or more than $100K. Notably, 84% of users say generative AI helps them do their own work faster or better, while only 16% fully delegate tasks.

  • Method: Nationally representative two-wave survey of 4,278 US workers, fielded by IncQuery in December 2024 and again in March/April 2025. Responses were weighted to match Current Population Survey benchmarks by age, gender, race, education, occupation, and industry. The survey captured both adoption rates and self-reported productivity impacts across task categories.

  • Reference: Hartley, Jonathan S., Filip Jolevski, Victor Melo, and Bennett Moore. “The Labor Market Effects of Generative Artificial Intelligence.” SSRN Working Paper No. 5136877, May 2025. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5136877

2023-11 | GPT-4 boosts call-center productivity

  • Findings: A GPT-4–powered chat-assist tool raised agent productivity by about 14%, measured as issues resolved per hour, with novice and lower-skill agents jumping 34%—effectively erasing several months of on-the-job learning gaps. Efficiency gains came from faster chats: average handle time fell roughly 12% (≈5 minutes on a 40-minute chat) while agents could juggle more simultaneous conversations. Service quality held steady or improved. Text-based sentiment analysis shows customer tone became significantly more positive (≈0.18 points—about half a standard deviation) even as Net Promoter Scores stayed flat, indicating speed gains were not bought at the cost of satisfaction. The tool particularly benefited new hires: their performance with AI matched peers who had six-plus months of experience and their attrition dropped roughly 40%.

  • Method: Researchers analyzed a staggered rollout of the real-time GPT-4 assistant to 5,179 chat agents at a Fortune 500 software company in the Philippines and the US. Difference-in-differences models with agent fixed effects compared treated and untreated agents over time, controlling for shift, product line and tenure. Productivity (resolutions per hour), quality (resolution rate, NPS, sentiment) and labour outcomes (retention) were drawn from the firm’s systems and customer surveys over the nine-month deployment.

  • Reference: Brynjolfsson, Erik, Danielle Li, and Lindsey R. Raymond. “Generative AI at Work: Evidence from a Multisite Call-Center Field Experiment.” NBER Working Paper 31161, National Bureau of Economic Research, 2023. https://www.nber.org/system/files/working_papers/w31161/w31161.pdf

2023-09 | GPT-4 helps consultants do more, faster, better

  • Findings: In a live test with 758 Boston Consulting Group consultants, those using GPT-4 finished 12% more “inside-frontier” consulting tasks, worked 25% faster, and delivered work that external graders rated 40% higher in quality than a control group. Performance lifts were steepest for below-average performers (quality increased 43%), showing AI’s ability to level-up lower quartiles. A stress-test on a task deliberately beyond GPT-4’s capabilities (“outside-frontier”) exposed a 19-percentage-point drop in accuracy—underscoring that knowing when not to lean on AI matters.

  • Method: Researchers ran a preregistered field experiment inside BCG. Consultants were randomly assigned to work on 18 realistic “inside-frontier” tasks and one “outside-frontier” case in one of three conditions: no AI, GPT-4 only, or GPT-4 plus prompt-engineering tips. Outputs were blind-graded for quality; the platform captured completion rates and time-on-task, while pre-task assessments let the team track gains across skill levels.

  • Reference: Dell’Acqua, Fabrizio, et al. “Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality.” Harvard Business School Working Paper, no. 24-013, 22 Sept. 2023. SSRN, https://dx.doi.org/10.2139/ssrn.4573321.

Science

2025-04 | AIs beat experts at virology troubleshooting

  • Findings: On the new Virology Capabilities Test (VCT)—322 multimodal questions written and vetted by PhD virologists—OpenAI’s o3 model hit 43.8% accuracy, outscoring 94% of human experts (experts averaged 22.1%). Four other frontier models (o4-mini, o1, Gemini 2.5 Pro, Claude 3.5) also surpassed the median virologist. The study warns that public-facing models now deliver expert-level guidance on dual-use lab methods, heightening biosecurity risk.

  • Method: Sixty-eight virologists created and peer-reviewed the VCT; 36 additional experts provided a baseline by answering only questions in their own specialties. Models took the test zero-shot in a multiple-response format; performance was compared to human baselines and analysed across text-only and image-dependent items.

  • Reference: Göttinger, Jasper, et al. “Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark.” arXiv, 29 Apr 2025. https://doi.org/10.48550/arXiv.2504.16137.

Simulation

2024-11 | AI simulations mirror real people

  • Findings: Researchers created AI “simulations” of 1,052 Americans by feeding GPT-4 detailed transcripts from two-hour interviews, then ran tests to compare the simulations and source humans. Results were striking: the AI agents reproduced human survey responses 85% as reliably as humans reproduced their own answers two weeks later, matched 80% of their personality traits, and made the same choices in economic games two-thirds of the time. This offers compelling evidence that with the right data, AI can be a powerful tool for social science and market research.

  • Method: The team ran AI voice interviews with a nationally representative sample, then used the full transcripts—plus expert summaries—as inputs to GPT-4. They tested how well the AI matched humans across 177 survey questions (from the General Social Survey), a standard personality test (the Big Five), five economic decision-making games, and five validated psychology experiments. They compared performance against agents built only from basic demographics or short self-written bios, and assessed bias using standard fairness metrics.

  • Reference: Park, Joon S., et al. “Generative Agent Simulations of 1,000 People.” arXiv, 15 Nov. 2024. https://doi.org/10.48550/arXiv.2411.10109.

Writing

2024-11 | AI poetry is indistinguishable from human

  • Findings: In two online experiments, non-expert readers failed to tell GPT-4 poems from verses by Sylvia Plath, Walt Whitman, T.S. Eliot and seven other renowned poets. Identification accuracy was 46%—worse than chance—and participants actually tagged AI poems as “human-written” more often than the genuine works. When rating quality, rhythm, imagery and 11 other traits, readers scored the AI poems higher on 13 of 14 dimensions; only “originality” was a tie. The pattern flips if people are told a poem is AI-made—scores then drop—signaling a perception bias rather than a real quality gap.

  • Method: Researchers assembled 100 poems: 50 originals from ten famous English-language poets and 50 fresh GPT-4 compositions generated with a simple “write a short poem in the style of ” prompt. In Study 1 (n=1,634), participants guessed authorship for ten poems each. Study 2 (n=696) had a new sample rate ten poems on 14 qualitative scales under three framing conditions: “told human,” “told AI,” or no authorship clue. Mixed-effects models tested recognition accuracy and rating differences.

  • Reference: Porter, Brian, and Edouard Machery. “AI-Generated Poetry Is Indistinguishable from Human-Written Poetry and Is Rated More Favorably.” Scientific Reports, vol. 14, article 26133, 2024. https://www.nature.com/articles/s41598-024-76900-1.


Note: I used AI extensively above to help summarize the papers and create the references.