Key Research on AI Capabilities and Impact

Last updated October 15, 2025

Dozens of AI-related papers get released daily through journals, arXiv, and labs. It’s hard to keep up. But every so often, significant research drops that shapes how I think about AI, and that I keep on file for referencing. This page includes those findings, organized by category in ascending alphabetical order, in reverse order of publication within each category.

Coding

2025-03 | Coding task length doubles every seven months

Findings: Researchers proposed the “50% task-completion time horizon”—the human clock-time of tasks an AI agent finishes half the time. Claude 3.7 Sonnet already tackles ~50-minute jobs, beating GPT-4 and 94% of expert programmers. Horizon length doubled every seven months from 2019–2025, pointing to month-long software projects before 2030. Progress stems from sharper reasoning, richer tool use, and quicker error recovery, but raises autonomy and bio-risk concerns.
- Update (April 2025): Fresh benchmarking with newer models shows the horizon now doubles roughly every four months, so the paper’s original seven-month estimate may understate current velocity.
Method: The team logged 2,529 hours of human work across 170 software-engineering, ML-research, and cybersecurity tasks (HCAST, RE-Bench, SWAA). Thirteen frontier models (GPT-2 to o1) ran in identical agent scaffolds; logistic curves linked success to human time, and hierarchical bootstraps traced trends.
Reference: Kwa, Thomas, et al. “Measuring AI Ability to Complete Long Tasks.” arXiv, 30 Mar 2025. https://doi.org/10.48550/arXiv.2503.14499.

2025-02 | GitHub Copilot increases coding output 26%

Findings: Across three randomized trials at Microsoft, Accenture, and a Fortune 100 firm, software developers who used GitHub Copilot completed 26.08% more tasks per week than their peers. Gains were strongest for newer, less senior, and previously less productive developers—who also adopted the tool faster and stuck with it longer. More experienced developers saw smaller or no productivity benefits. Output quality did not decline, suggesting Copilot's speed boost did not come at the cost of reliability.
Method: In total, 4,867 developers participated in field experiments lasting 2–8 months. Treatment groups received access to Copilot; control groups did not, with rollout later extended to all participants. Developer productivity was measured using GitHub activity: pull requests, commits, builds, and build success rates. Imperfect compliance was handled using instrumental variable regression, weighted by treatment-control adoption gaps.
Reference: Cui, Kevin Zheyuan, et al. “The Effects of Generative AI on High-Skilled Work: Evidence from Three Field Experiments with Software Developers.” SSRN, Feb. 2025. https://dx.doi.org/10.2139/ssrn.4945566.

2023-02 | GitHub Copilot cuts coding time 56%

Findings: In a single-task test, professional freelance developers with GitHub Copilot finished an HTTP-server implementation in ≈71 min on average versus ≈161 min for the control group, a 55.8% speed-up. The share of participants whose code passed all 12 unit tests was 7 pp higher for Copilot users, although the difference was not statistically significant. Benefits were largest for less-experienced and older programmers.
Method: 95 programmers recruited on Upwork were randomly assigned to treatment (Copilot, n=45) or control (n=50). Each wrote a JavaScript HTTP server; GitHub Classroom logged time-to-first-passing-submission and unit-test results automatically. Demographics and self-reported perceptions were collected in entry/exit surveys.
Reference: Peng, Sida, et al. “The Impact of AI on Developer Productivity: Evidence from GitHub Copilot.” arXiv, 13 Feb. 2023, https://doi.org/10.48550/arXiv.2302.06590.

Creativity

2024-02 | GPT-4 beats humans on creativity tests

Findings: In side-by-side trials on three staple divergent-thinking assessments—the Alternate Uses, Consequences, and Divergent Associations tasks—GPT-4 beat 151 human participants on every creativity metric. Even when the number of ideas was held constant, the model’s responses were markedly more original and up to seven times more elaborated, underscoring its capacity to deliver fresher, richer ideas than the average person.
Method: Researchers paired 151 GPT-4 chat sessions with 151 U.S. adults recruited via Prolific. Both groups completed the same tasks; GPT-4 was told to generate exactly as many ideas as its human counterpart for each prompt. Outputs were scored blind with the Open Creativity Scoring tool (GloVe semantic-distance metrics) and word-count analysis, then compared with standard statistical tests, confirming the AI’s clear advantage across all three tasks.
Reference: Hubert, Kent F., Kim N. Awa, and Darya L. Zabelina. “The Current State of Artificial Intelligence Generative Language Models Is More Creative Than Humans on Divergent Thinking Tasks.” Scientific Reports, vol. 14, article 3440, 10 Feb. 2024. https://doi.org/10.1038/s41598-024-53303-w.

Education

2025-06 | AI tutor beats in-class learning in RCT

Findings: In a Harvard randomized controlled trial, a custom GPT-4–powered AI tutor more than doubled median learning gains compared to in-class active learning, while taking less time and boosting student engagement and motivation.
Method: RCT (N=194) in an undergraduate physics course using a crossover design. The AI tutor, PS2 Pal, was a purpose-built platform integrating GPT-4 with expert-authored, step-by-step solutions, structured scaffolding, and prompts aligned to research-based pedagogy (active learning, cognitive load management, growth mindset). Students experienced both AI-tutored and in-class lessons with identical content; learning measured via pre/post-tests and perception surveys.
Reference: Kestin, Greg, et al. “AI Tutoring Outperforms In-Class Active Learning: An RCT Introducing a Novel Research-Based Design in an Authentic Educational Setting.” Scientific Reports, vol. 15, 2025, p. 17458. Nature, https://doi.org/10.1038/s41598-025-97652-6.

2025-06 | AI tutor delivers 3.2 years of learning in six weeks

Findings: A World Bank-backed study found that Microsoft’s Copilot (powered by GPT-4) significantly improved learning among high school students in Nigeria. Over just six weeks, students who used the AI tutor scored higher on English tests and maintained their gains on regular school exams. The improvement was equivalent to moving an average student into the top third of the class. Girls and stronger students saw the biggest benefit, but everyone gained with more time spent using the tool. This was one of the first rigorous tests of generative AI in classrooms in the Global South. It showed that AI tutors can drive real academic progress even in low-resource settings—and do so at remarkably low cost. For every $100 spent, the program delivered the equivalent of 3.2 years of traditional learning. That places it among the most cost-effective education interventions ever recorded.
Method: Researchers ran a randomized controlled trial across nine public schools in Benin City. Students were randomly assigned to either attend AI-assisted after-school sessions or stick with traditional instruction. Those in the AI group met in computer labs twice a week, working in pairs while teachers offered support. The team compared outcomes across tests, controlled for starting scores, and verified the results through multiple statistical checks.
Source: De Simone, Martín, et al. From Chalkboards to Chatbots: Evaluating the Impact of Generative AI on Learning Outcomes in Nigeria. World Bank Policy Research Working Paper 11125, May 2025. https://documents.worldbank.org/en/publication/documents-reports/documentdetail/099548105192529324.

2025-05 | ChatGPT significantly boosts learning outcomes

Findings: Across 51 classroom experiments, students who used ChatGPT out-performed peers by almost a full standard deviation (Hedges’ g=0.867)—roughly the difference between the 80th and 50th percentile. Smaller but still meaningful lifts appeared in learners’ self-reported confidence (g=0.456) and higher-order thinking skills (g=0.457). The biggest academic gains showed up in problem-based courses and programmes that ran the AI for 4–8 weeks.
Method: The authors conducted a random-effects meta-analysis of 51 experimental and quasi-experimental studies (November 2022 to February 2025). They compared ChatGPT-assisted groups with traditional instruction, tested for publication bias, and ran moderator tests on grade level, course type, pedagogy, duration, ChatGPT’s role (tutor, partner, tool), and subject domain. Results remained robust after all bias checks.
Reference: Wang, Jin, and Wenxiang Fan. “The Effect of ChatGPT on Students’ Learning Performance, Learning Perception, and Higher-Order Thinking: Insights from a Meta-Analysis.” Humanities and Social Sciences Communications, vol. 12, article 621, 2025. https://doi.org/10.1057/s41599-025-04787-y.

Emotional intelligence

2025-05 | LLMs beat human emotional intelligence scores

Findings: Large language models including GPT-4, o1, and others scored on average 25 percentage points higher than humans across five standardized emotional intelligence tests. These tests—used to assess things like leadership, HR, and counseling aptitude—measure how well someone understands emotions, manages them, and selects effective responses. The results show that LLMs now reason through emotional scenarios better than most people, with implications for customer service, coaching, hiring, mental health support, and other areas where high emotional intelligence is critical to success.
Method: The researchers administered five validated emotional intelligence assessments—STEM, STEU, GEMOK-Blends, GECo-Reg, and GECo-Mgmt—to six LLMs, each tested 10 times, and compared results to human baselines from the tests’ original development samples. Human scores averaged 56%; AIs, 81%.
Reference: Schlegel, Katja, et al. “Large Language Models Are Proficient in Solving and Creating Emotional Intelligence Tests.” Communications Psychology, vol. 3, 2025, article 80, Nature Portfolio, https://doi.org/10.1038/s44271-025-00258-x.

Employment

2025-08 | AI adoption cuts junior hiring, spares seniors

Findings: Firms that adopt generative AI reduce junior hiring while continuing to expand senior roles. Six quarters after adoption, headcount for junior employees is about 7–12% lower than comparable non-adopting firms, while senior headcount keeps rising. The adjustment comes mainly from not hiring new juniors, not from layoffs—separations remain flat, and juniors already inside often receive faster promotions. Sector effects are largest in wholesale and retail, and educational background matters: graduates from mid-tier universities in professions with routinizable entry-level work face the steepest risks, while elite and lowest-tier grads are relatively insulated.
Method: The authors linked 62 million LinkedIn job records with 246 million job postings to track workforce composition inside firms before and after they posted ads for “AI integrator” roles, a marker of adoption maturity. They ran difference-in-differences and triple-difference event studies with firm-time fixed effects, comparing junior vs senior employment within the same firms. They also analyzed heterogeneity by sector and university prestige.
Reference: Lichtinger, Guy, and Seyed Mahdi Hosseini Maasoum. “Generative AI as Seniority-Biased Technological Change: Evidence from U.S. Résumé and Job Posting Data.” SSRN, 31 Aug. 2025, https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5425555.

2025-08 | AI-exposed entry-level jobs fall 13%

Findings: Since late 2022, employment for 22–25 year-olds in the most AI-exposed occupations (like software development and customer service) has dropped about 13%, while jobs for older workers in those same positions and workers in less-exposed fields have held steady or grown. The declines are concentrated in roles where AI automates tasks rather than augments them, and they appear in employment rather than wages, which have remained largely flat. The authors interpret this as evidence that AI substitutes most easily for codified knowledge acquired in school, but not yet for tacit knowledge gained through workplace experience.
Method: The researchers analyzed monthly payroll records from ADP, the largest US payroll processor, covering millions of workers across tens of thousands of firms between 2021 and 2025. They linked job titles to standardized occupational codes and applied two measures of AI exposure: GPT-4-based task exposure estimates and Anthropic Claude usage data, which distinguish automative from augmentative uses. Event-study regressions with firm-time fixed effects isolated AI’s impact from macroeconomic and firm-level shocks. Robustness checks—including excluding tech occupations, removing information-sector firms, comparing remote vs non-remote jobs, and extending the sample pre-2022—confirmed the results.
Reference: Brynjolfsson, Erik, Bharat Chandar, and Ruyu Chen. “Canaries in the Coal Mine? Six Facts about the Recent Employment Effects of Artificial Intelligence.” Stanford Digital Economy Lab, 26 Aug. 2025, https://digitaleconomy.stanford.edu/publications/canaries-in-the-coal-mine.

2023-07 | ChatGPT reduced freelancer gigs and pay

Findings: After ChatGPT’s public launch (November 2022), writing-focused freelancers on Upwork rapidly saw a 2% drop in monthly jobs and a 5% fall in earnings, with both the likelihood of landing any job and the number of jobs per active worker sliding. Similar or larger declines hit designers after the release of image-generating models such as DALL-E 2. Notably, top-rated and higher-earning freelancers were not insulated—if anything, the losses skewed toward them, hinting that generative AI can compress skill premiums.
Method: The researchers scraped 92,547 single-occupation Upwork profiles and tracked monthly job counts and income from January 2022 to April 2023. Using a difference-in-differences design, they compared “treated” occupations (writing, editing, proofreading; later, visual design) with less-affected roles, controlling for freelancer and time fixed effects and clustering errors by occupation. Robustness checks—event studies, matching, wild bootstrap errors, and a parallel analysis of image-generation tools—reinforced the core result that generative AI substitutes rather than complements knowledge workers in the short run.
Reference: Hui, Xiang, Oren Reshef, and Luofeng Zhou. “The Short-Term Effects of Generative Artificial Intelligence on Employment: Evidence from an Online Labor Market.” SSRN, 31 July 2023, https://dx.doi.org/10.2139/ssrn.4527336.

Healthcare

2025-08 | AI note-taking tools cut clinician burnout

Findings: At Mass General Brigham, burnout fell from 50.6% to 29.4% after 42 days of ambient documentation technology (ADT) use, and from 52.6% to 30.7% after 84 days. At Emory, clinicians reporting a positive well-being impact from documentation rose from 1.6% to 32.3%. Free-text comments described more joy in practice and less cognitive strain, though pediatrics and psychiatry reported limited benefit and integration gaps persisted. While patient outcomes were not measured, prior research links clinician burnout to lower care quality and safety, suggesting potential patient benefits.
Method: Nonrandomized pilot surveys with pre- and post-measures. Clinicians used AI-based ADT systems integrated with Epic EHR. Burnout was measured with the Professional Fulfillment Index (MGB) and well-being with a single-item score (Emory). About half of respondents reported using ADT in most visits.
Reference: You, Jacqueline G., et al. “Ambient Documentation Technology in Clinician Experience of Documentation Burden and Burnout.” JAMA Network Open, vol. 8, no. 8, Aug. 21, 2025, e2528056. https://doi.org/10.1001/jamanetworkopen.2025.28056

2025-07 | AI reduces medical errors in real-world clinics

Findings: In a study of nearly 40,000 patient visits across 15 clinics in Nairobi, Kenya, OpenAI and Penda Health showed that clinicians using an LLM-powered copilot (AI Consult) had 16% fewer diagnostic errors and 13% fewer treatment errors. The effect grew to 31% and 18%, respectively, in cases where the AI flagged serious concerns. The system helped clinicians improve care without taking control, acting instead as a real-time safety net.
Method: A quasi-randomized quality improvement study compared outcomes between clinicians with and without access to AI Consult. The copilot was embedded in the electronic health record and used GPT‑4o to provide context-aware feedback during visits. Results were validated by 108 independent physicians who reviewed thousands of anonymized visit notes across four clinical quality domains.
Reference: Korom, Robert, et al. “Pioneering an AI clinical copilot with Penda Health.” OpenAI, 22 July 2025. https://openai.com/index/ai-as-the-greatest-source-of-empowerment-for-all

2025-06 | AI outdiagnoses doctors with lower test costs

Findings: Microsoft’s “MAI-DxO” agent framework achieved 85.5% diagnostic accuracy and significantly reduced testing costs using a new benchmark based on 304 real-world NEJM clinical cases. Even the standalone OpenAI o3 model outperformed experienced physicians, suggesting that current frontier models already surpass human performance in complex diagnostic reasoning. Cost savings were substantial: MAI-DxO cut test costs by ~70% compared to o3 alone, and ~20% compared to human doctors.
Method: Researchers created the Sequential Diagnosis Benchmark (SDBench) by turning NEJM Clinicopathological Conference case studies into interactive simulations. AI models—and humans—had to query patients, order tests, and provide a final diagnosis, with each test incurring a virtual cost. Performance was evaluated on historical cases and a hold-out set of 56 new cases published after models’ known training cut-offs to reduce contamination risk.
Nori, Harsha, et al. “Sequential Diagnosis with Language Models.” arXiv, 27 June 2025, arXiv:2506.22405, https://doi.org/10.48550/arXiv.2506.22405.

2025-06 | AI does 12 years of medical review in 2 days

Findings: A fully automated LLM workflow reproduced and updated 12 Cochrane Reviews—work typically requiring 12 person-years—in under 48 hours. The system outperformed expert humans in both screening (96.7% sensitivity vs. 81.7%) and data extraction (93.1% accuracy vs. 79.7%), with a blinded panel siding with the AI over original Cochrane extractions in 69% of disagreements. Contrary to a common concern, most AI “errors” stemmed from inaccessible data, not hallucination.
Method: The researchers built a multi-agent pipeline (otto-SR) combining OpenAI’s GPT-4.1 for abstract/full-text screening and o3-mini-high for data extraction. It processed 146,276 citations across 12 reviews from the April 2024 issue of the Cochrane Database. Meta-analyses replicated original results, with newly included studies altering statistical conclusions in several cases. Performance was benchmarked against dual human reviewers and the commercial tool Elicit.
Reference: Cao, Christian, et al. “Automation of Systematic Reviews with Large Language Models.” medRxiv, 13 June 2025, https://doi.org/10.1101/2025.06.13.25329541.

2024-10 | GPT-4 gives accurate diagnoses doctors ignore

Findings: In a single-blind trial, GPT-4 on its own scored a 92% diagnostic-reasoning grade, beating board-certified internists by 16 percentage points. Yet when those same physicians could see the chatbot’s ranked differentials, their own scores ticked up by just 2 pp—a statistically insignificant lift. In practice, doctors often stuck with their initial hunches even when the AI’s answer was demonstrably better, leaving most of the accuracy gain on the table.
Method: Fifty attendings and residents from three US academic centers were randomized to work through up to six real-case vignettes with either conventional references or conventional + GPT-4. A structured-reflection rubric captured differential breadth, supporting/contradictory evidence, next steps, and final diagnosis. Reviewers blinded to group assignment scored 244 physician case submissions; a separate arm ran GPT-4 alone on the same cases for comparison.
Reference: Goh, Ethan, et al. “Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial.” JAMA Network Open, vol. 7, no. 10, 28 Oct. 2024, e2440969. https://dx.doi.org/10.1001/jamanetworkopen.2024.40969.

Innovation

2025-03 | GPT-4 lets one person out-innovate two

Findings: In a live test with 776 Procter & Gamble professionals, solo employees who used GPT-4 generated ideas 0.37 SD higher in quality than solo peers without AI—on par with two-person teams working unaided. When a pair used GPT-4, their odds of hitting the top-10% of ideas doubled versus human-only teams. AI users also worked 12–16% faster, wrote more detailed proposals, reported more positive emotions, and produced concepts that mixed R&D and commercial perspectives instead of staying in their functional lanes.
Method: The authors ran a preregistered, 2 × 2 field experiment inside P&G’s product-innovation “hackathon.” Participants were randomly assigned to one of four cells—solo or pair, with or without GPT-4 access—and tasked with real business challenges. Blind expert judges scored 550 final submissions on quality, novelty, feasibility and business impact; time-on-task, word count, sentiment shifts and AI-retention metrics provided additional data.
Reference: Dell’Acqua, Fabrizio, et al. “The Cybernetic Teammate: A Field Experiment on Generative AI Reshaping Teamwork and Expertise.” Harvard Business School Working Paper, no. 25-043, 28 Mar. 2025. SSRN, https://dx.doi.org/10.2139/ssrn.5188231.

Persuasion

2025-05 | AI outpersuades paid human persuaders

Findings: In a high-stakes experiment, Claude 3.5 Sonnet convinced people to change their minds more often than real humans who were paid to persuade. The AI boosted correct answers more than its human counterparts—and misled more effectively when asked to deceive. People followed the AI’s advice nearly 8 percentage points more often. This confirms AI’s growing power to sway opinion, for better or worse.
Method: Researchers ran a live-chat quiz with 1,242 US participants. In each round, one person answered a multiple-choice question while another—either a human or Claude 3.5—tried to influence their answer. Both sides had cash incentives. The experiment covered trivia, common misinformation, and near-term forecasts, testing both helpful and deceptive nudges. In total, the team analyzed over 12,000 messages.
Reference: Schoenegger, Philipp, et al. “Large Language Models Are More Persuasive Than Incentivized Human Persuaders.” arXiv, 14 May 2025, https://doi.org/10.48550/arXiv.2505.09662.

2025-04 | AI chatbots out-argue 98% of Reddit debaters

Findings: In a live test on r/ChangeMyView, AI–powered accounts persuaded the original poster in 17–18% of debates—5-6× the human success rate (~3%) and good enough for the 98th–99th percentile of all users, including the subreddit’s top experts. No Redditor flagged the comments as machine-written, highlighting the ease with which AI can pass for—and outperform—human persuaders. (The working paper has since been withdrawn amid ethical debate.)
Method: Researchers posted 1,061 AI-generated replies to new discussions (November 2024 to March 2025) and analyzed the 478 that survived moderator deletions. Each post was randomly assigned to one of three treatments: Generic AI, Personalized AI (tailored to the poster’s inferred demographics), or Community-Aligned AI (fine-tuned on past winning arguments). All content was human-screened before release; the study was preregistered and cleared by the University of Zurich ethics board.
Reference: “Can AI Change Your View? Evidence from a Large-Scale Online Field Experiment.” Unpublished working paper, 18 Apr. 2025. https://regmedia.co.uk/2025/04/29/supplied_can_ai_change_your_view.pdf

Productivity

2025-10 | AI raises e-commerce sales in randomized trial

Findings: Large-scale field experiments on a leading global e-commerce platform show that generative AI measurably increases firm productivity even with fixed labor, capital, and prices. Across seven randomized A/B tests involving millions of users, generative AI interventions raised sales by up to 16.3% (for pre-sale chatbots) and improved conversion rates by 1–22%, while average cart values remained flat. Aggregating across the four workflows with positive results implies about $5 in incremental annual value per consumer. The productivity boost stems not from cost cuts but from better consumer experience: generative AI helped reduce information and search frictions, improve matching, and enhance personalization. Smaller sellers, less-experienced consumers, and long-tail products benefited most, suggesting generative AI acts as an equalizer within digital markets.
Method: Seven randomized field experiments (Sept 2023 – Jun 2024) tested generative AI integrations across customer service, search, product description, marketing, advertising, chargeback defense, and live-chat translation. Each workflow compared AI-enhanced vs. baseline processes with identical inputs. Outcomes—sales, conversion, clicks—were tracked at consumer or product level; regressions estimated average treatment effects. Because inputs and prices were constant, all observed revenue lifts directly map to total-factor-productivity gains.
Reference: Fang, Lu, Zhe Yuan, Kaifu Zhang, Dante Donati, and Miklos Sarvary. “Generative AI and Firm Productivity: Field Experiments in Online Retail.” arXiv, 14 Oct. 2025. https://doi.org/10.48550/arXiv.2510.12049.

2025-05 | AI cuts task time by two-thirds

Findings: Generative AI users report completing workplace tasks in 30 minutes on average—down from 90 minutes without AI, a 3X speedup. The most dramatic gains occur in cognitively demanding tasks like writing, scientific research, and negotiation, with time reductions of four- to five-fold (400–500% faster) in tasks like science, negotiation, and reputation management. Daily use of generative AI has grown rapidly (from 30% to 43% of US workers in four months), but most engagement is light and episodic—under 15 hours a week, often in short “copilot” bursts. Productivity gains follow a U-shaped pattern by income: the biggest time savings appear among workers earning either less than $35K or more than $100K. Notably, 84% of users say generative AI helps them do their own work faster or better, while only 16% fully delegate tasks.
Method: Nationally representative two-wave survey of 4,278 US workers, fielded by IncQuery in December 2024 and again in March/April 2025. Responses were weighted to match Current Population Survey benchmarks by age, gender, race, education, occupation, and industry. The survey captured both adoption rates and self-reported productivity impacts across task categories.
Reference: Hartley, Jonathan S., Filip Jolevski, Victor Melo, and Bennett Moore. “The Labor Market Effects of Generative Artificial Intelligence.” SSRN Working Paper No. 5136877, May 2025. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5136877

2023-11 | GPT-4 boosts call-center productivity

Findings: A GPT-4–powered chat-assist tool raised agent productivity by about 14%, measured as issues resolved per hour, with novice and lower-skill agents jumping 34%—effectively erasing several months of on-the-job learning gaps. Efficiency gains came from faster chats: average handle time fell roughly 12% (≈5 minutes on a 40-minute chat) while agents could juggle more simultaneous conversations. Service quality held steady or improved. Text-based sentiment analysis shows customer tone became significantly more positive (≈0.18 points—about half a standard deviation) even as Net Promoter Scores stayed flat, indicating speed gains were not bought at the cost of satisfaction. The tool particularly benefited new hires: their performance with AI matched peers who had six-plus months of experience and their attrition dropped roughly 40%.
Method: Researchers analyzed a staggered rollout of the real-time GPT-4 assistant to 5,179 chat agents at a Fortune 500 software company in the Philippines and the US. Difference-in-differences models with agent fixed effects compared treated and untreated agents over time, controlling for shift, product line and tenure. Productivity (resolutions per hour), quality (resolution rate, NPS, sentiment) and labour outcomes (retention) were drawn from the firm’s systems and customer surveys over the nine-month deployment.
Reference: Brynjolfsson, Erik, Danielle Li, and Lindsey R. Raymond. “Generative AI at Work: Evidence from a Multisite Call-Center Field Experiment.” NBER Working Paper 31161, National Bureau of Economic Research, 2023. https://www.nber.org/system/files/working_papers/w31161/w31161.pdf

2023-09 | GPT-4 helps consultants do more, faster, better

Findings: In a live test with 758 Boston Consulting Group consultants, those using GPT-4 finished 12% more “inside-frontier” consulting tasks, worked 25% faster, and delivered work that external graders rated 40% higher in quality than a control group. Performance lifts were steepest for below-average performers (quality increased 43%), showing AI’s ability to level-up lower quartiles. A stress-test on a task deliberately beyond GPT-4’s capabilities (“outside-frontier”) exposed a 19-percentage-point drop in accuracy—underscoring that knowing when not to lean on AI matters.
Method: Researchers ran a preregistered field experiment inside BCG. Consultants were randomly assigned to work on 18 realistic “inside-frontier” tasks and one “outside-frontier” case in one of three conditions: no AI, GPT-4 only, or GPT-4 plus prompt-engineering tips. Outputs were blind-graded for quality; the platform captured completion rates and time-on-task, while pre-task assessments let the team track gains across skill levels.
Reference: Dell’Acqua, Fabrizio, et al. “Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality.” Harvard Business School Working Paper, no. 24-013, 22 Sept. 2023. SSRN, https://dx.doi.org/10.2139/ssrn.4573321.

Recruiting

2025-08 | Voice AI interviews boost hiring outcomes

Findings: In a randomized field experiment with 70,000 applicants in the Philippines, AI voice agents conducting interviews (with humans still making final hiring decisions) improved outcomes: job offers rose by 12%, job starts by 18%, and 30-day retention by 17%. Applicants accepted offers at the same rate, rated AI interviews nearly as positively as human ones, and preferred them when given the choice (78% opted for AI). AI-led interviews elicited more hiring-relevant information, reduced perceived gender discrimination, and recruiters scored AI-interviewed applicants more favorably—though they placed relatively more weight on test scores. One caveat: negative sorting emerged, as weaker applicants (those with lower test scores) disproportionately chose the AI interviewer.
Method: Large-scale natural field experiment run with PSG Global Solutions. Applicants were randomly assigned to AI-led, human-led, or choice-of-interviewer conditions. Human recruiters always evaluated transcripts and test scores to make hiring decisions. Analysis of outcomes combined firm data, NLP on interview transcripts, and applicant/recruiter surveys.
Reference: Jabarian, Brian, and Luca Henkel. “Voice AI in Firms: A Natural Field Experiment on Automated Job Interviews.” SSRN, 18 Aug. 2025, http://dx.doi.org/10.2139/ssrn.5395709.

Science

2025-10 | Generative AI boosts research productivity

Findings: Generative AI users increased their research productivity by 15% in 2023 and 36% in 2024, relative to matched non-users. Publication quality—measured by average journal impact factor—also increased modestly by 1–2%, indicating that quantity gains didn’t come at the expense of quality. The largest benefits accrued to early-career researchers, authors in technical fields (economics, psychology), and those in non-English-speaking countries, suggesting that generative AI lowers both technical and linguistic barriers to publishing. No gender gap emerged in adoption effects.
Method: The authors combined a difference-in-differences framework with nearest-neighbor matching on pre-treatment characteristics (field, career age, prior output, gender, and language). They identified generative AI adopters by rising frequencies of 65 generative AI-associated lexical markers—such as delve, pivotal, underscore, and unveil—in titles and abstracts after ChatGPT’s 2022 release. This linguistic shift served as a behavioral proxy for tool use. Robustness checks across keyword thresholds (100–500%), adoption-intensity cutoffs, and matching ratios (1:1, 1:2, 1:3) confirmed stability of results.
Reference: Filimonovic, Dragan, Christian Rutzer, and Conny Wunsch. Can “GenAI Improve Academic Performance? Evidence from the Social and Behavioral Sciences.” arXiv, 2 Oct. 2025. https://doi.org/10.48550/arXiv.2510.02408.

2025-04 | AIs beat experts at virology troubleshooting

Findings: On the new Virology Capabilities Test (VCT)—322 multimodal questions written and vetted by PhD virologists—OpenAI’s o3 model hit 43.8% accuracy, outscoring 94% of human experts (experts averaged 22.1%). Four other frontier models (o4-mini, o1, Gemini 2.5 Pro, Claude 3.5) also surpassed the median virologist. The study warns that public-facing models now deliver expert-level guidance on dual-use lab methods, heightening biosecurity risk.
Method: Sixty-eight virologists created and peer-reviewed the VCT; 36 additional experts provided a baseline by answering only questions in their own specialties. Models took the test zero-shot in a multiple-response format; performance was compared to human baselines and analysed across text-only and image-dependent items.
Reference: Göttinger, Jasper, et al. “Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark.” arXiv, 29 Apr 2025. https://doi.org/10.48550/arXiv.2504.16137.

Simulation

2024-11 | AI simulations mirror real people

Findings: Researchers created AI “simulations” of 1,052 Americans by feeding GPT-4 detailed transcripts from two-hour interviews, then ran tests to compare the simulations and source humans. Results were striking: the AI agents reproduced human survey responses 85% as reliably as humans reproduced their own answers two weeks later, matched 80% of their personality traits, and made the same choices in economic games two-thirds of the time. This offers compelling evidence that with the right data, AI can be a powerful tool for social science and market research.
Method: The team ran AI voice interviews with a nationally representative sample, then used the full transcripts—plus expert summaries—as inputs to GPT-4. They tested how well the AI matched humans across 177 survey questions (from the General Social Survey), a standard personality test (the Big Five), five economic decision-making games, and five validated psychology experiments. They compared performance against agents built only from basic demographics or short self-written bios, and assessed bias using standard fairness metrics.
Reference: Park, Joon S., et al. “Generative Agent Simulations of 1,000 People.” arXiv, 15 Nov. 2024. https://doi.org/10.48550/arXiv.2411.10109.

Writing

2024-11 | AI poetry is indistinguishable from human

Findings: In two online experiments, non-expert readers failed to tell GPT-4 poems from verses by Sylvia Plath, Walt Whitman, T.S. Eliot and seven other renowned poets. Identification accuracy was 46%—worse than chance—and participants actually tagged AI poems as “human-written” more often than the genuine works. When rating quality, rhythm, imagery and 11 other traits, readers scored the AI poems higher on 13 of 14 dimensions; only “originality” was a tie. The pattern flips if people are told a poem is AI-made—scores then drop—signaling a perception bias rather than a real quality gap.
Method: Researchers assembled 100 poems: 50 originals from ten famous English-language poets and 50 fresh GPT-4 compositions generated with a simple “write a short poem in the style of ” prompt. In Study 1 (n=1,634), participants guessed authorship for ten poems each. Study 2 (n=696) had a new sample rate ten poems on 14 qualitative scales under three framing conditions: “told human,” “told AI,” or no authorship clue. Mixed-effects models tested recognition accuracy and rating differences.
Reference: Porter, Brian, and Edouard Machery. “AI-Generated Poetry Is Indistinguishable from Human-Written Poetry and Is Rated More Favorably.” Scientific Reports, vol. 14, article 26133, 2024. https://www.nature.com/articles/s41598-024-76900-1.

Note: I used AI extensively above to help summarize the papers and create the references.

Simon Smith on AI