SciAgentArena

Abstract

AI agents are increasingly being developed to accelerate scientific discovery, yet their practical capabilities in real research settings remain poorly understood. Existing benchmarks for AI agents rarely capture the complexity, heterogeneity, and extended reasoning required by scientific work, whereas benchmarks for scientific tasks often reduce research to static, direct problems. Here we introduce SciAgentArena, a systematic benchmark for evaluating AI agents in real-world scientific research scenarios drawn from emerging needs across multiple domains. SciAgentArena comprises approximately 200 tasks with stepwise verification and an interactive, agent-agnostic environment for assessing diverse AI agents. We find that current agents can contribute effectively to well-specified data-analysis workflows when task structure and evaluation criteria are clear, but their performance remains uneven across scientific contexts: agents struggle to generate genuinely novel insights, sustain self-directed exploration, and formulate robust solutions for open-ended research questions. We further characterize common failure modes across agents and identify opportunities for improving reliability, autonomy, and scientific reasoning.

~200

Stepwise-verified tasks

AI agents evaluated

generalist · specialist · MLLM

Scientific domains

Task categories

∞

Open submissions

a living benchmark

Overview

SciAgentArena separates the agent running framework from the evaluation framework, configuring a dedicated environment per agent and unifying I/O before computing metrics. This resolves configuration conflicts between agents and enables evaluation beyond accuracy — including stability, reliability, and cost.

Best-agent performance on representative tasks

While frontier models saturate math benchmarks (GPT 5.2 reaches 100% on AIME), the strongest agent on each SciAgentArena domain leaves clear room to improve.

68.1

hERG Prediction

Drug Discovery · AUROC

29.4

DEG detection

Single-Cell Omics · F1

47.6

Spatially variable genes

Spatial Omics · Jaccard

Direct Diagnosis

EHR Modeling · Accuracy

Polygenic Risk Score

Genetics · Accuracy

Five scientific domains, from molecules to patients

The benchmark spans the full arc of biomedical research — from identifying problems and collecting data to understanding disease and developing treatments, and from the molecular and cellular scale up to the whole human body.

Molecule

Drug Discovery

ADMET prediction, protein–ligand binding affinity, lead optimization, drug-target interaction, and combination-synergy scoring.

Cell

Single-Cell Omics

Integration, clustering, differentially expressed genes, perturbation prediction, and trajectory inference.

Tissue

Spatial Omics

Spatial graph construction, spatially variable genes, and neighborhood enrichment on a dedicated page.

Patient

EHR Modeling

Clinical-code normalization, rare-event extraction, outcome prediction, and treatment recommendation scored via FHIR.

Genome

Genetics

Mendelian randomization, QC, PRS computation, and Polygenic Risk Score pipelines over real GWAS summary statistics.

Cross-domain

Integrative Tasks

eQTL computation (genetics × omics), drug-target identification, and synthetic-lethality prediction across modalities.

Four core capability categories

Every domain's tasks map onto the same reasoning, planning, and action requirements of real research.

Data Analysis

Solve a long-horizon analysis problem step by step, testing the ability to sustain multi-step workflows.

Optimization

Improve candidate solutions toward stated objectives — selecting methods or designing new strategies.

Discovery

Explore a research domain and propose new, scientifically grounded hypotheses and ideas.

Validity

Determine whether a proposed task is scientifically and technically feasible before executing it.

An interactive, extensible platform

Because code execution and evaluation are decoupled, the community can add agents and tasks simultaneously. Open a dedicated benchmark page, submit code or data, and receive automatic, step-wise evaluation — our goal is the "LeetCode" of scientific-agent design.

The SciAgentArena platform: scientific roadmap from patient to molecule, the benchmark portal with seven benchmarks, and the submission/auto-evaluation system. — **Figure 2.** (a) Scientific challenges span from micro-to-macro discovery — patients, genetics, spatial/single-cell omics, and drug discovery. (b) The interactive benchmarking portal with per-domain benchmark pages. (c) Solution submission and automatic evaluation.

Key findings

Current AI agents are useful but uneven scientific collaborators.

No single agent dominates. Rankings are inconsistent across tasks and domains, exposing agent-specific biases and the need for stronger generalization.

Heterogeneous contributions. Agents excel at fixed-pipeline data analysis but struggle to optimize molecules and algorithms or to derive genuinely novel discoveries.

Conservative method selection. Agents converge to popular defaults — Leiden, Harmony, scVI, Wilcoxon, Moran's I, OLS — instead of adapting to the task.

Limited self-exploration & stability. Agents call tools passively, vary across identical runs, and share error patterns such as mismatched package versions.

Weak validity checking. Agents tend to execute requests even when the scientific premise is invalid — a form of sycophancy that undermines reliability.

A roadmap forward. Expand knowledge bases, supply richer prompts and tool awareness, add resource-aware decision-making, and keep humans in the loop.

Common failure modes & proposed fixes

A cross-domain taxonomy of recurring agent errors, distilled from running 18 agents over ~200 tasks.

Failure mode	Where it appears	Proposed fix
Version / environment-mismatched API or tool	Omics; Drug Discovery	Check package versions and pin signatures before coding.
Wrong library, namespace, or method wrapper	Omics; Drug Discovery	Use task-to-library mapping and API introspection.
Wrong data structure / output schema assumption	Omics	Inspect runtime objects and validate the final output.
Domain workflow over-generalization	Omics; EHR; Drug Discovery	Use domain-specific checklists before execution.
Insufficient data / context inspection	Omics; EHR; Drug Discovery; Genetics	Add a mandatory context-inspection stage.
Hallucinated valid actions	EHR	Ground every action in retrieved patient context.
Incomplete recommendation coverage	EHR	Map all active problems to required treatments.
Multi-objective optimization failure	Drug Discovery	Use constraint-first, window-aware optimization.
Tool availability ignored	Drug Discovery	Require tool calls with traceable evidence.
Executing ill-posed / invalid tasks	Drug Discovery; Omics; Genetics	Add premise checks and structured refusal when invalid.
Need for human–agent collaboration	Genetics	Request clarification on paths, covariates, phenotypes, PRS setup.

Adapted from Figure 6 of the paper — the full merged-error taxonomy.

Agents evaluated

18 agents spanning generalist LLMs, specialist scientific agents, and multimodal models — each run at maximal functionality with its most suitable backbone for a fair comparison.

GPT 5.2Gemini 3 ProClaude Sonnet 4.6 CodexClaude CodeToolUniverse CellForgeSTELLAAutoBABiomni TxAgentMedeaCACTUSChemToolAgent DrugAgentLIDDiADELTAMRagent

Citation

If you find SciAgentArena useful, please cite our work.

@article{liu2026sciagentarena,
  title  = {Benchmarking AI Agents for Addressing Scientific Challenges Across Scales},
  author = {Liu, Tianyu and Wang, Allen Xin and Panescu, Antonia and Chen, Lisa Xinyi
            and Long, Wenxin and Wei, Xinyu and Jing, Yueqian and Zeng, Ziyao and others},
  year   = {2026},
  note   = {SciAgentArena},
  url    = {https://arxiv.org/abs/2606.12736}
}

Tasks & datasets: huggingface.co/datasets/iLOVE2D/SciAgentArena