HypoEvolve

View the Project on GitHub JeffersonChen888/HypoEvolve

Benchmarking LLM-Generated Biological Hypotheses for Scientific Discovery

Jefferson Chen, Samuel Lee, Jieyuan Liu, Zhiting Hu, Zhen Wang
University of California, San Diego

jec068@ucsd.edu · hsl023@ucsd.edu · jil029@ucsd.edu · zhh019@ucsd.edu · zhw085@ucsd.edu

Elevator pitch

HypoEvolve is a multi-agent evolutionary system that refines LLM-generated biomedical hypotheses using selection, crossover, and mutation (genetic algorithms).
We show that evolutionary refinement yields substantial gains under external biological validation on drug repurposing (DepMap CRISPR dependency) and improves performance on Type 2 Diabetes (T2D) gene discovery.

Target audience / stakeholder: computational biology researchers, ML-for-science practitioners, and reviewers who need a systematic way to improve (not just generate) mechanistic hypotheses.

Table of Contents

Key results

External validation summary across cancer types Figure 1. External validation comparison between single-pass prompting and HypoEvolve (drug repurposing).

What we found:

Learning curve showing fitness across generations Figure 2. Learning curve showing LLM-evaluated fitness improving and stabilizing across generations (drug repurposing).

Problem & scope

Problem

Single-pass LLM prompting produces one hypothesis with no systematic refinement. Unlike researchers, LLM workflows typically lack selection pressure to preserve strong ideas and revise weak ones.

Scope boundaries

We do:

We do not:

Methods

System overview

HypoEvolve is a multi-agent GA-style loop:

HypoEvolve framework Figure 3. Multi-agent evolutionary loop used to iteratively refine hypotheses.

Evolutionary cycle per generation

Each generation follows:

  1. Review: generate component scores
  2. Selection: tournament selection + elitism
  3. Crossover: recombine mechanistic reasoning
  4. Mutation: controlled semantic edits

Objective

We define hypothesis search as:

$h^* = \arg\max_h f(h)$

where fitness is:

$f(h) = w_c s_c + w_n s_n + w_q s_q$

Weights balance novelty and quality while maintaining correctness.

Experiments & data

Task 1: Drug repurposing (oncology)

Given a cancer type, the system proposes:

External validation dataset: DepMap CRISPR dependency.
DepMap aggregates genome-scale CRISPR knockout screens across many cancer cell lines. Dependency scores indicate whether a gene is essential for cell survival in a given context. We use DepMap only after hypotheses are generated to measure whether implicated genes show strong dependency in the relevant cancer setting (external validation).

Leakage control: DepMap is not used in internal scoring or during evolution—only for post-hoc evaluation.

Task 2: Type 2 Diabetes (T2D) gene discovery

We also evaluate HypoEvolve on identifying T2D-associated genes, comparing performance across different LLM backbones (as in the poster).

T2D results comparison Figure 4. T2D results: single-pass vs HypoEvolve across different LLMs.

Interpretation

What changed across generations (qualitative):

What to trust vs. what to be cautious about:

Limitations & next steps

Limitations

Next steps

References

[1] Annu Lambora, Kunal Gupta, and Kriti Chopra. Genetic algorithm — a literature review.
In 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon), pp. 380–384. IEEE, 2019.