Data generation in biology thread (also see martinborschjensen)

Bauer LeSavage

@bauer_lesavage

The training data market has exploded for LLMs and bio foundation models are next.

But biological data is extremely complex and requires a data generation playbook that prioritizes quality over immediate scale.

@_DimensionCap

Research article live now!

https://research.dimensioncap.com/p/on-training-data-for-bio-ai-models

11:08 AM · Jun 3, 2026

·

2,595

Views

Relevant

View quotes

Bauer LeSavage

@bauer_lesavage

·

1h

Special thank you to

@RecursionChris

,

@nc_frey

,

@peytongreenside

,

@Ronalfa

, and several others for insightful conversations throughout putting this article together!

Bauer LeSavage

@bauer_lesavage

·

1h

Cross-posted here as well:

Quote

Bauer LeSavage

@bauer_lesavage

·

1h

On Training Data for Bio AI Models

As we advance biological foundation models, which lessons from LLM data curation transfer, and which need rethinking?

Special thank you to Chris Gibson (@RecursionChris), Nathan Frey (@nc_frey),…

Ron Alfa

@Ronalfa

·

1h

Great piece, and very timely.

Elizabeth Hudson

@ClarkPolner

·

1h

yes!! the care point is great; always wanted to do a side by side of the

@karpathy

point about actually looking at your data (common crawl scrapes look wonkier than you think) with some bio data examples.

more interesting questions:

  • How much observational versus

Show more

Raimo

@Rhymo

·

51m

Great piece. Verification half is the hard half. We treat it as the product: write the phenotype from a structural prior first, then grade it on real trial readouts, not surrogates. W ran it on checkpoint non-response: one definition, three cancers, held!

From capacity.encounter.bio

Account

@X__Notes__

·

45m

With biology-physics integration remaining largely unexplored…

Andrew Rodriguez

@andrewjrod

·

1h

Very excited to read this! Thanks for sharing your thoughts :slight_smile:

Kenny Workman

@kenbwork

·

12h

A few talks at the intersection of single cell biology x agent engineering:

  • Understanding library prep + sequencing is important
  • Repurposing data infrastructure for agent consumption
  • How to benchmark frontier agents on single cell analysis

BioAIDevs

@BioAIDevs

BIOS is building an in-house robotics program to bring physical synthesis, plate handling, and result readback inside the pipeline.

Today, wet lab runs route to

@adaptyvbio

via x402. The robotics program removes that dependency and closes the full autonomous loop from receptor to result inside one system.

BIOS already owns the full computational side end-to-end.

Three generative models, PXDesign, BoltzGen, and RFdiffusion3, produce 5,000 binder candidates per run. Scoring and molecular dynamics filter these down to 10 to 15 viable candidates.

Those candidates are commissioned for synthesis at Adaptyv Bio, paid machine-to-machine via x402, with every result publishing on-chain as it arrives.

The in-house robotics program brings the next step inside that same loop. A dexterous robotic arm is being developed to handle pipetting, transfer samples to well plates, and run binding assays without routing to an external CRO at each step.

When wet lab data returns, it feeds directly back into the binder generation step. The models learn which structural features survived physical testing and the next generation cycle starts from a stronger baseline.

BIOS is building all three pillars in parallel:

Self-hosted foundation models (fine-tunable inside the CLI)
Operational wet-lab execution (currently via x402 & Adaptyv Bio)
In-house robotics (the third pillar under active development)

This creates a fully integrated, self-improving system.