The training data market has exploded for LLMs and bio foundation models are next.
But biological data is extremely complex and requires a data generation playbook that prioritizes quality over immediate scale.
Research article live now!
https://research.dimensioncap.com/p/on-training-data-for-bio-ai-models
·
Relevant
·
Special thank you to
,
,
,
, and several others for insightful conversations throughout putting this article together!
·
Cross-posted here as well:
Quote
Bauer LeSavage
@bauer_lesavage
·
1h
On Training Data for Bio AI Models
As we advance biological foundation models, which lessons from LLM data curation transfer, and which need rethinking?
Special thank you to Chris Gibson (@RecursionChris), Nathan Frey (@nc_frey),…
·
Great piece, and very timely.
·
yes!! the care point is great; always wanted to do a side by side of the
point about actually looking at your data (common crawl scrapes look wonkier than you think) with some bio data examples.
more interesting questions:
- How much observational versus
Show more
·
Great piece. Verification half is the hard half. We treat it as the product: write the phenotype from a structural prior first, then grade it on real trial readouts, not surrogates. W ran it on checkpoint non-response: one definition, three cancers, held!
·
With biology-physics integration remaining largely unexplored…
·
Very excited to read this! Thanks for sharing your thoughts ![]()
