Comment by tennysont
Fun to see talk of "a compiler for DNA"---I've been hoping for that for a long time.
I have to admit, at a _glance_ this feels like a promising idea with few results and lots of marketing. I'll try to be clear about my confusion, feel free to explain if I'm off base.
- There's not a lot of talk of your "ground truth" for evaluations. Are you using mRNABench?
- Has you mRNABench paper been peer reviewed? You linked a preprint. (I know paper submission can be touch or stressful, and it's a superficial metric to be judged on!)
- Do any of your results suggest that this foundation model might be any good on out of sequence mRNA sequences? If not, then is the (current) model supposed to predict properties of natural mRNA sequences rather than of synthetic mRNA sequences?
- Did a lot mRNA sequences have experimental verification of their predicted properties? At a quick glance, I see this 66 number in the paper---but I truly have no idea.
I'm super happy to praise both incremental progress and putting forth a vision, I just also want to have a clear understanding of the current state-of-the-art as well!
> ground truth
Hey yes, the ground truth for our evaluations is measured experimental data. Our models are benchmarked using mRNABench, which aggregates results from high-throughput wet lab experiments.
Our goal, however, is to move beyond predicting existing experimental outcomes. We intend to design novel sequences and validate their function in our own lab. At that stage, the functional success of the RNA we design will become the ground truth.
> peer reviewed?
Both mRNA bench and Orthrus are in submission (at a big ML conference and a big name journal) - unfortunately the academic systems move slow but we're working on getting them out there.
> synthetic mRNA sequences
I think you're asking on generalizing out of distribution to unnatural sequences. There are two ways that we do this: (1) There are these screens called Massively Parallel Reporter Assays (MPRAs) and we eval for example on https://pubmed.ncbi.nlm.nih.gov/31267113/
Here all the sequences are synthetic and randomly designed and we do observe generalization. Ultimately it depends on the problem that we're tackling: some tasks like gene therapy design require endogenous sequences.
(2) The other angle is variant effect prediction (VEP). It can be thought of as a counterfactual prediction problem where you ask the model whether a small change in the input predicts a large change in the output. This is a good example of the study (https://www.biorxiv.org/content/10.1101/2025.02.11.637758v2)
> experimental verification of their predicted properties
all our model evaluations are predictions of experimental results! The datasets we use are collections of wet lab measurements, so the model is constantly benchmarked against ground-truth biology.
The evaluation method involves fitting a linear probe on the model's learned embeddings to predict the experimental signal. This directly tests whether the model's learned representation of an RNA sequence contains a linear combination of features that can predict its measured biological properties.
Thanks for the feedback I understand the caution around pre-prints. We believe a self-supervised learning approach is well-suited for this problem because it allows the model to first learn patterns from millions of unlabeled sequences before being fine-tuned on specific, and often smaller, experimental datasets.