COMET: A Comprehensive Cross-Molecular Benchmark for Language Model Evaluation and Tasks in Biological Sequence Understanding
This is the official codebase for the paper: COMET: A Comprehensive Cross-Molecular Benchmark for Language Model Evaluation and Tasks in Biological Sequence Understanding
Key libraries:
- torch==1.13.1+cu117
- transformers==4.38.1
git clone https://github.com/terry-r123/Multi-omicsBechmark.git
Supported Task Categories:
- Enhancer-Promoter Interaction
- Enhancer Activity
- Gene Expression
- APA Isoform
- Programmable RNA Switches
- RNA Secondary Structure
- siRNA Efficiency
- Thermostability
- EC
- Contact
- DNA-Protein Folding
- CRISPR Off-Target Prediction
- RNA-Protein Interaction
📁 Datasets: Huggingface
The project’s data directory is organized as follows:
├── downstream/
│ ├── dna_tasks
│ ├── rna_tasks
│ └── prot_tasks
│ └── ......
├── model/
│ ├── dnabert2
│ ├── ntv2
│ ├── rnafm
│ └── rnalm
│ └── esm1b
│ └── esm2
│ └── ......
├── scripts/
│ ├── single_molecule
│ ├── multi_molecule
│ └── cross_molecule
│ └── opensources
└── README.md
Available models/embedders used in COMET:
Common Biology Foundation Model: DNABERT2, NTv2, RnaFM, BEACON, ESM1b, ESM-2
Naive Model: CNN, Resnet, LSTM
Unify Biology Foundation Model: LucaOne
| Models | name | token | pos | length |
|---|---|---|---|---|
| DNABERT2 | dnabert2 | single | alibi | 1024 |
| NTv2 | ntv2 | single | rope | 1024 |
| RNA-FM | rna-fm | single | ape | 1024 |
| BEACON-B | rnalm | single | alibi | 1026 |
| ESM1b | esm1b | single | ape | 1024 |
| ESM2 | esm-2 | single | ape | 1024 |
To evaluate on specific task, run the bash scripts under the scripts/ folder. Take cross-molecule EC Task for example:
bash scripts/cross-molecule/ec.sh
This codebase is released under the Apache License 2.0. See the LICENSE file for more details.
💡 For questions or suggestions, feel free to open an issue or pull request.



