Skip to content

terry-r123/Cross-MolecularBenchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

COMET: A Comprehensive Cross-Molecular Benchmark for Language Model Evaluation and Tasks in Biological Sequence Understanding

This is the official codebase for the paper: COMET: A Comprehensive Cross-Molecular Benchmark for Language Model Evaluation and Tasks in Biological Sequence Understanding

COMET Overview


🔧 Prerequisites & Installation

Key libraries:

  • torch==1.13.1+cu117
  • transformers==4.38.1
git clone https://github.com/terry-r123/Multi-omicsBechmark.git

🧪 Tasks and Datasets

Supported Task Categories:

🧬 DNA Tasks

  • Enhancer-Promoter Interaction
  • Enhancer Activity
  • Gene Expression

🧫 RNA Tasks

  • APA Isoform
  • Programmable RNA Switches
  • RNA Secondary Structure
  • siRNA Efficiency

🧬 Protein Tasks

  • Thermostability
  • EC
  • Contact

🔗 Cross-Molecular Tasks

  • DNA-Protein Folding
  • CRISPR Off-Target Prediction
  • RNA-Protein Interaction

📁 Datasets: Huggingface


📂 Data Structure

The project’s data directory is organized as follows:

├── downstream/
│   ├── dna_tasks                      
│   ├── rna_tasks                
│   └── prot_tasks
│   └── ......                  
├── model/
│   ├── dnabert2                     
│   ├── ntv2      
│   ├── rnafm                 
│   └── rnalm
│   └── esm1b
│   └── esm2
│   └── ......       
├── scripts/
│   ├── single_molecule                    
│   ├── multi_molecule                  
│   └── cross_molecule
│   └── opensources               
└── README.md                                

🧠 Models

Available models/embedders used in COMET:

Common Biology Foundation Model: DNABERT2, NTv2, RnaFM, BEACON, ESM1b, ESM-2
Naive Model: CNN, Resnet, LSTM
Unify Biology Foundation Model: LucaOne   

⚙️ Model Settings

Models name token pos length
DNABERT2 dnabert2 single alibi 1024
NTv2 ntv2 single rope 1024
RNA-FM rna-fm single ape 1024
BEACON-B rnalm single alibi 1026
ESM1b esm1b single ape 1024
ESM2 esm-2 single ape 1024

Results

Results of the unpaired cross-molecular experiments

image

Results of the native multi-molecular experiments

image

Results of the native multi-molecular experiments

image

🚀 Usage

🔁 Finetuning

To evaluate on specific task, run the bash scripts under the scripts/ folder. Take cross-molecule EC Task for example:

bash scripts/cross-molecule/ec.sh

📜 License

This codebase is released under the Apache License 2.0. See the LICENSE file for more details.


💡 For questions or suggestions, feel free to open an issue or pull request.


About

Source code of "A Comprehensive Cross-Molecular Benchmark for Language Model Evaluation and Tasks in Biological Sequence Understanding".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors