NucEL: Single-Nucleotide ELECTRA-Style Genomic Pre-training for Efficient and Interpretable Representations

NucEL is a specialized language model designed for nucleotide sequence analysis and genomic applications. This model provides powerful embeddings for DNA sequences and can be fine-tuned for various downstream genomic tasks.

Model Details

Model Type: Transformer-based sequence model
Domain: Genomics and Nucleotide Sequences
Architecture: Based on ModernBert architecture optimized for nucleotide sequences

Features

Nucleotide-level tokenization and embedding
Pre-trained on human genome
Optimized for biological sequence understanding

Usage

Basic Usage

from transformers import AutoModel, AutoTokenizer

# Load model and tokenizer
model = AutoModel.from_pretrained("FreakingPotato/NucEL", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("FreakingPotato/NucEL", trust_remote_code=True)

# Example DNA sequence
sequence = "ATCGATCGATCGATCG"

# Tokenize and encode
inputs = tokenizer(sequence, return_tensors="pt")
outputs = model(**inputs)

# Get sequence embeddings
embeddings = outputs.last_hidden_state
print(f"Sequence embeddings shape: {embeddings.shape}")

Installation

pip install transformers torch
# Install any additional dependencies for your specific use case

Requirements

transformers >= 4.21.0
torch >= 1.9.0
Python >= 3.7

Citation

If you use NucEL in your research, please cite:

@misc{nucel2024,
  title={NucEL: Single-Nucleotide ELECTRA-Style Genomic Pre-training for Efficient and Interpretable Representations},
  author={Ke Ding, Brian Parker, and Jiayu Wen},
  year={2025},
  howpublished={\url{https://huggingface.co/FreakingPotato/NucEL}}
}

License

This model is released under the Apache 2.0 License.