NucEL: Single-Nucleotide ELECTRA-Style Genomic Pre-training for Efficient and Interpretable Representations
NucEL is a specialized language model designed for nucleotide sequence analysis and genomic applications. This model provides powerful embeddings for DNA sequences and can be fine-tuned for various downstream genomic tasks.
Model Details
- Model Type: Transformer-based sequence model
- Domain: Genomics and Nucleotide Sequences
- Architecture: Based on ModernBert architecture optimized for nucleotide sequences
Features
- Nucleotide-level tokenization and embedding
- Pre-trained on human genome
- Optimized for biological sequence understanding
Usage
Basic Usage
from transformers import AutoModel, AutoTokenizer
# Load model and tokenizer
model = AutoModel.from_pretrained("FreakingPotato/NucEL", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("FreakingPotato/NucEL", trust_remote_code=True)
# Example DNA sequence
sequence = "ATCGATCGATCGATCG"
# Tokenize and encode
inputs = tokenizer(sequence, return_tensors="pt")
outputs = model(**inputs)
# Get sequence embeddings
embeddings = outputs.last_hidden_state
print(f"Sequence embeddings shape: {embeddings.shape}")
Installation
pip install transformers torch
# Install any additional dependencies for your specific use case
Requirements
- transformers >= 4.21.0
- torch >= 1.9.0
- Python >= 3.7
Citation
If you use NucEL in your research, please cite:
@misc{nucel2024,
title={NucEL: Single-Nucleotide ELECTRA-Style Genomic Pre-training for Efficient and Interpretable Representations},
author={Ke Ding, Brian Parker, and Jiayu Wen},
year={2025},
howpublished={\url{https://huggingface.co/FreakingPotato/NucEL}}
}
License
This model is released under the Apache 2.0 License.
- Downloads last month
- 2
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support