NucEL: Single-Nucleotide ELECTRA-Style Genomic Pre-training for Efficient and Interpretable Representations

NucEL is a specialized language model designed for nucleotide sequence analysis and genomic applications. This model provides powerful embeddings for DNA sequences and can be fine-tuned for various downstream genomic tasks.

Model Details

  • Model Type: Transformer-based sequence model
  • Domain: Genomics and Nucleotide Sequences
  • Architecture: Based on ModernBert architecture optimized for nucleotide sequences

Features

  • Nucleotide-level tokenization and embedding
  • Pre-trained on human genome
  • Optimized for biological sequence understanding

Usage

Basic Usage

from transformers import AutoModel, AutoTokenizer

# Load model and tokenizer
model = AutoModel.from_pretrained("FreakingPotato/NucEL", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("FreakingPotato/NucEL", trust_remote_code=True)

# Example DNA sequence
sequence = "ATCGATCGATCGATCG"

# Tokenize and encode
inputs = tokenizer(sequence, return_tensors="pt")
outputs = model(**inputs)

# Get sequence embeddings
embeddings = outputs.last_hidden_state
print(f"Sequence embeddings shape: {embeddings.shape}")

Installation

pip install transformers torch
# Install any additional dependencies for your specific use case

Requirements

  • transformers >= 4.21.0
  • torch >= 1.9.0
  • Python >= 3.7

Citation

If you use NucEL in your research, please cite:

@misc{nucel2024,
  title={NucEL: Single-Nucleotide ELECTRA-Style Genomic Pre-training for Efficient and Interpretable Representations},
  author={Ke Ding, Brian Parker, and Jiayu Wen},
  year={2025},
  howpublished={\url{https://huggingface.co/FreakingPotato/NucEL}}
}

License

This model is released under the Apache 2.0 License.

Downloads last month
2
Safetensors
Model size
92.3M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support