NoLBert: A Time-Stamped Pre-Trained LLM

NoLBERT (No Lookahead(back) bias bidirectional encoder representation from transformers) is a foundational transformer-based language model specifically trained on a small time-restricted dataset to avoid both Lookahead and Lookaback bias. Furthermore, in order to make the model accessible even on personal machines, we adopt the architecture of DeBERTaV3-base which has a relatively small number of trainable parameters but performs well on linguistic benchmarks.

Lookahead bias is a fundamental challenge when researchers and practitioners use inferences from language models for forecasting. For example, when we ask a language model to infer the short-term return of a stock given a set of news articles, a concern is that the model may have been trained on data that includes future information beyond the point in time when the news articles were released. As a result, the nature of the task changes from drawing return-related inference from text to retrieving the date of the news articles and the realized returns of the particular stock shortly after that date. Consequently, this approach becomes invalid in practice when using such models to predict stock returns beyond the training data's coverage period. To frame the task as one of natural language inference, we pre-train a new text encoder using data strictly from 1976 to 1995. Therefore, our model exhibits no lookahead bias when backtesting trading strategies using data from 1996 onward or when performing other time series forecasting tasks using text data.

Another key feature of our model is that it also avoids Lookaback bias. In particular, after pre-training, the numerical representation provided by any model reflects a snapshot in time (although the exact time may not be well-defined). For example, in the early 1900s, the sentence “She is running a program” likely meant that the person was organizing an event. By contrast, in the late 20th century, the same sentence likely refers to someone executing a computer code. Since a model is trained to learn from all of its training data to form text representations, if it is trained using data spanning a long time horizon, it becomes unclear which period the final encoded vector represents. In this example, if the model is trained on data from the entire 20th century, the resulting numerical representation may exhibit lookback bias when the intention is to analyze texts from more recent periods. To overcome this, we use a highly restricted time window: all of our model's training data are from 1976 to 1995, and our validation set is strictly from 1996.

Our model is trained on 1 billion words (1-2 billion tokens) from Parliament Q&As, TV show conversations, music lyrics, patents, FOMC documents, public access books, newspapers, election campaign documents, and research papers. The model is based on the base-size DeBERTa model architecture and a custom ByteLevelBPETokenizer trained using the same training data.

Our model achieves state-of-the-art performance with less than 10% of training data.

Model	Vocabulary (K)	Backbone #Params (M)	COLA	SST2	QQP	MNLI	QNLI
FinBERT	30	110	0.29	0.89	0.87	0.79	0.86
StoriesLM	30	110	0.47	0.90	0.87	0.80	0.87
NoLBERT	30	109	0.43	0.91	0.91	0.82	0.89

Usage Examples

Masked Language Modeling

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

# Using GPU
device = 'cuda:0'  
checkpoint_path = "alikLab/NoLBERT"

model = AutoModelForMaskedLM.from_pretrained(checkpoint_path).to(device)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_path, use_fast=True)

text = "The day after Monday is<mask>."
inputs = tokenizer(text, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model(**inputs)

logits = outputs.logits

# Get the index of [MASK]
mask_token_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]

# Get logits for the mask position
mask_logits = logits[0, mask_token_index, :].squeeze()

# Get top 10 predictions
top_10_token_ids = torch.topk(mask_logits, 10).indices
top_10_tokens = [tokenizer.decode(token_id) for token_id in top_10_token_ids]
top_10_probs = torch.softmax(mask_logits, dim=-1)[top_10_token_ids]

print("Top 10 most likely words:")
for i, (token, prob) in enumerate(zip(top_10_tokens, top_10_probs)):
    print(f"{i+1:2d}. {token:<12} (probability: {prob:.4f})")

Getting Text Embeddings

from transformers import AutoTokenizer, AutoModel
import torch

# Using GPU
device = 'cuda:0'
checkpoint_path = "alikLab/NoLBERT"

# Use AutoModel instead of AutoModelForMaskedLM to get embeddings
model = AutoModel.from_pretrained(checkpoint_path).to(device)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_path, use_fast=True)

text = "The day after Monday is Tuesday."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True).to(device)

with torch.no_grad():
    outputs = model(**inputs)

# Get the hidden states
last_hidden_states = outputs.last_hidden_state

# Method 1: Use [CLS] token embedding (first token)
cls_embedding = last_hidden_states[0, 0, :]  # Shape: [hidden_size]

# Method 2: Mean pooling over all tokens (excluding padding)
attention_mask = inputs['attention_mask']
masked_embeddings = last_hidden_states * attention_mask.unsqueeze(-1)
mean_embedding = masked_embeddings.sum(dim=1) / attention_mask.sum(dim=1, keepdim=True)

print(f"CLS embedding shape: {cls_embedding.shape}")
print(f"Mean pooled embedding shape: {mean_embedding.shape}")
print(f"Text: {text}")
print(f"Embedding (first 10 dimensions): {cls_embedding[:10].tolist()}")

Citation

If you use this model in your research, please cite:

@misc{nolbert,
  author = {Ali Kakhbod, Peiyao Li},
  title = {NoLBert: A Time-Stamped Pre-Trained LLM},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/alikLab/NoLBERT}},
}