Yearly Word2Vec Embeddings (2005-2025)
Word2Vec models trained on single-year web data from the FineWeb dataset, capturing 21 years of language evolution.
Overview
This collection enables research into semantic change, concept emergence, and language evolution over time. Each model is trained exclusively on data from a single year, providing precise temporal snapshots of language.
Dataset: FineWeb
Models are trained on the FineWeb dataset, filtered by year from URLs to create single-year subsets spanning 2005-2025.
Corpus Statistics by Year
Year | Corpus Size | Articles | Vocabulary |
---|---|---|---|
2005 | 2.3 GB | 689,905 | 23,344 |
2006 | 3.3 GB | 1,047,683 | 23,142 |
2007 | 4.5 GB | 1,468,094 | 22,998 |
2008 | 7.0 GB | 2,379,636 | 23,076 |
2009 | 9.3 GB | 3,251,110 | 23,031 |
2010 | 11.6 GB | 4,102,893 | 23,008 |
2011 | 12.5 GB | 4,446,823 | 23,182 |
2012 | 20.0 GB | 7,276,289 | 23,140 |
2013 | 15.7 GB | 5,626,713 | 23,195 |
2014 | 8.7 GB | 2,868,446 | 23,527 |
2015 | 8.7 GB | 2,762,626 | 23,349 |
2016 | 9.4 GB | 2,901,744 | 23,351 |
2017 | 10.1 GB | 3,085,758 | 23,440 |
2018 | 10.4 GB | 3,103,828 | 23,348 |
2019 | 10.9 GB | 3,187,052 | 23,228 |
2020 | 12.9 GB | 3,610,390 | 23,504 |
2021 | 14.3 GB | 3,903,312 | 23,296 |
2022 | 16.5 GB | 4,330,132 | 23,222 |
2023 | 21.6 GB | 5,188,559 | 23,278 |
2024 | 27.9 GB | 6,443,985 | 24,022 |
2025 | 16.6 GB | 3,625,629 | 24,919 |
Model Architecture
All models use the same Word2Vec architecture with consistent hyperparameters:
- Embedding Dimension: 300
- Window Size: 15
- Min Count: 30
- Max Vocabulary Size: 50,000
- Negative Samples: 15
- Training Epochs: 20
- Workers: 48
- Batch Size: 100,000
- Training Algorithm: Skip-gram with negative sampling
FineWeb data processed with Trafilatura extraction, English filtering (score > 0.65), quality filters, and MinHash deduplication. Training uses 48 workers on multi-core CPU systems.
Evaluation
Models evaluated on WordSim-353 (similarity) and Google analogies datasets. Recent years show improved similarity performance with larger corpora.
Usage
Installation
pip install gensim numpy
from gensim.models import KeyedVectors
# Load a model for a specific year
model_2020 = KeyedVectors.load("word2vec_2020.model")
model_2024 = KeyedVectors.load("word2vec_2024.model")
# Find similar words
print(model_2020.most_similar("covid"))
print(model_2024.most_similar("covid"))
# Compare semantic drift
word = "technology"
similar_2020 = model_2020.most_similar(word)
similar_2024 = model_2024.most_similar(word)
Temporal Analysis
# Study semantic drift over time
years = [2005, 2010, 2015, 2020, 2025]
models = {}
for year in years:
models[year] = KeyedVectors.load(f"word2vec_{year}.model")
# Analyze how a word's meaning changed
word = "smartphone"
for year in years:
similar = models[year].most_similar(word, topn=5)
print(f"{year}: {[w for w, s in similar]}")
π Interactive Demo
Explore temporal embeddings interactively!
https://adameubanks.github.io/embeddings-over-time/
Compare how word meanings evolved across different years with our interactive visualization tool.
Model Cards
Individual model cards available for each year (2005-2025) at: https://huggingface.co/adameubanks/YearlyWord2Vec
Research Applications
Yearly embeddings enable research in semantic change, cultural shifts, discourse evolution, and concept emergence across time periods.
Citation
If you use these models in your research, please cite:
@misc{yearly_word2vec_2025,
title={Yearly Word2Vec Embeddings: Language Evolution from 2005-2025},
author={Adam Eubanks},
year={2025},
url={https://huggingface.co/adameubanks/YearlyWord2Vec},
note={Trained on FineWeb dataset with single-year segmentation}
}
FineWeb Dataset Citation:
@inproceedings{
penedo2024the,
title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2024},
url={https://openreview.net/forum?id=n6SCkn2QaG}
}
Contributing
Report issues, suggest improvements, or share research findings using these models.
License
MIT License. See LICENSE for details.