Yearly Word2Vec Embeddings (2005-2025)

Word2Vec models trained on single-year web data from the FineWeb dataset, capturing 21 years of language evolution.

Overview

This collection enables research into semantic change, concept emergence, and language evolution over time. Each model is trained exclusively on data from a single year, providing precise temporal snapshots of language.

Dataset: FineWeb

Models are trained on the FineWeb dataset, filtered by year from URLs to create single-year subsets spanning 2005-2025.

Corpus Statistics by Year

Year	Corpus Size	Articles	Vocabulary
2005	2.3 GB	689,905	23,344
2006	3.3 GB	1,047,683	23,142
2007	4.5 GB	1,468,094	22,998
2008	7.0 GB	2,379,636	23,076
2009	9.3 GB	3,251,110	23,031
2010	11.6 GB	4,102,893	23,008
2011	12.5 GB	4,446,823	23,182
2012	20.0 GB	7,276,289	23,140
2013	15.7 GB	5,626,713	23,195
2014	8.7 GB	2,868,446	23,527
2015	8.7 GB	2,762,626	23,349
2016	9.4 GB	2,901,744	23,351
2017	10.1 GB	3,085,758	23,440
2018	10.4 GB	3,103,828	23,348
2019	10.9 GB	3,187,052	23,228
2020	12.9 GB	3,610,390	23,504
2021	14.3 GB	3,903,312	23,296
2022	16.5 GB	4,330,132	23,222
2023	21.6 GB	5,188,559	23,278
2024	27.9 GB	6,443,985	24,022
2025	16.6 GB	3,625,629	24,919

Model Architecture

All models use the same Word2Vec architecture with consistent hyperparameters:

Embedding Dimension: 300
Window Size: 15
Min Count: 30
Max Vocabulary Size: 50,000
Negative Samples: 15
Training Epochs: 20
Workers: 48
Batch Size: 100,000
Training Algorithm: Skip-gram with negative sampling

FineWeb data processed with Trafilatura extraction, English filtering (score > 0.65), quality filters, and MinHash deduplication. Training uses 48 workers on multi-core CPU systems.

Evaluation

Models evaluated on WordSim-353 (similarity) and Google analogies datasets. Recent years show improved similarity performance with larger corpora.

Usage

Installation

pip install gensim numpy

from gensim.models import KeyedVectors

# Load a model for a specific year
model_2020 = KeyedVectors.load("word2vec_2020.model")
model_2024 = KeyedVectors.load("word2vec_2024.model")

# Find similar words
print(model_2020.most_similar("covid"))
print(model_2024.most_similar("covid"))

# Compare semantic drift
word = "technology"
similar_2020 = model_2020.most_similar(word)
similar_2024 = model_2024.most_similar(word)

Temporal Analysis

# Study semantic drift over time
years = [2005, 2010, 2015, 2020, 2025]
models = {}

for year in years:
    models[year] = KeyedVectors.load(f"word2vec_{year}.model")

# Analyze how a word's meaning changed
word = "smartphone"
for year in years:
    similar = models[year].most_similar(word, topn=5)
    print(f"{year}: {[w for w, s in similar]}")

🚀 Interactive Demo

Explore temporal embeddings interactively!

https://adameubanks.github.io/embeddings-over-time/

Compare how word meanings evolved across different years with our interactive visualization tool.

Model Cards

Individual model cards available for each year (2005-2025) at: https://huggingface.co/adameubanks/YearlyWord2Vec

Research Applications

Yearly embeddings enable research in semantic change, cultural shifts, discourse evolution, and concept emergence across time periods.

Citation

If you use these models in your research, please cite:

@misc{yearly_word2vec_2025,
  title={Yearly Word2Vec Embeddings: Language Evolution from 2005-2025},
  author={Adam Eubanks},
  year={2025},
  url={https://huggingface.co/adameubanks/YearlyWord2Vec},
  note={Trained on FineWeb dataset with single-year segmentation}
}

FineWeb Dataset Citation:

@inproceedings{
  penedo2024the,
  title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
  author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
  booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2024},
  url={https://openreview.net/forum?id=n6SCkn2QaG}
}

Contributing

Report issues, suggest improvements, or share research findings using these models.

License

MIT License. See LICENSE for details.

adameubanks
/

YearlyWord2Vec