Yearly Word2Vec Embeddings (2005-2025)

Word2Vec models trained on single-year web data from the FineWeb dataset, capturing 21 years of language evolution.

Overview

This collection enables research into semantic change, concept emergence, and language evolution over time. Each model is trained exclusively on data from a single year, providing precise temporal snapshots of language.

Dataset: FineWeb

Models are trained on the FineWeb dataset, filtered by year from URLs to create single-year subsets spanning 2005-2025.

Corpus Statistics by Year

Year Corpus Size Articles Vocabulary
2005 2.3 GB 689,905 23,344
2006 3.3 GB 1,047,683 23,142
2007 4.5 GB 1,468,094 22,998
2008 7.0 GB 2,379,636 23,076
2009 9.3 GB 3,251,110 23,031
2010 11.6 GB 4,102,893 23,008
2011 12.5 GB 4,446,823 23,182
2012 20.0 GB 7,276,289 23,140
2013 15.7 GB 5,626,713 23,195
2014 8.7 GB 2,868,446 23,527
2015 8.7 GB 2,762,626 23,349
2016 9.4 GB 2,901,744 23,351
2017 10.1 GB 3,085,758 23,440
2018 10.4 GB 3,103,828 23,348
2019 10.9 GB 3,187,052 23,228
2020 12.9 GB 3,610,390 23,504
2021 14.3 GB 3,903,312 23,296
2022 16.5 GB 4,330,132 23,222
2023 21.6 GB 5,188,559 23,278
2024 27.9 GB 6,443,985 24,022
2025 16.6 GB 3,625,629 24,919

Model Architecture

All models use the same Word2Vec architecture with consistent hyperparameters:

  • Embedding Dimension: 300
  • Window Size: 15
  • Min Count: 30
  • Max Vocabulary Size: 50,000
  • Negative Samples: 15
  • Training Epochs: 20
  • Workers: 48
  • Batch Size: 100,000
  • Training Algorithm: Skip-gram with negative sampling

FineWeb data processed with Trafilatura extraction, English filtering (score > 0.65), quality filters, and MinHash deduplication. Training uses 48 workers on multi-core CPU systems.

Evaluation

Models evaluated on WordSim-353 (similarity) and Google analogies datasets. Recent years show improved similarity performance with larger corpora.

Usage

Installation

pip install gensim numpy
from gensim.models import KeyedVectors

# Load a model for a specific year
model_2020 = KeyedVectors.load("word2vec_2020.model")
model_2024 = KeyedVectors.load("word2vec_2024.model")

# Find similar words
print(model_2020.most_similar("covid"))
print(model_2024.most_similar("covid"))

# Compare semantic drift
word = "technology"
similar_2020 = model_2020.most_similar(word)
similar_2024 = model_2024.most_similar(word)

Temporal Analysis

# Study semantic drift over time
years = [2005, 2010, 2015, 2020, 2025]
models = {}

for year in years:
    models[year] = KeyedVectors.load(f"word2vec_{year}.model")

# Analyze how a word's meaning changed
word = "smartphone"
for year in years:
    similar = models[year].most_similar(word, topn=5)
    print(f"{year}: {[w for w, s in similar]}")

πŸš€ Interactive Demo

Explore temporal embeddings interactively!

https://adameubanks.github.io/embeddings-over-time/

Compare how word meanings evolved across different years with our interactive visualization tool.

Model Cards

Individual model cards available for each year (2005-2025) at: https://huggingface.co/adameubanks/YearlyWord2Vec

Research Applications

Yearly embeddings enable research in semantic change, cultural shifts, discourse evolution, and concept emergence across time periods.

Citation

If you use these models in your research, please cite:

@misc{yearly_word2vec_2025,
  title={Yearly Word2Vec Embeddings: Language Evolution from 2005-2025},
  author={Adam Eubanks},
  year={2025},
  url={https://huggingface.co/adameubanks/YearlyWord2Vec},
  note={Trained on FineWeb dataset with single-year segmentation}
}

FineWeb Dataset Citation:

@inproceedings{
  penedo2024the,
  title={The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale},
  author={Guilherme Penedo and Hynek Kydl{\'\i}{\v{c}}ek and Loubna Ben allal and Anton Lozhkov and Margaret Mitchell and Colin Raffel and Leandro Von Werra and Thomas Wolf},
  booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2024},
  url={https://openreview.net/forum?id=n6SCkn2QaG}
}

Contributing

Report issues, suggest improvements, or share research findings using these models.

License

MIT License. See LICENSE for details.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train adameubanks/YearlyWord2Vec