Letβs see JEPA in actionπ€ Simplified image-based implementation training on a CPU with live preview support - very satisfying to watch:)
I-JEPA is the image-based version of JEPA (Joint-Embedding Predictive Architecture - an alternative to autoregressive LLM architectures ) pioneered by professor Yann Lecun.
At a higher level, I-JEPA predicts image segment representations (Target) based on representations of other segments within the same image (Context). It consists of three key components: a context encoder, target encoder and a predictor.
Introducing Fineweb-Edu-Fortified: An enhanced Fineweb-Edu dataset. π
This dataset is tailored for NLP tasks and helps streamline model training by offering a more refined, unique dataset. Perfect for startups and researchers looking for high-quality educational content to train, evaluate, or fine-tune AI models. The dataset is based on the Fineweb-Edu subset of the large Fineweb dataset and includes:
- Exact-match deduplication across all crawls - Embeddings for each row using the TaylorAI/bge-micro model - Count column indicating duplication frequency - Includes data from 95 Common Crawl crawls (2013-2024) - Rows have been reduced from 1.279B to 0.324B after deduplication - It is comprised of ~375B tokens (down from 1,320B in Fineweb-Edu)
Many thanks to the amazing @josh-sematic for his work on this project, the Fineweb/Fineweb-Edu team at Hugging Face for producing the original datasets and for their support during our work on Fineweb-Edu-Fortified, and also thanks toΒ @underspirit forΒ pointing outΒ the reduction in dataset size that could be achieved via deduplication. π€
After a few attempts, I found that combining the information in this dataset with a good model (like meta-llama/Meta-Llama-3-8B-Instruct) opens the doors to a myriad of chat adventures.
π οΈ Stack: πΉHaystack for orchestration ποΈ πΉllamafile π¦ποΈ to run our model locally.
π Check out the notebook: https://t.ly/y6jrZ (includes a bonus π΅οΈ Mystery Character Quiz)
Thrilled to introduce Adam-mini, an optimizer that achieves on-par or better performance than AdamW with 45% to 50% less memory footprint. Adam-mini can also achieve 49.5% higher throughput than AdamW on Llama2-7B pre-training.
The design of Adam-mini is inspired by certain Hessian structures we observed on Transformers.
Feel free to try it out! Try switching to Adam-mini with the same hyperparams of AdamW, it would work with only half memory. Hope Adam-mini can help save time, cost, and energy in your tasks!