HowToCollections
AI & ML interests
How To Collections for Learning and Teaching AI - Brain Development
Version 1
Perform a deep dive synopsis in markdown code describing the datasets and input datasetts used by two models in comparison - delving much deeper into real time papers and information on these datasets and ideally find the URL where the dataset or dataset paper can be viewed. Fix the article I have written below with a start on the datasets that were used to train the two models:
Language Models π£οΈ
π Bloom sets new record for most performant and efficient AI model in science! πΈ
Comparison of Large Language Models
Model Name | Model Size (in Parameters) |
---|---|
BigScience-tr11-176B | 176 billion |
GPT-3 | 175 billion |
GPT-3 Datasets π
- WebText
- Common Crawl
- BooksCorpus
- English Wikipedia
- Toronto Books Corpus
- OpenWebText
ChatGPT Datasets - Details π
- WebText: A dataset of web pages crawled from domains on the Alexa top 5,000 list. This dataset was used to pretrain GPT-2.
- Common Crawl: A dataset of web pages from a variety of domains, which is updated regularly. This dataset was used to pretrain GPT-3.
- Language Models are Few-Shot Learners by Brown et al.
- BooksCorpus: A dataset of over 11,000 books from a variety of genres.
- Scalable Methods for 8 Billion Token Language Modeling by Zhu et al.
- English Wikipedia: A dump of the English-language Wikipedia as of 2018, with articles from 2001-2017.
- Improving Language Understanding by Generative Pre-Training Space for Wikipedia Search
- Toronto Books Corpus: A dataset of over 7,000 books from a variety of genres, collected by the University of Toronto.
- OpenWebText: A dataset of web pages that were filtered to remove content that was likely to be low-quality or spammy. This dataset was used to pretrain GPT-3.
- Language Models are Few-Shot Learners by Brown et al.
Big Science Model π
π Papers:
- BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Paper
- Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism Paper
- 8-bit Optimizers via Block-wise Quantization Paper
- Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation Paper
- Other papers related to Big Science
- 217 other models optimized for use with Bloom
π Datasets:
Datasets:
- Universal Dependencies: A collection of annotated corpora for natural language processing in a range of languages, with a focus on dependency parsing.
- WMT 2014: The fourth edition of the Workshop on Statistical Machine Translation, featuring shared tasks on translating between English and various other languages.
- The Pile: An English language corpus of diverse text, sourced from various places on the internet.
- HumanEval: A dataset of English sentences, annotated with human judgments on a range of linguistic qualities.
- HumanEval: An Evaluation Benchmark for Language Understanding by Gabriel Ilharco, Daniel Loureiro, Pedro Rodriguez, and Afonso Mendes.
- FLORES-101: A dataset of parallel sentences in 101 languages, designed for multilingual machine translation.
- FLORES-101: A Massively Multilingual Parallel Corpus for Language Understanding by Aman Madaan, Shruti Rijhwani, Raghav Gupta, and Mitesh M. Khapra.
- CrowS-Pairs: A dataset of sentence pairs, designed for evaluating the plausibility of generated text.
- CrowS-Pairs: A Challenge Dataset for Plausible Plausibility Judgments by Andrea Madotto, Zhaojiang Lin, Chien-Sheng Wu, Pascale Fung, and Caiming Xiong.
- WikiLingua: A dataset of parallel sentences in 75 languages, sourced from Wikipedia.
- WikiLingua: A New Benchmark Dataset for Cross-Lingual Wikification by Jiarui Yao, Yanqiao Zhu, Ruihan Bao, Guosheng Lin, Lidong Bing, and Bei Shi.
- MTEB: A dataset of English sentences, annotated with their entailment relationships with respect to other sentences.
- Multi-Task Evaluation Benchmark for Natural Language Inference by MichaΕ Lukasik, Marcin Junczys-Dowmunt, and Houda Bouamor.
- xP3: A dataset of English sentences, annotated with their paraphrase relationships with respect to other sentences.
- xP3: A Large-Scale Evaluation Benchmark for Paraphrase Identification in Context by Aniket Didolkar, James Mayfield, Markus Saers, and Jason Baldridge.
- DiaBLa: A dataset of English dialogue, annotated with dialogue acts.
A Large-Scale Corpus for Conversation Disentanglement by Samuel Broscheit, AntΓ³nio Branco, and AndrΓ© F. T. Martins.
π Dataset Papers with Code
Deep RL ML Strategy π§
The AI strategies are:
- Language Model Preparation using Human Augmented with Supervised Fine Tuning π€
- Reward Model Training with Prompts Dataset Multi-Model Generate Data to Rank π
- Fine Tuning with Reinforcement Reward and Distance Distribution Regret Score π―
- Proximal Policy Optimization Fine Tuning π€
- Variations - Preference Model Pretraining π€
- Use Ranking Datasets Sentiment - Thumbs Up/Down, Distribution π
- Online Version Getting Feedback π¬
- OpenAI - InstructGPT - Humans generate LM Training Text π
- DeepMind - Advantage Actor Critic Sparrow, GopherCite π¦
- Reward Model Human Prefence Feedback π
For more information on specific techniques and implementations, check out the following resources:
- OpenAI's paper on GPT-3 which details their Language Model Preparation approach
- DeepMind's paper on SAC which describes the Advantage Actor Critic algorithm
- OpenAI's paper on Reward Learning which explains their approach to training Reward Models
- OpenAI's blog post on GPT-3's fine-tuning process