MaskLID: Code-Switching Language Identification through Iterative Masking Paper ā¢ 2406.06263 ā¢ Published Jun 10 ā¢ 5
view article Article DuckDB: run SQL queries on 50,000+ datasets on the Hugging Face Hub Jun 7, 2023 ā¢ 4
CommonCatalog Collection Common Catalog, a dataset with Creative Commons licensed images and machine-generated caption pairs ā¢ 8 items ā¢ Updated May 16 ā¢ 14
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only Paper ā¢ 2306.01116 ā¢ Published Jun 1, 2023 ā¢ 31
LexC-Gen: Generating Data for Extremely Low-Resource Languages with Large Language Models and Bilingual Lexicons Paper ā¢ 2402.14086 ā¢ Published Feb 21 ā¢ 9
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model Paper ā¢ 2402.07827 ā¢ Published Feb 12 ā¢ 45
GIRT-Model: Automated Generation of Issue Report Templates Paper ā¢ 2402.02632 ā¢ Published Feb 4 ā¢ 1
GlotLID: Language Identification for Low-Resource Languages Paper ā¢ 2310.16248 ā¢ Published Oct 24, 2023 ā¢ 1
GlotScript: A Resource and Tool for Low Resource Writing System Identification Paper ā¢ 2309.13320 ā¢ Published Sep 23, 2023 ā¢ 1
Analytical Derivation and Comparison of Alarm Similarity Measures Paper ā¢ 2003.10600 ā¢ Published Mar 24, 2020 ā¢ 1
GIRT-Data: Sampling GitHub Issue Report Templates Paper ā¢ 2303.09236 ā¢ Published Mar 16, 2023 ā¢ 1
MenuCraft: Interactive Menu System Design with Large Language Models Paper ā¢ 2303.04496 ā¢ Published Mar 8, 2023 ā¢ 1
Wide-AdGraph: Detecting Ad Trackers with a Wide Dependency Chain Graph Paper ā¢ 2004.14826 ā¢ Published Apr 29, 2020 ā¢ 1
Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages Paper ā¢ 2305.12182 ā¢ Published May 20, 2023 ā¢ 1