Collection of approximately 19,449 repositories worth of source code featuring: - Diverse programming languages including Python, JavaScript, C++, Java, Go, Rust, and dozens of other languages - Includes metadata: repository names, file paths, programming language detection, licensing information, and file sizes - Contains high-quality source code files with line length filtering for optimal processing - Organized in compressed JSONL format with Zstandard compression