AI & ML interests

Omni Lingual Models

Recent Activity

ajibawa-2023 
posted an update 3 days ago
view post
Post
3284
Python-Code-Large
Dataset: ajibawa-2023/Python-Code-Large

Python-Code-Large is a large-scale corpus of Python source code comprising more than 2 million rows of Python code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis for the Python ecosystem.

By providing a high-volume, language-specific corpus, Python-Code-Large enables systematic experimentation in Python-focused model training, domain adaptation, and downstream code understanding tasks.

Python-Code-Large addresses the need for a dedicated Python-only dataset at substantial scale, enabling focused research across data science, backend systems, automation, scientific computing, and AI-driven Python environments.
  • 1 reply
·
ajibawa-2023 
posted an update 8 days ago
view post
Post
2532
PHP-Code-Large

Dataset: ajibawa-2023/PHP-Code-Large

PHP-Code-Large is a large-scale corpus of PHP source code comprising more than 12 million lines of PHP code. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and static program analysis for the PHP ecosystem.

By providing a high-volume, language-specific corpus, PHP-Code-Large enables systematic experimentation in PHP-focused model training, domain adaptation, and downstream code understanding tasks.

PHP-Code-Large addresses the need for a dedicated PHP-only dataset at substantial scale, enabling focused research across backend systems, CMS platforms, APIs, and full-stack PHP environments.
ajibawa-2023 
posted an update 12 days ago
view post
Post
3241
JavaScript-Code-Large
ajibawa-2023/JavaScript-Code-Large

JavaScript-Code-Large is a large-scale corpus of JavaScript source code comprising around 5 million JavaScript files. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis for the JavaScript ecosystem.

By providing a high-volume, language-specific corpus, JavaScript-Code-Large enables systematic experimentation in JavaScript-focused model training, domain adaptation, and downstream code understanding tasks.

JavaScript-Code-Large addresses the need for a dedicated JavaScript-only dataset at substantial scale, enabling focused research across frontend, backend, and full-stack JavaScript environments. .
ajibawa-2023 
posted an update 14 days ago
view post
Post
3125
Java-Code-Large ( ajibawa-2023/Java-Code-Large)

Java-Code-Large is a large-scale corpus of publicly available Java source code comprising more than 15 million java codes. The dataset is designed to support research in large language model (LLM) pretraining, code intelligence, software engineering automation, and program analysis.

By providing a high-volume, language-specific corpus, Java-Code-Large enables systematic experimentation in Java-focused model training, domain adaptation, and downstream code understanding tasks.
Locutusque 
posted an update 4 months ago
view post
Post
2770
🚀 AutoXLA - Accelerating Large Models on TPU
AutoXLA is an experimental library that automates the distribution, optimization, and quantization of large language models for TPUs using PyTorch/XLA. It extends the Hugging Face Transformers interface with TPU-aware features such as automatic sharding, custom attention kernels, and quantization-aware loading, making large-scale deployment and training both simpler and faster.
With quantization and Splash Attention kernels, AutoXLA achieves up to 4Ă— speedups over standard Flash Attention implementations, significantly improving throughput for both inference and training workloads.
Whether you’re experimenting with distributed setups (FSDP, 2D, or 3D sharding) or optimizing memory via LanguageModelQuantizer, AutoXLA is built to make scaling LLMs on TPU seamless.
⚠️ Note: This is an experimental repository. Expect rough edges! Please report bugs or unexpected behavior through GitHub issues.
đź”— GitHub Repository: https://github.com/Locutusque/AutoXLA