AI & ML interests
None defined yet.
Recent Activity
Shared Task Specification: âSmall Models, Big Impactâ Building Compact Sinhala & Tamil LLMs (⤠8 B Parameters)
Task Overview & Objectives Goal: Foster development of compact, high-quality LLMs for Sinhala and Tamil by continual pre-training or fine-tuning open-source models with ⤠8 billion parameters. Impact: Empower local NLP research and applicationsâchatbots, translation, sentiment analysis, educational toolsâwhile lowering computational and storage barriers. Who Should Participate: Students & Academic Teams: Showcase research on model adaptation, data augmentation, multilingual/multitask training. Industry & Startups: Demonstrate practical performance in real-world pipelines; optimise inference speed, resource usage.
Allowed Base Models Participants must choose one of the following (or any other fully open-source LLM ⤠8 B params) Note: Proprietary or closed-license models (e.g., GPT-3 series, Claude) are not allowed.
Data Resources and Evaluation Training Data (public): Sinhala: OSCARâSinhala, Wikipedia dumps, Common Crawl subsets. Tamil: OSCARâTamil, Tamil Wikipedia, CC100âTamil. Evaluation: Your LLM will be evaluated using intrinsic and extrinsic measures as follows: Intrinsic evaluation using Perplexity score Extrinsic evaluation using the appropriate MMLU metric You can use the given MMLU dataset and compare results in zero-shot, few-shot, and fine-tuned settings.
Submission Requirements Model: HuggingFace-format upload. Scripts and Notebooks: Should be uploaded to a GitHub or HuggingFace repository. Technical Report (2-5 pages): Training details: data sources, training mechanism, epochs, batch size, learning rates. Resource usage: GPU time, list of hardware resources. Model evaluation. Analysis of strengths/limitations.
Rules & Fairness Parameter Limit: Strict upper bound of 8 B parameters (model + adapter weights). Data Usage: Only public/open-license data; no private or web-scraped behind login. Reproducibility: All code, data-prep scripts, and logs must be publicly accessible by the submission deadline.
How to Register & Contact Registration Form: https://forms.gle/edzfpopVvKkkF6cH8 Contact: [email protected] Phone: 076 981 1289