gemma-2-9b-it Singlish2English translation model

Model overview

This model is a fine-tuned version of google/gemma-2-9b-it, trained on over 20,000 Singlish-English text pairs from the health coaching (HC) sessions conducted in Singapore.

Custom dataset overview

To enable fine-tuning of open-source foundation LLM models, we curated HC dataset:

First, we collected audios from nearly 90 HC sessions involving three health coaches (Singaporean, Malaysian and European) and 40 Singaporean patients who are not adherent to taking cholesterol-lowering medication. These patients were recruited from three polyclinics in Singapore and received compensation for their time spent participating in our study.
Second, we used a fine-tuned ivabojic/whisper-medium-sing2eng-transcribe model to generate audio transcriptions.
Third, we employed GPT-4o mini to generate Singlish-to-English translations text pairs for these audio transcriptions.

The initial HC training dataset comprised GPT-generated translations for 5,000 original audio segments, each longer than 2 seconds. This dataset was then expanded by applying three additional rephrasing prompts to each original transcript, generating four translations per segment and increasing the total number of samples to 20,000. The HC validation dataset consist of 2,000 samples each, generated using a single prompt for rephrasing.

Table 1: Overview of the custom-created translation datasets.

Name	Samples	Total hours	Avg. duration (s)	Min (s)	Max (s)
HC_train	20,000	94.9	17.1	2.0	378.4
HC_valid	2,000	9.2	16.6	2.0	463.0

Evaluation

Evaluation was conducted on the NSC_{P36_conv} dataset containing 6,000 Singlish-to-English translations text pairs. Performance was measured using BLEU, comparing the fine-tuned model against the off-the-shelf gemma-2-9b-it baseline.

NSC_{P36_conv} bespoke dataset constructed from the Singapore National Speech Corpus (NSC). It is designed to capture the range and richness of Singlish conversational contexts.

Conversational and expressive speech includes:
Part 3: Natural dialogues on everyday topics between Singaporean speakers.
Part 5: Stylized recordings simulating debates, finance-related discussions, and emotional expressions (both positive and negative).
Part 6: Scenario-based dialogues, where speakers engage in topic-driven, semi-scripted interactions covering various themes.

Together, these components make NSC_{P36_conv} - a robust dataset for building translation models for Singlish.

ivabojic
/

gemma-2-9b-it-sing2eng-translate

gemma-2-9b-it Singlish2English translation model

Model overview

Custom dataset overview

Evaluation

Model tree for ivabojic/gemma-2-9b-it-sing2eng-translate