gemma-2-9b-it Singlish2English translation model

Hugging Face

Model overview

This model is a fine-tuned version of google/gemma-2-9b-it, trained on over 20,000 Singlish-English text pairs from the health coaching (HC) sessions conducted in Singapore.


Custom dataset overview

To enable fine-tuning of open-source foundation LLM models, we curated HC dataset:

  • First, we collected audios from nearly 90 HC sessions involving three health coaches (Singaporean, Malaysian and European) and 40 Singaporean patients who are not adherent to taking cholesterol-lowering medication. These patients were recruited from three polyclinics in Singapore and received compensation for their time spent participating in our study.
  • Second, we used a fine-tuned ivabojic/whisper-medium-sing2eng-transcribe model to generate audio transcriptions.
  • Third, we employed GPT-4o mini to generate Singlish-to-English translations text pairs for these audio transcriptions.

The initial HC training dataset comprised GPT-generated translations for 5,000 original audio segments, each longer than 2 seconds. This dataset was then expanded by applying three additional rephrasing prompts to each original transcript, generating four translations per segment and increasing the total number of samples to 20,000. The HC validation dataset consist of 2,000 samples each, generated using a single prompt for rephrasing.

Table 1: Overview of the custom-created translation datasets.

Name Samples Total hours Avg. duration (s) Min (s) Max (s)
HCtrain 20,000 94.9 17.1 2.0 378.4
HCvalid 2,000 9.2 16.6 2.0 463.0

Evaluation

Evaluation was conducted on the NSCP36_conv dataset containing 6,000 Singlish-to-English translations text pairs. Performance was measured using BLEU, comparing the fine-tuned model against the off-the-shelf gemma-2-9b-it baseline.

NSCP36_conv bespoke dataset constructed from the Singapore National Speech Corpus (NSC). It is designed to capture the range and richness of Singlish conversational contexts.

  • Conversational and expressive speech includes:
  • Part 3: Natural dialogues on everyday topics between Singaporean speakers.
  • Part 5: Stylized recordings simulating debates, finance-related discussions, and emotional expressions (both positive and negative).
  • Part 6: Scenario-based dialogues, where speakers engage in topic-driven, semi-scripted interactions covering various themes.

Together, these components make NSCP36_conv - a robust dataset for building translation models for Singlish.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ivabojic/gemma-2-9b-it-sing2eng-translate

Base model

google/gemma-2-9b
Finetuned
(296)
this model