vukosi's picture
Update README.md
6432040 verified

A newer version of the Gradio SDK is available: 5.40.0

Upgrade
metadata
title: Siswati-English Linguistic Translation Tool
emoji: πŸ”¬
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 5.33.2
app_file: app.py
pinned: false
license: apache-2.0
tags:
  - translation
  - siswati
  - linguistics
  - african-languages
  - nlp
  - research
  - corpus-analysis
  - bantu-languages
  - m2m100
  - multilingual

πŸ”¬ Siswati-English Linguistic Translation Tool

An advanced AI-powered translation system with comprehensive linguistic analysis features, designed specifically for linguists, researchers, and language documentation projects working with Siswati and English.

🌟 Features

πŸ”„ Translation Capabilities

  • Bidirectional Translation: High-quality English ↔ Siswati translation
  • Advanced Model Architecture: Built on M2M100 transformer models
  • Batch Processing: Process multiple texts simultaneously for corpus analysis
  • Real-time Analysis: Instant linguistic metrics and feature detection

πŸ“Š Linguistic Analysis

  • Morphological Complexity: Word length, sentence structure analysis
  • Lexical Diversity: Vocabulary richness measurements
  • Language-Specific Features: Siswati agglutination, click consonants, tone markers
  • Translation Ratios: Comparative analysis between source and target languages
  • Statistical Metrics: Character count, word count, sentence segmentation

πŸ”¬ Research Tools

  • Translation History: Track and analyze translation patterns over time
  • CSV Export: Research-ready data export for statistical analysis
  • Corpus Management: Batch processing for linguistic corpora
  • Performance Metrics: Processing time and efficiency tracking

πŸ—£οΈ About Siswati

Siswati (also known as Swati or Swazi) is a Bantu language spoken by approximately 2.3 million people, primarily in:

  • πŸ‡ΈπŸ‡Ώ Eswatini (Kingdom of Eswatini) - Official language
  • πŸ‡ΏπŸ‡¦ South Africa - One of 11 official languages

Linguistic Features

  • Language Family: Niger-Congo β†’ Bantu β†’ Southeast Bantu
  • Script: Latin alphabet
  • Characteristics: Agglutinative morphology, click consonants, tonal
  • ISO Code: ss (ISO 639-1), ssw (ISO 639-3)

πŸ€– Model Information

This tool uses state-of-the-art transformer models developed by the Data Science for Social Impact Research Group:

  • English β†’ Siswati: dsfsi/en-ss-m2m100-combo
  • Siswati β†’ English: dsfsi/ss-en-m2m100-combo

Both models are based on Meta's M2M100 architecture, fine-tuned specifically for Siswati-English translation pairs.

🎯 Use Cases

For Linguists & Researchers

  • Language Documentation: Analyze translation patterns and linguistic features
  • Corpus Studies: Process large text collections with batch translation
  • Comparative Analysis: Study morphological and syntactic differences
  • Quality Assessment: Evaluate translation adequacy and fluency

For Educators & Students

  • Language Learning: Understand translation patterns and linguistic structures
  • Academic Research: Export data for statistical analysis and publications
  • Computational Linguistics: Study machine translation for low-resource languages

For Community & Cultural Projects

  • Language Preservation: Support Siswati language documentation efforts
  • Cultural Exchange: Facilitate communication between English and Siswati speakers
  • Content Translation: Assist in translating educational and cultural materials

πŸš€ Getting Started

  1. Single Translation: Enter text and select translation direction
  2. Batch Processing: Upload .txt files or paste multiple lines for corpus analysis
  3. Analysis Export: Use the research tools to export translation data as CSV
  4. Linguistic Study: Explore the real-time analysis features for detailed insights

πŸ“ˆ Linguistic Metrics Explained

Text Complexity

  • Word Count: Total number of words in the text
  • Character Count: Total characters including spaces and punctuation
  • Sentence Count: Number of sentences detected
  • Average Word Length: Mean character length per word
  • Lexical Diversity: Ratio of unique words to total words (vocabulary richness)

Translation Analysis

  • Word Ratio: Target word count / Source word count
  • Character Ratio: Target character count / Source character count
  • Processing Time: Time taken for model inference

Siswati-Specific Features

  • Agglutination Detection: Identification of potentially agglutinated words (>10 characters)
  • Click Consonants: Count of clicks (c, q, x sounds)
  • Tone Markers: Detection of acute (́) and grave (Μ€) accent marks

πŸ“š Academic Usage

If you use this tool in your research, please cite the original models:

@misc{dsfsi-siswati-translation,
  title={Siswati-English Translation Models},
  author={Marivate, Vukosi and Lastrucci, Richard},
  year={2024},
  publisher={Data Science for Social Impact Research Group},
  url={https://github.com/dsfsi/}
}

πŸ”— Related Resources

🀝 Contributing

We welcome contributions from the linguistic and NLP communities! Areas of interest:

  • Improving translation quality
  • Adding more linguistic analysis features
  • Expanding to other African languages
  • Enhancing the user interface for research workflows

πŸ“„ License

This project is licensed under the Apache 2.0 License. The underlying models may have their own licensing terms - please check the individual model repositories.

🌍 Supporting African Languages

This tool is part of a broader effort to support African language technology and computational linguistics research. By providing advanced NLP tools for Siswati, we aim to:

  • Preserve and promote African languages in the digital age
  • Support linguistic research and documentation
  • Enable better communication across language barriers
  • Contribute to the development of multilingual AI systems

Built with ❀️ for the African NLP community