one-way-polyglot-15m-tied

A one-way polyglot language model trained to understand Japanese but generate only English.

Model Details

  • Architecture: LLaMA-based transformer
  • Parameters: 22,025,088 (22.0M)
  • Vocabulary: 16,384 tokens (bilingual SentencePiece)
  • Context Length: 512 tokens
  • Embedding Strategy: Tied

Capabilities

  • Semantic Transfer: Understands Japanese input and generates contextually appropriate English
  • One-Way Constraint: Strong bias toward English-only generation
  • Name Transliteration: Can transliterate Japanese names to English (context-dependent)

Training Data

Trained on bilingual Japanese-English story data with masked loss on Japanese prefixes to enforce one-way generation.

Usage

from transformers import LlamaForCausalLM, AutoTokenizer

model = LlamaForCausalLM.from_pretrained("one-way-polyglot-15m-tied")
tokenizer = AutoTokenizer.from_pretrained("one-way-polyglot-15m-tied")

# Japanese input β†’ English output (primary use case)
prompt = "ζ˜”γ€…γ€θ΅€γ„ε‚˜γ‚’ζŒγ£γŸε°‘ε₯³γŒγ„γΎγ—γŸγ€‚"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# Mixed-language name transliteration
prompt = "ε€ͺιƒŽγ―ε…¬εœ’γ§θŠ±ε­γ¨ιŠγ‚“γ§γ„γΎγ—γŸγ€‚After playing, Taro told Hanako that"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=30, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# English text (works perfectly with case folding)
prompt = "Hello World"  # Automatically normalized to lowercase
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=30, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Tokenizer Features

  • βœ… Case Folding: "Hello", "hello", and "HELLO" produce identical tokenization
  • βœ… Japanese Support: Full Japanese text support with proper normalization
  • βœ… No UNK Tokens: Proper handling of uppercase/lowercase English text
  • βœ… SentencePiece Compatibility: Built using proper Unigram model with normalization

Model Variants

This is part of a series exploring one-way polyglot capabilities:

  • 1.25M parameters (tied embeddings)
  • 8.5M parameters (tied embeddings)
  • 12.7M parameters (untied embeddings)
  • 15.7M parameters (tied embeddings)

License

Apache 2.0

Downloads last month
13
Safetensors
Model size
22M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support