metadata

library_name: transformers
tags:
  - machine translation
  - english-german
  - english
  - german
  - bilingual
license: apache-2.0
datasets:
  - rewicks/english-german-data
language:
  - en
  - de
pipeline_tag: translation

Model Card for Model ID

This model is a simple bilingual English-German machine translation trained with MarianNMT. They were converted to huggingface using scripts derived from the Helsinki-NLP group. We collected most datasets listed via mtdata and filtered. The processed data is also available on huggingface.

We trained these models in order to develop a new ensembling algorithm. Agreement-Based Ensembling is an inference-time-only algorithm that allows for ensembling models with different vocabularies, without the ned to learn additional parameters or alter the underlying models. Instead, the algorithm ensures that tokens generated by the ensembled models agree in their surface form. For more information, please check out our code available on GitHub, or read our paper on Arxiv.

Model Details

Model Description

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

Shared, Developed by: Rachel Wicks
Funded By: Johns Hopkins University
Model type: Encoder-Decoder (Transformer, Transformer)
Language(s) (NLP): English, German
License: Apache 2.0

Model Sources [optional]

Paper [optional]: Coming Soon!

How to Get Started with the Model

The code below can be used to translate lines read from standard input (our baseline in our paper).

import sys
import torch

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = sys.argv[1]

device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(model_id, torch_dtype=torch.bfloat16)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id).to(device)
model = model.eval()

for line in sys.stdin:
    line = line.strip()
    inputs = tokenizer(line, return_tensors="pt").to(device)
    translated_tokens = model.generate(
        **inputs, max_length=256,
        num_beams = 5,
    )
    print(tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0])

Training Details

Data is available here. We use sotastream to stream data over stdin. We use MarianNMT to train. The config is available in the repo as config.yml.

Evaluation

BLEU on WMT24 is XX.

Hardware

RTX Titan (24GB)

Citation [optional]

BibTeX:

[More Information Needed]

APA: