J-Moshi: A Japanese Full-duplex Spoken Dialogue System
๐ Paper | ๐ค Model | ๐ฅ๏ธ Demo | ๐ง Training Code
J-Moshi is a full-duplex spoken dialogue model for Japanese. Built upon the English 7B-parameter full-duplex spoken dialogue model Moshi, it was developed through additional training on Japanese spoken dialogue data. The model realizes natural turn-taking behaviors such as speech overlaps and backchannels in real-time, similar to human-to-human conversations. For more details, please refer to our paper.
This repository provides the trained J-Moshi models and instructions for interacting with the models. Audio samples generated by J-Moshi and our training codebase used for training J-Moshi are also available.
Models
Two variants of J-Moshi are publicly available:
- nu-dialogue/j-moshi
- A model based on kyutai/moshiko-pytorch-bf16, trained on large-scale Japanese spoken dialogue data.
- nu-dialogue/j-moshi-ext
- A model based on kyutai/moshiko-pytorch-bf16, trained on large-scale Japanese spoken dialogue data and augmented data synthesized using multi-stream TTS.
Each repository contains the following three model files:
model.safetensors
- The main J-Moshi model weights.
tokenizer_spm_32k_3.model
- Text tokenizer. Japanese SentencePiece model from rinna/japanese-gpt2-medium.
tokenizer-e351c8d8-checkpoint125.safetensors
- Audio tokenizer. Mimi model from kyutai/moshiko-pytorch-bf16.
Interactive Demo
You can interact with J-Moshi using the official Moshi PyTorch implementation from Kyutai. For implementation details, please refer to the original Moshi repository kyutai-labs/moshi.
Installation
Python 3.10 or higher is required.
pip install moshi<=0.2.2
Usage
You can launch the web UI by running moshi.server
. Specify the J-Moshi ๐ค HuggingFace Hub repository (nu-dialogue/j-moshi, nu-dialogue/j-moshi-ext) using the --hf-repo
option.
python -m moshi.server --hf-repo nu-dialogue/j-moshi-ext
Tips
- A Linux GPU machine with at least 24GB of VRAM is required for execution. MacOS is not supported.
- To prevent echo of the model's speech output, please use earphones or headphones instead of speakers during dialogue. Audio devices can be configured in the browser when accessing the web UI.
Training Details
The following spoken dialogue corpora were used for training J-Moshi. In addition to these datasets, J-Moshi-ext was also trained on augmented data synthesized from text dialogue corpora. The corpora used are as follows:
Spoken dialogue corpora
- J-CHAT
- Japanese Callhome
- CSJ
- Travel Agency Dialogue Corpus
- Casual Dialogue Corpus (in-house)
- Consultation Dialogue Corpus (in-house)
Text dialogue corpora
Training was conducted using 128 NVIDIA V100 32GB GPUs.
Terms of Use
J-Moshi is released under CC BY-NC 4.0 and is intended for research purposes. This model is not intended for any malicious use, including impersonation or fraud. Additionally, the model's outputs may contain biases or inaccurate or offensive information derived from the training data. We assume no responsibility for any damages arising from its use.
Acknowledgements
This research was supported by JST Moonshot R&D, Grant Number JPMJMS2011. The casual dialogue corpus and consultation dialogue corpus were constructed in joint research with AISIN Corporation. We used the computational resources of the supercomputer "Flow" at the Information Technology Center, Nagoya University. Finally, we would like to thank Kyutai Labs for releasing Moshi's technical paper and model.
Citation
@inproceedings{ohashi2025jmoshi,
title={Towards a Japanese Full-duplex Spoken Dialogue System},
author={Ohashi, Atsumoto and Iizuka, Shinya and Jiang, Jingjing and Higashinaka, Ryuichiro},
booktitle={Proceedings of the 26th Interspeech Conference},
year={2025},
}