arxiv:2605.00777

LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

Published on May 1

· Submitted by

Venkata Pushpak Teja Menta on May 4

Praxel

Upvote

Authors:

Abstract

A language-adversarial speaker encoder (LASE) is proposed to address cross-script voice cloning issues by training with contrastive loss and gradient-reversal learning to produce language-uninformative yet speaker-informative embeddings.

AI-generated summary

A speaker encoder used in multilingual voice cloning should treat the same speaker identically regardless of which script the audio was uttered in. Off-the-shelf encoders do not, and the failure is accent-conditional. On a 1043-pair Western-accented voice corpus across English, Hindi, Telugu, and Tamil, WavLM-base-plus-sv loses 0.082 absolute cosine similarity when the same voice changes script and ECAPA-TDNN loses 0.105. On a 1369-pair Indian-accented voice corpus, the gap shrinks to 0.006 (WavLM-SV) and 0.044 (ECAPA-TDNN). The leak is largest where it matters most for cross-script TTS: when a system projects a non-Indic-trained voice into Indic scripts. We present LASE (Language-Adversarial Speaker Encoder), a small projection head over frozen WavLM-base-plus trained with two losses: a supervised contrastive loss over voice identity, and a gradient-reversal cross-entropy against a 4-language classifier that pushes the embedding to be language-uninformative while remaining speaker-informative. Trained on 1118 quality-gated cross-script pairs synthesised from 8 commercial multilingual voices, LASE's residual gap is consistent with zero on both corpora (Delta = 0.013 Western, Delta = 0.026 Indian; both bootstrap 95% CIs include zero) and amplifies the cross-script-vs-floor margin 2.4-2.7x over both baselines. An ECAPA+GRL ablation shows the GRL objective improves either backbone but the WavLM choice contributes too. In synthetic multi-speaker diarisation, LASE matches ECAPA-TDNN on cross-script speaker recall (0.788 vs 0.789) with ~100x less training data. We release the r1 checkpoint, both corpora, and the bootstrap recipe.

View arXiv page View PDF Project page GitHub 1 Add to collection

Community

praxelhq

Paper submitter about 11 hours ago

LASE is a language-adversarial speaker encoder for cross-script identity preservation in Indic TTS. The problem: when a multilingual TTS clones the same voice across scripts (English → Hindi → Telugu → Tamil), speaker identity drifts measurably (within-script SECS 0.93 → cross-script 0.83, vs the across-speaker noise floor of 0.64). LASE wraps a frozen speaker-encoder backbone (WavLM-base-plus or ECAPA-TDNN) with a gradient-reversal-layer language classifier that strips language-specific signal from the speaker embedding while preserving identity. On a 1118-pair Indian-voice held-out, LASE r1 cuts cross-script drift 5× and shows 3× headroom over the within-script ceiling. Both the WavLM backbone and the GRL ablation contribute; ECAPA+GRL replicates the finding. Paper, code, and weights are open-source.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.00777

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.00777 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.00777 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.00777 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.