ChristophSchuhmann commited on
Commit
16f4426
·
verified ·
1 Parent(s): 5142e60

Add README.md

Browse files
Files changed (1) hide show
  1. README.md +97 -0
README.md ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: onnxruntime
3
+ tags:
4
+ - snac
5
+ - onnx
6
+ - 24khz
7
+ - decoder
8
+ - browser
9
+ license: other
10
+ language:
11
+ - en
12
+ ---
13
+
14
+ # SNAC 24 kHz — Decoder as ONNX (browser-ready)
15
+
16
+ This repo provides **ONNX decoders** for the SNAC 24 kHz codec so you can decode SNAC tokens **on-device**, including **in the browser** with `onnxruntime-web`.
17
+
18
+ **Why?** If your TTS front-end is a decoder-only Transformer (e.g. Orpheus-style) that can stream out SNAC tokens fast and cheaply, you can keep synthesis private and responsive by decoding the audio **in the user’s browser/CPU** (or WebGPU when available).
19
+
20
+ > In a Colab CPU test, we saw ~**2.1× real-time** decoding for a longer file using the ONNX model (inference time only, excluding model load). Your mileage will vary with hardware and browser.
21
+
22
+ ---
23
+
24
+ ## Files
25
+
26
+ - **`snac24_int2wav_static.onnx`** — *int → wav* decoder
27
+ Inputs (int64):
28
+ - `codes0`: `[1, 12]`
29
+ - `codes1`: `[1, 24]`
30
+ - `codes2`: `[1, 48]`
31
+ Output:
32
+ - `audio`: `float32 [1, 1, 24576]` (24 kHz)
33
+
34
+ Shapes correspond to a **48-frame window**. Each frame is **512 samples**, so one window = **24576 samples** ≈ **1.024 s** at 24 kHz.
35
+ Token alignment: `L0*4 = L1*2 = L2*1 = shared_frames`.
36
+
37
+ - **`snac24_latent2wav_static.onnx`** — *latent → wav* decoder
38
+ Input: `z` `float32 [1, 768, 48]` → Output: `audio [1, 1, 24576]`
39
+ Use this if you reconstruct the latent yourself (RVQ embeddings + 1×1 conv projections).
40
+
41
+ - **`snac24_codes.json`** — sample codes (for testing)
42
+
43
+ - **`snac24_quantizers.json`** — RVQ metadata/weights (stride + embeddings + 1×1 projections) to reconstruct `z` if needed.
44
+
45
+ ---
46
+
47
+ ## Browser (WASM/WebGPU) quickstart
48
+
49
+ Serve these files from a local server with cross-origin isolation for multithreaded WASM (e.g., COOP/COEP headers). If not isolated, WASM will typically run **single-threaded**.
50
+
51
+ ```html
52
+ <script src="https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/ort.min.js"></script>
53
+ <script>
54
+ (async () => {
55
+ // Prefer WebGPU if available; else WASM
56
+ const providers = (typeof navigator.gpu !== 'undefined') ? ['webgpu','wasm'] : ['wasm'];
57
+ // Enable SIMD; threads only if crossOriginIsolated
58
+ ort.env.wasm.simd = true;
59
+ ort.env.wasm.numThreads = crossOriginIsolated ? (navigator.hardwareConcurrency||4) : 1;
60
+
61
+ const session = await ort.InferenceSession.create('snac24_int2wav_static.onnx', {
62
+ executionProviders: providers,
63
+ graphOptimizationLevel: 'all',
64
+ });
65
+
66
+ // Example: one 48-frame window (12/24/48 tokens). Replace with real codes.
67
+ const T0=12, T1=24, T2=48;
68
+ const feed = {
69
+ codes0: new ort.Tensor('int64', BigInt64Array.from(new Array(T0).fill(0), x=>BigInt(x)), [1,T0]),
70
+ codes1: new ort.Tensor('int64', BigInt64Array.from(new Array(T1).fill(0), x=>BigInt(x)), [1,T1]),
71
+ codes2: new ort.Tensor('int64', BigInt64Array.from(new Array(T2).fill(0), x=>BigInt(x)), [1,T2]),
72
+ };
73
+
74
+ const t0 = performance.now();
75
+ const out = await session.run(feed);
76
+ const t1 = performance.now();
77
+ const audio = out.audio.data; // Float32Array [1,1,24576]
78
+
79
+ // Play it (24 kHz)
80
+ const ctx = new (window.AudioContext||window.webkitAudioContext)({sampleRate:24000});
81
+ const buf = ctx.createBuffer(1, audio.length, 24000);
82
+ buf.copyToChannel(audio, 0);
83
+ const src = ctx.createBufferSource(); src.buffer = buf; src.connect(ctx.destination); src.start();
84
+
85
+ console.log({ usedEP: providers[0], infer_ms: (t1-t0).toFixed(2), samples: audio.length });
86
+ })();
87
+ </script>
88
+ Streaming note
89
+
90
+ SNAC is streamable in principle. For practical low-latency TTS, emit ~200 ms of tokens, decode in ~100 ms,
91
+ start playback, and continue decoding subsequent chunks; cross-fade a few ms to hide seams.
92
+
93
+ Threads / GPU
94
+
95
+ Multithreaded WASM requires cross-origin isolation (COOP/COEP). Without it, browsers typically run single-threaded.
96
+
97
+ WebGPU can accelerate on desktop and mobile when kernels are supported; this model usually falls back to WASM if not.