NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction
Code
arXiv
Multi-round dialogue

NTPP

Transcription Text
Current Audio: None
Transcription will appear here when audio is playing...
Status: Ready

Tip: Click on any text segment to jump to that point in the audio.

Conditional Examples

Below are generation samples from our best model, compared to the ground truth continuation and from a cascaded model (Whisper+Qwen LLM+TTS). Note that the synthesized speakers are different from the original ones, but we deliberately choose the speakers with the same gender as in original speech.
You will hear a ding sound at the end of the prompt duration.

ID original speech synthesized speech continuation
Prompt Ground Truth dGSLM NTPP Cascaded Continuation
0
1

Dataset Details (140K Hours)

Type Stages Dataset Hours
Pronunciation Recording 1 Common Voice 70,000 h
Video Audio 1 Gigaspeech 10,000 h
Spoken English Audio 1 Libri-light 60,000 h

Tokenizer Comparison

Different Audio Tokenizer Meaningfulness↑ Naturalness↑
Mimi 4.05 4.28
Vanilla RVQ 3.95 4.15

Moshi Evaluation Details

To address the potential unfairness due to differences in context window sizes, we conducted additional experiments using Llama2 with a context window size consistent with Helium LLM. The experiments also demonstrated superior multi-turn response latency compared to Moshi.

Model Audio response latency for 5 turn-taking(ms)
Moshi 261.6
NTPP-Llama2 204.9

Figure 6 detailed discussions

Figure 6: Training Convergence Comparison

Figure 6: Comparison of training convergence between audio-only and text-included continue pre-training

Our architecture's full weight-sharing mechanism across modalities induces inter-modal gradient competition, where each modality subconsciously amplifies its parameter norms to gain dominance in the joint representation space. We conducted two continue pre-training experiments: When continuing pre-training with only audio interaction objectives (w/o ASR-driven text supervision), the model achieves faster convergence in perplexity reduction compared to "w Text training". This suggests that decoupling text modality during continue pretraining phases may mitigate cross-modal interference while maintaining dialog competency.

Figure 9 detailed discussions

Figure 9: Ablation Study on Two-Stage Training

Figure 9: Ablation study comparing the performance of different training configurations: Full Two-Stage NTPP, NTPP without first stage pre-training, and NTPP without second stage fine-tuning

The ablation results in Figure 9 aim to support a simple claim: the first-stage training is essential before the second-stage NTPP training. We acknowledge that this may cause some confusion, so we have further clarified this ablation study with a slightly modified figure [Figure 9](audio-3059.pages.dev/figure9). This updated figure includes three curves: Full Two-Stage NTPP (NTPP), NTPP without the first stage (NTPP w/o-1), and NTPP without the second stage (NTPP w/o-2). The results clearly demonstrate that skipping either stage leads to suboptimal performance, with the absence of first-stage pre-training being particularly detrimental. This validates our two-stage training strategy and highlights importance of proper model initialization through pre-training before proceeding with the NTPP fine-tuning phase.

Further explanations on positional encoding

\[\begin{align} \label{eq:token_embed} \mathbf{q} &= \mathbf{W}_{Q}[\mathbf{z}^{a}_{t}, \mathbf{z}^{b}_{t}] + [\mathbf{p}^{a}_{t}, \mathbf{p}^{b}_{t}] + [\mathbf{c}^{a}_{t}, \mathbf{c}^{b}_{t}],\quad (8)\\ \mathbf{k} &= \mathbf{W}_{K}[\mathbf{z}^{a}_{t}, \mathbf{z}^{b}_{t}] + [\mathbf{p}^{a}_{t}, \mathbf{p}^{b}_{t}] + [\mathbf{c}^{a}_{t}, \mathbf{c}^{b}_{t}],\quad (9) \\ S &= (\mathbf{S}^{a}_{1}, \mathbf{S}^{b}_{1}, ..., \mathbf{S}^{a}_{T}, \mathbf{S}^{b}_{T}) = ((s^{a}_{1,1}, ..., s^{a}_{1,D}),\\ &= (s^{b}_{1,1}, ...,s^{b}_{1,D}), ..., (s^{a}_{T,1}, ..., s^{a}_{T,D}), (s^{b}_{T,1}, ..., s^{b}_{T,D})). \quad (10) \\ \mathbf{d} &= (\sin((2\pi * i)/D), \cos((2\pi * i)/D)) \quad (12) \end{align}\]

The positional encoding formula we use is RoPE. However, NTPP must determine several key aspects: which set of tokens should share the same RoPE (e.g., all tokens at the same timestep t), how to distinguish between speaker channels (using channel embeddings), and how to differentiate depth tokens (through cyclic depth encoding). Equations 9, 10, and 12 correspond to these three embeddings. It's important to note that equations 8 and 9 follow the implementation details of Llama, specifically in how they are added to the queries and keys.

Sample paged based on HiFi-GAN page.