Speech Technology

Channel address:

Categories: Technologies

Language: English

Subscribers: 652

▲ Vote (1)

Ratings & Reviews

2.67

3 reviews

Reviews can be left only by registered users. All reviews are moderated by admins.

5 stars

4 stars

3 stars

2 stars

1 stars

The latest Messages 2

2023-06-28 03:15:19 Another semisup thing from Google, better ensembling than ROVER

https://arxiv.org/abs/2306.12012

Learning When to Trust Which Teacher for Weakly Supervised ASR

Aakriti Agrawal, Milind Rao, Anit Kumar Sahu, Gopinath Chennupati, Andreas Stolcke

Automatic speech recognition (ASR) training can utilize multiple experts as teacher models, each trained on a specific domain or accent. Teacher models may be opaque in nature since their architecture may be not be known or their training cadence is different from that of the student ASR model. Still, the student models are updated incrementally using the pseudo-labels generated independently by the expert teachers. In this paper, we exploit supervision from multiple domain experts in training student ASR models. This training strategy is especially useful in scenarios where few or no human transcriptions are available. To that end, we propose a Smart-Weighter mechanism that selects an appropriate expert based on the input audio, and then trains the student model in an unsupervised setting. We show the efficacy of our approach using LibriSpeech and LibriLight benchmarks and find an improvement of 4 to 25\% over baselines that uniformly weight all the experts, use a single expert model, or combine experts using ROVER.

574 viewsedited 00:15

Open / Comment

2023-06-28 02:52:13 https://arxiv.org/abs/2306.13114

https://github.com/aixplain/NoRefER

A Reference-less Quality Metric for Automatic Speech Recognition via Contrastive-Learning of a Multi-Language Model with Self-Supervision
Kamer Ali Yuksel, Thiago Ferreira, Ahmet Gunduz, Mohamed Al-Badrashiny, Golara Javadi
The common standard for quality evaluation of automatic speech recognition (ASR) systems is reference-based metrics such as the Word Error Rate (WER), computed using manual ground-truth transcriptions that are time-consuming and expensive to obtain. This work proposes a multi-language referenceless quality metric, which allows comparing the performance of different ASR models on a speech dataset without ground truth transcriptions. To estimate the quality of ASR hypotheses, a pre-trained language model (LM) is fine-tuned with contrastive learning in a self-supervised learning manner. In experiments conducted on several unseen test datasets consisting of outputs from top commercial ASR engines in various languages, the proposed referenceless metric obtains a much higher correlation with WER scores and their ranks than the perplexity metric from the state-of-art multi-lingual LM in all experiments, and also reduces WER by more than 7% when used for ensembling hypotheses. The fine-tuned model and experiments are made available for the reproducibility: this https URL

537 viewsedited 23:52

Open / Comment

2023-06-28 02:42:37 https://twitter.com/unilightwf/status/1673522880053940224

537 views23:42

Open / Comment

2023-06-24 02:59:49 https://twitter.com/forthshinji/status/1672082306239176706

demo: https://aria-k-alethia.github.io/2023laughter-demo/
corpus: https://sites.google.com/site/shinnosuketakamichi/research-topics/laughter_corpus
source: https://github.com/Aria-K-Alethia/laughter-synthesis/

694 views23:59

Open / Comment

2023-06-21 03:43:18 GPT-4 is an ensemble

https://twitter.com/soumithchintala/status/1671267150101721090

we shall see llama ensembles soon

941 views00:43

Open / Comment

2023-06-21 03:38:13

A device to track human activity from Meta/Facebook

https://ariatutorial2023.github.io/

734 viewsedited 00:38

Open / Comment

2023-06-17 10:13:46 Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

https://ai.facebook.com/blog/voicebox-generative-ai-model-speech/

844 views07:13

Open / Comment

2023-06-15 01:36:02 https://arxiv.org/abs/2306.07691

https://styletts2.github.io/

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, Nima Mesgarani
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs. The audio demos and source code are available at this https URL.

908 views22:36

Open / Comment

2023-06-14 20:59:36 https://github.com/gweltou/vosk-br

Implemented nice Breton model for Vosk. Very valuable contribution! Please don't hesitate to add a star to that project!

644 viewsedited 17:59

Open / Comment

2023-06-14 17:44:41 Speech-to-Text Adapter and Speech-to-Entity Retriever Augmented LLMs for Speech Understanding

paper page: https://huggingface.co/papers/2306.07944

Large Language Models (LLMs) have been applied in the speech domain, often incurring a performance drop due to misaligned between speech and language representations. To bridge this gap, we propose a joint speech and language model (SLM) using a Speech2Text adapter, which maps speech into text token embedding space without speech information loss. Additionally, using a CTC-based blank-filtering, we can reduce the speech sequence length to that of text. In speech MultiWoz dataset (DSTC11 challenge), SLM largely improves the dialog state tracking (DST) performance (24.7% to 28.4% accuracy). Further to address errors on rare entities, we augment SLM with a Speech2Entity retriever, which uses speech to retrieve relevant entities, and then adds them to the original SLM input as a prefix. With this retrieval-augmented SLM (ReSLM), the DST performance jumps to 34.6% accuracy. Moreover, augmenting the ASR task with the dialog understanding task improves the ASR performance from 9.4% to 8.5% WER.

881 views14:44

Open / Comment

Speech Technology

Ratings & Reviews

The latest Messages 2

Popular Channels

Related Chats

Popular Channels

Login