Get Mystery Box with random crypto!

Speech Technology

Logo of telegram channel speechtech — Speech Technology S
Logo of telegram channel speechtech — Speech Technology
Channel address: @speechtech
Categories: Technologies
Language: English
Subscribers: 652

Ratings & Reviews

2.67

3 reviews

Reviews can be left only by registered users. All reviews are moderated by admins.

5 stars

1

4 stars

0

3 stars

0

2 stars

1

1 stars

1


The latest Messages 10

2023-03-03 00:29:52

296 views21:29
Open / Comment
2023-02-28 15:12:38 Nice Chinese chip with analog NPU (very power-efficient) and RISC core

https://en.witmem.com/wtm2101.html
242 views12:12
Open / Comment
2023-02-28 04:07:14 https://twitter.com/Maureendss/status/1630209732223852544

Exciting news! We just released ProsAudit, a prosodic benchmark for SSL models of speech

It is now part of the Zero Resource Speech Challenge (track 4). The paper also includes results on a human comparison.

Check out the preprint: https://arxiv.org/pdf/2302.12057.pdf
125 views01:07
Open / Comment
2023-02-28 00:52:10 https://arxiv.org/abs/2302.12369

Factual Consistency Oriented Speech Recognition

Naoyuki Kanda, Takuya Yoshioka, Yang Liu

This paper presents a novel optimization framework for automatic speech recognition (ASR) with the aim of reducing hallucinations produced by an ASR model. The proposed framework optimizes the ASR model to maximize an expected factual consistency score between ASR hypotheses and ground-truth transcriptions, where the factual consistency score is computed by a separately trained estimator. Experimental results using the AMI meeting corpus and the VoxPopuli corpus show that the ASR model trained with the proposed framework generates ASR hypotheses that have significantly higher consistency scores with ground-truth transcriptions while maintaining the word error rates close to those of cross entropy-trained ASR models. Furthermore, it is shown that training the ASR models with the proposed framework improves the speech summarization quality as measured by the factual consistency of meeting conversation summaries generated by a large language model.
178 views21:52
Open / Comment
2023-02-28 00:40:27
How much smaller can you make your LM with overtraining?

This figure from Chinchilla gives you a clue on what to expect. Say, you have C = 6e20.

If N = 350M, it performs on par with L_opt of C = 1e20 (N_opt = 900M).

=> 6x training FLOPS for 2.5x less inference FLOPS

https://twitter.com/arankomatsuzaki/status/1630257908238696449
178 views21:40
Open / Comment
2023-02-23 22:28:15 https://github.com/Plachtaa/VITS-fast-fine-tuning
318 views19:28
Open / Comment
2023-02-23 22:19:56 From Phil Woodland

https://arxiv.org/abs/2302.08579

Adaptable End-to-End ASR Models using Replaceable Internal LMs and Residual Softmax

Keqi Deng, Philip C. Woodland

End-to-end (E2E) automatic speech recognition (ASR) implicitly learns the token sequence distribution of paired audio-transcript training data. However, it still suffers from domain shifts from training to testing, and domain adaptation is still challenging. To alleviate this problem, this paper designs a replaceable internal language model (RILM) method, which makes it feasible to directly replace the internal language model (LM) of E2E ASR models with a target-domain LM in the decoding stage when a domain shift is encountered. Furthermore, this paper proposes a residual softmax (R-softmax) that is designed for CTC-based E2E ASR models to adapt to the target domain without re-training during inference. For E2E ASR models trained on the LibriSpeech corpus, experiments showed that the proposed methods gave a 2.6% absolute WER reduction on the Switchboard data and a 1.0% WER reduction on the AESRC2020 corpus while maintaining intra-domain ASR results.
304 views19:19
Open / Comment
2023-02-23 22:18:48 BigVGAN is accepted at ICLR 2023.
Listen audio samples:
https://bigvgan-demo.github.io

A universal audio synthesis model, trained on speech only, works for out-of-distribution scenarios, e.g., unseen singing voices and music audio!

Code and models are released!
https://github.com/NVIDIA/BigVGAN

https://twitter.com/_weiping/status/1628210425480515584
265 viewsedited  19:18
Open / Comment
2023-02-23 22:04:36 https://arxiv.org/abs/2302.10248

VoxSRC 2022: The Fourth VoxCeleb Speaker Recognition Challenge

Jaesung Huh, Andrew Brown, Jee-weon Jung, Joon Son Chung, Arsha Nagrani, Daniel Garcia-Romero, Andrew Zisserman

This paper summarises the findings from the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22), which was held in conjunction with INTERSPEECH 2022. The goal of this challenge was to evaluate how well state-of-the-art speaker recognition systems can diarise and recognise speakers from speech obtained "in the wild". The challenge consisted of: (i) the provision of publicly available speaker recognition and diarisation data from YouTube videos together with ground truth annotation and standardised evaluation software; and (ii) a public challenge and hybrid workshop held at INTERSPEECH 2022. We describe the four tracks of our challenge along with the baselines, methods, and results. We conclude with a discussion on the new domain-transfer focus of VoxSRC-22, and on the progression of the challenge from the previous three editions.
278 views19:04
Open / Comment
2023-02-23 21:41:21 From Google

https://arxiv.org/abs/2302.11186

UML: A Universal Monolingual Output Layer for Multilingual ASR

Chao Zhang, Bo Li, Tara N. Sainath, Trevor Strohman, Shuo-yiin Chang

Word-piece models (WPMs) are commonly used subword units in state-of-the-art end-to-end automatic speech recognition (ASR) systems. For multilingual ASR, due to the differences in written scripts across languages, multilingual WPMs bring the challenges of having overly large output layers and scaling to more languages. In this work, we propose a universal monolingual output layer (UML) to address such problems. Instead of one output node for only one WPM, UML re-associates each output node with multiple WPMs, one for each language, and results in a smaller monolingual output layer shared across languages. Consequently, the UML enables to switch in the interpretation of each output node depending on the language of the input speech. Experimental results on an 11-language voice search task demonstrated the feasibility of using UML for high-quality and high-efficiency multilingual streaming ASR.
298 views18:41
Open / Comment