Speech Technology

Channel address:

Categories: Technologies

Language: English

Subscribers: 652

▲ Vote (1)

Ratings & Reviews

2.67

3 reviews

Reviews can be left only by registered users. All reviews are moderated by admins.

5 stars

4 stars

3 stars

2 stars

1 stars

The latest Messages 45

2021-09-03 01:36:58 IBM Watson transition to RNNT technology:

https://www.linkedin.com/posts/gakuto_watson-speech-to-text-how-to-plan-your-migration-activity-6839176747339137024-KLMA

some features lost in the process like grammars

218 views22:36

Open / Comment

2021-09-02 23:20:07 https://twitter.com/javierjorgecano/status/1433429870478958605

224 views20:20

Open / Comment

2021-09-01 01:52:49 ASAPP is looking for Senior Speech Scientist

https://jobs.lever.co/asapp-2/0153c72c-a94e-4125-867e-a061ba285ec4

398 views22:52

Open / Comment

2021-08-31 23:14:15 We released Vosk 0.3.31 and US English model en-us-0.21

We fixed rescoring strategy (hclg - g + carpa + rnnlm instead of hclg - g + rnnlm before) and fixed rescoring itself.

tedlium 6.93 -> 6.42
librispeech 6.27 -> 5.43

others improved too. Please update

539 viewsedited 20:14

Open / Comment

2021-08-31 11:12:31 https://arxiv.org/abs/2108.13320

Neural HMMs are all you need (for high-quality attention-free TTS)
Shivam Mehta, Éva Székely, Jonas Beskow, Gustav Eje Henter
Neural sequence-to-sequence TTS has demonstrated significantly better output quality over classical statistical parametric speech synthesis using HMMs. However, the new paradigm is not probabilistic and the use of non-monotonic attention both increases training time and introduces "babbling" failure modes that are unacceptable in production. In this paper, we demonstrate that the old and new paradigms can be combined to obtain the advantages of both worlds, by replacing the attention in Tacotron 2 with an autoregressive left-right no-skip hidden-Markov model defined by a neural network. This leads to an HMM-based neural TTS model with monotonic alignment, trained to maximise the full sequence likelihood without approximations. We discuss how to combine innovations from both classical and contemporary TTS for best results. The final system is smaller and simpler than Tacotron 2 and learns to align and speak with fewer iterations, while achieving the same speech naturalness. Unlike Tacotron 2, it also allows easy control over speaking rate. Audio examples and code are available at this https URL

https://shivammehta007.github.io/Neural-HMM/

354 views08:12

Open / Comment

2021-08-31 02:37:49 TTS and ASR gets closer and closer

https://arxiv.org/abs/2108.12226

Injecting Text in Self-Supervised Speech Pretraining

Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Gary Wang, Pedro Moreno

Self-supervised pretraining for Automated Speech Recognition (ASR) has shown varied degrees of success. In this paper, we propose to jointly learn representations during pretraining from two different modalities: speech and text. The proposed method, tts4pretrain complements the power of contrastive learning in self-supervision with linguistic/lexical representations derived from synthesized speech, effectively learning from untranscribed speech and unspoken text. Lexical learning in the speech encoder is enforced through an additional sequence loss term that is coupled with contrastive loss during pretraining. We demonstrate that this novel pretraining method yields Word Error Rate (WER) reductions of 10% relative on the well-benchmarked, Librispeech task over a state-of-the-art baseline pretrained with wav2vec2.0 only. The proposed method also serves as an effective strategy to compensate for the lack of transcribed speech, effectively matching the performance of 5000 hours of transcribed speech with just 100 hours of transcribed speech on the AMI meeting transcription task. Finally, we demonstrate WER reductions of up to 15% on an in-house Voice Search task over traditional pretraining. Incorporating text into encoder pretraining is complimentary to rescoring with a larger or in-domain language model, resulting in additional 6% relative reduction in WER.

364 viewsedited 23:37

Open / Comment

2021-08-30 14:21:00 EasyCall Dysarthric speech corpus

The EasyCall corpus is a database of command speech recorded from healthy individuals and dysarthric patients. This dataset has been collected through a collaboration between the Italian Institute of Technology (IIT), the University of Ferrara and the Sant'Anna Hospital of Ferrara, and it aims at providing a new resource for future developments of ASR-based assistive technologies.
In particular, it may be exploited to develop a voice-controlled contact application for commercial smartphones, and improve dysarthric patients' ability to communicate with their family and caregivers.

It currently consists of 16683 audio recordings from 21 healthy and 26 dysarthric speakers. For each speech-impaired individual, dysarthria has been assessed by a neurologist through the Therapy Outcome Measure. The recordings focus on a small vocabulary, including basic smartphone commands, such as “open contacts”, “start call”, “end call”. Specifically, these commands are the result of a survey administered to patients that evaluates which commands are more likely to be employed by dysarthric individuals to use a speech command-based contact application. In addition, the dataset includes a list of non-commands (i.e., words near/inside commands or phonetically close to commands) that can be leveraged to build a more robust ASR system.

PROJECT TEAM:
Rosanna Turrisi, Arianna Braccia, Marco Emanuele, Simone Giulietti, Luciano Fadiga, Mariachiara Sensi, Leonardo Badino.

https://www.isca-speech.org/archive/interspeech_2021/turrisi21_interspeech.html

http://neurolab.unife.it/easycallcorpus/

356 viewsedited 11:21

Open / Comment

2021-08-28 01:26:27 https://github.com/NVIDIA/NeMo/releases/tag/v1.3.0

NVIDIA Neural Modules 1.3.0

Added
RNNT Exportable to ONNX #2510
Multi-batch inference support for speaker diarization #2522
DALI Integration for char/subword ASR #2567
VAD Postprocessing #2636
Perceiver encoder for NMT #2621
gRPC NMT server #2656
German ITN # 2486
Russian TN and ITN #2519
Save/restore connector # 2592
PTL 1.4+ # 2600

Tutorial Notebooks
Non-English downstream NLP task #2532
RNNT Basics #2651

Bug Fixes
NMESE clustering for very small audio files #2566

453 viewsedited 22:26

Open / Comment

2021-08-27 11:49:14 https://speech.fit.vutbr.cz/events/SMM21/

Interspeech starts only on Monday, but on Friday 27 August, a satellite workshop Speech, Music and Mind 2021 - SMM21 takes place with Pavel Matejka involved in organization. It's virtual and free but registration is needed.

439 views08:49

Open / Comment

2021-08-27 00:32:11 https://ssw11.hte.hu/

Speech Synthesis Workshop going on right now (Aug 26-Aug 28)

444 views21:32

Open / Comment

Speech Technology

Ratings & Reviews

The latest Messages 45

Popular Channels

Related Chats

Popular Channels

Login