Speech Technology

Channel address:

Categories: Technologies

Language: English

Subscribers: 652

▲ Vote (1)

Ratings & Reviews

2.67

3 reviews

Reviews can be left only by registered users. All reviews are moderated by admins.

5 stars

4 stars

3 stars

2 stars

1 stars

The latest Messages 39

2022-07-17 17:43:35 We have just released big Italian model for Vosk

https://alphacephei.com/vosk/models/vosk-model-it-0.22.zip

WER:

8.10 (cv test)
15.68 (mls)
11.23 (mtedx)

Small model WER:

16.88 (cv test)
25.87 (mls)
17.01 (mtedx)

we probably need to put all those models on huggingface.

78 views14:43

Open / Comment

2022-07-15 12:23:42 today we launched an alpha version of a transcription tool which uses vosk in browser of @ccoreilly to transcribe and edit the results of the speech recognition. Obviously it uses the small Vosk models and we appreciate if you can take a look, use it and maybe give us feedback! https://otranscribe.bsc.es/

It is a fork of oTranscribe https://github.com/projecte-aina/oTranscribe-plus

178 views09:23

Open / Comment

2022-07-14 19:20:25 https://techcrunch.com/2022/07/14/flush-with-new-cash-assemblyai-looks-to-grow-its-ai-as-a-service-business

242 views16:20

Open / Comment

2022-07-13 16:13:08 https://arxiv.org/abs/2207.05071

Online Continual Learning of End-to-End Speech Recognition Models

Muqiao Yang, Ian Lane, Shinji Watanabe

Continual Learning, also known as Lifelong Learning, aims to continually learn from new data as it becomes available. While prior research on continual learning in automatic speech recognition has focused on the adaptation of models across multiple different speech recognition tasks, in this paper we propose an experimental setting for \textit{online continual learning} for automatic speech recognition of a single task. Specifically focusing on the case where additional training data for the same task becomes available incrementally over time, we demonstrate the effectiveness of performing incremental model updates to end-to-end speech recognition models with an online Gradient Episodic Memory (GEM) method. Moreover, we show that with online continual learning and a selective sampling strategy, we can maintain an accuracy that is similar to retraining a model from scratch while requiring significantly lower computation costs. We have also verified our method with self-supervised learning (SSL) features.

227 views13:13

Open / Comment

2022-07-13 16:08:51 https://www.muse-challenge.org/

The MuSe 2022 Multimodal Sentiment Analysis Challenge: Humor, Emotional Reactions, and Stress

Lukas Christ, Shahin Amiriparian, Alice Baird, Panagiotis Tzirakis, Alexander Kathan, Niklas Müller, Lukas Stappen, Eva-Maria Meßner, Andreas König, Alan Cowen, Erik Cambria, Björn W. Schuller

The Multimodal Sentiment Analysis Challenge (MuSe) 2022 is dedicated to multimodal sentiment and emotion recognition. For this year's challenge, we feature three datasets: (i) the Passau Spontaneous Football Coach Humor (Passau-SFCH) dataset that contains audio-visual recordings of German football coaches, labelled for the presence of humour; (ii) the Hume-Reaction dataset in which reactions of individuals to emotional stimuli have been annotated with respect to seven emotional expression intensities, and (iii) the Ulm-Trier Social Stress Test (Ulm-TSST) dataset comprising of audio-visual data labelled with continuous emotion values (arousal and valence) of people in stressful dispositions. Using the introduced datasets, MuSe 2022 2022 addresses three contemporary affective computing problems: in the Humor Detection Sub-Challenge (MuSe-Humor), spontaneous humour has to be recognised; in the Emotional Reactions Sub-Challenge (MuSe-Reaction), seven fine-grained `in-the-wild' emotions have to be predicted; and in the Emotional Stress Sub-Challenge (MuSe-Stress), a continuous prediction of stressed emotion values is featured. The challenge is designed to attract different research communities, encouraging a fusion of their disciplines. Mainly, MuSe 2022 targets the communities of audio-visual emotion recognition, health informatics, and symbolic sentiment analysis. This baseline paper describes the datasets as well as the feature sets extracted from them. A recurrent neural network with LSTM cells is used to set competitive baseline results on the test partitions for each sub-challenge. We report an Area Under the Curve (AUC) of .8480 for MuSe-Humor; .2801 mean (from 7-classes) Pearson's Correlations Coefficient for MuSe-Reaction, as well as .4931 Concordance Correlation Coefficient (CCC) and .4761 for valence and arousal in MuSe-Stress, respectively.

192 views13:08

Open / Comment

2022-07-13 01:34:34 https://hal.archives-ouvertes.fr/hal-03712735/

Abstract : Evaluating automatic speech recognition (ASR) systems is a classical but difficult and still open problem, which often boils down to focusing only on the word error rate (WER). However, this metric suffers from many limitations and does not allow an in-depth analysis of automatic transcription errors. In this paper, we propose to study and understand the impact of rescoring using language models in ASR systems by means of several metrics often used in other natural language processing (NLP) tasks in addition to the WER. In particular, we introduce two measures related to morpho-syntactic and semantic aspects of transcribed words: 1) the POSER (Part-of-speech Error Rate), which should highlight the grammatical aspects, and 2) the Em-bER (Embedding Error Rate), a measurement that modifies the WER by providing a weighting according to the semantic distance of the wrongly transcribed words. These metrics illustrate the linguistic contributions of the language models that are applied during a posterior rescoring step on transcription hypotheses.

30 views22:34

Open / Comment

2022-07-12 23:34:25 An interesting point to train TTS and ASR jointly

https://arxiv.org/abs/2207.04659

Speaker consistency loss and step-wise optimization for semi-supervised joint training of TTS and ASR using unpaired text data

Naoki Makishima, Satoshi Suzuki, Atsushi Ando, Ryo Masumura

In this paper, we investigate the semi-supervised joint training of text to speech (TTS) and automatic speech recognition (ASR), where a small amount of paired data and a large amount of unpaired text data are available. Conventional studies form a cycle called the TTS-ASR pipeline, where the multispeaker TTS model synthesizes speech from text with a reference speech and the ASR model reconstructs the text from the synthesized speech, after which both models are trained with a cycle-consistency loss. However, the synthesized speech does not reflect the speaker characteristics of the reference speech and the synthesized speech becomes overly easy for the ASR model to recognize after training. This not only decreases the TTS model quality but also limits the ASR model improvement. To solve this problem, we propose improving the cycleconsistency-based training with a speaker consistency loss and step-wise optimization. The speaker consistency loss brings the speaker characteristics of the synthesized speech closer to that of the reference speech. In the step-wise optimization, we first freeze the parameter of the TTS model before both models are trained to avoid over-adaptation of the TTS model to the ASR model. Experimental results demonstrate the efficacy of the proposed method.

115 views20:34

Open / Comment

2022-07-12 21:08:06 We have just released updated small Italian model for Vosk

https://alphacephei.com/vosk/models/vosk-model-small-it-0.22.zip

WER:

16.88 (cv test)
25.87 (mls)
17.01 (mtedx)

Previous model WER:

31.62 (cv test)
35.31 (mls)
24.49 (mtedx)

un miglioramento significativo.

169 views18:08

Open / Comment

2022-07-11 00:43:22 https://cmusphinx.github.io/2022/06/update/

59 views21:43

Open / Comment

2022-07-11 00:27:31 A set of links on RNNT fast decoding stratagies from brother Mddct

https://github.com/wenet-e2e/wenet/issues/1269

70 views21:27

Open / Comment

Speech Technology

Ratings & Reviews

The latest Messages 39

Popular Channels

Related Chats

Popular Channels

Login