Get Mystery Box with random crypto!

Speech Technology

Logo of telegram channel speechtech — Speech Technology S
Logo of telegram channel speechtech — Speech Technology
Channel address: @speechtech
Categories: Technologies
Language: English
Subscribers: 652

Ratings & Reviews

2.67

3 reviews

Reviews can be left only by registered users. All reviews are moderated by admins.

5 stars

1

4 stars

0

3 stars

0

2 stars

1

1 stars

1


The latest Messages 43

2021-12-16 00:33:24 Much better theory of emotion

https://arxiv.org/abs/2112.06603

Detecting Emotion Carriers by Combining Acoustic and Lexical Representations

Sebastian P. Bayerl, Aniruddha Tammewar, Korbinian Riedhammer, Giuseppe Riccardi

Personal narratives (PN) - spoken or written - are recollections of facts, people, events, and thoughts from one's own experience. Emotion recognition and sentiment analysis tasks are usually defined at the utterance or document level. However, in this work, we focus on Emotion Carriers (EC) defined as the segments (speech or text) that best explain the emotional state of the narrator ("loss of father", "made me choose"). Once extracted, such EC can provide a richer representation of the user state to improve natural language understanding and dialogue modeling. In previous work, it has been shown that EC can be identified using lexical features. However, spoken narratives should provide a richer description of the context and the users' emotional state. In this paper, we leverage word-based acoustic and textual embeddings as well as early and late fusion techniques for the detection of ECs in spoken narratives. For the acoustic word-level representations, we use Residual Neural Networks (ResNet) pretrained on separate speech emotion corpora and fine-tuned to detect EC. Experiments with different fusion and system combination strategies show that late fusion leads to significant improvements for this task.
230 views21:33
Open / Comment
2021-12-16 00:32:55 An overview paper from Google with nice vision

Revisiting the Boundary between ASR and NLU in the Age of Conversational Dialog Systems

https://arxiv.org/abs/2112.05842

Manaal Faruqui, Dilek Hakkani-Tür
As more users across the world are interacting with dialog agents in their daily life, there is a need for better speech understanding that calls for renewed attention to the dynamics between research in automatic speech recognition (ASR) and natural language understanding (NLU). We briefly review these research areas and lay out the current relationship between them. In light of the observations we make in this paper, we argue that (1) NLU should be cognizant of the presence of ASR models being used upstream in a dialog system's pipeline, (2) ASR should be able to learn from errors found in NLU, (3) there is a need for end-to-end datasets that provide semantic annotations on spoken input, (4) there should be stronger collaboration between ASR and NLU research communities.
213 views21:32
Open / Comment
2021-12-15 23:33:41 PeoplesSpeech finally released

https://twitter.com/GregoryDiamos/status/1470899348070223873

with clean CC licenses with academic and commercial use!

Special shout out to Daniel Galvez, Mark Mazumder, and the whole team who put in a huge effort to create these.
222 viewsedited  20:33
Open / Comment
2021-12-15 23:30:24 The average accuracy of our proposed method was 0.5648 for classifying four words

Decoding High-level Imagined Speech using Attention-based Deep Neural Networks

https://arxiv.org/abs/2112.06922
213 views20:30
Open / Comment
2021-12-15 22:46:33 https://www.openslr.org/115/

Summary: a database of emotional speech intended to be open-sourced and used for synthesis and generation purpose. It contains data for male and female actors in English (https://github.com/numediart/EmoV-DB)
211 views19:46
Open / Comment
2021-12-15 22:32:52 The topic is interesting:

https://europe.naverlabs.com/job/joint-asr-and-repunctuation-for-better-machine-and-human-readable-transcripts-internship/
233 views19:32
Open / Comment
2021-12-15 22:30:47 https://www.apptek.com/post/asru-2021-kicks-off-this-week-in-cartagena-colombia
243 views19:30
Open / Comment
2021-12-11 20:23:53 The related paper is also interesting:

https://arxiv.org/pdf/2110.04109.pdf

Hierarchical Conditional End-to-End ASR with CTC and Multi-Granular Subword Units

Yosuke Higuchi, Keita Karube, Tetsuji Ogawa, Tetsunori Kobayashi

In end-to-end automatic speech recognition (ASR), a model is expected to implicitly learn representations suitable for recognizing a word-level sequence. However, the huge abstraction gap between input acoustic signals and output linguistic tokens makes it challenging for a model to learn the representations. In this work, to promote the word-level representation learning in end-to-end ASR, we propose a hierarchical conditional model that is based on connectionist temporal classification (CTC). Our model is trained by auxiliary CTC losses applied to intermediate layers, where the vocabulary size of each target subword sequence is gradually increased as the layer becomes close to the word-level output. Here, we make each level of sequence prediction explicitly conditioned on the previous sequences predicted at lower levels. With the proposed approach, we expect the proposed model to learn the word-level representations effectively by exploiting a hierarchy of linguistic structures. Experimental results on LibriSpeech-{100h, 960h} and TEDLIUM2 demonstrate that the proposed model improves over a standard CTC-based model and other competitive models from prior work. We further analyze the results to confirm the effectiveness of the intended representation learning with our model.
377 views17:23
Open / Comment
2021-12-11 20:10:41 I took a closer look on popular BPE segments. In my opinion the segments selected by sentencepiece by default are too long. For example, if we dump standard 5000 units BPE from Espnet/Wenet there will be units like _ANIMALS. Its is very hard to detect such long inputs for acoustic model. Looks like some adjustment is required for sentencepiece so that segments will be shorter and more suitable for acoustic models.

Nemo uses 129 units and thus has shorter chunks but still there are units like "ve" or "think" which are not really acoustically motivated.

This paper confirms that:

https://arxiv.org/abs/2104.09106

Acoustic Data-Driven Subword Modeling for End-to-End Speech Recognition

Wei Zhou, Mohammad Zeineldeen, Zuoyun Zheng, Ralf Schlüter, Hermann Ney

Subword units are commonly used for end-to-end automatic speech recognition (ASR), while a fully acoustic-oriented subword modeling approach is somewhat missing. We propose an acoustic data-driven subword modeling (ADSM) approach that adapts the advantages of several text-based and acoustic-based subword methods into one pipeline. With a fully acoustic-oriented label design and learning process, ADSM produces acoustic-structured subword units and acoustic-matched target sequence for further ASR training. The obtained ADSM labels are evaluated with different end-to-end ASR approaches including CTC, RNN-Transducer and attention models. Experiments on the LibriSpeech corpus show that ADSM clearly outperforms both byte pair encoding (BPE) and pronunciation-assisted subword modeling (PASM) in all cases. Detailed analysis shows that ADSM achieves acoustically more logical word segmentation and more balanced sequence length, and thus, is suitable for both time-synchronous and label-synchronous models. We also briefly describe how to apply acoustic-based subword regularization and unseen text segmentation using ADSM.
339 viewsedited  17:10
Open / Comment
2021-12-09 23:41:32 From Max Ryabinin, Andrey Malinin, Mark Gales more on Ensembles for OOD detection with application to speech

https://openreview.net/forum?id=7S3RMGVS5vO
343 viewsedited  20:41
Open / Comment