Speech Technology

Channel address:

Categories: Technologies

Language: English

Subscribers: 652

▲ Vote (1)

Ratings & Reviews

2.67

3 reviews

Reviews can be left only by registered users. All reviews are moderated by admins.

5 stars

4 stars

3 stars

2 stars

1 stars

The latest Messages

2023-07-07 15:52:21 Beside diarization with tinydiarize, whisper can do audio tagging well

https://arxiv.org/abs/2307.03183

Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers

Yuan Gong, Sameer Khurana, Leonid Karlinsky, James Glass

In this paper, we focus on Whisper, a recent automatic speech recognition model trained with a massive 680k hour labeled speech corpus recorded in diverse conditions. We first show an interesting finding that while Whisper is very robust against real-world background sounds (e.g., music), its audio representation is actually not noise-invariant, but is instead highly correlated to non-speech sounds, indicating that Whisper recognizes speech conditioned on the noise type. With this finding, we build a unified audio tagging and speech recognition model Whisper-AT by freezing the backbone of Whisper, and training a lightweight audio tagging model on top of it. With <1% extra computational cost, Whisper-AT can recognize audio events, in addition to spoken text, in a single forward pass.

309 views12:52

Open / Comment

2023-07-07 14:50:11 Cambridge team is always doing nice research

https://arxiv.org/abs/2307.03088

Label-Synchronous Neural Transducer for End-to-End ASR
Keqi Deng, Philip C. Woodland
Neural transducers provide a natural approach to streaming ASR. However, they augment output sequences with blank tokens which leads to challenges for domain adaptation using text data. This paper proposes a label-synchronous neural transducer (LS-Transducer), which extracts a label-level encoder representation before combining it with the prediction network output. Hence blank tokens are no longer needed and the prediction network can be easily adapted using text data. An Auto-regressive Integrate-and-Fire (AIF) mechanism is proposed to generate the label-level encoder representation while retaining the streaming property. In addition, a streaming joint decoding method is designed to improve ASR accuracy. Experiments show that compared to standard neural transducers, the proposed LS-Transducer gave a 10% relative WER reduction (WERR) for intra-domain Librispeech-100h data, as well as 17% and 19% relative WERRs on cross-domain TED-LIUM 2 and AESRC2020 data with an adapted prediction network.

328 views11:50

Open / Comment

2023-07-05 18:25:14 Are you skilled at generating synthesized or converted speech samples? Are you concerned about the potential implications of deepfake speech? Are you interested to contribute to advancing technology for detecting such 'fake' speech using machine learning?

If yes, you are warmly invited to contribute to the fifth edition of the ASVspoof (Automatic Speaker Verification and Spoofing Countermeasures) challenge! ASVspoof is centered around the challenges to design spoofing-robust automatic speaker verification solutions and application-agnostic speech deepfake detectors.

You may join us either as a data provider (phase 1) or as a challenge participant (phase 2). We are now inviting expressions of interest from potential data contributors.

For further details, please refer to the ASVspoof 5 Evaluation Plan which can be downloaded from our website at: https://www.asvspoof.org/

Kind regards,
On behalf of the ASVspoof 5 organising committee
organisers@lists.asvspoof.org be

July 1, 2023 Phase 1 - registration opens
July 1, 2023 - training and development data available
July 1, 2023 - TTS/VC adaptation and input data available
July 1, 2023 - surrogate ASV/CM available
July 15, 2023 - Phase 1 CodaLab platform opens
July 15 to September 15, 2023 - submit TTS/VC spoofed data

453 viewsedited 15:25

Open / Comment

2023-07-05 14:24:11 The approach is reasonable at least

https://github.com/akashmjn/tinydiarize

415 views11:24

Open / Comment

2023-06-30 11:40:30 Another similar one with LLAMA

https://arxiv.org/abs/2306.16007

Prompting Large Language Models for Zero-Shot Domain Adaptation in Speech Recognition
Yuang Li, Yu Wu, Jinyu Li, Shujie Liu
The integration of Language Models (LMs) has proven to be an effective way to address domain shifts in speech recognition. However, these approaches usually require a significant amount of target domain text data for the training of LMs. Different from these methods, in this work, with only a domain-specific text prompt, we propose two zero-shot ASR domain adaptation methods using LLaMA, a 7-billion-parameter large language model (LLM). LLM is used in two ways: 1) second-pass rescoring: reranking N-best hypotheses of a given ASR system with LLaMA; 2) deep LLM-fusion: incorporating LLM into the decoder of an encoder-decoder based ASR system. Experiments show that, with only one domain prompt, both methods can effectively reduce word error rates (WER) on out-of-domain TedLium-2 and SPGISpeech datasets. Especially, the deep LLM-fusion has the advantage of better recall of entity and out-of-vocabulary words.

671 views08:40

Open / Comment

2023-06-30 11:20:03 From

https://arxiv.org/abs/2306.17103

LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPT
Le Zhuo, Ruibin Yuan, Jiahao Pan, Yinghao Ma, Yizhi LI, Ge Zhang, Si Liu, Roger Dannenberg, Jie Fu, Chenghua Lin, Emmanouil Benetos, Wenhu Chen, Wei Xue, Yike Guo
We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic lyrics transcription method achieving state-of-the-art performance on various lyrics transcription datasets, even in challenging genres such as rock and metal. Our novel, training-free approach utilizes Whisper, a weakly supervised robust speech recognition model, and GPT-4, today's most performant chat-based large language model. In the proposed method, Whisper functions as the "ear" by transcribing the audio, while GPT-4 serves as the "brain," acting as an annotator with a strong performance for contextualized output selection and correction. Our experiments show that LyricWhiz significantly reduces Word Error Rate compared to existing methods in English and can effectively transcribe lyrics across multiple languages. Furthermore, we use LyricWhiz to create the first publicly available, large-scale, multilingual lyrics transcription dataset with a CC-BY-NC-SA copyright license, based on MTG-Jamendo, and offer a human-annotated subset for noise level estimation and evaluation. We anticipate that our proposed method and dataset will advance the development of multilingual lyrics transcription, a challenging and emerging task.

Prompt to combine ASR results with GPT-4

Task: As a GPT-4 based lyrics transcription post-processor, your task is to analyze multiple ASR model-generated versions of a song’s lyrics and determine the most accurate version closest to the true lyrics. Also filter out invalid lyrics when all predictions are nonsense.

Input: The input is in JSON format:
{“prediction_1”: “line1;line2;...”, ...}

Output: Your output must be strictly in readable JSON format without any extra text:
{
“reasons”: “reason1;reason2;...”,
“closest_prediction”:
“output”: “line1;line2...”
}

Requirements: For the "reasons" field, you have to provide a reason for the choice of the "closest_prediction" field. For the "closest_prediction" field, choose the prediction key that is closest to the true lyrics. Only when all predictions greatly differ from each other or are completely nonsense or meaningless, which means that none of the predictions is valid, fill in "None" in this field. For the "output" field, you need to output the final lyrics of closest_prediction. If the "closest_prediction" field is "None", you should also output "None" in this field. The language of the input lyrics is English.

586 viewsedited 08:20

Open / Comment

2023-06-29 19:54:22 A useful effort to collect interspeech paper repos by https://github.com/DmitryRyumin

Please start/share and help to fill the remaining parts, it is a huge effort

https://github.com/DmitryRyumin/INTERSPEECH-2023-Papers

one can automate it probably

632 views16:54

Open / Comment

2023-06-29 19:51:03 On Interspeech 2023 program Daniel Povey has Johns Hopkins University affilation (again)

https://interspeech2023.org/wp-content/uploads/2023/06/INTERSPEECH_2023_Booklet_v1.pdf

476 views16:51

Open / Comment

2023-06-29 18:38:53 UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data (INTERSPEECH 2023)

https://github.com/gmltmd789/UnitSpeech

Demo

https://unitspeech.github.io/

554 viewsedited 15:38

Open / Comment

2023-06-29 16:21:23 IWSLT 2023 program is available

https://iwslt.org/2023/program

515 views13:21

Open / Comment

Speech Technology

Ratings & Reviews

The latest Messages

Popular Channels

Related Chats

Popular Channels

Login