Speech Technology

Channel address:

Categories: Technologies

Language: English

Subscribers: 652

▲ Vote (1)

Ratings & Reviews

2.67

3 reviews

Reviews can be left only by registered users. All reviews are moderated by admins.

5 stars

4 stars

3 stars

2 stars

1 stars

The latest Messages 12

2023-02-11 02:56:52 https://huggingface.co/blog/speecht5

461 views23:56

Open / Comment

2023-02-10 22:55:06 https://github.com/openai/whisper/discussions/937

Whisper model in CTranslate2, which is a fast inference engine for Transformer models. The project supports many useful inference features such as CPU and GPU execution, asynchronous execution, multi-GPU execution, 8-bit quantization, etc.

You can find a usage example here.

Note that it does not currently implement the full transcription loop, only the model.decode part. So you would still need to implement the transcription logic from transcribe.py on top of it (iterate on each 30-second window, accumulate the context in the prompt, etc.).

For example, here's the transcription time of 13 minutes of audio on a V100 for the same accuracy:

Implementation Time with "small" model Time with "medium" model
Baseline 1m37s 3m16s
CTranslate2 0m25s 0m42s

392 views19:55

Open / Comment

2023-02-10 13:14:24 Learned about https://uberduck.ai/ from https://news.ycombinator.com/item?id=34736745

TTS is really popular these days

387 views10:14

Open / Comment

2023-02-10 01:48:44 It is interesting how quickly people implement ideas. Like the one of podcast transcript with Whisper. Here is a selection

https://podscript.ai/
https://podtext.ai/
https://podscription.app/
https://podsearch.page/

Discussion https://news.ycombinator.com/item?id=34727695

170 views22:48

Open / Comment

2023-02-09 23:16:55 CMU pubs are nice. High quality TTS trained from Youtube

https://github.com/b04901014/MQTTS

https://arxiv.org/abs/2302.04215

A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech

Li-Wei Chen, Shinji Watanabe, Alexander Rudnicky

Recent Text-to-Speech (TTS) systems trained on reading or acted corpora have achieved near human-level naturalness. The diversity of human speech, however, often goes beyond the coverage of these corpora. We believe the ability to handle such diversity is crucial for AI systems to achieve human-level communication. Our work explores the use of more abundant real-world data for building speech synthesizers. We train TTS systems using real-world speech from YouTube and podcasts. We observe the mismatch between training and inference alignments in mel-spectrogram based autoregressive models, leading to unintelligible synthesis, and demonstrate that learned discrete codes within multiple code groups effectively resolves this issue. We introduce our MQTTS system whose architecture is designed for multiple code generation and monotonic alignment, along with the use of a clean silence prompt to improve synthesis quality. We conduct ablation analyses to identify the efficacy of our methods. We show that MQTTS outperforms existing TTS systems in several objective and subjective measures.

244 viewsedited 20:16

Open / Comment

2023-02-06 03:12:10 https://twitter.com/DrJimFan/status/1622276293776793600

Looks like many of you are ready to embrace the Year of Sound Waves!

Here’s a big and OPEN dataset for you to get your hands dirty on AI audio modeling: EPIC-SOUNDS, 78k segments of annotated, audible events and actions.

Downloadable here: https://epic-kitchens.github.io/epic-sounds/

327 views00:12

Open / Comment

2023-02-06 02:32:06 Similar recent thing from DeepMind

https://github.com/deepmind/transformer_grammars

307 views23:32

Open / Comment

2023-02-06 02:04:15 It is interesting that for things like NER for latest research Google returned to structured prediction instead of pure transformers

https://github.com/lyutyuh/ASP

https://arxiv.org/abs/2210.14698

Autoregressive Structured Prediction with Language Models

Tianyu Liu, Yuchen Jiang, Nicholas Monath, Ryan Cotterell, Mrinmaya Sachan

Recent years have seen a paradigm shift in NLP towards using pretrained language models ({PLM}) for a wide range of tasks.
However, there are many difficult design decisions to represent structures (e.g. tagged text, coreference chains) in a way such that they can be captured by PLMs. Prior work on structured prediction with PLMs typically flattens the structured output into a sequence, which limits the quality of structural information being learned and leads to inferior performance compared to classic discriminative models. In this work, we describe an approach to model structures as sequences of actions in an autoregressive manner with PLMs, allowing in-structure dependencies to be learned without any loss.
Our approach achieves the new state-of-the-art on all the structured prediction tasks we looked at, namely, named entity recognition, end-to-end relation extraction, and coreference resolution.

307 views23:04

Open / Comment

2023-02-04 02:58:47 https://twitter.com/alphacep/status/1621612504840273928

NeMo 1.15 is out now! Theres a whole bunch of powerful ASR features added in this release including Hybrid CTC-RNNT models, Multiblank Transducer, Multi Head Attention Adapters, Conformer longformer inference, and a Beam Search API!

First, we dicuss Hybrid CTC-RNNT models. We can train a single model with both losses, and then perform inference with either decoder. It turns out, we can attain better CTC results, and converge 40-50% faster for CTC head when jointly trained.

Next up, we have Multiblank Transducers supported in NeMo. It is an extension of RNNT loss - in which tokens can jump multiple timesteps per predicted token, allowing for highly efficient inference - even at sample level ! Refer to the paper here
With this change, you can now easily train a multi blank RNNT model and obtain better WER but also much faster inference than regular RNNT models.

Next up, we now support Multi Head Attention Adapters in NeMo ASR. With this approach, now any NeMo module can be retrofitted into an adapter module. We see significant parameter efficiency when compared to Houlsby Adapter. With the newly updated scripts for adapter training, we can now easily train either Linear adapters or MHA adapters from the same script. More details can be found in the PR

Long form audio transcription has long been a challange for Conformer based ASR models, because of the attention component. So we now support Longformer based transcriptions - even for pre-trained models ! You can use the transcribe_speech script for this! We find that if you further finetune the model after conversion to Longformer attention, you can recover most of the WER and still get excellent long audio transcription of up to 30-40 minutes in one shot forward pass.

A long-asked feature is to support beam search in NeMo ASR in a easy to use way. So we unified the way we do CTC beam search with external libraries with the simple model.transcribe() method! You can simply update the config, and then transcribe !

We also begin support for AIStore as a framework for terabyte-scale datasets as a scalable solution to train ASR models on enormous real world datasets.

62 viewsedited 23:58

Open / Comment

2023-02-03 12:50:05 Some Zipformer ideas (multistream is nice):

https://github.com/k2-fsa/icefall/issues/837#issuecomment-1412312846

256 views09:50

Open / Comment

Speech Technology

Ratings & Reviews

The latest Messages 12

Popular Channels

Related Chats

Popular Channels

Login