Speech Technology

Channel address:

Categories: Technologies

Language: English

Subscribers: 652

▲ Vote (1)

Ratings & Reviews

2.67

3 reviews

Reviews can be left only by registered users. All reviews are moderated by admins.

5 stars

4 stars

3 stars

2 stars

1 stars

The latest Messages 38

2022-07-20 23:50:58 https://github.com/xinjli/transphone

265 views20:50

Open / Comment

2022-07-20 22:35:39 System that ranks 1st in DCASE 2022 Challenge Task 5: Few-shot Bioacoustic Event Detection

https://github.com/haoheliu/DCASE_2022_Task_5

https://arxiv.org/abs/2207.07773

Segment-level Metric Learning for Few-shot Bioacoustic Event Detection

Haohe Liu, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley

Few-shot bioacoustic event detection is a task that detects the occurrence time of a novel sound given a few examples. Previous methods employ metric learning to build a latent space with the labeled part of different sound classes, also known as positive events. In this study, we propose a segment-level few-shot learning framework that utilizes both the positive and negative events during model optimization. Training with negative events, which are larger in volume than positive events, can increase the generalization ability of the model. In addition, we use transductive inference on the validation set during training for better adaptation to novel classes. We conduct ablation studies on our proposed method with different setups on input features, training data, and hyper-parameters. Our final system achieves an F-measure of 62.73 on the DCASE 2022 challenge task 5 (DCASE2022-T5) validation set, outperforming the performance of the baseline prototypical network 34.02 by a large margin. Using the proposed method, our submitted system ranks 2nd in DCASE2022-T5. The code of this paper is fully open-sourced at this https URL.

255 views19:35

Open / Comment

2022-07-19 18:00:30 https://twitter.com/justin_salamon/status/1549094639193321472

81 views15:00

Open / Comment

2022-07-19 17:39:31 Excellent submission from community for Korean ASR using NeMo Conformer Transducer, getting greedy scores that surpass the state of the art in academic references.

https://huggingface.co/eesungkim/stt_kr_conformer_transducer_large

93 viewsedited 14:39

Open / Comment

2022-07-18 16:27:05 https://github.com/MTG/Podcastmix

250 views13:27

Open / Comment

2022-07-17 23:51:04 Important direction overall

https://ai.googleblog.com/2022/07/towards-reliability-in-deep-learning.html

285 views20:51

Open / Comment

2022-07-17 23:20:19 https://github.com/microsoft/SpeechT5

293 views20:20

Open / Comment

2022-07-17 22:09:52 More papers recently on this interesting area:

https://arxiv.org/abs/2207.07073

Efficient spike encoding algorithms for neuromorphic speech recognition

Sidi Yaya Arnaud Yarga, Jean Rouat, Sean U. N. Wood

Spiking Neural Networks (SNN) are known to be very effective for neuromorphic processor implementations, achieving orders of magnitude improvements in energy efficiency and computational latency over traditional deep learning approaches. Comparable algorithmic performance was recently made possible as well with the adaptation of supervised training algorithms to the context of SNN. However, information including audio, video, and other sensor-derived data are typically encoded as real-valued signals that are not well-suited to SNN, preventing the network from leveraging spike timing information. Efficient encoding from real-valued signals to spikes is therefore critical and significantly impacts the performance of the overall system. To efficiently encode signals into spikes, both the preservation of information relevant to the task at hand as well as the density of the encoded spikes must be considered. In this paper, we study four spike encoding methods in the context of a speaker independent digit classification system: Send on Delta, Time to First Spike, Leaky Integrate and Fire Neuron and Bens Spiker Algorithm. We first show that all encoding methods yield higher classification accuracy using significantly fewer spikes when encoding a bio-inspired cochleagram as opposed to a traditional short-time Fourier transform. We then show that two Send On Delta variants result in classification results comparable with a state of the art deep convolutional neural network baseline, while simultaneously reducing the encoded bit rate. Finally, we show that several encoding methods result in improved performance over the conventional deep learning baseline in certain cases, further demonstrating the power of spike encoding algorithms in the encoding of real-valued signals and that neuromorphic implementation has the potential to outperform state of the art techniques.

254 views19:09

Open / Comment

2022-07-17 22:04:58 Common knowledge that content and speaker have different representation in deep models, this paper confirms again

https://arxiv.org/abs/2207.06867

Deep versus Wide: An Analysis of Student Architectures for Task-Agnostic Knowledge Distillation of Self-Supervised Speech Models

Takanori Ashihara, Takafumi Moriya, Kohei Matsuura, Tomohiro Tanaka

Self-supervised learning (SSL) is seen as a very promising approach with high performance for several speech downstream tasks. Since the parameters of SSL models are generally so large that training and inference require a lot of memory and computational cost, it is desirable to produce compact SSL models without a significant performance degradation by applying compression methods such as knowledge distillation (KD). Although the KD approach is able to shrink the depth and/or width of SSL model structures, there has been little research on how varying the depth and width impacts the internal representation of the small-footprint model. This paper provides an empirical study that addresses the question. We investigate the performance on SUPERB while varying the structure and KD methods so as to keep the number of parameters constant; this allows us to analyze the contribution of the representation introduced by varying the model architecture. Experiments demonstrate that a certain depth is essential for solving content-oriented tasks (e.g. automatic speech recognition) accurately, whereas a certain width is necessary for achieving high performance on several speaker-oriented tasks (e.g. speaker identification). Based on these observations, we identify, for SUPERB, a more compressed model with better performance than previous studies.

249 viewsedited 19:04

Open / Comment

2022-07-17 19:42:21 You can actually try new Italian model here:

https://huggingface.co/spaces/alphacep/asr

264 views16:42

Open / Comment

Speech Technology

Ratings & Reviews

The latest Messages 38

Popular Channels

Related Chats

Popular Channels

Login