2021-12-11 20:10:41
I took a closer look on popular BPE segments. In my opinion the segments selected by sentencepiece by default are too long. For example, if we dump standard 5000 units BPE from Espnet/Wenet there will be units like _ANIMALS. Its is very hard to detect such long inputs for acoustic model. Looks like some adjustment is required for sentencepiece so that segments will be shorter and more suitable for acoustic models.
Nemo uses 129 units and thus has shorter chunks but still there are units like "ve" or "think" which are not really acoustically motivated.
This paper confirms that:
https://arxiv.org/abs/2104.09106
Acoustic Data-Driven Subword Modeling for End-to-End Speech Recognition
Wei Zhou, Mohammad Zeineldeen, Zuoyun Zheng, Ralf Schlüter, Hermann Ney
Subword units are commonly used for end-to-end automatic speech recognition (ASR), while a fully acoustic-oriented subword modeling approach is somewhat missing. We propose an acoustic data-driven subword modeling (ADSM) approach that adapts the advantages of several text-based and acoustic-based subword methods into one pipeline. With a fully acoustic-oriented label design and learning process, ADSM produces acoustic-structured subword units and acoustic-matched target sequence for further ASR training. The obtained ADSM labels are evaluated with different end-to-end ASR approaches including CTC, RNN-Transducer and attention models. Experiments on the LibriSpeech corpus show that ADSM clearly outperforms both byte pair encoding (BPE) and pronunciation-assisted subword modeling (PASM) in all cases. Detailed analysis shows that ADSM achieves acoustically more logical word segmentation and more balanced sequence length, and thus, is suitable for both time-synchronous and label-synchronous models. We also briefly describe how to apply acoustic-based subword regularization and unseen text segmentation using ADSM.
339 viewsedited 17:10