Language-independent speech recognition

From Novoyuuparosk Wiki

Modern speech recognition, or automatic speech recognition usually depends on a language-specific Hidden Markov Model for application-level accuracy. The language independence of speech recognition is about the step before probabilistic models - converting audio (waveforms) into signal sequences that then goes into the probabilistic model.

Several approaches exist. To be very honest I haven't read a lot of literature so I will not brag about what is commonplace. Here I will simply introduce how I think about it.

Purpose

Usually when an ASR system is built around a target language, the interim state, which is phonemes, are also determined by the language in question. For example a Japanese ASR might only classify vowels into 5 possibilities in A/I/U/E/O (not in their finest IPA form, but you get the idea if you know one tad of Japanese).

This haven't proved to be a fundamental shortcoming of ASR when dealing with accented or multi-lingual situations. However, there exist a niche for accurate accent reproduction and thus precise transcription. The niche actually comes from me and I have always struggled to prove that anyone else really needed this.

Viability check

To put things more straightforward, I want to recreate heavily accented singings with vocal synthesisers such as CeVIO, Synthesizer V, or VOCALOID, or you name it. These synthesisers usually have 'voice banks' which are capable of making sounds under phoneme notation for one or several (in the case of Synthesizer V) languages. If there is a intermediate notation capable of bidirectional translation (transcription) from and to the different notation systems used by all the synthesisers, it will theoretically be possible to recreate any pronunciation in any synthesiser. Of course the biggest challenge lies in that the conversion from this universal, omnipotent notation to a vocal synthesiser notation is lossy, and often very lossy (because many voice banks are in Japanese and Japanese is rather limited). However, English and Spanish voice banks offer a wider base coverage of phonemes, and newer engines (e.g. CeVIO and VOCALOID beta-studio) provide means to alter a certain vowel in a wider range.

Projected implementation

One example would be using the IPA phonemes (X-SAMPA) as this intermediate form, which I am not using. The reasons are as follows:

  • No generic voice-to-(full-)IPA model is available. Most of the trained or optimised models are for a subset of IPA i.e. for a certain language.
  • The IPA phoneme is still discrete while the human oral cavity acts in a continuous way. IPA phonemes have intersections and sometimes even contradictions.

A value-based notation is proposed. Based on the source-filter concept, a phoneme (vowel, as far as what I've done) is shaped by the oral cavity acting as a filter. Each articulatory component contributes to the filter and some are more important than others. I've selected 4 parameters which can be quantified.

  • Front/Back - A value from 0.0 to 1.0. The position of the tongue (tip) relevant to tongue root.
  • Open/Close - A value from 0.0 to 1.0. Describes how far the palate is from the

What has been done and what has not