Academic works: Difference between revisions

Latest revision as of 07:34, 11 October 2023

I am Wan Ziyu, a current doctoral course student at Human-Computer Interaction Laboratory, Hokkaido University. I have relatively formal education on opto-electrical engineering and some computer science, and informal self-education on linguistics and phonetics.

I am not very proud to say that I have no publications as of now (whenever you are looking at this page, that is).

However, I feel that it is equally important to introduce what I am currently doing and what I have done in the slight case that you are interested.

Research strengths and, well, abilities

I don't have any specifically notable certificates or qualifications so take everything I list here with a pinch of salt.

Digital signal processing, with Python (librosa / numpy / scipy) and a some C++ (iPlug2 for creating VST apps)
Neural network (rather basic) with Tensorflow / Keras. Also a tad of PyTorch, but I hate migrating between toolkits.
Some HTML / JavaScript for unpretty utility webpages.
Some bash / Python for task automation.
LLM prompt composition, generic LLM utilisation.
Basic welding and electrician skills.

Research topics / interests

Language-independent speech recognition

To capture the phonemes (sound) rather than text, a language-independent speech recognition system is proposed.

See: Language-independent speech recognition

LLM powered conversational (car) navigation agent

Based on the half-faith, half-fact (proportions may vary) that vehicle navigation is a social, conversational task, I think it might be good to enable navigation softwares to have some conversation with the driver as well.

See: LLM navigation agent

(Pro?)active agent guidance

A spin-off from the navigator idea. Make the agent capable of actively asking for information might help the human user form more structured, concise and solid input.

See: Active agent guidance

@@ Line 1: / Line 1: @@
-As a current doctoral course student at [https://hci-lab.jp/ Human-Computer Interaction Laboratory], Hokkaido University, I '''am not very proud''' '''to say''' that I have no publications as of now (whenever you are looking at this page, that is).
+I am Wan Ziyu, a current doctoral course student at [https://hci-lab.jp/ Human-Computer Interaction Laboratory], Hokkaido University. I have relatively formal education on opto-electrical engineering and some computer science, and informal self-education on linguistics and phonetics.
+I '''am not very proud''' '''to say''' that I have no publications as of now (whenever you are looking at this page, that is).
 However, I feel that it is equally important to introduce what I am currently doing and what I have done in the slight case that you are interested.
-== Language-independent speech recognition ==
+== Research strengths and, well, abilities ==
-Modern speech recognition, or automatic speech recognition usually depends on a language-specific Hidden Markov Model for application-level accuracy. The language independence of speech recognition is about the step before probabilistic models - converting audio (waveforms) into signal sequences that then goes into the probabilistic model.
+I don't have any specifically notable certificates or qualifications so take everything I list here with a pinch of salt.
-Several approaches exist. To be very honest I haven't read a lot of literature so I will not brag about what is commonplace. Here I will simply introduce how I think about it.
+* Digital signal processing, with Python (librosa / numpy / scipy) and a some C++ (iPlug2 for creating VST apps)
+* Neural network (rather basic) with Tensorflow / Keras. Also a tad of PyTorch, but I hate migrating between toolkits.
+* Some HTML / JavaScript for unpretty utility webpages.
+* Some bash / Python for task automation.
+* LLM prompt composition, generic LLM utilisation.
+* Basic welding and electrician skills.
-=== Purpose ===
+== Research topics / interests ==
-Usually when an ASR system is built around a target language, the interim state, which is phonemes, are also determined by the language in question. For example a Japanese ASR might only classify vowels into 5 possibilities in A/I/U/E/O (not in their finest IPA form, but you get the idea if you know one tad of Japanese).
-This haven't proved to be a fundamental shortcoming of ASR when dealing with accented or multi-lingual situations. However, there exist a niche for accurate accent reproduction and thus precise transcription. The niche actually comes from me and I have always struggled to prove that anyone else really needed this.
+=== Language-independent speech recognition ===
+To capture the phonemes (sound) rather than text, a language-independent speech recognition system is proposed.
-=== Viability check ===
+''See: [[Language-independent speech recognition]]''
-To put things more straightforward, I want to recreate heavily accented singings with vocal synthesisers such as CeVIO, Synthesizer V, or VOCALOID, or you name it. These synthesisers usually have 'voice banks' which are capable of making sounds under phoneme notation for one or several (in the case of Synthesizer V) languages. If there is a intermediate notation capable of bidirectional translation (transcription) from and to the different notation systems used by all the synthesisers, it will theoretically be possible to recreate any pronunciation in any synthesiser. Of course the biggest challenge lies in that the conversion from this universal, omnipotent notation to a vocal synthesiser notation is lossy, and often very lossy (because many voice banks are in Japanese and Japanese is rather limited). However, English and Spanish voice banks offer a wider base coverage of phonemes, and newer engines (e.g. CeVIO and VOCALOID beta-studio) provide means to alter a certain vowel in a wider range.
-=== Projected implementation ===
+=== LLM powered conversational (car) navigation agent ===
-One example would be using the IPA phonemes (X-SAMPA) as this intermediate form, which I am not using. The reasons are as follows:
+Based on the half-faith, half-fact (proportions may vary) that vehicle navigation is a social, conversational task, I think it might be good to enable navigation softwares to have some conversation with the driver as well.
-* No generic voice-to-(full-)IPA model is available. Most of the trained or optimised models are for a subset of IPA i.e. for a certain language.
+''See: [[LLM navigation agent]]''
-* The IPA phoneme is still discrete while the human oral cavity acts in a continuous way. IPA phonemes have intersections and sometimes even contradictions.
-A value-based notation is proposed. Based on the source-filter concept, a phoneme (vowel, as far as what I've done) is shaped by the oral cavity acting as a filter. Each articulatory component contributes to the filter and some are more important than others. I've selected 4 parameters which can be quantified.
+=== (Pro?)active agent guidance ===
+A spin-off from the navigator idea. Make the agent capable of actively asking for information might help the human user form more structured, concise and solid input.
-* Front/Back - A value from 0.0 to 1.0. The position of the tongue (tip) relevant to tongue root.
+''See: [[Active agent guidance]]''
-*