Voicepeak API

Regarding the title: more like 'Using VOICEPEAK via command line'.

References:

Apparently, neither AHS nor Dreamtonics bothers giving an extensive documentation / manual for using VOICEPEAK without the GUI. For comparison, VOICEVOX doesn't do great but at least you can get a grip scavenging through what they have in the repo.

So, half of this is translation of the aforementioned Japanese blogs, the other half is what I found out by trying.

Basic usage

./voicepeak.exe [OPTION..]

Starting with ./voicepeak.exe -h:

 -s, --say Text               Text to say
 -t, --text File              Text file to say
 -o, --out File               Path of output file
 -n, --narrator Name          Name of narrator, check --list-narrator
 -e, --emotion Expr           Emotion expression, for example:
                              happy=50,sad=50. Also check --list-emotion
     --list-narrator          Print narrator list
     --list-emotion Narrator  Print emotion list for given narrator
 -h, --help                   Print help
     --speed Value            Speed (50 - 200)
     --pitch Value            Pitch (-300 - 300)

One thing to notice is that the command line execution is languishedly slow, probably because an unseen GUI gets initiated and terminated every time voicepeak.exe is called. I don't know the details but it is painfully slow.

Some further breakdown on the options and arguments:

-s, --say Text: The Text part is essentially a string. Reference #1 says the maximum length is 140 characters (probably determined by Twitter, just a wild guess) so generally nothing comically long.

-t, --text File: Basically the same with -s, but reads a text file for the text to speak.

-o, --out File: Specifies the name (path) of the output file. If not supplied, the default is 'output.wav' at the current (shell) directory.

-n, --narrator Name: Specifies the narrator (character). I think this defaults to 'the first one in the narrator list' but I have only 1 narrator in Koharuri so I don't know.

-e, --emotion Expr: Emotion ratios. Only works if the narrator (manually selected or default) is compatible with the given emotions. Note that Koharuri has totally different emotion names than the standard 6 nameless voices.I'll list these in another section in this page

--list-narrator: Returns a list of available, locally installed Narrators. The names can and should be used when a Narrator is required as an additional argument, such as in -n, --narrator.

--list-emotion Narrator: Returns a list of all possible emotion handle / variable names / tags / you name it for the given Narrator. See some of the results below if you don't want to do this all the time.

-h, --help: Displays the help which is also quoted above.

--speed Value, -- pitch Value: Speech-related parameters.

Emotions

As of 2023/09/26 (VOICEPEAK v1.2.6)

Koharu Rikka

 hightension
 livid
 lamenting
 despising
 narration

Generic VOICEPEAK voices

The 'Japanese Male/Female 1/2/3' Voices. Also one called 'Japanese Female Child'.

 happy
 fun
 angry
 sad

Other characters

I don't have 'em so I don't know.

How to use, for example, in Python

Apparently when you're trying to use VOICEPEAK in command line, you're not using it really via command line.

VOICEPEAKをPythonから呼び出す provides a simple example of Python wrapper. For archive reasons I'll also steal the code and post it here.

import os
import subprocess
import winsound

def playVoicePeak(script , narrator = "Japanese Female 1", happy=50, sad=50, angry=50, fun=50):
    """
    任意のテキストをVOICEPEAKのナレーターに読み上げさせる関数
    script: 読み上げるテキスト（文字列）
    narrator: ナレーターの名前（文字列）
    happy: 嬉しさの度合い
    sad: 悲しさの度合い
    angry: 怒りの度合い
    fun: 楽しさの度合い
    """
    # voicepeak.exeのパス
    exepath = "C:/Program Files/VOICEPEAK/voicepeak.exe"
    # wav出力先
    outpath = "output.wav"
    # 引数を作成
    args = [
        exepath,
        "-s", script,
        "-n", narrator,
        "-o", outpath,
        "-e", f"happy={happy},sad={sad},angry={angry},fun={fun}"
    ]
    # プロセスを実行
    process = subprocess.Popen(args)

    # プロセスが終了するまで待機
    process.communicate()

    # 音声を再生
    winsound.PlaySound(outpath, winsound.SND_FILENAME)

    # wavファイルを削除
    os.remove(outpath)

This is based on the 6 nameless voices with their shared set of emotions. If you want to use for example Koharuri, you'll need to fix the emotion-related portions in this code.

I might do that later but currently it's as is and apparently can't correctly adjust Koharu Rikka's emotions.

Also note that this example is using winsound for playback, and it's not cross-platform.

This is probably not the most time-efficient approach, but the major limitation comes from VOICEPEAK which doesn't have a stream output option (also understandable).

Thoughts

When limited to 140 words (characters, in fact, I guess),and if the upstream content (text) provider function / program can pre-process the text to speak, turn it into smaller chunks, short sentences, phrases etc., Command line VOICPEAK is not that slow.