The speech generated by Google Text-to-Speech has ...

thanhdx · 04-27-2024 04:04 AM

I development an AI voice assistant project with TTS feature use Google Text to Speech. However synthesizeSpeech's ouput contain background noise like a mouse click.

Has anyone had a similar problem or is there any workaround for this problem. Thanks a lot.

This is current config.

            const request = {
                input: { text: text },
                voice: {
                    languageCode: "en-US",
                    name: "en-US-Studio-O",
                    ssmlGender: protos.google.cloud.texttospeech.v1.SsmlVoiceGender.FEMALE
                },
                audioConfig: {
                    audioEncoding: protos.google.cloud.texttospeech.v1.AudioEncoding.LINEAR16,
                    sampleRateHertz: 24000,
                    effectsProfileId: ['handset-class-device'],
                    speakingRate: 1.45
                },
            };

            const [response] = await this.client.synthesizeSpeech(request);

Audio sample: https://jmp.sh/s/SQM0UJsOBlGikL5aFuJG

I tried with other params config about audioEncoding, sampleRateHertz, effectsProfileId however it still error.

Updated 27/4: I tried with both ElevenLab & Google TTS service with input text as "Hello". When emitting audio, TTS will contain click noise while ElevenLab does not.
Audio base64 string here: https://drive.google.com/file/d/1DG5KxvllqaQHJj6FK0L5Ovj3zHiLreIU/view?usp=sharing

lsolatorio

Hi @thanhdx,

Thank you for joining our community,

The clicking sound at the beginning of your audio file might be caused by the header information. You can find a helpful discussion on Stack Overflow that explores this issue.

The telephony standard for audio is 8-bit PCM mono uLaw (MULAW) with a sampling rate of 8Khz. The payload of the media message should not contain the audio file type header bytes. So it's essential to understand the WAV file header fields so that you can strip them off before sending the audio data to the user.

A standard WAV file header comprises the following fields:

Positions Sample Value Description

1 - 4 “RIFF” Marks the file as a riff file. Characters are each 1 byte long.

5 - 8 File size (integer) Size of the overall file - 8 bytes, in bytes (32-bit integer). Typically, you’d fill this in after creation.

9 -12 “WAVE” File Type Header. For our purposes, it always equals “WAVE”.

13-16 “fmt " Format chunk marker. Includes trailing null

17-20 16 Length of format data as listed above

21-22 1 Type of format (1 is PCM) - 2 byte integer

23-24 2 Number of Channels - 2 byte integer

25-28 44100 Sample Rate - 32 byte integer. Common values are 44100 (CD), 48000 (DAT). Sample Rate = Number of Samples per second, or Hertz.

29-32 176400 (Sample Rate * BitsPerSample * Channels) / 8.

33-34 4 (BitsPerSample * Channels) / 8.1 - 8 bit mono2 - 8 bit stereo/16 bit mono4 - 16 bit stereo

35-36 16 Bits per sample

37-40 “data” “data” chunk header. Marks the beginning of the data section.

41-44 File size (data) Size of the data section.

The table above indicates that the header is typically 44 bytes long. This suggests you could potentially skip or remove the first 44 bytes to eliminate the clicking sound.

I hope I was able to provide you with useful insights.

Positions	Sample Value	Description
1 - 4	“RIFF”	Marks the file as a riff file. Characters are each 1 byte long.
5 - 8	File size (integer)	Size of the overall file - 8 bytes, in bytes (32-bit integer). Typically, you’d fill this in after creation.
9 -12	“WAVE”	File Type Header. For our purposes, it always equals “WAVE”.
13-16	“fmt "	Format chunk marker. Includes trailing null
17-20	16	Length of format data as listed above
21-22	1	Type of format (1 is PCM) - 2 byte integer
23-24	2	Number of Channels - 2 byte integer
25-28	44100	Sample Rate - 32 byte integer. Common values are 44100 (CD), 48000 (DAT). Sample Rate = Number of Samples per second, or Hertz.
29-32	176400	(Sample Rate * BitsPerSample * Channels) / 8.
33-34	4	(BitsPerSample * Channels) / 8.1 - 8 bit mono2 - 8 bit stereo/16 bit mono4 - 16 bit stereo
35-36	16	Bits per sample
37-40	“data”	“data” chunk header. Marks the beginning of the data section.
41-44	File size (data)	Size of the data section.

The speech generated by Google Text-to-Speech has background noise that sounds like a mouse click