PI24 - Real-Time Speech Synthesis

Authors:

Introduction

The goal of this project is to develop a low-latency, high-quality audio synthesis real-time text-to-speech (TTS) system. The project is set up to take advantage of microservices concepts in order to facilitate modularity, scalability, and adaptability. The architecture guarantees effective functioning and facilitates future extensions by breaking the system up into discrete parts, each of which is in charge of carrying out particular functions.

The system was designed to be able to run on a large variety of hardware, from small devices to large cloud servers.

Project Scope

This project's main objective was to create a reliable TTS system that could perform the following tasks:

Process text inputs:

Could accept inputs in the form of structured SSML (Speech Synthesis Markup Language) or natural language text.
Prepares the input for synthesis by apllying advanced normalization and techniques.

Generate Real-Time Audio:

Capable of producing audio streams with minimal latency.
Ensures high-quality audio output, with a focus on naturalness and intelligibility.

Supports Modular Deployment:

The system is designed to be modular, allowing for easy integration of new components and updates on existing ones.
The system can be deployed in a microservices architecture, allowing for scalability and adaptability.

Utilizability:

The system is designed to be usable by a wide range of users, from developers to end-users.
The system offers a simple API for easy integration into other systems.

Facilitates Customization:

The system is designed to be easily customizable, allowing for the integration of new voices, languages, and other features.
The system is designed to be easily extendable, allowing for the integration of new components and features.

The project was designed to meet real-world demands for customizable, efficient, and scalable voice synthesis, making it ideal for applications such as virtual assistants, customer service bots, and accessible technology solutions.

Architecture

The system was designed to be simple, modular, and scalable. The architecture is based on a microservices approach, with each component responsible for a specific task. This allows for easy integration of new components and updates on existing ones. The system is designed to be able to run on a large variety of hardware, from small devices to large cloud servers.

Architecture

In the diagram above, it is possible to see the main components of the system:

Black Arrows: Represent the flow of "text" between the components.
Blue Arrows: Represent the flow of audio between the components.
Components:
- Frontend: The user interface, responsible for sending requests to the api and displaying the results.
- API: One of the possible interfaces of the system, responsible for processing requests from the frontend and sending them to the server.
- Client: The client is a Terminal-based interface that allows users to interact with the system without being dependent on the API.
- Proxy: The proxy is responsible for routing requests to the server and returning the results to the client and/or API.
- Server: The server is responsible for processing requests from the proxy and generating the audio output.
- Normalizer: The normalizer is responsible for processing the input text and preparing it for synthesis.

Docker

Each component is incapsulated in a Docker container, allowing for easy deployment and scaling. The provided docker-compose file allows for easy deployment of the system on a single machine.

Flow

The system, as designed, is composed of several components, each responsible for a specific task.

App Flow

Based on the architecture diagram, the flow of the system is as follows:

Frontend API flow:

The user interacts with the system through the Frontend, which sends requests to the API.
The API processes the requests and sends them to the Proxy.
The Proxy routes the requests to an available Server.
The Server sends the text input to the Normalizer.
The Normalizer processes the input and prepares it for synthesis and sends it back to the Server.
The Server uses the normalized text to generate the audio. Then sends it back to the Proxy.
The Proxy returns the results to the API.
The API sends the results back to the Frontend, which displays the results to the user.

Client flow:

The user interacts with the system through the Client , which sends requests directly to the Proxy.
The Proxy routes the requests to an available Server.
The Server sends the text input to the Normalizer.
The Normalizer processes the input and prepares it for synthesis and sends it back to the Server.
The Server uses the normalized text to generate the audio. Then sends it back to the Proxy.
The Proxy returns the results to the Client.
The Client displays the results to the user.

Data Flow

The flow of data between these components is crucial for the system to function correctly. The following diagram illustrates the flow of data between the components of the system:

Data Flow

As the diagram shows, the data flow is as follows:

1. Input:

The system accepts two types of inputs:

Natural Text: Plain text provided by the user.
SSML Text: Structured input using Speech Synthesis Markup Language (not currently implemented).

2. Normalizer:

The input text is sent to the Normalizer, which standardizes it for further processing. For example:

Expanding abbreviations.
Converting numbers into words.

3. TTS Model:

The normalized text is then processed by the TTS Model, which converts the text into audio data. This includes:

Generating phonetic representations.
Applying prosody to ensure naturalness.

4. Output:

The audio data is finalized and saved as an output file (e.g., .wav or .mp3), ready for playback and/or further processing.

Current Limitations

SSML Support: While the diagram includes an SSML Parser, this functionality is not yet implemented. All inputs must currently be provided as plain text.

Streaming: The system generates complete audio files and sends them directly to the user, but streaming capabilities could be added in future iterations.

Communication

Main Components Communication

To handle the comunications between the main components, the system uses gRPC as the communication protocol. This allows for fast and efficient communication between the components, ensuring that the system can handle the real-time requirements of the audio synthesis process.

The use of gRPC also allows for a technology-agnostic approach to the system, as it can be used with a wide variety of programming languages and platforms.

Frontend API Communication

To handle the communication between the Frontend and the API, the system uses HTTP as the communication protocol. This allows for easy integration with web-based applications and ensures that the system can be easily accessed by a wide variety of devices.

Requirements

The system requirements define the functional and non-functional aspects that the text-to-audio conversion system must meet. These requirements guide the development process, ensuring that the system meets the user's needs and expectations.

Functional Requirements

Text-to-Audio Conversion: The system must generate audible audio from text inputs.
Integration: The system must integrate seamlessly with existing virtual assistant infrastructures.
Language Support: The system must support generating audio in Portuguese.
Voice Customization: Users must be able to choose and customize voices for their needs.
Dynamic Intonation: The system must adjust intonation to emphasize keywords in responses.
Emotion Adaptation: The system should modify the voice's tone based on conversational context.
Multi-Format Text Support: Inputs in HTML, Markdown, plain text, and emails must be processed effectively.

Non-Functional Requirements

Performance: The system must generate responses in under 1.5 seconds.
Scalability: It must support multiple simultaneous users without degradation.
Availability: The system must be operational 99% of the time.
Clarity: The generated audio must be clear and easy to understand.
Naturalness: Voices must sound natural and human-like, avoiding robotic tones.
Redundancy: Mechanisms should ensure uninterrupted service even in case of failures.
Maintainability: The system must have modular, well-documented code for easy updates.
Multilingual Capability: It should support multiple languages to cater to diverse markets.
Legal Compliance: All voices used must adhere to licensing and copyright laws.

Constraints

Integration with Virtual Assistants: The system must integrate seamlessly with existing virtual assistants, preserving user experience.
Independent Deployment: It must operate autonomously without third-party dependencies.

How to install

This guide will walk you through the steps to install the application.

⛔ At least 25GB of free space is required to install the application. ⛔

Components

Docker

For ease of use, we have provided a Dockerfile that contains all the necessary dependencies to run the application. This is the recommended way to run the application.

Prerequisites

Installation

Clone the repository:
```
git clone https://github.com/ddu72/PI
```
Change to the project directory:
```
cd PI
```
Build the Docker image:
```
docker-compose build
```
Start the target services:
```
docker-compose up [proxy, server, api, normalizer] -d
```
!Note: The -d flag is optional and runs the services in detached mode.

!Note: The each service will start its dependencies automatically, for example, the proxy service will start all the server services and the server service will start the normalizer service.

The services are now running and ready to be used. See How to Use for more information.

Stopping the Services

To stop the services, run the following command:

docker-compose down

Removing the Docker Image

To remove the Docker image, run the following command:

docker-compose down --rmi all

Standalone Program

The Intelex Module can be used as a standalone program (Single Service).

Prerequisites

Installation

Clone the repository:
```
git clone https://github.com/ddu72/PI
```
Change to the project directory:
```
cd PI
```
Download the pre-trained models:
```
sh download_models.sh
```
!Note: This will download the pre-trained models to the model directory. This may take a while depending on your internet connection. (Aprox. 7GB)

Install the requirements:

pip install -r environments/server_requirements.txt

The standalone program is now ready to be used. See How to Use for more information.

How to use

This section will guide you through the process of using the system, either as a standalone program or as a dockerized version.

Components

Docker

To run the application (Client-side) using Docker, follow the steps below.

Prerequisites

Docker
Docker Compose
Docker images built (see How to Install)

Usage

Verify that the Docker images have been built and the "main" services are running:

docker compose ps

!Note: The output should be similar to the following:

PI> docker compose ps
NAME                   IMAGE                      COMMAND                  SERVICE      CREATED          STATUS          PORTS
normalizer_container   tts-agentifai:normalizer   "python app_normaliz…"   normalizer   13 seconds ago   Up 12 seconds   50051/tcp
proxy_container        tts-agentifai:proxy        "python app_proxy -d"    proxy        13 seconds ago   Up 11 seconds   0.0.0.0:50051->50051/tcp, :::50051->50051/tcp
server_container1      tts-agentifai:server       "python app_server"      server_1     13 seconds ago   Up 12 seconds   50051/tcp
server_container2      tts-agentifai:server       "python app_server"      server_2     13 seconds ago   Up 12 seconds   50051/tcp
server_container3      tts-agentifai:server       "python app_server"      server_3     13 seconds ago   Up 12 seconds   50051/tcp

!Note: The normalizer_container, proxy_container, and server_containerX services should be running.

Run endpoints:
1. Client:
```
docker compose run -e PROXY_SERVER_PORT={PROXY_SERVER_PORT} -e PROXY_SERVER_ADDRESS={PROXY_SERVER_ADDRESS} client
```
  !Note: The PROXY_SERVER_PORT and PROXY_SERVER_ADDRESS are optional and default to 50051 and proxy_container, respectively.
  
  !Note: The client will be available in the opened terminal.
2. Frontend and API:
```
docker compose run frontend
```
  !Note: The frontend will be available at http://localhost:3000 and the API will be available at http://localhost:5000.
The output will be displayed in the terminal.

Standalone Usage

This allows the user to synthesize text using the Intelex Module. This version is a standalone (Single Service) version of the implementation.

Prerequisites

(see Standalone Installation Pre-requisites)

Steps

Run the program:

python intlex.py [TEXT] [CONFIG] --output [OUTPUT] --lang [LANG] --kwargs [KWARGS]

The output will be saved in the output file if provided, otherwise it will be stored in the default location.

Arguments and Options

For a more detailed explanation of the arguments and options for the standalone version, see the Intlex Model section.

TEXT (Required)

This argument is used to pass the text to be synthesized.

CONFIG (Optional)

This argument is used to pass the configuration file. The default configuration file is in the config directory.

OUTPUT (Optional)

This argument is used to pass the output file. The default output file is output.wav.

LANG (Optional)

This argument is used to pass the language of the text to be synthesized. The default language is pt.

KWARGS (Optional)

This argument is used to pass additional arguments to the synthesizer. The arguments should be passed as a dictionary in the following format:

--kwargs '{"temperature":0.01,"top_p":0.4,"top_k":60}'

Updating

Model training

The xtts_v2_pt_colab_pro_notebook.ipynb notebook is a comprehensive script designed for fine-tuning and training the XTTS (Extended Text-to-Speech) model. It is structured for execution in Google Colab or locally, leveraging a dataset in the LJSpeech format and integrating with the Coqui TTS framework for model training and management.

Dataset Format

The XTTS training process requires the dataset to be structured following the LJSpeech format, adapted for single-speaker audio data. Below is a detailed explanation of the format and its requirements.

Directory Structure

The dataset should follow this structure:

dataset
├── metadata.csv
└── wavs
    ├── audio_0.wav
    ├── audio_1.wav
    ├── audio_2.wav
    ├── audio_3.wav
    ├── audio_4.wav
    ├── audio_5.wav
    ├── audio_6.wav
    ├── audio_7.wav
    ├── audio_8.wav
    ├── audio_9.wav
    └── audio_N.wav

Components

metadata.csv

A CSV file that maps each audio file to its corresponding text transcription.

Format: Each row contains:

filename (without extension): The name of the audio file.
transcription: The text spoken in the audio file.
normalizer transcription: The text spoken in the audio file.

Example:

audio_0|Hello, welcome to the XTTS training system.|Hello, welcome to the XTTS training system.
audio_1|This is an example of a training sentence with numbers 12345.| This is an example of a training sentence with numbers one two three four five.
audio_n|High-quality audio is essential for better results.|High-quality audio is essential for better results.

The delimiter is a pipe (|).

wavs/ Directory
- Contains all audio files in WAV format.
- Requirements:
  - Sampling rate: 22,050 Hz (recommended).
  - Single-channel (mono) audio.
  - Bit depth: 16-bit PCM encoding.

Requirements for Training

The dataset must consist of:
- Single Speaker: All audio recordings must originate from the same speaker to ensure consistency in the model's output.
- Clear Transcriptions: Text transcriptions should accurately match the spoken content in the audio files. Any incorrectly text matched will "Poison" the model.
Audio file naming should correspond to the filename column in the metadata.csv file.

Example Dataset Entry

For the dataset structure shown above, the metadata.csv might look like this:

audio_0|Hello, welcome to the XTTS training system.|Hello, welcome to the XTTS training system.
audio_1|This is an example of a training sentence with numbers 12345.| This is an example of a training sentence with numbers one two three four five.
audio_n|High-quality audio is essential for better results.|High-quality audio is essential for better results.

Notes

Ensure all filenames and paths are consistent between metadata.csv and the audio files.
Verify the audio quality and transcriptions for best results during training.
If using a dataset in another format, preprocess it into this structure before starting training.

File Structure and Execution Flow

1. Environment Setup

Dependency Installation:
- Installs necessary tools:
  - espeak-ng: For phoneme generation in Portuguese.
  - ffmpeg: For audio processing.
- Clones and installs the Coqui TTS repository to provide training utilities.
- Ensures all dependencies are installed with pip.

Environment Variables:

Defines key dataset and model paths:

DATASET_URL = "http://mayorx.xyz/Media/agentifai/podcast_dataset.zip"
DATASET_PATH = "podcast_dataset.zip"
DATASET_DIR = "dataset/"
DATASET_NAME = "bitalk_podcast"
DATASET_LANG = "pt"

2. Dataset Management

Dataset Download and Preparation:
- Fetches the dataset from a predefined URL and extracts it into a structured directory.
- Supports single-speaker datasets formatted similarly to LJSpeech:
  - WAV audio files.
  - Corresponding metadata with text transcripts.
Input Voice Configuration:
- Allows downloading or setting input voices for use during fine-tuning.

Commands used:

wget $DATASET_URL -O $DATASET_PATH
unzip $DATASET_PATH
rm $DATASET_PATH

3. Model Configuration

Trainer Setup:
- Imports and configures Trainer and TrainerArgs from Coqui TTS to manage the training process.
- Defines BaseDatasetConfig to specify dataset properties.
Logging and Checkpoints:
- Logs training progress using TensorBoard.
- Saves model checkpoints to a defined directory.

Sample Configuration:

RUN_NAME = "XTTS_FineTuned"
PROJECT_NAME = "XTTS_trainer"
DASHBOARD_LOGGER = "tensorboard"
CHECKPOINT_DIR = "./checkpoints"

4. Training Execution

Executes the training loop with Coqui TTS, fine-tuning the model based on the specified dataset.
Allows flexibility in hyperparameters for optimization.

5. Output

Model Artifacts:
- Fine-tuned XTTS model saved in the checkpoint directory.
Logs:
- Training metrics and progress logged for monitoring via TensorBoard.

Prerequisites

Hardware

GPU Required: A CUDA-enabled GPU is strongly recommended for efficient training.

Software

Python 3.x.
Necessary Python packages installed:
- espeak-ng
- ffmpeg
- torch, numpy, and other dependencies via Coqui TTS.

Dataset

A dataset formatted similarly to LJSpeech:
- Audio files in WAV format.
- A metadata file containing transcripts for each audio file.

Usage Instructions

Setup Environment:
- Install the required dependencies by running the notebook's initial cells.
Download Dataset:
- Provide a dataset URL or manually place your dataset in the dataset/ directory.
Configure Training:
- Adjust the training parameters in the notebook as needed:
  - Dataset path.
  - Checkpoint directory.
  - Logging options.
Run Training:
- Execute the training cells sequentially.
Monitor Progress:
- Use TensorBoard to visualize training metrics.
Save and Deploy:
- Use the final model checkpoint for deployment or further evaluation.

System Components

In this section, we will discuss the system components, how they function, and how they interact with each other.

Components

Communication

The components communicate with each other in the following ways:

With the use of GRPC's:
- API <---> Proxy
- Client <---> Proxy
- Proxy <---> Server
- Server <---> Normalizer
With the use of HTTP Requests and Responses:
- Frontend <---> API

Intlex Standalone Module

The Intlex Module serves as the standalone implementation of the text-to-speech system, allowing users to directly process text and generate audio files without using the microservices architecture. It provides flexibility for single-instance use cases.

Usage

To use the Intlex Module, you can run the following command:

python intlex.py [TEXT] [CONFIG] --output [OUTPUT] --lang [LANG] --kwargs [KWARGS]

The output will be saved in the output file if provided, otherwise it will be stored in the default output file.

Arguments and Options

TEXT: Text to be synthesized (required)
CONFIG: Configuration file (optional)
OUTPUT: Output file (optional)
LANG: Language [pt, en] (optional)
KWARGS: Additional arguments (optional)

Options:

Configuration File

The configuration file contains the settings for the synthesis process, such as the voice to clone, model information and configuration. If not provided, the default configuration will be used.

{
  "normalization_config": "config/normalization_rules.toml",
  "audio_path": {
    "pt": "inputs/voices/marcelo.wav",
    "en": "inputs/voices/eng_morgan_freeman.wav"
  },
  "storage_directory": "inputs/voices",
  "xtts_config": "model/training/GPT_XTTS_Finetuned/config.json",
  "model": {
    "checkpoint_dir": null,
    "eval": true,
    "strict": true,
    "speaker_file_path": null,
    "checkpoint_path": "model/training/GPT_XTTS_Finetuned/best_model.pth",
    "vocab_path": "model/training/XTTS_v2.0_original_model_files/vocab.json",
    "use_deepspeed": false
  }
}

Each field in the configuration file is described below:

normalization_config: Path to the normalization configuration file to be used by the model
audio_path: Default input voices for conversion, the field is a dictionary with the keys being the voices' languages and the values being the paths to the audio files
- pt: Path to the Portuguese reference audio file
- en: Path to the English reference audio file
storage_directory: Path to the folder where the reference audio files are stored
xtts_config: Path to the XTTS configuration file to be used by the trained model
model: Model configuration to be used for conversion, the fields are according to the trained model to facilitate configuration
- checkpoint_dir: Path to the directory where the model checkpoints are stored
- eval: Whether the model is in evaluation mode
- strict: Whether to strictly enforce the model configuration
- speaker_file_path: Path to the file containing the speaker information
- checkpoint_path: Path to the model checkpoint file
- vocab_path: Path to the vocabulary file
- use_deepspeed: Whether to use DeepSpeed for training

Language

The language option allows the user to specify the language of the text to be synthesized. The available languages are:

pt: Portuguese
en: English

The model suports more languages, but the available voices are only in Portuguese and English. If the user wants to use another language, they must provide the voice file and update the configuration file accordingly.

Additional Arguments for the voice synthesis

The kwargs option allows the user to specify additional arguments for the voice synthesis process. The available arguments are:

length_penalty: Length penalty for the synthesis process
repetition_penalty: Repetition penalty for the synthesis process
top_k: Top k tokens to consider for the synthesis process
top_p: Top p tokens to consider for the synthesis process
do_sample: Whether to sample the tokens
temperature: Temperature for the sampling process

The default values for the additional arguments are:

{
  "length_penalty": 1.0,
  "repetition_penalty": 2.5,
  "top_k": 40,
  "top_p": 0.5,
  "do_sample": true,
  "temperature": 0.7
}

The format for the kwargs option is a dictionary with the keys being the argument names and the values being the argument values.

py intlex.py TEXT --kwargs '{"temperature":0.01,"top_p":0.4,"top_k":60}'

This will set the temperature to 0.01, top_p to 0.4 and top_k to 60.

Example

To synthesize the text "Hello, World!" in English, you can run the following command:

python intlex.py "Hello, World!" --lang en

This will generate an audio file with the synthesized speech in English.

Frontend

The frontend is the user interface of the application. It is responsible for rendering the user interface and handling user interactions. The frontend is built using React.

Requirements

The frontend requires the following:

ENV: The environment variables must be defined in the .env file or as environment variables. The following variables must be defined:
- PORT: The port on which the frontend will run.
- REACT_APP_API_IP_PORT: The port of the API server.
- REACT_APP_API_IP_ADDRESS: The address of the API server.
Dependencies: The frontend requires the following dependencies:
- Node.js
- npm
- axios
- react-audio-player

Usage

To run the frontend, you can do the following:

Define the api address and port in the .env file or as an environment variable. The serve port can also be defined. For example:
```
export PORT=3000
export REACT_APP_API_IP_PORT=5000
export REACT_APP_API_IP_ADDRESS={api_address}
```
Install the required dependencies by running:
```
cd app
npm install
```
Start the frontend by running: bash cd app npm start The frontend will be available at http://localhost:3000.

Comunication

The frontend communicates with the API to send the text to be synthesized and receive the audio file. The API is responsible for processing the text and generating the audio file.

The frontend sends a POST request to the API with the text to be synthesized. The API processes the request and when completed, the audio file is sent to the frontend. The frontend then receives the audio file and displays it to the user.

Request

{
  "text": "O exame de Época Especial realiza-se no dia 10 de Julho, às 9:00, na sala Ed.2 1.03.",
  "language": "pt"
}

Response

[BINARY DATA]

Aspect

The frontend is composed of only one page, which contains the text input and the audio player. The user can input the text to be synthesized and click the "Enviar" button to generate the audio file. The audio player will display the audio file and allow the user to play, pause, and stop the audio.

Frontend

Frontend With Text

Frontend Waiting

Frontend With Audio Player

API

The API is a RESTful service that provides access to the system's functionality. It is built using Flask, a Python microframework for web development. The API is responsible for handling requests from the client, processing them by sending them to the proxy, and returning the appropriate response.

Endpoints

The API has the following endpoint:

/api/tts: This endpoint is used to syntesize text to speech. It accepts a POST request with a JSON payload containing the text to be synthesized and the language to use. The API sends the text to the proxy, which then sends it to the TTS engine. The TTS engine synthesizes the text and returns the audio file, which is then returned to the client.

POST `/api/tts`

Request

{
  "text": "Hello, how are you?",
  "language": "en"
}

Response

UklGRvYAAABXQVZFZm10...IBAAAAABAAEAQB8AAEAfAAAA==

Requirements

The API requires the following:

ENV: The environment variables must be defined in the .env file or as environment variables. The following variables must be defined:
- PORT: The port on which the API server will run.
- PROXY_SERVER_PORT: The port on which the proxy server is running.
- PROXY_SERVER_ADDRESS: The address of the proxy server.
Dependencies: The API requires the following dependencies:
- Python requirements

Usage

To use the API, you need to start the Flask server by doing the following:

Define the proxy address and port in the .env file or as an environment variable. The serve port can also be defined, just remember to update the frontend REACT_APP_API_PORT to the correct one. For example:
```
export PORT=5000
export PROXY_SERVER_PORT=50051
export PROXY_SERVER_ADDRESS={proxy_address}
```

Install the required dependencies by running:

pip install -r enviroments/api_requirements.txt

Start the Flask server by running:
```
python app_api
```

The API will start running on the specified port, and you can now send requests to it.

Client

The client is the user interface of the application. It is a terminal-based application that allows users to interact with the system. The client is responsible for sending requests directly to the Proxy and displaying the responses to the user.

Functionality

The client provides the following functionality:

User authentication: The client allows users to log in to the system using their token.
User interaction: The client provides a command-line interface for users to interact with the system. Users can send requests to the Proxy and receive responses.
Response display: The client displays the responses received from the Proxy in a user-friendly format.

Implementation

The client is a rudimentary terminal-based application made in Python. It reads user input from the command line, sends requests to the Proxy, and displays the responses to the user. The client uses the grpc library to communicate with the Proxy using gRPC.

The client is a standalone component that can be run on any machine with Python installed. It requires only the grpcs module/folder to be present in the same directory.

Requirements

The client requires the following dependencies:

Python 3.12 or higher
Python packages

Usage

To run the client, follow these steps:

Clone the repository:
```
git clone https://github.com/ddu72/PI
```

Define the environment variables:

export USER_TOKEN={user_token}
export PROXY_SERVER_PORT={proxy_host}
export PROXY_SERVER_ADDRESS={proxy_port}

Install the required packages:

pip install -r environments/client_requirements.txt

Run the client:
```
 python app_client
```
Follow the instructions displayed on the terminal to interact with the system.

Aspect

Client Aspect

Client with Text

Client Audio List

Normalization

The Normalization Service is a microservice responsible for pre-processing text inputs to ensure a standardized and consistent format before they are sent to the TTS synthesis process. It uses a modular design, allowing flexibility in managing text transformations such as punctuation adjustments, number-to-word conversions, date formatting, and abbreviation expansions.

This service supports multiple languages and can be configured via a rules.toml file, making it easy to adapt to new use cases or languages.

Functionality

Key Features:

Text Normalization:
Converts raw text into a predictable, synthesis-ready format. Handles various textual elements including punctuation, numbers, and abbreviations.
Multi-Language Support:
Adapts normalization rules based on the language (e.g., pt for Portuguese, en for English).
gRPC Communication:
Offers a Normalize method to process text requests from the TTS server and returns a NormalizeResponse with the transformed text.
Modularity:
Rules are implemented as independent modules, which can be enabled or disabled through the configuration file.

Architecture

Core Logic:

The Normalizer class is the entry point for applying normalization rules.
Rules are dynamically loaded based on the configuration file.

Rule Modules:

Each rule is a separate class inheriting from a NormalizationRule base class.

gRPC Service:

File: normalizer.py
Defines the NormalizerService and handles requests via gRPC.

Implementation

File: normalizer.py

The NormalizerService initializes the Normalizer class with a rule configuration and exposes the Normalize method:

class NormalizerService(normalizer_pb2_grpc.NormalizerServiceServicer):
    def __init__(self, args):
        if args.rules is None:
            args.rules = NORMALIZER_RULES
        self.normalizer = Normalizer(args.rules)

    def Normalize(self, request, context):
        normalized_text = self.normalizer.normalize_text(request.text.strip())
        return normalizer_pb2.NormalizeResponse(normalized_text=normalized_text)

The service listens on a configurable port (NORMALIZER_PORT) and communicates with other components using gRPC.

Rules