fbpx

NeMo Conversational AI Translator

0

Introduction

This article introduces you to IDIA Conversational AI French-to-English Translator through the NeMo conversational AI toolkit. We will present an example on translating from French to English using NVIDIA’s NeMo juicy collections which include:
– ASR: Automatic Speech Recognition
– NLP: Natural Language Processing
– TTS: Text-To-Speech synthesis
Python code is given along the way

Installing the NeMo package

Using our good old pip install friend, we can install NVIDIA NeMo for our conversational AI project, as such

!python -m pip install git+https://github.com/NVIDIA/NeMo.git@'r1.4.0'#egg=nemo_toolkit[all]

Importing modules

The modules that we are going to use are the following: The ASR, NLP and TTS from NeMo collections. Those collections will serve as recipes for our conversational AI translator IPython will be utilized just for listening to specific pieces of audio

import nemo
import nemo.collections.asr as nemo_asr
import nemo.collections.nlp as nemo_nlp
import nemo.collections.tts as nemo_tts
import IPython

Models

We will be using NGC models by NVIDIA. NVIDIA is too generous that it makes models available with quantity and quality. To make our lives easier, one could check the models via simple calls as such

nemo_tts.models.HifiGanModel.list_available_models()

The above lists models that are available within the Hifi Gan models of NeMo’s TTS module, you should be able to get the following response

[PretrainedModelInfo(
 	pretrained_model_name=tts_hifigan,
 	description=This model is trained on LJSpeech audio sampled at 22050Hz and mel spectrograms generated from Tacotron2, TalkNet, and FastPitch. This model has been tested on generating female English voices with an American accent.,
 	location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_hifigan/versions/1.0.0rc1/files/tts_hifigan.nemo,
 	class_=<class 'nemo.collections.tts.models.hifigan.HifiGanModel'>
 )]

Training via GPUs and CUDA

CUDA is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). With CUDA, developers are able to dramatically speed up computing applications by harnessing the power of GPUs. We will train our models as such

asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name='stt_fr_quartznet15x5').cuda()
nmt_model = nemo_nlp.models.MTEncDecModel.from_pretrained(model_name='nmt_fr_en_transformer12x2').cuda()
spectrogram_generator = nemo_tts.models.FastPitchModel.from_pretrained(model_name='tts_en_fastpitch').cuda()
vocoder = nemo_tts.models.HifiGanModel.from_pretrained(model_name='tts_hifigan').cuda()

Reading online French Audio

I will be using the lightbulblanguages website to read french audio. We can grab any piece of audio by the magical wget then display the audio we have

!wget 'https://www.lightbulblanguages.co.uk/resources/audio/trente.mp3'
audio_sample = 'trente.mp3'
IPython.display.Audio(audio_sample)

This is what you should see using IPython

I don’t know about your French knowledge, but the guy is pronouncing numbers from 30 all the way up to 39.

Transcribe

We will now use our ASR model to transcribe audio to text as follows

transcribed_text = asr_model.transcribe([audio_sample])
print(transcribed_text)

This is what you should be able to see

Transcribing: 100%
1/1 [00:01<00:00, 1.32s/it]
['trente trente et un trente deux trente trois trente quatre trente cinq trente six trente sept trente huit trente neuf']

Translating using NVIDIA’s NMT

We will now use the NMT model to perform french to english translation

english_text = nmt_model.translate(transcribed_text)
print(english_text)
['Thirty One Thirty Two Thirty Three Thirty Four Thirty Five Thirty Six Thirty Seven Thirty E']

and voila.. Now we have our text translated from French to English.

Text to Speech

Now, we shall convert the above text to speech using a 2-step procedure: 1) Text to spectrogram and 2) Spectrogram to audio; which is easily accomplished as

parseText = spectrogram_generator.parse(english_text[0])
spectrogram = spectrogram_generator.generate_spectrogram(tokens=parseText)
audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)
audioOutput = audio.to('cpu').detach().numpy()

You can display the audio as

IPython.display.Audio(audioOutput,rate=22050)

and we are done.

Summary

All in all, we saw how to make use of NeMo NVIDIA Conversational AI French-to-English Translator to perform french to english translation. You could indeed explore other languages by checking the available ones on the NGC or by training using your own custom data.

Enjoyed this article ? Buy me a coffee

PS: I’m on twitter. I retweet stuff around algorithms, python, MATLAB and mathematical optimization, mostly convex.
Follow my other courses such as convex optimization.