Text Manipulation: Extracting, Translating, and Converting to Audio

A friend of mine was asking me about the technical complexity of a children’s picture book reading robot, the Luka. The robot is used to read children’s books in both English and Chinese. It looks at the pages and then reads the book aloud in English or Chinese.

The Luka Children's Book Reading Companion Robot
The Luka Children’s Book Reading Companion Robot

This product has three primary requirements:

  • Extracting text from an image
  • Translating the content between languages
  • Converting the text to audio

As a proof of concept, I used the following image as an example use case.

My Mumm Book Sample Page
My Mumm Book Sample Page

I leveraged several APIs to replicate the functionality. The sample code is available here in a Google Colab notebook.

Environment Setup

Google Colab has the ability to be run with a CPU, GPU or TPU. For this example, it is recommended to change the runtime type and select “GPU” as the hardware accelerator.

Additionally, some of the APIs used require an AWS developer account and user permissions. AWS provides detailed documentation for setting up an Administrator account and generating secrets.

Extracting Text From An Image

The first step is to extract the text from the image. Amazon’s Rekognition API has a detect_text() function which will do this, but it only works on ISO basic Latin script characters.

Instead, the EasyOCR library by JaidedAI is a free utility which supports 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, and Cyrillic.

First, download the sample image and display it. Once the image is loaded in the Google Colab instance and EasyOCR is installed, configure the reader. The reader takes arguments for language models to load.

reader = easyocr.Reader(['ch_sim','en']) # need to run only once to load model into memory

Then pass the image filename and extract the text.

result = reader.readtext('my_mumm_ex.jpg', detail = 0)

The following text is detected.


Translating The Text

Next, the text needs to be translated from Chinese characters to English using Googletrans. Googletrans is a free python library that uses the Google Translate Ajax API.

Load the translator and pass the text string and destination language.

translator = Translator()
result = translator.translate(text_ch, dest='en')

Store the translated text in a new variable.

text_en = result.text

Print the translated text to see the output.

Also a very juggling stuntman

Converting Text to Audio

The final step is to convert the text to audio. Amazon Polly is a Text-to-Speech (TTS) cloud service that converts text into lifelike speech. Amazon Polly supports multiple languages and includes a variety of lifelike voices.

First, the connection to Polly is set up given the AWS access credentials and preferred region.

aws_access_key_id = AWS_ACCESS_KEY_ID,
aws_secret_access_key = AWS_SECRET_ACCESS_KEY,
region_name = AWS_REGION_ID)

Next, the English text will be converted. The synthesize_speech function takes a specific voice ID, an output file format, and the text in English as an input.

response = polly_client.synthesize_speech(VoiceId='Joanna',
Text = text_en)

The file can be written and saved to the Google Colab instance.

file = open('book_eng.mp3', 'wb')

The file can be downloaded from the Google Colab instance to the host machine. The file is available here.

The text can also be converted using a Chinese voice ID.

response = polly_client.synthesize_speech(VoiceId='Zhiyu',
Text = text_en)

This file is available here.