Project Lingua Franca: Democratizing Information through Unified Optical Character Recognition and Neural Machine Translation

Suvaditya Mukherjee, Aaryadev Chandra, Krisha Chemburkar

December, 2023

Image credit: Personal

Abstract

Document Translations are considered to be an unsolved problem with respect to dependable translations across languages. This process is time-consuming in nature and requires human intervention, primarily due to the inability of current methods to develop an understanding of the local context within the document. This is also a problem from the lens of accessibility, as a large section of the human population does not know English. In our work, we demonstrate how our proposed method, termed Project Lingua Franca, has the ability to generate document translations in a cheap and fairly efficient method,with no prior training and in Zero-shot settings.

Type

Conference paper

Publication

In IEEE International Conference on Modelling Simulation & Intelligent Computing

We ingest a PDF document and convert it into an image-based representation. This representation undergoes several pre-processing steps to enhance image quality and make the characters as prominent as possible. We then use the Tesseract Module to perform quick and effective optical character recognition in English. We take the final output of the OCR submodule and pass it as input to the text pre-processing submodule which processes the text by passing it through a Token Encoder. These tokens are then collated together to form a single input vector. We utilize the state-of-the-art SeamlessM4T model in the T2TT (Text to Text Translation) mode for each language which can be selected as per the original paper and as per the user’s target language choice. The final tokenized vectors are then passed to the model to generate predicted tokens in the target language, which are then subsequently decoded into human-readable text. From here, we branch out to calculate the BLEU and SacreBLEU scores during training while the final output is formatted into a document, converted into a PDF, and saved to the local machine disk for the user to retrieve.

Project Lingua Franca: Democratizing Information through Unified Optical Character Recognition and Neural Machine Translation

Abstract

Suvaditya Mukherjee

ML @ USC Institute of Creative Technologies & USC School of Cinematic Arts | MS CS-AI @ USC Viterbi | Google Developer Expert (Machine Learning)