Document Translations are considered to be an unsolved problem with respect to dependable translations across languages. This process is time-consuming in nature and requires human intervention, primarily due to the inability of current methods to develop an understanding of the local context within the document. This is also a problem from the lens of accessibility, as a large section of the human population does not know English. In our work, we demonstrate how our proposed method, termed Project Lingua Franca, has the ability to generate document translations in a cheap and fairly efficient method,with no prior training and in Zero-shot settings.
We ingest a PDF document and convert it into an image-based representation. This representation undergoes several pre-processing steps to enhance image quality and make the characters as prominent as possible. We then use the Tesseract Module to perform quick and effective optical character recognition in English. We take the final output of the OCR submodule and pass it as input to the text pre-processing submodule which processes the text by passing it through a Token Encoder. These tokens are then collated together to form a single input vector. We utilize the state-of-the-art SeamlessM4T model in the T2TT (Text to Text Translation) mode for each language which can be selected as per the original paper and as per the user’s target language choice. The final tokenized vectors are then passed to the model to generate predicted tokens in the target language, which are then subsequently decoded into human-readable text. From here, we branch out to calculate the BLEU and SacreBLEU scores during training while the final output is formatted into a document, converted into a PDF, and saved to the local machine disk for the user to retrieve.