Machine translation in python (with code and explanation)

Published in

Analytics Vidhya

3 min readFeb 5, 2022

Note that this article has snippets of code. For the full code base click here.

Need of sequence of sequence models

Sequence to Sequence models are encoder-decoder networks, whose architecture can be leveraged for complex machine learning tasks like Machine Translation, Text Summarization and chat bots.

Preprocessing

Guidelines

Remove special characters (@, %, #, …)
Add space between punctuation and words. (Good morning! — > Good morning !)
Add <start> and <end> tokens to each sentence
Create word index and reverse word index (using the keras tokenizer)
Pad each sentence to maximum length

Sample Input

esta es mi vida.

Sample output

<start> esta es mi vida . <end>

Code

Modelling the network

Discussing forward propogation

Convert all words in source sentence to one hot vectors depending on the vocab size
Put the one hot vectors through a trainable 256 dimension embedding layer (Click here to know more about embedding layers and word vectors)
The embedded vectors are passed through RNN layers (1024 units).
An encoding of the source sentence is generated.
Encoding is passed to the decoder and it acts as the hidden state for the decoder RNN.
The decoder is trained by passing one word at a time.
The model is expected to predict the next word, which is passed to the RNN layer. (The actual word is passed during training, while the predicted word is passed during prediction)
The model stops once it encounters the <end> token or the max length of the target sentence as determined by the dataset is reached.

Discussing matrix shapes

Interestingly, not a lot of people discuss matrix shapes while discussing modelling and I feel that it is a very important part of understanding networks in ML.

The letters in curved brackets () indicate the layer. Using these letter denotations, we shall look into matrix shapes.

We’ll also consider batch size (64) to avoid confusion when going through the code

let max length of source sentence = 16 (after padding)

Number of units in RNN = 1024

{we will consider all 16 together because we have the entire source sentence}
A -> (64, 16, Vocab_size)
B -> (64, 16, 256)
C -> (64,1,1024)
Hi -> (64, 1, 1024){we will consider each word one by one in the decoder as we use the previous word to form the next word. Hence we will not use the max length here.}
D -> (64, 1, Vocab_size)
E -> (64, 1, 256)
F -> (64,1,1024)
G -> (64, 1, Vocab_size)  {apply softmax here to get output word}

Sample input

<start> esta es mi vida . <end>

Sample output

<start> this is my life . <end>

Coding the Encoder Network

encoder tensorflow code

Coding the Decoder Network

decoder tensorflow code

Loss Function

Loss function

Training

Results

Translate

Input:                     <start> ¿ todavia estan en casa ? <end> 
Predicted translation:     are you at home ? <end>

Problem with the current Encoder Decoder Network

By the time the encoder encodes the last word in the input sentence, a lot of the information related to the previous words is lost due to vanishing gradients

This usually affects longer input sentences.

Solution

We want the encoder to remember the important words of the input sentence which affect how the decoder decodes the encoding. For this we use Attention mechanism.

Detailed description on attention in Part 2

For the entire code base visit:

Machine translation in python (with code and explanation)

Need of sequence of sequence models

Preprocessing

Guidelines

Sample Input

Sample output

Code

Modelling the network

Discussing forward propogation

Discussing matrix shapes

Sample input

Sample output

Coding the Encoder Network

Coding the Decoder Network

Loss Function

Training

Results

Translate

Problem with the current Encoder Decoder Network

Solution

For the entire code base visit:

Seq2Seq with Attention

References

Written by Amay Gada