Machine translation in python (with code and explanation)

Amay Gada
Analytics Vidhya
Published in
3 min readFeb 5, 2022

--

Note that this article has snippets of code. For the full code base click here.

Need of sequence of sequence models

Sequence to Sequence models are encoder-decoder networks, whose architecture can be leveraged for complex machine learning tasks like Machine Translation, Text Summarization and chat bots.

Seq2Seq models in a nutshell

Preprocessing

Guidelines

  1. Remove special characters (@, %, #, …)
  2. Add space between punctuation and words. (Good morning! — > Good morning !)
  3. Add <start> and <end> tokens to each sentence
  4. Create word index and reverse word index (using the keras tokenizer)
  5. Pad each sentence to maximum length

Sample Input

esta es mi vida.

Sample output

<start> esta es mi vida . <end>

Code

Modelling the network

Discussing forward propogation

  1. Convert all words in source sentence to one hot vectors depending on the vocab size
  2. Put the one hot vectors through a trainable 256 dimension embedding layer (Click here to know more about embedding layers and word vectors)
  3. The embedded vectors are passed through RNN layers (1024 units).
  4. An encoding of the source sentence is generated.
  5. Encoding is passed to the decoder and it acts as the hidden state for the decoder RNN.
  6. The decoder is trained by passing one word at a time.
  7. The model is expected to predict the next word, which is passed to the RNN layer. (The actual word is passed during training, while the predicted word is passed during prediction)
  8. The model stops once it encounters the <end> token or the max length of the target sentence as determined by the dataset is reached.

Discussing matrix shapes

Interestingly, not a lot of people discuss matrix shapes while discussing modelling and I feel that it is a very important part of understanding networks in ML.

The letters in curved brackets () indicate the layer. Using these letter denotations, we shall look into matrix shapes.

We’ll also consider batch size (64) to avoid confusion when going through the code

let max length of source sentence = 16 (after padding)

Number of units in RNN = 1024

{we will consider all 16 together because we have the entire source sentence}
A -> (64, 16, Vocab_size)
B -> (64, 16, 256)
C -> (64,1,1024)
Hi -> (64, 1, 1024)
{we will consider each word one by one in the decoder as we use the previous word to form the next word. Hence we will not use the max length here.}
D -> (64, 1, Vocab_size)
E -> (64, 1, 256)
F -> (64,1,1024)
G -> (64, 1, Vocab_size) {apply softmax here to get output word}

Sample input

<start> esta es mi vida . <end>

Sample output

<start> this is my life . <end>

Coding the Encoder Network

encoder tensorflow code

Coding the Decoder Network

decoder tensorflow code

Loss Function

Loss function

Training

Results

result section

Translate

Input:                     <start> ¿ todavia estan en casa ? <end> 
Predicted translation: are you at home ? <end>

Problem with the current Encoder Decoder Network

By the time the encoder encodes the last word in the input sentence, a lot of the information related to the previous words is lost due to vanishing gradients

This usually affects longer input sentences.

Solution

We want the encoder to remember the important words of the input sentence which affect how the decoder decodes the encoding. For this we use Attention mechanism.

Detailed description on attention in Part 2

For the entire code base visit:

--

--