Machine translation in python (with code and explanation)
Note that this article has snippets of code. For the full code base click here.
Need of sequence of sequence models
Sequence to Sequence models are encoder-decoder networks, whose architecture can be leveraged for complex machine learning tasks like Machine Translation, Text Summarization and chat bots.
Preprocessing
Guidelines
- Remove special characters (@, %, #, …)
- Add space between punctuation and words. (Good morning! — > Good morning !)
- Add <start> and <end> tokens to each sentence
- Create word index and reverse word index (using the keras tokenizer)
- Pad each sentence to maximum length
Sample Input
esta es mi vida.
Sample output
<start> esta es mi vida . <end>
Code
Modelling the network
Discussing forward propogation
- Convert all words in source sentence to one hot vectors depending on the vocab size
- Put the one hot vectors through a trainable 256 dimension embedding layer (Click here to know more about embedding layers and word vectors)
- The embedded vectors are passed through RNN layers (1024 units).
- An encoding of the source sentence is generated.
- Encoding is passed to the decoder and it acts as the hidden state for the decoder RNN.
- The decoder is trained by passing one word at a time.
- The model is expected to predict the next word, which is passed to the RNN layer. (The actual word is passed during training, while the predicted word is passed during prediction)
- The model stops once it encounters the <end> token or the max length of the target sentence as determined by the dataset is reached.
Discussing matrix shapes
Interestingly, not a lot of people discuss matrix shapes while discussing modelling and I feel that it is a very important part of understanding networks in ML.
The letters in curved brackets () indicate the layer. Using these letter denotations, we shall look into matrix shapes.
We’ll also consider batch size (64) to avoid confusion when going through the code
let max length of source sentence = 16 (after padding)
Number of units in RNN = 1024
{we will consider all 16 together because we have the entire source sentence}
A -> (64, 16, Vocab_size)
B -> (64, 16, 256)
C -> (64,1,1024)
Hi -> (64, 1, 1024){we will consider each word one by one in the decoder as we use the previous word to form the next word. Hence we will not use the max length here.}
D -> (64, 1, Vocab_size)
E -> (64, 1, 256)
F -> (64,1,1024)
G -> (64, 1, Vocab_size) {apply softmax here to get output word}
Sample input
<start> esta es mi vida . <end>
Sample output
<start> this is my life . <end>
Coding the Encoder Network
Coding the Decoder Network
Loss Function
Training
Results
Translate
Input: <start> ¿ todavia estan en casa ? <end>
Predicted translation: are you at home ? <end>
Problem with the current Encoder Decoder Network
By the time the encoder encodes the last word in the input sentence, a lot of the information related to the previous words is lost due to vanishing gradients
This usually affects longer input sentences.
Solution
We want the encoder to remember the important words of the input sentence which affect how the decoder decodes the encoding. For this we use Attention mechanism.
Detailed description on attention in Part 2