Short notes on Mogrifier LSTM:

Overall objective of the paper: How do we inject context into the embeddings of a LSTM? Note, that the hidden state of the LSTM does contain context-specific info, but how do we have the hidden state play more into the embeddings generated by the LSTM?

The paper takes a fresh look at the traditional LSTM, and the “gates” in there. First, the vanilla LSTM:

LSTM

Some insights:

the i gate essentially “scales” the rows of the weight matrix W_j. What? Look at the equation
motivated by the above, equip the LSTM with gates that scale the columns of the weight matrices.

References:

Mogrifier LSTM
Mogrifier LSTM openreview