Different types of Attention
Table of Contents
\(s_t\) and \(h_i\) are source hidden states and target hidden state, the shape is (n,1)
. \(c_t\) is the final context vector, and \(\alpha_{t,s}\) is alignment score.
\[\begin{aligned} c_t&=\sum_{i=1}^n \alpha_{t,s}h_i \\ \alpha_{t,s}&= \frac{\exp(score(s_t,h_i))}{\sum_{i=1}^n \exp(score(s_t,h_i))} \end{aligned}\]
Global(Soft) VS Local(Hard) #
Global Attention takes all source hidden states into account, and local attention only use part of the source hidden states.
Content-based VS Location-based #
Content-based Attention uses both source hidden states and target hidden states, but location-based attention only use source hidden states.
Here are several popular attention mechanisms:
Dot-Product #
\[score(s_t,h_i)=s_t^Th_i\]
Scaled Dot-Product #
\[score(s_t,h_i)=\frac{s_t^Th_i}{\sqrt{n}}\] where n is the vectors dimension. Google’s Transformer model has similar scaling factor when calculate self-attention: \(score=\frac{KQ^T}{\sqrt{n}}\)
Location-Base #
\[socre(s_t,h_i)=softmax(W_as_t)\]
General #
\[score(s_t,h_i)=s_t^TW_ah_i\]
\(Wa\)’s shape is (n,n)
Concat #
\[score(s_t,h_i)=v_a^Ttanh(W_a[s_t,h_i])\]
\(v_a\)’s shape is (x,1)
, and \(Wa\) ’s shape is (x,x)
. This is similar to a neural network with one hidden layer.
When I doing a slot filling project, I compare these mechanisms. Concat attention produce the best result.