Different Types of Attention

$$s_t$$ and $$h_i$$ are source hidden states and target hidden state, the shape is (n,1). $$c_t$$ is the final context vector, and $$\alpha_{t,s}$$ is alignment score.

\begin{aligned} c_t&=\sum_{i=1}^n \alpha_{t,s}h_i \\\ \alpha_{t,s}&= \frac{\exp(score(s_t,h_i))}{\sum_{i=1}^n \exp(score(s_t,h_i))} \end{aligned}

Global(Soft) VS Local(Hard)

Global Attention takes all source hidden states into account, and local attention only use part of the source hidden states.

Content-based VS Location-based

Content-based Attention uses both source hidden states and target hidden states, but location-based attention only use source hidden states.

Here are several popular attention mechanisms:

Dot-Product

$score(s_t,h_i)=s_t^Th_i$

Scaled Dot-Product

$score(s_t,h_i)=\frac{s_t^Th_i}{\sqrt{n}}$ where n is the vectors dimension. Google’s Transformer model has similar scaling factor when calculate self-attention: $$score=\frac{KQ^T}{\sqrt{n}}$$

Location-Base

$socre(s_t,h_i)=softmax(W_as_t)$

General

$score(s_t,h_i)=s_t^TW_ah_i$

$$Wa$$‘s shape is (n,n)

Concat

$score(s_t,h_i)=v_a^Ttanh(W_a[s_t,h_i])$

$$v_a$$‘s shape is (x,1), and $$Wa$$ ‘s shape is (x,x). This is similar to a neural network with one hidden layer.

When I doing a slot filling project, I compare these mechanisms. Concat attention produce the best result.