This title was written by GPT-3, see how here...

Is attention really all you need?

Stumbling forward in the darkness of research, attention mechanisms are a flash of light so rarely encountered when dreaming up new ideas.

<aside> 💡 *Why are people excited about attention?

Attention mechanisms have received more creative energy than almost any other part of deep learning. Why are researchers so inspired?*

I'm an applied machine learning grad student at the University of Illinois at Urbana Champaign, and I'm looking to learn and teach the best current research in machine learning. Follow along into my journey into the depths of attention mechanisms. Hold onto your hat and save all the links along the way :)

</aside>

Pictured: visualizing where a self-driving car is looking (in red colors). Remember: attention allows neural networks (NNs) to ignore input.

Pictured: visualizing where a self-driving car is looking (in red colors). Remember: attention allows neural networks (NNs) to ignore input.

Pictured: green and red bounding boxes visualize neural network attention in classification and visualization.
Attention in image-generation allows NNs to focus resources on creating high resolution improvements in a specific area.

Pictured: green and red bounding boxes visualize neural network attention in classification and visualization. Attention in image-generation allows NNs to focus resources on creating high resolution improvements in a specific area.

Motivation

I had to write this to understand the many creative ways ML designers are ignoring their training data. Attention == ignoring part of your input.

From AI Summer

Before attention, convolutional neural nets (CNNs) were the most impactful idea in machine learning. Astonishingly, I view CNNs as a as a primitive form of attention 🤯 that's why I'm so excited.

CNN kernel:

A form of local attention (with fixed kernel size), iteratively scanned over input
Stack multiple conv + pooling layers to capture features at larger scales in the receptive field.
Pro: local features best captured with CNNs.
Con: Kernel values are learned during training, but FIXED for inference: limited adaptation to test data.

Attention:

Local attention is very similar to conv
Global attention is defined continuously across entire input, easily models global features.
Pro: After training, the learned attention weight matrices dynamically compute new attention scores for each input.
Pro: global features best captured with attention-based networks.
Pro: weight matrices have many more parameters than conv kernels, so can learn more complicated functions???
Con: attention is costly👇

Why you should read this post (desired outcomes):

Why it matters: understand the power of attention.
I map out important areas of research to clarify recent progress, in the short list of attention 👇
What: Survey the best attention mechanisms
What's next: get the best reading list to understand SOTA ML.

<aside> 👉 $\underbrace{TL;DR}$

Attention enables models to form richer, denser, internal representations. The model learns the salient parts and ignores the rest. Dense internal reps = more efficient use of smaller models.

Invariant forms of attention allow models to zero-shot (i.e. instantly) adapt to changing inputs. Keep an eye out for better and better forms of this.

Attention makes models robust to adversarial attacks (and substantially less likely to be fooled by imperceptible grain added to images).

Finally, attention aids explainability. But much work remains.

🧠👉 Read the highest-value next steps below to learn these concepts.

</aside>

Outline

The short list of attention