Attention on the Sphere
Published:
Transformers have proven to be remarkably capable architectures. For image data, the trick is simple: chop an image into patches, treat each patch as a token, and let attention do the rest. This is the core idea behind Vision Transformers (Dosovitskiy et al., 2020), and the reason it works is that attention can learn both local and global relationships between patches simultaneously.
