What are transformers?

March 13, 2023

Neural networks are by far the most powerful form of artificial intelligence we have today. They essentially attempt to represent complex problems by approximating them as massive mathematical functions. For example, a neural network that determines whether a photo contains a cat or not might process the pixels in the image with a series of linear transformations and other nonlinear mathematical operations known as “layers” to deep learning researchers. The final result would encode a single number from zero to one representing the percentage chance the photo contains a cat.

Stanford postdoctoral electrical engineering researcher Dr. Pulkit Tandon frequently works with natural language processing models as part of his investigations into AI-based data representations. His research covers a wide array of topics within information theory and artificial intelligence.

“Fully connected neural networks are a step beyond linear models: they add some nonlinearities to it, which really adds to their expressive powers,” Dr. Tandon said. “That means the kinds of functions it can predict based on the data it has been given.”

Although any neural network with nonlinearities can approximate any mathematical function, different mathematical techniques lend themselves better to different types of problems. For example, a mathematical operator known as a convolution effectively acts as a sliding window across a sequence of data, considering each point in context of its neighbors. This property means that so-called convolutional neural networks work much better for image and video processing than standard fully-connected networks, which include only linear operators and basic nonlinear activations.

“It can be shown that fully connected neural networks are universal function approximators,” Dr. Tandon said. “But they are really a waste of resources, so there are many different architectures with inductive biases, many of which are motivated by how humans learn.”

Transformer models are neural networks with specialized operations that make it possible for them to intelligently process sequential data such as text and audio. Unlike older recurrent neural networks, transformers attempt to mimic the human method of paying attention to important details while ignoring irrelevant information.

Specifically, transformers use a mechanism known as self-attention to analyze each new datapoint in a sequence — for example, each word in a sentence — in context of the rest of the sequence. This allows them to understand the structure of their inputs better than any prior model architecture, which helps them accurately extract meaning from language. It also gives them the uncanny ability to remember specific details from inputs they received long ago.

“For language prediction in the past, it was really hard to keep track of infinite history,” Dr. Tandon said. “What attention says is that at each point of time, humans do not look at the whole input; they probably focus on something very specific. If you have a huge amount of text that you need to read, maybe you basically contextualize different areas of this text in different ways in your head. Then you pay specific attention to very particular words.”