Let’s Read: Transformer Models, Part 1

Large Language Models (LLMs) are a hot topic today, but few people know even the basics of how they work. I work in data science, but I also didn’t really know how they work. In this series, I’d like to go through the foundational paper that defined the Transformer model on which LLMs are based.

“Attention is all you need” by Ashish Vaswani et al. from the Proceedings of the 31st International Conference on Neural Information Processing Systems, December 2017. https://dl.acm.org/doi/10.5555/3295222.3295349 (publicly accessible)

This series aims to be understandable to a non-technical audience, but will discuss at least some of the technical details. If the technical parts are too difficult, please ask for clarification in the comments. You’re also welcome to just read the TL;DR parts, which should contain the essential points.

Defining the problem

The paper’s abstract explains that they’re trying to create a sequence transduction model. Let’s break down each of those words.

“Model” is a basic concept that would be taken for granted by data scientists, but I should probably explain lest I overestimate the average person’s familiarity. A machine learning model is a piece of programming that takes inputs to produce outputs. However, unlike traditional programming which is explicitly written by a programmer, a model has a set of parameters that are derived in a process called training. Most commonly, a model is trained by providing it with a collection of inputs paired with desired outputs. After a model is trained, it can take new inputs whose desired output is unknown, and make predictions about the outputs.

A “sequence model” is simply a model that deals with data in sequences. For example, suppose we’re trying to predict the likelihood of someone paying off a loan. We might look at their income, current debt, and length of credit history, but this is not a sequence model because those are just three numbers with no natural sequential ordering. However, if we look at a person’s debt during one month, and then the next month, and the month after that, then this data is sequential

Models that deal with language are sequence models, because language consists of one word after another, in a sequence.

I spent some time looking up “transduction” (I recommend this source), and in this context it basically refers to problems where both input and output are sequences. This of course includes LLMs that generate text in response to a text prompt. However, I gather that the authors are interested in the more specific problem of translating text into another language, since their examples are about translating English to German and French.

TL;DR: The authors are interested in creating a model that takes sequential data input (such as text), and produces sequential data output (such as a translation).

Previous solutions

According to the paper’s introduction, the state of the art approach to sequence transduction is the recurrent neural network (RNNs). It also mentions long short-term memory and gated RNNs, but we won’t discuss those except to say that they’re modifications of RNNs. The authors propose the Transformer model, which is also a neural network, but not a recurrent neural network.

A neural network is a type of machine learning model which contains a collection of “neurons”. Each neuron takes a number of inputs and produces an output, a simple little machine learning model all by itself. The complexity emerges from the connections between neurons. Often the neurons are organized in a set of layers, with the output of the first layer being used as the inputs for the second layer, the output of the second layer being used as the inputs for the third layer, and so on. All layers are trained simultaneously.

To use a specific example, suppose we’re making a model to classify images, perhaps to distinguish chihuahuas from muffins. The final layer of neurons is making the dog/muffin prediction, but what of the lower layers of neurons? Conceptually, the lowest layers are identifying simple concepts such as edges, and higher layers are combining simple concepts into more complex concepts, like noses and blueberries. However, we never explicitly tell the neurons what to do; they determine for themselves what to do during training.  In practice, neuron tasks are not well organized, and it may be difficult to explain what any particular neuron is doing.

Neural networks can be quite powerful, but from a developer standpoint, they present a problem: how should the neurons be connected? For example, how many layers should there be, and how many neurons in each layer? The way that neurons are connected is referred to as the architecture.

RNNs are an architecture used for sequence models. At each step of the sequence, the network stores a hidden state, and that hidden state is included among the inputs for the next step of the sequence. When analyzing a sentence, we imagine that the hidden state somehow encodes “everything in the sentence up to this point”. This hidden state is used to predict the next word, as well as a new hidden state, which encodes “everything in the sentence up to this point, including one more word”.

The problem with recurrent neural networks, according to the paper’s introduction, is that it takes a lot of computation to train them. Rather than running a bunch of computation processes in parallel with one another, RNNs inherently require processes to be run sequentially, which is to say one process after another after another.

An alternate solution is a so-called “attention mechanism” (to be defined later). The paper claims that previous work primarily used attention mechanisms in conjunction with RNNs. The main point of the paper is that we can use an attention mechanism by itself, without any recurrence. Thus the title of the paper, “Attention is all you need”.

TL;DR: Previously, sequence models were addressed by recurrent neural networks. The problem is that recurrence requires running a lot of computational processes in sequence rather than in parallel. The authors propose the Transformer model, a neural network that removes recurrence, and replaces it with an “attention mechanism”.

So far, we’ve only gotten through the introduction! That’s just four paragraphs, but of course we also had to cover a lot of background knowledge. I’m pausing here to give you a chance to ask questions, give feedback, and look at the paper if you so choose.

See part 2


  1. JM says

    Conceptually, the lowest layers are identifying simple concepts such as edges, and higher layers are combining simple concepts into more complex concepts, like noses and blueberries. However, we never explicitly tell the neurons what to do; they determine for themselves what to do during training. In practice, neuron tasks are not well organized, and it may be difficult to explain what any particular neuron is doing.

    I think your conceptually is a bit off here. Conceptually it may be organized the way you describe if the neural network is analyzing the picture like a human would but there is no guarantee that is the case. The neural network might develop a model where some property of the distributions of dark space is key to identification or settle on using only skin texture properties.
    The note about it being hard to explain how a particular network works doesn’t really matter for describing the process but has a lot of real world impact. A lot of legal cases involving neural networks have featured that issue. Companies that provide neural networks often use the difficulty of understanding the neural network as cover for problems.

  2. says

    Yes, I’m trying to gesture in the same direction that you’re talking about. The blueberry vs noses example is just a cartoon. In practice, the meaning of any given neuron is likely “nothing in particular”, at least nothing that would correspond to a human-identifiable concept. That’s because the neurons aren’t trained to identify human concepts, they’re trained to identify whatever would be most useful to the other neurons. (That said, convolutional neural networks are known to identify edges in their lowest layers.)

    I am not an expert in neural networks, but I do deal with model explainability in a professional capacity! Neural networks are generally regarded as one of the least explainable kinds of models, but I would tell you that even more basic models like decision trees don’t really have satisfactory explanations when they have a similarly large number of inputs. The core issue is that complex functions with lots of inputs just aren’t very explainable, because “explanations” are a philosophical concept based on everyday human experience, which simply doesn’t involve reading large matrices of numbers.

  3. says

    In case anyone gets confused, when I talk about “model explainability”, that doesn’t mean “explanation of how this type of model works”. We’re not confused about how neural networks work! “Explainability” refers to explanations of how any particular model works. For example, if I train a model to translate English to German, how does the model do that? This is called a “global” explanation. In contrast, a “local” explanation is something that explains how the model translated any particular sentence into German.

Leave a Reply

Your email address will not be published. Required fields are marked *