Transformer Model
The RNN models with attention mechanisms saw significant improvement in their performance. However, recurrent models are, by their nature, difficult to scale. But, the self-attention mechanism soon proved to be quite powerful, so much so that it did not even require recurrent sequential processing!
The introduction of transformers by the Google Brain team in 2017 is perhaps one of the most important inflection points in the history of LLMs. A transformer is a deep learning model that adopts the self-attention mechanism and processes the entire input all at once.
Attention is a technique to enhance some parts of the input data while diminishing other parts. The motivation behind this is that the network should devote more focus to the important parts of the data.
Transformer model (specifically Multi Headed Self Attention) is now became the foundation for many of the state of the art LLMs like BERT, ChatGPT, Google Bard etc.
• Transformer uses only attention mechanism to remember things.
• They do not have recurrence so it is faster to train using parallel approach.
• Attention means the features on which model is focusing more to find correlation or to generate human like output.
The architecture of the Transformer consists of an encoder-decoder mechanism. The original paper describes six encoders and six decoders. Each encoder consists of a multi-headed self-attention layer and a feed-forward layer. Each decoder consists of two self-attention layers: one is a masked multi-headed layer, and the other is a multi-headed layer similar to the one used in the encoder. Additionally, the decoder includes a feed-forward layer
.
Parallelization in the transformer is achieved by feeding all the words of a sentence at the same time to the input of the network, which then passes through the encoder and decoder.
To handle the order or sequence of words without using recurrence, the input to the transformer is embedded and positional encoding is applied on top of the word embeddings. Positional encoding provides information about the location of each word in the sentence.
The encoders and decoders are connected to each other. The input is passed to the encoders, and after passing through all the encoders, the output is then passed to all the decoders. The output of the decoders is then passed to a linear layer and a softmax layer, which produces the final output.
In addition to the basic structure, the Transformer architecture includes normalization layers, specifically Layer Normalization (an improvement over batch normalization), and skip connections. These normalization layers and skip connections help the model retain important information and optimize performance.
Keep reading with a 7-day free trial
Subscribe to STJayaprakash's AI-Aqua Substack to keep reading this post and get 7 days of free access to the full post archives.