STJayaprakash's  AI-Aqua Substack

STJayaprakash's  AI-Aqua Substack

Share this post

STJayaprakash's  AI-Aqua Substack
STJayaprakash's AI-Aqua Substack
Language Models Part 3

Language Models Part 3

Transformer Model and its derivative Large Language Models

stjayaprakash's avatar
stjayaprakash
Aug 18, 2023
∙ Paid

Share this post

STJayaprakash's  AI-Aqua Substack
STJayaprakash's AI-Aqua Substack
Language Models Part 3
Share

Transformer Model

The RNN models with attention mechanisms saw significant improvement in their performance. However, recurrent models are, by their nature, difficult to scale. But, the self-attention mechanism soon proved to be quite powerful, so much so that it did not even require recurrent sequential processing!

The introduction of transformers by the Google Brain team in 2017 is perhaps one of the most important inflection points in the history of LLMs. A transformer is a deep learning model that adopts the self-attention mechanism and processes the entire input all at once.

Attention is a technique to enhance some parts of the input data while diminishing other parts. The motivation behind this is that the network should devote more focus to the important parts of the data.

Transformer model (specifically Multi Headed Self Attention) is now became the foundation for many of the state of the art LLMs like BERT, ChatGPT, Google Bard etc.

• Transformer uses only attention mechanism to remember things.

• They do not have recurrence so it is faster to train using parallel approach.

• Attention means the features on which model is focusing more to find correlation or to generate human like output.

The architecture of the Transformer consists of an encoder-decoder mechanism. The original paper describes six encoders and six decoders. Each encoder consists of a multi-headed self-attention layer and a feed-forward layer. Each decoder consists of two self-attention layers: one is a masked multi-headed layer, and the other is a multi-headed layer similar to the one used in the encoder. Additionally, the decoder includes a feed-forward layer

.

Parallelization in the transformer is achieved by feeding all the words of a sentence at the same time to the input of the network, which then passes through the encoder and decoder.

To handle the order or sequence of words without using recurrence, the input to the transformer is embedded and positional encoding is applied on top of the word embeddings. Positional encoding provides information about the location of each word in the sentence.

The encoders and decoders are connected to each other. The input is passed to the encoders, and after passing through all the encoders, the output is then passed to all the decoders. The output of the decoders is then passed to a linear layer and a softmax layer, which produces the final output.

STJayaprakash's AI-Aqua Substack is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

In addition to the basic structure, the Transformer architecture includes normalization layers, specifically Layer Normalization (an improvement over batch normalization), and skip connections. These normalization layers and skip connections help the model retain important information and optimize performance.

Keep reading with a 7-day free trial

Subscribe to STJayaprakash's AI-Aqua Substack to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 stjayaprakash
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share