Transformers Models
Foundational blocks for various natural language processing (NLP) and generative AI tasks
Transformers, a type of deep learning model, are generally used for various natural language processing (NLP) and generative AI tasks. Introduced in a seminal paper, "Attention is All You Need" by Vaswani et al, transformers have since become key foundational blocks for various natural language processing (NLP) and generative AI tasks.
Transformers use the 'self-attention' mechanism to learn contextual relationships between words in a sentence or text sequence, allowing the model to weigh the importance of different words based on context. This ability to comprehend the interplay of words in a sentence can be used to generate human-like text, demonstrating remarkable capabilities in various generative AI tasks.
The Transformer architecture uses an encoder-decoder structure without relying on recurrence or convolutions for output generation. The encoder maps input sequences to continuous representations. These representations are then input to the decoder. The decoder, situated on the right half, uses the encoder's output and its own previous output to generate an output sequence. This design allows for efficient sequence-to-sequence tasks without the limitations of traditional recurrent or convolutional structures.
Encoder: The encoder has 6 identical layers (N=6). Each layer comprises two sub-layers. The first sub-layer is a multi-head self-attention mechanism, and the second is a positionwise fully connected feed-forward network. A residual connection surrounds each sub-layer, followed by layer normalization. The output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. The model maintains consistency by ensuring that all sub-layers and embedding layers produce outputs of dimension dmodel = 512, enabling efficient information flow through residual connections.
Decoder: The decoder also consists of 6 identical layers (N=6). Each layer has three sub-layers. Besides the multi-head self-attention mechanism and the positionwise fully connected feed-forward network present in the encoder layers, the decoder introduces a third sub-layer. This additional sub-layer performs multi-head attention over the output of the encoder stack. Residual connections and layer normalization surround each sub-layer, maintaining a consistent structure. Notably, the self-attention sub-layer in the decoder is modified to prevent positions from attending to subsequent positions. This, along with the offset in output embeddings by one position, ensures that predictions for position i depend only on known outputs at positions less than i, preventing information leakage from future positions during training.
Transformer models process input sequences in parallel, making them faster than RNNs for many NLP tasks. They are highly effective for a range of NLP tasks, including language modelling, text classification, question answering, and machine translation.
OpenAI's GPT-3 and GPT-4 are based on the Transformer architecture.
References: Paper, Attention is all you need, 2017