3 LLM Architectures

Abhinav Kimothi
3 min readJul 24, 2023

Transformers form the backbone of the revolutionary Large Language Models

While LLMs like GPT4, llama2 & Falcon seem to do an excellent jobs across a variety of tasks, the performance of an LLM on a particular task is a direct result of the underlying architecture.

There are three variations of the transformer architecture that power different LLMs.

1️⃣ Autoencoders — In auto-encoders, the decoder part of the transformer is discarded after pre-training and only the encoder is used to generated the output. The widely popular BERT and RoBERTa models were based on this architecture and performed well on sentiment analysis and text classification . These models are trained using a process called MLM or masked language modeling.

2️⃣ Autoregressors — The modern LLMs like the GPT series, bloom are autoregressors. In this architecture, the decoder part is retained and the encoder part is discarded after pre-training. While text generation is the most suitable use case of autoregressors, they perform exceptionally well on a wide variety of tasks. Most modern LLMs are autoregressors. These models are trained using a process called Causal Language Modeling.

3️⃣ Sequence-to-Sequence — The genesis of the transformer models was the sequence-to-sequence models. These models have both the encoder and the decoder part and can be trained in multiple ways. One of the methods is span corruption and reconstruction. These models are best suited for language translation. The T5 and the BART family of models are sequence to sequence models.

--

--

Abhinav Kimothi

Co-founder and Head of AI @ Yarnit.app || Data Science, Analytics & AIML since 2007 || BITS-Pilani, ISB-Hyderabad || Ex-HSBC, Ex-Genpact, Ex-LTI || Theatre