Unravel LLM building blocks

Yaniv Vaknin
5 min readAug 22, 2023

The total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching 120 Zettabytes in 2023. By 2025, global data creation is expected to reach 180 Zettabytes..

According to estimates, 80 to 90 percent of data is unstructured ( e.g. text, image, etc.), meaning it is unorganized and inaccessible to conventional data analytics tools.

Created by DALL-E

LLM (Large Language Model) leverages a unified way to consume, analyze, search and train this volume of data, LLMs use a language for a unified representation of unstructured data — Embeddings.

This topic has been the subject of many papers. In this article, my aim is to provide you with the essential foundations and practical code examples that will simplify the building blocks of LLM. This will ensure you have the fundamental knowledge and smooth route to navigate a technical discussion like “On how many tokens GPT-4 was trained on?” or “What is Open AI’s ADA (hint: embedding ) model ?

Image by author

What is an embedding?

Google’s team set the stage with Word2vec and the foundational paper “Attention is all you need”. These papers build the pillars for the emerging language “Embeddings”.

Word and sentence embeddings are the building blocks of language models. Embeddings map items of unstructured data to high-dimensional real vectors.

Word embeddings are vector representations of words, where each dimension represents a different feature or aspect of the word’s meaning.

Word embedding — 50 features (Source)

Embeddings are typically created using neural network models that learn to map words or tokens to a high-dimensional vector space.

Embeddings Vs. Tokens

OK, that’s great but how can we decide how to split the input data? It can be easy for frequent and short words, but for non-frequent words we need a split mechanism.

This is where tokenization comes into play where each word is tokenized into individual tokens and then embedded into a vector representation.

Tokens are common sequences of characters found in text. One token is ~4 characters, or one token equals 0.75 words by mean.

To calculate how many tokens are included in your text you can use Tiktoken, a fast BPE (Byte Pair Encoding) tokenizer for use with OpenAI models.

The open source version of Tiktoken can be installed from PyPI:

pip install tiktoken

This function returns the number of tokens in a text string:

import tiktoken
def num_tokens_from_string(string: str, encoding_name: str) -> int:
encoding = tiktoken.get_encoding(encoding_name)
num_tokens = len(encoding.encode(string))
return num_tokens

If we want to know how many tokens are in the sentence “tiktoken is awesome”, we can call the function above in the following way, in this case we used cl100k_base.

print(num_tokens_from_string("tiktoken is awesome!", "cl100k_base"))
6

Popular OpenAI models likegpt-4, gpt-3.5-turbo, text-embedding-ada-002 use cl100k_base encoding model.

You can use this visual tool provided by OpenAI. For example: the sentence “Mary Poppins Supercalifragilisticexpialidocious”, comprises 15 tokens.

Now that we understand how to represent a word as a vector of numbers — a word embedding — and have a mechanism to split it into tokens, this is where “Sentence Embeddings” come into play. A sentence embedding is just like a word embedding, except it associates every sentence with a vector full of numbers. We encode the position of that word in the input sentence into a vector, and add it to the word vector. This way, the same word at a different position in a sentence is encoded differently.

The following code generates a vector of 384 features for each of the sentences [“I like dogs”, “I like cats”]. It is imperative to know that there are different models with various numbers of features (~96–4096 features).

Consider the following tradeoffs when choosing your model:

Many features -> higher recall (accuracy) -> more memory (higher latency / slower performance)

pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
sentences = ["I like dogs", " I like cats"]
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
print(embeddings)

This is the short output:

[[-5.84880412e-02 -2.79528778e-02  6.88477978e-02  2.85013970e-02
-6.79694414e-02 -2.25785817e-03 7.21920580e-02 -5.46673220e-03
1.04574956e-01 5.83925620e-02 7.27049336e-02 -6.51337430e-02

Summary

This article discusses the increasing amount of data generated globally and the need for a unified way to consume and train unstructured data and LLMs. It emphasizes the importance of embeddings, which are vector representations of unstructured data, such as words or sentences. Tokenization is introduced as a mechanism to split input data, especially non-frequent words. Tokens are sequences of characters found in text, and they are embedded into vector representations. Tiktoken is a tokenizer for OpenAI models, and provides an example code snippet to count the number of tokens in a text string. Additionally, it introduces the concept of sentence embeddings, which associate every sentence with a vector of numbers. We provide code examples for generating sentence embeddings using the SentenceTransformer library.

If someone will ask you :-)

  • According to this article OpenAI trained GPT-4 on ~13 Trillion tokens
  • Text-embedding-ada-002 is a new embedding model from OpenAI that replaces separate models for text search, text similarity, and code search

--

--