Cracking the Code: The Language LLMs speak.

Ignacio Correcher Sánchez
5 min readSep 7, 2024

--

Have you ever wondered how large language models like GPT-4 actually “speak” your language? It all comes down to a fascinating process called tokenization. This behind-the-scenes magic breaks your text into smaller pieces called tokens, which can be entire words, parts of words, or even single letters. These tokens are the building blocks that allow the model to understand and generate language, turning complex text into something it can process. Today, tokenization is essential to how LLMs crack the code of human language, though some research is being conducted to bypass this process entirely. In this article, we dive deep into how tokenization works, why it’s essential, and how it acts as the silent powerhouse behind today’s most advanced language models.

TL;DR

This article examines how large language models like GPT-4 use tokenization — building on Byte Pair Encoding (BPE) with regular expressions (regex) — to process language more effectively. While GPT-2 also utilized regex, GPT-4’s showcases an improvement in tokenization handling punctuation, spaces, numbers, and compound words more efficiently. These refinements lead to enhanced performance in tasks like coding and multilingual processing. The article highlights the evolution of tokenization from GPT-2 to GPT-4 and explains how the latest innovations in regex-based tokenization contribute to the model’s superior capabilities.

Byte Pair Encoding Algorithm

Byte Pair Encoding (BPE) is arguably the most popular algorithm for tokenization, currently used in some of the most advanced LLMs like OpenAI’s GPT-4.

The core idea behind BPE is to replace the most frequently occurring pairs of tokens with new ones until every pair of tokens appears only once. Let’s walk through a simple example to clarify how the algorithm works.

#Start
text = aaabdaaabac
vocab = {}

#Iteration 1
text = XabdXabac
vocab = {X: aa}

#Iteration 2
text = ZYdZYac
vocab = {X:aa, Y:ab}

#Iteration 3
text = ZdZac
vocab = {X:aa, Y:ab, Z: XY}

Classic BPE Implementation idea

Before implementing BPE, some groundwork is needed.

First, we require a substantial amount of text to train the encoder. We then convert this text to UTF-8 and extract its ASCII representation as a list, so instead of dealing with characters, we work with numbers — a lot of them.

Our vocabulary starts with 256 elements, corresponding to each ASCII character from 0 to 255. At this point, we set a goal for the vocabulary size, determining how large we want it to be. From this, we can calculate the number of merges required using the formula n_merges = vocab_size - 256.

We can break the process into three main steps:

  1. Identify the most repeated pair in the large sequence of numbers.
  2. Merge these two characters into a new token, corresponding to token ID 255 + i.
  3. Iterate until we reach our desired vocabulary size.

Vizualization

Tiktokenizer capture of GPT-2 Tokenization

Modern BPE implementation with regex

In state-of-the-art models, BPE implementation differs slightly from the classical approach.

The goal of tokenization is to make the LLM understand words and sentences as humans do. Using the basic algorithm, it can be messy to tokenize chunks like “dog.”, “dog!”, and “dog?” the same way. Humans can differentiate the noun “dog” followed by different punctuation marks, but LLMs need some help.

Simply feeding the model a string of numbers won’t allow it to make such distinctions. To address this, modern tokenization methods use a regex pattern to handle these cases, splitting the text into manageable chunks that account for punctuation and other variations. For example, GPT-4 uses the following regex pattern to train its tokenizer:

import regex as re

GPT4_SPLIT_PATTERN = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""

Visualization

Tiktokenizer capture of GPT-4 Tokenization

Comparing tokenization in GPT-2 and GPT-4 reveals significant differences, especially in handling spaces. While GPT-2 assigns each space a separate token, GPT-4 merges spaces until the last one, which becomes part of the next token.

These improvements make GPT-4 particularly effective in niche tasks such as Python coding, where spaces are highly relevant. The new technique allows Python code to be tokenized more efficiently, using fewer tokens to represent the same text, and enhancing the model’s ability to understand functionality.

Conclusions

Now that we’ve covered the basics of tokenization, we can answer some of the frequently asked questions I’ve seen on X (formerly Twitter):

  • Why can’t ChatGPT spell words correctly?
    This is due to tokenization. If characters are not passed to the model separately, it may struggle with spelling. To fix this, try separating characters with spaces. For example, instead of spelling “strawberry,” try spelling it as “s t r a w b e r r y.”
  • Why can’t ChatGPT perform simple arithmetic?
    Again, tokenization plays a role. Numbers and operations are tokenized, which can cause errors in simple calculations.
  • How did GPT-4 improve on Python-based tasks compared to GPT-2?
    Tokenization is a key factor. Aside from larger context windows, GPT-4 manages spaces more effectively, which is crucial in Python code.
  • How does GPT-4 handle punctuation better than GPT-2?
    GPT-4 uses regex-based tokenization, allowing it to differentiate between words and punctuation more efficiently. In GPT-2, punctuation like “dog.”, “dog!”, and “dog?” may be tokenized separately, but GPT-4 identifies the word “dog” as a single token while treating punctuation marks as distinct tokens.
  • What improvements were made in space handling in GPT-4?
    GPT-4 merges spaces within tokens, reducing the number of tokens required. GPT-2 treated each space as a separate token, which was inefficient. In contrast, GPT-4’s regex pattern optimizes token usage by collapsing spaces, enabling the model to process more content within its token limit.
  • How does GPT-4 tokenize numbers and special characters more efficiently?
    GPT-4’s regex patterns handle non-alphabetic characters like numbers and symbols in a smarter way. While GPT-2 may split numbers like “123” into separate tokens, GPT-4 can treat the entire number as a single token, improving the model’s understanding and reducing token usage.
  • What’s the impact of regex on handling rare or compound words?
    GPT-4’s tokenization keeps rare and compound words intact, unlike GPT-2, which might break them into inefficient parts. This leads to better understanding of domain-specific language and languages with complex word forms, enhancing the model’s overall comprehension.

GitHub Repo

For anyone interested in a simple implementation of the BPE algorithm (with and without regex), a GitHub repository is available. It includes training data, test data, and a test Jupyter Notebook for comparing both algorithms:

Resources

· Language Models are Unsupervised Multitask Learners by Radford et al.

· tiktokenizer website to visualize tokens from different LLMs.

· Let’s build the GPT Tokenizer by Andrej Karpathy

--

--