OpenAI’s gpt2 tokenizer is among the first that handles tokenization in a completely lossless way, meaning that there is no unknown token. In my opinion, OpenAI’s vision for generality of GPT really shines through from the tokenizer aspect. In this blog we will be analyzing tiktoken which is the tokenizer behind GPT models.

We will describe how Tiktoken encodes a text to tokens. There are three main stages. (1) extracting out special tokens that we never want to be broken up into smaller pieces (2) pre-tokenization based on a pre-defined regular expression patterns, resembling breaking up texts into words (3) If such a pre-token is not an actual token, this is the stage where we use the byte-level BPE to break up the pre-token into smaller pieces.

Pre-Tokenization

Let’s look at the code that performs step (2). Note that step (1) is omitted in the educational Python tiktoken code, but it is in the Rust code here. Below is the Python encode function taken from tiktoken.

def encode(self, text: str, visualise: Optional[str] = "colour") -> list[int]:
    """Encodes a string into tokens.
    >>> enc.encode("hello world")
    [388, 372]
    """
    # Use the regex to split the text into (approximately) words
    words = self._pat.findall(text) # pre-tokens based on word boundary rules
    tokens = []
    for word in words:
        # Turn each word into tokens, using the byte pair encoding algorithm
        word_bytes = word.encode("utf-8")
        word_tokens = bpe_encode(self.mergeable_ranks, word_bytes, visualise=visualise)
        tokens.extend(word_tokens)
    return tokens, words

For this encode function, a text (still in string, not bytes) is broken into words which we will call pre-tokens. For GPT-4 models, the tokenizer’s name is cl100k_base. The regular expression pattern self._pat is defined below.

>>> from tiktoken._educational import *
>>> enc = SimpleBytePairEncoding.from_tiktoken("cl100k_base")
>>> end._pat
regex.Regex("(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+", flags=regex.V0)

GPT-4’s short description of such regular expression is given below (long description here).

This regex captures common contractions (like 's, 't, etc.) in a case-insensitive manner, sequences of letters possibly preceded by a non-letter, non-number character, sequences of 1 to 3 numbers, sequences of non-letter, non-number characters possibly followed by newlines, sequences of whitespace ending with newlines, whitespace not followed by non-whitespace, or any sequence of whitespace characters.

Let’s see some examples of how the regex breaks up text. Below, we can see that such pattern defines the rule for word boundaries such as how to separate non-whitespace and whitespace, and also imposes certain structure such as space-prefix (the use of space right before non-whitespace character such as “ x”, “ +”, “ y”).

>>> import regex
>>> pat = regex.compile(r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+")
>>> pat.findall("hello worlddddd")
['hello', ' worlddddd']
>>> pat.findall("def add(x, y):\n\treturn x + y")
['def', ' add', '(x', ',', ' y', '):\n', '\treturn', ' x', ' +', ' y']

Byte-level BPE

Base Vocabulary

BPE builds a vocabulary in a bottom-up approach where it merges tokens starting with the base vocabulary. The traditional BPE starts with a set of characters. The set of all possible characters is very large, and it is growing as we speak. For example, the unicode standard has over 100,000 characters. This makes it very difficult to build a lossless tokenizer.

However, these Unicode characters are composed of smaller elements, that is, the bytes. Since there are only 256 base bytes which can represent any text, we can build a lossless tokenizer where there is absolutely no unknown token. This is a neat tokenizer design that GPT was among the first to adopt (if not the first). (show huggingface implementation). Before GPT, many other models use all sorts of tricks to manage the unknown tokens such as normalization.

Let’s see some example of the byte representation of a few Unicode characters. We can see that a normal English character such as ‘a’ is represented by a single byte. However, a Japanese character such as ‘カ’ is represented by 3 bytes. An emoji 🐱 is represented by 4 bytes.

>>> "a".encode('utf-8')
b'a'

>>> "🐱".encode('utf-8')
b'\xf0\x9f\x90\xb1'

>>> "カ".encode('utf-8')
b'\xe3\x82\xab'

In Tiktoken, the 256 bytes are used as the base vocabulary. Even if the tokenizer has not observed any character or phrases before during the training stage, such phrases can be encoded by the bytes.

Encoding

Now, let’s investigate the bpe_encode function. Each pre-token is broken up into smaller pieces using the byte-level BPE. First, the pre-token is convert into a list of bytes (parts in the code below). Then, for each adjacent pair of parts, we check if the pair is in the vocabulary (mergeable_ranks). If it is, we obtain the rank. We go through all the pairs and the adjacent pair with the smallest rank is selected to be merged. In the code below, it is enumerating through the zip of parts[:-1] and parts[1:] which essentially going through all adjacent pairs. Then, we merge the selected pair and leave other parts intact, and repeat such process again. Observe that each part can become longer than 1 byte due to the process of iteratively merging. In the end, we stop when merging is no longer possible (two adjacent parts are not in the vocabulary anymore).

def bpe_encode(
    mergeable_ranks: dict[bytes, int], input: bytes, visualise: Optional[str] = "colour"
) -> list[int]:
    parts = [bytes([b]) for b in input]
    while True:
        # See the intermediate merges play out!
        if visualise:
            if visualise in ["colour", "color"]:
                visualise_tokens(parts)
            elif visualise == "simple":
                print(parts)

        # Iterate over all pairs and find the pair we want to merge the most
        min_idx = None
        min_rank = None
        for i, pair in enumerate(zip(parts[:-1], parts[1:])):
            rank = mergeable_ranks.get(pair[0] + pair[1])
            if rank is not None and (min_rank is None or rank < min_rank):
                min_idx = i
                min_rank = rank

        # If there were no pairs we could merge, we're done!
        if min_rank is None:
            break
        assert min_idx is not None

        # Otherwise, merge that pair and leave the rest unchanged. Then repeat.
        parts = parts[:min_idx] + [parts[min_idx] + parts[min_idx + 1]] + parts[min_idx + 2 :]

    if visualise:
        print()

    tokens = [mergeable_ranks[part] for part in parts]
    return tokens

Below is and example where hello worlddddd is the input for encoding. From above, we see that it is splitted into two pre-tokens, hello and worlddddd, which is what we observed below where BPE works on hello first. Throughout the merging process, observe BPE merges the pair with lower rank first and keep building up parts. Note that the process is deterministic. Given the pre-built vocabulary and a text, the same sequence of merging will always be performed.

Interesting, we can also see that the emoji is encoded with 3 tokens for 4 bytes (which is rather inefficient, which implies that other tokens are more important/have lower rank). The Japanese character カ however can be represented with only 1 token despite being 3 bytes. It is possible that カ appears frequently enough that it is part of the vocab itself.

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

view raw tiktoken_demo.ipynb hosted with ❤ by GitHub

Training a BPE Tokenizer

To train a BPE tokenizer (that is, to obtain a vocabulary), we iterate through a text corpus, pre-tokenize, the use the bag of words (each word or pre-token is a sequence of bytes) as our data which will be iteratively merged.

First, we add the base bytes (all 256 bytes) to the vocabulary. Then, we iterate by counting the occurrences of each byte pair. Then, the highest frequency byte pair (a, b) is added to the vocabulary in the form of token ab. The data is then processed by merging any occurrences of adjacent a, b to be ab. That wat, at each stage, all the parts are in the vocabulary. We repeat until the size of the vocabulary reaches the desired size.

Appendix

We show the modified educational tiktoken code here.

	"""This is an educational implementation of the byte pair encoding algorithm."""
	import collections
	from typing import Optional

	import regex

	import tiktoken


	class SimpleBytePairEncoding:
	def __init__(self, *, pat_str: str, mergeable_ranks: dict[bytes, int]) -> None:
	"""Creates an Encoding object."""
	# A regex pattern string that is used to split the input text
	self.pat_str = pat_str
	# A dictionary mapping token bytes to their ranks. The ranks correspond to merge priority
	self.mergeable_ranks = mergeable_ranks

	self._decoder = {token: token_bytes for token_bytes, token in mergeable_ranks.items()}
	self._pat = regex.compile(pat_str)

	def encode(self, text: str, visualise: Optional[str] = "colour") -> list[int]:
	"""Encodes a string into tokens.

	>>> enc.encode("hello world")
	[388, 372]
	"""
	# Use the regex to split the text into (approximately) words
	words = self._pat.findall(text)
	tokens = []
	for word in words:
	# Turn each word into tokens, using the byte pair encoding algorithm
	word_bytes = word.encode("utf-8")
	word_tokens = bpe_encode(self.mergeable_ranks, word_bytes, visualise=visualise)
	tokens.extend(word_tokens)
	return tokens

	def decode_bytes(self, tokens: list[int]) -> bytes:
	"""Decodes a list of tokens into bytes.

	>>> enc.decode_bytes([388, 372])
	b'hello world'
	"""
	return b"".join(self._decoder[token] for token in tokens)

	def decode(self, tokens: list[int]) -> str:
	"""Decodes a list of tokens into a string.

	Decoded bytes are not guaranteed to be valid UTF-8. In that case, we replace
	the invalid bytes with the replacement character "�".

	>>> enc.decode([388, 372])
	'hello world'
	"""
	return self.decode_bytes(tokens).decode("utf-8", errors="replace")

	def decode_tokens_bytes(self, tokens: list[int]) -> list[bytes]:
	"""Decodes a list of tokens into a list of bytes.

	Useful for visualising how a string is tokenised.

	>>> enc.decode_tokens_bytes([388, 372])
	[b'hello', b' world']
	"""
	return [self._decoder[token] for token in tokens]

	@staticmethod
	def train(training_data: str, vocab_size: int, pat_str: str):
	"""Train a BPE tokeniser on some data!"""
	mergeable_ranks = bpe_train(data=training_data, vocab_size=vocab_size, pat_str=pat_str)
	return SimpleBytePairEncoding(pat_str=pat_str, mergeable_ranks=mergeable_ranks)

	@staticmethod
	def from_tiktoken(encoding):
	if isinstance(encoding, str):
	encoding = tiktoken.get_encoding(encoding)
	return SimpleBytePairEncoding(
	pat_str=encoding._pat_str, mergeable_ranks=encoding._mergeable_ranks
	)


	def bpe_encode(
	mergeable_ranks: dict[bytes, int], input: bytes, visualise: Optional[str] = "colour"
	) -> list[int]:
	parts = [bytes([b]) for b in input]
	while True:
	# See the intermediate merges play out!
	if visualise:
	if visualise in ["colour", "color"]:
	visualise_tokens(parts)
	elif visualise == "simple":
	print(parts)

	# Iterate over all pairs and find the pair we want to merge the most
	min_idx = None
	min_rank = None
	for i, pair in enumerate(zip(parts[:-1], parts[1:])):
	rank = mergeable_ranks.get(pair[0] + pair[1])
	if rank is not None and (min_rank is None or rank < min_rank):
	min_idx = i
	min_rank = rank

	# If there were no pairs we could merge, we're done!
	if min_rank is None:
	break
	assert min_idx is not None

	# Otherwise, merge that pair and leave the rest unchanged. Then repeat.
	parts = parts[:min_idx] + [parts[min_idx] + parts[min_idx + 1]] + parts[min_idx + 2 :]

	if visualise:
	print()

	tokens = [mergeable_ranks[part] for part in parts]
	return tokens


	def bpe_train(
	data: str, vocab_size: int, pat_str: str, visualise: Optional[str] = "colour"
	) -> dict[bytes, int]:
	# First, add tokens for each individual byte value
	if vocab_size < 2**8:
	raise ValueError("vocab_size must be at least 256, so we can encode all bytes")
	ranks = {}
	for i in range(2**8):
	ranks[bytes([i])] = i

	# Splinter up our data into lists of bytes
	# data = "Hello world"
	# words = [
	# [b'H', b'e', b'l', b'l', b'o'],
	# [b' ', b'w', b'o', b'r', b'l', b'd']
	# ]
	words: list[list[bytes]] = [
	[bytes([b]) for b in word.encode("utf-8")] for word in regex.findall(pat_str, data)
	]

	# Now, use our data to figure out which merges we should make
	while len(ranks) < vocab_size:
	# Find the most common pair. This will become our next token
	stats = collections.Counter()
	for piece in words:
	for pair in zip(piece[:-1], piece[1:]):
	stats[pair] += 1

	most_common_pair = max(stats, key=lambda x: stats[x])
	token_bytes = most_common_pair[0] + most_common_pair[1]
	token = len(ranks)
	# Add the new token!
	ranks[token_bytes] = token

	# Now merge that most common pair in all the words. That is, update our training data
	# to reflect our decision to make that pair into a new token.
	new_words = []
	for word in words:
	new_word = []
	i = 0
	while i < len(word) - 1:
	if (word[i], word[i + 1]) == most_common_pair:
	# We found our pair! Merge it
	new_word.append(token_bytes)
	i += 2
	else:
	new_word.append(word[i])
	i += 1
	if i == len(word) - 1:
	new_word.append(word[i])
	new_words.append(new_word)
	words = new_words

	# See the intermediate merges play out!
	if visualise:
	print(f"The current most common pair is {most_common_pair[0]} + {most_common_pair[1]}")
	print(f"So we made {token_bytes} our {len(ranks)}th token")
	if visualise in ["colour", "color"]:
	print("Now the first fifty words in our training data look like:")
	visualise_tokens([token for word in words[:50] for token in word])
	elif visualise == "simple":
	print("Now the first twenty words in our training data look like:")
	for word in words[:20]:
	print(word)
	print("\n")

	return ranks


	def visualise_tokens(token_values: list[bytes]) -> None:
	background = [f"\u001b[48;5;{i}m" for i in [167, 179, 185, 77, 80, 68, 134]]
	# If token boundaries do not occur at unicode character boundaries, it's unclear how best to
	# visualise the token. Here, we'll just use the unicode replacement character to represent some
	# fraction of a character.
	unicode_token_values = [x.decode("utf-8", errors="replace") for x in token_values]

	running_length = 0
	last_color = None
	for token in unicode_token_values:
	color = background[running_length % len(background)]
	if color == last_color:
	color = background[(running_length + 1) % len(background)]
	assert color != last_color
	last_color = color
	running_length += len(token)
	print(color + token, end="")
	print("\u001b[0m")


	def train_simple_encoding():
	gpt2_pattern = (
	r"""'s\|'t\|'re\|'ve\|'m\|'ll\|'d\| ?[\p{L}]+\| ?[\p{N}]+\| ?[^\s\p{L}\p{N}]+\|\s+(?!\S)\|\s+"""
	)
	with open(__file__, "r") as f:
	data = f.read()

	enc = SimpleBytePairEncoding.train(data, vocab_size=600, pat_str=gpt2_pattern)

	print("This is the sequence of merges performed in order to encode 'hello world':")
	tokens = enc.encode("hello world")
	assert enc.decode(tokens) == "hello world"
	assert enc.decode_bytes(tokens) == b"hello world"
	assert enc.decode_tokens_bytes(tokens) == [b"hello", b" world"]

	return enc

view raw _educational.py hosted with ❤ by GitHub