Smallest unit or chunks of text that a model processes

What is a token?

In the context of large language models, 'token' refers to the smallest unit or chunks of text that a model processes. Used by LLMs to process and generate language, tokens can be as short as one character, as long as a word, or even larger chunks of text-like phrases, depending on the model and its configuration.

Tokens serve as a connection between human language and a structure that AI models can understand. Many modern language models, such as GPT  models, are trained as token-based models. AI models are designed to handle a specific number of tokens at one go.

Each input provided to the model is broken down into tokens and analyzed, and the understanding is used to create a response. The exact process is followed for creating a response - the model generates one token at a time based on the previous token.

Types of tokens:

Here are some types of tokens used in AI Large Language Models:

  • Word Tokens: These represent individual words or phrases in the text, like "house."
  • Sub-word Tokens: Words can be divided into smaller sub-word units. For instance, "speaking" can be segmented into "speak" and "ing."
  • Punctuation Tokens: Tokens that signify various punctuation marks, such as commas (","), periods ("."), and others.
  • Special Tokens: Unique symbols like "[CLS]" (classification token), "[SEP]" (separator token), or "[MASK]" (mask token) have specific roles within the model.
  • Number Tokens: Textual numbers are transformed into numerical tokens. For example, "10" might be represented as a numerical token.

