LLM Tokenizer Generation

The goal of this project is to design and implement a tokenizer for large language models (LLMs). Tokenizers play a crucial role in NLP pipelines by converting text into manageable units (tokens) that models can understand. This tokenizer should balance efficiency, compatibility with LLMs, and the ability to handle diverse text inputs. The tokenizer should efficiently split the text into tokens, leveraging the byte pair encoding (BPE) algorithm.

Running the Program

The program can be run in different modes (sequential, parallel, distributed) by specifying a parameter.
The user can specify a folder containing *.txt files, which should be used to generate the tokenizer.
The output of the program is a JSON file containing mappings between token IDs (numbers) and the character combinations (text).
The program should support loading an external tokenizer from a JSON file. From there, the program should enable the user to encode/decode text.

Testing

The report must include extensive testing and explanation of results (numeric and graphical). All three versions (sequential, parallel, and distributed) must be tested. The tests should be performed without encoding/decoding large files. The parameters that influence runtime are the number of text files used to generate the tokenizer and the complexity of the vocabulary. Consequently, both need to be tested independently to show how the implementation scales. Present the results with informative charts/figures and explain them in detail. The implementation should be tested in the following way:

Testing by Limiting Text

Limit the amount of text from which to generate the tokenizer (it is recommended to measure the corpus size in the number of words). Set the vocabulary complexity to a fixed size.
Run the program multiple times, increasing the amount of text until the program takes too long to complete.
Every configuration should be run at least three times, and the average runtime should be considered when analyzing results.

Testing by Limiting Vocabulary Complexity

Set the corpus (text to build from) size to a fixed amount.
Run the program by increasing the vocabulary complexity (amount of tokens to assign).
Every configuration should be run at least three times, and the average runtime should be considered when analyzing results.