Build a modern LLM from scratch. Every line commented. Explained like we are five.
-
Updated
May 24, 2026 - Jupyter Notebook
Build a modern LLM from scratch. Every line commented. Explained like we are five.
Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript
TokenScript schema, specs and paper
Frames iOS: making native card payments simple
Open morphology for Finnish
A Python 3 module that provides functions for splitting identifiers found in source code files.
Taiwanese Hokkien Transliterator and Tokeniser
Built a complete search engine by creating an Inverted Index on the Wikipedia corpus ( of 2018 with size 72 GB). That gives you top search result related to given query words.
Taiwanese Hokkien Transliterator and Tokeniser
Frames Android: making native card payments simple
Chatbot pédagogique entraîné sur un large fichier d’intents (thèmes d’éducation/psycho/socio/éco). Tokenisation, réseau LSTM Keras, entraînement prolongé et boucle d’inférence simple pour répondre aux intentions.
eqCAT — Tokenised Parametric Earthquake Catastrophe Bond | ERC-20 smart contract enabling fractional on-chain access to the $64B insurance-linked securities market | Solidity 0.8.24 • OpenZeppelin • Hardhat
This project consist of creating a system or tool that can understand and transform a textual description of a process into an image.
HIPAA-regulated AI integration architecture for Impowr's healthcare platform, serving at-risk populations. Deterministic tokenization on Azure (SQL, Blob, OpenAI) for secure ingestion of sensitive health data; NLP post-processing translates AI outputs into clinical insights for non-technical staff.
Exploratory data analysis and Natural Language Processing techniques applied to a corpus of emails to identify the key topics and their frequencies
An objective diagnostic framework that measures digital maturity
A hybrid tokenization framework that combines coarse semantic context tokens with fine-grained sub-word tokens to improve narrative cohesion and creativity in Large Language Models (LLMs). This research builds upon Meta's Large Concept Models (LCMs) by exploring a quantized, hierarchical generation paradigm ("big picture to small").
A search engine is constructed to return customised recipes according to three sorting algorithms. Speed is improved by performing pre-processing and inverted index.
Add a description, image, and links to the tokenisation topic page so that developers can more easily learn about it.
To associate your repository with the tokenisation topic, visit your repo's landing page and select "manage topics."