Very insightful blogpost! IMO tokenization is a part of NLP pipelines which receives way less attention than it should As an aside, while reading the summary of SuperBPE, I realized that space-agnostic tokenization and other recent improvements go way back to pre-LLM times: