The smallest unit a corpus is divided into, typically each word form and punctuation mark (however, contracted forms such as can’t are counted as two tokens in English). Thus, a corpus consists of more tokens than words.

A tool that divides text into tokens is called a tokeniser, and the process, tokenisation.