Skip to main content
Version: 0.2.4

Class: BasicTokenizer

text/BasicTokenizer.BasicTokenizer

Constructors

constructor

new BasicTokenizer(config)

Construct a BasicTokenizer Object.

Parameters

NameTypeDescription
configBasicTokenizerConfigA basic tokenizer configuration object that specifies the non-splitable symbol, lowercase, customized punctuations, etc.

Methods

tokenize

tokenize(text): string[]

Tokenize any text with basic operations like lowercase transform, blackspace trimming and punctuation splitting. Normally used to clean text before passing to other tokenizers (e.g. wordpiece).

Parameters

NameTypeDescription
textstringThe text to be processed

Returns

string[]