Skip to main content
Version: 0.2.4

Class: WordPieceTokenizer

text/WordpieceTokenizer.WordPieceTokenizer

Constructors

constructor

new WordPieceTokenizer(config)

Construct a tokenizer with a WordPieceTokenizer object.

Parameters

NameTypeDescription
configWordPieceTokenizerConfiga tokenizer configuration object that specify the vocabulary and special tokens, etc.

Methods

decode

decode(tokenIds): string

Decode an array of tokenIds to a string using the vocabulary

Parameters

NameTypeDescription
tokenIdsnumber[]an array of tokenIds derived from the output of model

Returns

string

a string decoded from the output of the model


encode

encode(text): number[]

Encode the raw input to a NLP model to an array of number, which is tensorizable.

Parameters

NameTypeDescription
textstringThe raw input of the model

Returns

number[]

An array of number, which can then be used to create a tensor as model input with the torch.tensor API


tokenize

tokenize(text): string[]

Tokenizes a piece of text into its word pieces. This uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary.

Parameters

NameTypeDescription
textstringthe raw input of the model

Returns

string[]

an array of tokens in vocabulary representing the input text.