virtex.data.tokenizers

class virtex.data.tokenizers.SentencePieceBPETokenizer(model_path: str)[source]

Bases: object

A tokenizer based on SentencePiece with BPE sub-routine. It encodes caption strings into list of tokens.

Parameters: model_path – Path to the .model file trained by SentencePiece.

get_vocab_size() → int[source]: Return number of tokens in vocabulary (including special tokens).

token_to_id(token: str) → int[source]: Get integer ID of a string token (<unk> if does not exist).

id_to_token(token_id: int) → str[source]: Get string token of an integer ID (<unk> if does not exist).

encode(text: str) → List[int][source]: Convert a text string to a list of integer token ids.

decode(token_ids: List[int]) → str[source]: Convert a sequence of token IDs to a text string.