virtex.modules.textual_heads

A textual head accepts visual features from the visual backbone, and performs task specific modeling (captioning, classification etc.) to predict an output distribution over vocabulary tokens for one or multiple time-steps in the batch.

class virtex.modules.textual_heads.TextualHead(visual_feature_size: int, vocab_size: int, hidden_size: int)[source]

Bases: torch.nn.modules.module.Module

Base class for all textual heads. All child classes can simply inherit from Module, however this is kept here for uniform type annotations.

Parameters

visual_feature_size – Size (number of channels) of the input features from the visual backbone.
vocab_size – Number of tokens in the output vocabulary.
hidden_size – Size of the token embedding vectors, or hidden state vector of the language model.

property textual_feature_size: Size of the last dimension of output right before the output linear layer (which predicts a distribution over vocabulary tokens). This is typically same as hidden_size for most modules. This property is used to add more modules on top of this.

class virtex.modules.textual_heads.LinearTextualHead(visual_feature_size: int, vocab_size: int, **kwargs)[source]

Bases: virtex.modules.textual_heads.TextualHead

A textual head containing a single linear layer projecting from the visual feature size to the output vocabulary size.

Parameters

visual_feature_size – Size (number of channels) of the input features from the visual backbone.
vocab_size – Number of tokens in the output vocabulary.

forward(visual_features: torch.Tensor, caption_tokens: Optional[torch.Tensor] = None, caption_lengths: Optional[torch.Tensor] = None) → torch.Tensor[source]

Project visual features directly to predict a distribution over vocabulary tokens through a single linear layer. This textual head ignores arguments caption_tokens and caption_lengths, they are here for API consistency.

Parameters: visual_features – A tensor of shape (batch_size, channels, height, width) containing features from visual backbone.
Returns: A tensor of shape (batch_size, vocab_size) containing output vocabulary logits.

class virtex.modules.textual_heads.TransformerDecoderTextualHead(visual_feature_size: int, vocab_size: int, hidden_size: int, num_layers: int, attention_heads: int, feedforward_size: int, dropout: float = 0.1, norm_first: bool = False, mask_future_positions: bool = True, max_caption_length: int = 30, padding_idx: int = 0)[source]

Bases: virtex.modules.textual_heads.TextualHead

A textual head composed of four main modules: (1) input projection (linear layer) for visual features to match size with textual features, (2) word and positional embedding for input captions, (3) a unidirectional transformer decoder, and (4) and output projection (linear layer) to predict a distribution over vocabulary tokens. The word embedding weights are tied with output projection; the latter still has its own learnable bias.

Note

For the “bicaptioning” pretraining task, our textual head (as defined in the paper) must have two transformer decoders: one each to decode caption in either direction. This class however will always have one transformer per object.

Refer BidirectionalCaptioningModel source to understand how an object of this class is cloned, along with tying embedding and output weights, for bicaptioning.

Hence, while there are two objects of this class, it is pragmatically a single textual head as a whole, according to the terminology used in paper.

Parameters

visual_feature_size – Size (number of channels) of the input features from the visual backbone.
vocab_size – Number of tokens in the output vocabulary.
hidden_size – Size of the token embedding vectors, or hidden state vector of the language model.
num_layers – Number of layers in the transformer.
attention_heads – Number of attention heads in the transformer.
feedforward_size – Size of feedforward layers in the transformer.
dropout – Dropout probability for transformer (applied after layernorm).
norm_first – Whether to apply normalization before or after attention/FF layers. The former type are called pre-norm variants (like GPT-2) and latter are post-norm variants (like BERT). Default is post-norm.
mask_future_positions – Whether to mask future positions for self-attention over caption tokens. This must be True for captioning (and bicaptioning) tasks to prevent the language model from cheating, and False for masked language modeling, as the self-attention should consider all tokens.
max_caption_length – Maximum length of input captions; this is used to create a fixed positional embedding lookup table.
padding_idx – Token index of [PAD] token, word embedding for these tokens will be a vector of zeroes (and not trainable).

static _init_weights(module)[source]: Initialize weights like BERT - N(0.0, 0.02), bias = 0.

forward(visual_features: torch.Tensor, caption_tokens: torch.Tensor, caption_lengths: torch.Tensor) → torch.Tensor[source]

Given (projected) visual features from visual backbone and caption tokens, predict the output logits for next time-step.

Parameters

visual_features – A tensor of shape (batch_size, channels, height, width) containing features from visual backbone.
caption_tokens – A tensor of shape (batch_size, max_caption_length) of caption tokens padded to the right by padding_idx.
caption_lengths – A tensor of shape (batch_size, ) containing lengths of caption tokens in the batch.

Returns

A tensor of shape (batch_size, max_caption_length, vocab_size) containing output vocabulary logits for each time-step.

static make_future_mask(size: int, dtype: torch.dtype, device: torch.device) → torch.Tensor: Generate a mask for “future” positions. Masked positions will be negative infinity. This mask is critical for casual language modeling.