virtex.config


class virtex.config.Config(config_file: Optional[str] = None, override_list: List[Any] = [])[source]

Bases: object

This class provides package-wide configuration management. It is a nested dict-like structure with nested keys accessible as attributes. It contains sensible default values, which can be modified by (first) a YAML file and (second) a list of attributes and values.

An instantiated object is immutable: modifying any attribute is illegal. You must override required parameter values either through config_file or override_list arguments.

Parameters
  • config_file – Path to a YAML file containing config parameters.

  • config_override – A list of sequential attributes and values of parameters. This happens after overriding from YAML file.

Examples

Let a YAML file named “config.yaml” specify these parameters to override:

OPTIM:
BATCH_SIZE: 512
LR: 0.01
>>> _C = Config("config.yaml", ["OPTIM.BATCH_SIZE", 1024])
>>> _C.LR  # default: 0.001
0.01
>>> _C.OPTIM.BATCH_SIZE  # default: 256, file: 512
1024
dump(file_path: str)[source]

Save config at the specified file path.

Parameters

file_path – Path to save config file (YAML).

Config References

  1_C.RANDOM_SEED = 0
  2# Train with Automatic Mixed Precision (native PyTorch).
  3_C.AMP = True
  4# Set CUDNN deterministic flag (torch.backends.cudnn.deterministic).
  5# Setting this will ensure exact results on every run at the cost of
  6# little slowdown. Good for debugging.
  7_C.CUDNN_DETERMINISTIC = False
  8# Set CUDNN benchmark flag (torch.backends.cudnn.benchmark). Enables
  9# CUDNN to select fastest implementation for operations based on GPU.
 10# May change results (in decimals) on different hardware, but faster
 11# to train. Turn off while debugging.
 12_C.CUDNN_BENCHMARK = True
 13
 14# ---------------------------------------------------------------------
 15#   Data paths and parameters related to dataloading.
 16# ---------------------------------------------------------------------
 17_C.DATA = CN()
 18
 19# Path to the dataset root, which structure as per README. Path is
 20# assumed to be relative to project root.
 21_C.DATA.ROOT = "datasets/coco"
 22# Path to .model file generated by ``sentencepiece``.
 23_C.DATA.TOKENIZER_MODEL = "datasets/vocab/coco_10k.model"
 24
 25# Handy config params for vocab size and indices of special tokens.
 26# While these can be picked up from the tokenizer, having these in
 27# the config makes it easy to create a model without instantiating too
 28# many tokenizer instances (especially when not needed, e.g. model zoo).
 29# These must match according to what's present in ``TOKENIZER_VOCAB``
 30# and ``TOKENIZER_MODEL`` above.
 31_C.DATA.VOCAB_SIZE = 10000
 32# Index of out-of-vocabulary (and padding) token.
 33_C.DATA.UNK_INDEX = 0
 34# Index of the start-of-sentence [SOS] token.
 35_C.DATA.SOS_INDEX = 1
 36# Index of the end-of-sentence [EOS] token.
 37_C.DATA.EOS_INDEX = 2
 38# Index of the word masking token. While not used for captioning, having
 39# this extra token makes it possible to train an MLM model without
 40# re-creating a new vocab mapping.
 41_C.DATA.MASK_INDEX = 3
 42
 43# Size of the image (square) to crop from original input image.
 44_C.DATA.IMAGE_CROP_SIZE = 224
 45# Maximum length of input caption (number of tokens).
 46# Longer captions will be truncated up to this length.
 47_C.DATA.MAX_CAPTION_LENGTH = 30
 48
 49# List of image transforms (pre-processing and data augmentation) to be
 50# applied sequentially (always or randomly) during training and
 51# validation. Refer ``virtex/facetories.py`` for all possible transforms.
 52_C.DATA.IMAGE_TRANSFORM_TRAIN = [
 53    "random_resized_crop",
 54    "horizontal_flip",
 55    "color_jitter",
 56    "normalize",
 57]
 58_C.DATA.IMAGE_TRANSFORM_VAL = [
 59    "smallest_resize",
 60    "center_crop",
 61    "normalize",
 62]
 63
 64# Hyper-parameters for masked LM pretraining task. These are only used
 65# when ``MODEL.NAME`` is "masked_lm".
 66_C.DATA.MASKED_LM = CN()
 67# Fraction of tokens to choose for masking, this must be less than 1.
 68_C.DATA.MASKED_LM.MASK_PROPORTION = 0.15
 69# Probability to replace chosen tokens with [MASK] token.
 70_C.DATA.MASKED_LM.MASK_PROBABILITY = 0.85
 71# Probability to replace chosen tokens with a random token.
 72_C.DATA.MASKED_LM.REPLACE_PROBABILITY = 0.10
 73
 74# ---------------------------------------------------------------------
 75#   Model architecture: visual backbone and textual head.
 76# ---------------------------------------------------------------------
 77_C.MODEL = CN()
 78
 79# Name of model, based on pretraining task.
 80# Possible choices: {"token_classification", "multilabel_classification",
 81# "captioning", "bicaptioning", "masked_lm", "virtex"}
 82_C.MODEL.NAME = "virtex"
 83
 84_C.MODEL.VISUAL = CN()
 85# Name of visual backbone. Possible choices: {"blind", "torchvision"}
 86# Models from torchvision can be specified as shown below.
 87_C.MODEL.VISUAL.NAME = "torchvision::resnet50"
 88# Number of channels in pooled spatial features of visual backbone.
 89_C.MODEL.VISUAL.FEATURE_SIZE = 2048
 90# Whether to load ImageNet pretrained weights into visual backbone.
 91_C.MODEL.VISUAL.PRETRAINED = False
 92# Whether to keep visual backbone frozen and train only textual head.
 93_C.MODEL.VISUAL.FROZEN = False
 94
 95_C.MODEL.TEXTUAL = CN()
 96# Name of textual head. Set to "none" for MODEL.NAME = "*_classification".
 97# Possible choices: {"transdec_postnorm", "transdec_prenorm"}.
 98# Architectural hyper-parameters are specified as shown above.
 99_C.MODEL.TEXTUAL.NAME = "transdec_postnorm::L1_H2048_A32_F8192"
100# L = Number of layers in the transformer.
101# H = Hidden size of the transformer (embeddings, attention features).
102# A = Number of attention heads in the transformer.
103# F = Size of feedforward layers in the transformer.
104# Typically, we have (A = H / 64) and (F = 4 * H).
105
106# Dropout probability for embedding, hidden features in textual head.
107_C.MODEL.TEXTUAL.DROPOUT = 0.1
108
109_C.MODEL.DECODER = CN()
110# What algorithm to use for decoding. Supported values: {"beam_search",
111# "nucleus_sampling"}.
112_C.MODEL.DECODER.NAME = "beam_search"
113# Number of beams to decode (1 = greedy decoding). Ignored when decoding
114# through nucleus sampling.
115_C.MODEL.DECODER.BEAM_SIZE = 5
116# Size of nucleus for sampling predictions. Ignored when decoding through
117# beam search.
118_C.MODEL.DECODER.NUCLEUS_SIZE = 0.9
119# Maximum length of decoded caption. Decoding may end earlier when [EOS]
120# token is sampled.
121_C.MODEL.DECODER.MAX_DECODING_STEPS = _C.DATA.MAX_CAPTION_LENGTH
122
123# ---------------------------------------------------------------------
124#   Optimization hyper-parameters, default values are for pretraining
125#   our best model on bicaptioning task (COCO Captions).
126# ---------------------------------------------------------------------
127_C.OPTIM = CN()
128
129# Name of optimizer to use. Supported values: {"sgd", "adamw"}.
130# AdamW uses default (beta1, beta2) values from PyTorch.
131_C.OPTIM.OPTIMIZER_NAME = "sgd"
132# Momentum co-efficient for SGD. Ignored for AdamW.
133_C.OPTIM.SGD_MOMENTUM = 0.9
134# Weight decay co-efficient for the optimizer.
135_C.OPTIM.WEIGHT_DECAY = 0.0001
136# Regex pattern of params for which there will be no weight decay.
137_C.OPTIM.NO_DECAY = ".*textual.(embedding|transformer).*(norm.*|bias)"
138# Max gradient norm for clipping to avoid exploding gradients.
139_C.OPTIM.CLIP_GRAD_NORM = 10.0
140
141# Wrap our optimizer with Lookahead (https://arxiv.org/abs/1907.08610).
142_C.OPTIM.LOOKAHEAD = CN()
143_C.OPTIM.LOOKAHEAD.USE = True
144_C.OPTIM.LOOKAHEAD.ALPHA = 0.5
145_C.OPTIM.LOOKAHEAD.STEPS = 5
146
147# We set different learning rates for CNN (visual backbone) and rest of
148# the model. CNN LR is typically much higher for training from scratch.
149# Both LRs undergo same warmup-decay schedules.
150
151# Total batch size (will be distributed evenly across GPUs).
152_C.OPTIM.BATCH_SIZE = 256
153# Max learning rate for CNN (visual backbone).
154_C.OPTIM.CNN_LR = 0.2
155# Max learning rate for rest of the model.
156_C.OPTIM.LR = 0.001
157# Number of iterations to train for, batches are randomly sampled.
158_C.OPTIM.NUM_ITERATIONS = 500000
159
160# Number of steps at the start of training for linear LR warmup.
161_C.OPTIM.WARMUP_STEPS = 10000
162# Learning rate annealing schedule for decay after warmup.
163# Possible choices: {"none", "linear", "cosine", "multistep"}.
164_C.OPTIM.LR_DECAY_NAME = "cosine"
165# Steps to decay LR for "multistep" schedule.
166_C.OPTIM.LR_STEPS = []
167# Factor to multiply with LR for "multistep" schedule.
168_C.OPTIM.LR_GAMMA = 0.1
169