This module is a collection of metrics commonly used during pretraining and downstream evaluation. Two main classes here are:

Parts of this module (tokenize(), cider() and spice()) are adapted from coco-captions evaluation code.

class virtex.utils.metrics.TopkAccuracy(k: int = 1)[source]

Bases: object

Top-K classification accuracy. This class can accumulate per-batch accuracy that can be retrieved at the end of evaluation. Targets and predictions are assumed to be integers (long tensors).

If used in DistributedDataParallel, results need to be aggregated across GPU processes outside this class.


kk for computing Top-K accuracy.

class virtex.utils.metrics.CocoCaptionsEvaluator(gt_annotations_path: str)[source]

Bases: object

A helper class to evaluate caption predictions in COCO format. This uses cider() and spice() which exactly follow original COCO Captions evaluation protocol.


gt_annotations_path – Path to ground truth annotations in COCO format (typically this would be COCO Captions val2017 split).

evaluate(preds: List[Dict[str, Any]]) Dict[str, float][source]

Compute CIDEr and SPICE scores for predictions.


preds – List of per instance predictions in COCO Captions format: [ {"image_id": int, "caption": str} ...].


Computed metrics; a dict with keys {"CIDEr", "SPICE"}.

virtex.utils.metrics.tokenize(image_id_to_captions: Dict[int, List[str]]) Dict[int, List[str]][source]

Given a mapping of image id to a list of corrsponding captions, tokenize captions in place according to Penn Treebank Tokenizer. This method assumes the presence of Stanford CoreNLP JAR file in directory of this module.

virtex.utils.metrics.cider(predictions: Dict[int, List[str]], ground_truth: Dict[int, List[str]], n: int = 4, sigma: float = 6.0) float[source]

Compute CIDEr score given ground truth captions and predictions.

virtex.utils.metrics.spice(predictions: Dict[int, List[str]], ground_truth: Dict[int, List[str]]) float[source]

Compute SPICE score given ground truth captions and predictions.