Skip to content

Releases: Lightning-AI/torchmetrics

JOSS paper

10 Feb 17:04
Choose a tag to compare

[0.7.2] - 2022-02-10


  • Minor patches in JOSS paper.

Improve mAP performance

03 Feb 20:42
Choose a tag to compare

[0.7.1] - 2022-02-03


  • Used torch.bucketize in calibration error when torch>1.8 for faster computations (#769)
  • Improve mAP performance (#742)


  • Fixed check for available modules (#772)
  • Fixed Matthews correlation coefficient when the denominator is 0 (#781)


@Borda, @ramonemiliani93, @SkafteNicki, @twsl

If we forgot someone due to not matching commit email with GitHub account, let us know :]

New NLP metrics and improved API

17 Jan 18:33
Choose a tag to compare

We are excited to announce that TorchMetrics v0.7 is now publicly available. This release is pretty significant. It includes several new metrics (mainly for NLP), naming and import changes, general improvements to the API, and some other great features. TorchMetrics thus now has over 60+ metrics, and the package is more user-friendly than ever.

NLP metrics - Text package

Text package is a part of TorchMetrics as of v0.5. With the growing capability of language generation models, there is also a real need to have reliable evaluation metrics. With several added metrics and unified API, TorchMetrics makes the usage of various metrics even easier! TorchMetrics v0.7 newly includes a couple of machine translation metrics such as chrF, chrF++, Translation Edit Rate, or Extended Edit Distance. Furthermore, it also supports other metrics - Match Error Rate, Word Information Lost, Word Information Preserved, and SQuAD evaluation metrics. Last but not least, we also made possible the evaluation of the ROUGE score using multiple references.

Argument unification

Importantly, all text metrics assume preds, target input order with these explicit keyword arguments. If different naming was used before v0.7, it is deprecated and completely removed in v0.8.

Import and naming changes

TorchMetrics v0.7 brings more extensive and minor changes to how metrics should be imported. The import changes directly impact v0.7, meaning that you will most likely need to change the import statement for some specific metrics. All naming changes follow our standard deprecation process, meaning that in v0.7, any metric that is renamed will still work but raise an error asking to use the new metric name. From v0.8, the old metric names will no longer be available.

[0.7.0] - 2022-01-17


  • Added NLP metrics:
    • MatchErrorRate (#619)
    • WordInfoLost and WordInfoPreserved (#630)
    • SQuAD (#623)
    • CHRFScore (#641)
    • TranslationEditRate (#646)
    • ExtendedEditDistance (#668)
  • Added MultiScaleSSIM into image metrics (#679)
  • Added Signal to Distortion Ratio (SDR) to audio package (#565)
  • Added MinMaxMetric to wrappers (#556)
  • Added ignore_index to retrieval metrics (#676)
  • Added support for multi references in ROUGEScore (#680)
  • Added a default VSCode devcontainer configuration (#621)


  • Scalar metrics will now consistently have additional dimensions squeezed (#622)
  • Metrics having third party dependencies removed from global import (#463)
  • Untokenized for BLEUScore input stay consistent with all the other text metrics (#640)
  • Arguments reordered for TER, BLEUScore, SacreBLEUScore, CHRFScore now the expected input order is predictions first and target second (#696)
  • Changed dtype of metric state from torch.float to torch.long in ConfusionMatrix to accommodate larger values (#715)
  • Unify preds, target input argument's naming across all text metrics (#723, #727)
    • bert, bleu, chrf, sacre_bleu, wip, wil, cer, ter, wer, mer, rouge, squad


  • Renamed IoU -> Jaccard Index (#662)
  • Renamed text WER metric: (#714)
    • functional.wer -> functional.word_error_rate
    • WER -> WordErrorRate
  • Renamed correlation coefficient classes: (#710)
    • MatthewsCorrcoef -> MatthewsCorrCoef
    • PearsonCorrcoef -> PearsonCorrCoef
    • SpearmanCorrcoef -> SpearmanCorrCoef
  • Renamed audio STOI metric: (#753, #758)
    • audio.STOI to audio.ShortTimeObjectiveIntelligibility
    • to
  • Renamed audio PESQ metrics: (#751)
    • ->
    • audio.PESQ -> audio.PerceptualEvaluationSpeechQuality
  • Renamed audio SDR metrics: (#711)
    • functional.sdr -> functional.signal_distortion_ratio
    • functional.si_sdr -> functional.scale_invariant_signal_distortion_ratio
    • SDR -> SignalDistortionRatio
    • SI_SDR -> ScaleInvariantSignalDistortionRatio
  • Renamed audio SNR metrics: (#712)
    • functional.snr -> functional.signal_distortion_ratio
    • functional.si_snr -> functional.scale_invariant_signal_noise_ratio
    • SNR -> SignalNoiseRatio
    • SI_SNR -> ScaleInvariantSignalNoiseRatio
  • Renamed F-score metrics: (#731, #740)
    • functional.f1 -> functional.f1_score
    • F1 -> F1Score
    • functional.fbeta -> functional.fbeta_score
    • FBeta -> FBetaScore
  • Renamed Hinge metric: (#734)
    • functional.hinge -> functional.hinge_loss
    • Hinge -> HingeLoss
  • Renamed image PSNR metrics (#732)
    • functional.psnr -> functional.peak_signal_noise_ratio
    • PSNR -> PeakSignalNoiseRatio
  • Renamed image PIT metric: (#737)
    • functional.pit -> functional.permutation_invariant_training
    • PIT -> PermutationInvariantTraining
  • Renamed image SSIM metric: (#747)
    • functional.ssim -> functional.scale_invariant_signal_noise_ratio
    • SSIM -> StructuralSimilarityIndexMeasure
  • Renamed detection MAP to MeanAveragePrecision metric (#754)
  • Renamed Fidelity & LPIPS image metric: (#752)
    • image.FID -> image.FrechetInceptionDistance
    • image.KID -> image.KernelInceptionDistance
    • image.LPIPS -> image.LearnedPerceptualImagePatchSimilarity


  • Removed embedding_similarity metric (#638)
  • Removed argument concatenate_texts from wer metric (#638)
  • Removed arguments newline_sep and decimal_places from rouge metric (#638)


  • Fixed MetricCollection kwargs filtering when no kwargs are present in update signature (#707)


@ashutoshml, @Borda, @cuent, @Fariborzzz, @getgaurav2, @janhenriklambrechts, @justusschock, @karthikrangasai, @lucadiliello, @mahinlma, @mathemusician, @mona0809, @mrleu, @puhuk, @quancs, @SkafteNicki, @stancld, @twsl

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Fixing mAP on GPU

15 Dec 16:40
Choose a tag to compare

[0.6.2] - 2021-12-15


  • Fixed torch.sort currently does not support bool dtype on CUDA (#665)
  • Fixed mAP properly checks if ground truths are empty (#684)
  • Fixed initialization of tensors to be on the correct device for MAP metric (#673)


@OlofHarrysson, @tkupek, @twsl

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Own mAP implementation

06 Dec 09:43
Choose a tag to compare

[0.6.1] - 2021-12-06


  • Migrate MAP metrics from pycocotools to PyTorch (#632)
  • Use torch.topk instead of torch.argsort in retrieval precision for speedup (#627)


  • Fix empty predictions in MAP metric (#594, #610, #624)
  • Fix edge case of AUROC with average=weighted on GPU (#606)
  • Fixed forward in compositional metrics (#645)


@Callidior, @SkafteNicki, @tkupek, @twsl, @zuoxingdong

If we forgot someone due to not matching commit email with GitHub account, let us know :]

More metrics than ever

28 Oct 22:42
Choose a tag to compare

[0.6.0] - 2021-10-28

We are excited to announce that Torchmetrics v0.6 is now publicly available. TorchMetrics v0.6 does not focus on specific domains but adds a ton of new metrics to several domains, thus increasing the number of metrics in the repository to over 60! Not only have v0.6 added metrics within already covered domains, but we also add support for two new: Pairwise metrics and detection.

Pairwise Metrics

TorchMetrics v0.6 offers a new set of metrics in its functional backend for calculating pairwise distances. Given a tensor X with shape [N,d] (N observations, each in d dimensions), a pairwise metric calculates [N,N] matrix of all possible combinations between the rows of X.


TorchMetrics v0.6 now includes a detection package that provides for the MAP metric. The implementation essentially wraps pycocotools around securing that we get the correct value, but with the benefit of now being able to scale to multiple devices (as any other metric in TorchMetrics).

New additions

  • In the audio package, we have two new metrics: Perceptual Evaluation of Speech Quality (PESQ) and Short Term Objective Intelligibility (STOI). Both metrics can be used to assert speech quality.

  • In the retrieval package, we also have two new metrics: R-precision and Hit-rate. R-precision corresponds to recall at the R-th position of the query. The hit rate is the ratio of the total number of hits returned as a result of a query (hits) to the total number of hits returned.

  • The text package also receives an update in the form of two new metrics: Sacre BLEU score and character error rate. Sacre BLUE score provides and more systematic way of comparing BLUE scores across tasks. The character error rate is similar to the word error rate but instead calculates if a given algorithm has correctly predicted a sentence based on a character-by-character comparison.

  • The regression package got a single new metric in the form of the Tweedie deviance score metric. Deviance scores are generally a better measure of fit than measures such as squared error when trying to model data coming from highly screwed distributions.

  • Finally, we have added five new metrics for simple aggregation: SumMetric, MeanMetric, MinMetric, MaxMetric, CatMetric. All five metrics take in a single input (either native python floats or torch.Tensor) and keep track of the sum, average, min, etc. These new aggregation metrics are especially useful in combination with self.log from lightning if you want to log something other than the average of the metric you are tracking.

Detail changes


  • Added audio metrics:
    • Perceptual Evaluation of Speech Quality (PESQ) (#353)
    • Short Term Objective Intelligibility (STOI) (#353)
  • Added Information retrieval metrics:
    • RetrievalRPrecision (#577)
    • RetrievalHitRate (#576)
  • Added NLP metrics:
    • SacreBLEUScore (#546)
    • CharErrorRate (#575)
  • Added other metrics:
    • Tweedie Deviance Score (#499)
    • Learned Perceptual Image Patch Similarity (LPIPS) (#431)
  • Added MAP (mean average precision) metric to new detection package (#467)
  • Added support for float targets in nDCG metric (#437)
  • Added average argument to AveragePrecision metric for reducing multi-label and multi-class problems (#477)
  • Added MultioutputWrapper (#510)
  • Added metric sweeping:
    • higher_is_better as constant attribute (#544)
    • higher_is_better to rest of codebase (#584)
  • Added simple aggregation metrics: SumMetric, MeanMetric, CatMetric, MinMetric, MaxMetric (#506)
  • Added pairwise submodule with metrics (#553)
    • pairwise_cosine_similarity
    • pairwise_euclidean_distance
    • pairwise_linear_similarity
    • pairwise_manhatten_distance


  • AveragePrecision will now as default output the macro average for multilabel and multiclass problems (#477)
  • half, double, float will no longer change the dtype of the metric states. Use metric.set_dtype instead (#493)
  • Renamed AverageMeter to MeanMetric (#506)
  • Changed is_differentiable from property to a constant attribute (#551)
  • ROC and AUROC will no longer throw an error when either the positive or negative class is missing. Instead, return 0 scores and give a warning


  • Deprecated torchmetrics.functional.self_supervised.embedding_similarity in favour of new pairwise submodule


  • Removed dtype property (#493)


  • Fixed bug in F1 with average='macro' and ignore_index!=None (#495)
  • Fixed bug in pit by using the returned first result to initialize device and type (#533)
  • Fixed SSIM metric using too much memory (#539)
  • Fixed bug where device property was not properly updated when the metric was a child of a module (#542)


@an1lam, @Borda, @karthikrangasai, @lucadiliello, @mahinlma, @obus, @quancs, @SkafteNicki, @stancld, @tkupek

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Own NLP implementations

01 Sep 21:28
Choose a tag to compare

[0.5.1] - 2021-08-30


  • Added device and dtype properties (#462)
  • Added TextTester class for robustly testing text metrics (#450)


  • Added support for float targets in nDCG metric (#437)


  • Removed rouge-score as dependency for text package (#443)
  • Removed jiwer as dependency for text package (#446)
  • Removed bert-score as dependency for text package (#473)


  • Fixed ranking of samples in SpearmanCorrCoef metric (#448)
  • Fixed bug where compositional metrics where unable to sync because of type mismatch (#454)
  • Fixed metric hashing (#478)
  • Fixed BootStrapper metrics not working on GPU (#462)
  • Fixed the semantic ordering of kernel height and width in SSIM metric (#474)


@justusschock, @karthikrangasai, @kingyiusuen, @obus, @SkafteNicki, @stancld

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Text-related (NLP) metrics

10 Aug 14:22
Choose a tag to compare

[0.5.0] - 2021-08-09

This release includes general improvements to the library and new metrics within the NLP domain.

Natural language processing is arguably one of the most exciting areas of machine learning, with models such as BERT, ROBERTA, GPT-3 etc., really pushing what automated text translation, recognition, and generation systems are capable of. 

With the introduction of these models, many metrics have been proposed that measure how well these models perform. TorchMetrics v0.5 includes 4 such metrics: BERT score, BLEU, ROUGE and WER.

Detail changes


  • Added Text-related (NLP) metrics:
  • Added MetricTracker wrapper metric for keeping track of the same metric over multiple epochs (#238)
  • Added other metrics:
    • Symmetric Mean Absolute Percentage error (SMAPE) (#375)
    • Calibration error (#394)
    • Permutation Invariant Training (PIT) (#384)
  • Added support in nDCG metric for target with values larger than 1 (#349)
  • Added support for negative targets in nDCG metric (#378)
  • Added None as reduction option in CosineSimilarity metric (#400)
  • Allowed passing labels in (n_samples, n_classes) to AveragePrecision (#386)


  • Moved psnr and ssim from functional.regression.* to functional.image.* (#382)
  • Moved image_gradient from functional.image_gradients to functional.image.gradients (#381)
  • Moved R2Score from regression.r2score to regression.r2 (#371)
  • Pearson metric now only store 6 statistics instead of all predictions and targets (#380)
  • Use torch.argmax instead of torch.topk when k=1 for better performance (#419)
  • Moved check for number of samples in R2 score to support single sample updating (#426)


  • Rename r2score >> r2_score and kldivergence >> kl_divergence in functional (#371)
  • Moved bleu_score from functional.nlp to functional.text.bleu (#360)


  • Removed restriction that threshold has to be in (0,1) range to support logit input (#351, #401)
  • Removed restriction that preds could not be bigger than num_classes to support logit input (#357)
  • Removed module regression.psnr and regression.ssim (#382):
  • Removed (#379):
    • function functional.mean_relative_error
    • num_thresholds argument in BinnedPrecisionRecallCurve


  • Fixed bug where classification metrics with average='macro' would lead to wrong result if a class was missing (#303)
  • Fixed weighted, multi-class AUROC computation to allow for 0 observations of some class, as contribution to final AUROC is 0 (#376)
  • Fixed that _forward_cache and _computed attributes are also moved to the correct device if metric is moved (#413)
  • Fixed calculation in IoU metric when using ignore_index argument (#328)


@BeyondTheProof, @Borda, @CSautier, @discort, @edwardclem, @gagan3012, @hugoperrin, @karthikrangasai, @paul-grundmann, @quancs, @rajs96, @SkafteNicki, @vatch123

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Fixing DDP sync

05 Jul 16:02
Choose a tag to compare

[0.4.1] - 2021-07-05



  • Fixed DDP by is_sync logic to Metric (#339)

Multimedia - audio & image quality

29 Jun 13:01
Choose a tag to compare



The first highlight of v0.4.0 is a set of 3 new metrics for calculating for evaluating audio data: Scale-invariant signal-to-distortion ratio, Scale-invariant signal-to-noise ratio, and signal-to-noise ratio. All these metrics take a predicted audio tensor and a target tensor, both with the shape [...,time] and calculate the metric over the time axis.


Version v0.4.0 also includes a completely new image package. Since its initial 0.2.0 release, Torchmetrics has had both PSNR and SSIM in its regression module, metrics that can be used to evaluate image quality. 
With the image module, we are adding three new metrics for evaluating the quality of generative models (such as GANS): Inception score (IS), Fréchet inception distance (FID) and kernel inception distance (KID).

More Functionality

In addition to the new audio and image package, we also want to highlight a couple of features:

  • Addition of MeanAbsolutePercentageError (MAPE) metric to the regression package. Useful in regression settings where you want to focus on the relative instead of absolute error.
  • Addition of KLDivergence metric to the classification package. Useful for measuring the distance between probability distributions like the ones outputted in variational auto-encoders.
  • Addition of CosineSimilarity metric to the regression package. Useful for calculating the angle between two embedding vectors in domains such as metric learning.
  • As requested by multiple users, Accuracy, Precision, Recall, FBeta, F1, StatScore, Hamming, ConfusionMatrix now directly support that predictions can be unnormalized, e.g. logits from your model. No need to call .softmax(dim=-1) anymore!
  • All modular metrics now have both a sync and sync_context methods that allow the user full control over when metric states are synced. Note that we still automatically do this whenever calling the compute method.
  • The is_differentiable property has been adopted by many more of our metrics!


Big thanks to all community members for their contributions and feedback.
A special thanks to @quancs for leading the development of the new audio package.

[0.4.0] - 2021-06-24


  • Added Cosine Similarity metric (#305)
  • Added Specificity metric (#210)
  • Added add_metrics method to MetricCollection for adding additional metrics after initialization (#221)
  • Added pre-gather reduction in the case of dist_reduce_fx="cat" to reduce communication cost (#217)
  • Added better error message for AUROC when num_classes is not provided for multiclass input (#244)
  • Added support for unnormalized scores (e.g. logits) in Accuracy, Precision, Recall, FBeta, F1, StatScore, Hamming, ConfusionMatrix metrics (#200)
  • Added MeanAbsolutePercentageError(MAPE) metric. (#248)
  • Added squared argument to MeanSquaredError for computing RMSE (#249)
  • Added FID metric (#213)
  • Added is_differentiable property to ConfusionMatrix, F1, FBeta, Hamming, Hinge, IOU, MatthewsCorrcoef, Precision, Recall, PrecisionRecallCurve, ROC, StatScores (#253)
  • Added audio metrics: SNR, SI_SDR, SI_SNR (#292)
  • Added Inception Score metric to image module (#299)
  • Added KID metric to image module (#301)
  • Added sync and sync_context methods for manually controlling when metric states are synced (#302)
  • Added KLDivergence metric (#247)


  • Forward cache is reset when reset method is called (#260)
  • Improved per-class metric handling for imbalanced datasets for precision, recall, precision_recall, fbeta, f1, accuracy, and specificity (#204)
  • Decorated torch.jit.unused to MetricCollection forward (#307)
  • Renamed thresholds argument to binned metrics for manually controlling the thresholds (#322)


  • Deprecated torchmetrics.functional.mean_relative_error (#248)
  • Deprecated num_thresholds argument in BinnedPrecisionRecallCurve (#322)


  • Removed argument is_multiclass (#319)


  • AUC can also support more dimensional inputs when all but one dimension are of size 1 (#242)
  • Fixed dtype of modular metrics after reset has been called (#243)
  • Fixed calculation in matthews_corrcoef to correctly match formula (#321)


@AnselmC, @arvindmuralie77, @bhadreshpsavani, @Borda, @GiannisVagionakis, @hassiahk, @IgorHoholko, @johannespitz, @justusschock, @maximsch2, @pranjaldatta, @quancs, @simran2905, @SkafteNicki, @tchaton

If we forgot someone due to not matching commit email with GitHub account, let us know :]