On Buggy Resizing Libraries and Surprising Subtleties in FID Calculation

Gaurav Parmar    Richard Zhang    Jun-Yan Zhu   

Carnegie Mellon University     Adobe Research

Paper | GitHub | Leaderboard


We investigate the sensitivity of the Fréchet Inception (FID) score to inconsistent and often incorrect implementations across different image processing libraries. FID score is widely used to evaluate generative models, but each FID implementation uses a different low-level image processing process. Image resizing functions in commonly-used deep learning libraries often introduce aliasing artifacts. We observe that numerous subtle choices need to be made for FID calculation and a lack of consistencies in these choices can lead to vastly different FID scores. In particular, we show that the following choices are significant: (1) selecting what image resizing library to use, (2) choosing what interpolation kernel to use, (3) what encoding to use when representing images. We additionally outline numerous common pitfalls that should be avoided and provide recommendations for computing the FID score accurately. We provide an easy-to-use optimized implementation of our proposed recommendations in the accompanying code.

Leaderboard For Common Tasks

We compute the FID scores using the corresponding methods used in the original papers and using the Clean-FID proposed here. All values are computed using 10 evaluation runs. We also provide an API to query the results directly from the pip package. The arguments model_name, dataset_name, dataset_res, dataset_split, task_name can be used to filter the results.

paper thumbnail


arXiv 2104.11222, 2021.


Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. "On Buggy Resizing Libraries and Surprising Subtleties in FID Calculation", in arXiv, 2021.

Buggy Resizing Operations in commonly-used libraries

We resize an input image of a circle (left) by a factor of 8 using different commonly used methods with different image processing libraries. The Lanczos, bicubic and bilinear implementations by PIL (top row) antialias correctly. The other implementations introduce aliasing artifacts. Many evaluation metrics of generative models such as FID use the downsampling function.


The inconsistencies among implementations can have a drastic effect of the evaluations metrics. The table below shows that FFHQ dataset images resized with bicubic implementation from other libraries (OpenCV, PyTorch, TensorFlow, OpenCV) have a large FID score (≥ 6) when compared to the same images resized with the correctly implemented PIL-bicubic filter. Other correctly implemented filters from PIL (Lanczos, bilinear, box) all result in relatively smaller FID score (≤ 0.75). Note that since TF 2.0, the new flag `antialias` (default: `False`) can produce results close to PIL. However, it was not used in the existing TF-FID repo and set as `False` by default.


Effects of JPEG compression on FID

We show a sample image from the FFHQ dataset that is saved with different JPEG compression ratios. Images are perceptually indistinguishable from each other but have a large FID score. The FID scores under the images are calculated between all FFHQ images saved using the corresponding JPEG format and the PNG format.

Below, we study the effect of JPEG compression for StyleGAN2 models trained on the FFHQ dataset (left) and LSUN outdoor Church dataset (right). Note that LSUN dataset images were collected with JPEG compression (quality 75), whereas FFHQ images were collected as PNG. Interestingly, for LSUN dataset, the best FID score (3.48) is obtained when the generated images are compressed with JPEG quality 87.



We thank Jaakko Lehtinen and Assaf Shocher for bringing attention to this issue and for helpful discussion. We thank Sheng-Yu Wang, Nupur Kumari, Kangle Deng, and Andrew Liu for useful discussions. We thank William S. Peebles, Shengyu Zhao, and Taesung Park for proofreading our manuscript. We are grateful for the support of Adobe, Naver Corporation, and Sony Corporation.