Super-Resolution in image processing means upsampling and therefore interpolation between pixels of an image. It can be interpreted as the opposite of downsampling. To make images larger in the image dimensions it is necessary to predict the values of the additional pixels between the original pixels. One of the easiest ways and also a traditional method to do this is applying a bicubic interpolation. New methods have evolved in the recent years and the use of neural networks is outperforming all other methods developed so far.

Visual Example for better Understanding

Figure 1: Low resolution image.

Figure 2: Interpolated low resolution image.

Figure 3: High resolution image

Figure 1 shows a picture with 25x25 pixels, whereas on the upper right you can see the same image in original resolution 100x100 pixels. Figure 2 shows the upscaled left image with bicubic interpolation. One can see that the image is blurred compared to the original resolution image on the right (cf. figure 3). This effect is caused by incorrect prediction of the new pixel values. The aim of a Super-Resolution neural network is learning the missing pixel values for the upscaled image as good as possible.

Metrics

In order to describe the quality of the upscaling method it is necessary to define a metric which describes the similiraty between the predicted (upscaled) image and the ground truth (full resolution) image. In this section, some of the commonly used metrics, that are employed for problems of this nature, are described.

PSNR

Peak Signal to Noise Ratio (PSNR) is a commonly used metric to define the similarity between two images. It is calculated using the Mean-Square-Error (MSE) of the pixels and the maximum possible pixel value (MAXI) as follows:

PSNR = 10 \cdot \log (\frac{MAX_I^2}{MSE})

A high PSNR value corresponds to a high similarity between two images and a low value corresponds to a low similarity respectively. (4)

SSIM

The structural similarity index is developed in order to improve traditional methods such as PSNR, which have been proven to be inconsistent with human visual perception. It takes luminance, contrast and structure of both images into account.

The SSIM index is calculated on various windows of an image. The measure between two windows x and  y of common size N×N is:

{\displaystyle {\hbox{SSIM}}(x,y)={\frac {(2\mu _{x}\mu _{y}+c_{1})(2\sigma _{xy}+c_{2})}{(\mu _{x}^{2}+\mu _{y}^{2}+c_{1})(\sigma _{x}^{2}+\sigma _{y}^{2}+c_{2})}}}

with:

 

More metrics can be found in literature:

  • IFC (Information Fidelity Criterion)
  • NQM
  • WPSNR (Weighted Peak Signal to Noise Ration)
  • MSSSIM (Multi Scale Structural Similarity)

 

 

Classical Approaches

Three classical approaches are briefly described below:

Bicubic Interpolation

Bicubic Interpolation considers 16 surrounding pixels to predict new pixel values.

(5)

 

Bilinear Interpolation

Bilinear Interpolation considers 4 surrounding Pixels to predict new pixel values.

(6)

Nearest Neighbor

The Nearest Neighboor method simply predicts the pixel values from the value of the nearest neighboor pixel.

(7)

Approaches with neural Networks

SRCNN (Super Resolution Convolutional Neural Network)

Figure 4: Architecture of SRCNN. Three convolutional layers in total. (2)


SRCNN consists of 3 convolutional layers with filter sizes 9x9 for the first layer, 1x1 for the second layer and 3x3 for the last layer. The first layer generates 64 feature maps, the second 32 and the last one generates the output. Figure 4 shows color images, but in practice only grayscale images are used to train and apply the network. The first layer filters can be interpreted as feature detectors, such as corners, lines, etc. They are visualized in the figure 5.

Figure 5: Filters of the first convolution. (2)

VDSR (Very Deep Super Resolution)

Figure 6: Network architecture of VDSR. Multiple convolutional layers followed by addition of the input image. (1)

The VDSR network consists of 20 convolutional layers. The input and output image share the same size. This is achieved by padding with zeros in every convolution. The key element here is the residual learning, which is applied by adding the input image to the ouput from the last convolutional layer. In this way only the difference between low and high resolution is learned by the network. It makes sense because both images are sharing the same low frequencies and thus do not need to be considered in the training process.

The filter size of each convolution except the first one is 3x3x64. The receptive field of the network is therefore 41x41 pixels. Each convolution except the last one generates 64 feature maps, some of which visualized in the graphic above. Data augmentation via rotation and flipping is used for training. In order to gain speed and reduce the size of the network the training data is decomposed into patches with size 41x41. This also helps to increase the amount of training data. Out of 291 images approximately 140.000 Patches can be generated.

In order to aid the the network to converge, gradient clipping and L2-regularization is used. The learning rate is decreased every 20 epochs by a factor of 0.1, which improves performance.

Comparison of different Methods and State of the Art Performance

The table below shows a few methods of super resolution approaches. The datasets can be found as standard in today´s literature. All networks are trained with Set291, a set of images containing 291 natural images.

Figure 7: Benchmark table for different super-resolution approaches. (1)


Figure 8 visualizes the performance of state of the art techniques. There is only a slight difference between the ground truth image on the left and the predicted image on the right.

Figure 8: Results of different super-resolution approaches on an example image. (1)

Applications

Image Super-Resolution is used in many areas such as:

  • Surveillance
  • Remote Sensing
  • Medical Imaging (i.e., ultrasonic images, x-ray-images)
  • Video Standard Conversion (i.e., SD to HD)
  • Photocameras (i.e., postprocessing of images)
  • Printing (i.e., enhance print quality on paper of low resolution images)
  • Biometrics (i.e., fingerprint/face recognition)
  • Commercial (barcode reading)
  • Military (tracking and detecting)
  • Satellite Imaging (i.e., weather forecasting)

Literature

  1. https://arxiv.org/abs/1511.04587 (Accurate Image Super-Resolution Using Very Deep Convolutional Networks; Jiwon Kim, Jung Kwon Lee, Kyoung Mu Lee )
  2. http://personal.ie.cuhk.edu.hk/~ccloy/files/eccv_2014_deepresolution.pdf (Learning a Deep Convolutional Network for Image Super-Resolution; Chao Dong, Chen Change Loy, Kaiming He and Xiaoou Tang)
  3. https://en.wikipedia.org/wiki/Structural_similarity
  4. https://en.wikipedia.org/wiki/Peak_signal-to-noise_ratio
  5. https://upload.wikimedia.org/wikipedia/commons/f/f5/Interpolation-bicubic.svg
  6. https://upload.wikimedia.org/wikipedia/commons/d/dd/Interpolation-bilinear.svg
  7. https://upload.wikimedia.org/wikipedia/commons/2/27/Interpolation-nearest.svg

https://github.com/huangzehao/caffe-vdsr (Implementation of VDSR in Caffe)

http://live.ece.utexas.edu/publications/2004/hrs_ieeetip_2004_infofidel.pdf (More about Information Fidelity Criterion (IFC))

Kommentar

  1. Unbekannter Benutzer (ga69taq) sagt:

    General problems and suggestions:
    Please refer to your images with figure labels if you wish to explain them in your text.

    • Do not use phrases such as "below", "above", "following" etc. .
    • Add image sources to your bibliography/weblinks instead of under the image without a hyperlink.
    • Do not use phrases such as "one can do..." or "one can see" instead say "it is possible to..." or any passive formulation.
    • You might want to add the used databases to your bibliography as well.
    • For all mathematical expressions including the image sizes I would recommend using math inlines.

    Consistency problems and suggestions:

    • Your capitalization is inconsistent throughout the article.
    • Your bibliography style is inconsistent with the rest of the wiki.

    Corrections (note that some of the corrections have more than one changes):

    • in Image Processing means
    • in image processing means
    • can be understood as
    • can be interpreted as
    • additional pixels in between the original pixels.
    • additional pixels between the original pixels.
    • One of the easiest ways and also kind of a traditional method
    • One of the easiest ways and also a traditional method
    • all other methods yet.
    • all other methods developed so far.
    • The upper left image shows
    • Figure x shows (I will omit these type of correction from now on since this is a recurring error that is already mentioned in the consistency section)
    • in full resolution 100x100 pixels.
    • in the original resolution of 100x100 pixels.
    • This effect results from incorrect prediction
    • This effect is caused by incorrect prediction
    • is to learn how to predict as good as possible the missing pixel values for the upscaled image.
    • is learning to predict the missing pixel values for the upscaled image as good as possible.
    • To describe the quality
    • In order to describe the quality (Avoid starting sentences with "to".)
    • In the followin one can find a few common used metrics for this kind of problems.
    • In this section, some of the commonly used metrics, that are employed for problems of this nature, are described.
    • is a common used metric to define the similarity between to images.
    • is a commonly used metric to define the similarity between two images.
    • as one can see below:
    • (One does not simply see.)(Also, avoid using "below".)
    • a low similarity respectivly.
    • a low similarity respectively.
    • The structural similarity index was developed
    • The structural similarity index is developed
    • which have proven to
    • which have been proven to
    • It takes luminance, contrast an structure of both images
    • It takes luminance, contrast and structure of both images
    • The measure between two windows x and y of common size N×N is: (EQUATION)
    • (The sentence is correct, the math seems wrong with this equation though)
    • SRCNN contains of 3 convolutional layers
    • SRCNN consists of 3 convolutional layers
    • Filtersizes 9x9 for the first layer,
    • filter sizes of 9x9 for the first layer,
    • 1x1 for the second and
    • 1x1 for the second layer and
    • The first layer generates 64 Feature Maps, the second 32 and the last one.
    • The first layer generates 64 feature maps, the second 32 and the last one. (last one does what?)
    • but in practical only grayscale images
    • but in practice only grayscale images
    • The first layer filters can be understand as feature detectors,
    • The first layer filters can be interpreted as feature detectors;
    • The VDSR network contains of 20 convolutional layers.
    • The VDSR network consists of 20 convolutional layers.
    • Input and Output Image share
    • The input and output images share
    • and thus don´t need to be considered
    • and thus, they do not need to be considered
    • The Filtersize of each convolution exept the first is 3x3x64.
    • The filter size of each convolutional layer except the first one is 3x3x64.
    • is thus 41x41 pixels.
    • is therefore 41x41 pixels.
    • generates 64 Feature Maps, a few are visualized
    • generates 64 feature maps, some of which visualized
    • Data Augmentation with rotation and flipping is used for training.
    • Data augmentation via rotation and flipping is used for training.
    • In order to gain speed
    • In order to accelerate
    • approximately 140.000 Patches can be generated.
    • approximately 140.000 patches can be generated.
    • To help converge the network,
    • In order to aid the the network to converge,
    • is decreased every 20 Epochs
    • is decreased every 20 epochs
    • by a factor of 0.1, which gains performance.
    • by a factor of 0.1, which improves performance.
    • Mentioned and descriped two networks more detailed, one can see the difference in architecture and performance.
    • (????)
    • The Datasets can be found as standart in todays literature.
    • The datasets can be found as standard in today's literature.
    • Set291, an ImageSet containing 291 natural images.
    • Set291, an image set containing 291 natural images. (Might be a good idea to add a reference to the Set291)
    • The Image below visualizes the performance of state of the art techniques.
    • The image below visualizes the performance of the state of the art techniques.
    • Medical Imaging (i.e. ultrasonic images, y-ray-images,
    • Medical Imaging (i.e., ultrasonic images, x-ray images)
    • (Also the capitalization in this bullet list seems entirely random, please fix that.)
    • (As Sebastian pointed out to me, "i.e." is always followed by a comma in American English.)

    Final comments:

    • It is important to note that some of the corrections / suggestions are subjective and for you to decide
    • Most of the issues I pointed out under the general topics are omitted in the correction list but I am pretty sure I still missed some stuff. Hopefully the next review will catch any mistakes I skipped.