[Paper] Deep learning for image super-resolution: A survey (2020)

22 Dec 2020 #cv

Wang, Zhihao, Jian Chen, and Steven CH Hoi. “Deep learning for image super-resolution: A survey.” IEEE transactions on pattern analysis and machine intelligence 43.10 (2020): 3365-3387.

Paper Link

Introduction

Super-resolution (SR) is the process of enhancing the resolution of images, transforming low-resolution (LR) images to high-resolution (HR) images.
SR is an ill-posed problem due to the existence of multiple HR images for a single LR image.
Deep learning has significantly advanced SR, with approaches like CNNs (SRCNN) and GANs (SRGAN).

Problem Setting and Terminology

Problem Definition: Developing a super-resolution model to approximate HR images from LR inputs.
Image Quality Assessment (IQA): Methods include subjective human perception and objective computational techniques, classified into full-reference, reduced-reference, and no-reference methods.

Supervised Super-Resolution

SR Framework

Pre-Upsampling Framework: Uses traditional upsampling followed by deep neural networks (e.g., SRCNN).
Post-Upsampling Framework: Employs end-to-end deep learning models for upsampling.
Progressive Upsampling Framework: Utilizes cascades of CNNs for step-by-step refinement of images.
Iterative Up-and-Down Sampling: Incorporates methods like DBPN and SRFBN for capturing LR-HR dependencies.

Upsampling Methods

Interpolation-Based

Includes nearest-neighbor, bilinear, and bicubic interpolation. These are traditional techniques used to resize images before the advent of deep learning-based methods.

Learning-Based

Utilizes transposed convolution layers and sub-pixel layers for end-to-end learning.

Transposed Convolution Layer (Deconvolution Layer): Predicts the possible input based on feature maps sized like the convolution output for resolution, expanding the image by inserting zeros and performing convolution.
- This method enlarges the image size while maintaining a connectivity pattern, but it can cause uneven overlapping on each axis, leading to checkerboard-like artifacts that can affect SR performance.
Sub-Pixel Layer (Pixelshuffle): Generates plurality of channels by convolution and the reshape them.
- Given an input size $(h \times w \times c)$, it generates $s^2$ times channels, where $s$ is a scaling factor. The output size becomes $(h \times w \times s^2c)$, which is then reshaped(shuffled) to $(sh \times sw \times c)$.
- This method maintains a larger receptive field than the transposed convolution layer, providing more contextual and realistic details. However, the distribution of the receptive field can be uneven, leading to artifacts near the boundaries of different blocks.

Network Design

networks

Residual Learning

Simplifies learning by focusing on the residuals between LR and HR images instead of learning a direct mapping. This approach reduces the complexity of the transformation task.
By learning only the difference (residuals) between the input and the target image, the model can focus on fine details, resulting in better performance and faster convergence.
Example: The ResNet architecture uses residual blocks to enhance the ability of very deep networks to learn effectively without vanishing gradients.

Recursive Learning

Repeatedly applies the same modules to capture higher-level features while maintaining a manageable number of parameters.
Allows the network to refine features iteratively, leading to more detailed and accurate image reconstructions.
Example: Deep Recursive Convolutional Network (DRCN) utilizes a single convolutional layer applied multiple times to expand the receptive field without increasing the number of parameters significantly.

Multi-Path Learning

Local Multi-Path Learning
- Extracts features through multiple parallel paths which then get fused to provide better modeling capabilities. This approach helps in capturing different aspects of the image simultaneously.
- Different paths can focus on various scales or types of features, which are then combined to improve the overall representation.
- Example: Multi-scale Residual Network (MSRN) uses multiple convolutional layers with different kernel sizes to capture multi-scale features.
Scale-Specific Multi-Path Learning
- Involves having separate paths for different scaling factors within a single network, allowing the network to handle multiple scales more effectively.
- Example: MDSR (Multi-Scale Deep Super-Resolution) shares most network parameters but has scale-specific layers to handle different upscaling factors.

Dense Connections

Enhances gradient flow and feature reuse by connecting each layer to every other layer in a feed-forward fashion. This ensures that gradients can flow directly to earlier layers, improving learning efficiency.
Promotes feature reuse, leading to more efficient and compact networks.
Example: DenseNet connects each layer to every other layer, facilitating better feature propagation and reducing the risk of gradient vanishing.

Group Convolution

Splits the input channels into groups and performs convolutions within each group. This reduces the computational complexity and number of parameters.
Often used in lightweight models to balance performance and efficiency.
Example: Xception and MobileNet architectures use depthwise separable convolutions, a type of group convolution, to reduce the number of parameters and computation.

Pyramid Pooling

Uses pooling operations at multiple scales to capture both global and local context information. This helps in understanding the image at different resolutions.
Example: PSPNet (Pyramid Scene Parsing Network) uses pyramid pooling to aggregate contextual information from different scales, which is then combined to enhance the feature representation.

Attention Mechanisms

Channel Attention
- Focuses on the interdependencies between feature channels. It assigns different weights to different channels, enhancing important features and suppressing less useful ones.
- Example: Squeeze-and-Excitation Networks (SENet) uses a squeeze operation to aggregate feature maps across spatial dimensions, followed by an excitation operation that recalibrates channel-wise feature responses.
Spatial Attention
- Focuses on the spatial location of important features. It assigns weights to different spatial regions, allowing the model to focus on relevant areas of the image.
- Example: Convolutional Block Attention Module (CBAM) combines channel and spatial attention to improve representation by focusing on meaningful parts of the image.
Non-Local Attention
- Captures long-range dependencies between distant pixels. This is particularly useful for super-resolution tasks where global context is important.
- Example: Non-local Neural Networks use a self-attention mechanism to compute relationships between all pairs of positions in the feature map, allowing the model to capture global context and dependencies.
Combined Attention
- Some networks combine multiple types of attention mechanisms to leverage the strengths of each. For instance, combining channel and spatial attention can provide a more comprehensive attention mechanism.
- Example: The Residual Channel Attention Network (RCAN) uses channel attention modules within a residual network structure to enhance the network’s ability to capture important features for image super-resolution

Learning Strategies

Loss Functions: Early methods used pixel-wise L2 loss, while newer approaches incorporate more complex losses like content loss, adversarial loss, and perceptual loss to improve the quality of the reconstructed images.
Training Techniques: Techniques such as curriculum learning, multi-supervision, and progressive learning are used to enhance the training process and improve model performance.

Unsupervised Super-Resolution

Unsupervised methods do not rely on paired LR-HR datasets. Instead, they use generative models and adversarial training to learn the mapping from LR to HR images. Techniques include CycleGAN, which learns the transformation by mapping LR images to HR images and vice versa.

Domain-Specific Super-Resolution

Domain-specific methods focus on specific applications such as face SR, text SR, and medical image SR. These methods leverage domain knowledge to improve the quality of SR in specific contexts.

Benchmark Datasets and Performance Evaluation

Several benchmark datasets are used for evaluating SR models, including Set5, Set14, BSD100, and Urban100. Common evaluation metrics include Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM).

Metrics

While PSNR is widely used, it does not always correlate well with human perception of image quality. SSIM addresses this by considering luminance, contrast, and structure.

PSNR is one of the most popular reconstruction quality measurements of lossy transformation. In the context of SR, it’s defined via the maximum pixel value ($L$) and the mean squared error (MSE) between the images.
\[PSNR=10\cdot\log_{10}\big({L^2\over{1\over N}\sum_{i=1}^N(I(i)-\hat{I}(i))^2}\big)\]
- $I(i)$ and $\hat{I}(i)$ represent the pixel values of the original and reconstructed images, respectively. and $N$ is the total number of pixels.
SSIM measures the structural similarity between images based on independent comparisons in terms of luminance, contrast, and structures. It considers the human visual system (HVS) is highly adapted to extract image structures.
\[SSIM(I,\hat{I})={(2\mu_I\mu_{\hat{I}}+C_1)(2\sigma_{I\hat{I}}+C_2)\over(\mu_I^2+\mu_{\hat{I}}^2+C_1)(\sigma_I^2+\sigma_\hat{I}^2+C_2)}\]
- $\mu_I$ and $\mu_\hat{I}$ are the mean pixel values of the original and reconstructed images, respectively. $\sigma_I^2$ and $\sigma_\hat{I}^2$ are the variances, and $\sigma_{I\hat{I}}$ is the covariance of $I$ and $\hat{I}$. $C_1$ and $C_2$ are constants to stabilize the division when the denominators are small.

Challenges and Future Directions

Scalability: Developing SR models that can handle varying scales and resolutions efficiently.
Real-World Applications: Enhancing SR models to perform well on real-world images with diverse degradation.
Efficiency: Reducing computational complexity and memory usage while maintaining high performance.
Generality: Creating SR models that generalize well across different types of images and domains.
Perceptual Quality: Improving the perceptual quality of SR images, ensuring that they are visually appealing and free from artifacts.

Conclusion

The survey paper provides an in-depth review of deep learning-based super-resolution techniques, categorizing them into supervised, unsupervised, and domain-specific methods. It discusses various network architectures, upsampling techniques, and learning strategies, highlighting the advancements and challenges in the field. The paper also covers benchmark datasets and performance evaluation metrics, providing a comprehensive overview of the current state of image super-resolution research.

Coffee Chat Brewing AI Knowledge