SPADE

SPADE (Spatially-Adaptive Normalization)

Paper: Semantic Image Synthesis with Spatially-Adaptive Normalization (CVPR 2019) Code: NVlabs/SPADE

SPADE 是一种用于语义图像合成 (Semantic Image Synthesis) 的归一化技术。它的核心目标是解决传统的归一化层（如 Batch Norm, Instance Norm）在生成过程中“清洗”掉语义布局信息的问题。基于 SPADE 的生成器模型通常被称为 GauGAN。

1. 核心动机

在 SPADE 之前，语义图像合成（即从语义分割掩码生成真实图像）的主流方法（如 Pix2PixHD）通常将语义掩码直接作为输入喂给网络。然而，网络中间层通常包含归一化层（Normalization Layers）。

问题：标准的归一化操作（BN, IN, LN）会将特征图归一化为均值为 0、方差为 1 的分布。这个过程在数学上倾向于抹去 (Wash away) 输入特征中的语义信息，尤其是当语义掩码是大面积的均匀区域时（例如大片的草地或天空，归一化后变成一样的分布）。
结果：为了弥补这种信息丢失，以前的模型需要极其深的网络结构或复杂的技巧。

2. SPADE 模块原理

SPADE (SPatially-Adaptive DE-normalization) 提出了一种新的条件归一化方法。它不再学习一组全局的仿射变换参数 ($\gamma, \beta$)，而是根据输入的语义掩码，动态地、空间自适应地生成每个位置的 $\gamma$ 和 $\beta$。

公式

假设 $h$ 是某层的激活值（特征图），$N$ 是 batch size，$C$ 是通道数，$H, W$ 是高宽。归一化后的激活值计算如下：

$$ \gamma_{c,y,x}(m) \cdot \frac{h_{n,c,y,x} - \mu_c}{\sigma_c} + \beta_{c,y,x}(m) $$

其中：

$\mu_c, \sigma_c$ 是该通道的均值和标准差（类似于 Instance Norm 或 Sync Batch Norm）。
关键点：$\gamma_{c,y,x}(m)$ 和 $\beta_{c,y,x}(m)$ 是关于输入语义掩码 $m$ 的函数。
这两个参数是通过一个简单的两层卷积网络从语义掩码 $m$ 中学习得到的。这意味着图像中不同位置（如天空区域和草地区域）将使用不同的缩放和平移参数进行去归一化。

官方实现

# Creates SPADE normalization layer based on the given configuration
# SPADE consists of two steps. First, it normalizes the activations using
# your favorite normalization method, such as Batch Norm or Instance Norm.
# Second, it applies scale and bias to the normalized output, conditioned on
# the segmentation map.
# The format of |config_text| is spade(norm)(ks), where
# (norm) specifies the type of parameter-free normalization.
#       (e.g. syncbatch, batch, instance)
# (ks) specifies the size of kernel in the SPADE module (e.g. 3x3)
# Example |config_text| will be spadesyncbatch3x3, or spadeinstance5x5.
# Also, the other arguments are
# |norm_nc|: the #channels of the normalized activations, hence the output dim of SPADE
# |label_nc|: the #channels of the input semantic map, hence the input dim of SPADE
class SPADE(nn.Module):
    def __init__(self, config_text, norm_nc, label_nc):
        super().__init__()

        assert config_text.startswith('spade')
        parsed = re.search('spade(\D+)(\d)x\d', config_text)
        param_free_norm_type = str(parsed.group(1))
        ks = int(parsed.group(2))

        if param_free_norm_type == 'instance':
            self.param_free_norm = nn.InstanceNorm2d(norm_nc, affine=False)
        elif param_free_norm_type == 'syncbatch':
            self.param_free_norm = SynchronizedBatchNorm2d(norm_nc, affine=False)
        elif param_free_norm_type == 'batch':
            self.param_free_norm = nn.BatchNorm2d(norm_nc, affine=False)
        else:
            raise ValueError('%s is not a recognized param-free norm type in SPADE'
                             % param_free_norm_type)

        # The dimension of the intermediate embedding space. Yes, hardcoded.
        nhidden = 128

        pw = ks // 2
        self.mlp_shared = nn.Sequential(
            nn.Conv2d(label_nc, nhidden, kernel_size=ks, padding=pw),
            nn.ReLU()
        )
        self.mlp_gamma = nn.Conv2d(nhidden, norm_nc, kernel_size=ks, padding=pw)
        self.mlp_beta = nn.Conv2d(nhidden, norm_nc, kernel_size=ks, padding=pw)

    def forward(self, x, segmap):

        # Part 1. generate parameter-free normalized activations
        normalized = self.param_free_norm(x)

        # Part 2. produce scaling and bias conditioned on semantic map
        segmap = F.interpolate(segmap, size=x.size()[2:], mode='nearest')
        actv = self.mlp_shared(segmap)
        gamma = self.mlp_gamma(actv)
        beta = self.mlp_beta(actv)
		
		# 这里和论文公式有点出入，主要目的是学习残差，防止gamma接近零的时候特征被抹去（没有1+的情况下）
        # apply scale and bias
        out = normalized * (1 + gamma) + beta

        return out

3. GauGAN 架构

GauGAN 是使用了 SPADE 模块的生成器架构。

输入：生成器的输入不再是语义掩码，而是一个随机采样的噪声向量（或者全零张量）。这允许模型通过改变噪声来控制生成图像的风格（多模态合成）。
语义注入：语义掩码 $m$ 不再作为输入层，而是通过 SPADE 模块注入到网络的每一层。
结构：
- 包含多个 ResNetBlock。
- 每个 ResNetBlock 中的 BN/IN 被替换为 SPADE。
- 上采样层逐步提高分辨率。

4. 优势与对比

特性	Pix2PixHD	SPADE (GauGAN)
归一化	Instance Norm	Spatially-Adaptive Norm
语义输入位置	仅在输入层	注入到每一层
语义保持	较弱，易丢失细节	强，严格遵循布局
风格控制	较难	容易（通过输入噪声）
参数量	较大	较小（更高效）

5. 总结

SPADE 通过让归一化层的仿射参数与空间语义布局相关联，成功地解决了语义信息在深层网络中流失的问题。它不仅生成的图像质量更高，而且能更好地遵循用户给定的语义布局，是条件图像生成领域的里程碑式工作。