常见算子

中间层 (Intermediate Layers)

Conv (卷积层)

卷积层是计算机视觉中的核心算子，用于提取图像特征。常用的是二维卷积 nn.Conv2d。

原理

通过一个可学习的卷积核（Kernel/Filter）在输入特征图上滑动，进行点积求和（互相关运算），从而提取局部特征。

PyTorch 实现

torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros', device=None, dtype=None)

参数详解

in_channels (int): 输入特征图的通道数。例如 RGB 图像为 3。
out_channels (int): 输出特征图的通道数（即卷积核的个数）。
kernel_size (int or tuple): 卷积核的大小。如 3 代表 3x3，(3, 5) 代表 3x5。
stride (int or tuple): 卷积核滑动的步长。默认为 1。步长 > 1 会导致输出尺寸减小（下采样）。
padding (int, tuple or str): 输入周边的填充层数。常用于保持输出尺寸不变（如 same padding）。
- 默认为 0。
- 'valid': 不填充。
- 'same': 填充使得输出尺寸与输入尺寸（在 stride=1 时）一致。
dilation (int or tuple): 空洞卷积（Dilated Convolution）的膨胀率。默认为 1（普通卷积）。用于扩大感受野。
groups (int): 分组卷积。默认为 1。
- groups=1: 普通卷积。
- groups=in_channels 且 out_channels=in_channels: 深度卷积 (Depthwise Convolution)。
bias (bool): 是否添加偏置参数。默认为 True。如果卷积后接 BatchNorm，通常设为 False（因为 BN 有 affine 参数，Bias 冗余）。

输出尺寸计算公式

设输入尺寸 $(H_{in}, W_{in})$，输出尺寸 $(H_{out}, W_{out})$：

$$ H_{out} = \left\lfloor \frac{H_{in} + 2 \times \text{padding}[0] - \text{dilation}[0] \times (\text{kernel\_size}[0] - 1) - 1}{\text{stride}[0]} + 1 \right\rfloor $$

Example

import torch
import torch.nn as nn

# 1. 基础卷积: 保持尺寸不变 (Same Padding)
# Input: [Batch, 3, 32, 32] -> Output: [Batch, 16, 32, 32]
# Padding = (Kernel - 1) / 2 = 1
conv = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=1)
x = torch.randn(1, 3, 32, 32)
out = conv(x)
print(f"Shape: {out.shape}") # torch.Size([1, 16, 32, 32])

# 2. 下采样 (Stride=2)
# Input: [Batch, 16, 32, 32] -> Output: [Batch, 32, 16, 16]
conv_down = nn.Conv2d(16, 32, kernel_size=3, stride=2, padding=1)
out_down = conv_down(out)
print(f"Downsampled Shape: {out_down.shape}") # torch.Size([1, 32, 16, 16])

# 3. 深度可分离卷积 (Depthwise Separable Conv)
# 分两步：Depthwise (groups=in_channels) + Pointwise (1x1 conv)
dw_conv = nn.Conv2d(16, 16, kernel_size=3, groups=16) # Depthwise
pw_conv = nn.Conv2d(16, 32, kernel_size=1)             # Pointwise

Matmul / Linear (全连接层 & 矩阵乘法)

用于特征的线性变换或分类头。

1. `torch.matmul`

原理: 矩阵乘法。支持广播机制（Broadcasting）。
用法: torch.matmul(input, other) 或 @ 运算符。

Example:

import torch
# 2D 矩阵乘法
# [B, N] x [N, M] -> [B, M]
A = torch.randn(10, 20)
B = torch.randn(20, 30)
C = torch.matmul(A, B) # Size: [10, 30]

# 3D Batch Matmul (常见于 Attention)
# [B, H, W] x [B, W, K] -> [B, H, K]
x = torch.randn(2, 3, 4)
y = torch.randn(2, 4, 5)
z = torch.matmul(x, y) # Size: [2, 3, 5]

2. `nn.Linear`

原理: 对输入进行线性变换 $y = xA^T + b$。
参数详解:
- in_features: 输入样本的大小。
- out_features: 输出样本的大小。
- bias: 是否包含偏置。默认为 True。

Example:

import torch.nn as nn
# Input: [Batch, 512] -> Output: [Batch, 10] (例如分类任务)
fc = nn.Linear(in_features=512, out_features=10)
x = torch.randn(32, 512)
out = fc(x)
print(out.shape) # torch.Size([32, 10])

Pooling (池化层)

用于降维，减少参数量，扩大感受野。

nn.MaxPool2d (最大池化): 取窗口内的最大值。保留最显著的特征（如纹理）。
nn.AvgPool2d (平均池化): 取窗口内的平均值。保留背景等整体特征。

参数详解

用于 nn.MaxPool2d(kernel_size, stride=None, padding=0, ...)

kernel_size: 窗口大小。
stride: 步长。默认等于 kernel_size。

Example:

# Input: [1, 64, 32, 32]
pool = nn.MaxPool2d(kernel_size=2, stride=2)
# Output: [1, 64, 16, 16] (尺寸减半)

输出层 (Output Layers)

Softmax

常规

将 N 个神经元的输出转化为预测概率分布。

原理

公式：

$$ \text{Softmax}(x_{i}) = \frac{\exp(x_i)}{\sum_j \exp(x_j)} $$

保证了所有输出值在 (0, 1) 之间，且总和为 1。

PyTorch 实现

torch.nn.Softmax(dim=None)

dim (int): 指定进行归一化的维度。
- 对于二维输入 [Batch, Classes]，通常 dim=1。

import torch
import torch.nn as nn

softmax = nn.Softmax(dim=1)
logits = torch.tensor([[1.0, 2.0, 3.0], [1.0, 2.0, 3.0]])
probs = softmax(logits)

print(probs)
# tensor([[0.0900, 0.2447, 0.6652],
#         [0.0900, 0.2447, 0.6652]])
print(probs.sum(dim=1)) 
# tensor([1., 1.])

注意事项

数值稳定性: 直接计算 $\exp(x)$ 容易溢出。PyTorch 的 CrossEntropyLoss 内部结合了 LogSoftmax 和 NLLLoss，因此训练分类模型时，模型的最后一层不需要加 Softmax，直接输出 logits 即可。

LLM中的Tempreature

原理

在 LLM (大语言模型) 的生成过程中 (Sampling)，Temperature 是一个非常关键的超参数，它直接作用于 Softmax 之前的 logits。

公式变化为：

$$ p_i = \frac{\exp(x_i / T)}{\sum_j \exp(x_j / T)} $$

其中 $T$ 是温度系数 (Temperature)。

T > 1 (高温):
- Logits 之间的差异被缩小 (除以大数)。
- 分布变得更平坦 (接近均匀分布)。
- 效果: 模型更倾向于选择非最高概率的词，生成结果更具随机性、多样性和创造力。
T < 1 (低温):
- Logits 之间的差异被放大 (除以小数)。
- 分布变得更尖锐 (强者更强)。
- 效果: 模型更倾向于选择概率最高的词，生成结果更确定、保守和准确。
T = 1: 标准 Softmax。

PyTorch 实现

import torch
import torch.nn.functional as F

logits = torch.tensor([1.0, 2.0, 4.0])

# 1. T = 1.0 (Standard)
probs_std = F.softmax(logits / 1.0, dim=0)
print(f"T=1.0: {probs_std.numpy()}") 
# T=1.0: [0.042 0.114 0.844] (正常分布)

# 2. T = 0.5 (Low Temperature - More Deterministic)
probs_low = F.softmax(logits / 0.5, dim=0)
print(f"T=0.5: {probs_low.numpy()}")
# T=0.5: [0.002 0.018 0.980] (差异放大，几乎只选最大的)

# 3. T = 2.0 (High Temperature - More Random)
probs_high = F.softmax(logits / 2.0, dim=0)
print(f"T=2.0: {probs_high.numpy()}")
# T=2.0: [0.155 0.256 0.589] (差异缩小，概率更平均)

常见算子

中间层 (Intermediate Layers)

Conv (卷积层)

原理

PyTorch 实现

参数详解

输出尺寸计算公式

Example

Matmul / Linear (全连接层 & 矩阵乘法)

1. torch.matmul

2. nn.Linear

Pooling (池化层)

参数详解

输出层 (Output Layers)

Softmax

常规

原理

PyTorch 实现

注意事项

LLM中的Tempreature

原理

PyTorch 实现

1. `torch.matmul`

2. `nn.Linear`