现有模型信息记录

底模SD1.4

BK-SDM

https://github.com/Nota-NetsPresso/BK-SDM 数据来源：BK-SDM ModelCard

训练信息

训练集：BK-SDM数据

Training Data
- BK-SDM: 212,776 image-text pairs (i.e., 0.22M pairs) from LAION-Aesthetics V2 6.5+.
- BK-SDM-2M: 2,256,472 image-text pairs (i.e., 2.3M pairs) from LAION-Aesthetics V2 6.25+.
Hardware: A single NVIDIA A100 80GB GPU
Gradient Accumulations: 4
Batch: 256 (=4×64)
Optimizer: AdamW
Learning Rate: a constant learning rate of 5e-5 for 50K-iteration pretraining

测试信息

从数据来源拷贝整理测试集：MS COCO 30k，可以通过BK-SDM的脚本下载或者见文件链接

The following table shows the results on 30K samples from the MS-COCO validation split. After generating 512×512 images with the PNDM scheduler and 25 denoising steps, we downsampled them to 256×256 for evaluating generation scores.

Zero-shot MS-COCO 256×256 30K

Our models were drawn at the 50K-th training iteration.

Model	FID↓	IS↑	CLIP Score↑ (ViT-g/14)	# Params, U-Net	# Params, Whole SDM
Stable Diffusion v1.4	13.05	36.76	0.2958	0.86B	1.04B
BK-SDM-Base (Ours)	15.76	33.79	0.2878	0.58B	0.76B
BK-SDM-Base-2M (Ours)	14.81	34.17	0.2883	0.58B	0.76B
BK-SDM-Small (Ours)	16.98	31.68	0.2677	0.49B	0.66B
BK-SDM-Small-2M (Ours)	17.05	33.10	0.2734	0.49B	0.66B
BK-SDM-Tiny (Ours)	17.12	30.09	0.2653	0.33B	0.50B
BK-SDM-Tiny-2M (Ours)	17.53	31.32	0.2690	0.33B	0.50B
The following figure depicts synthesized images with some MS-COCO captions.

Effect of Different Data Sizes for Training BK-SDM-Small

Increasing the number of training pairs improves the IS and CLIP scores over training progress. The MS-COCO 256×256 30K benchmark was used for evaluation. Furthermore, with the growth in data volume, visual results become more favorable (e.g., better image-text alignment and clear distinction among objects).

Additional Visual Examples

Personalized Generation (Full Finetuning)

To show the applicability of our lightweight SD backbones, we use DreamBooth finetuning for personalized generation.

Each subject is marked as “a [identifier] [class noun]” (e.g., “a [V] dog”).
Our BK-SDMs can synthesize the input subjects in different backgrounds while preserving their appearance.

底模SD1.5

segmind/tiny-sd

segmind/tiny-sd · Hugging Face

训练信息

code: segmind/distill-sd: Segmind Distilled diffusion (github.com)

Base Model: SG161222/Realistic_Vision_V4.0_noVAE · Hugging Face
Training Data
- Subset of recastai/LAION-art-EN-improved-captions dataset
- fantasyfish/laion-art（官方示例给的数据集）
Steps: 125000
Gradient Accumulations: 4
Batch: 128 (=4×32)
Mixed-precision: fp16
Image resolution: 512
Learning Rate: 1e-4

segmind/portrait-finetuned

segmind/portrait-finetuned · Hugging Face

大部分信息同segmind/tiny-sd，不过是在portrait images上面finetune过的模型。

训练信息

Base Model: segmind/tiny-sd · Hugging Face
Training Data
- 没细说，只说是portrait images，7k images
Steps: 131000
Gradient Accumulations: 4
Batch: 128 (=4×32)
Mixed-precision: fp16
Image resolution: 768
Learning Rate: 1e-4

测试信息汇总

模型	FID-mscoco 30k	FID-mscoco 5k	Unet Params	Text Encoder Params	Image Decoder Parms	Whole Parms
runwayml/stable-diffusion-v1-5	13.8705	19.4507	0.860B	0.123B	0.05B	1.032B
segmind/tiny-sd	17.7266	23.0478	0.323B	0.123B	0.05B	0.495B
tiny-sd-lcm-6400	22.1493	27.4531	0.323B	0.123B	0.05B	0.495B