MLFlow

MLFlow 是一个开源的机器学习生命周期管理平台。它旨在简化机器学习开发的流程，包括实验跟踪、代码打包、模型部署等。

核心组件

MLFlow 主要包含以下四个核心组件：

MLflow Tracking: 用于记录和查询实验数据，包括代码版本、参数、指标（Metrics）、以及输出文件（Artifacts）。
MLflow Projects: 一种打包数据科学代码的标准格式，确保在不同环境下的可复用性和可复现性。
MLflow Models: 一种打包机器学习模型的标准格式，支持在多种下游工具（如 Docker, Apache Spark, AWS SageMaker 等）中部署。
MLflow Model Registry: 一个集中式的模型存储库，用于管理模型的全生命周期，包括版本控制、阶段流转（Staging -> Production）和注释。

安装与配置

k8s部署

下载helm charts

# 把mlflow的helm chart下载到本地
helm pull community-charts/mlflow --version 1.8.0
# 解压
tar -xf mlflow-1.8.0.tgz

配置

k8s部署的需要配置以下几个部分：

数据库（postgresql）
artifaces存储（这里使用minio），另外要配置mlflow代理artifacts上传下载（否则需要手动配置S3认证环境变量）。
ingress（用于外部访问访问）

# -- Mlflow database connection settings
backendStore:
	# -- Specifies if you want to run database migration
	databaseMigration: true
	# -- Add an additional init container, which checks for database availability
	databaseConnectionCheck: false
	# -- Specifies the default sqlite path
	defaultSqlitePath: ":memory:"
	postgres:
	# -- Specifies if you want to use postgres backend storage
	enabled: true
	# -- Postgres host address. e.g. your RDS or Azure Postgres Service endpoint
	host: "<your_database_host>" # required
	# -- Postgres service port
	port: 5432 # required
	# -- mlflow database name created before in the postgres instance
	database: "mlflow" # required
	# -- postgres database user name which can access to mlflow database
	user: "postgres" # required
	# -- postgres database user password which can access to mlflow database
	password: "your_data_base_password" # required
	# -- postgres database connection driver. e.g.: "psycopg2"
	driver: ""
...

artifactRoot:
	# -- Specifies if you want to enable proxied artifact storage access
	proxiedArtifactStorage: true
	...
	# -- Specifies if you want to use AWS S3 Mlflow Artifact Root
	s3:
	# -- Specifies if you want to use AWS S3 Mlflow Artifact Root
		enabled: true
		# -- S3 bucket name
		bucket: "mlflow" # required
		# -- S3 bucket folder. If you want to use root level, please don't set anything.
		path: "" # optional
		# -- AWS IAM user AWS_ACCESS_KEY_ID which has attached policy for access to the S3 bucket
		awsAccessKeyId: "xxxxxxxxxxxxxx" # (awsAccessKeyId and awsSecretAccessKey) or roleArn serviceaccount annotation required
		# -- AWS IAM user AWS_SECRET_ACCESS_KEY which has attached policy for access to the S3 bucket
		awsSecretAccessKey: "xxxxxxxxxxxxxxxxxxxxxxxxxx" # (awsAccessKeyId and awsSecretAccessKey) or roleArn serviceaccount annotation required
		# -- Existing secret for AWS IAM user AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY secrets.
		existingSecret:
		  # -- This is for setting up the AWS IAM user secrets existing secret name.
		  name: ""
		  # -- This is for setting up the key for AWS_ACCESS_KEY_ID secret. If it's set, awsAccessKeyId will be ignored.
		  keyOfAccessKeyId: ""
		  # -- This is for setting up the key for AWS_SECRET_ACCESS_KEY secret. If it's set, awsSecretAccessKey will be ignored.
		  keyOfSecretAccessKey: ""


# 下面参数是为了支持mlflow代理artifacts的上传下载，不然的话
# 还要配置S3的认证信息。
extraFlags:
  - serveArtifacts

# 修改S3的endpoint到minio的服务
extraEnvVars:
  MLFLOW_S3_ENDPOINT_URL: "http://<your_host>:9000"
  MLFLOW_S3_IGNORE_TLS: true  # Skip TLS certificate

# nginx ingress 配置
ingress:
  # -- Specifies if you want to create an ingress access
  enabled: true
  # -- New style ingress class name. Only possible if you use K8s 1.18.0 or later version
  className: "nginx"
  # -- Additional ingress annotations
  annotations:
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/upstream-vhost: "localhost"
    nginx.ingress.kubernetes.io/proxy-body-size: "0" 
    # kubernetes.io/tls-acme: "true"
  hosts:
    - host: <your_domain>
      paths:
        - path: /
          # -- Ingress path type
          pathType: ImplementationSpecific
  # -- Ingress tls configuration for https access
  tls:
   - secretName: mlflow-tls
     hosts:
       - <your_domain>

本地部署

安装

使用 pip 安装 MLFlow：

pip install mlflow

启动服务

1. 本地启动（用于开发调试）

直接在终端运行，默认会在本地启动一个 UI 服务，数据存储在本地 mlruns 目录。

mlflow ui
# 默认访问地址: http://127.0.0.1:5000

2. 远程/生产环境启动

在生产环境中，通常需要指定后端存储（Backend Store）和制品存储（Artifact Root）。

Backend Store: 存储实验元数据（参数、指标等），通常使用 MySQL 或 PostgreSQL。
Artifact Root: 存储大文件（模型文件、图片等），通常使用 S3、HDFS 或 SFTP。

mlflow server \
    --host 0.0.0.0 \
    --port 5000 \
    --backend-store-uri mysql+pymysql://user:password@host:port/dbname \
    --default-artifact-root s3://my-mlflow-bucket/ \
    --serve-artifacts

使用指南 (Python)

1. 设置 Tracking URI

在代码开始处，指定 MLFlow Server 的地址。

import mlflow

# 如果是在本地启动 mlflow ui，通常不需要设置，默认是 ./mlruns
# 如果是远程服务器
mlflow.set_tracking_uri("http://your-mlflow-server:5000")

# 设置实验名称，如果不存在会自动创建
mlflow.set_experiment("My_Experiment_Name")

2. 记录实验 (Logging)

使用 mlflow.start_run() 上下文管理器来开启一次 Run。

import mlflow
import torch
import torch.nn as nn
import torch.optim as optim

# 定义一个简单的 PyTorch 模型
class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.linear = nn.Linear(1, 1)

    def forward(self, x):
        return self.linear(x)

# 开启一个 Run
with mlflow.start_run():
    # 1. 记录参数 (Parameters)
    lr = 0.01
    epochs = 10
    mlflow.log_param("learning_rate", lr)
    mlflow.log_param("epochs", epochs)

    # 训练模型 (伪代码)
    model = SimpleModel()
    criterion = nn.MSELoss()
    optimizer = optim.SGD(model.parameters(), lr=lr)
    
    # ... 训练循环 ...
    # loss = ...

    # 2. 记录指标 (Metrics)
    mlflow.log_metric("loss", 0.123) # 假设的 loss 值

    # 3. 记录模型 (Models)
    # 使用 mlflow.pytorch 记录 PyTorch 模型
    mlflow.pytorch.log_model(model, "model")
    
    # 4. 记录任意文件 (Artifacts)
    with open("output.txt", "w") as f:
        f.write("Hello MLFlow with PyTorch")
    mlflow.log_artifact("output.txt")

3. 模型注册 (Model Registry)

在记录模型时，可以直接将其注册到 Model Registry 中。

# 方式一：在 log_model 时注册
mlflow.pytorch.log_model(
    pytorch_model=model,
    artifact_path="model",
    registered_model_name="MyPyTorchModel"
)

# 方式二：通过 API 注册已有的 Run
result = mlflow.register_model(
    "runs:/<run_id>/model",
    "MyPyTorchModel"
)

4. 加载模型

从 MLFlow 中加载模型进行推理。

import mlflow
import torch

# 加载指定版本的模型
# 注意：加载回来的是原始的 PyTorch 模型对象 (nn.Module)
model = mlflow.pytorch.load_model("models:/MyPyTorchModel/1")

# 或者加载处于特定阶段的模型
model_prod = mlflow.pytorch.load_model("models:/MyPyTorchModel/Production")

# 进行预测
model.eval()
with torch.no_grad():
    # 假设 data 是 tensor
    prediction = model(data)