MLFlow
MLFlow 是一个开源的机器学习生命周期管理平台。它旨在简化机器学习开发的流程,包括实验跟踪、代码打包、模型部署等。
核心组件
MLFlow 主要包含以下四个核心组件:
- MLflow Tracking: 用于记录和查询实验数据,包括代码版本、参数、指标(Metrics)、以及输出文件(Artifacts)。
- MLflow Projects: 一种打包数据科学代码的标准格式,确保在不同环境下的可复用性和可复现性。
- MLflow Models: 一种打包机器学习模型的标准格式,支持在多种下游工具(如 Docker, Apache Spark, AWS SageMaker 等)中部署。
- MLflow Model Registry: 一个集中式的模型存储库,用于管理模型的全生命周期,包括版本控制、阶段流转(Staging -> Production)和注释。
安装与配置
k8s部署
下载helm charts
# 把mlflow的helm chart下载到本地
helm pull community-charts/mlflow --version 1.8.0
# 解压
tar -xf mlflow-1.8.0.tgz配置
k8s部署的需要配置以下几个部分:
- 数据库(postgresql)
- artifaces存储(这里使用minio),另外要配置mlflow代理artifacts上传下载(否则需要手动配置S3认证环境变量)。
- ingress(用于外部访问访问)
# -- Mlflow database connection settings
backendStore:
# -- Specifies if you want to run database migration
databaseMigration: true
# -- Add an additional init container, which checks for database availability
databaseConnectionCheck: false
# -- Specifies the default sqlite path
defaultSqlitePath: ":memory:"
postgres:
# -- Specifies if you want to use postgres backend storage
enabled: true
# -- Postgres host address. e.g. your RDS or Azure Postgres Service endpoint
host: "<your_database_host>" # required
# -- Postgres service port
port: 5432 # required
# -- mlflow database name created before in the postgres instance
database: "mlflow" # required
# -- postgres database user name which can access to mlflow database
user: "postgres" # required
# -- postgres database user password which can access to mlflow database
password: "your_data_base_password" # required
# -- postgres database connection driver. e.g.: "psycopg2"
driver: ""
...
artifactRoot:
# -- Specifies if you want to enable proxied artifact storage access
proxiedArtifactStorage: true
...
# -- Specifies if you want to use AWS S3 Mlflow Artifact Root
s3:
# -- Specifies if you want to use AWS S3 Mlflow Artifact Root
enabled: true
# -- S3 bucket name
bucket: "mlflow" # required
# -- S3 bucket folder. If you want to use root level, please don't set anything.
path: "" # optional
# -- AWS IAM user AWS_ACCESS_KEY_ID which has attached policy for access to the S3 bucket
awsAccessKeyId: "xxxxxxxxxxxxxx" # (awsAccessKeyId and awsSecretAccessKey) or roleArn serviceaccount annotation required
# -- AWS IAM user AWS_SECRET_ACCESS_KEY which has attached policy for access to the S3 bucket
awsSecretAccessKey: "xxxxxxxxxxxxxxxxxxxxxxxxxx" # (awsAccessKeyId and awsSecretAccessKey) or roleArn serviceaccount annotation required
# -- Existing secret for AWS IAM user AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY secrets.
existingSecret:
# -- This is for setting up the AWS IAM user secrets existing secret name.
name: ""
# -- This is for setting up the key for AWS_ACCESS_KEY_ID secret. If it's set, awsAccessKeyId will be ignored.
keyOfAccessKeyId: ""
# -- This is for setting up the key for AWS_SECRET_ACCESS_KEY secret. If it's set, awsSecretAccessKey will be ignored.
keyOfSecretAccessKey: ""
# 下面参数是为了支持mlflow代理artifacts的上传下载,不然的话
# 还要配置S3的认证信息。
extraFlags:
- serveArtifacts
# 修改S3的endpoint到minio的服务
extraEnvVars:
MLFLOW_S3_ENDPOINT_URL: "http://<your_host>:9000"
MLFLOW_S3_IGNORE_TLS: true # Skip TLS certificate
# nginx ingress 配置
ingress:
# -- Specifies if you want to create an ingress access
enabled: true
# -- New style ingress class name. Only possible if you use K8s 1.18.0 or later version
className: "nginx"
# -- Additional ingress annotations
annotations:
kubernetes.io/ingress.class: nginx
nginx.ingress.kubernetes.io/upstream-vhost: "localhost"
nginx.ingress.kubernetes.io/proxy-body-size: "0"
# kubernetes.io/tls-acme: "true"
hosts:
- host: <your_domain>
paths:
- path: /
# -- Ingress path type
pathType: ImplementationSpecific
# -- Ingress tls configuration for https access
tls:
- secretName: mlflow-tls
hosts:
- <your_domain>本地部署
安装
使用 pip 安装 MLFlow:
pip install mlflow启动服务
1. 本地启动(用于开发调试)
直接在终端运行,默认会在本地启动一个 UI 服务,数据存储在本地 mlruns 目录。
mlflow ui
# 默认访问地址: http://127.0.0.1:50002. 远程/生产环境启动
在生产环境中,通常需要指定后端存储(Backend Store)和制品存储(Artifact Root)。
- Backend Store: 存储实验元数据(参数、指标等),通常使用 MySQL 或 PostgreSQL。
- Artifact Root: 存储大文件(模型文件、图片等),通常使用 S3、HDFS 或 SFTP。
mlflow server \
--host 0.0.0.0 \
--port 5000 \
--backend-store-uri mysql+pymysql://user:password@host:port/dbname \
--default-artifact-root s3://my-mlflow-bucket/ \
--serve-artifacts使用指南 (Python)
1. 设置 Tracking URI
在代码开始处,指定 MLFlow Server 的地址。
import mlflow
# 如果是在本地启动 mlflow ui,通常不需要设置,默认是 ./mlruns
# 如果是远程服务器
mlflow.set_tracking_uri("http://your-mlflow-server:5000")
# 设置实验名称,如果不存在会自动创建
mlflow.set_experiment("My_Experiment_Name")2. 记录实验 (Logging)
使用 mlflow.start_run() 上下文管理器来开启一次 Run。
import mlflow
import torch
import torch.nn as nn
import torch.optim as optim
# 定义一个简单的 PyTorch 模型
class SimpleModel(nn.Module):
def __init__(self):
super(SimpleModel, self).__init__()
self.linear = nn.Linear(1, 1)
def forward(self, x):
return self.linear(x)
# 开启一个 Run
with mlflow.start_run():
# 1. 记录参数 (Parameters)
lr = 0.01
epochs = 10
mlflow.log_param("learning_rate", lr)
mlflow.log_param("epochs", epochs)
# 训练模型 (伪代码)
model = SimpleModel()
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=lr)
# ... 训练循环 ...
# loss = ...
# 2. 记录指标 (Metrics)
mlflow.log_metric("loss", 0.123) # 假设的 loss 值
# 3. 记录模型 (Models)
# 使用 mlflow.pytorch 记录 PyTorch 模型
mlflow.pytorch.log_model(model, "model")
# 4. 记录任意文件 (Artifacts)
with open("output.txt", "w") as f:
f.write("Hello MLFlow with PyTorch")
mlflow.log_artifact("output.txt")3. 模型注册 (Model Registry)
在记录模型时,可以直接将其注册到 Model Registry 中。
# 方式一:在 log_model 时注册
mlflow.pytorch.log_model(
pytorch_model=model,
artifact_path="model",
registered_model_name="MyPyTorchModel"
)
# 方式二:通过 API 注册已有的 Run
result = mlflow.register_model(
"runs:/<run_id>/model",
"MyPyTorchModel"
)4. 加载模型
从 MLFlow 中加载模型进行推理。
import mlflow
import torch
# 加载指定版本的模型
# 注意:加载回来的是原始的 PyTorch 模型对象 (nn.Module)
model = mlflow.pytorch.load_model("models:/MyPyTorchModel/1")
# 或者加载处于特定阶段的模型
model_prod = mlflow.pytorch.load_model("models:/MyPyTorchModel/Production")
# 进行预测
model.eval()
with torch.no_grad():
# 假设 data 是 tensor
prediction = model(data)