搬瓦工 VPS 部署 MLflow 机器学习实验跟踪平台教程

MLflow 是由 Databricks 开源的机器学习生命周期管理平台，它提供了实验跟踪、模型注册、项目打包和模型部署等核心功能。通过在搬瓦工 VPS 上部署 MLflow Tracking Server，团队成员可以集中记录和比较机器学习实验结果，管理模型版本，实现 MLOps 最佳实践。

一、MLflow 核心组件

MLflow Tracking：记录实验参数、指标和产物，提供 Web UI 对比查看。
MLflow Models：标准化的模型打包格式，支持多种部署方式。
MLflow Model Registry：模型版本管理和生命周期控制。
MLflow Projects：可复现的实验运行环境打包。

二、环境要求

操作系统：Ubuntu 20.04 或更高版本。
内存：至少 1GB，推荐 2GB 以上。
Python：Python 3.9 或更高版本。

选购搬瓦工 VPS 请参考全部方案。

三、安装 MLflow

apt update && apt upgrade -y
apt install python3 python3-pip python3-venv -y

mkdir -p /opt/mlflow && cd /opt/mlflow
python3 -m venv venv
source venv/bin/activate

pip install mlflow

四、启动 Tracking Server

4.1 基础模式（本地存储）

mkdir -p /opt/mlflow/mlruns /opt/mlflow/artifacts

mlflow server \
  --host 0.0.0.0 \
  --port 5000 \
  --backend-store-uri sqlite:///opt/mlflow/mlflow.db \
  --default-artifact-root /opt/mlflow/artifacts

4.2 使用 PostgreSQL 后端（生产环境推荐）

apt install postgresql postgresql-contrib -y
pip install psycopg2-binary

# 创建数据库和用户
sudo -u postgres psql -c "CREATE USER mlflow WITH PASSWORD 'mlflow_password';"
sudo -u postgres psql -c "CREATE DATABASE mlflow_db OWNER mlflow;"

mlflow server \
  --host 0.0.0.0 \
  --port 5000 \
  --backend-store-uri postgresql://mlflow:mlflow_password@localhost/mlflow_db \
  --default-artifact-root /opt/mlflow/artifacts

五、记录实验

在训练脚本中使用 MLflow API 记录实验数据：

pip install scikit-learn numpy

cat > /opt/mlflow/train_example.py <<'EOF'
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score

# 指定 Tracking Server 地址
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("random-forest-experiment")

# 生成示例数据
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# 训练不同参数的模型
for n_estimators in [50, 100, 200]:
    for max_depth in [5, 10, 20]:
        with mlflow.start_run():
            # 记录参数
            mlflow.log_param("n_estimators", n_estimators)
            mlflow.log_param("max_depth", max_depth)

            # 训练模型
            model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=42)
            model.fit(X_train, y_train)

            # 评估和记录指标
            predictions = model.predict(X_test)
            accuracy = accuracy_score(y_test, predictions)
            f1 = f1_score(y_test, predictions)

            mlflow.log_metric("accuracy", accuracy)
            mlflow.log_metric("f1_score", f1)

            # 记录模型
            mlflow.sklearn.log_model(model, "model")

            print(f"n={n_estimators}, depth={max_depth}: accuracy={accuracy:.4f}, f1={f1:.4f}")
EOF

python3 train_example.py

六、使用 Web UI

访问 http://your-ip:5000 即可使用 MLflow Web 界面。界面主要功能包括：

实验列表：查看所有实验及其运行记录。
运行对比：选择多个运行进行指标对比。
参数搜索：按参数和指标筛选最佳运行。
产物查看：下载模型文件和其他产物。

七、模型注册

cat > /opt/mlflow/register_model.py <<'EOF'
import mlflow

mlflow.set_tracking_uri("http://localhost:5000")

# 从最佳运行注册模型
best_run_id = "your_best_run_id"  # 从 Web UI 获取
model_uri = f"runs:/{best_run_id}/model"

# 注册模型
result = mlflow.register_model(model_uri, "RandomForestClassifier")
print(f"模型版本: {result.version}")

# 将模型标记为生产就绪
from mlflow import MlflowClient
client = MlflowClient()
client.transition_model_version_stage(
    name="RandomForestClassifier",
    version=result.version,
    stage="Production"
)
EOF

八、Docker 部署

使用 Docker Compose 部署 MLflow 和 PostgreSQL（需先安装 Docker，参考 Docker 安装教程）：

cat > /opt/mlflow/docker-compose.yml <<'EOF'
version: '3.8'
services:
  postgres:
    image: postgres:15
    environment:
      POSTGRES_USER: mlflow
      POSTGRES_PASSWORD: mlflow_password
      POSTGRES_DB: mlflow_db
    volumes:
      - postgres_data:/var/lib/postgresql/data
    restart: unless-stopped

  mlflow:
    image: ghcr.io/mlflow/mlflow:latest
    ports:
      - "5000:5000"
    volumes:
      - mlflow_artifacts:/opt/artifacts
    environment:
      MLFLOW_BACKEND_STORE_URI: postgresql://mlflow:mlflow_password@postgres/mlflow_db
      MLFLOW_DEFAULT_ARTIFACT_ROOT: /opt/artifacts
    command: mlflow server --host 0.0.0.0 --port 5000 --backend-store-uri postgresql://mlflow:mlflow_password@postgres/mlflow_db --default-artifact-root /opt/artifacts
    depends_on:
      - postgres
    restart: unless-stopped

volumes:
  postgres_data:
  mlflow_artifacts:
EOF

docker compose up -d

九、配置 Systemd 服务

cat > /etc/systemd/system/mlflow.service <<EOF
[Unit]
Description=MLflow Tracking Server
After=network.target postgresql.service

[Service]
Type=simple
User=root
WorkingDirectory=/opt/mlflow
ExecStart=/opt/mlflow/venv/bin/mlflow server --host 0.0.0.0 --port 5000 --backend-store-uri sqlite:///opt/mlflow/mlflow.db --default-artifact-root /opt/mlflow/artifacts
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable mlflow
systemctl start mlflow

十、Nginx 反向代理

cat > /etc/nginx/sites-available/mlflow <<'EOF'
server {
    listen 80;
    server_name mlflow.yourdomain.com;

    location / {
        proxy_pass http://127.0.0.1:5000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        client_max_body_size 500M;
    }
}
EOF
ln -s /etc/nginx/sites-available/mlflow /etc/nginx/sites-enabled/
nginx -t && systemctl reload nginx

十一、常见问题

产物上传失败

确认 artifact 目录存在且有写入权限。如果使用远程存储，确认网络连通性。

数据库锁定

SQLite 不支持高并发访问。如果多人同时使用，建议切换到 PostgreSQL 后端。

总结

MLflow 为机器学习实验管理提供了完整的解决方案。在搬瓦工 VPS 上部署 Tracking Server，可以实现实验的集中管理和团队协作。配合 Jupyter Notebook 进行实验开发，使用 HuggingFace 加载预训练模型，形成完整的 ML 工作流。选购搬瓦工 VPS 请查看全部方案，使用优惠码 NODESEEK2026 可享受 6.77% 的折扣，购买链接：bwh81.net。