搬瓦工 VPS 部署 HuggingFace 模型推理服务教程

HuggingFace 是全球最大的开源机器学习社区和模型仓库，托管着数十万个预训练模型，涵盖自然语言处理、计算机视觉、语音识别等领域。将 HuggingFace 上的模型部署到搬瓦工 VPS 上，可以搭建私有的 AI 推理服务，避免依赖第三方 API 并保护数据隐私。本文将介绍多种部署方式，从简单的 Python 脚本到生产级的推理服务。

一、环境要求

操作系统：Ubuntu 20.04 或更高版本。
内存：至少 2GB，运行较大模型需要 4GB 以上。
磁盘：至少 20GB 可用空间（模型文件通常较大）。
Python：Python 3.9 或更高版本。

搬瓦工 VPS 提供纯 CPU 环境，适合运行小型和中型模型的推理任务。对于大型语言模型，建议选择经过量化处理的版本。可在方案页面选择合适配置。

二、安装基础依赖

apt update && apt upgrade -y
apt install python3 python3-pip python3-venv git -y

mkdir -p /opt/huggingface && cd /opt/huggingface
python3 -m venv venv
source venv/bin/activate

三、安装 Transformers 库

pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install transformers accelerate sentencepiece protobuf

由于搬瓦工 VPS 没有 GPU，我们安装 CPU 版本的 PyTorch 以减少磁盘占用。

3.1 下载和使用模型

通过 Transformers 的 pipeline 接口可以快速加载和使用模型：

python3 <<'PYEOF'
from transformers import pipeline

# 文本分类示例
classifier = pipeline("sentiment-analysis")
result = classifier("I love using this VPS service!")
print(result)

# 文本生成示例
generator = pipeline("text-generation", model="distilgpt2")
result = generator("The future of AI is", max_length=50)
print(result)
PYEOF

3.2 预下载模型

为了避免每次启动都需要下载模型，可以提前将模型缓存到本地：

pip install huggingface_hub

python3 -c "
from huggingface_hub import snapshot_download
snapshot_download(repo_id='distilbert-base-uncased-finetuned-sst-2-english',
                  local_dir='/opt/huggingface/models/distilbert-sst2')
"

四、使用 FastAPI 搭建推理 API

将模型封装为 REST API 服务，方便其他应用调用：

pip install fastapi uvicorn

创建 API 服务文件 /opt/huggingface/api_server.py：

cat > /opt/huggingface/api_server.py <<'EOF'
from fastapi import FastAPI
from pydantic import BaseModel
from transformers import pipeline

app = FastAPI(title="HuggingFace Inference API")

# 加载模型（启动时加载一次）
classifier = pipeline("sentiment-analysis")
generator = pipeline("text-generation", model="distilgpt2")

class TextInput(BaseModel):
    text: str
    max_length: int = 100

@app.post("/classify")
async def classify_text(input: TextInput):
    result = classifier(input.text)
    return {"result": result}

@app.post("/generate")
async def generate_text(input: TextInput):
    result = generator(input.text, max_length=input.max_length)
    return {"result": result}

@app.get("/health")
async def health_check():
    return {"status": "ok"}
EOF

启动服务：

cd /opt/huggingface
source venv/bin/activate
uvicorn api_server:app --host 0.0.0.0 --port 8000

五、使用 Docker 部署

使用 Docker 可以简化部署流程，首先确保已安装 Docker（参考 Docker 安装教程）：

mkdir -p /opt/huggingface-docker && cd /opt/huggingface-docker

cat > Dockerfile <<'EOF'
FROM python:3.11-slim

WORKDIR /app
RUN pip install torch --index-url https://download.pytorch.org/whl/cpu && \
    pip install transformers fastapi uvicorn sentencepiece

COPY api_server.py .

EXPOSE 8000
CMD ["uvicorn", "api_server:app", "--host", "0.0.0.0", "--port", "8000"]
EOF

cp /opt/huggingface/api_server.py .
docker build -t hf-inference .
docker run -d --name hf-api -p 8000:8000 -v hf-cache:/root/.cache/huggingface hf-inference

六、使用 Text Generation Inference (TGI)

HuggingFace 官方提供了专门的文本生成推理引擎 TGI，优化了推理性能：

docker run -d --name tgi \
  -p 8080:80 \
  -v /opt/tgi-data:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id distilgpt2 \
  --max-input-length 512 \
  --max-total-tokens 1024

TGI 提供了 OpenAI 兼容的 API 接口，可以直接替换 OpenAI API：

# 测试 TGI 服务
curl http://localhost:8080/generate \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{"inputs":"What is machine learning?","parameters":{"max_new_tokens":100}}'

七、配置 Systemd 服务

cat > /etc/systemd/system/hf-inference.service <<EOF
[Unit]
Description=HuggingFace Inference API
After=network.target

[Service]
Type=simple
User=root
WorkingDirectory=/opt/huggingface
ExecStart=/opt/huggingface/venv/bin/uvicorn api_server:app --host 0.0.0.0 --port 8000
Restart=always
RestartSec=10
Environment=TRANSFORMERS_CACHE=/opt/huggingface/models

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable hf-inference
systemctl start hf-inference

八、性能优化建议

选择小型模型：在 CPU 环境下，优先使用 DistilBERT、TinyLlama 等轻量模型。
使用 ONNX Runtime：将模型转换为 ONNX 格式可以显著提升 CPU 推理速度。
模型量化：使用 INT8 量化可以减少内存占用并加速推理，参考模型量化教程。
批处理请求：合并多个推理请求进行批量处理，提高吞吐量。
缓存预热：服务启动后发送预热请求，避免首次请求延迟过高。

# 安装 ONNX Runtime 加速推理
pip install optimum[onnxruntime]

python3 -c "
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer
model = ORTModelForSequenceClassification.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english', export=True)
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
print('ONNX model loaded successfully')
"

九、常见问题

模型下载缓慢

搬瓦工海外节点直连 HuggingFace 通常速度很好。如果遇到下载问题，可以使用镜像站或手动下载后上传到服务器。

内存不足 (OOM)

运行大型模型时容易出现内存溢出。解决方法：

# 添加 swap 空间
fallocate -l 4G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
echo '/swapfile none swap sw 0 0' >> /etc/fstab

总结

在搬瓦工 VPS 上部署 HuggingFace 模型推理服务，可以快速构建私有的 AI 应用后端。对于更复杂的 AI 应用，可以结合 LangChain 框架或 LlamaIndex RAG 系统进行开发。选购搬瓦工 VPS 请查看全部方案，使用优惠码 NODESEEK2026 可享受 6.77% 的折扣，购买链接：bwh81.net。