使用 vLLM 部署 LLM 推理服务

部署 LLM 推理服务面临着多方面的挑战，包括计算资源需求、延迟和吞吐量、成本控制等。vLLM 是一个快速、灵活且易于使用的 LLM 推理和服务库，其利用 PagedAttention 注意力算法优化注意力机制的键值存储，有效节约内存空间以用于批处理请求，从而显著提高服务的吞吐量。vLLM 能够有效控制运行成本，利用有限的计算资源为更多用户提供高吞吐量和低延迟的 LLM 推理服务。

本示例使用 MLService 和 vLLM 框架部署一个 LLM 推理服务。模型存储使用 PVC。

相比使用 FastChat 部署 LLM 推理服务，本示例使用了更高效的推理后端，以及可用于生产环境的 MLService。

准备

在项目中创建一个名为 vllm、大小 30 GiB 以上的 PVC，然后创建一个同样名为 vllm 的 Notebook 挂载该 PVC（镜像类型和模板不限）。

进入 Notebook 或远程连接到 Notebook，启动一个终端，执行以下命令以克隆 t9k/examples 仓库：

cd ~
git clone https://github.com/t9k/examples.git

然后从 Hugging Face Hub 或魔搭社区下载要部署的模型，这里以 CodeLlama-7b-Instruct-hf 模型为例：

# 方法 1：如果可以直接访问 huggingface
huggingface-cli download codellama/CodeLlama-7b-Instruct-hf \
  --local-dir CodeLlama-7b-Instruct-hf --local-dir-use-symlinks False

# 方法 2：对于国内用户，访问 modelscope 网络连通性更好
pip install modelscope
python -c \
  "from modelscope import snapshot_download; snapshot_download('AI-ModelScope/CodeLlama-7b-Instruct-hf')"
mv .cache/modelscope/hub/AI-ModelScope/CodeLlama-7b-Instruct-hf .

部署

这里将 vLLM 部署为兼容 OpenAI API 的服务器，这样 vLLM 可以作为使用 OpenAI API 的应用程序的即插即用替代品。

使用以下 YAML 配置文件创建 MLServiceRuntime 和 MLService 以部署服务：

cd examples/deployments/vllm
kubectl apply -f mlservice-runtime.yaml
kubectl create -f mlservice.yaml

监控服务是否准备就绪：

kubectl get -f mlservice.yaml -w

待其 READY 值变为 true 后，便可开始使用该服务。第一次拉取镜像可能会花费较长的时间，具体取决于集群的网络状况。

提示

本示例以 7B 模型为例，用户也可以尝试部署 13B、34B 和 70B 模型，但需要提供更多的计算资源。下面给出了相应的计算资源需求：

模型大小	PVC 大小	并行度×显存大小
7B	30GiB	1×24GB
13B	55GiB	2×24GB / 1×40GB
34B	130GiB	4×24GB / 2×40GB / 1×80GB
70B	260GiB	4×40GB / 2×80GB

部署 70B 模型的 YAML 配置文件请参阅 mlservice-runtime-70b.yaml 和 mlservice-70b.yaml

如果没有足够的显存，可以尝试量化方法。

使用推理服务

继续使用 Notebook 的终端，使用 curl 命令发送聊天或生成文本的请求：

address=$(kubectl get -f mlservice.yaml -ojsonpath='{.status.address.url}')

# 聊天
curl ${address}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "codellama-7b",
    "messages": [{"role": "user", "content": "hello"}],
    "temperature": 0.5
  }'

# 生成文本
curl ${address}/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "codellama-7b",
    "prompt": "Long long ago, there was",
    "max_tokens": 100,
    "temperature": 0.5
  }'

返回的响应类似于：

响应

{
    "id": "cmpl-5915c46dc6054ecfa4d57d07225c1264",
    "object": "chat.completion",
    "created": 5101130,
    "model": "codellama-7b",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "  Hello! It's nice to meet you. Is there something I can help you with or would you like to chat?"
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 10,
        "total_tokens": 37,
        "completion_tokens": 27
    }
}

{
    "id": "cmpl-afbd703626c44a12ad192d0861fadd6e",
    "object": "text_completion",
    "created": 5101200,
    "model": "codellama-7b",
    "choices": [
        {
            "index": 0,
            "text": " a time when the world was dark and cold, and little light entered.\n\nA young girl named Kanna was born in this world. She was born with a burden on her back.\n\nKanna grew up in a small village, surrounded by snow and ice. The villagers were poor, and they lived in miserable huts. They were cold and hungry all the time.\n\nBut Kanna was different. She had a special gift. She could make light",
            "logprobs": null,
            "finish_reason": "length"
        }
    ],
    "usage": {
        "prompt_tokens": 7,
        "total_tokens": 107,
        "completion_tokens": 100
    }
}

编写代码

现在让它发挥自己的特长，写一个罗马数字转整数的 Python 程序。一次编写代码的聊天记录如下：

聊天记录

User:

Roman numerals are represented by seven different symbols: I, V, X, L, C, D and M.

Symbol Value
I 1
V 5
X 10
L 50
C 100
D 500
M 1000

For example, 2 is written as II in Roman numeral, just two ones added together. 12 is written as XII, which is simply X + II. The number 27 is written as XXVII, which is XX + V + II.

Roman numerals are usually written largest to smallest from left to right. However, the numeral for four is not IIII. Instead, the number four is written as IV. Because the one is before the five we subtract it making four. The same principle applies to the number nine, which is written as IX. There are six instances where subtraction is used:

I can be placed before V (5) and X (10) to make 4 and 9.
X can be placed before L (50) and C (100) to make 40 and 90.
C can be placed before D (500) and M (1000) to make 400 and 900.

Given a roman numeral, convert it to an integer.

Solve in Python:

class Solution:
    def romanToInt(self, s: str) -> int:

CodeLlama-7b-Instruct-hf:

Here is a possible implementation of the romanToInt function:

class Solution:
    def romanToInt(self, s: str) -> int:
        roman_numerals = {
            'I': 1,
            'V': 5,
            'X': 10,
            'L': 50,
            'C': 100,
            'D': 500,
            'M': 1000
        }
        result = 0
        for i in range(len(s)):
            if i > 0 and roman_numerals[s[i]] > roman_numerals[s[i - 1]]:
                result -= roman_numerals[s[i - 1]]
            result += roman_numerals[s[i]]
        return result

This implementation uses a dictionary to map each Roman numeral to its corresponding integer value. It then iterates over the characters in the input string, and for each character, it adds or subtracts the corresponding integer value depending on whether it is greater than or less than the previous character. The resulting integer value is returned at the end.

For example, if the input string is "XII", the function will first map the characters to their integer values:

X -> 10
I -> 1
I -> 1

Then it will add the values of the characters in order:

10 + 1 + 1 = 12

So the function will return 12.

Note that this implementation assumes that the input string is a valid Roman numeral, and that it does not contain any invalid characters. If the input string contains invalid characters, the function will raise a KeyError exception. You can add error handling to the function as needed to handle such cases.

得益于 vLLM 提供的高吞吐量推理服务，该解答的生成仅用了 6s，然而并不正确，未能通过 LeetCode 的测试。

顺带一提，使用相同的 prompt（由于 CodeLlama-7b-Python-hf 未经过指令微调，prompt 的格式略有不同），CodeLlama-13b-Instruct-hf 和 CodeLlama-7b-Python-hf 提供的解答可以通过，GPT 3.5、GPT-4 和 Bard 提供的解答也都可以通过。

用户可以自行尝试部署更大的 Code Llama 系列模型，并让其编写更加复杂的代码。

扩展：部署其他 LLM

我们可以使用同样的方法部署其他支持的模型，例如要将本示例部署的模型从 CodeLlama-7b-Instruct-hf 换成 Mistral-7B-Instruct-v0.1，只需：

下载 Mistral-7B-Instruct-v0.1 的模型文件：

# 方法 1：如果可以直接访问 huggingface
# 需要登录
huggingface-cli download mistralai/Mistral-7B-Instruct-v0.1 \
  --local-dir Mistral-7B-Instruct-v0.1 --local-dir-use-symlinks False

# 方法 2：对于国内用户，访问 modelscope 网络连通性更好
pip install modelscope
python -c \
  "from modelscope import snapshot_download; snapshot_download('AI-ModelScope/Mistral-7B-Instruct-v0.1')"
mv .cache/modelscope/hub/AI-ModelScope/Mistral-7B-Instruct-v0___1 ./Mistral-7B-Instruct-v0.1

对 MLService 的 YAML 配置文件作以下修改，再次创建即可：

$ diff --color -u mlservice.yaml mlservice-mistral.yaml
--- mlservice.yaml
+++ mlservice-mistral.yaml
@@ -1,22 +1,22 @@
 apiVersion: tensorstack.dev/v1beta1
 kind: MLService
 metadata:
-  name: codellama-7b
+  name: mistral-7b
 spec:
   # scheduler:
   #   t9kScheduler:
   #     queue: default
-  default: codellama-7b
+  default: mistral-7b
   releases:
-    - name: codellama-7b
+    - name: mistral-7b
       predictor:
         minReplicas: 1
         model:
           runtime: vllm-openai
           parameters:
             MODEL_PATH: /var/lib/t9k/model
-            MODEL_NAME: codellama-7b
-          modelUri: pvc://vllm/CodeLlama-7b-Instruct-hf
+            MODEL_NAME: mistral-7b
+          modelUri: pvc://vllm/Mistral-7B-Instruct-v0.1
         containersResources:
         - name: user-container
           resources:

TensorStack AI 计算平台 - 用户使用手册 - v20240206

使用 vLLM 部署 LLM 推理服务

准备

部署

使用推理服务

编写代码

扩展：部署其他 LLM

参考