当前位置:首页 > AI技术 > 正文内容

triton+tensorrt

admin1周前 (03-29)AI技术15

/Qwen2.5-0.5B-Instruct

--output_dir

./trtllm_checkpoint_fp16

[TensorRT-LLM] TensorRT-LLM version: 0.17.0.post1

0.17.0.post1 198it [00:03, 62.91it/s]

Total time of converting checkpoints: 00:00:04

build engine

https://nvidia.github.io/TensorRT-LLM/commands/trtllm-build.html

trtllm-build

--checkpoint_dir

./trtllm_checkpoint_fp16

--output_dir

./trt_engines/

--gemm_plugin

clone https://github.com/triton-inference-server/tensorrtllm_backend

./triton_model_repo

./tensorrtllm_backend/all_models/inflight_batcher_llm/* ./triton_model_repo/

创建修改模型配置的sh脚本,

,并执行,填写模型的参数配置

ENGINE_DIR

/opt/tritonserver/trt_engines/

TOKENIZER_DIR

/Qwen2.5-0.5B-Instruct

MODEL_FOLDER

/opt/tritonserver/triton_model_repo

TRITON_MAX_BATCH_SIZE

INSTANCE_COUNT

MAX_QUEUE_DELAY_MS

MAX_QUEUE_SIZE

FILL_TEMPLATE_SCRIPT

/mnt/tensorrtllm_backend/tools/fill_template.py

DECOUPLED_MODE

LOGITS_DATATYPE

${FILL_TEMPLATE_SCRIPT}

${MODEL_FOLDER}

/ensemble/config.pbtxt triton_max_batch_size:

${TRITON_MAX_BATCH_SIZE}

,logits_datatype:

${LOGITS_DATATYPE}

${FILL_TEMPLATE_SCRIPT}

${MODEL_FOLDER}

/preprocessing/config.pbtxt tokenizer_dir:

${TOKENIZER_DIR}

,triton_max_batch_size:

${TRITON_MAX_BATCH_SIZE}

,preprocessing_instance_count:

${INSTANCE_COUNT}

${FILL_TEMPLATE_SCRIPT}

${MODEL_FOLDER}

/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:

${TRITON_MAX_BATCH_SIZE}

,decoupled_mode:

${DECOUPLED_MODE}

,engine_dir:

${ENGINE_DIR}

,max_queue_delay_microseconds:

${MAX_QUEUE_DELAY_MS}

,batching_strategy:inflight_fused_batching,max_queue_size:

${MAX_QUEUE_SIZE}

,encoder_input_features_data_type:TYPE_FP16,logits_datatype:

${LOGITS_DATATYPE}

${FILL_TEMPLATE_SCRIPT}

${MODEL_FOLDER}

/postprocessing/config.pbtxt tokenizer_dir:

${TOKENIZER_DIR}

,triton_max_batch_size:

${TRITON_MAX_BATCH_SIZE}

,postprocessing_instance_count:

${INSTANCE_COUNT}

${FILL_TEMPLATE_SCRIPT}

${MODEL_FOLDER}

/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:

${TRITON_MAX_BATCH_SIZE}

,decoupled_mode:

${DECOUPLED_MODE}

,bls_instance_count:

${INSTANCE_COUNT}

,logits_datatype:

${LOGITS_DATATYPE}

tritonserver 部署

tritonserver --model-repository

./triton_model_repo/

curl 调用 ensemble / tensorrt_llm_bls 模型

POST localhost:8000/v2/models/ensemble/generate

'{"text_input": "你知道deepseek吗,写一首诗表达对大模型技术发展的惊叹", "max_tokens": 200, "bad_words": "", "stop_words": ""}'

"model_name"

"ensemble"

"model_version"

"sequence_end"

"sequence_id"

"sequence_start"

"text_output"

"你知道deepseek吗,写一首诗表达对大模型技术发展的惊叹和敬畏之情。

深海深处,深海深处,

大模型技术,如星辰般璀璨。

它在数据海洋中航行,

探索未知,创造奇迹。

在无尽的宇宙中穿梭。

在文字的海洋中航行。

它用深度学习,创造奇迹,

在数据的海洋中航行。

它用自然语言处理,创造奇迹,

在语言的海洋中航行。

它用深度学习,创造奇迹,

在数据的海洋中航行。

它用自然语言处理,创造奇迹,

在语言的海洋中航行。

它用深度学习,创造奇迹,

在数据的海洋中航行。

它用自然语言处理,创造奇迹,

在语言的海洋中航行。

它用深度学习,创造奇迹,

在数据的海洋中航行。

它用自然语言处理,创造奇迹,

在语言的海洋中航行。

它用深度学习,创造奇迹,

在数据的海洋中航行。

trtllm-serve 部署

这种方式可以启动 OpenAI兼容模式的接口

https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/apps

有3种方式启动,模型格式不一样,这种方式跟

vllm serve

trtllm-serve /Qwen2.5-0.5B-Instruct

trtllm-serve ./trtllm_checkpoint_fp16/

--tokenizer

/Qwen2.5-0.5B-Instruct/

trtllm-serve ./trt_engines/

--tokenizer

/Qwen2.5-0.5B-Instruct/

TensorRT-LLM

TensorRT-LLM version:

/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:2011: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs

visible cards are included

compilation.

If this is not desired, please

os.environ

'TORCH_CUDA_ARCH_LIST'

warnings.warn

Loading Model:

Loading HF model to memory

Loading Model:

Building TRT-LLM engine

Loading model done.

Total latency:

TensorRT-LLM

TensorRT-LLM version:

/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:2011: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs

visible cards are included

compilation.

If this is not desired, please

os.environ

'TORCH_CUDA_ARCH_LIST'

warnings.warn

TensorRT-LLM

Engine version

.0.post1 found

the config file, assuming engine

built by new builder API.

TensorRT-LLM

Refreshed the MPI

TensorRT-LLM

TensorRT-LLM

is using GPU

TensorRT-LLM

Fix optionalParams

KV cache reuse disabled because model was not built with paged context FMHA support

TensorRT-LLM

TRTGptModel maxNumSequences:

TensorRT-LLM

TRTGptModel maxBatchSize:

TensorRT-LLM

TRTGptModel maxBeamWidth:

TensorRT-LLM

TRTGptModel maxSequenceLen:

TensorRT-LLM

TRTGptModel maxDraftLen:

TensorRT-LLM

TRTGptModel mMaxAttentionWindowSize:

TensorRT-LLM

TRTGptModel enableTrtOverlap:

TensorRT-LLM

TRTGptModel normalizeLogProbs:

TensorRT-LLM

TRTGptModel maxNumTokens:

TensorRT-LLM

TRTGptModel maxInputLen:

maxSequenceLen -

, maxNumTokens

since context FMHA and usePackedInput are enabled

TensorRT-LLM

TRTGptModel If model

is encoder, maxInputLen would be reset

trtEncoderModel to maxInputLen: min

maxSequenceLen, maxNumTokens

TensorRT-LLM

Capacity Scheduler Policy: GUARANTEED_NO_EVICT

TensorRT-LLM

Context Chunking Scheduler Policy: None

TensorRT-LLM

Loaded engine size:

TensorRT-LLM

Inspecting the engine to identify potential runtime issues

TensorRT-LLM

The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with

'detailed'

profiling verbosity to get

diagnostics.

TensorRT-LLM

MemUsageChange

execution context memory.

TensorRT-LLM

MemUsageChange

TensorRT-managed allocation

IExecutionContext creation: CPU +0, GPU +0, now: CPU

TensorRT-LLM

MemUsageChange

GB GPU memory

runtime buffers.

TensorRT-LLM

MemUsageChange

GB GPU memory

TensorRT-LLM

Memory usage when calculating max tokens

paged kv cache: total:

GiB, available:

TensorRT-LLM

Number of blocks

KV cache primary pool:

TensorRT-LLM

Number of blocks

KV cache secondary pool:

, onboard blocks to primary memory before reuse:

TensorRT-LLM

KV cache block reuse is disabled

TensorRT-LLM

Max KV cache pages per sequence:

TensorRT-LLM

Number of tokens per block:

TensorRT-LLM

MemUsageChange

max tokens

paged KV cache

INFO: Started server process

INFO: Waiting

application startup.

INFO: Application startup complete.

INFO: Uvicorn running on http://localhost:8000

Press CTRL+C to quit

http://localhost:8000/v1/completions

"Content-Type: application/json"

"model": "/Qwen2.5-0.5B-Instruct",

"prompt": "大模型为什么这么厉害?",

"max_tokens": 16,

"temperature": 0

"cmpl-b8f4c498683b46e4ab1fe2e2de20a32"

"text_completion"

"Qwen2.5-0.5B-Instruct"

"为什么能比人类更好?

在人工智能领域,我们经常看到各种各样的"

"logprobs"

"finish_reason"

"stop_reason"

"prompt_tokens"

"total_tokens"

"completion_tokens"

http://localhost:8000/v1/chat/completions

"Content-Type: application/json"

"model": "/Qwen2.5-0.5B-Instruct",

"messages":[{"role": "system", "content": "You are a helpful assistant."},

{"role": "user", "content": "什么促进了大模型发展?"}],

"max_tokens": 50,

"temperature": 0

"chatcmpl-b861cc9725fd42ffa5baf6a406b4f624"

"chat.completion"

"Qwen2.5-0.5B-Instruct"

"assistant"

"大模型是指深度学习模型,它在计算机视觉、自然语言处理、语音识别等领域取得了显著的进展。促进大模型发展的因素包括:

1. 数据量的增加:随着大数据和云计算技术的发展,数据量的增加使得"

"tool_calls"

"logprobs"

"finish_reason"

"stop_reason"

"prompt_tokens"

"total_tokens"

"completion_tokens"

triton + OpenAI接口模式部署

另一种 OpenAI 兼容接口部署方式,目前还是 beta 版本

https://github.com/triton-inference-server/server/tree/main/python/openai

python3 /mnt/server/python/openai/openai_frontend/main.py

--model-repository ./triton_model_repo/

--tokenizer

/Qwen2.5-0.5B-Instruct/

"tensorrt_llm_bls"

http://localhost:9000/v1/chat/completions

'Content-Type: application/json'

"model": "'

"messages": [{"role": "user", "content": "Say this is a test!"}]

% Total % Received % Xferd Average Speed Time Time Time Current

Dload Upload Total Spent Left Speed

--:--:-- --:--:-- --:--:--

--:--:-- --:--:-- --:--:--

"cmpl-41ca7405-f3f6-11ef-9c8a-4753e2d4625b"

"finish_reason"

You are Qwen, created by Alibaba Cloud. You are a helpful assistant.

Say this is a test!

I'm sorry, but I'm Qwen, an artificial intelligence language model created"

"tool_calls"

"assistant"

"function_call"

"logprobs"

1740542446

"tensorrt_llm_bls"

"system_fingerprint"

"chat.completion"

pip install genai-perf

"tensorrt_llm_bls"

"/Qwen2.5-0.5B-Instruct/"

genai-perf profile

--tokenizer

${TOKENIZER}

--service-kind openai

--endpoint-type chat

localhost:9000

--synthetic-input-tokens-mean

--synthetic-input-tokens-stddev

--output-tokens-mean

--output-tokens-stddev

--request-count

--warmup-request-count

--concurrency

--streaming

# streaming 需要上面sh脚本的 DECOUPLED_MODE=true

genai_perf.parser:115 - Profiling these models: tensorrt_llm_bls

genai_perf.subcommand.common:208 - Running Perf Analyzer

'perf_analyzer -m tensorrt_llm_bls --async --input-data artifacts/tensorrt_llm_bls-openai-chat-concurrency1/inputs.json -i http --concurrency-range 1 --endpoint v1/chat/completions --service-kind openai -u localhost:9000 --request-count 50 --warmup-request-count 10 --profile-export-file artifacts/tensorrt_llm_bls-openai-chat-concurrency1/profile_export.json --measurement-interval 10000 --stability-percentage 999'

并发 = 1 输出:

并发 = 10 输出:

首token响应时间

测试过程中发现模型响应时长线性增长,没有并发效果,经过同事指点,需要注意以下几点

build engine 的时候记得开启以下两个参数

--kv_cache_type

--use_paged_context_fmha

trtllm-build

--checkpoint_dir

./trtllm_checkpoint_fp16

--output_dir

./trt_engines/

--gemm_plugin

--max_batch_size

--max_seq_len

--max_num_tokens

--max_input_len

--remove_input_padding

--kv_cache_type

--use_paged_context_fmha

--gpt_attention_plugin

--mamba_conv1d_plugin

--nccl_plugin

--moe_plugin

--context_fmha

--reduce_fusion

--use_fused_mlp

postprocessing / preprocessing / tensorrt_llm_bls 3个模型的

config.pbtxt

配置中的实例数量需要改成

>= TRITON_MAX_BATCH_SIZE

INSTANCE_COUNT

TRITON_MAX_BATCH_SIZE

https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#modify-the-model-configuration

NOTE: It is recommended to match the number of pre/post_instance_counts with triton_max_batch_size for better performance

建议将 pre/post_instance_counts 的数量与 triton_max_batch_size 匹配以获得更好的性能

python3 /mnt/TensorRT-LLM/examples/qwen/convert_checkpoint.py

--model_dir

/Qwen2.5-32B-Instruct

--output_dir

./trtllm_checkpoint_fp16

指定GPU编号,以及

-n 参数跟上面设置匹配

CUDA_VISIBLE_DEVICES

,6,7 mpirun --allow-run-as-root

tritonserver --model-repository

./triton_model_repo/

CUDA_VISIBLE_DEVICES

,6,7 trtllm-serve ./trt_engines/

--tokenizer

/Qwen2.5-32B-Instruct/

CUDA_VISIBLE_DEVICES

mpirun --allow-run-as-root

python3 /mnt/triton/python/openai/openai_frontend/main.py --model-repository ./triton_model_repo/

--tokenizer

/Qwen2.5-32B-Instruct/

相关文章

什么是LLM?看这一篇就够了!

一、全套AGI大模型学习路线 AI大模型时代的学习之旅:从基础到前沿,掌握人工智能的核心技能! 二、640套AI大模型报告合集 这套包含640份报告的合集,涵盖了AI大...

神经网络中的单层神经网络

神经网络是一种模拟人脑的神经网络以期能够实现类人工智能的机器学习技术。人脑中的神经网络是一个非常复杂的组织。成人的大脑中估计有1000亿个神经元之多。 看一个经典的神经网络。这是一个包...

跨平台机器学习:ML.NET架构及应用编程

平台上的一个机器学习框架,它提供了一套丰富的算法和工具,使得开发人员可以轻松地构建和部署机器学习模型。支持多种编程语言,包括等,这使得它成为跨平台机器学习的理想选择。的架构主要包括三个部分:数据读取、...

一文讲清神经网络、BP神经网络、深度学习的关系

人工神经网络中的顶级代表。往往说《神经网络》就是指《BP神经网络》。 大家研究着各种神经网络,研究得不亦乐乎, 来了两个家伙Romelhart 和Mcclelland,...

几种主要的神经网络

卷积神经网络的输入为二维的像素整阵列,输出为这个图片的属性,当网络训练学习后,所输入的图片或许经过稍微的变换,但卷积神经网络还是可以通过识别图片局部的特征而将整个图片识别出来。 :该层...

AI核心技巧:如何重置openclaw硬件设置 openclaw恢复出厂设置操作方法【操作】深度解析|Duuu笔记

重置 OpenClaw 配置有四种方法:一、交互式向导重置(openclaw onboard --reset);二、指定作用域的命令行重置(如--reset-scope config);三、手动删除~...

发表评论

访客

看不清,换一张

◎欢迎参与讨论,请在这里发表您的看法和观点。