triton+tensorrt
/Qwen2.5-0.5B-Instruct
--output_dir
./trtllm_checkpoint_fp16
[TensorRT-LLM] TensorRT-LLM version: 0.17.0.post1
0.17.0.post1 198it [00:03, 62.91it/s]
Total time of converting checkpoints: 00:00:04
build engine
https://nvidia.github.io/TensorRT-LLM/commands/trtllm-build.html
trtllm-build
--checkpoint_dir
./trtllm_checkpoint_fp16
--output_dir
./trt_engines/
--gemm_plugin
clone https://github.com/triton-inference-server/tensorrtllm_backend
./triton_model_repo
./tensorrtllm_backend/all_models/inflight_batcher_llm/* ./triton_model_repo/
创建修改模型配置的sh脚本,
,并执行,填写模型的参数配置
ENGINE_DIR
/opt/tritonserver/trt_engines/
TOKENIZER_DIR
/Qwen2.5-0.5B-Instruct
MODEL_FOLDER
/opt/tritonserver/triton_model_repo
TRITON_MAX_BATCH_SIZE
INSTANCE_COUNT
MAX_QUEUE_DELAY_MS
MAX_QUEUE_SIZE
FILL_TEMPLATE_SCRIPT
/mnt/tensorrtllm_backend/tools/fill_template.py
DECOUPLED_MODE
LOGITS_DATATYPE
${FILL_TEMPLATE_SCRIPT}
${MODEL_FOLDER}
/ensemble/config.pbtxt triton_max_batch_size:
${TRITON_MAX_BATCH_SIZE}
,logits_datatype:
${LOGITS_DATATYPE}
${FILL_TEMPLATE_SCRIPT}
${MODEL_FOLDER}
/preprocessing/config.pbtxt tokenizer_dir:
${TOKENIZER_DIR}
,triton_max_batch_size:
${TRITON_MAX_BATCH_SIZE}
,preprocessing_instance_count:
${INSTANCE_COUNT}
${FILL_TEMPLATE_SCRIPT}
${MODEL_FOLDER}
/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:
${TRITON_MAX_BATCH_SIZE}
,decoupled_mode:
${DECOUPLED_MODE}
,engine_dir:
${ENGINE_DIR}
,max_queue_delay_microseconds:
${MAX_QUEUE_DELAY_MS}
,batching_strategy:inflight_fused_batching,max_queue_size:
${MAX_QUEUE_SIZE}
,encoder_input_features_data_type:TYPE_FP16,logits_datatype:
${LOGITS_DATATYPE}
${FILL_TEMPLATE_SCRIPT}
${MODEL_FOLDER}
/postprocessing/config.pbtxt tokenizer_dir:
${TOKENIZER_DIR}
,triton_max_batch_size:
${TRITON_MAX_BATCH_SIZE}
,postprocessing_instance_count:
${INSTANCE_COUNT}
${FILL_TEMPLATE_SCRIPT}
${MODEL_FOLDER}
/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:
${TRITON_MAX_BATCH_SIZE}
,decoupled_mode:
${DECOUPLED_MODE}
,bls_instance_count:
${INSTANCE_COUNT}
,logits_datatype:
${LOGITS_DATATYPE}
tritonserver 部署
tritonserver --model-repository
./triton_model_repo/
curl 调用 ensemble / tensorrt_llm_bls 模型
POST localhost:8000/v2/models/ensemble/generate
'{"text_input": "你知道deepseek吗,写一首诗表达对大模型技术发展的惊叹", "max_tokens": 200, "bad_words": "", "stop_words": ""}'
"model_name"
"ensemble"
"model_version"
"sequence_end"
"sequence_id"
"sequence_start"
"text_output"
"你知道deepseek吗,写一首诗表达对大模型技术发展的惊叹和敬畏之情。
深海深处,深海深处,
大模型技术,如星辰般璀璨。
它在数据海洋中航行,
探索未知,创造奇迹。
在无尽的宇宙中穿梭。
在文字的海洋中航行。
它用深度学习,创造奇迹,
在数据的海洋中航行。
它用自然语言处理,创造奇迹,
在语言的海洋中航行。
它用深度学习,创造奇迹,
在数据的海洋中航行。
它用自然语言处理,创造奇迹,
在语言的海洋中航行。
它用深度学习,创造奇迹,
在数据的海洋中航行。
它用自然语言处理,创造奇迹,
在语言的海洋中航行。
它用深度学习,创造奇迹,
在数据的海洋中航行。
它用自然语言处理,创造奇迹,
在语言的海洋中航行。
它用深度学习,创造奇迹,
在数据的海洋中航行。
trtllm-serve 部署
这种方式可以启动 OpenAI兼容模式的接口
https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/apps
有3种方式启动,模型格式不一样,这种方式跟
vllm serve
trtllm-serve /Qwen2.5-0.5B-Instruct
trtllm-serve ./trtllm_checkpoint_fp16/
--tokenizer
/Qwen2.5-0.5B-Instruct/
trtllm-serve ./trt_engines/
--tokenizer
/Qwen2.5-0.5B-Instruct/
TensorRT-LLM
TensorRT-LLM version:
/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:2011: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs
visible cards are included
compilation.
If this is not desired, please
os.environ
'TORCH_CUDA_ARCH_LIST'
warnings.warn
Loading Model:
Loading HF model to memory
Loading Model:
Building TRT-LLM engine
Loading model done.
Total latency:
TensorRT-LLM
TensorRT-LLM version:
/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:2011: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs
visible cards are included
compilation.
If this is not desired, please
os.environ
'TORCH_CUDA_ARCH_LIST'
warnings.warn
TensorRT-LLM
Engine version
.0.post1 found
the config file, assuming engine
built by new builder API.
TensorRT-LLM
Refreshed the MPI
TensorRT-LLM
TensorRT-LLM
is using GPU
TensorRT-LLM
Fix optionalParams
KV cache reuse disabled because model was not built with paged context FMHA support
TensorRT-LLM
TRTGptModel maxNumSequences:
TensorRT-LLM
TRTGptModel maxBatchSize:
TensorRT-LLM
TRTGptModel maxBeamWidth:
TensorRT-LLM
TRTGptModel maxSequenceLen:
TensorRT-LLM
TRTGptModel maxDraftLen:
TensorRT-LLM
TRTGptModel mMaxAttentionWindowSize:
TensorRT-LLM
TRTGptModel enableTrtOverlap:
TensorRT-LLM
TRTGptModel normalizeLogProbs:
TensorRT-LLM
TRTGptModel maxNumTokens:
TensorRT-LLM
TRTGptModel maxInputLen:
maxSequenceLen -
, maxNumTokens
since context FMHA and usePackedInput are enabled
TensorRT-LLM
TRTGptModel If model
is encoder, maxInputLen would be reset
trtEncoderModel to maxInputLen: min
maxSequenceLen, maxNumTokens
TensorRT-LLM
Capacity Scheduler Policy: GUARANTEED_NO_EVICT
TensorRT-LLM
Context Chunking Scheduler Policy: None
TensorRT-LLM
Loaded engine size:
TensorRT-LLM
Inspecting the engine to identify potential runtime issues
TensorRT-LLM
The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with
'detailed'
profiling verbosity to get
diagnostics.
TensorRT-LLM
MemUsageChange
execution context memory.
TensorRT-LLM
MemUsageChange
TensorRT-managed allocation
IExecutionContext creation: CPU +0, GPU +0, now: CPU
TensorRT-LLM
MemUsageChange
GB GPU memory
runtime buffers.
TensorRT-LLM
MemUsageChange
GB GPU memory
TensorRT-LLM
Memory usage when calculating max tokens
paged kv cache: total:
GiB, available:
TensorRT-LLM
Number of blocks
KV cache primary pool:
TensorRT-LLM
Number of blocks
KV cache secondary pool:
, onboard blocks to primary memory before reuse:
TensorRT-LLM
KV cache block reuse is disabled
TensorRT-LLM
Max KV cache pages per sequence:
TensorRT-LLM
Number of tokens per block:
TensorRT-LLM
MemUsageChange
max tokens
paged KV cache
INFO: Started server process
INFO: Waiting
application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://localhost:8000
Press CTRL+C to quit
http://localhost:8000/v1/completions
"Content-Type: application/json"
"model": "/Qwen2.5-0.5B-Instruct",
"prompt": "大模型为什么这么厉害?",
"max_tokens": 16,
"temperature": 0
"cmpl-b8f4c498683b46e4ab1fe2e2de20a32"
"text_completion"
"Qwen2.5-0.5B-Instruct"
"为什么能比人类更好?
在人工智能领域,我们经常看到各种各样的"
"logprobs"
"finish_reason"
"stop_reason"
"prompt_tokens"
"total_tokens"
"completion_tokens"
http://localhost:8000/v1/chat/completions
"Content-Type: application/json"
"model": "/Qwen2.5-0.5B-Instruct",
"messages":[{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "什么促进了大模型发展?"}],
"max_tokens": 50,
"temperature": 0
"chatcmpl-b861cc9725fd42ffa5baf6a406b4f624"
"chat.completion"
"Qwen2.5-0.5B-Instruct"
"assistant"
"大模型是指深度学习模型,它在计算机视觉、自然语言处理、语音识别等领域取得了显著的进展。促进大模型发展的因素包括:
1. 数据量的增加:随着大数据和云计算技术的发展,数据量的增加使得"
"tool_calls"
"logprobs"
"finish_reason"
"stop_reason"
"prompt_tokens"
"total_tokens"
"completion_tokens"
triton + OpenAI接口模式部署
另一种 OpenAI 兼容接口部署方式,目前还是 beta 版本
https://github.com/triton-inference-server/server/tree/main/python/openai
python3 /mnt/server/python/openai/openai_frontend/main.py
--model-repository ./triton_model_repo/
--tokenizer
/Qwen2.5-0.5B-Instruct/
"tensorrt_llm_bls"
http://localhost:9000/v1/chat/completions
'Content-Type: application/json'
"model": "'
"messages": [{"role": "user", "content": "Say this is a test!"}]
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
--:--:-- --:--:-- --:--:--
--:--:-- --:--:-- --:--:--
"cmpl-41ca7405-f3f6-11ef-9c8a-4753e2d4625b"
"finish_reason"
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
Say this is a test!
I'm sorry, but I'm Qwen, an artificial intelligence language model created"
"tool_calls"
"assistant"
"function_call"
"logprobs"
1740542446
"tensorrt_llm_bls"
"system_fingerprint"
"chat.completion"
pip install genai-perf
"tensorrt_llm_bls"
"/Qwen2.5-0.5B-Instruct/"
genai-perf profile
--tokenizer
${TOKENIZER}
--service-kind openai
--endpoint-type chat
localhost:9000
--synthetic-input-tokens-mean
--synthetic-input-tokens-stddev
--output-tokens-mean
--output-tokens-stddev
--request-count
--warmup-request-count
--concurrency
--streaming
# streaming 需要上面sh脚本的 DECOUPLED_MODE=true
genai_perf.parser:115 - Profiling these models: tensorrt_llm_bls
genai_perf.subcommand.common:208 - Running Perf Analyzer
'perf_analyzer -m tensorrt_llm_bls --async --input-data artifacts/tensorrt_llm_bls-openai-chat-concurrency1/inputs.json -i http --concurrency-range 1 --endpoint v1/chat/completions --service-kind openai -u localhost:9000 --request-count 50 --warmup-request-count 10 --profile-export-file artifacts/tensorrt_llm_bls-openai-chat-concurrency1/profile_export.json --measurement-interval 10000 --stability-percentage 999'
并发 = 1 输出:
并发 = 10 输出:
首token响应时间
测试过程中发现模型响应时长线性增长,没有并发效果,经过同事指点,需要注意以下几点
build engine 的时候记得开启以下两个参数
--kv_cache_type
--use_paged_context_fmha
trtllm-build
--checkpoint_dir
./trtllm_checkpoint_fp16
--output_dir
./trt_engines/
--gemm_plugin
--max_batch_size
--max_seq_len
--max_num_tokens
--max_input_len
--remove_input_padding
--kv_cache_type
--use_paged_context_fmha
--gpt_attention_plugin
--mamba_conv1d_plugin
--nccl_plugin
--moe_plugin
--context_fmha
--reduce_fusion
--use_fused_mlp
postprocessing / preprocessing / tensorrt_llm_bls 3个模型的
config.pbtxt
配置中的实例数量需要改成
>= TRITON_MAX_BATCH_SIZE
INSTANCE_COUNT
TRITON_MAX_BATCH_SIZE
https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#modify-the-model-configuration
NOTE: It is recommended to match the number of pre/post_instance_counts with triton_max_batch_size for better performance
建议将 pre/post_instance_counts 的数量与 triton_max_batch_size 匹配以获得更好的性能
python3 /mnt/TensorRT-LLM/examples/qwen/convert_checkpoint.py
--model_dir
/Qwen2.5-32B-Instruct
--output_dir
./trtllm_checkpoint_fp16
指定GPU编号,以及
-n 参数跟上面设置匹配
CUDA_VISIBLE_DEVICES
,6,7 mpirun --allow-run-as-root
tritonserver --model-repository
./triton_model_repo/
CUDA_VISIBLE_DEVICES
,6,7 trtllm-serve ./trt_engines/
--tokenizer
/Qwen2.5-32B-Instruct/
CUDA_VISIBLE_DEVICES
mpirun --allow-run-as-root
python3 /mnt/triton/python/openai/openai_frontend/main.py --model-repository ./triton_model_repo/
--tokenizer
/Qwen2.5-32B-Instruct/
