当前位置：首页 > AI技术 > 正文内容

triton+tensorrt

admin1周前 (03-29)AI技术15

/Qwen2.5-0.5B-Instruct

--output_dir

./trtllm_checkpoint_fp16

[TensorRT-LLM] TensorRT-LLM version: 0.17.0.post1

0.17.0.post1 198it [00:03, 62.91it/s]

Total time of converting checkpoints: 00:00:04

build engine

https://nvidia.github.io/TensorRT-LLM/commands/trtllm-build.html

trtllm-build

--checkpoint_dir

./trtllm_checkpoint_fp16

--output_dir

./trt_engines/

--gemm_plugin

clone https://github.com/triton-inference-server/tensorrtllm_backend

./triton_model_repo

./tensorrtllm_backend/all_models/inflight_batcher_llm/* ./triton_model_repo/

创建修改模型配置的sh脚本，

，并执行，填写模型的参数配置

ENGINE_DIR

/opt/tritonserver/trt_engines/

TOKENIZER_DIR

/Qwen2.5-0.5B-Instruct

MODEL_FOLDER

/opt/tritonserver/triton_model_repo

TRITON_MAX_BATCH_SIZE

INSTANCE_COUNT

MAX_QUEUE_DELAY_MS

MAX_QUEUE_SIZE

FILL_TEMPLATE_SCRIPT

/mnt/tensorrtllm_backend/tools/fill_template.py

DECOUPLED_MODE

LOGITS_DATATYPE

${FILL_TEMPLATE_SCRIPT}

${MODEL_FOLDER}

/ensemble/config.pbtxt triton_max_batch_size:

${TRITON_MAX_BATCH_SIZE}

,logits_datatype:

${LOGITS_DATATYPE}

${FILL_TEMPLATE_SCRIPT}

${MODEL_FOLDER}

/preprocessing/config.pbtxt tokenizer_dir:

${TOKENIZER_DIR}

,triton_max_batch_size:

${TRITON_MAX_BATCH_SIZE}

,preprocessing_instance_count:

${INSTANCE_COUNT}

${FILL_TEMPLATE_SCRIPT}

${MODEL_FOLDER}

/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:

${TRITON_MAX_BATCH_SIZE}

,decoupled_mode:

${DECOUPLED_MODE}

,engine_dir:

${ENGINE_DIR}

,max_queue_delay_microseconds:

${MAX_QUEUE_DELAY_MS}

,batching_strategy:inflight_fused_batching,max_queue_size:

${MAX_QUEUE_SIZE}

,encoder_input_features_data_type:TYPE_FP16,logits_datatype:

${LOGITS_DATATYPE}

${FILL_TEMPLATE_SCRIPT}

${MODEL_FOLDER}

/postprocessing/config.pbtxt tokenizer_dir:

${TOKENIZER_DIR}

,triton_max_batch_size:

${TRITON_MAX_BATCH_SIZE}

,postprocessing_instance_count:

${INSTANCE_COUNT}

${FILL_TEMPLATE_SCRIPT}

${MODEL_FOLDER}

/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:

${TRITON_MAX_BATCH_SIZE}

,decoupled_mode:

${DECOUPLED_MODE}

,bls_instance_count:

${INSTANCE_COUNT}

,logits_datatype:

${LOGITS_DATATYPE}

tritonserver 部署

tritonserver --model-repository

./triton_model_repo/

curl 调用 ensemble / tensorrt_llm_bls 模型

POST localhost:8000/v2/models/ensemble/generate

'{"text_input": "你知道deepseek吗，写一首诗表达对大模型技术发展的惊叹", "max_tokens": 200, "bad_words": "", "stop_words": ""}'

"model_name"

"ensemble"

"model_version"

"sequence_end"

"sequence_id"

"sequence_start"

"text_output"

"你知道deepseek吗，写一首诗表达对大模型技术发展的惊叹和敬畏之情。

深海深处，深海深处，

大模型技术，如星辰般璀璨。

它在数据海洋中航行，

探索未知，创造奇迹。

在无尽的宇宙中穿梭。

在文字的海洋中航行。

它用深度学习，创造奇迹，

在数据的海洋中航行。

它用自然语言处理，创造奇迹，

在语言的海洋中航行。

它用深度学习，创造奇迹，

在数据的海洋中航行。

它用自然语言处理，创造奇迹，

在语言的海洋中航行。

它用深度学习，创造奇迹，

在数据的海洋中航行。

它用自然语言处理，创造奇迹，

在语言的海洋中航行。

它用深度学习，创造奇迹，

在数据的海洋中航行。

它用自然语言处理，创造奇迹，

在语言的海洋中航行。

它用深度学习，创造奇迹，

在数据的海洋中航行。

trtllm-serve 部署

这种方式可以启动 OpenAI兼容模式的接口

https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/apps

有3种方式启动，模型格式不一样，这种方式跟

vllm serve

trtllm-serve /Qwen2.5-0.5B-Instruct

trtllm-serve ./trtllm_checkpoint_fp16/

--tokenizer

/Qwen2.5-0.5B-Instruct/

trtllm-serve ./trt_engines/

--tokenizer

/Qwen2.5-0.5B-Instruct/

TensorRT-LLM

TensorRT-LLM version:

/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:2011: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs

visible cards are included

compilation.

If this is not desired, please

os.environ

'TORCH_CUDA_ARCH_LIST'

warnings.warn

Loading Model:

Loading HF model to memory

Loading Model:

Building TRT-LLM engine

Loading model done.

Total latency:

TensorRT-LLM

TensorRT-LLM version:

/usr/local/lib/python3.12/dist-packages/torch/utils/cpp_extension.py:2011: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs

visible cards are included

compilation.

If this is not desired, please

os.environ

'TORCH_CUDA_ARCH_LIST'

warnings.warn

TensorRT-LLM

Engine version

.0.post1 found

the config file, assuming engine

built by new builder API.

TensorRT-LLM

Refreshed the MPI

TensorRT-LLM

is using GPU

TensorRT-LLM

Fix optionalParams

KV cache reuse disabled because model was not built with paged context FMHA support

TensorRT-LLM

TRTGptModel maxNumSequences:

TensorRT-LLM

TRTGptModel maxBatchSize:

TensorRT-LLM

TRTGptModel maxBeamWidth:

TensorRT-LLM

TRTGptModel maxSequenceLen:

TensorRT-LLM

TRTGptModel maxDraftLen:

TensorRT-LLM

TRTGptModel mMaxAttentionWindowSize:

TensorRT-LLM

TRTGptModel enableTrtOverlap:

TensorRT-LLM

TRTGptModel normalizeLogProbs:

TensorRT-LLM

TRTGptModel maxNumTokens:

TensorRT-LLM

TRTGptModel maxInputLen:

maxSequenceLen -

, maxNumTokens

since context FMHA and usePackedInput are enabled

TensorRT-LLM

TRTGptModel If model

is encoder, maxInputLen would be reset

trtEncoderModel to maxInputLen: min

maxSequenceLen, maxNumTokens

TensorRT-LLM

Capacity Scheduler Policy: GUARANTEED_NO_EVICT

TensorRT-LLM

Context Chunking Scheduler Policy: None

TensorRT-LLM

Loaded engine size:

TensorRT-LLM

Inspecting the engine to identify potential runtime issues

TensorRT-LLM

The profiling verbosity of the engine does not allow this analysis to proceed. Re-build the engine with

'detailed'

profiling verbosity to get

diagnostics.

TensorRT-LLM

MemUsageChange

execution context memory.

TensorRT-LLM

MemUsageChange

TensorRT-managed allocation

IExecutionContext creation: CPU +0, GPU +0, now: CPU

TensorRT-LLM

MemUsageChange

GB GPU memory

runtime buffers.

TensorRT-LLM

MemUsageChange

GB GPU memory

TensorRT-LLM

Memory usage when calculating max tokens

paged kv cache: total:

GiB, available:

TensorRT-LLM

Number of blocks

KV cache primary pool:

TensorRT-LLM

Number of blocks

KV cache secondary pool:

, onboard blocks to primary memory before reuse:

TensorRT-LLM

KV cache block reuse is disabled

TensorRT-LLM

Max KV cache pages per sequence:

TensorRT-LLM

Number of tokens per block:

TensorRT-LLM

MemUsageChange

max tokens

paged KV cache

INFO: Started server process

INFO: Waiting

application startup.

INFO: Application startup complete.

INFO: Uvicorn running on http://localhost:8000

Press CTRL+C to quit

http://localhost:8000/v1/completions

"Content-Type: application/json"

"model": "/Qwen2.5-0.5B-Instruct",

"prompt": "大模型为什么这么厉害?",

"max_tokens": 16,

"temperature": 0

"cmpl-b8f4c498683b46e4ab1fe2e2de20a32"

"text_completion"

"Qwen2.5-0.5B-Instruct"

"为什么能比人类更好?

在人工智能领域,我们经常看到各种各样的"

"logprobs"

"finish_reason"

"stop_reason"

"prompt_tokens"

"total_tokens"

"completion_tokens"

http://localhost:8000/v1/chat/completions

"Content-Type: application/json"

"model": "/Qwen2.5-0.5B-Instruct",

"messages":[{"role": "system", "content": "You are a helpful assistant."},

{"role": "user", "content": "什么促进了大模型发展?"}],

"max_tokens": 50,

"temperature": 0

"chatcmpl-b861cc9725fd42ffa5baf6a406b4f624"

"chat.completion"

"Qwen2.5-0.5B-Instruct"

"assistant"

"大模型是指深度学习模型，它在计算机视觉、自然语言处理、语音识别等领域取得了显著的进展。促进大模型发展的因素包括：

1. 数据量的增加：随着大数据和云计算技术的发展，数据量的增加使得"

"tool_calls"

"logprobs"

"finish_reason"

"stop_reason"

"prompt_tokens"

"total_tokens"

"completion_tokens"

triton + OpenAI接口模式部署

另一种 OpenAI 兼容接口部署方式，目前还是 beta 版本

https://github.com/triton-inference-server/server/tree/main/python/openai

python3 /mnt/server/python/openai/openai_frontend/main.py

--model-repository ./triton_model_repo/

--tokenizer

/Qwen2.5-0.5B-Instruct/

"tensorrt_llm_bls"

http://localhost:9000/v1/chat/completions

'Content-Type: application/json'

"model": "'

"messages": [{"role": "user", "content": "Say this is a test!"}]

% Total % Received % Xferd Average Speed Time Time Time Current

Dload Upload Total Spent Left Speed

--:--:-- --:--:-- --:--:--

"cmpl-41ca7405-f3f6-11ef-9c8a-4753e2d4625b"

"finish_reason"

You are Qwen, created by Alibaba Cloud. You are a helpful assistant.

Say this is a test!

I'm sorry, but I'm Qwen, an artificial intelligence language model created"

"tool_calls"

"assistant"

"function_call"

"logprobs"

1740542446

"tensorrt_llm_bls"

"system_fingerprint"

"chat.completion"

pip install genai-perf

"tensorrt_llm_bls"

"/Qwen2.5-0.5B-Instruct/"

genai-perf profile

--tokenizer

${TOKENIZER}

--service-kind openai

--endpoint-type chat

localhost:9000

--synthetic-input-tokens-mean

--synthetic-input-tokens-stddev

--output-tokens-mean

--output-tokens-stddev

--request-count

--warmup-request-count

--concurrency

--streaming

# streaming 需要上面sh脚本的 DECOUPLED_MODE=true

genai_perf.parser:115 - Profiling these models: tensorrt_llm_bls

genai_perf.subcommand.common:208 - Running Perf Analyzer

'perf_analyzer -m tensorrt_llm_bls --async --input-data artifacts/tensorrt_llm_bls-openai-chat-concurrency1/inputs.json -i http --concurrency-range 1 --endpoint v1/chat/completions --service-kind openai -u localhost:9000 --request-count 50 --warmup-request-count 10 --profile-export-file artifacts/tensorrt_llm_bls-openai-chat-concurrency1/profile_export.json --measurement-interval 10000 --stability-percentage 999'

并发 = 1 输出：

并发 = 10 输出：

首token响应时间

测试过程中发现模型响应时长线性增长，没有并发效果，经过同事指点，需要注意以下几点

build engine 的时候记得开启以下两个参数

--kv_cache_type

--use_paged_context_fmha

trtllm-build

--checkpoint_dir

./trtllm_checkpoint_fp16

--output_dir

./trt_engines/

--gemm_plugin

--max_batch_size

--max_seq_len

--max_num_tokens

--max_input_len

--remove_input_padding

--kv_cache_type

--use_paged_context_fmha

--gpt_attention_plugin

--mamba_conv1d_plugin

--nccl_plugin

--moe_plugin

--context_fmha

--reduce_fusion

--use_fused_mlp

postprocessing / preprocessing / tensorrt_llm_bls 3个模型的

config.pbtxt

配置中的实例数量需要改成

>= TRITON_MAX_BATCH_SIZE

INSTANCE_COUNT

TRITON_MAX_BATCH_SIZE

https://github.com/triton-inference-server/tensorrtllm_backend?tab=readme-ov-file#modify-the-model-configuration

NOTE: It is recommended to match the number of pre/post_instance_counts with triton_max_batch_size for better performance

建议将 pre/post_instance_counts 的数量与 triton_max_batch_size 匹配以获得更好的性能

python3 /mnt/TensorRT-LLM/examples/qwen/convert_checkpoint.py

--model_dir

/Qwen2.5-32B-Instruct

--output_dir

./trtllm_checkpoint_fp16

指定GPU编号，以及

-n 参数跟上面设置匹配

CUDA_VISIBLE_DEVICES

,6,7 mpirun --allow-run-as-root

tritonserver --model-repository

./triton_model_repo/

CUDA_VISIBLE_DEVICES

,6,7 trtllm-serve ./trt_engines/

--tokenizer

/Qwen2.5-32B-Instruct/

CUDA_VISIBLE_DEVICES

mpirun --allow-run-as-root

python3 /mnt/triton/python/openai/openai_frontend/main.py --model-repository ./triton_model_repo/

--tokenizer

/Qwen2.5-32B-Instruct/

标签: LLM AI 模型 Python OpenAI

返回列表

上一篇：前端开发为什么例子很重要？Few-shot Prompt 在内容风格迁移中的实战|Duuu笔记

下一篇：骡子快跑怎么设置默认参数骡子快跑个性化配置保存方式最佳实践|Duuu笔记

triton+tensorrt

1. 数据量的增加：随着大数据和云计算技术的发展，数据量的增加使得"

相关文章

什么是LLM？看这一篇就够了！

神经网络中的单层神经网络

跨平台机器学习：ML.NET架构及应用编程

一文讲清神经网络、BP神经网络、深度学习的关系

几种主要的神经网络

AI核心技巧：如何重置openclaw硬件设置 openclaw恢复出厂设置操作方法【操作】深度解析|Duuu笔记

发表评论

Copyright Duuu.net Duuu笔记. Some Rights Reserved.

Powered By Z-BlogPHP. Theme by Duuu笔记.

triton+tensorrt

1. 数据量的增加：随着大数据和云计算技术的发展，数据量的增加使得"

相关文章

什么是LLM？看这一篇就够了！

神经网络中的单层神经网络

跨平台机器学习：ML.NET架构及应用编程

一文讲清神经网络、BP神经网络、深度学习的关系

几种主要的神经网络

AI核心技巧：如何重置openclaw硬件设置 openclaw恢复出厂设置操作方法【操作】深度解析|Duuu笔记

发表评论取消回复

Copyright Duuu.net Duuu笔记. Some Rights Reserved.

Powered By Z-BlogPHP. Theme by Duuu笔记.

发表评论