maximum power모드 실행
bluesanta@ubuntu:~$ sudo nvpmodel -m 0
NVPM WARN: Golden image context is already created
NVPM WARN: Reboot required for changing to this power mode: 0
NVPM WARN: DO YOU WANT TO REBOOT NOW? enter YES/yes to confirm:
yes
NVPM WARN: rebooting..
Jetson Clocks 활성화: 팬 속도와 클럭을 최대로 고정
bluesanta@ubuntu:~$ sudo jetson_clocks
NVMe Swap 확보: 64GB 메모리라도 빌드 시에는 부족대비 32GB 이상의 스왑을 권장
bluesanta@ubuntu:~$ sudo fallocate -l 32G /swapfile
bluesanta@ubuntu:~$ sudo chmod 600 /swapfile
bluesanta@ubuntu:~$ sudo mkswap /swapfile
Setting up swapspace version 1, size = 32 GiB (34359734272 bytes)
no label, UUID=fd9a09b8-5892-4904-aa0a-30b37166c229
bluesanta@ubuntu:~$ sudo swapon /swapfile
vLLM 최적화 소스 빌드
기존 구버전 및 찌꺼기 제거
(.venv) bluesanta@ubuntu:~/llm$ pip uninstall -y vllm
(.venv) bluesanta@ubuntu:~/llm/vllm$ rm -rf build
CUDA 아키텍처(Compute Capability) 확인
(.venv) bluesanta@ubuntu:~/llm$ nvidia-smi --query-gpu=compute_cap --format=csv,noheader
8.7
Jetson 전용 빌드 환경 변수 설정
(.venv) bluesanta@ubuntu:~/llm$ export TORCH_CUDA_ARCH_LIST="8.7"
(.venv) bluesanta@ubuntu:~/llm$ export VLLM_TARGET_DEVICE="cuda"
vLLM 설치
(.venv) bluesanta@ubuntu:~/llm$ git clone https://github.com/vllm-project/vllm.git
(.venv) bluesanta@ubuntu:~/llm$ cd vllm
(.venv) bluesanta@ubuntu:~/llm/vllm$ pip install setuptools_scm
(.venv) bluesanta@ubuntu:~/llm/vllm$ pip install --upgrade pip setuptools setuptools-scm wheel
(.venv) bluesanta@ubuntu:~/llm/vllm$ sudo apt install -y ninja-build
(.venv) bluesanta@ubuntu:~/llm/vllm$ MAX_JOBS=6 pip install -e .
버전 확인
(.venv) bluesanta@ubuntu:~/llm$ vllm --version
0.20.2rc1.dev93+g51f22dcfd.cu126
vLLM 실행
(.venv) bluesanta@ubuntu:~/llm/vllm$ vllm serve /home/bluesanta/llm/models/gemma-4-31B-it-FP8-block --tensor-parallel-size 1 --max-model-len 4096 --max-num-batched-tokens 4096 --gpu-memory-utilization 0.7 --quantization compressed-tensors --dtype bfloat16 --enforce-eager --speculative-config '{"model": "/home/bluesanta/llm/models/gemma-4-31B-it-assistant", "num_speculative_tokens": 5}'
(APIServer pid=29780) INFO 05-07 19:54:39 [utils.py:306]
(APIServer pid=29780) INFO 05-07 19:54:39 [utils.py:306] █ █ █▄ ▄█
(APIServer pid=29780) INFO 05-07 19:54:39 [utils.py:306] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.20.2rc1.dev93+g51f22dcfd
(APIServer pid=29780) INFO 05-07 19:54:39 [utils.py:306] █▄█▀ █ █ █ █ model /home/bluesanta/llm/models/gemma-4-31B-it-FP8-block
(APIServer pid=29780) INFO 05-07 19:54:39 [utils.py:306] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=29780) INFO 05-07 19:54:39 [utils.py:306]
(APIServer pid=29780) INFO 05-07 19:54:39 [utils.py:240] non-default args: {'model_tag': '/home/bluesanta/llm/models/gemma-4-31B-it-FP8-block', 'model': '/home/bluesanta/llm/models/gemma-4-31B-it-FP8-block', 'dtype': 'bfloat16', 'max_model_len': 4096, 'quantization': 'compressed-tensors', 'enforce_eager': True, 'gpu_memory_utilization': 0.7, 'max_num_batched_tokens': 4096, 'speculative_config': {'model': '/home/bluesanta/llm/models/gemma-4-31B-it-assistant', 'num_speculative_tokens': 5}}
(APIServer pid=29780) INFO 05-07 19:54:39 [model.py:563] Resolved architecture: Gemma4ForConditionalGeneration
(APIServer pid=29780) INFO 05-07 19:54:39 [model.py:1692] Using max model len 4096
(APIServer pid=29780) INFO 05-07 19:54:41 [nixl_utils.py:20] Setting UCX_RCACHE_MAX_UNRELEASED to '1024' to avoid a rare memory leak in UCX when using NIXL.
(APIServer pid=29780) WARNING 05-07 19:54:41 [nixl_utils.py:34] NIXL is not available
(APIServer pid=29780) WARNING 05-07 19:54:41 [nixl_utils.py:44] NIXL agent config is not available
(APIServer pid=29780) INFO 05-07 19:54:42 [model.py:563] Resolved architecture: Gemma4MTPModel
(APIServer pid=29780) INFO 05-07 19:54:42 [model.py:1692] Using max model len 262144
(APIServer pid=29780) WARNING 05-07 19:54:42 [speculative.py:672] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
(APIServer pid=29780) INFO 05-07 19:54:42 [speculative.py:858] Overriding draft model max model len from 262144 to 4096
(APIServer pid=29780) INFO 05-07 19:54:42 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=4096.
(APIServer pid=29780) INFO 05-07 19:54:42 [config.py:101] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.
(APIServer pid=29780) INFO 05-07 19:54:42 [vllm.py:844] Asynchronous scheduling is enabled.
(APIServer pid=29780) WARNING 05-07 19:54:42 [vllm.py:900] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=29780) WARNING 05-07 19:54:42 [vllm.py:918] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=29780) INFO 05-07 19:54:42 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
(APIServer pid=29780) WARNING 05-07 19:54:42 [vllm.py:1406] max_num_scheduled_tokens is set to 4096 based on the speculative decoding settings. This may lead to suboptimal performance. Consider increasing max_num_batched_tokens to accommodate the additional draft token slots, or decrease num_speculative_tokens or max_num_seqs.
(APIServer pid=29780) INFO 05-07 19:54:42 [vllm.py:1093] Cudagraph is disabled under eager mode
(APIServer pid=29780) WARNING 05-07 19:54:42 [cuda.py:233] Forcing --disable_chunked_mm_input for models with multimodal-bidirectional attention.
(APIServer pid=29780) INFO 05-07 19:54:42 [compilation.py:303] Enabled custom fusions: norm_quant, act_quant
WARNING 05-07 19:55:07 [nixl_utils.py:34] NIXL is not available
WARNING 05-07 19:55:07 [nixl_utils.py:44] NIXL agent config is not available
(EngineCore pid=29812) INFO 05-07 19:55:07 [core.py:109] Initializing a V1 LLM engine (v0.20.2rc1.dev93+g51f22dcfd) with config: model='/home/bluesanta/llm/models/gemma-4-31B-it-FP8-block', speculative_config=SpeculativeConfig(method='mtp', model='/home/bluesanta/llm/models/gemma-4-31B-it-assistant', num_spec_tokens=5), tokenizer='/home/bluesanta/llm/models/gemma-4-31B-it-FP8-block', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=compressed-tensors, quantization_config=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/home/bluesanta/llm/models/gemma-4-31B-it-FP8-block, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': , 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'all', '+quant_fp8'], 'ir_enable_torch_wrap': False, 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [4096], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': , 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native']), enable_flashinfer_autotune=False, moe_backend='auto')
(EngineCore pid=29812) INFO 05-07 19:55:12 [parallel_state.py:1410] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.0.237:46069 backend=nccl
(EngineCore pid=29812) INFO 05-07 19:55:12 [parallel_state.py:1723] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=29812) INFO 05-07 19:55:14 [topk_topp_sampler.py:45] Using FlashInfer for top-p & top-k sampling.
(EngineCore pid=29812) WARNING 05-07 19:55:14 [__init__.py:204] min_p and logit_bias parameters won't work with speculative decoding.
(EngineCore pid=29812) INFO 05-07 19:55:14 [gpu_model_runner.py:4842] Starting to load model /home/bluesanta/llm/models/gemma-4-31B-it-FP8-block...
(EngineCore pid=29812) INFO 05-07 19:55:14 [vllm.py:844] Asynchronous scheduling is enabled.
(EngineCore pid=29812) WARNING 05-07 19:55:14 [vllm.py:900] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore pid=29812) WARNING 05-07 19:55:14 [vllm.py:918] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore pid=29812) INFO 05-07 19:55:14 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
(EngineCore pid=29812) WARNING 05-07 19:55:14 [vllm.py:1406] max_num_scheduled_tokens is set to 4096 based on the speculative decoding settings. This may lead to suboptimal performance. Consider increasing max_num_batched_tokens to accommodate the additional draft token slots, or decrease num_speculative_tokens or max_num_seqs.
(EngineCore pid=29812) INFO 05-07 19:55:14 [vllm.py:1093] Cudagraph is disabled under eager mode
(EngineCore pid=29812) INFO 05-07 19:55:14 [compilation.py:303] Enabled custom fusions: norm_quant, act_quant
(EngineCore pid=29812) INFO 05-07 19:55:15 [cuda.py:308] Using AttentionBackendEnum.TRITON_ATTN backend.
(EngineCore pid=29812) INFO 05-07 19:55:16 [cuda.py:308] Using AttentionBackendEnum.TRITON_ATTN backend.
(EngineCore pid=29812) INFO 05-07 19:55:23 [weight_utils.py:904] Filesystem type for checkpoints: EXT4. Checkpoint size: 30.98 GiB. Available RAM: 17.28 GiB.
(EngineCore pid=29812) INFO 05-07 19:55:23 [weight_utils.py:934] Auto-prefetch is disabled because the filesystem (EXT4) is not a recognized network FS (NFS/Lustre) and the checkpoint size (30.98 GiB) exceeds 90% of available RAM (17.28 GiB).
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:19<00:19, 19.16s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:24<00:00, 10.97s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:24<00:00, 12.20s/it]
(EngineCore pid=29812)
(EngineCore pid=29812) INFO 05-07 19:55:48 [default_loader.py:391] Loading weights took 24.73 seconds
(EngineCore pid=29812) WARNING 05-07 19:55:48 [marlin_utils_fp8.py:97] Your GPU does not have native support for FP8 computation but FP8 quantization is being used. Weight-only FP8 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
(EngineCore pid=29812) INFO 05-07 19:55:57 [gpu_model_runner.py:4866] Loading drafter model...
(EngineCore pid=29812) WARNING 05-07 19:55:57 [vllm.py:900] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore pid=29812) WARNING 05-07 19:55:57 [vllm.py:918] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore pid=29812) INFO 05-07 19:55:57 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
(EngineCore pid=29812) INFO 05-07 19:55:57 [vllm.py:1093] Cudagraph is disabled under eager mode
(EngineCore pid=29812) WARNING 05-07 19:55:57 [vllm.py:900] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore pid=29812) WARNING 05-07 19:55:57 [vllm.py:918] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore pid=29812) INFO 05-07 19:55:57 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
(EngineCore pid=29812) INFO 05-07 19:55:57 [vllm.py:1093] Cudagraph is disabled under eager mode
(EngineCore pid=29812) INFO 05-07 19:55:57 [weight_utils.py:904] Filesystem type for checkpoints: EXT4. Checkpoint size: 0.87 GiB. Available RAM: 16.89 GiB.
(EngineCore pid=29812) INFO 05-07 19:55:57 [weight_utils.py:927] Auto-prefetch is disabled because the filesystem (EXT4) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 1.40it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 1.40it/s]
(EngineCore pid=29812)
(EngineCore pid=29812) INFO 05-07 19:55:58 [default_loader.py:391] Loading weights took 0.76 seconds
(EngineCore pid=29812) WARNING 05-07 19:55:58 [llm_base_proposer.py:1375] Draft model does not support multimodal inputs, falling back to text-only mode
(EngineCore pid=29812) INFO 05-07 19:55:58 [llm_base_proposer.py:1487] Detected MTP model. Sharing target model embedding weights with the draft model.
(EngineCore pid=29812) INFO 05-07 19:55:58 [gemma4.py:171] Gemma4 MTP: keeping draft model's own lm_head (draft_dim != backbone_dim).
(EngineCore pid=29812) INFO 05-07 19:55:58 [gemma4.py:330] Gemma4 MTP: draft layer 0 (sliding_attention) -> language_model.model.layers.58.self_attn.attn
(EngineCore pid=29812) INFO 05-07 19:55:58 [gemma4.py:330] Gemma4 MTP: draft layer 1 (sliding_attention) -> language_model.model.layers.58.self_attn.attn
(EngineCore pid=29812) INFO 05-07 19:55:58 [gemma4.py:330] Gemma4 MTP: draft layer 2 (sliding_attention) -> language_model.model.layers.58.self_attn.attn
(EngineCore pid=29812) INFO 05-07 19:55:58 [gemma4.py:330] Gemma4 MTP: draft layer 3 (full_attention) -> language_model.model.layers.59.self_attn.attn
(EngineCore pid=29812) INFO 05-07 19:55:59 [gpu_model_runner.py:4944] Model loading took 33.05 GiB memory and 43.758278 seconds
(EngineCore pid=29812) INFO 05-07 19:55:59 [gpu_model_runner.py:5905] Encoder cache will be initialized with a budget of 4096 tokens, and profiled with 1 video items of the maximum feature size.
(EngineCore pid=29812) WARNING 05-07 19:56:23 [op.py:290] Priority not set for op rms_norm, using native implementation.
(EngineCore pid=29812) INFO 05-07 19:57:18 [gpu_worker.py:460] Available KV cache memory: 9.01 GiB
(EngineCore pid=29812) INFO 05-07 19:57:18 [kv_cache_utils.py:1710] GPU KV cache size: 10,690 tokens
(EngineCore pid=29812) INFO 05-07 19:57:18 [kv_cache_utils.py:1711] Maximum concurrency for 4,096 tokens per request: 2.61x
(EngineCore pid=29812) INFO 05-07 19:57:20 [kernel_warmup.py:44] Skipping FlashInfer autotune because it is disabled.
(EngineCore pid=29812) INFO 05-07 19:57:23 [jit_monitor.py:54] Kernel JIT monitor activated — Triton JIT compilations during inference will be logged as warnings.
(EngineCore pid=29812) INFO 05-07 19:57:23 [core.py:306] init engine (profile, create kv cache, warmup model) took 84.31 s
(EngineCore pid=29812) WARNING 05-07 19:57:24 [vllm.py:900] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore pid=29812) WARNING 05-07 19:57:24 [vllm.py:918] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore pid=29812) INFO 05-07 19:57:24 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
(EngineCore pid=29812) INFO 05-07 19:57:24 [vllm.py:1093] Cudagraph is disabled under eager mode
(APIServer pid=29780) INFO 05-07 19:57:24 [api_server.py:613] Supported tasks: ['generate']
(APIServer pid=29780) WARNING 05-07 19:57:24 [model.py:1449] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 1.0, 'top_k': 64, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=29780) INFO 05-07 19:57:28 [hf.py:483] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
(APIServer pid=29780) INFO 05-07 19:58:24 [base.py:224] Multi-modal warmup completed in 55.545s
(APIServer pid=29780) INFO 05-07 19:58:24 [base.py:224] Readonly multi-modal warmup completed in 0.170s
(APIServer pid=29780) INFO 05-07 19:58:24 [api_server.py:617] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:37] Available routes are:
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /generative_scoring, Methods: POST
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=29780) INFO: Started server process [29780]
(APIServer pid=29780) INFO: Waiting for application startup.
(APIServer pid=29780) INFO: Application startup complete.
vLLM 실행 (메모리 조정)
(.venv) bluesanta@ubuntu:~/llm$ vllm serve /home/bluesanta/llm/models/gemma-4-31B-it-FP8-block --tensor-parallel-size 1 --max-model-len 2048 --max-num-batched-tokens 4096 --gpu-memory-utilization 0.8 --quantization compressed-tensors --dtype bfloat16 --enforce-eager --speculative-config '{"model": "/home/bluesanta/llm/models/gemma-4-31B-it-assistant", "num_speculative_tokens": 5}'
vLLM 실행 (모델 gemma-4-26B-A4B-it-FP8-Dynamic) : 최적화 실행
(.venv) bluesanta@ubuntu:~/llm$ vllm serve /home/bluesanta/llm/models/gemma-4-26B-A4B-it-FP8-Dynamic --tensor-parallel-size 1 --max-model-len 16384 --max-num-batched-tokens 16384 --gpu-memory-utilization 0.8 --quantization compressed-tensors --dtype bfloat16 --enforce-eager --speculative-config '{"model": "/home/bluesanta/llm/models/gemma-4-26B-A4B-it-assistant", "num_speculative_tokens": 5}'
테스트
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/home/bluesanta/llm/models/gemma-4-31B-it-FP8-block",
"messages": [{"role": "user", "content": "한글은 누가 만들었어?"}],
"max_tokens": 100
}'
bluesanta@ubuntu:~/llm$ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "/home/bluesanta/llm/models/gemma-4-31B-it-FP8-block",
"messages": [{"role": "user", "content": "한글은 누가 만들었어?"}],
"max_tokens": 100
}'
{"id":"chatcmpl-8c7ff513de335bf2","object":"chat.completion","created":1778151721,"model":"/home/bluesanta/llm/models/gemma-4-31B-it-FP8-block","choices":[{"index":0,"message":{"role":"assistant","content":"한글은 조선 시대의 제4대 왕인 **세종대왕**이 만들었습니다.\n\n세종대왕은 백성들이 한자를 배우기 어려워 자신의 생각을 글로 표현하지 못하는 것을 안타깝게 여겨, 누구나 쉽게 배우고 쓸 수 있는 우리만의 글자인 **'훈민정음(訓民正音)'**을 창제하셨습니다.\n\n1443년에 완성되어 1446년에 세상에 반","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":"vllm-0.20.2rc1.dev93+g51f22dcfd-887d36ad","usage":{"prompt_tokens":21,"total_tokens":121,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}