728x90

가상환경 생성 및 활성화

bluesanta@localhost:~$ mkdir llm
bluesanta@bluesanta-B550M-Pro-RS:~$ cd Application/stable_diffusion/
bluesanta@bluesanta-B550M-Pro-RS:~/Application/stable_diffusion$ python3 -m venv .venv
bluesanta@bluesanta-B550M-Pro-RS:~/Application/stable_diffusion$ source .venv/bin/activate
(.venv) bluesanta@bluesanta-B550M-Pro-RS:~/Application/stable_diffusion$

PyTorch 및 종속 패키지 설치

(.venv) bluesanta@bluesanta-B550M-Pro-RS:~/Application/stable_diffusion$ pip install --upgrade pip
(.venv) bluesanta@bluesanta-B550M-Pro-RS:~/Application/stable_diffusion$ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130 --no-cache-dir
(.venv) bluesanta@bluesanta-B550M-Pro-RS:~/Application/stable_diffusion$ pip install --upgrade xformers --no-cache-dir

설치 확인

(.venv) bluesanta@bluesanta-B550M-Pro-RS:~/Application/stable_diffusion$ python -c "import torch; import xformers; print('CUDA 사용 가능:', torch.cuda.is_available()); print('GPU 이름:', torch.cuda.get_device_name(0)); print('xFormers 버전:', xformers.__version__)"
CUDA 사용 가능: True
GPU 이름: NVIDIA GeForce RTX 4090
xFormers 버전: 0.0.29.post3

ComfyUI 필수 나머지 패키지 설치

(.venv) bluesanta@bluesanta-B550M-Pro-RS:~/Application/stable_diffusion$ cd ComfyUI/
(.venv) bluesanta@bluesanta-B550M-Pro-RS:~/Application/stable_diffusion/ComfyUI$ pip install -r requirements.txt
(.venv) bluesanta@bluesanta-B550M-Pro-RS:~/Application/stable_diffusion/ComfyUI$ pip install gguf lm-eval
(.venv) bluesanta@bluesanta-B550M-Pro-RS:~/Application/stable_diffusion/ComfyUI$ pip install librosa omegaconf piexif ultralytics aiofiles facexlib lpips fal-client runwayml blend_modes loguru segment-anything dynamicprompts wget ftfy hydra-core iopath pydantic-settings google-genai sounddevice reportlab timm yacs py3langid gdown opencv-contrib-python toml deepdiff surrealist dashscope numexpr ollama easydict boto3 google-generativeai redis google-cloud-storage PyPDF2 replicate pymupdf pypinyin addict albumentations glitch-this hangul-romanize yapf albumentations scipy

얼굴 분석 및 복원(Reactor 등)을 위한 패키지 설치

(.venv) bluesanta@bluesanta-B550M-Pro-RS:~/Application/stable_diffusion/ComfyUI$ sudo apt update && sudo apt install cmake g++ -y
(.venv) bluesanta@bluesanta-B550M-Pro-RS:~/Application/stable_diffusion/ComfyUI$ pip install insightface

GIMM-VFI 및 RMBG 노드 오류 해결

CuPy & ONNX Runtime - CUDA 12를 지원하는 안정적인 버전 지정 설치

(.venv) bluesanta@bluesanta-B550M-Pro-RS:~/Application/stable_diffusion/ComfyUI$ pip install cupy-cuda12x
(.venv) bluesanta@bluesanta-B550M-Pro-RS:~/Application/stable_diffusion/ComfyUI$ pip uninstall onnxruntime onnxruntime-gpu -y
(.venv) bluesanta@bluesanta-B550M-Pro-RS:~/Application/stable_diffusion/ComfyUI$ pip install onnxruntime-gpu==1.19.0 --extra-index-url https://pypi.org/simple

Kosmos2 VLM 노드 오류 해결을 위한 transformers 최신화

(.venv) bluesanta@bluesanta-B550M-Pro-RS:~/Application/stable_diffusion/ComfyUI$ pip install --upgrade transformers

sam2 설치

(.venv) bluesanta@bluesanta-B550M-Pro-RS:~/Application/stable_diffusion/ComfyUI$ pip install git+https://github.com/facebookresearch/segment-anything-2.git

ComfyUI 실행

(.venv) bluesanta@bluesanta-B550M-Pro-RS:~/Application/stable_diffusion/ComfyUI$ python main.py --listen 0.0.0.0 --novram
728x90
728x90

출처

openclaude 설치

C:\Users\bluesanta>npm install -g @gitlawb/openclaude

openclaude 최신버전으로 설치

C:\Users\bluesanta>npm install -g @gitlawb/openclaude@latest

설치된 버전 확인

C:\Users\bluesanta>openclaude --version
0.13.0 (OpenClaude)

환경설정

C:\test>set CLAUDE_CODE_USE_OPENAI=1
C:\test>set OPENAI_API_KEY=sk-jetson-qwen36
C:\test>set OPENAI_BASE_URL=http://192.168.0.235:8000/v1
C:\test>set OPENAI_MODEL=Qwen3.6-35B-A3B-Q5_K_M.gguf

실행

C:\test>openclaude
Warning: ignoring saved provider profile. Codex auth is required for codexplan. Set CODEX_API_KEY, choose Codex OAuth in /provider or put auth.json at C:\Users\bluesanta\.codex\auth.json.
 
  ████████╗ ████████╗ ████████╗ ██╗  ██╗
  ██╔═══██║ ██╔═══██║ ██╔═════╝ ███╗ ██║
  ██║   ██║ ████████║ ██████╗   ████╗██║
  ██║   ██║ ██╔═════╝ ██╔═══╝   ██╔████║
  ████████║ ██║       ████████╗ ██║ ╚███║
  ╚═══════╝ ╚═╝       ╚═══════╝ ╚═╝  ╚══╝
 
  ████████╗ ██╗      ████████╗ ██╗   ██╗ ███████╗  ████████╗
  ██╔═════╝ ██║      ██╔═══██║ ██║   ██║ ██╔═══██╗ ██╔═════╝
  ██║       ██║      ████████║ ██║   ██║ ██║   ██║ ██████╗
  ██║       ██║      ██╔═══██║ ██║   ██║ ██║   ██║ ██╔═══╝
  ████████╗ ████████╗██║   ██║ ╚██████╔╝ ███████╔╝ ████████╗
  ╚═══════╝ ╚═══════╝╚═╝   ╚═╝  ╚═════╝  ╚══════╝  ╚═══════╝
 
  ✦ Any model. Every tool. Zero limits. ✦
 
╔════════════════════════════════════════════════════════════╗
│ Provider  OpenAI                                           │
│ Model     Qwen3.6-35B-A3B-Q5_K_M.gguf                      │
│ Endpoint  http://192.168.0.235:8000/v1                     │
╠════════════════════════════════════════════════════════════╣
│ ● cloud    Ready — type /help to begin                     │
╚════════════════════════════════════════════════════════════╝
  openclaude v0.13.0
 
 
─────────────────────────────────────────────────────────────────────────────────────────────
> 
─────────────────────────────────────────────────────────────────────────────────────────────
  ? for shortcuts

bun 설치

PS C:\WINDOWS\System32> powershell -c "irm bun.sh/install.ps1 | iex"

openclaude 소스 다운로드

C:\test>git clone https://github.com/Gitlawb/openclaude.git

openclaude-vscode 디렉토리로 이동

C:\test>cd openclaude
C:\test\openclaude>cd vscode-extension
C:\test\openclaude\vscode-extension>cd openclaude-vscode

빌드

C:\test\openclaude\vscode-extension\openclaude-vscode>bun add -d @vscode/vsce
bun add v1.3.14 (0d9b296a)
 
installed @vscode/vsce@3.9.1 with binaries:
 - vsce
 
293 packages installed [50.82s]
 
Blocked 1 postinstall. Run `bun pm untrusted` for details.
 
C:\test\openclaude\vscode-extension\openclaude-vscode>bunx vsce package
 WARNING  LICENSE, LICENSE.md, or LICENSE.txt not found
Do you want to continue? [y/N] y
 INFO  Files included in the VSIX:
openclaude-vscode-0.2.0.vsix
├─ [Content_Types].xml
├─ extension.vsixmanifest
└─ extension/
   ├─ package.json [4.99 KB]
   ├─ readme.md [2.34 KB]
   ├─ media/
   │  └─ openclaude.svg [0.43 KB]
   ├─ src/
   │  ├─ extension.js [38.7 KB]
   │  ├─ presentation.js [5.92 KB]
   │  ├─ state.js [10.93 KB]
   │  └─ chat/
   │     ├─ chatProvider.js [21.92 KB]
   │     ├─ chatRenderer.js [47.16 KB]
   │     ├─ diffController.js [2.66 KB]
   │     ├─ messageParser.js [4.5 KB]
   │     ├─ processManager.js [5.7 KB]
   │     ├─ protocol.js [4.64 KB]
   │     └─ sessionManager.js [8.61 KB]
   └─ themes/
      └─ OpenClaude-Terminal-Black.json [2.76 KB]
 
 DONE  Packaged: C:\test\openclaude\vscode-extension\openclaude-vscode\openclaude-vscode-0.2.0.vsix (16 files, 42.77 KB)

 

-

-

728x90
728x90

출처

havenoammo/Qwen3.6-35B-A3B-MTP-GGUF

llama.cpp 소스 다운로드

(.venv) bluesanta@ubuntu:~/llm$ git clone https://github.com/ggml-org/llama.cpp.git
(.venv) bluesanta@ubuntu:~/llm$ cd llama.cpp

최신 원격 변경 사항을 가져오기

(.venv) bluesanta@ubuntu:~/llm/llama.cpp$ git fetch origin
From https://github.com/ggml-org/llama.cpp
 * [new tag]             b9151      -> b9151

PR #22673을 로컬 브랜치로 가져오기

PR #22673("llama + spec: MTP 지원")은 speculative decoding 기능을 추가

(.venv) bluesanta@ubuntu:~/llm/llama.cpp$ git fetch origin pull/22673/head:pr-22673
From https://github.com/ggml-org/llama.cpp
 * [new tag]             b9156      -> b9156

Checkout master and reset to latest remote

(.venv) bluesanta@ubuntu:~/llm/llama.cpp$ git checkout master
Already on 'master'
Your branch is up to date with 'origin/master'.
(.venv) bluesanta@ubuntu:~/llm/llama.cpp$ git reset --hard 856c3adac
HEAD is now at 856c3adac hexagon: eliminate scalar VTCM loads via HVX splat helpers (#22993)

Merge the PR on top (non-fast-forward)

(.venv) bluesanta@ubuntu:~/llm/llama.cpp# git merge --no-ff pr-22673 -m "Merge [PR #22673](https://github.com/ggml-org/llama.cpp/pull/22673): llama + spec: MTP Support"
Merge made by the 'ort' strategy.
 .devops/intel.Dockerfile                                                     |    20 +-
 .editorconfig                                                                |     8 -
 .gitattributes                                                               |     4 -
 
 delete mode 100644 tools/server/public/index.html
 delete mode 100644 tools/server/public/loading.html

빌드 llama-server

(.venv) bluesanta@ubuntu:~/llm/llama.cpp$ cmake -B build -DGGML_CUDA=ON
bluesanta@ubuntu:~/llm/llama.cpp$ cmake --build build --config Release -j$(nproc)

llama-server 설치

bluesanta@ubuntu:~/llm/llama.cpp$ sudo cmake --install build

llama-server 버전 확인

bluesanta@ubuntu:~/llm/llama.cpp$ ./build/bin/llama-server --version
version: 9173 (0672285b2)
built with GNU 11.4.0 for Linux aarch64

unsloth/Qwen3.6-35B-A3B-MTP-GGUF

소스 다운로드

bluesanta@ubuntu:~/llm$ git clone https://github.com/ggml-org/llama.cpp.git llama.cpp-22673-mtp
bluesanta@ubuntu:~/llm$ cd llama.cpp-22673-mtp
bluesanta@ubuntu:~/llm/llama.cpp-22673-mtp$ git fetch origin pull/22673/head:pr-22673-mtp
remote: Enumerating objects: 158, done.
remote: Counting objects: 100% (128/128), done.
remote: Compressing objects: 100% (9/9), done.
remote: Total 158 (delta 119), reused 119 (delta 119), pack-reused 30 (from 2)
Receiving objects: 100% (158/158), 158.55 KiB | 12.20 MiB/s, done.
Resolving deltas: 100% (124/124), completed with 32 local objects.
From https://github.com/ggml-org/llama.cpp
 * [new ref]             refs/pull/22673/head -> pr-22673-mtp
bluesanta@ubuntu:~/llm/llama.cpp-22673-mtp$ git checkout pr-22673-mtp
Switched to branch 'pr-22673-mtp'
bluesanta@ubuntu:~/llm/llama.cpp-22673-mtp$ git status
On branch pr-22673-mtp
nothing to commit, working tree clean
bluesanta@ubuntu:~/llm/llama.cpp-22673-mtp$ git merge pr-22673-mtp -m "Merge MTP support from PR #22673"

빌드

bluesanta@ubuntu:~/llm/llama.cpp-22673-mtp$ cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
bluesanta@ubuntu:~/llm/llama.cpp-22673-mtp$ cmake --build build -j$(nproc)

llama-server 설치

bluesanta@ubuntu:~/llm/llama.cpp-22673-mtp$ sudo cmake --install build

llama-server 버전 확인

bluesanta@ubuntu:~/llm/llama.cpp-22673-mtp$ ./build/bin/llama-server --version
version: 9173 (0672285b2)
built with GNU 11.4.0 for Linux aarch64
728x90
728x90

출처

소수 다운로드

(.venv) bluesanta@gx10-3b16:~/llm$ git clone https://github.com/antirez/ds4.git
(.venv) bluesanta@gx10-3b16:~/llm$ cd ds4/

모델 다운로드

(.venv) bluesanta@gx10-3b16:~/llm/ds4$ ./download_model.sh q2-imatrix 
Downloading DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf
from https://huggingface.co/antirez/deepseek-v4-gguf
If the download stops, run the same command again to resume it.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1479  100  1479    0     0    279      0  0:00:05  0:00:05 --:--:--   344
100 80.7G  100 80.7G    0     0  51.7M      0  0:26:37  0:26:37 --:--:-- 51.4M
Linked ./ds4flash.gguf -> /home/bluesanta/llm/ds4/gguf/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf
 
Done.

빌드

(.venv) bluesanta@gx10-3b16:~/llm/ds4$ make cuda-spark

실행

bluesanta@gx10-3b16:~/llm/ds4$ ./ds4 -p "모델 이름 알려죠"
ds4: context buffers 751.71 MiB (ctx=32768, backend=cuda, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=8194)
ds4: CUDA backend initialized on NVIDIA GB10 (sm_121)
ds4: CUDA registered 80.76 GiB model mapping for device access

ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
We need to answer the user's query. The user asked "모델 이름 알려죠" which is Korean for "Tell me the model name" or "What's your model name?" So we need to respond with the model name. The assistant should state its name. Typically, the assistant might say something like "저는 DeepSeek입니다." But need to check the context. The user didn't specify which model. Probably the assistant is a DeepSeek model. So answer accordingly.
저는 DeepSeek 모델입니다. 도움이 필요하시면 언제든지 물어보세요! 😊
ds4: prefill: 9.25 t/s, generation: 4.39 t/s

서비스 파일 생성

bluesanta@gx10-3b16:~/llm/ds4$ sudo vi /etc/systemd/system/ds4-server.service
[Unit]
Description=DS4 LLM Server
After=network.target

[Service]
Type=simple
User=bluesanta
WorkingDirectory=/home/bluesanta/llm/ds4
ExecStart=/home/bluesanta/llm/ds4/ds4-server --host 0.0.0.0 --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192
Restart=on-failure
RestartSec=10
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

# 로깅 설정
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

서비스 등록

bluesanta@gx10-3b16:~/llm/ds4$ sudo systemctl enable ds4-server
Created symlink /etc/systemd/system/multi-user.target.wants/ds4-server.service → /etc/systemd/system/ds4-server.service.

서비스 실행

bluesanta@gx10-3b16:~/llm/ds4$ sudo systemctl start ds4-server

서비스 상태 확인

bluesanta@gx10-3b16:~/llm/ds4$ sudo systemctl start ds4-server
bluesanta@gx10-3b16:~/llm/ds4$ 
bluesanta@gx10-3b16:~/llm/ds4$ 
bluesanta@gx10-3b16:~/llm/ds4$ sudo systemctl status ds4-server
● ds4-server.service - DS4 LLM Server
     Loaded: loaded (/etc/systemd/system/ds4-server.service; enabled; preset: enabled)
     Active: active (running) since Thu 2026-05-21 23:15:40 KST; 19s ago
   Main PID: 953632 (ds4-server)
      Tasks: 3 (limit: 153548)
     Memory: 634.1M (peak: 634.1M)
        CPU: 5.399s
     CGroup: /system.slice/ds4-server.service
             └─953632 /home/bluesanta/llm/ds4/ds4-server --host 0.0.0.0 --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-sp>
 
 5월 21 23:15:40 gx10-3b16 ds4-server[953632]: ds4: CUDA host registration skipped: operation not supported
 5월 21 23:15:41 gx10-3b16 ds4-server[953632]: ds4: CUDA loading model tensors into device cache
 5월 21 23:15:44 gx10-3b16 ds4-server[953632]: ds4: CUDA loading model tensors 16.02 GiB cached
 5월 21 23:15:48 gx10-3b16 ds4-server[953632]: ds4: CUDA loading model tensors 32.06 GiB cached
 5월 21 23:15:52 gx10-3b16 ds4-server[953632]: ds4: CUDA loading model tensors 48.02 GiB cached
 5월 21 23:15:55 gx10-3b16 ds4-server[953632]: ds4: CUDA loading model tensors 64.06 GiB cached
 5월 21 23:15:59 gx10-3b16 ds4-server[953632]: ds4: CUDA loading model tensors 80.04 GiB cached
 5월 21 23:15:59 gx10-3b16 ds4-server[953632]: ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 19.009s
 5월 21 23:15:59 gx10-3b16 ds4-server[953632]: ds4: cuda backend initialized for graph diagnostics
 5월 21 23:15:59 gx10-3b16 ds4-server[953632]: 0521 23:15:59 ds4-server: context buffers 1896.58 MiB (ctx=100000, backend=c>

서비스 로그 확인

bluesanta@gx10-3b16:~/llm/ds4$ sudo journalctl -u ds4-server -f

확인

curl http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model":"deepseek-v4-flash",
    "messages":[{"role":"user","content":"List three Redis design principles."}],
    "stream":true
  }'

openclaude 설정

export CLAUDE_CODE_USE_OPENAI=1
export OPENAI_API_KEY=deepseek-v4-flash
export OPENAI_BASE_URL=http://192.168.0.240:8000/v1
export OPENAI_MODEL="DeepSeek V4 Flash"
728x90
728x90

출처

jetpack 버전 확인

bluesanta@ubuntu:~$ sudo apt show nvidia-jetpack
Package: nvidia-jetpack
Version: 6.2.2+b24
Priority: standard
Section: metapackages
Source: nvidia-jetpack (6.2.2)
Maintainer: NVIDIA Corporation
Installed-Size: 199 kB
Depends: nvidia-jetpack-runtime (= 6.2.2+b24), nvidia-jetpack-dev (= 6.2.2+b24)
Homepage: http://developer.nvidia.com/jetson
Download-Size: 29.3 kB
APT-Sources: https://repo.download.nvidia.com/jetson/common r36.5/main arm64 Packages
Description: NVIDIA Jetpack Meta Package

CUDA 버전 확인

bluesanta@ubuntu:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Aug_14_10:14:07_PDT_2024
Cuda compilation tools, release 12.6, V12.6.68
Build cuda_12.6.r12.6/compiler.34714021_0

가상환경 만들기

bluesanta@bluesanta-desktop:~$ cd llm
bluesanta@bluesanta-desktop:~/llm$ python -m venv .venv
bluesanta@bluesanta-desktop:~/llm$ source .venv/bin/activate
(.venv) bluesanta@bluesanta-desktop:~/llm$ 

모델 다운로드 (Hugging Face)

(.venv) bluesanta@ubuntu:~/llm$ pip install -U "huggingface_hub[cli]"
(.venv) bluesanta@ubuntu:~/llm$ hf download Qwen/Qwen3.6-35B-A3B --local-dir ~/llm/models/Qwen3.6-35B-A3B

원본 모델을 GGUF(FP16)로 변환

(.venv) bluesanta@gx10-3b16:~/llm/llama.cpp$ python convert_hf_to_gguf.py ~/llm/models/Qwen3.6-35B-A3B --outfile ~/llm/models/Qwen3.6-35B-A3B-F16.gguf --outtype f16
 
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:/home/bluesanta/llm/models/Qwen3.6-35B-A3B-F16.gguf: n_tensors = 733, total_size = 69.4G
Writing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 69.4G/69.4G [01:36<00:00, 718Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to /home/bluesanta/llm/models/Qwen3.6-35B-A3B-F16.gguf

양자화 (Q5_K_M)

(.venv) bluesanta@gx10-3b16:~/llm/llama.cpp$ llama-quantize --leave-output-tensor ~/llm/models/Qwen3.6-35B-A3B-F16.gguf ~/llm/models/Qwen3.6-35B-A3B-Q5_K_M.gguf Q5_K_M
 
[ 731/ 733] blk.39.ffn_up_exps.weight            - [  2048,    512,    256,      1], type =    f16, converting to q5_K .. size =   512.00 MiB ->   176.00 MiB
[ 732/ 733] blk.39.ffn_up_shexp.weight           - [  2048,    512,      1,      1], type =    f16, converting to q5_K .. size =     2.00 MiB ->     0.69 MiB
[ 733/ 733] blk.39.post_attention_norm.weight    - [  2048,      1,      1,      1], type =    f32, size =    0.008 MiB
llama_model_quantize_impl: model size  = 66152.24 MiB (16.01 BPW)
llama_model_quantize_impl: quant size  = 24145.21 MiB (5.84 BPW)
 
main: quantize time = 473736.51 ms
main:    total time = 473736.51 ms

멀티모달 프로젝터 (mmproj) 파일 생성

(.venv) bluesanta@gx10-3b16:~/llm/llama.cpp$ python convert_hf_to_gguf.py ~/llm/models/Qwen3.6-35B-A3B --outfile  ~/llm/models/Qwen3.6-35B-A3B-mmproj-f16.gguf --outtype f16 --mmproj
INFO:hf-to-gguf:Loading model: Qwen3.6-35B-A3B
INFO:hf-to-gguf:Model architecture: Qwen3_5MoeForConditionalGeneration
 
INFO:gguf.gguf_writer:/home/bluesanta/llm/models/Qwen3.6-35B-A3B-mmproj-f16.gguf: n_tensors = 334, total_size = 899.3M
Writing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 899M/899M [00:00<00:00, 1.48Gbyte/s]
INFO:hf-to-gguf:Model successfully exported to /home/bluesanta/llm/models/Qwen3.6-35B-A3B-mmproj-f16.gguf

llama.cpp 이용한 확인

(.venv) bluesanta@ubuntu:~/llm$ llama-cli -m ~/llm/models/Qwen3.6-35B-A3B-Q5_K_M.gguf  -n 512 --color on -ngl 99 -p "당신은 유능한 AI 어시스턴트입니다. Jetson AGX Orin에 대해 설명해주세요."
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 62827 MiB):
  Device 0: Orin, compute capability 8.7, VMM: yes, VRAM: 62827 MiB
 
Loading model...  
 
 
▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀
 
build      : b9123-927dada6c
model      : Qwen3.6-35B-A3B-Q5_K_M.gguf
modalities : text
 
available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read         add a text file
  /glob      add text files using globbing pattern
 
 
> 당신은 유능한 AI 어시스턴트입니다. Jetson AGX Orin에 대해 설명해주세요.
 
[Start thinking]
Here's a thinking process:
 
1.  **Analyze User Input:**
   - **User:** "당신은 유능한 AI 어시스턴트입니다. Jetson AGX Orin 에 대해 설명해주세요."
   - **Language:** Korean
   - **Topic:** Jetson AGX Orin (NVIDIA's edge AI computing platform)
   - **Request:** Explain/describe it
 
2.  **Identify Key Information Needed:**
   - What is it? (Overview/Definition)
   - Key Specifications (CPU, GPU, Memory, AI Performance, I/O, Power)
   - Target Applications/Use Cases
   - Key Features/Advantages
   - Ecosystem/Software Support
   - Comparison/Context (if relevant, but keep it concise)
   - Structure the response logically in Korean
 
3.  **Gather Facts (Internal Knowledge up to 2024):**
   - *Manufacturer:* NVIDIA
   - *Series:* Jetson family (edge AI)
   - *Architecture:* Ampere GPU architecture, Carmel CPU cores
   - *AI Performance:* Up to 275 TOPS (INT8)
   - *GPU:* 1024-core NVIDIA Ampere architecture GPU with 32 Tensor cores
   - *CPU:* 12-core ARM64 CPU (8 high-performance + 4 high-efficiency)
   - *Memory:* 64GB LPDDR5 (102.4 GB/s bandwidth)
   - *Storage/Connectivity:* PCIe 4.0, NVMe SSD support, USB 3.2, Ethernet, MIPI CSI-2, etc.
   - *Power:* 60W max (configurable), TDP 60W
   - *Use Cases:* Autonomous mobile robots, smart cameras, medical devices, industrial automation, drones, etc.
   - *Software:* JetPack SDK, CUDA, cuDNN, TensorRT, DeepStream, Isaac ROS, Docker support, Linux-based
   - *Form Factors:* Developer Kit, Industrial Module (for OEMs), Edge Cloud Gateway
   - *Successor/Context:* Follows Xavier AGX, successor to Jetson AGX Xavier, part of NVIDIA's edge AI lineup
 
4.  **Structure the Response (in Korean):**
   - 소개 (Overview)
   - 주요 스펙 (Key Specs
 
[ Prompt: 111.5 t/s | Generation: 30.9 t/s ]
 
> 

API 서버 실행

(.venv) bluesanta@bluesanta-desktop:~/llm/llama.cpp$ llama-server -m ~/llm/models/Qwen3.6-35B-A3B-Q4_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 99 -c 32768 -np 1
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 62840 MiB):
  Device 0: Orin, compute capability 8.7, VMM: yes, VRAM: 62840 MiB
build_info: b8851-e365e658f
system_info: n_threads = 12 (n_threads_batch = 12) / 12 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
init: using 11 threads for HTTP server
start: binding port with default address family
 
srv    load_model: loading model '/home/bluesanta/llm/models/Qwen3.6-35B-A3B-Q4_K_M.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected to use 21098 MiB of device memory vs. 52336 MiB of free device memory
llama_params_fit_impl: will leave 31238 >= 1024 MiB of free device memory, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 1.10 seconds
llama_model_load_from_file_impl: using device CUDA0 (Orin) (0000:00:00.0) - 52362 MiB free
llama_model_loader: loaded meta data with 44 key-value pairs and 733 tensors from /home/bluesanta/llm/models/Qwen3.6-35B-A3B-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
 
init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant

'
srv          init: init: chat template, thinking = 1
main: model loaded
main: server is listening on http://0.0.0.0:8080
main: starting the main loop...
srv  update_slots: all slots are idle

서비스 파일 생성

(.venv) bluesanta@bluesanta-desktop:~/llm$ sudo vi /etc/systemd/system/llama.service

텍스트 전용

[Unit]
Description=Llama.cpp Server Service
After=network.target

[Service]
# 사용자 계정
User=bluesanta
Group=bluesanta
LimitMEMLOCK=infinity
WorkingDirectory=/opt/llama.cpp

# 최적화된 실행 명령어
# -m /home/bluesanta/llm/models/Qwen3.6-35B-A3B-Q4_K_M.gguf \
# -m /home/bluesanta/llm/models/gemma-4-26b-Q4_K_M.gguf
# --ctx-size 262144
ExecStart=/usr/local/bin/llama-server \
    -m /home/bluesanta/llm/models/Qwen3.6-35B-A3B-Q5_K_M.gguf \
    --host 0.0.0.0 \
    --port 8000 \
    --ctx-size 262144 \
    --n-gpu-layers 99 \
    --flash-attn on \
    --spec-draft-n-max 2 \
    --mlock \
    --cont-batching \
    --metrics
    
# 프로세스 종료 시 자동 재시작 설정
# Restart=always
# RestartSec=5

[Install]
WantedBy=multi-user.target

--mmproj 적용

[Unit]
Description=Llama.cpp Server Service
After=network.target

[Service]
# 사용자 계정
User=bluesanta
Group=bluesanta
LimitMEMLOCK=infinity
WorkingDirectory=/opt/llama.cpp

# 최적화된 실행 명령어
# -m /home/bluesanta/llm/models/Qwen3.6-35B-A3B-Q4_K_M.gguf \
# -m /home/bluesanta/llm/models/gemma-4-26b-Q4_K_M.gguf
# -m /home/bluesanta/llm/models/Qwen3.6-35B-A3B-Q5_K_M.gguf \
# --ctx-size 262144
# --ctx-size를 65536 (64k) 또는 32768 (32k)로 변경
# --spec-type ngram-mod,draft-mtp --spec-draft-n-max 4
ExecStart=/usr/local/bin/llama-server \
    -m /home/bluesanta/llm/models/Qwen3.6-35B-A3B-Q5_K_M.gguf \
    --mmproj /home/bluesanta/llm/models/Qwen3.6-35B-A3B-mmproj-f16.gguf \
    --host 0.0.0.0 \
    --port 8000 \
    --ctx-size 131072 \
    --n-gpu-layers 99 \
    --flash-attn on \
    --spec-draft-n-max 2 \
    --mlock \
    --cont-batching \
    --metrics


# 프로세스 종료 시 자동 재시작 설정
# Restart=always
# RestartSec=5

[Install]
WantedBy=multi-user.target

서비스 갱신

(.venv) bluesanta@ubuntu:~/llm/llama.cpp$ sudo systemctl daemon-reload

서비스 실행

(.venv) bluesanta@ubuntu:~/llm/llama.cpp$ sudo systemctl start llama

서비스 상태 확인

(.venv) bluesanta@ubuntu:~/llm/llama.cpp$ sudo systemctl status llama

서비스 로그 확인

(.venv) bluesanta@ubuntu:~/llm/llama.cpp$ sudo journalctl -u llama.service -f

확인

(.venv) bluesanta@ubuntu:~/llm$ curl http://localhost:8000/completion -H "Content-Type: application/json" -d '{
  "prompt": "Jetson AGX Orin의 장점 3가지는?",
  "n_predict": 256
}'
{"index":0,"content":"\n\n\n\n\n\nNVIDIA Jetson AGX Orin은 에지 AI(Edge AI) 애플리케이션을 위한 최상위 성능의 임베디드 컴퓨팅 모듈로, 기존 제품 대비 뛰어난 성능과 효율성을 자랑합니다. 주요 장점 3가지는 다음과 같습니다:\n\n1. **뛰어난 AI 추론 성능 (500 TOPS)**  \n   Jetson AGX Orin은 최대 **500 TOPS**(초당 500조 회 연산)의 INT8 추론 성능을 제공합니다. 이는 이전 세대인 Jetson Xavier NX 대비 약 **20배 이상** 향상된 것으로, 대규모 딥러닝 모델(예: YOLO, ResNet, Vision Transformer 등)을 실시간으로 처리할 수 있어 복잡한 비전 AI, 로봇 공학, 자율 주행 등 고성능이 요구되는 애플리케이션에 이상적입니다.\n\n2. **높은 전력 효율성 대비 고성능**  \n   최대 60W까지 전력 소비를 지원하지만, 필요에 따라 **5W부터 60W까지 유연하게 전력 구성**이 가능합니다. 이는 제한된 전력과 열 설계(Power & Thermal Budget) 환경에서도 고성능을 유지하면서도 에너지 효율","tokens":[],"id_slot":3,"stop":true,"model":"Qwen3.6-35B-A3B-Q5_K_M.gguf","tokens_predicted":256,"tokens_evaluated":13,"generation_settings":{"seed":4294967295,"temperature":1.0,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":20,"top_p":0.949999988079071,"min_p":0.05000000074505806,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":262144,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":256,"n_predict":256,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[],"chat_format":"Content-only","reasoning_format":"deepseek","reasoning_in_content":false,"generation_prompt":"","samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.types":"none","timings_per_token":false,"post_sampling_probs":false,"backend_sampling":false,"lora":[]},"prompt":"Jetson AGX Orin의 장점 3가지는?","has_new_line":true,"truncated":false,"stop_type":"limit","stopping_word":"","tokens_cached":268,"timings":{"cache_n":0,"prompt_n":13,"prompt_ms":434.387,"prompt_per_token_ms":33.41438461538461,"prompt_per_second":29.927230787293357,"predicted_n":256,"predicted_ms":8410.958,"predicted_per_token_ms":32.8553046875,"predicted_per_second":30.436485356364873}}
export CLAUDE_CODE_MAX_OUTPUT_TOKENS=64000
export CLAUDE_CODE_USE_OPENAI=1
export OPENAI_API_KEY=sk-jetson-qwen36
export OPENAI_BASE_URL=http://192.168.0.235:8000/v1
export OPENAI_MODEL=Qwen3.6-35B-A3B-Q5_K_M.gguf
728x90
728x90

현재 Swap 확인

bluesanta@ubuntu:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:            61Gi       8.8Gi        50Gi       8.0Mi       1.7Gi        51Gi
Swap:           30Gi       311Mi        30Gi

물리 Swap 파일 32GB 추가

bluesanta@ubuntu:~$ sudo fallocate -l 32G /swapfile
bluesanta@ubuntu:~$ sudo chmod 600 /swapfile
bluesanta@ubuntu:~$ sudo mkswap /swapfile
mkswap: /swapfile: warning: wiping old swap signature.
Setting up swapspace version 1, size = 32 GiB (34359734272 bytes)
no label, UUID=66a0c946-6432-4abd-a118-536a222f9e16
bluesanta@ubuntu:~$ sudo swapon /swapfile

부팅 시 자동 로드를 위해 fstab 등록

bluesanta@ubuntu:~$ echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
/swapfile none swap sw 0 0

추가 확인

bluesanta@ubuntu:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:            61Gi       8.8Gi        50Gi       8.0Mi       1.7Gi        51Gi
Swap:           62Gi       309Mi        62Gi

Swappiness 설정 변경 (성능 최적화)

현재 60으로 되어 있는 값을 10으로 낮춰야 시스템이 Swap(zram/SSD)을 쓰지 않고, 최대한 물리 RAM(61GiB)을 끝까지 활용하여 추론 속도를 유지함

bluesanta@ubuntu:~$ sudo sysctl vm.swappiness=10
vm.swappiness = 10
bluesanta@ubuntu:~$ echo 'vm.swappiness=10' | sudo tee -a /etc/sysctl.conf
vm.swappiness=10

vllm 실행

(.venv) bluesanta@ubuntu:~/llm$ vllm serve /home/bluesanta/llm/models/gemma-4-26B-A4B-it-FP8-Dynamic --tensor-parallel-size 1 --max-model-len 32768 --max-num-batched-tokens 32768 --gpu-memory-utilization 0.8 --quantization compressed-tensors --dtype bfloat16 --enforce-eager --speculative-config '{"model": "/home/bluesanta/llm/models/gemma-4-26B-A4B-it-assistant", "num_speculative_tokens": 5}' --enable-auto-tool-choice --tool-call-parser hermes
(.venv) bluesanta@ubuntu:~/llm$ vllm serve /home/bluesanta/llm/models/gemma-4-26B-A4B-it-FP8-Dynamic --tensor-parallel-size 1 --max-model-len 36864 --max-num-batched-tokens 36864 --gpu-memory-utilization 0.8 --quantization compressed-tensors --dtype bfloat16 --enforce-eager --speculative-config '{"model": "/home/bluesanta/llm/models/gemma-4-26B-A4B-it-assistant", "num_speculative_tokens": 5}' --enable-auto-tool-choice --tool-call-parser hermes

vllm 실행 (*10)

(.venv) bluesanta@ubuntu:~/llm$ vllm serve /home/bluesanta/llm/models/gemma-4-26B-A4B-it-FP8-Dynamic --tensor-parallel-size 1 --max-model-len 46080 --max-num-batched-tokens 46080 --gpu-memory-utilization 0.8 --quantization compressed-tensors --dtype bfloat16 --enforce-eager --speculative-config '{"model": "/home/bluesanta/llm/models/gemma-4-26B-A4B-it-assistant", "num_speculative_tokens": 5}' --enable-auto-tool-choice --tool-call-parser hermes
export CLAUDE_CODE_MAX_OUTPUT_TOKENS=23000
export CLAUDE_CODE_USE_OPENAI=1
export OPENAI_API_KEY=test
export OPENAI_BASE_URL=http://192.168.0.235:8000/v1
export OPENAI_MODEL=/home/bluesanta/llm/models/gemma-4-26B-A4B-it-FP8-Dynamic

vllm 실행 (*11)

(.venv) bluesanta@ubuntu:~/llm$ vllm serve /home/bluesanta/llm/models/gemma-4-26B-A4B-it-FP8-Dynamic --tensor-parallel-size 1 --max-model-len 50688 --max-num-batched-tokens 50688 --gpu-memory-utilization 0.8 --quantization compressed-tensors --dtype bfloat16 --enforce-eager --speculative-config '{"model": "/home/bluesanta/llm/models/gemma-4-26B-A4B-it-assistant", "num_speculative_tokens": 5}' --enable-auto-tool-choice --tool-call-parser hermes
export CLAUDE_CODE_MAX_OUTPUT_TOKENS=24000
export CLAUDE_CODE_USE_OPENAI=1
export OPENAI_API_KEY=test
export OPENAI_BASE_URL=http://192.168.0.235:8000/v1
export OPENAI_MODEL=/home/bluesanta/llm/models/gemma-4-26B-A4B-it-FP8-Dynamic
728x90
728x90

Extensions 아이콘 선택

Continue 확장 플러그인 설치

오른쪽 하단 Continue 버튼 선택

Open chat 선택

Select model 선택 -> + Add Chat model 선택

config file 선택

config.yaml 내용 설정

name: Local Config
version: 1.0.0
schema: v1
models:
  - name: gemma-4-26B-A4B-it-FP8-Dynamic
    provider: vllm
    model: /home/bluesanta/llm/models/gemma-4-26B-A4B-it-FP8-Dynamic
    apiBase: http://192.168.0.237:8000

Chat 창에 프롬프트 입력

index.html 파일 생성

 

Apply 버튼 선택

index.html 저장

테스트

728x90
728x90

python3.11 설치

bluesanta@ubuntu:~/llm$ sudo apt install python3.11 python3.11-venv

가상환경 만들기

bluesanta@bluesanta-AI-Series:~$ cd llm
bluesanta@bluesanta-AI-Series:~/llm$ python3 -m venv .ui-env
bluesanta@bluesanta-AI-Series:~/llm$ source .ui-env/bin/activate
(.ui-env) bluesanta@bluesanta-AI-Series:~/llm$ 

open-webui 설치

(.ui-env) bluesanta@bluesanta-AI-Series:~/llm$ pip install --upgrade pip
(.ui-env) bluesanta@bluesanta-AI-Series:~/llm$ pip install open-webui

서비스 파일 생성

(.ui-env) bluesanta@bluesanta-AI-Series:~/llm$ sudo vi /etc/systemd/system/open-webui.service
[Unit]
Description=Open WebUI Service
After=network.target

[Service]
Type=simple
# 실제 시스템 계정 이름으로 수정 (예: bluesanta)
User=bluesanta
Group=bluesanta

# Open WebUI 데이터가 저장될 경로
WorkingDirectory=/home/bluesanta/llm
# 가상환경 내의 open-webui 실행 파일 경로 확인 필수
ExecStart=/home/bluesanta/llm/.ui-env/bin/open-webui serve

# 환경 변수 설정
Environment=PYTHONPATH=/home/bluesanta/llm/.ui-env
Environment=PORT=8080
# vLLM 서버가 이미 실행 중이라면 기본 연결 주소 설정 (선택)
Environment=OPENAI_API_BASE_URL=http://192.168.0.237:8000/v1
Environment=OPENAI_API_KEY=none

Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

서비스 등록

(.ui-env) bluesanta@bluesanta-AI-Series:~/llm$ sudo systemctl enable open-webui.service
Created symlink /etc/systemd/system/multi-user.target.wants/open-webui.service → /etc/systemd/system/open-webui.service.

서비스 실행

(.ui-env) bluesanta@bluesanta-AI-Series:~/llm$ sudo systemctl start open-webui

서비스 상태 확인

(.ui-env) bluesanta@bluesanta-AI-Series:~/llm$ sudo systemctl status open-webui
● open-webui.service - Open WebUI Service
     Loaded: loaded (/etc/systemd/system/open-webui.service; enabled; preset: enabled)
     Active: active (running) since Thu 2026-05-07 23:54:12 KST; 28s ago
   Main PID: 9094 (open-webui)
      Tasks: 83 (limit: 73866)
     Memory: 716.5M (peak: 716.9M)
        CPU: 12.000s
     CGroup: /system.slice/open-webui.service
             └─9094 /home/bluesanta/llm/.ui-env/bin/python3 /home/bluesanta/llm/.ui-env/bin/open-webui serve

 5월 07 23:54:17 bluesanta-AI-Series open-webui[9094]: ------------------------+------------+--+-
 5월 07 23:54:17 bluesanta-AI-Series open-webui[9094]: embeddings.position_ids | UNEXPECTED |  |
 5월 07 23:54:17 bluesanta-AI-Series open-webui[9094]: Notes:
 5월 07 23:54:17 bluesanta-AI-Series open-webui[9094]: - UNEXPECTED:        can be ignored when loading from different task/architecture; not ok if you expect identical arch.
 5월 07 23:54:18 bluesanta-AI-Series open-webui[9094]: INFO:     Started server process [9094]
 5월 07 23:54:18 bluesanta-AI-Series open-webui[9094]: INFO:     Waiting for application startup.
 5월 07 23:54:18 bluesanta-AI-Series open-webui[9094]: 2026-05-07 23:54:18.351 | INFO     | open_webui.utils.logger:start_logger:194 - GLOBAL_LOG_LEVEL: INFO
 5월 07 23:54:18 bluesanta-AI-Series open-webui[9094]: 2026-05-07 23:54:18.351 | INFO     | open_webui.main:lifespan:659 - Installing external dependencies of functions and tools...
 5월 07 23:54:18 bluesanta-AI-Series open-webui[9094]: 2026-05-07 23:54:18.362 | INFO     | open_webui.utils.plugin:install_frontmatter_requirements:407 - No requirements found in fro>
 5월 07 23:54:18 bluesanta-AI-Series open-webui[9094]: 2026-05-07 23:54:18.362 | INFO     | open_webui.utils.automations:scheduler_worker_loop:172 - Scheduler worker started (poll int>
728x90
728x90

maximum power모드 실행

bluesanta@ubuntu:~$ sudo nvpmodel -m 0
NVPM WARN: Golden image context is already created
NVPM WARN: Reboot required for changing to this power mode: 0
NVPM WARN: DO YOU WANT TO REBOOT NOW? enter YES/yes to confirm:
yes
NVPM WARN: rebooting..

Jetson Clocks 활성화: 팬 속도와 클럭을 최대로 고정

bluesanta@ubuntu:~$ sudo jetson_clocks

NVMe Swap 확보: 64GB 메모리라도 빌드 시에는 부족대비 32GB 이상의 스왑을 권장

bluesanta@ubuntu:~$ sudo fallocate -l 32G /swapfile
bluesanta@ubuntu:~$ sudo chmod 600 /swapfile
bluesanta@ubuntu:~$ sudo mkswap /swapfile
Setting up swapspace version 1, size = 32 GiB (34359734272 bytes)
no label, UUID=fd9a09b8-5892-4904-aa0a-30b37166c229
bluesanta@ubuntu:~$ sudo swapon /swapfile

vLLM 최적화 소스 빌드

기존 구버전 및 찌꺼기 제거

(.venv) bluesanta@ubuntu:~/llm$ pip uninstall -y vllm
(.venv) bluesanta@ubuntu:~/llm/vllm$ rm -rf build

CUDA 아키텍처(Compute Capability) 확인

(.venv) bluesanta@ubuntu:~/llm$ nvidia-smi --query-gpu=compute_cap --format=csv,noheader
8.7

Jetson 전용 빌드 환경 변수 설정

(.venv) bluesanta@ubuntu:~/llm$ export TORCH_CUDA_ARCH_LIST="8.7"
(.venv) bluesanta@ubuntu:~/llm$ export VLLM_TARGET_DEVICE="cuda"

vLLM 설치

(.venv) bluesanta@ubuntu:~/llm$ git clone https://github.com/vllm-project/vllm.git
(.venv) bluesanta@ubuntu:~/llm$ cd vllm
(.venv) bluesanta@ubuntu:~/llm/vllm$ pip install setuptools_scm
(.venv) bluesanta@ubuntu:~/llm/vllm$ pip install --upgrade pip setuptools setuptools-scm wheel
(.venv) bluesanta@ubuntu:~/llm/vllm$ sudo apt install -y ninja-build
(.venv) bluesanta@ubuntu:~/llm/vllm$ MAX_JOBS=6 pip install -e .

버전 확인

(.venv) bluesanta@ubuntu:~/llm$ vllm --version
0.20.2rc1.dev93+g51f22dcfd.cu126

vLLM 실행

(.venv) bluesanta@ubuntu:~/llm/vllm$ vllm serve /home/bluesanta/llm/models/gemma-4-31B-it-FP8-block --tensor-parallel-size 1 --max-model-len 4096 --max-num-batched-tokens 4096 --gpu-memory-utilization 0.7 --quantization compressed-tensors --dtype bfloat16 --enforce-eager --speculative-config '{"model": "/home/bluesanta/llm/models/gemma-4-31B-it-assistant", "num_speculative_tokens": 5}'
(APIServer pid=29780) INFO 05-07 19:54:39 [utils.py:306] 
(APIServer pid=29780) INFO 05-07 19:54:39 [utils.py:306]        █     █     █▄   ▄█
(APIServer pid=29780) INFO 05-07 19:54:39 [utils.py:306]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.20.2rc1.dev93+g51f22dcfd
(APIServer pid=29780) INFO 05-07 19:54:39 [utils.py:306]   █▄█▀ █     █     █     █  model   /home/bluesanta/llm/models/gemma-4-31B-it-FP8-block
(APIServer pid=29780) INFO 05-07 19:54:39 [utils.py:306]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=29780) INFO 05-07 19:54:39 [utils.py:306] 
(APIServer pid=29780) INFO 05-07 19:54:39 [utils.py:240] non-default args: {'model_tag': '/home/bluesanta/llm/models/gemma-4-31B-it-FP8-block', 'model': '/home/bluesanta/llm/models/gemma-4-31B-it-FP8-block', 'dtype': 'bfloat16', 'max_model_len': 4096, 'quantization': 'compressed-tensors', 'enforce_eager': True, 'gpu_memory_utilization': 0.7, 'max_num_batched_tokens': 4096, 'speculative_config': {'model': '/home/bluesanta/llm/models/gemma-4-31B-it-assistant', 'num_speculative_tokens': 5}}
(APIServer pid=29780) INFO 05-07 19:54:39 [model.py:563] Resolved architecture: Gemma4ForConditionalGeneration
(APIServer pid=29780) INFO 05-07 19:54:39 [model.py:1692] Using max model len 4096
(APIServer pid=29780) INFO 05-07 19:54:41 [nixl_utils.py:20] Setting UCX_RCACHE_MAX_UNRELEASED to '1024' to avoid a rare memory leak in UCX when using NIXL.
(APIServer pid=29780) WARNING 05-07 19:54:41 [nixl_utils.py:34] NIXL is not available
(APIServer pid=29780) WARNING 05-07 19:54:41 [nixl_utils.py:44] NIXL agent config is not available
(APIServer pid=29780) INFO 05-07 19:54:42 [model.py:563] Resolved architecture: Gemma4MTPModel
(APIServer pid=29780) INFO 05-07 19:54:42 [model.py:1692] Using max model len 262144
(APIServer pid=29780) WARNING 05-07 19:54:42 [speculative.py:672] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
(APIServer pid=29780) INFO 05-07 19:54:42 [speculative.py:858] Overriding draft model max model len from 262144 to 4096
(APIServer pid=29780) INFO 05-07 19:54:42 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=4096.
(APIServer pid=29780) INFO 05-07 19:54:42 [config.py:101] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.
(APIServer pid=29780) INFO 05-07 19:54:42 [vllm.py:844] Asynchronous scheduling is enabled.
(APIServer pid=29780) WARNING 05-07 19:54:42 [vllm.py:900] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=29780) WARNING 05-07 19:54:42 [vllm.py:918] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=29780) INFO 05-07 19:54:42 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
(APIServer pid=29780) WARNING 05-07 19:54:42 [vllm.py:1406] max_num_scheduled_tokens is set to 4096 based on the speculative decoding settings. This may lead to suboptimal performance. Consider increasing max_num_batched_tokens to accommodate the additional draft token slots, or decrease num_speculative_tokens or max_num_seqs.
(APIServer pid=29780) INFO 05-07 19:54:42 [vllm.py:1093] Cudagraph is disabled under eager mode
(APIServer pid=29780) WARNING 05-07 19:54:42 [cuda.py:233] Forcing --disable_chunked_mm_input for models with multimodal-bidirectional attention.
(APIServer pid=29780) INFO 05-07 19:54:42 [compilation.py:303] Enabled custom fusions: norm_quant, act_quant
WARNING 05-07 19:55:07 [nixl_utils.py:34] NIXL is not available
WARNING 05-07 19:55:07 [nixl_utils.py:44] NIXL agent config is not available
(EngineCore pid=29812) INFO 05-07 19:55:07 [core.py:109] Initializing a V1 LLM engine (v0.20.2rc1.dev93+g51f22dcfd) with config: model='/home/bluesanta/llm/models/gemma-4-31B-it-FP8-block', speculative_config=SpeculativeConfig(method='mtp', model='/home/bluesanta/llm/models/gemma-4-31B-it-assistant', num_spec_tokens=5), tokenizer='/home/bluesanta/llm/models/gemma-4-31B-it-FP8-block', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=compressed-tensors, quantization_config=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/home/bluesanta/llm/models/gemma-4-31B-it-FP8-block, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': , 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'all', '+quant_fp8'], 'ir_enable_torch_wrap': False, 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [4096], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': , 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native']), enable_flashinfer_autotune=False, moe_backend='auto')
(EngineCore pid=29812) INFO 05-07 19:55:12 [parallel_state.py:1410] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://192.168.0.237:46069 backend=nccl
(EngineCore pid=29812) INFO 05-07 19:55:12 [parallel_state.py:1723] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=29812) INFO 05-07 19:55:14 [topk_topp_sampler.py:45] Using FlashInfer for top-p & top-k sampling.
(EngineCore pid=29812) WARNING 05-07 19:55:14 [__init__.py:204] min_p and logit_bias parameters won't work with speculative decoding.
(EngineCore pid=29812) INFO 05-07 19:55:14 [gpu_model_runner.py:4842] Starting to load model /home/bluesanta/llm/models/gemma-4-31B-it-FP8-block...
(EngineCore pid=29812) INFO 05-07 19:55:14 [vllm.py:844] Asynchronous scheduling is enabled.
(EngineCore pid=29812) WARNING 05-07 19:55:14 [vllm.py:900] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore pid=29812) WARNING 05-07 19:55:14 [vllm.py:918] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore pid=29812) INFO 05-07 19:55:14 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
(EngineCore pid=29812) WARNING 05-07 19:55:14 [vllm.py:1406] max_num_scheduled_tokens is set to 4096 based on the speculative decoding settings. This may lead to suboptimal performance. Consider increasing max_num_batched_tokens to accommodate the additional draft token slots, or decrease num_speculative_tokens or max_num_seqs.
(EngineCore pid=29812) INFO 05-07 19:55:14 [vllm.py:1093] Cudagraph is disabled under eager mode
(EngineCore pid=29812) INFO 05-07 19:55:14 [compilation.py:303] Enabled custom fusions: norm_quant, act_quant
(EngineCore pid=29812) INFO 05-07 19:55:15 [cuda.py:308] Using AttentionBackendEnum.TRITON_ATTN backend.
(EngineCore pid=29812) INFO 05-07 19:55:16 [cuda.py:308] Using AttentionBackendEnum.TRITON_ATTN backend.
(EngineCore pid=29812) INFO 05-07 19:55:23 [weight_utils.py:904] Filesystem type for checkpoints: EXT4. Checkpoint size: 30.98 GiB. Available RAM: 17.28 GiB.
(EngineCore pid=29812) INFO 05-07 19:55:23 [weight_utils.py:934] Auto-prefetch is disabled because the filesystem (EXT4) is not a recognized network FS (NFS/Lustre) and the checkpoint size (30.98 GiB) exceeds 90% of available RAM (17.28 GiB).
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:19<00:19, 19.16s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:24<00:00, 10.97s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:24<00:00, 12.20s/it]
(EngineCore pid=29812) 
(EngineCore pid=29812) INFO 05-07 19:55:48 [default_loader.py:391] Loading weights took 24.73 seconds
(EngineCore pid=29812) WARNING 05-07 19:55:48 [marlin_utils_fp8.py:97] Your GPU does not have native support for FP8 computation but FP8 quantization is being used. Weight-only FP8 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
(EngineCore pid=29812) INFO 05-07 19:55:57 [gpu_model_runner.py:4866] Loading drafter model...
(EngineCore pid=29812) WARNING 05-07 19:55:57 [vllm.py:900] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore pid=29812) WARNING 05-07 19:55:57 [vllm.py:918] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore pid=29812) INFO 05-07 19:55:57 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
(EngineCore pid=29812) INFO 05-07 19:55:57 [vllm.py:1093] Cudagraph is disabled under eager mode
(EngineCore pid=29812) WARNING 05-07 19:55:57 [vllm.py:900] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore pid=29812) WARNING 05-07 19:55:57 [vllm.py:918] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore pid=29812) INFO 05-07 19:55:57 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
(EngineCore pid=29812) INFO 05-07 19:55:57 [vllm.py:1093] Cudagraph is disabled under eager mode
(EngineCore pid=29812) INFO 05-07 19:55:57 [weight_utils.py:904] Filesystem type for checkpoints: EXT4. Checkpoint size: 0.87 GiB. Available RAM: 16.89 GiB.
(EngineCore pid=29812) INFO 05-07 19:55:57 [weight_utils.py:927] Auto-prefetch is disabled because the filesystem (EXT4) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.40it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.40it/s]
(EngineCore pid=29812) 
(EngineCore pid=29812) INFO 05-07 19:55:58 [default_loader.py:391] Loading weights took 0.76 seconds
(EngineCore pid=29812) WARNING 05-07 19:55:58 [llm_base_proposer.py:1375] Draft model does not support multimodal inputs, falling back to text-only mode
(EngineCore pid=29812) INFO 05-07 19:55:58 [llm_base_proposer.py:1487] Detected MTP model. Sharing target model embedding weights with the draft model.
(EngineCore pid=29812) INFO 05-07 19:55:58 [gemma4.py:171] Gemma4 MTP: keeping draft model's own lm_head (draft_dim != backbone_dim).
(EngineCore pid=29812) INFO 05-07 19:55:58 [gemma4.py:330] Gemma4 MTP: draft layer 0 (sliding_attention) -> language_model.model.layers.58.self_attn.attn
(EngineCore pid=29812) INFO 05-07 19:55:58 [gemma4.py:330] Gemma4 MTP: draft layer 1 (sliding_attention) -> language_model.model.layers.58.self_attn.attn
(EngineCore pid=29812) INFO 05-07 19:55:58 [gemma4.py:330] Gemma4 MTP: draft layer 2 (sliding_attention) -> language_model.model.layers.58.self_attn.attn
(EngineCore pid=29812) INFO 05-07 19:55:58 [gemma4.py:330] Gemma4 MTP: draft layer 3 (full_attention) -> language_model.model.layers.59.self_attn.attn
(EngineCore pid=29812) INFO 05-07 19:55:59 [gpu_model_runner.py:4944] Model loading took 33.05 GiB memory and 43.758278 seconds
(EngineCore pid=29812) INFO 05-07 19:55:59 [gpu_model_runner.py:5905] Encoder cache will be initialized with a budget of 4096 tokens, and profiled with 1 video items of the maximum feature size.
(EngineCore pid=29812) WARNING 05-07 19:56:23 [op.py:290] Priority not set for op rms_norm, using native implementation.
(EngineCore pid=29812) INFO 05-07 19:57:18 [gpu_worker.py:460] Available KV cache memory: 9.01 GiB
(EngineCore pid=29812) INFO 05-07 19:57:18 [kv_cache_utils.py:1710] GPU KV cache size: 10,690 tokens
(EngineCore pid=29812) INFO 05-07 19:57:18 [kv_cache_utils.py:1711] Maximum concurrency for 4,096 tokens per request: 2.61x
(EngineCore pid=29812) INFO 05-07 19:57:20 [kernel_warmup.py:44] Skipping FlashInfer autotune because it is disabled.
(EngineCore pid=29812) INFO 05-07 19:57:23 [jit_monitor.py:54] Kernel JIT monitor activated — Triton JIT compilations during inference will be logged as warnings.
(EngineCore pid=29812) INFO 05-07 19:57:23 [core.py:306] init engine (profile, create kv cache, warmup model) took 84.31 s
(EngineCore pid=29812) WARNING 05-07 19:57:24 [vllm.py:900] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore pid=29812) WARNING 05-07 19:57:24 [vllm.py:918] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore pid=29812) INFO 05-07 19:57:24 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
(EngineCore pid=29812) INFO 05-07 19:57:24 [vllm.py:1093] Cudagraph is disabled under eager mode
(APIServer pid=29780) INFO 05-07 19:57:24 [api_server.py:613] Supported tasks: ['generate']
(APIServer pid=29780) WARNING 05-07 19:57:24 [model.py:1449] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 1.0, 'top_k': 64, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=29780) INFO 05-07 19:57:28 [hf.py:483] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
(APIServer pid=29780) INFO 05-07 19:58:24 [base.py:224] Multi-modal warmup completed in 55.545s
(APIServer pid=29780) INFO 05-07 19:58:24 [base.py:224] Readonly multi-modal warmup completed in 0.170s
(APIServer pid=29780) INFO 05-07 19:58:24 [api_server.py:617] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:37] Available routes are:
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /generative_scoring, Methods: POST
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=29780) INFO 05-07 19:58:24 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=29780) INFO:     Started server process [29780]
(APIServer pid=29780) INFO:     Waiting for application startup.
(APIServer pid=29780) INFO:     Application startup complete.

vLLM 실행 (메모리 조정)

(.venv) bluesanta@ubuntu:~/llm$ vllm serve /home/bluesanta/llm/models/gemma-4-31B-it-FP8-block --tensor-parallel-size 1 --max-model-len 2048 --max-num-batched-tokens 4096 --gpu-memory-utilization 0.8 --quantization compressed-tensors --dtype bfloat16 --enforce-eager --speculative-config '{"model": "/home/bluesanta/llm/models/gemma-4-31B-it-assistant", "num_speculative_tokens": 5}'

vLLM 실행 (모델 gemma-4-26B-A4B-it-FP8-Dynamic) : 최적화 실행

(.venv) bluesanta@ubuntu:~/llm$ vllm serve /home/bluesanta/llm/models/gemma-4-26B-A4B-it-FP8-Dynamic --tensor-parallel-size 1 --max-model-len 16384 --max-num-batched-tokens 16384 --gpu-memory-utilization 0.8 --quantization compressed-tensors --dtype bfloat16 --enforce-eager --speculative-config '{"model": "/home/bluesanta/llm/models/gemma-4-26B-A4B-it-assistant", "num_speculative_tokens": 5}'

테스트

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/home/bluesanta/llm/models/gemma-4-31B-it-FP8-block",
    "messages": [{"role": "user", "content": "한글은 누가 만들었어?"}],
    "max_tokens": 100
  }'
bluesanta@ubuntu:~/llm$ curl http://localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "/home/bluesanta/llm/models/gemma-4-31B-it-FP8-block",
    "messages": [{"role": "user", "content": "한글은 누가 만들었어?"}],                      
    "max_tokens": 100
  }'
{"id":"chatcmpl-8c7ff513de335bf2","object":"chat.completion","created":1778151721,"model":"/home/bluesanta/llm/models/gemma-4-31B-it-FP8-block","choices":[{"index":0,"message":{"role":"assistant","content":"한글은 조선 시대의 제4대 왕인 **세종대왕**이 만들었습니다.\n\n세종대왕은 백성들이 한자를 배우기 어려워 자신의 생각을 글로 표현하지 못하는 것을 안타깝게 여겨, 누구나 쉽게 배우고 쓸 수 있는 우리만의 글자인 **'훈민정음(訓民正音)'**을 창제하셨습니다.\n\n1443년에 완성되어 1446년에 세상에 반","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":"vllm-0.20.2rc1.dev93+g51f22dcfd-887d36ad","usage":{"prompt_tokens":21,"total_tokens":121,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
728x90
728x90

출처

물리적 하드웨어 연결

bluesanta@gx10-f6a7:~$ ibdev2netdev
roceP2p1s0f0 port 1 ==> enP2p1s0f0np0 (Down)
roceP2p1s0f1 port 1 ==> enP2p1s0f1np1 (Up)
rocep1s0f0 port 1 ==> enp1s0f0np0 (Down)
rocep1s0f1 port 1 ==> enp1s0f1np1 (Up)

드라이버 다운로드

bluesanta@gx10-f6a7:~$ sudo wget -O /etc/netplan/40-cx7.yaml https://github.com/NVIDIA/dgx-spark-playbooks/raw/main/nvidia/connect-two-sparks/assets/cx7-netplan.yaml

실행 모드 설정

bluesanta@gx10-f6a7:~$ sudo chmod 600 /etc/netplan/40-cx7.yaml

드라이버 적용

bluesanta@gx10-f6a7:~$ sudo netplan apply

IP 확인

서버1

bluesanta@gx10-f6a7:~$ ip addr show enp1s0f1np1
4: enp1s0f1np1:  mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 30:c5:99:bb:f6:a9 brd ff:ff:ff:ff:ff:ff
    inet 169.254.74.248/16 brd 169.254.255.255 scope link noprefixroute enp1s0f1np1
       valid_lft forever preferred_lft forever
    inet6 fe80::32c5:99ff:febb:f6a9/64 scope link 
       valid_lft forever preferred_lft forever

서버2

bluesanta@gx10-3b16:~$ ip addr show enp1s0f1np1
4: enp1s0f1np1:  mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 30:c5:99:3e:3b:18 brd ff:ff:ff:ff:ff:ff
    inet 169.254.74.57/16 brd 169.254.255.255 scope link noprefixroute enp1s0f1np1
       valid_lft forever preferred_lft forever
    inet6 fe80::32c5:99ff:fe3e:3b18/64 scope link 
       valid_lft forever preferred_lft forever

비밀번호 없는 SSH 인증 설정

스크립트 다운로드

bluesanta@gx10-f6a7:~$ wget https://github.com/NVIDIA/dgx-spark-playbooks/raw/refs/heads/main/nvidia/connect-two-sparks/assets/discover-sparks

서버1

bluesanta@gx10-f6a7:~$ bash ./discover-sparks
Found: 169.254.74.248 (gx10-f6a7.local)
Found: 169.254.74.57 (gx10-3b16.local)
 
Setting up shared SSH access across all nodes...
You may be prompted for your password on each node.
Configuring 169.254.74.248...
  ✓ Successfully configured 169.254.74.248 with shared key
Configuring 169.254.74.57...
  ✓ Successfully configured 169.254.74.57 with shared key
 
Shared SSH setup complete!
All nodes can now SSH to each other using the shared key (id_ed25519_shared).

서버2

bluesanta@gx10-3b16:~$ bash ./discover-sparks
Found: 169.254.74.248 (gx10-f6a7.local)
Found: 169.254.74.57 (gx10-3b16.local)
 
Setting up shared SSH access across all nodes...
You may be prompted for your password on each node.
Configuring 169.254.74.248...
  ✓ Successfully configured 169.254.74.248 with shared key
Configuring 169.254.74.57...
  ✓ Successfully configured 169.254.74.57 with shared key
 
Shared SSH setup complete!
All nodes can now SSH to each other using the shared key (id_ed25519_shared).
728x90

+ Recent posts