728x90

현재 Swap 확인

bluesanta@ubuntu:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:            61Gi       8.8Gi        50Gi       8.0Mi       1.7Gi        51Gi
Swap:           30Gi       311Mi        30Gi

물리 Swap 파일 32GB 추가

bluesanta@ubuntu:~$ sudo fallocate -l 32G /swapfile
bluesanta@ubuntu:~$ sudo chmod 600 /swapfile
bluesanta@ubuntu:~$ sudo mkswap /swapfile
mkswap: /swapfile: warning: wiping old swap signature.
Setting up swapspace version 1, size = 32 GiB (34359734272 bytes)
no label, UUID=66a0c946-6432-4abd-a118-536a222f9e16
bluesanta@ubuntu:~$ sudo swapon /swapfile

부팅 시 자동 로드를 위해 fstab 등록

bluesanta@ubuntu:~$ echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
/swapfile none swap sw 0 0

추가 확인

bluesanta@ubuntu:~$ free -h
               total        used        free      shared  buff/cache   available
Mem:            61Gi       8.8Gi        50Gi       8.0Mi       1.7Gi        51Gi
Swap:           62Gi       309Mi        62Gi

Swappiness 설정 변경 (성능 최적화)

현재 60으로 되어 있는 값을 10으로 낮춰야 시스템이 Swap(zram/SSD)을 쓰지 않고, 최대한 물리 RAM(61GiB)을 끝까지 활용하여 추론 속도를 유지함

bluesanta@ubuntu:~$ sudo sysctl vm.swappiness=10
vm.swappiness = 10
bluesanta@ubuntu:~$ echo 'vm.swappiness=10' | sudo tee -a /etc/sysctl.conf
vm.swappiness=10

vllm 실행

(.venv) bluesanta@ubuntu:~/llm$ vllm serve /home/bluesanta/llm/models/gemma-4-26B-A4B-it-FP8-Dynamic --tensor-parallel-size 1 --max-model-len 32768 --max-num-batched-tokens 32768 --gpu-memory-utilization 0.8 --quantization compressed-tensors --dtype bfloat16 --enforce-eager --speculative-config '{"model": "/home/bluesanta/llm/models/gemma-4-26B-A4B-it-assistant", "num_speculative_tokens": 5}' --enable-auto-tool-choice --tool-call-parser hermes
(.venv) bluesanta@ubuntu:~/llm$ vllm serve /home/bluesanta/llm/models/gemma-4-26B-A4B-it-FP8-Dynamic --tensor-parallel-size 1 --max-model-len 36864 --max-num-batched-tokens 36864 --gpu-memory-utilization 0.8 --quantization compressed-tensors --dtype bfloat16 --enforce-eager --speculative-config '{"model": "/home/bluesanta/llm/models/gemma-4-26B-A4B-it-assistant", "num_speculative_tokens": 5}' --enable-auto-tool-choice --tool-call-parser hermes

vllm 실행 (*10)

(.venv) bluesanta@ubuntu:~/llm$ vllm serve /home/bluesanta/llm/models/gemma-4-26B-A4B-it-FP8-Dynamic --tensor-parallel-size 1 --max-model-len 46080 --max-num-batched-tokens 46080 --gpu-memory-utilization 0.8 --quantization compressed-tensors --dtype bfloat16 --enforce-eager --speculative-config '{"model": "/home/bluesanta/llm/models/gemma-4-26B-A4B-it-assistant", "num_speculative_tokens": 5}' --enable-auto-tool-choice --tool-call-parser hermes
export CLAUDE_CODE_MAX_OUTPUT_TOKENS=23000
export CLAUDE_CODE_USE_OPENAI=1
export OPENAI_API_KEY=test
export OPENAI_BASE_URL=http://192.168.0.235:8000/v1
export OPENAI_MODEL=/home/bluesanta/llm/models/gemma-4-26B-A4B-it-FP8-Dynamic

vllm 실행 (*11)

(.venv) bluesanta@ubuntu:~/llm$ vllm serve /home/bluesanta/llm/models/gemma-4-26B-A4B-it-FP8-Dynamic --tensor-parallel-size 1 --max-model-len 50688 --max-num-batched-tokens 50688 --gpu-memory-utilization 0.8 --quantization compressed-tensors --dtype bfloat16 --enforce-eager --speculative-config '{"model": "/home/bluesanta/llm/models/gemma-4-26B-A4B-it-assistant", "num_speculative_tokens": 5}' --enable-auto-tool-choice --tool-call-parser hermes
export CLAUDE_CODE_MAX_OUTPUT_TOKENS=24000
export CLAUDE_CODE_USE_OPENAI=1
export OPENAI_API_KEY=test
export OPENAI_BASE_URL=http://192.168.0.235:8000/v1
export OPENAI_MODEL=/home/bluesanta/llm/models/gemma-4-26B-A4B-it-FP8-Dynamic
728x90

+ Recent posts