728x90
현재 Swap 확인
bluesanta@ubuntu:~$ free -h
total used free shared buff/cache available
Mem: 61Gi 8.8Gi 50Gi 8.0Mi 1.7Gi 51Gi
Swap: 30Gi 311Mi 30Gi
물리 Swap 파일 32GB 추가
bluesanta@ubuntu:~$ sudo fallocate -l 32G /swapfile
bluesanta@ubuntu:~$ sudo chmod 600 /swapfile
bluesanta@ubuntu:~$ sudo mkswap /swapfile
mkswap: /swapfile: warning: wiping old swap signature.
Setting up swapspace version 1, size = 32 GiB (34359734272 bytes)
no label, UUID=66a0c946-6432-4abd-a118-536a222f9e16
bluesanta@ubuntu:~$ sudo swapon /swapfile
부팅 시 자동 로드를 위해 fstab 등록
bluesanta@ubuntu:~$ echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
/swapfile none swap sw 0 0
추가 확인
bluesanta@ubuntu:~$ free -h
total used free shared buff/cache available
Mem: 61Gi 8.8Gi 50Gi 8.0Mi 1.7Gi 51Gi
Swap: 62Gi 309Mi 62Gi
Swappiness 설정 변경 (성능 최적화)
현재 60으로 되어 있는 값을 10으로 낮춰야 시스템이 Swap(zram/SSD)을 쓰지 않고, 최대한 물리 RAM(61GiB)을 끝까지 활용하여 추론 속도를 유지함
bluesanta@ubuntu:~$ sudo sysctl vm.swappiness=10
vm.swappiness = 10
bluesanta@ubuntu:~$ echo 'vm.swappiness=10' | sudo tee -a /etc/sysctl.conf
vm.swappiness=10
vllm 실행
(.venv) bluesanta@ubuntu:~/llm$ vllm serve /home/bluesanta/llm/models/gemma-4-26B-A4B-it-FP8-Dynamic --tensor-parallel-size 1 --max-model-len 32768 --max-num-batched-tokens 32768 --gpu-memory-utilization 0.8 --quantization compressed-tensors --dtype bfloat16 --enforce-eager --speculative-config '{"model": "/home/bluesanta/llm/models/gemma-4-26B-A4B-it-assistant", "num_speculative_tokens": 5}' --enable-auto-tool-choice --tool-call-parser hermes
(.venv) bluesanta@ubuntu:~/llm$ vllm serve /home/bluesanta/llm/models/gemma-4-26B-A4B-it-FP8-Dynamic --tensor-parallel-size 1 --max-model-len 36864 --max-num-batched-tokens 36864 --gpu-memory-utilization 0.8 --quantization compressed-tensors --dtype bfloat16 --enforce-eager --speculative-config '{"model": "/home/bluesanta/llm/models/gemma-4-26B-A4B-it-assistant", "num_speculative_tokens": 5}' --enable-auto-tool-choice --tool-call-parser hermes
vllm 실행 (*10)
(.venv) bluesanta@ubuntu:~/llm$ vllm serve /home/bluesanta/llm/models/gemma-4-26B-A4B-it-FP8-Dynamic --tensor-parallel-size 1 --max-model-len 46080 --max-num-batched-tokens 46080 --gpu-memory-utilization 0.8 --quantization compressed-tensors --dtype bfloat16 --enforce-eager --speculative-config '{"model": "/home/bluesanta/llm/models/gemma-4-26B-A4B-it-assistant", "num_speculative_tokens": 5}' --enable-auto-tool-choice --tool-call-parser hermes
export CLAUDE_CODE_MAX_OUTPUT_TOKENS=23000
export CLAUDE_CODE_USE_OPENAI=1
export OPENAI_API_KEY=test
export OPENAI_BASE_URL=http://192.168.0.235:8000/v1
export OPENAI_MODEL=/home/bluesanta/llm/models/gemma-4-26B-A4B-it-FP8-Dynamic
vllm 실행 (*11)
(.venv) bluesanta@ubuntu:~/llm$ vllm serve /home/bluesanta/llm/models/gemma-4-26B-A4B-it-FP8-Dynamic --tensor-parallel-size 1 --max-model-len 50688 --max-num-batched-tokens 50688 --gpu-memory-utilization 0.8 --quantization compressed-tensors --dtype bfloat16 --enforce-eager --speculative-config '{"model": "/home/bluesanta/llm/models/gemma-4-26B-A4B-it-assistant", "num_speculative_tokens": 5}' --enable-auto-tool-choice --tool-call-parser hermes
export CLAUDE_CODE_MAX_OUTPUT_TOKENS=24000
export CLAUDE_CODE_USE_OPENAI=1
export OPENAI_API_KEY=test
export OPENAI_BASE_URL=http://192.168.0.235:8000/v1
export OPENAI_MODEL=/home/bluesanta/llm/models/gemma-4-26B-A4B-it-FP8-Dynamic728x90