728x90
출처
- GitHub - antirez/ds4: DeepSeek 4 Flash local inference engine for Metal and CUDA · GitHub
- antirez/ds4 - Metal용 DeepSeek V4 Flash 로컬 추론 엔진 | GeekNews
소수 다운로드
(.venv) bluesanta@gx10-3b16:~/llm$ git clone https://github.com/antirez/ds4.git
(.venv) bluesanta@gx10-3b16:~/llm$ cd ds4/
모델 다운로드
(.venv) bluesanta@gx10-3b16:~/llm/ds4$ ./download_model.sh q2-imatrix
Downloading DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf
from https://huggingface.co/antirez/deepseek-v4-gguf
If the download stops, run the same command again to resume it.
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1479 100 1479 0 0 279 0 0:00:05 0:00:05 --:--:-- 344
100 80.7G 100 80.7G 0 0 51.7M 0 0:26:37 0:26:37 --:--:-- 51.4M
Linked ./ds4flash.gguf -> /home/bluesanta/llm/ds4/gguf/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf
Done.
빌드
(.venv) bluesanta@gx10-3b16:~/llm/ds4$ make cuda-spark
실행
bluesanta@gx10-3b16:~/llm/ds4$ ./ds4 -p "모델 이름 알려죠"
ds4: context buffers 751.71 MiB (ctx=32768, backend=cuda, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=8194)
ds4: CUDA backend initialized on NVIDIA GB10 (sm_121)
ds4: CUDA registered 80.76 GiB model mapping for device access
ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
We need to answer the user's query. The user asked "모델 이름 알려죠" which is Korean for "Tell me the model name" or "What's your model name?" So we need to respond with the model name. The assistant should state its name. Typically, the assistant might say something like "저는 DeepSeek입니다." But need to check the context. The user didn't specify which model. Probably the assistant is a DeepSeek model. So answer accordingly.
저는 DeepSeek 모델입니다. 도움이 필요하시면 언제든지 물어보세요! 😊
ds4: prefill: 9.25 t/s, generation: 4.39 t/s
서비스 파일 생성
bluesanta@gx10-3b16:~/llm/ds4$ sudo vi /etc/systemd/system/ds4-server.service
[Unit]
Description=DS4 LLM Server
After=network.target
[Service]
Type=simple
User=bluesanta
WorkingDirectory=/home/bluesanta/llm/ds4
ExecStart=/home/bluesanta/llm/ds4/ds4-server --host 0.0.0.0 --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192
Restart=on-failure
RestartSec=10
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
# 로깅 설정
StandardOutput=journal
StandardError=journal
[Install]
WantedBy=multi-user.target
서비스 등록
bluesanta@gx10-3b16:~/llm/ds4$ sudo systemctl enable ds4-server
Created symlink /etc/systemd/system/multi-user.target.wants/ds4-server.service → /etc/systemd/system/ds4-server.service.
서비스 실행
bluesanta@gx10-3b16:~/llm/ds4$ sudo systemctl start ds4-server
서비스 상태 확인
bluesanta@gx10-3b16:~/llm/ds4$ sudo systemctl start ds4-server
bluesanta@gx10-3b16:~/llm/ds4$
bluesanta@gx10-3b16:~/llm/ds4$
bluesanta@gx10-3b16:~/llm/ds4$ sudo systemctl status ds4-server
● ds4-server.service - DS4 LLM Server
Loaded: loaded (/etc/systemd/system/ds4-server.service; enabled; preset: enabled)
Active: active (running) since Thu 2026-05-21 23:15:40 KST; 19s ago
Main PID: 953632 (ds4-server)
Tasks: 3 (limit: 153548)
Memory: 634.1M (peak: 634.1M)
CPU: 5.399s
CGroup: /system.slice/ds4-server.service
└─953632 /home/bluesanta/llm/ds4/ds4-server --host 0.0.0.0 --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-sp>
5월 21 23:15:40 gx10-3b16 ds4-server[953632]: ds4: CUDA host registration skipped: operation not supported
5월 21 23:15:41 gx10-3b16 ds4-server[953632]: ds4: CUDA loading model tensors into device cache
5월 21 23:15:44 gx10-3b16 ds4-server[953632]: ds4: CUDA loading model tensors 16.02 GiB cached
5월 21 23:15:48 gx10-3b16 ds4-server[953632]: ds4: CUDA loading model tensors 32.06 GiB cached
5월 21 23:15:52 gx10-3b16 ds4-server[953632]: ds4: CUDA loading model tensors 48.02 GiB cached
5월 21 23:15:55 gx10-3b16 ds4-server[953632]: ds4: CUDA loading model tensors 64.06 GiB cached
5월 21 23:15:59 gx10-3b16 ds4-server[953632]: ds4: CUDA loading model tensors 80.04 GiB cached
5월 21 23:15:59 gx10-3b16 ds4-server[953632]: ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 19.009s
5월 21 23:15:59 gx10-3b16 ds4-server[953632]: ds4: cuda backend initialized for graph diagnostics
5월 21 23:15:59 gx10-3b16 ds4-server[953632]: 0521 23:15:59 ds4-server: context buffers 1896.58 MiB (ctx=100000, backend=c>
서비스 로그 확인
bluesanta@gx10-3b16:~/llm/ds4$ sudo journalctl -u ds4-server -f
확인
curl http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model":"deepseek-v4-flash",
"messages":[{"role":"user","content":"List three Redis design principles."}],
"stream":true
}'
openclaude 설정
export CLAUDE_CODE_USE_OPENAI=1
export OPENAI_API_KEY=deepseek-v4-flash
export OPENAI_BASE_URL=http://192.168.0.240:8000/v1
export OPENAI_MODEL="DeepSeek V4 Flash"728x90