728x90

출처

소수 다운로드

(.venv) bluesanta@gx10-3b16:~/llm$ git clone https://github.com/antirez/ds4.git
(.venv) bluesanta@gx10-3b16:~/llm$ cd ds4/

모델 다운로드

(.venv) bluesanta@gx10-3b16:~/llm/ds4$ ./download_model.sh q2-imatrix 
Downloading DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf
from https://huggingface.co/antirez/deepseek-v4-gguf
If the download stops, run the same command again to resume it.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1479  100  1479    0     0    279      0  0:00:05  0:00:05 --:--:--   344
100 80.7G  100 80.7G    0     0  51.7M      0  0:26:37  0:26:37 --:--:-- 51.4M
Linked ./ds4flash.gguf -> /home/bluesanta/llm/ds4/gguf/DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf
 
Done.

빌드

(.venv) bluesanta@gx10-3b16:~/llm/ds4$ make cuda-spark

실행

bluesanta@gx10-3b16:~/llm/ds4$ ./ds4 -p "모델 이름 알려죠"
ds4: context buffers 751.71 MiB (ctx=32768, backend=cuda, prefill_chunk=2048, raw_kv_rows=2304, compressed_kv_rows=8194)
ds4: CUDA backend initialized on NVIDIA GB10 (sm_121)
ds4: CUDA registered 80.76 GiB model mapping for device access

ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 0.000s
ds4: cuda backend initialized for graph diagnostics
We need to answer the user's query. The user asked "모델 이름 알려죠" which is Korean for "Tell me the model name" or "What's your model name?" So we need to respond with the model name. The assistant should state its name. Typically, the assistant might say something like "저는 DeepSeek입니다." But need to check the context. The user didn't specify which model. Probably the assistant is a DeepSeek model. So answer accordingly.
저는 DeepSeek 모델입니다. 도움이 필요하시면 언제든지 물어보세요! 😊
ds4: prefill: 9.25 t/s, generation: 4.39 t/s

서비스 파일 생성

bluesanta@gx10-3b16:~/llm/ds4$ sudo vi /etc/systemd/system/ds4-server.service
[Unit]
Description=DS4 LLM Server
After=network.target

[Service]
Type=simple
User=bluesanta
WorkingDirectory=/home/bluesanta/llm/ds4
ExecStart=/home/bluesanta/llm/ds4/ds4-server --host 0.0.0.0 --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192
Restart=on-failure
RestartSec=10
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

# 로깅 설정
StandardOutput=journal
StandardError=journal

[Install]
WantedBy=multi-user.target

서비스 등록

bluesanta@gx10-3b16:~/llm/ds4$ sudo systemctl enable ds4-server
Created symlink /etc/systemd/system/multi-user.target.wants/ds4-server.service → /etc/systemd/system/ds4-server.service.

서비스 실행

bluesanta@gx10-3b16:~/llm/ds4$ sudo systemctl start ds4-server

서비스 상태 확인

bluesanta@gx10-3b16:~/llm/ds4$ sudo systemctl start ds4-server
bluesanta@gx10-3b16:~/llm/ds4$ 
bluesanta@gx10-3b16:~/llm/ds4$ 
bluesanta@gx10-3b16:~/llm/ds4$ sudo systemctl status ds4-server
● ds4-server.service - DS4 LLM Server
     Loaded: loaded (/etc/systemd/system/ds4-server.service; enabled; preset: enabled)
     Active: active (running) since Thu 2026-05-21 23:15:40 KST; 19s ago
   Main PID: 953632 (ds4-server)
      Tasks: 3 (limit: 153548)
     Memory: 634.1M (peak: 634.1M)
        CPU: 5.399s
     CGroup: /system.slice/ds4-server.service
             └─953632 /home/bluesanta/llm/ds4/ds4-server --host 0.0.0.0 --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-sp>
 
 5월 21 23:15:40 gx10-3b16 ds4-server[953632]: ds4: CUDA host registration skipped: operation not supported
 5월 21 23:15:41 gx10-3b16 ds4-server[953632]: ds4: CUDA loading model tensors into device cache
 5월 21 23:15:44 gx10-3b16 ds4-server[953632]: ds4: CUDA loading model tensors 16.02 GiB cached
 5월 21 23:15:48 gx10-3b16 ds4-server[953632]: ds4: CUDA loading model tensors 32.06 GiB cached
 5월 21 23:15:52 gx10-3b16 ds4-server[953632]: ds4: CUDA loading model tensors 48.02 GiB cached
 5월 21 23:15:55 gx10-3b16 ds4-server[953632]: ds4: CUDA loading model tensors 64.06 GiB cached
 5월 21 23:15:59 gx10-3b16 ds4-server[953632]: ds4: CUDA loading model tensors 80.04 GiB cached
 5월 21 23:15:59 gx10-3b16 ds4-server[953632]: ds4: CUDA startup model cache prepared 80.76 GiB of tensor spans in 19.009s
 5월 21 23:15:59 gx10-3b16 ds4-server[953632]: ds4: cuda backend initialized for graph diagnostics
 5월 21 23:15:59 gx10-3b16 ds4-server[953632]: 0521 23:15:59 ds4-server: context buffers 1896.58 MiB (ctx=100000, backend=c>

서비스 로그 확인

bluesanta@gx10-3b16:~/llm/ds4$ sudo journalctl -u ds4-server -f

확인

curl http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model":"deepseek-v4-flash",
    "messages":[{"role":"user","content":"List three Redis design principles."}],
    "stream":true
  }'

openclaude 설정

export CLAUDE_CODE_USE_OPENAI=1
export OPENAI_API_KEY=deepseek-v4-flash
export OPENAI_BASE_URL=http://192.168.0.240:8000/v1
export OPENAI_MODEL="DeepSeek V4 Flash"
728x90

+ Recent posts