Jetson AGX Orin (64GB) : Qwen3.6-35B-A3B 양자화 (Q5_K_M)

파란크리스마스 2026. 5. 13. 23:21

2026. 5. 13. 23:21

출처

Jetson AGX Orin에서 PyTorch 설치 ::: 대학원 생존일지

jetpack 버전 확인

bluesanta@ubuntu:~$ sudo apt show nvidia-jetpack
Package: nvidia-jetpack
Version: 6.2.2+b24
Priority: standard
Section: metapackages
Source: nvidia-jetpack (6.2.2)
Maintainer: NVIDIA Corporation
Installed-Size: 199 kB
Depends: nvidia-jetpack-runtime (= 6.2.2+b24), nvidia-jetpack-dev (= 6.2.2+b24)
Homepage: http://developer.nvidia.com/jetson
Download-Size: 29.3 kB
APT-Sources: https://repo.download.nvidia.com/jetson/common r36.5/main arm64 Packages
Description: NVIDIA Jetpack Meta Package

CUDA 버전 확인

bluesanta@ubuntu:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Aug_14_10:14:07_PDT_2024
Cuda compilation tools, release 12.6, V12.6.68
Build cuda_12.6.r12.6/compiler.34714021_0

가상환경 만들기

bluesanta@bluesanta-desktop:~$ cd llm
bluesanta@bluesanta-desktop:~/llm$ python -m venv .venv
bluesanta@bluesanta-desktop:~/llm$ source .venv/bin/activate
(.venv) bluesanta@bluesanta-desktop:~/llm$

모델 다운로드 (Hugging Face)

(.venv) bluesanta@ubuntu:~/llm$ pip install -U "huggingface_hub[cli]"
(.venv) bluesanta@ubuntu:~/llm$ hf download Qwen/Qwen3.6-35B-A3B --local-dir ~/llm/models/Qwen3.6-35B-A3B

원본 모델을 GGUF(FP16)로 변환

(.venv) bluesanta@gx10-3b16:~/llm/llama.cpp$ python convert_hf_to_gguf.py ~/llm/models/Qwen3.6-35B-A3B --outfile ~/llm/models/Qwen3.6-35B-A3B-F16.gguf --outtype f16
 
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:/home/bluesanta/llm/models/Qwen3.6-35B-A3B-F16.gguf: n_tensors = 733, total_size = 69.4G
Writing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 69.4G/69.4G [01:36<00:00, 718Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to /home/bluesanta/llm/models/Qwen3.6-35B-A3B-F16.gguf

양자화 (Q5_K_M)

(.venv) bluesanta@gx10-3b16:~/llm/llama.cpp$ llama-quantize --leave-output-tensor ~/llm/models/Qwen3.6-35B-A3B-F16.gguf ~/llm/models/Qwen3.6-35B-A3B-Q5_K_M.gguf Q5_K_M
 
[ 731/ 733] blk.39.ffn_up_exps.weight            - [  2048,    512,    256,      1], type =    f16, converting to q5_K .. size =   512.00 MiB ->   176.00 MiB
[ 732/ 733] blk.39.ffn_up_shexp.weight           - [  2048,    512,      1,      1], type =    f16, converting to q5_K .. size =     2.00 MiB ->     0.69 MiB
[ 733/ 733] blk.39.post_attention_norm.weight    - [  2048,      1,      1,      1], type =    f32, size =    0.008 MiB
llama_model_quantize_impl: model size  = 66152.24 MiB (16.01 BPW)
llama_model_quantize_impl: quant size  = 24145.21 MiB (5.84 BPW)
 
main: quantize time = 473736.51 ms
main:    total time = 473736.51 ms

멀티모달 프로젝터 (mmproj) 파일 생성

(.venv) bluesanta@gx10-3b16:~/llm/llama.cpp$ python convert_hf_to_gguf.py ~/llm/models/Qwen3.6-35B-A3B --outfile  ~/llm/models/Qwen3.6-35B-A3B-mmproj-f16.gguf --outtype f16 --mmproj
INFO:hf-to-gguf:Loading model: Qwen3.6-35B-A3B
INFO:hf-to-gguf:Model architecture: Qwen3_5MoeForConditionalGeneration
 
INFO:gguf.gguf_writer:/home/bluesanta/llm/models/Qwen3.6-35B-A3B-mmproj-f16.gguf: n_tensors = 334, total_size = 899.3M
Writing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 899M/899M [00:00<00:00, 1.48Gbyte/s]
INFO:hf-to-gguf:Model successfully exported to /home/bluesanta/llm/models/Qwen3.6-35B-A3B-mmproj-f16.gguf

llama.cpp 이용한 확인

(.venv) bluesanta@ubuntu:~/llm$ llama-cli -m ~/llm/models/Qwen3.6-35B-A3B-Q5_K_M.gguf  -n 512 --color on -ngl 99 -p "당신은 유능한 AI 어시스턴트입니다. Jetson AGX Orin에 대해 설명해주세요."
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 62827 MiB):
  Device 0: Orin, compute capability 8.7, VMM: yes, VRAM: 62827 MiB
 
Loading model...  
 
 
▄▄ ▄▄
██ ██
██ ██  ▀▀█▄ ███▄███▄  ▀▀█▄    ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██    ██    ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
                                    ██    ██
                                    ▀▀    ▀▀
 
build      : b9123-927dada6c
model      : Qwen3.6-35B-A3B-Q5_K_M.gguf
modalities : text
 
available commands:
  /exit or Ctrl+C     stop or exit
  /regen              regenerate the last response
  /clear              clear the chat history
  /read         add a text file
  /glob      add text files using globbing pattern
 
 
> 당신은 유능한 AI 어시스턴트입니다. Jetson AGX Orin에 대해 설명해주세요.
 
[Start thinking]
Here's a thinking process:
 
1.  **Analyze User Input:**
   - **User:** "당신은 유능한 AI 어시스턴트입니다. Jetson AGX Orin 에 대해 설명해주세요."
   - **Language:** Korean
   - **Topic:** Jetson AGX Orin (NVIDIA's edge AI computing platform)
   - **Request:** Explain/describe it
 
2.  **Identify Key Information Needed:**
   - What is it? (Overview/Definition)
   - Key Specifications (CPU, GPU, Memory, AI Performance, I/O, Power)
   - Target Applications/Use Cases
   - Key Features/Advantages
   - Ecosystem/Software Support
   - Comparison/Context (if relevant, but keep it concise)
   - Structure the response logically in Korean
 
3.  **Gather Facts (Internal Knowledge up to 2024):**
   - *Manufacturer:* NVIDIA
   - *Series:* Jetson family (edge AI)
   - *Architecture:* Ampere GPU architecture, Carmel CPU cores
   - *AI Performance:* Up to 275 TOPS (INT8)
   - *GPU:* 1024-core NVIDIA Ampere architecture GPU with 32 Tensor cores
   - *CPU:* 12-core ARM64 CPU (8 high-performance + 4 high-efficiency)
   - *Memory:* 64GB LPDDR5 (102.4 GB/s bandwidth)
   - *Storage/Connectivity:* PCIe 4.0, NVMe SSD support, USB 3.2, Ethernet, MIPI CSI-2, etc.
   - *Power:* 60W max (configurable), TDP 60W
   - *Use Cases:* Autonomous mobile robots, smart cameras, medical devices, industrial automation, drones, etc.
   - *Software:* JetPack SDK, CUDA, cuDNN, TensorRT, DeepStream, Isaac ROS, Docker support, Linux-based
   - *Form Factors:* Developer Kit, Industrial Module (for OEMs), Edge Cloud Gateway
   - *Successor/Context:* Follows Xavier AGX, successor to Jetson AGX Xavier, part of NVIDIA's edge AI lineup
 
4.  **Structure the Response (in Korean):**
   - 소개 (Overview)
   - 주요 스펙 (Key Specs
 
[ Prompt: 111.5 t/s | Generation: 30.9 t/s ]
 
>

API 서버 실행

(.venv) bluesanta@bluesanta-desktop:~/llm/llama.cpp$ llama-server -m ~/llm/models/Qwen3.6-35B-A3B-Q4_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 99 -c 32768 -np 1
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 62840 MiB):
  Device 0: Orin, compute capability 8.7, VMM: yes, VRAM: 62840 MiB
build_info: b8851-e365e658f
system_info: n_threads = 12 (n_threads_batch = 12) / 12 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
init: using 11 threads for HTTP server
start: binding port with default address family
 
srv    load_model: loading model '/home/bluesanta/llm/models/Qwen3.6-35B-A3B-Q4_K_M.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected to use 21098 MiB of device memory vs. 52336 MiB of free device memory
llama_params_fit_impl: will leave 31238 >= 1024 MiB of free device memory, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 1.10 seconds
llama_model_load_from_file_impl: using device CUDA0 (Orin) (0000:00:00.0) - 52362 MiB free
llama_model_loader: loaded meta data with 44 key-value pairs and 733 tensors from /home/bluesanta/llm/models/Qwen3.6-35B-A3B-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
 
init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant

'
srv          init: init: chat template, thinking = 1
main: model loaded
main: server is listening on http://0.0.0.0:8080
main: starting the main loop...
srv  update_slots: all slots are idle

서비스 파일 생성

(.venv) bluesanta@bluesanta-desktop:~/llm$ sudo vi /etc/systemd/system/llama.service

텍스트 전용

[Unit]
Description=Llama.cpp Server Service
After=network.target

[Service]
# 사용자 계정
User=bluesanta
Group=bluesanta
LimitMEMLOCK=infinity
WorkingDirectory=/opt/llama.cpp

# 최적화된 실행 명령어
# -m /home/bluesanta/llm/models/Qwen3.6-35B-A3B-Q4_K_M.gguf \
# -m /home/bluesanta/llm/models/gemma-4-26b-Q4_K_M.gguf
# --ctx-size 262144
ExecStart=/usr/local/bin/llama-server \
    -m /home/bluesanta/llm/models/Qwen3.6-35B-A3B-Q5_K_M.gguf \
    --host 0.0.0.0 \
    --port 8000 \
    --ctx-size 262144 \
    --n-gpu-layers 99 \
    --flash-attn on \
    --spec-draft-n-max 2 \
    --mlock \
    --cont-batching \
    --metrics
    
# 프로세스 종료 시 자동 재시작 설정
# Restart=always
# RestartSec=5

[Install]
WantedBy=multi-user.target

--mmproj 적용

[Unit]
Description=Llama.cpp Server Service
After=network.target

[Service]
# 사용자 계정
User=bluesanta
Group=bluesanta
LimitMEMLOCK=infinity
WorkingDirectory=/opt/llama.cpp

# 최적화된 실행 명령어
# -m /home/bluesanta/llm/models/Qwen3.6-35B-A3B-Q4_K_M.gguf \
# -m /home/bluesanta/llm/models/gemma-4-26b-Q4_K_M.gguf
# -m /home/bluesanta/llm/models/Qwen3.6-35B-A3B-Q5_K_M.gguf \
# --ctx-size 262144
# --ctx-size를 65536 (64k) 또는 32768 (32k)로 변경
# --spec-type ngram-mod,draft-mtp --spec-draft-n-max 4
ExecStart=/usr/local/bin/llama-server \
    -m /home/bluesanta/llm/models/Qwen3.6-35B-A3B-Q5_K_M.gguf \
    --mmproj /home/bluesanta/llm/models/Qwen3.6-35B-A3B-mmproj-f16.gguf \
    --host 0.0.0.0 \
    --port 8000 \
    --ctx-size 131072 \
    --n-gpu-layers 99 \
    --flash-attn on \
    --spec-draft-n-max 2 \
    --mlock \
    --cont-batching \
    --metrics


# 프로세스 종료 시 자동 재시작 설정
# Restart=always
# RestartSec=5

[Install]
WantedBy=multi-user.target

서비스 갱신

(.venv) bluesanta@ubuntu:~/llm/llama.cpp$ sudo systemctl daemon-reload

서비스 실행

(.venv) bluesanta@ubuntu:~/llm/llama.cpp$ sudo systemctl start llama

서비스 상태 확인

(.venv) bluesanta@ubuntu:~/llm/llama.cpp$ sudo systemctl status llama

서비스 로그 확인

(.venv) bluesanta@ubuntu:~/llm/llama.cpp$ sudo journalctl -u llama.service -f

확인

(.venv) bluesanta@ubuntu:~/llm$ curl http://localhost:8000/completion -H "Content-Type: application/json" -d '{
  "prompt": "Jetson AGX Orin의 장점 3가지는?",
  "n_predict": 256
}'
{"index":0,"content":"\n\n\n\n\n\nNVIDIA Jetson AGX Orin은 에지 AI(Edge AI) 애플리케이션을 위한 최상위 성능의 임베디드 컴퓨팅 모듈로, 기존 제품 대비 뛰어난 성능과 효율성을 자랑합니다. 주요 장점 3가지는 다음과 같습니다:\n\n1. **뛰어난 AI 추론 성능 (500 TOPS)**  \n   Jetson AGX Orin은 최대 **500 TOPS**(초당 500조 회 연산)의 INT8 추론 성능을 제공합니다. 이는 이전 세대인 Jetson Xavier NX 대비 약 **20배 이상** 향상된 것으로, 대규모 딥러닝 모델(예: YOLO, ResNet, Vision Transformer 등)을 실시간으로 처리할 수 있어 복잡한 비전 AI, 로봇 공학, 자율 주행 등 고성능이 요구되는 애플리케이션에 이상적입니다.\n\n2. **높은 전력 효율성 대비 고성능**  \n   최대 60W까지 전력 소비를 지원하지만, 필요에 따라 **5W부터 60W까지 유연하게 전력 구성**이 가능합니다. 이는 제한된 전력과 열 설계(Power & Thermal Budget) 환경에서도 고성능을 유지하면서도 에너지 효율","tokens":[],"id_slot":3,"stop":true,"model":"Qwen3.6-35B-A3B-Q5_K_M.gguf","tokens_predicted":256,"tokens_evaluated":13,"generation_settings":{"seed":4294967295,"temperature":1.0,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":20,"top_p":0.949999988079071,"min_p":0.05000000074505806,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":262144,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":256,"n_predict":256,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[],"chat_format":"Content-only","reasoning_format":"deepseek","reasoning_in_content":false,"generation_prompt":"","samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.types":"none","timings_per_token":false,"post_sampling_probs":false,"backend_sampling":false,"lora":[]},"prompt":"Jetson AGX Orin의 장점 3가지는?","has_new_line":true,"truncated":false,"stop_type":"limit","stopping_word":"","tokens_cached":268,"timings":{"cache_n":0,"prompt_n":13,"prompt_ms":434.387,"prompt_per_token_ms":33.41438461538461,"prompt_per_second":29.927230787293357,"predicted_n":256,"predicted_ms":8410.958,"predicted_per_token_ms":32.8553046875,"predicted_per_second":30.436485356364873}}

export CLAUDE_CODE_MAX_OUTPUT_TOKENS=64000
export CLAUDE_CODE_USE_OPENAI=1
export OPENAI_API_KEY=sk-jetson-qwen36
export OPENAI_BASE_URL=http://192.168.0.235:8000/v1
export OPENAI_MODEL=Qwen3.6-35B-A3B-Q5_K_M.gguf

저작자표시 (새창열림)

파란크리스마스

Jetson AGX Orin (64GB) : Qwen3.6-35B-A3B 양자화 (Q5_K_M)

+ Recent posts

티스토리툴바