728x90
출처
jetpack 버전 확인
bluesanta@ubuntu:~$ sudo apt show nvidia-jetpack
Package: nvidia-jetpack
Version: 6.2.2+b24
Priority: standard
Section: metapackages
Source: nvidia-jetpack (6.2.2)
Maintainer: NVIDIA Corporation
Installed-Size: 199 kB
Depends: nvidia-jetpack-runtime (= 6.2.2+b24), nvidia-jetpack-dev (= 6.2.2+b24)
Homepage: http://developer.nvidia.com/jetson
Download-Size: 29.3 kB
APT-Sources: https://repo.download.nvidia.com/jetson/common r36.5/main arm64 Packages
Description: NVIDIA Jetpack Meta Package
CUDA 버전 확인
bluesanta@ubuntu:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Aug_14_10:14:07_PDT_2024
Cuda compilation tools, release 12.6, V12.6.68
Build cuda_12.6.r12.6/compiler.34714021_0
가상환경 만들기
bluesanta@bluesanta-desktop:~$ cd llm
bluesanta@bluesanta-desktop:~/llm$ python -m venv .venv
bluesanta@bluesanta-desktop:~/llm$ source .venv/bin/activate
(.venv) bluesanta@bluesanta-desktop:~/llm$
모델 다운로드 (Hugging Face)
(.venv) bluesanta@ubuntu:~/llm$ pip install -U "huggingface_hub[cli]"
(.venv) bluesanta@ubuntu:~/llm$ hf download Qwen/Qwen3.6-35B-A3B --local-dir ~/llm/models/Qwen3.6-35B-A3B
원본 모델을 GGUF(FP16)로 변환
(.venv) bluesanta@gx10-3b16:~/llm/llama.cpp$ python convert_hf_to_gguf.py ~/llm/models/Qwen3.6-35B-A3B --outfile ~/llm/models/Qwen3.6-35B-A3B-F16.gguf --outtype f16
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:/home/bluesanta/llm/models/Qwen3.6-35B-A3B-F16.gguf: n_tensors = 733, total_size = 69.4G
Writing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 69.4G/69.4G [01:36<00:00, 718Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to /home/bluesanta/llm/models/Qwen3.6-35B-A3B-F16.gguf
양자화 (Q5_K_M)
(.venv) bluesanta@gx10-3b16:~/llm/llama.cpp$ llama-quantize --leave-output-tensor ~/llm/models/Qwen3.6-35B-A3B-F16.gguf ~/llm/models/Qwen3.6-35B-A3B-Q5_K_M.gguf Q5_K_M
[ 731/ 733] blk.39.ffn_up_exps.weight - [ 2048, 512, 256, 1], type = f16, converting to q5_K .. size = 512.00 MiB -> 176.00 MiB
[ 732/ 733] blk.39.ffn_up_shexp.weight - [ 2048, 512, 1, 1], type = f16, converting to q5_K .. size = 2.00 MiB -> 0.69 MiB
[ 733/ 733] blk.39.post_attention_norm.weight - [ 2048, 1, 1, 1], type = f32, size = 0.008 MiB
llama_model_quantize_impl: model size = 66152.24 MiB (16.01 BPW)
llama_model_quantize_impl: quant size = 24145.21 MiB (5.84 BPW)
main: quantize time = 473736.51 ms
main: total time = 473736.51 ms
멀티모달 프로젝터 (mmproj) 파일 생성
(.venv) bluesanta@gx10-3b16:~/llm/llama.cpp$ python convert_hf_to_gguf.py ~/llm/models/Qwen3.6-35B-A3B --outfile ~/llm/models/Qwen3.6-35B-A3B-mmproj-f16.gguf --outtype f16 --mmproj
INFO:hf-to-gguf:Loading model: Qwen3.6-35B-A3B
INFO:hf-to-gguf:Model architecture: Qwen3_5MoeForConditionalGeneration
INFO:gguf.gguf_writer:/home/bluesanta/llm/models/Qwen3.6-35B-A3B-mmproj-f16.gguf: n_tensors = 334, total_size = 899.3M
Writing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 899M/899M [00:00<00:00, 1.48Gbyte/s]
INFO:hf-to-gguf:Model successfully exported to /home/bluesanta/llm/models/Qwen3.6-35B-A3B-mmproj-f16.gguf
llama.cpp 이용한 확인
(.venv) bluesanta@ubuntu:~/llm$ llama-cli -m ~/llm/models/Qwen3.6-35B-A3B-Q5_K_M.gguf -n 512 --color on -ngl 99 -p "당신은 유능한 AI 어시스턴트입니다. Jetson AGX Orin에 대해 설명해주세요."
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 62827 MiB):
Device 0: Orin, compute capability 8.7, VMM: yes, VRAM: 62827 MiB
Loading model...
▄▄ ▄▄
██ ██
██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄
██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██
██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀
██ ██
▀▀ ▀▀
build : b9123-927dada6c
model : Qwen3.6-35B-A3B-Q5_K_M.gguf
modalities : text
available commands:
/exit or Ctrl+C stop or exit
/regen regenerate the last response
/clear clear the chat history
/read add a text file
/glob add text files using globbing pattern
> 당신은 유능한 AI 어시스턴트입니다. Jetson AGX Orin에 대해 설명해주세요.
[Start thinking]
Here's a thinking process:
1. **Analyze User Input:**
- **User:** "당신은 유능한 AI 어시스턴트입니다. Jetson AGX Orin 에 대해 설명해주세요."
- **Language:** Korean
- **Topic:** Jetson AGX Orin (NVIDIA's edge AI computing platform)
- **Request:** Explain/describe it
2. **Identify Key Information Needed:**
- What is it? (Overview/Definition)
- Key Specifications (CPU, GPU, Memory, AI Performance, I/O, Power)
- Target Applications/Use Cases
- Key Features/Advantages
- Ecosystem/Software Support
- Comparison/Context (if relevant, but keep it concise)
- Structure the response logically in Korean
3. **Gather Facts (Internal Knowledge up to 2024):**
- *Manufacturer:* NVIDIA
- *Series:* Jetson family (edge AI)
- *Architecture:* Ampere GPU architecture, Carmel CPU cores
- *AI Performance:* Up to 275 TOPS (INT8)
- *GPU:* 1024-core NVIDIA Ampere architecture GPU with 32 Tensor cores
- *CPU:* 12-core ARM64 CPU (8 high-performance + 4 high-efficiency)
- *Memory:* 64GB LPDDR5 (102.4 GB/s bandwidth)
- *Storage/Connectivity:* PCIe 4.0, NVMe SSD support, USB 3.2, Ethernet, MIPI CSI-2, etc.
- *Power:* 60W max (configurable), TDP 60W
- *Use Cases:* Autonomous mobile robots, smart cameras, medical devices, industrial automation, drones, etc.
- *Software:* JetPack SDK, CUDA, cuDNN, TensorRT, DeepStream, Isaac ROS, Docker support, Linux-based
- *Form Factors:* Developer Kit, Industrial Module (for OEMs), Edge Cloud Gateway
- *Successor/Context:* Follows Xavier AGX, successor to Jetson AGX Xavier, part of NVIDIA's edge AI lineup
4. **Structure the Response (in Korean):**
- 소개 (Overview)
- 주요 스펙 (Key Specs
[ Prompt: 111.5 t/s | Generation: 30.9 t/s ]
>
API 서버 실행
(.venv) bluesanta@bluesanta-desktop:~/llm/llama.cpp$ llama-server -m ~/llm/models/Qwen3.6-35B-A3B-Q4_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 99 -c 32768 -np 1
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 62840 MiB):
Device 0: Orin, compute capability 8.7, VMM: yes, VRAM: 62840 MiB
build_info: b8851-e365e658f
system_info: n_threads = 12 (n_threads_batch = 12) / 12 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
init: using 11 threads for HTTP server
start: binding port with default address family
srv load_model: loading model '/home/bluesanta/llm/models/Qwen3.6-35B-A3B-Q4_K_M.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected to use 21098 MiB of device memory vs. 52336 MiB of free device memory
llama_params_fit_impl: will leave 31238 >= 1024 MiB of free device memory, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 1.10 seconds
llama_model_load_from_file_impl: using device CUDA0 (Orin) (0000:00:00.0) - 52362 MiB free
llama_model_loader: loaded meta data with 44 key-value pairs and 733 tensors from /home/bluesanta/llm/models/Qwen3.6-35B-A3B-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
srv init: init: chat template, thinking = 1
main: model loaded
main: server is listening on http://0.0.0.0:8080
main: starting the main loop...
srv update_slots: all slots are idle
서비스 파일 생성
(.venv) bluesanta@bluesanta-desktop:~/llm$ sudo vi /etc/systemd/system/llama.service
텍스트 전용
[Unit]
Description=Llama.cpp Server Service
After=network.target
[Service]
# 사용자 계정
User=bluesanta
Group=bluesanta
LimitMEMLOCK=infinity
WorkingDirectory=/opt/llama.cpp
# 최적화된 실행 명령어
# -m /home/bluesanta/llm/models/Qwen3.6-35B-A3B-Q4_K_M.gguf \
# -m /home/bluesanta/llm/models/gemma-4-26b-Q4_K_M.gguf
# --ctx-size 262144
ExecStart=/usr/local/bin/llama-server \
-m /home/bluesanta/llm/models/Qwen3.6-35B-A3B-Q5_K_M.gguf \
--host 0.0.0.0 \
--port 8000 \
--ctx-size 262144 \
--n-gpu-layers 99 \
--flash-attn on \
--spec-draft-n-max 2 \
--mlock \
--cont-batching \
--metrics
# 프로세스 종료 시 자동 재시작 설정
# Restart=always
# RestartSec=5
[Install]
WantedBy=multi-user.target
--mmproj 적용
[Unit]
Description=Llama.cpp Server Service
After=network.target
[Service]
# 사용자 계정
User=bluesanta
Group=bluesanta
LimitMEMLOCK=infinity
WorkingDirectory=/opt/llama.cpp
# 최적화된 실행 명령어
# -m /home/bluesanta/llm/models/Qwen3.6-35B-A3B-Q4_K_M.gguf \
# -m /home/bluesanta/llm/models/gemma-4-26b-Q4_K_M.gguf
# -m /home/bluesanta/llm/models/Qwen3.6-35B-A3B-Q5_K_M.gguf \
# --ctx-size 262144
# --ctx-size를 65536 (64k) 또는 32768 (32k)로 변경
# --spec-type ngram-mod,draft-mtp --spec-draft-n-max 4
ExecStart=/usr/local/bin/llama-server \
-m /home/bluesanta/llm/models/Qwen3.6-35B-A3B-Q5_K_M.gguf \
--mmproj /home/bluesanta/llm/models/Qwen3.6-35B-A3B-mmproj-f16.gguf \
--host 0.0.0.0 \
--port 8000 \
--ctx-size 131072 \
--n-gpu-layers 99 \
--flash-attn on \
--spec-draft-n-max 2 \
--mlock \
--cont-batching \
--metrics
# 프로세스 종료 시 자동 재시작 설정
# Restart=always
# RestartSec=5
[Install]
WantedBy=multi-user.target
서비스 갱신
(.venv) bluesanta@ubuntu:~/llm/llama.cpp$ sudo systemctl daemon-reload
서비스 실행
(.venv) bluesanta@ubuntu:~/llm/llama.cpp$ sudo systemctl start llama
서비스 상태 확인
(.venv) bluesanta@ubuntu:~/llm/llama.cpp$ sudo systemctl status llama
서비스 로그 확인
(.venv) bluesanta@ubuntu:~/llm/llama.cpp$ sudo journalctl -u llama.service -f
확인
(.venv) bluesanta@ubuntu:~/llm$ curl http://localhost:8000/completion -H "Content-Type: application/json" -d '{
"prompt": "Jetson AGX Orin의 장점 3가지는?",
"n_predict": 256
}'
{"index":0,"content":"\n\n\n\n\n\nNVIDIA Jetson AGX Orin은 에지 AI(Edge AI) 애플리케이션을 위한 최상위 성능의 임베디드 컴퓨팅 모듈로, 기존 제품 대비 뛰어난 성능과 효율성을 자랑합니다. 주요 장점 3가지는 다음과 같습니다:\n\n1. **뛰어난 AI 추론 성능 (500 TOPS)** \n Jetson AGX Orin은 최대 **500 TOPS**(초당 500조 회 연산)의 INT8 추론 성능을 제공합니다. 이는 이전 세대인 Jetson Xavier NX 대비 약 **20배 이상** 향상된 것으로, 대규모 딥러닝 모델(예: YOLO, ResNet, Vision Transformer 등)을 실시간으로 처리할 수 있어 복잡한 비전 AI, 로봇 공학, 자율 주행 등 고성능이 요구되는 애플리케이션에 이상적입니다.\n\n2. **높은 전력 효율성 대비 고성능** \n 최대 60W까지 전력 소비를 지원하지만, 필요에 따라 **5W부터 60W까지 유연하게 전력 구성**이 가능합니다. 이는 제한된 전력과 열 설계(Power & Thermal Budget) 환경에서도 고성능을 유지하면서도 에너지 효율","tokens":[],"id_slot":3,"stop":true,"model":"Qwen3.6-35B-A3B-Q5_K_M.gguf","tokens_predicted":256,"tokens_evaluated":13,"generation_settings":{"seed":4294967295,"temperature":1.0,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":20,"top_p":0.949999988079071,"min_p":0.05000000074505806,"top_n_sigma":-1.0,"xtc_probability":0.0,"xtc_threshold":0.10000000149011612,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"dry_multiplier":0.0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":262144,"dry_sequence_breakers":["\n",":","\"","*"],"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"stop":[],"max_tokens":256,"n_predict":256,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","grammar_lazy":false,"grammar_triggers":[],"preserved_tokens":[],"chat_format":"Content-only","reasoning_format":"deepseek","reasoning_in_content":false,"generation_prompt":"","samplers":["penalties","dry","top_n_sigma","top_k","typ_p","top_p","min_p","xtc","temperature"],"speculative.types":"none","timings_per_token":false,"post_sampling_probs":false,"backend_sampling":false,"lora":[]},"prompt":"Jetson AGX Orin의 장점 3가지는?","has_new_line":true,"truncated":false,"stop_type":"limit","stopping_word":"","tokens_cached":268,"timings":{"cache_n":0,"prompt_n":13,"prompt_ms":434.387,"prompt_per_token_ms":33.41438461538461,"prompt_per_second":29.927230787293357,"predicted_n":256,"predicted_ms":8410.958,"predicted_per_token_ms":32.8553046875,"predicted_per_second":30.436485356364873}}
export CLAUDE_CODE_MAX_OUTPUT_TOKENS=64000
export CLAUDE_CODE_USE_OPENAI=1
export OPENAI_API_KEY=sk-jetson-qwen36
export OPENAI_BASE_URL=http://192.168.0.235:8000/v1
export OPENAI_MODEL=Qwen3.6-35B-A3B-Q5_K_M.gguf728x90