출처
소스 다운로드
orangepi@orangepi-desktop:~/Llama$ source .venv/bin/activate
(.venv) orangepi@orangepi-desktop:~/Llama$ git clone https://github.com/notpunchnox/rkllama
Python 패키지 설치 시스템을 최신 버전으로 업데이트
(.venv) orangepi@orangepi-desktop:~/Llama/rkllama$ pip install --upgrade pip setuptools wheel
RKLLama 설치
(.venv) orangepi@orangepi-desktop:~/Llama/rkllama$ python -m pip install .
Successfully built rkllama
Installing collected packages: zipp, Werkzeug, typing-inspection, safetensors, requests, regex, pyyaml, python-dotenv, pydantic-core, itsdangerous, hf-xet, h11, exceptiongroup, click, blinker, annotated-types, rknn-toolkit-lite2, pydantic, importlib_metadata, huggingface_hub, httpcore, Flask, anyio, tokenizers, httpx, flask-cors, transformers, diffusers, rkllama
Attempting uninstall: requests
Found existing installation: requests 2.32.5
Uninstalling requests-2.32.5:
Successfully uninstalled requests-2.32.5
Successfully installed Flask-2.3.2 Werkzeug-3.1.4 annotated-types-0.7.0 anyio-4.12.0 blinker-1.9.0 click-8.3.1 diffusers-0.36.0 exceptiongroup-1.3.1 flask-cors-6.0.1 h11-0.16.0 hf-xet-1.2.0 httpcore-1.0.9 httpx-0.28.1 huggingface_hub-0.36.0 importlib_metadata-8.7.0 itsdangerous-2.2.0 pydantic-2.12.5 pydantic-core-2.41.5 python-dotenv-1.2.1 pyyaml-6.0.3 regex-2025.11.3 requests-2.31.0 rkllama-0.0.52 rknn-toolkit-lite2-2.3.2 safetensors-0.7.0 tokenizers-0.22.1 transformers-4.57.3 typing-inspection-0.4.2 zipp-3.23.0
RKLLama 서버 실행
(.venv) orangepi@orangepi-desktop:~/Llama/rkllama$ mkdir ~/Llama/models
(.venv) orangepi@orangepi5pro:~/Llama/rkllama$ rkllama_server --debug --models ~/Llama/models
Disabling PyTorch because PyTorch >= 2.1 is required but found 2.0.1
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
2025-12-09 22:16:51,896 - rkllama.worker - INFO - Models Monitor running.
2025-12-09 22:16:51,911 - rkllama.config - INFO - Created directory: /home/orangepi/Llama/.venv/lib/python3.10/site-packages/rkllama/config/data
2025-12-09 22:16:51,911 - rkllama.config - INFO - Created directory: /home/orangepi/Llama/.venv/lib/python3.10/site-packages/rkllama/config/temp
Start the API at http://localhost:8080
* Serving Flask app 'rkllama.server.server'
* Debug mode: off
2025-12-09 22:16:51,914 - werkzeug - INFO - WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
* Running on all addresses (0.0.0.0)
* Running on http://127.0.0.1:8080
* Running on http://192.168.0.217:8080
2025-12-09 22:16:51,914 - werkzeug - INFO - Press CTRL+C to quit
모델 다운로드
punchnox/Tinnyllama-1.1B-rk3588-rkllm-1.1.4
(.venv) orangepi@orangepi-desktop:~/Llama$ rkllama_client pull
Repo ID ( example: punchnox/Tinnyllama-1.1B-rk3588-rkllm-1.1.4 ): punchnox/Tinnyllama-1.1B-rk3588-rkllm-1.1.4
File ( example: TinyLlama-1.1B-Chat-v1.0-rk3588-w8a8-opt-0-hybrid-ratio-0.5.rkllm ): TinyLlama-1.1B-Chat-v1.0-rk3588-w8a8-opt-0-hybrid-ratio-0.5.rkllm
Custom Model Name ( example: tinyllama-chat:1.1b ): tinyllama-chat:1.1b
Downloading TinyLlama-1.1B-Chat-v1.0-rk3588-w8a8-opt-0-hybrid-ratio-0.5.rkllm (1126.29 MB)...
Progress: [##################################################] 100.00%
Download complete.
(.venv) orangepi@orangepi-desktop:~/Llama$ rkllama_client list
Available models:
- tinyllama-chat:1.1b
c01zaut/Qwen2.5-3B-Instruct-RK3588-1.1.4
(.venv) orangepi@orangepi-desktop:~/Llama$ rkllama_client pull
Repo ID ( example: punchnox/Tinnyllama-1.1B-rk3588-rkllm-1.1.4 ): c01zaut/Qwen2.5-3B-Instruct-RK3588-1.1.4
File ( example: TinyLlama-1.1B-Chat-v1.0-rk3588-w8a8-opt-0-hybrid-ratio-0.5.rkllm ): Qwen2.5-3B-Instruct-rk3588-w8a8-opt-0-hybrid-ratio-0.5.rkllm
Custom Model Name ( example: tinyllama-chat:1.1b ): Qwen2.5-3B-Instruct-RK3588:1.1.4
Downloading Qwen2.5-3B-Instruct-rk3588-w8a8-opt-0-hybrid-ratio-0.5.rkllm (3565.17 MB)...
Progress: [##################################################] 100.00%
Download complete.
(.venv) orangepi@orangepi-desktop:~/Llama$ rkllama_client list
Available models:
- Qwen2.5-3B-Instruct-RK3588:1.1.4
- tinyllama-chat:1.1b
c01zaut/Qwen2.5-3B-Instruct-RK3588-1.1.4 오프라인용 tokenizer 파일 다운로드
(.venv) orangepi@orangepi5pro:~/Llama$ huggingface-cli download \
c01zaut/Qwen2.5-3B-Instruct-RK3588-1.1.4 \
--local-dir ~/Llama/models/Qwen2.5-3B-Instruct-RK3588\:1.1.4/ \
--include "tokenizer.*" "vocab.*" "merges.*" "special_tokens_map.json" "tokenizer_config.json" "config.json"
모델 실행
모델 실행 오류
2025-12-16 12:14:03,387 - werkzeug - INFO - 127.0.0.1 - - [16/Dec/2025 12:14:03] "GET / HTTP/1.1" 200 -
FROM: Qwen2.5-3B-Instruct-rk3588-w8a8-opt-0-hybrid-ratio-0.5.rkllm
HuggingFace Path: c01zaut/Qwen2.5-3B-Instruct-RK3588-1.1.4
2025-12-16 12:14:03,440 - rkllama.rkllm - DEBUG - Initializing RKLLM model from /home/orangepi/Llama/models/Qwen2.5-3B-Instruct-RK3588:1.1.4/Qwen2.5-3B-Instruct-rk3588-w8a8-opt-0-hybrid-ratio-0.5.rkllm with options: {'temperature': '0.5', 'num_ctx': '16384', 'max_new_tokens': '16384', 'top_k': '7', 'top_p': '0.5', 'repeat_penalty': '1.1', 'frequency_penalty': '0.0', 'presence_penalty': '0.0', 'mirostat': '0', 'mirostat_tau': '3', 'mirostat_eta': '0.1', 'from': '"Qwen2.5-3B-Instruct-rk3588-w8a8-opt-0-hybrid-ratio-0.5.rkllm"', 'huggingface_path': '"c01zaut/Qwen2.5-3B-Instruct-RK3588-1.1.4"', 'system': '""', 'enable_thinking': 'False'}
I rkllm: rkllm-runtime version: 1.2.3, rknpu driver version: 0.9.8, platform: RK3588
I rkllm: loading rkllm model from /home/orangepi/Llama/models/Qwen2.5-3B-Instruct-RK3588:1.1.4/Qwen2.5-3B-Instruct-rk3588-w8a8-opt-0-hybrid-ratio-0.5.rkllm
E rkllm: max_context[16384] must be less than the model's max_context_limit[4096]
2025-12-16 12:14:04,137 - rkllama.worker - ERROR - Failed creating the worker for model 'Qwen2.5-3B-Instruct-RK3588:1.1.4': Failed to initialize RKLLM model: -1
2025-12-16 12:14:04,145 - werkzeug - INFO - 127.0.0.1 - - [16/Dec/2025 12:14:04] "POST /load_model HTTP/1.1" 400 -
Modelfile 수정
(.venv) orangepi@orangepi5pro:~/Llama$ vi models/Qwen2.5-3B-Instruct-RK3588\:1.1.4/Modelfile
NUM_CTX=4096
MAX_NEW_TOKENS=2048
모델 실행
(.venv) orangepi@orangepi5pro:~/Llama/models/Qwen2.5-3B-Instruct-RK3588:1.1.4$ rkllama_client run Qwen2.5-3B-Instruct-RK3588:1.1.4
Model Qwen2.5-3B-Instruct-RK3588:1.1.4 loaded successfully.
Available commands:
/help : Displays this help menu.
/clear : Clears the current conversation history.
/cls or /c : Clears the console content.
/set stream : Enables stream mode.
/unset stream : Disables stream mode.
/set verbose : Enables verbose mode.
/unset verbose : Disables verbose mode.
/set system : Modifies the system message.
exit : Exits the conversation.
You: Hello
Assistant: Hello there! How can I assist you today? Whether it's answering questions, helping with tasks, or just chatting, feel free to let me know how I can help.
You: 안녕
Assistant: 글렌치니# 안녕하세요! 무엇을 도와드릴까요?
You: 세종대왕 알려죠
Assistant: Qwen는 역사적인 세부 사항에 대해 제공하거나 질문에 답변하지 못합니다. 하지만 세종대왕은 조선 시대의 유명한 왕으로, 그의 공과 업적을 알고 싶으시다면 말씀해 주세요. 다른 궁금하신 점이 있으신가요?
You:
Offline 처리
.rkllm 모델은 토크나이저를 내부에 포함하지 않으며, 문자열 프롬프트를 그대로 전달하는 방식으로 동작합니다. 반면 AutoTokenizer는 Hugging Face API 호출이나 캐시 파일에 의존하기 때문에, 네트워크가 차단된 오프라인 환경에서는 여러 제약이 발생할 수 있습니다. 이러한 이유로 .rkllm 모델을 사용할 때는 AutoTokenizer를 배제하고, 오프라인 환경에 최적화된 프롬프트 처리 방식을 정리했습니다.
환경 설정 -> 서버 다시시작
$ export HF_HUB_OFFLINE=1
$ export TRANSFORMERS_OFFLINE=1
server_utils.py 소스 수정
(.venv) orangepi@orangepi5pro:~/Llama$ vi .venv/lib/python3.10/site-packages/rkllama/api/server_utils.py
@staticmethod
def prepare_prompt(model_name, messages, system="", tools=None, enable_thinking=False):
"""Prepare prompt with proper system handling"""
# 수정시작
# Get model specific tokenizer from Huggin Face specified in Modelfile
hf_offline = os.getenv("HF_HUB_OFFLINE", "0")
print(f"[ENV] HF_HUB_OFFLINE = {hf_offline}")
if hf_offline == "1":
print("👉 HuggingFace OFFLINE mode enabled")
rkllama_config_models = rkllama.config.get_path("models")
print(f"model_name = {model_name}, rkllama_config_models = {rkllama_config_models}")
# tokenizer = AutoTokenizer.from_pretrained("/home/orangepi/Llama/models/Qwen2.5-3B-Instruct-RK3588:1.1.4", trust_remote_code=True, local_files_only=True)
tokenizer_model_path = os.path.join(rkllama_config_models, model_name)
print(f"tokenizer_model_path = {tokenizer_model_path}")
if os.path.isdir(tokenizer_model_path):
# 경로가 존재할 때 실행
tokenizer = AutoTokenizer.from_pretrained(tokenizer_model_path, trust_remote_code=True, local_files_only=True)
else:
print(f"Model path not found: {tokenizer_model_path}")
else:
print("👉 HuggingFace ONLINE mode enabled")
model_in_hf = get_property_modelfile(model_name, "HUGGINGFACE_PATH", rkllama.config.get_path("models")).replace('"', '').replace("'", "")
# Get the tokenizer configured for the model
tokenizer = AutoTokenizer.from_pretrained(model_in_hf, trust_remote_code=True)
# 수정끝
supports_system_role = "raise_exception('System role not supported')" not in tokenizer.chat_template
if system and supports_system_role:
prompt_messages = [{"role": "system", "content": system}] + messages
else:
prompt_messages = messages
prompt_tokens = tokenizer.apply_chat_template(prompt_messages, tools=tools, tokenize=True, add_generation_prompt=True, enable_thinking=enable_thinking)
return tokenizer, prompt_tokens, len(prompt_tokens)