Introducing the new OpenCL™ GPU Backend in llama.cpp for Qualcomm Adreno GPUs

Qualcomm Technologies team is thrilled to announce the availability of a new backend based on OpenCL to the llama.cpp project. Well optimized for Qualcomm Adreno GPUs in Snapdragon SoCs, this work marks a significant milestone in our continuing efforts to improve the performance and versatility of llama.cpp, a well-recognized project that is targeting large language models (LLMs) and has been actively evolving within the open-source community. Adreno OpenCL backend for Llama.cpp is now officially upstreamed to the open-source community via Codelinaro.” to give users a place to find resources?
What are the benefits of leveraging OpenCL for Adreno?
OpenCL (Open Computing Language) is a framework for writing programs that execute across heterogeneous platforms, including CPUs, GPUs, and other processors. By leveraging OpenCL, we can tap into the computational power of Adreno GPUs, which are widely used in many mobile devices. This integration allows us to optimize llama.cpp for better performance and efficiency on these devices.
Key features and benefits
- Enhanced Performance: The new backend significantly boosts the performance of llama.cpp on Adreno GPUs, enabling faster computations and more efficient processing.
- Broader Compatibility: The backend has been highly optimized for Adreno GPUs. However, the backend would run on all GPUs that support the OpenCL 3.0 standard with subgroup support, ensuring broader compatibility and accessibility.
- High flexibility: Users may modify and optimize the backend for different GPUs, as the current solution uses all standard OpenCL features. For example, the backend can use vendor extensions targeting other GPUs[SK1] .
- Open-Source Collaboration: This update is a testament to the power of open-source collaboration. We have worked closely with the community so that this backend meets the needs of developers and users alike.
The new backend leverages the capabilities of OpenCL to offload computationally intensive tasks to the GPU, freeing up the CPU for other operations. This parallel processing capability is particularly beneficial for applications that require high computational power, such as machine learning.
Tested llama.cpp models and platforms
The team has rigorously tested llama.cpp with various large language models to confirm its robustness and performance. These tests include:
- Meta’s llama models, including llama 2 & 3 models, with parameters of 7 billion (7B) and 8B, etc.
- Gemma 1&2 2B models, Phi3 mini.
- Mistral 7B models
- Bilingual models like Qwen 1&2 7B, Baichuan 7B.
The backend has been tested with many premium devices powered by Snapdragon SOCs:
- Laptops running Windows 11 with Snapdragon X Elite and Snapdragon X Plus chips
- Android smartphones powered by Snapdragon 8 Gen 1, 2, 3, and the latest Snapdragon 8 Elite
These tests demonstrate the backbend’s capability to handle diverse and complex models efficiently across different hardware configurations.
How to build and run llama.cpp on Android and Snapdragon X Elite with Windows on Snapdragon®
llama.cpp with Adreno® OpenCL backend has been well optimized on the Android devices powered by Qualcomm Snapdragon 8 Gen 1, 2, 3, and Elite mobile platforms, as well as the Snapdragon® X Elite Compute Platform running on Windows 11. Here are the instructions to build and run llama.cpp on the two platforms.
Steps for Android
List of prerequisite software (other versions may work) and hardware
- Ubuntu 22.04
- Python3, CMake, Make and Ninja
- C/C++ compiler
- Android NDK version of 26.3.11579264, and installed in /opt/android-sdk/ndk/26.3.11579264/
- An Android device powered by Qualcomm Snapdragon 8 Gen 1, 2, 3, or Elite mobile platforms.
Prepare OpenCL
The required files for running OpenCL are not directly available in the NDK distribution. Users must download the OpenCL headers and the ICD loader from the official Khronos® OpenCL repos for free. These files are then used along with Android NDK to build the llama.cpp executables.
Obtain the official OpenCL headers
Build the OpenCL ICD loader
Build llama.cpp with the Adreno OpenCL backend
If built successfully, the executable will be located at build/bin
Steps for Snapdragon X Elite with Windows on Snapdragon®
List of prerequisite software (other versions may work) and hardware
- Visual Studio 2022 (community or professional version)
- Python3, CMake and Ninja
- LLVM 19 (can be downloaded from https://github.com/llvm/llvm-project/releases/tag/llvmorg-19.1.0)
- A laptop powered by Snapdragon X Elite
Prepare OpenCL
The OpenCL header and ICD loader can be obtained using the similar approach to the Android environment. For simplicity, we assume the OpenCL files are installed in C:\OpenCL.
OpenCL headers
OpenCL ICD loader
Build llama.cpp
If built successfully, the executable will be located at build\bin
Launch the executable
Here is an example of how to run the llama.cpp executable:
./llama-cli -m ggml-model-qwen1.5–7b-chat-Q4_0.gguf -b 128 -ngl 99 -c 2048 -p “Hello”
Note that currently the Adreno OpenCL backend has been optimized for the weights using the Q4_0 quantization scheme. The optimization for weights using other schemes, such as FP16 and Q6, is in progress and we will update soon.
Future Work
Qualcomm team is working on bringing more Adreno specific features into the OpenCL backend. Adreno GPUs support a wide range of extensions that allows better performance and power. For instance, we support features like integer dot product, and on-chip global memory (Please refer to the Adreno SDK from the Qualcomm Developer).
Conclusion
The addition of the OpenCL GPU backend for Adreno GPUs is a significant step forward for llama.cpp. We are excited to see how this enhancement will be utilized by the community and look forward to your feedback.
Want to know more? Join our Discord community to engage with Qualcomm Technologies’ experts in real time, connect with fellow developers and join exclusive virtual events.
Author: Hongqiang Wang is a Principal Engineer/Manager at the GPU Research Team in Qualcomm, where he spearheads initiatives to enhance GPGPU/OpenCL solutions for Adreno GPUs. His focus spans both architecture and programming, with a keen eye on advancing Adreno’s AI/ML capabilities and addressing traditional image/video GPGPU use cases for edge devices. He represents Qualcomm in the Khronos OpenCL and SYCL working groups and serves as the primary author of the OpenCL General Programming and Optimization for Adreno GPUs.