Hexagon SDK(cDSP)
高通Qualcomm Hexagon SDK提供了一系列的工具、软件、文档辅助用户在高通设备上利用Hexagon DSPs.称之为异构编程
安装
SDK安装指南:链接
# 列出所有可以安装的包名称
qpm-cli --product-list
# Linux install sdk
qpm-cli --install hexagonsdk5.x --config installConfig.json
# Install tools
qpm-cli --extract hexagon8.7 --config installConfig.json
# Install halide tools
qpm-cli --extract halide2.4 --config installConfig.jsoninstallConfig.json:
{
"CustomInstallPath" : "/myTools/SDK/Hexagon",
"Addons" : ["HexagonSDK5x_FullNDK", "HexagonSDK5x_Eclipse"] // 可选
}文档资料
文档资料见SDK 文件夹docs中,还有一些编程资料见docs/pdf中。(在gitea中)
硬件信息
v75的编程指南见docs/pdf/80-N2040-57_AA_Hexagon_V75_Programmers_Reference_Manual.pdf
架构概览
处理器
Qualcomm Technologies, Inc. (QTI) 提供了大量的高通芯片解决方案。移动方案安排为了五个产品等级。最高等级包含了SM8xxx 系列(premium)和SM7xxx 系列(high tier),低等级的包含SM6xx, SM4xx和SM2xx系列。
这些等级主要在可扩展计算资源(scalable computing resources)如CPU、GPU、DSP芯片上有区别。从低往高,处理器数量、处理器复杂程度、处理器主频逐级提升,详细差异见。
以下是SM8150芯片的概览,主要包含了一个Kryo(/ˈkriː.oʊ/) CPU,一个Adreno(/əˈdriːnoʊ/) 640GPU以及四个Hexagon DSP芯片,包括sDSP(sensor, SLPI – Sensor Low Power Island)、mDSP(modem 调制解调器)、aDSP(audio)、cDSP(compute),其中cDSP是唯一可以使用Unsigned PD(也就是不用设备签名)

Compute DSP(cDSP),主要是用于计算密集型任务,比如图像处理、计算机视觉、相机流,同时也包含了一系列定点数向量指令(fixed-point vector operations),我们称为Hexagon Vector eXtensions(HVX)。从Lahaina(高通8350 SoC内部编号)开始,cDSP被称为Qualcomm Hexagon Tensor Processor(HTP),以反应其高效的神经网络张量数据处理能力,在Hexagon SDK中,cDSP也被称为Hexagon Tensor Processor(HTP)。
对比于CPU,cDSP以更低的时钟(clock)运行,但是在指令集的层次提供了更多的并行度。RPC(remote procedure call)通常是用来分担(offload)任务给DSP的机制。
下表是一个Hexagon SDK支持的cDSP芯片的特性:
| Chip name | cDSP | Turbo L1 | Turbo | Nominal | HVX | HMX | L2 | VTCM |
|---|---|---|---|---|---|---|---|---|
| Waipio | V69 | 1.5 GHz | 1.4 GHz | 1.2 GHz | 4 | 1 | 1 MB | 8 MB |
| Lahaina | V68 | 1.5 GHz | 1.4 GHz | 1.2 GHz | 4 | 1 | 1 MB | 4 MB |
架构概览,下图是cDSP中处理器单元如何连接memory cache的概览(V66):

cDSP中通常包含入若干个DSP 硬件线程(DSP hardware threads),四个/六个,每个DSP硬件线程都能够访问Hexagon scalar instructions(用于在单个或者成对的32位寄存器上执行定点数和浮点数操作)。
每个数据单元负责执行一个64-bit load/store操作,或者32-bit 标量ALU操作。
V66之前,所有执行单元(execution unit)共享浮点乘法器资源(floating-point and multiplier resources),这意味着所有执行单元一个时钟周期只能执行一个浮点乘法。V66之后,每个cluster(图中绿色部分,也就是一个cluster有两个线程)有其独立的浮点乘法资源了。
如上图中,一个cluster通常指一对线程(Thread0&1 and Thread2&3)。在一个cluster中,两个线程通常按照时钟交替发送指令包,这是因为大部分指令需要至少两个时钟来完成。理想情况下,每个cluster在每个DSP时钟周期完成一条指令。
Hexagon HVX unit,HVX是一个增加了128-byte(也就是1024-bit)向量处理兼容性的协处理器,标量单元(scalar hardware threads)通过HVX context(也就是HVX register file)来访问HVX协处理器。
Hexagon HMX unit,HMX是一个矩阵引擎(Lahaina引入的),提供了高吞吐量的卷积操作。Hexagon SDK不直接包含该单元的指令调用,QNN SDK包含了这部分调用,神经网络重度应用了HMX engine。
内存子系统(Memory subsystem)
32-bit内存空间地址
下图提供了DSP内存子系统的概览:
cDSP有两层缓存内存子系统,L1 cache只能由标量单元(scalar unit)访问,L2则是标量单元(scalar unit)和HVX coprocessor访问。
L1缓存是仅写直通的。这样保持了缓存的硬件一致性。如果一个HVX存储命中了L1缓存,该缓存行会被置为无效(随后会从主存中重新加载)。
向量单元支持多种load/store指令,包括unaligned vectors和per-byte condition stores.
cDSP还引入了TCM(Tightly Coupled Memory)称为VTCM(向量紧耦合内存)
Hexagon DSP采用了统一的字节寻址(byte addressing)内存,并且内存只有一个32-bit的虚拟内存地址空间(意味着DSP内存中最多存4GB的数据,包括指令和数据),并且使用了小端的字节序。
虚拟内存
QuRT负责虚拟内存的管理和向物理内存的映射。
寄存器
Hexagon处理器包含了两种寄存器:通用寄存器(General registers) 和 控制寄存器(Control registers)。
通用寄存器包含了32个32-bit寄存器(named R0 - R31),可以以单个32-bit寄存器或者aligned 64-bit寄存器对的方式访问。通用寄存器中包含了指针(pointer)、标量(scalar)、向量(vector)和累计量(accumulator data)。
控制寄存器则包含了特殊目的的寄存器,例如程序计数器(program counter)、状态寄存器(status register)、循环寄存器(loop registers)。
指令队列(Instruction sequencer)
指令队列每个时钟周期同时处理1-4个指令包,如果一个包 包含了超过一条指令,则这些指令是并行执行的。
数据类型
定点数
Hexagon processor提供处理8-,16-,32-,64-bit定点数的方法。包括无符号/有符号的整数和定点小数。
标量算子
- Multiplication of 16-bit, 32-bit, and complex data
- Addition and subtraction of 16-, 32-, and 64-bit data (with and without saturation)
- Logical operations on 32- and 64-bit data (AND, OR, XOR, NOT)
- Shifts on 32- and 64-bit data (arithmetic and logical)
- Min/max, negation, absolute value, parity, norm, swizzle
- Compares of 8-, 16-, 32-, and 64-bit data
- Sign and zero extension (8- and 16- to 32-bit, 32- to 64-bit)
- Bit manipulation
- Predicate operations
向量算子
- Multiplication (halfwords, word by half, vector reduce, dual multiply)
- Addition and subtraction of word and halfword data
- Shifts on word and halfword data (arithmetic and logical)
- Min/max, average, negative average, absolute difference, absolute value
- Compares of word, halfword, and byte data
- Reduce, sum of absolute differences on unsigned bytes
- Special-purpose data arrangement (such as pack, splat, shuffle, align, saturate, splice, truncate, complex conjugate, complex rotate, zero extend)
浮点数
可以处理32-bitIEEE单精度浮点数,浮点数可以存在通用寄存器中
浮点数操作
- Addition and subtraction
- Multiplication (with optional scaling)
- Min/max/compare
- Reciprocal/square root approximation
- Format conversion TODO(未完待续)
DSP架构支持情况
目前测下来是向下兼容的(支持v75的手机支持以下版本架构编译的库)
| DSP架构 | v75 | v73 | v69 | v68 | v66 | v65 |
|---|---|---|---|---|---|---|
| 芯片型号 | SM8650 | SM8550 |
DSP code name和对应芯片part number
| Code Name | Part number |
|---|---|
| Lanai | SM8650-AB |
| Kailua | SM8550 |
| Waipio | SM8450 |
| Lahaina | SM8350 |
| Palima | SM8475 |
特性矩阵
docs/reference/feature_matrix.html
| Targets | Simulator | Lanai | Divar | Lahaina/Cedros/Kodiak | Waipio/Palima/Fillmore | Clarence | Bitra | Agatti | Kailua | Netrani | Tofino | Strait | SXR2130P | QCS405 | QCS403 | QCS8550 | QCS610 | QCS605 | QRB5165 | ENEL | Camano | SW5100 | Waipio LE | Anorak | Neo | QCM6490 | Palawan |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Operating System | |||||||||||||||||||||||||||
| AP | None | LA | LA | LA | LA | LA | LA | LA | LA | LA | LA | LA | LA | LE | LE | LE | LA/LE | LA/LE | UBUNTU/LE | UBUNTU | LA | LA | LE | LA | LA/LE | LE | LA |
| DSP | QuRT | QuRT | QuRT | QuRT | QuRT | QuRT | QuRT | QuRT | QuRT | QuRT | QuRT | QuRT | QuRT | QuRT | QuRT | QuRT | QuRT | QuRT | QuRT | QuRT | QuRT | QuRT | QuRT | QuRT | QuRT | QuRT | QuRT |
| DSPs Supported | |||||||||||||||||||||||||||
| aDSP | Yes (>=v65) | Yes (v73) | Yes (v66) | Yes (v66) | Yes (v66) | Yes (v66) | Yes (v66) | Yes (v66) | Yes (v73) | Yes (v66) | Yes (v66) | Yes (v66) | Yes (v66) | Yes (v66) | Yes (v66) | Yes (v73) | Yes (v66) | Yes (v65) | Yes (v66) | No | Yes (v73) | Yes (v66) | Yes (v66) | Yes (v66) | Yes (v73) | Yes (v66) | Yes (v73) |
| cDSP | Yes (>=v65) | Yes (v75) | Yes (v66) | Yes (v68) | Yes (v69) | No | Yes (v66) | No | Yes (v73) | Yes (v73) | Yes (v69) | Yes (v66) | Yes (v66) | Yes (v66) | Yes (v66) | Yes (v73) | Yes (v66) | Yes (v65) | Yes (v66) | Yes (v66) | Yes (v73) | No | Yes (v69) | Yes (v69) | Yes (v73) | Yes (v68) | Yes (v73) |
| sDSP | No | No | No | Yes (v66) (For Lahaina) No (For Cedros/Kodiak) | Yes (v66) (For Waipio/Palima) No (For Filmore) | No | No | No | No | No | Yes (v66) | No | Yes (v66) | No | No | No | No | No | Yes (v66) | No | No | No | Yes (v66) | No | No | No | No |
| mDSP | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No |
| Tools Version | |||||||||||||||||||||||||||
| Hexagon Tools | N/A | 8.6(aDSP) 8.7(cDSP) | 8.2 | 8.4 | 8.5 | 8.5 | 8.3 | 8.2 | 8.6 | 8.5(aDSP) 8.6(cDSP) | 8.5 | 8.3 | 8.3 | 8.2 | 8.2 | 8.6 | 8.2 | 8.1 | 8.3 | 8.4 | 8.6 | 8.2 | 8.5 | 8.5 | 8.6 | 8.4 | 8.6 |
| Language | |||||||||||||||||||||||||||
| C++98/11/14 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| C++17 | Yes | Yes | No | Yes | Yes | Yes | No | No | Yes | Yes | Yes | No | No | No | No | No | No | No | No | No | Yes | No | Yes | Yes | Yes | Yes | Yes |
| Assembly and intrinsics | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| Halide | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| Debugging | |||||||||||||||||||||||||||
| LLDB | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | No | No | Yes | No | No | No | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| logcat | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| printf() | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| Profiling | |||||||||||||||||||||||||||
| SysMon Profiler (UI) | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | No | No | No | Yes | No | No | No | Yes | Yes | No | Yes | No (LE) Yes (LA) | No | Yes |
| Hexagon Trace Analyzer | Yes | Yes | No | Yes | Yes | No | No | No | Yes | Yes | Yes | No | No | No | No | Yes | No | No | No | No | Yes | No | Yes | Yes | Yes | No | Yes |
| SysMonApp (command line interface) | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes |
| itrace | No | Yes | No | Yes | Yes | No | No | No | Yes | Yes | Yes | No | No | No | No | No | No | No | No | No | Yes | No | No | Yes | Yes(LA)/No(LE) | No | No |
| Hexagon Profiler | Yes | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
| Hardware Features | |||||||||||||||||||||||||||
| Integer/fixed-point HVX | Yes | cDSP | cDSP | cDSP | cDSP | No | cDSP | No | cDSP | cDSP | cDSP | cDSP | cDSP | cDSP | cDSP | cDSP | cDSP | cDSP | cDSP | Yes | cDSP | No | cDSP | cDSP | cDSP | cDSP | cDSP |
| Floating-point HVX | Yes | cDSP | No | cDSP | cDSP | No | No | No | cDSP | cDSP | cDSP | No | No | No | No | cDSP | No | No | No | No | cDSP | No | cDSP | cDSP | cDSP | cDSP | cDSP |
| HMX | Yes | cDSP | No | cDSP | cDSP | No | No | No | cDSP | cDSP | cDSP | No | No | No | No | cDSP | No | No | No | No | cDSP | No | cDSP | cDSP | cDSP | cDSP | cDSP |
| DCVS v3 | No | Yes | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | No | Yes | No | No | Yes | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| FastRPC Domains | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| CPZ | No | cDSP | cDSP | cDSP | cDSP | No | cDSP | No | cDSP | cDSP | cDSP | cDSP | cDSP | cDSP | cDSP | cDSP | cDSP | cDSP | cDSP | Yes | cDSP | No | cDSP | cDSP | cDSP | cDSP | cDSP |
| VTCM APIs | Yes | cDSP | cDSP | cDSP | cDSP | No | cDSP | No | cDSP | cDSP | cDSP | cDSP | cDSP | cDSP | cDSP | cDSP | cDSP | cDSP | cDSP | Yes | cDSP | No | cDSP | cDSP | cDSP | cDSP | cDSP |
| Cache locking API v2 | Yes | cDSP | cDSP | cDSP | cDSP | No | cDSP | No | cDSP | cDSP | cDSP | cDSP | cDSP | No | No | cDSP | cDSP | cDSP | cDSP | Yes | cDSP | No | cDSP | cDSP | cDSP | cDSP | cDSP |
| Unsigned PD | No | cDSP | cDSP | cDSP | cDSP | No | cDSP | No | cDSP | cDSP | cDSP | cDSP | cDSP | No | No | cDSP | cDSP | cDSP | cDSP | Yes | cDSP | No | cDSP/aDSP | cDSP | cDSP | cDSP | cDSP |
| Compute resource manager API | Yes | cDSP | cDSP | cDSP | cDSP | No | cDSP | No | cDSP | cDSP | cDSP | cDSP | cDSP | cDSP | No | cDSP | No | No | cDSP | No | cDSP | No | cDSP | cDSP | cDSP | cDSP | cDSP |
| IO Coherency | No | Yes | No | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | No | Yes | No | No | Yes | Yes | No | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes |
| Max. concurrent FastRPC user PDs on aDSP | N/A | 2 | 2 | 1 (For Lahaina) 2 (For Cedros/Kodiak) | 1 (For Waipio/Palima) 4 (For Filmore) | 4 | 4 | 2 | 9 | 3 | 1 | 3 | 1 | 4 | 4 | 9 | 5 | 1 | 1 | N/A | 1 | 2 | 4 | 3 | 5 | 1 | 1 |
| Max. concurrent FastRPC user PDs on cDSP | N/A | 11 | 4 | 10 | 10 | N/A | 6 | N/A | 10 | 10 | 10 | 6 | 6 | 4 | 4 | 12 | 4 | 7 | 6 | 6 | 10 | N/A | 12 | 13 | 8 | 10 | 10 |
| Max. concurrent FastRPC user PDs on sDSP | N/A | N/A | N/A | 4 (For Lahaina) N/A (For Cedros/Kodiak) | 4 (For Waipio/Palima) N/A (For Filmore) | N/A | N/A | N/A | N/A | N/A | 4 | N/A | 4 | N/A | N/A | N/A | N/A | N/A | 4 | N/A | N/A | N/A | 5 | N/A | N/A | N/A | N/A |
| DSP libraries | |||||||||||||||||||||||||||
| QHL | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| QHL_HVX | Yes | Yes | Yes | Yes | Yes | No | Yes | No | Yes | Yes | Yes | Yes | Yes | No | No | Yes | No | No | Yes | No | Yes | No | No | No | No | Yes | Yes |
| qprintf | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| DSP worker pool | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| Compute Libraries | |||||||||||||||||||||||||||
| asyncdspq | No | No | No | No | No | No | Yes | No | No | No | No | No | Yes | No | No | No | Yes | Yes | Yes | No | No | No | No | No | No | No | No |
| fastCV | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes |
| imagedspq | No | No | No | cDSP | cDSP | No | cDSP | No | No | No | cDSP | No | cDSP | No | No | No | cDSP | cDSP | cDSP | Yes | No | No | cDSP | cDSP | cDSP | cDSP | No |
| Base SDK examples | |||||||||||||||||||||||||||
| Android_app | No | Yes | No | No | Yes (For Waipio) No (For Filmore/Palima) | No | No | No | Yes | Yes | Yes | No | No | No | No | No | No | No | No | No | Yes | No | No | No | Yes (LA) No (LE) | No | Yes |
| asyncdspq_example | No | No | No | Yes (For Lahaina / Kodiak) No (For Cedros) | Yes | No | Yes | No | No | No | Yes | No | Yes | Yes | No | No | No (Unsigned PD) Yes (Signed PD) | Yes | Yes | No | Yes | No | Yes | Yes | No | No | Yes |
| Asynchronous DSP Packet Queue | No | Yes | No | Yes | Yes | No | No | No | Yes | Yes | Yes | No | No | No | No | Yes | No | No | No | No | Yes | No | Yes | Yes | Yes | Yes | Yes |
| Calculator | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| Calculator_c++ | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| calculator_c++_apk | No | Yes | Yes | Yes | Yes | No | Yes | No | Yes | Yes | Yes | Yes | No | No | No | No | No | No | No | No | Yes | No | Yes | Yes | Yes | Yes | Yes |
| gtest | Yes | Yes | Yes | Yes | Yes | No | Yes | No | Yes | Yes | Yes | Yes | No | No | No | Yes | No | Yes | No | No | Yes | No | Yes | Yes | No (LE)/Yes (LA) | No | Yes |
| HAP_example | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| itrace | No | Yes | No | Yes | Yes | No | No | No | Yes | Yes | Yes | No | No | No | No | No | No | No | No | No | Yes | No | No | Yes | Yes(LA)/No(LE) | No | No |
| LPI_example | Yes | Yes | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No |
| Multithreading | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| OEM Configuration | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| Profiling | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| QHL | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | No | Yes | No | No | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| QHL_HMX | Yes | Yes | No | Yes (For Lahaina / Kodiak) No (For Cedros) | Yes | No | No | No | Yes | Yes | Yes | No | No | No | No | Yes | No | No | No | No | Yes | No | Yes | Yes | No | Yes | Yes |
| QHL_HVX | Yes | Yes | Yes | Yes | Yes | No | Yes | No | Yes | Yes | Yes | Yes | Yes | No | No | Yes | No | No | No | No | Yes | No | Yes | Yes | Yes | Yes | Yes |
| Qprintf | Yes | Yes | cDSP | cDSP | cDSP | No | cDSP | No | Yes | Yes | cDSP | cDSP | cDSP | cDSP | cDSP | Yes | cDSP | Yes | cDSP | Yes | Yes | No | Yes | Yes | Yes | cDSP | Yes |
| synxexample | No | Yes | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | No | Yes | Yes |
| Compute examples | |||||||||||||||||||||||||||
| Benchmark | Yes | Yes | Yes | Yes | Yes | No | Yes | No | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | No | Yes | Yes | Yes | No | Yes |
| Camera CHI | No | Yes | No | Yes | Yes (For Waipio/Palima) No (For Filmore) | No | No | No | Yes | Yes | Yes | No | No | No | No | Yes | No | No | No | No | Yes | No | Yes | Yes | Yes | Yes | Yes |
| Camera streamer | No | Yes | No | Yes (For Lahaina / Cedros) No (For Kodiak) | Yes (For Waipio) No (For Filmore/Palima) | No | No | No | No | No | Yes | No | No | No | No | Yes | No | No | No | No | No | No | Yes | Yes | No | No | No |
| Corner detect | No | Yes | Yes | Yes | Yes | No | Yes | No | Yes | Yes | Yes | Yes | Yes | No | No | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes |
| Image DSPQ | No | Yes | No | Yes | Yes | No | Yes | No | Yes | Yes | Yes | No | Yes | No | No | Yes | Yes | Yes | Yes | No | Yes | No | Yes | Yes | Yes | Yes | Yes |
| UBWC DMA | Yes | Yes | No | Yes | Yes | No | Yes | No | No | No | Yes | No | Yes | No | No | Yes | No | No | Yes | Yes | No | No | Yes | No | No | Yes | No |
| User DMA | No | Yes | No | Yes | Yes | No | No | No | Yes | Yes | Yes | No | No | No | No | Yes | No | No | No | No | Yes | No | Yes | Yes | Yes | Yes | Yes |
| Compute Resource Manager Sample | No | Yes | No | Yes | Yes | No | No | No | Yes | Yes | Yes | No | No | No | No | Yes | No | No | No | No | Yes | No | Yes | Yes | Yes | Yes | Yes |
| QAIC Features | |||||||||||||||||||||||||||
| Stub-Skel version mismatch check | No | Yes | No | No | No | No | No | No | Yes | Yes | No | No | No | No | No | Yes | No | No | No | No | Yes | No | No | Yes | Yes | No | Yes |
编程
入门
如下图所示,一个DSP程序被分成了两部分,在应用程序空间与客户端方法接口的部分使用stub,他是编译器生成的与DSP RPC驱动程序接口的功能。
值得注意的是,此时DSP的负载分担是串行的,并且对FastRPC的调用是阻塞的,与DSP级接口的部分由自动生成的skel代码处理,改代码负责解组参数、连接应用程序的DSP和DSP RPC框架。
值得注意的是,此时 DSP 卸载是串行的,并且对 FastRPC 的调用是阻塞的。与 DSP 级接口的部分由自动生成的 Skel 代码处理,该代码负责解组参数以及应用程序 DSP 中执行和 DSP RPC 框架之间的接口。
编译&运行
编译器
在编译HLOS端的程序的时候使用的是原版的NDK中的clang/clang++。
在编译Hexagon端的动态库的时候使用的是HEXAGON_Tools中的编译器:hexagon_clang/hexagon_clang++。
这点在编译halide的时候相同。
编译指令
详细内容查看:
docs/tools/build.html#custom-toolchain
这里主要是用Hexagon SDK中的CMake build system来进行编译,为了保持cmake和make.d参数一致,使用了build_cmake这个可执行程序:
# 可执行程序路径在:<Hexagon-SDK>/build/cmake/Ubuntu/build_cmake
build_cmake <action> [Options]
# 或者可以直接执行bash脚本,和执行build_cmake相同,不过需要手动输入source_dir
# 使用bash脚本的话更加清晰,可以更容易增加自定义的行为
<Hexagon-SDK>/build/cmake/cmake_configure.bash source_dir <action> [Options]cmake编译系统支持以下target选项:
| action | Description |
|---|---|
| hexagon | Build a dynamic DSP lib.so |
| hexagonsim | Build a dynamic DSP lib or executable for hexagon and run on simulator |
| android | Build the android executable |
| ubuntuARM | Build the ubuntuARM executable |
| <CUSTOM_NAME> | 用户自定义 |
| help | 打印帮助信息 |
| Hexagon Options | Acceptable Values( * denotes defaults ) | Description |
|---|---|---|
| DSP_ARCH | v65*, v66, v68, v69, v73, v75 | Target Variant |
| NO_QURT_INC | 0*, 1 | Do not include QuRT as a dependency when NO_QURT_INC=1 |
| HLOS Options | Acceptable Values( * denotes defaults ) | Description |
|---|---|---|
| HLOS_ARCH | 32, 64* | HLOS architecture variant |
| DOMAIN_FLAG | 0, 1, 2, 3* | Select the fastrpc domain, 3 means cDSP. |
| Other Options | Acceptable Values( * denotes defaults ) | Description |
|---|---|---|
| BUILD | ReleaseG*, Debug, Release | Build Variant |
| VERBOSE | 0*,1 | If VERBOSE=1, it displays all the outputs from the build process. If VERBOSE is not defined, the build system displays only error messages. |
| TREE | 0*,1 | TREE=0 cleans project build directory. TREE=1 cleans project build directory and dependency build directories |
| -gMake | CMake Build uses Makefile Generator instead of default Ninja Generator when provided. | |
| -j | <num> | Enables to build the project using multiple threads. Default value of <num> is 1. |
CMakeList辅助函数
cmake包含的cmakelists中的辅助函数:
| Helper function | Usage summary | Documentation |
|---|---|---|
| find_library | Find a library in the specified locations | find_library |
| add_dependencies | Add a dependency between top level targets | add_dependencies |
| add_library | Add a library target, built from the sources specified | add_library |
| target_link_directories | Specify the search paths for the linker | target_link_directories |
| target_link_libraries | Specify the libraries to be used to link a given target | target_link_libraries |
| add_custom_command | Add a custom build rule to the generated build system | add_custom_command |
| ExternalProject_Add | Build a target from sources outside of the current CMake project | ExternalProject |
| include_directories | Add the given directories to the include path of current CMake project | include_directories |
| target_include_directories | Add the directories to the include path for a given target | target_include_directories |
| add_executable | Add an executable target, built from the sources specified | add_executable |
| set_target_properties | Set properties for the given targets | set_target_properties |
| target_compile_definitions | Add compile definitions for a given target | target_compile_definitions |
| file | Perform operations on the file system | file |
| include | Include external build rules to the current project | include |
Hexagon SDK中包含的cmake辅助函数,来自文件:$(HEXAGON_SDK_ROOT)/build/cmake/hexagon_fun.cmake |
| Helper function | Usage summary | Example |
|---|---|---|
| build_idl(<idlFile> <target> | Set up a custom_target to build <idlFile> using qaic IDL compiler and also add the custom_target created as the dependency of <target> | build_idl(inc/calculator calculator) |
| add_external_project(<target> SOURCE_DIR <library_source_dir> BYPPRODUCTS <list_variable> [ADDITIONAL_CMAKE_ARGS “<cmake args>”]) | Takes the source directory of the library and list of binaries generated as arguments and calls cmake’s ExternalProject_Add() with all the cmake variables needed by Hexagon SDK build system. ADDITIONAL_CMAKE_ARGS is an optional argument to pass cmake variables other than the default set by this function. BYPRODUCTS list is compulsory for Ninja Generator and is ignored with Makefile Generator. | set(LIBS_GENERATED ${QHL_DIR}/${V}/libqhblas.a ${QHL_DIR}/${V}/libqhmath.a) add_external_project(qhl-target SOURCE_DIR $(HEXAGON_SDK_ROOT)/libs/qhl BYPRODUCTS LIBS_GENERATED) |
| link_options(<target>) | Set up the architecture-specific linker flags for the <target> | link_options(calculator_device) |
| link_custom_library(<target> <custom_library>) | Build the <custom_library> and link to the target. To build the <custom_library> from source, use the BUILD_SOURCE flag. For example link_custom_library(<target> <custom_library> BUILD_SOURCE). <custom_library> can take one of the following values: (rpcmem, atomic, test_util, rtld, qhl, qhl_hvx) | link_custom_library(calculator_device rpcmem) |
| choose_dsprpc("<domain>" <target>) | Take <domain> as argument and return the corresponding remote library name in <target>. <domain> take values from 0-3 and default is 3(CDSP). | choose_dsprpc(“3”, calculator) |
stub和skel
stub和skel是与RPC(Remote Procedure Call)相关的概念,用于生成和管理客户端和服务端之间的通信代码和接口。
stub(存根):
- 是客户端生成的代码,用于封装本地调用并将其转换为远程调用。负责将客户端的参数打包成网络消息,并且将消息发送到远程服务端。
skel(骨架):
- 是在服务端生成的代码,用于接收来自客户端的远程调用请求,解析请求参数,并且调用本地实际函数执行请求。
运行
在端侧运行的时候,需要指定stub和skel的库路径,eg: 当前可执行程序所在的文件夹结构为:
.
├── dsp
│ └── libimgproc_skel.so # skel
├── imgproc_exe # 主程序,主机端程序调用入口
└── libimgproc_stub.so # stub
1 directory, 3 files通过以下指令运行主程序:
export LD_LIBRARY_PATH=./ # 指向stub所在文件夹
export DSP_LIBRARY_PATH=./dsp # 指向skel所在文件夹
./imgproc_exe # 执行程序即可,DSP程序执行的时候,需要配置环境变量。RPC(Remote Procedure Call)
参考docs/software/ipc/rpc.html
RPC允许程序调用一个远程的过程,并且忽略远程执行细节。FastRPC是RPC机制,并且用来允许CPU调用DSP的funciton。
FastRPC接口是定义在IDL文件中的,并且利用QAIC compiler(ipc/fastrpc/qaic)来生成头文件和stub和skel代码。头文件、stub需要链接到CPU可执行程序中;头文件、skel需要编译之后链接到DSP库中。
FastRPC架构
名词解释:
Application:User mode process that initiates the remote invocation
Stub:Auto-generated code that takes care of marshaling parameters and runs on the CPU
FastRPC user driver on CPU:User mode library that is used by the stub code to do remote invocations
FastRPC Kernel Driver:Receives the remote invocations from the client, queues them up with the FastRPC DSP driver, and then waits for the response after signaling the remote side
FastRPC DSP Driver:Dequeues the messages sent by the FastRPC kernel driver and dispatches them for processing
FastRPC user driver on DSP:User mode code that includes a shell executable to run in the user protection domain (PD) on the DSP and complete the remote invocations to the skel library
Skel:Auto-generated code that un-marshals parameters and invokes the user-defined implementation of the function that runs on the DSP
User PD:User protection domain on the DSP that provides the environment to run the user code
FastRPC执行流程
FastRPC是典型的代理模式。其中interface object stub和实现skeleton object skel在不同的处理器上(CPU GPU),FastRPC clients直接被暴露给stub object,然后skel object在DSP端直接被FastRPC framework调用。
- CPU端调用stub版本的函数,stub code将函数调用转换成了RPC message。
- 在CPU端,stub code内部调用了FastRPC framework,将message压入队列。
- CPU端的FastRPC框架,将入队的message发送给DSP端的FastRPC框架。
- DSP端的FastRPC DSP框架将函数调用信号分发给skel code。
- 在DSP端,skel code解构了参数,并且调用方法的实现。
- DSP端,skel code等待执行完成,并且将返回值编码到返回message中。
- DSP端,skel code调用FastRPC DSP框架将返回信息入队列,并且准备发送给CPU。
- DSP端的FastRPC DSP框架将返回信息发送给CPU端的FastRPC框架。
- CPU端的FastRPC框架识别对应的stub code,并且将返回值分发过去。
- stub code解构返回值,并且将其返回给用户程序。
Android端库
/system/vendor path is only supported on Android O and later.
| Software component | Description |
|---|---|
/system/vendor/lib/lib*rpc.so or /vendor/lib/lib*rpc.so, where * is adsp, cdsp, or sdsp | Shared object library to be linked with the user-space vendor application that is invoking the remote procedure call. This library interfaces with the kernel driver to initiate the remote invocation to the aDSP, cDSP, or sDSP. |
/system/lib/lib*rpc_system.so, where * is adsp, cdsp, or sdsp | Shared object library that is to be linked with the user-space system application that is invoking the remote procedure call. This library interfaces with the kernel driver to initiate the remote invocation to the aDSP, cDSP, or sDSP. This library is applicable for system applications on Android P and onwards. |
PD
IDL Compiler
参考 docs/reference/idl.html
DSP平台的接口、所有FastRPC程序都是通过IDL来定义的。如下是一个IDL的示例:
#include "AEEStdDef.idl"
#include "remote.idl"
interface calculator : remote_handle64 {
long sum(in sequence<long> vec, rout long long res);
long max(in sequence<long> vec, rout long res);
};Memory management
参考文档:docs/software/ipc/rpc.html#memory-management
一个常规的高通SoC包含了多个处理器和其他硬件核心,这些核心都公用一块儿内存(External Memory)。这些处理器和核心都通过MMU访问物理内存。这些MMU管理那些内存这些核心能够访问。以下为简单的内存架构:
其中DSP有两个MMU:
- internal MMU:在DSP芯片中,由DSP operating system管理,对于DSP的内部内存(如VTCM)和DSP子系统寄存器的访问由其管理。
- system MMU:处在DSP芯片和系统总线之间、物理内存之间,由CPU管理。所有对物理内存和硬件寄存器(CPU)的访问都通过SMMU管理。
通常情况下,每个核心在运行程序的时候都有其独立的地址空间,之间是无法相互访问的,但是可以通过设置shared memory使得核心之间共享内存。
每个DSP程序在其自己独立的VA中运行,DSP的VA是通过QuRT管理的,保证进程之间以及进程和操作系统之间是相互独立的。Hexagon DSP使用了32-bit虚拟内存地址空间,所以每个进程限制4GB内存。如果尝试访问没有映射到DSP MMU的内存地址,DSP-side进程会被kill掉(page fault)。
DSP内存地址空间是通过DSP PA(Physical Address space)映射到系统内存空间(而不是物理内存空间),经过SMMU(System Memory Management Units)才会映射到物理地址空间。而SMMU是CPU管理的,如果DSP尝试访问没有映射到SMMU的内存地址空间,将会触发由CPU来处理的SMMU page fault。
Compute Resource Management
参考SDK docs/software/system_performance/resource_management.html
Multi-sessions
参考SDK docs/software/ipc/rpc.html#multi-sessions
根据文档可知,CPU程序入口可以在单个/多个dsp设备上创建单个或者多个session,每个session之间是隔离的,其中一个作用是方便用户重启session(比如session失败了)。
Intrinsics
为了方便编程(在耗时比较严重的部分,通常这部分会用汇编来写),C编译器提供了intrinsics,直接用C语言表达Hexagon处理器指令。
例如:
// 这个程序按照我的理解为standalone,也即是只在DSP上运行的程序,不需要和CPU进行数据沟通。
#include "hexagon_protos.h"
int main()
{
long long v1 = 0xFFFF0000FFFF0000LL;
long long v2 = 0x0000FFFF0000FFFFLL;
long long result;
// Find the minimum for each half-word in 64-bit vector
result = Q6_P_vminh_PP(v1,v2);
}函数名解释
Q6_Vub_vmax_VubVub:
- Vub表示unsigned byte HVX向量。
- vmax表示向量取max。
- 第一个Vub表示函数返回Vub,后面两个表示函数参数为两个Vub。
// Vd.ub=vmax(Vu.ub,Vv.ub)
for (i = 0; i < VELEM(8); i++) {
Vd.ub[i] = (Vu.ub[i] > Vv.ub[i]) ? Vu.ub[i] : Vv.ub[i];
}Q6_V_valign_VVI:
- V表示HVX的向量。
- valign表示将两个向量按指定的字节数进行对齐,详细见文档80-N2040-58_AA_Hexagon_V75_HVX_Programmers_Reference_Manual.pdf的6.10章。
- VVI表示输入参数是两个HVX向量和一个立即数。
// Vd=valign(Vu,Vv,#u3)
for(i = 0; i < VWIDTH; i++) {
Vd.ub[i] = (i+#u>=VWIDTH) ? Vu.ub[i+#u-VWIDTH] : Vv.ub[i+#u];
}Q6_Wh_vmpaacc_WhWubRb:
vmpaacc表示包含累计的向量加权乘法Wh表示返回一个HVX_VectorPair(相当于两个HVX_Vector),WhWubRb表示输入参数第一个是HVX_VectorPair,第二个是HVX_Vector,第三个是Word32。
// Vxx.h+=vmpa(Vuu.ub,Rt.b)
for (i = 0; i < VELEM(16); i++) {
Vxx.v[0].h[i] += (Vuu.v[0].uh[i].ub[0] * Rt.b[0]) + (Vuu.v[1].uh[i].ub[0] * Rt.b[1]);
Vxx.v[1].h[i] += (Vuu.v[0].uh[i].ub[1] * Rt.b[2]) + (Vuu.v[1].uh[i].ub[1] * Rt.b[3]);
}Q6_Wh_vmpy_VubRb:
vmpy表示向量乘法。
// Vdd.h=vmpy(Vu.ub,Rt.b)
for (i = 0; i < VELEM(16); i++) {
Vdd.v[0].h[i] = (Vu.uh[i].ub[0] * Rt.b[(2*i+0)%4]);
Vdd.v[1].h[i] = (Vu.uh[i].ub[1] * Rt.b[(2*i+1)%4]);
}符号表示解释
Vd.h:Vd表示向量寄存器,.h表示数据类型是16位的半精度(half-word)。
Vu.ub:Vu表示向量寄存器,.ub表示数据类型是无符号8位字节。
Vu.w:.w表示字的数据。
Halide
前端语言是C++,tutorial
更详细的教程见Halide教程
文件夹结构:
├── dsp
│ ├── conv3x3_run.cpp # dsp侧的函数,用来调用halide生成的函数。
│ └── print.cpp
├── halide
│ └── conv3x3_generator.cpp # 用于生成generator,随后生成$*_halide
├── host
│ └── main.cpp # CPU端调用
├── includes
│ └── conv3x3_halide.schedule.h # autotune_loop.py自动生成的文件
├── Makefile
└── rpc
└── conv3x3.idl # 用于生成skel.c和stub.c,FastRPC接口文件halide项目编译流程:
graph TD C0(hexagon-clang/++) C1(ndk clang/++) C2(linux clang/++) G_DEP(libHalide.a Halide.h GenGen.cpp) A0(qaic) A0 --> A1(conv3x3.idl) A1 --> A2(conv3x3_skel.c) A1 --> A3(conv3x3_stub.c) C0 --> A2 --> A20(libconv3x3_skel.so) C1 --> A3 --> A21(libconv3x3_stub.so) C2 --> G0(conv3x3_generator.cpp) G_DEP --> G0 G0 --> G1(conv3x3_generator) G1 --> G2(conv3x3_halide.h conv3x3_halide.o) B0(halide::autotune_loop.py 可选) G1 --> B0 G2 --> B0 B0 --> B1(conv3x3_halide.schedule.h)
Halide是一个图像处理和计算摄影领域的新的DSL(Domain Specific Language)。
通常快速的图像处理流程写起来很困难,1)定义流程的每个阶段很困难。2)流程的优化也很困难:向量化(vectorization),多线程(multi-threading),分块(tiling)。另外使用传统的编程语言利用上并行、分块和其他优化手段很困难。
Halide通过将算法(algorithm)和流程各阶段的计算资源分配(schedule)任务分离,使得代码实现和流程测试更加快速。并且algorithm和schedule都是由编程人员实现的,编译器会根据程序员明确的定义生成高度优化的code。
standalone&offload
- standalone:我的理解是完全在DSP上运行的程序,不需要通过FastRPC进行通讯,可以有main函数,不需要写idl接口。调用方式是
Hexagon-SDK/libs/run_main_on_hexagon的实现(目前还没有跑通,不知道问题是什么,报错是如下的信息) - offload:也就是一直看到的模式,需要写 host侧(HLOS) 的调用、 端侧(DSP) 的实现、idl接口(也就是FastRPC封装)。
案例
关于多线程的测试结果见:[[2024-07-02]]
问题和解决办法
mini-dm突然打印不出来东西了
直接重启手机:推测是相关系统任务挂了,或者是之前尝试使用的adb logcat导致出的问题。
FastRPC返回44(AEE_EINVHANDLE)
目前对比之后发现应该是remote_session_control没有调用开启unsignedPD:
const int domain_id = 3; // use cDSP with unsignedPD by default.
// Open unsignedPD by default.
if (remote_session_control) {
struct remote_rpc_control_unsigned_module data;
data.domain = domain_id;
data.enable = 1;
if (AEE_SUCCESS !=
(nErr = remote_session_control(DSPRPC_CONTROL_UNSIGNED_MODULE,
(void *)&data, sizeof(data)))) {
printf("ERROR 0x%x: remote_session_control failed\n", nErr);
goto bail;
}
} else {
nErr = AEE_EUNSUPPORTED;
printf("ERROR 0x%x: remote_session_control interface is not supported "
"on this device\n",
nErr);
goto bail;
}测试之后,正常可以调用进去了。


