OpenCL

架构和概念

工作组(work-group)和工作项(work-item)

工作组和工作项是并行计算模型的基本概念,他们帮助组织和管理大规模并行计算任务。

工作项(work-item)

  • 工作项是OpenCL中的基本单元,相当于在并行计算中执行一个线程。
  • 每个工作项都有一个唯一的全局ID,用于标识它在整个并行计算中的位置。
  • 工作项可以是1D、2D、3D的,这取决于问题的维度。
  • 在内核函数中,可以使用get_global_id来获取工作项的全局ID。

工作组(work-group)

  • 工作组是一组工作项,作为一个整体被调度执行。
  • 每个工作组中的工作项内也有唯一的标识符,通过get_local_id获取,获取的是该工作项在当前工作组中的id。工作组的id可以通过get_group_id来获取。

默认在不设置的情况下,opencl会分配最大的工作组大小,同时工作组是按照全局工作项的顺序分配工作组的。

并且全局ID和工作组局部ID的关系如下:

GlobalID = WorkGroupID * WorkGroupSize + LocalID

工作组的最大大小可以通过opencl device的信息获取:

#include "opencl.hpp" // use CLHPP
...

std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);

cl::Platform& plat = platforms[0];
std::vector<cl::Device> devices;
plat.getDevices(CL_DEVICE_TYPE_GPU, &devices);

cl::Device& device = devices[0];
size_t maxWorkGroupSize = device.getInfo<CL_DEVICE_MAX_WORK_GROUP_SIZE>();
std::vector<size_t> maxWorkItemSizes =
  device.getInfo<CL_DEVICE_MAX_WORK_ITEM_SIZES>();
std::cout << "Max Work Group Size: " << maxWorkGroupSize << std::endl;
std::cout << "Max Work Item Sizes: " << maxWorkItemSizes[0] << " "
		<< maxWorkItemSizes[1] << " " << maxWorkItemSizes[2] << std::endl;

并行架构

Compute Unit(CU)

OpenCL是异构并行计算架构,系统中有一台主机和若干个计算设备,这里计算设备就称之为CU(Compute Unit),OpenCL中的CU类比到CPU中相当于一个核心,是硬件中的一个物理单元。

CU主要负责执行从OpenCL内核中分配的工作组,每个计算单元可以同时处理多个工作组,并且以并行的方式执行计算任务。实际执行计算任务的是PE。可以通过device的CL_DEVICE_MAX_COMPUTE_UNITS来获取计算单元的个数。

#include "opencl.hpp" // use CLHPP
...
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);

cl::Platform& plat = platforms[0];
std::vector<cl::Device> devices;
plat.getDevices(CL_DEVICE_TYPE_GPU, &devices);

cl::Device& device = devices[0];

size_t maxGPUCoreSize = device.getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>();
std::cout << "Max Compute Unit Size: " << maxGPUCoreSize << std::endl;

Processing Element(PE)

PE是每个CU中的基本执行单元,没有找到直接的方法获取。

编程相关

常用库

OpenCL Header

地址: https://github.com/KhronosGroup/OpenCL-Headers

官方的OpenCL头文件。

OpenCL CPP Header

地址: https://github.com/KhronosGroup/OpenCL-CLHPP

官方的OpenCL cpp头文件。

OpenCL ICD loader

地址: https://github.com/KhronosGroup/OpenCL-ICD-Loader

 Installable Client Driver (ICD),可以让用户直接链接一个可安装的loader(也就是OpenCL动态库的一个wrapper),而不是链接特定版本的OpenCL动态库。

kernel function

// vstore3: 将包含三个元素的向量存储到内存中。
// - `data`: 要存储的向量数据。
// - `offset`: 存储到数组中的起始位置的偏移量。
// - `p`: 指向内存位置的指针。
void vstore3(<type> data, size_t offset, __global <type>* p)

// vload3: 从内存中加载三个元素到一个向量。
// - `offset`: 从数组中加载数据的起始位置的偏移量。
// - `p`: 指向内存位置的指针。
<type3> vload3(size_t offset, __global const <type>* p)

// convert_int_rtz:float浮点数转换为int整数,策略为向零舍入(而不是四舍五入)。
int convert_int_rtz (float x)

// convert_uchar4_sat_rtz: 向零舍入 + 饱和操作(大于uchar最大值,截断)
uchar4 convert_uchar4_sat_rtz (float4 x)

引用计数

OpenCL使用引用计数的机制来管理对象的声明周期,包括上下文,命令队列,内存对象等。这种机制确保对象在不使用时可以安全释放,防止内存泄漏或者无效引用。每个OpenCL对象都有一个引用计数,当对象被创建或者某个操作需要引用该对象的时候,引用计数增加。当不再需要适用对象的时候,引用计数减少。当引用计数降至零的时候,对象被销毁时,释放其资源。

OpenCL提供了几个函数来管理对象的引用计数:

  • clRetain*函数:增加对象的引用计数。
  • clRelease*函数:减少对象的引用计数,如果引用计数降至零,释放该对象。

以下是一些常用的clRetain*clRelease*函数:

  • clRetainContext / clReleaseContext
  • clRetainCommandQueue / clReleaseCommandQueue
  • clRetainMemObject / clReleaseMemObject
  • clRetainKernel / clReleaseKernel
  • clRetainProgram / clReleaseProgram
  • clRetainEvent / clReleaseEvent

理论上来讲,创建clKernel会增加clProgram的引用计数,所以只要clKernel对象还在使用,clProgram对象的资源就不会被释放

效率问题

Android端(高通芯片)

目前测试下来,OpenCL对于计算的并行加速很有效,但是对于内存拷贝而言(不论是从cl_mem到cl_mem,还是从CPU到cl_mem),并无直接优势。原因在于在高通芯片中,不论是NPU、CPU、GPU底层使用的内存硬件都是同一套,只是虚拟地址转换不同,相当于内存的带宽是固定的,拷贝速度受到:寻址、拷贝指令的执行速率、硬件缓存/寄存器速度相关。

内存

cl_mem

cl_mem是一个通用内存对象,它可以表示一个缓冲区(buffer)、一个图像(image),或者其他类型的内存对象。cl_mem是OpenCL中内存对象的通用句柄,可以用于不同类型的内存资源。

clCreateImage和clCreateBuffer

clCreateBuffer用于创建通用缓冲区对象。这些缓冲区对象主要用于存储线性数组,可以是任意类型的数据,例如浮点数、整数、结构体等,通常为线性一维的数据。

cl_mem buffer = clCreateBuffer(context, CL_MEM_READ_WRITE, size, NULL, &err);

clCreateImage用于创建图像对象。这些图像对象主要用于存储多维数据,例如2D图像或者3D图像。图像对象支持在内核中进行高效的读写操作,特别是在处理图像数据的时候。

支持更高效的内存读写操作,特别是对图像处理的优化,例如边缘处理和滤波操作,需要指定图像的格式例如(RGBA、RGB、单通道)以及数据类型(无符号整数,)

cl_image_format format;
format.image_channel_order = CL_RGBA;
format.image_channel_data_type = CL_UNSIGNED_INT8;

cl_image_desc desc;
desc.image_type = CL_MEM_OBJECT_IMAGE2D;
desc.image_width = width;
desc.image_height = height;
desc.image_depth = 0;
desc.image_array_size = 1;
desc.image_row_pitch = 0;
desc.image_slice_pitch = 0;
desc.num_mip_levels = 0;
desc.num_samples = 0;
desc.buffer = NULL;

cl_mem image = clCreateImage(context, CL_MEM_READ_WRITE, &format, &desc, NULL, &err);

内存共享

目的是减少内存拷贝,降低耗时。

OpenCL-OpenGL sharing

OpenCL和OpenGL的buffer是可以共享的,可以减少一次内存拷贝:Sharing Data Between OpenCL and OpenGL (apple.com)

OpenCL SVM(Shared Virtual Memory)

OpenCL2.0引入的新的特性,用于简化主机和设备之间的内存共享和数据传输,主要有以下特性:

  1. Fine-Grained Buffer SVM:
    • 支持细粒度的内存共享,主机和设备可以并发访问共享内存中的数据,适用于需要频繁交互的应用。
  2. Fine-Grained System SVM:
    • 进一步扩展到系统级别,允许主机和所有支持SVM的设备共享系统中的数据。这需要硬件支持。
  3. Coarse-Grained Buffer SVM:
    • 支持粗粒度的内存共享,主机和设备在执行数据传输前后需要显示的同步。这种方式更适合批量处理和较少的频繁数据交互的应用。

local memory和同步

上下文(Context)

上下文封装了OpenCL执行程序所需要的各种资源。一般来说,上下文和内存是向后兼容的,也就是说较新的OpenCL实现是能够识别和使用较老版本的OpenCL的设备或者上下文的。

设备(Devices)

  • 上下文可以包含一个或者多个OpenCL设备(CPU、GPU、FPGA等)。所有与上下文相关的操作都在这些设备上执行。
  • 上下文负责管理这些设备之间的资源分配和数据共享。

内存对象

  • 上下文管理所有内存对象(缓冲区和图像)。这些内存对象在上下文的设备间共享。
  • 使用上下文来创建内存对象,可以确保这些对象在上下文所有设备上是可访问的。(也就是每个上下文有独立的内存地址,不同上下文之间是不互通的)。

程序和内核(Program&Kernel)

  • 上下文用于创建和管理OpenCL程序和内核。程序对象包含了要在设备上执行的OpenCL C代码,而内核对象是从程序中提取出来并执行的函数。
  • 上下文确保这些程序和内核可以在相关设备上正确的编译和执行。

同步和事件管理(Sync&Event)

  • 上下文管理所有与同步相关的对象,包括事件对象。事件对象用于在命令队列中同步操作。
  • 上下文确保在多个设备或者命令队列之间执行任务时的同步和依赖关系。

命令队列(CommandQueue)

  • 虽然命令对立不在上下文中创建,但是每个命令队列都与一个上下文关联。命令队列是向设备提交命令(如内存拷贝、内核执行)的接口。
  • 上下文确保命令队列的命令在相关的设备上按照顺序执行。

编程的注意事项

  1. 指针的index不要用short,会越界截断的。

常见问题

AddressSanitizer不兼容

使用AddressSanitizer的话,涉及到OpenCL的函数调用会报内存泄漏,并且cl的函数调用会返回失败,以下是一个最简单的测试demo:

#include <iostream>
#include <string>
#include "CL/cl.h"

int main() {
  cl_uint num_platforms;
  cl_int status = clGetPlatformIDs(0, nullptr, &num_platforms);

  // 如果包含AddressSanitizer,返回-1001 0,并且会报内存泄漏
  // 如果不包含AddressSanitizer,则返回0 1
  std::cout << status << std::endl;
  std::cout << num_platforms << std::endl;

  return 0;
}
cmake_minimum_required(VERSION 3.14)
project(opencl_test)

set(SOURCE_DIR ${CMAKE_CURRENT_SOURCE_DIR}/src)
include_directories(${SOURCE_DIR})

set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -fsanitize=address -fno-omit-frame-pointer -g")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=address -fno-omit-frame-pointer -g")

add_executable(test ${SOURCE_DIR}/test.cpp)
target_link_libraries(test OpenCL)

内存泄漏日志:

=================================================================
==2358598==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 392 byte(s) in 1 object(s) allocated from:
    #0 0x7f0537735b38 in calloc (/lib/x86_64-linux-gnu/libasan.so.4+0xdfb38)
    #1 0x7f0532ea264c  (<unknown module>)
    #2 0x7f0532d526da  (<unknown module>)
    #3 0x7f0537450fd4  (/usr/local/cuda-12.1/targets/x86_64-linux/lib/libOpenCL.so.1+0x2fd4)

Direct leak of 56 byte(s) in 1 object(s) allocated from:
    #0 0x7f0537735b38 in calloc (/lib/x86_64-linux-gnu/libasan.so.4+0xdfb38)
    #1 0x7f0532e9db26  (<unknown module>)
    #2 0x7f0532ea17aa  (<unknown module>)
    #3 0x7f0532d526da  (<unknown module>)
    #4 0x7f0537450fd4  (/usr/local/cuda-12.1/targets/x86_64-linux/lib/libOpenCL.so.1+0x2fd4)

Direct leak of 56 byte(s) in 1 object(s) allocated from:
    #0 0x7f0537735b38 in calloc (/lib/x86_64-linux-gnu/libasan.so.4+0xdfb38)
    #1 0x7f0532e9da9e  (<unknown module>)
    #2 0x7f0532ea17aa  (<unknown module>)
    #3 0x7f0532d526da  (<unknown module>)
    #4 0x7f0537450fd4  (/usr/local/cuda-12.1/targets/x86_64-linux/lib/libOpenCL.so.1+0x2fd4)

Direct leak of 56 byte(s) in 1 object(s) allocated from:
    #0 0x7f0537735b38 in calloc (/lib/x86_64-linux-gnu/libasan.so.4+0xdfb38)
    #1 0x7f0532e9d9f6  (<unknown module>)
    #2 0x7f0532ea17aa  (<unknown module>)
    #3 0x7f0532d526da  (<unknown module>)
    #4 0x7f0537450fd4  (/usr/local/cuda-12.1/targets/x86_64-linux/lib/libOpenCL.so.1+0x2fd4)

SUMMARY: AddressSanitizer: 560 byte(s) leaked in 4 allocation(s).

静态资源释放

在某些平台的OpenCL的实现,OpenCL资源(包括cl_context、cl_program等)如果是静态资源(比如单例模式下),系统自动释放的时候可能会遇到崩溃问题。

以下是在AOSP 13(Pixel 6)上遇到的问题(之前在小米上老版系统上也遇到过,升级系统之后解决了):

报的错误是使用了已经释放掉的锁,目前猜测是OpenCL资源管理中存在静态锁,并且该静态资源释放在我的静态cl资源释放之前,导致我这边释放CL资源发生崩溃。

最终解决办法:程序退出前执行资源释放(避免自动释放)。可以使用std::atexit函数辅助。

代码片段

C++端

OpenCL函数动态加载

// Copyright (C) 2019 FaceUnity Inc. All rights reserved.

#ifndef FUAI_OPTIMIZED_GPU_CL_OPENCL_WRAPPER_H_
#define FUAI_OPTIMIZED_GPU_CL_OPENCL_WRAPPER_H_

#include "CL/cl.h"
#include "CL/cl_egl.h"
#include "CL/cl_ext.h"
#include "CL/cl_gl.h"
#if (defined(ANDROID) || defined(__ANDROID__))
#include "CL/cl_ext_qcom.h"
#endif 
#include "CL/cl_platform.h"

namespace fuai {

#define LoadQcomExtensionFunctions()

#define DEFINE_QCOM_FUNCTION_PTRS

bool LoadOpenCL();

bool CloseOpenCL();

typedef cl_int(CL_API_CALL* PFN_clGetPlatformIDs)(
    cl_uint /* num_entries */, cl_platform_id* /* platforms */,
    cl_uint* /* num_platforms */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clGetPlatformInfo)(
    cl_platform_id /* platform */, cl_platform_info /* param_name */,
    size_t /* param_value_size */, void* /* param_value */,
    size_t* /* param_value_size_ret */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clGetDeviceIDs)(
    cl_platform_id /* platform */, cl_device_type /* device_type */,
    cl_uint /* num_entries */, cl_device_id* /* devices */,
    cl_uint* /* num_devices */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clGetDeviceInfo)(
    cl_device_id /* device */, cl_device_info /* param_name */,
    size_t /* param_value_size */, void* /* param_value */,
    size_t* /* param_value_size_ret */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clCreateSubDevices)(
    cl_device_id /* in_device */,
    const cl_device_partition_property* /* properties */,
    cl_uint /* num_devices */, cl_device_id* /* out_devices */,
    cl_uint* /* num_devices_ret */) CL_API_SUFFIX__VERSION_1_2;
typedef cl_int(CL_API_CALL* PFN_clRetainDevice)(cl_device_id /* device */)
    CL_API_SUFFIX__VERSION_1_2;
typedef cl_int(CL_API_CALL* PFN_clReleaseDevice)(cl_device_id /* device */)
    CL_API_SUFFIX__VERSION_1_2;
typedef cl_context(CL_API_CALL* PFN_clCreateContext)(
    const cl_context_properties* /* properties */, cl_uint /* num_devices */,
    const cl_device_id* /* devices */,
    void(CL_CALLBACK* /* pfn_notify */)(const char*, const void*, size_t,
                                        void*),
    void* /* user_data */,
    cl_int* /* errcode_ret */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_context(CL_API_CALL* PFN_clCreateContextFromType)(
    const cl_context_properties* /* properties */,
    cl_device_type /* device_type */,
    void(CL_CALLBACK* /* pfn_notify*/)(const char*, const void*, size_t, void*),
    void* /* user_data */,
    cl_int* /* errcode_ret */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clRetainContext)(cl_context /* context */)
    CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clReleaseContext)(cl_context /* context */)
    CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clGetContextInfo)(
    cl_context /* context */, cl_context_info /* param_name */,
    size_t /* param_value_size */, void* /* param_value */,
    size_t* /* param_value_size_ret */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_command_queue(CL_API_CALL* PFN_clCreateCommandQueueWithProperties)(
    cl_context /* context */, cl_device_id /* device */,
    const cl_queue_properties* /* properties */,
    cl_int* /* errcode_ret */) CL_API_SUFFIX__VERSION_2_0;
typedef cl_int(CL_API_CALL* PFN_clRetainCommandQueue)(
    cl_command_queue /* command_queue */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clReleaseCommandQueue)(
    cl_command_queue /* command_queue */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clGetCommandQueueInfo)(
    cl_command_queue /* command_queue */,
    cl_command_queue_info /* param_name */, size_t /* param_value_size */,
    void* /* param_value */,
    size_t* /* param_value_size_ret */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_mem(CL_API_CALL* PFN_clCreateBuffer)(
    cl_context /* context */, cl_mem_flags /* flags */, size_t /* size */,
    void* /* host_ptr */, cl_int* /* errcode_ret */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_mem(CL_API_CALL* PFN_clCreateSubBuffer)(
    cl_mem /* buffer */, cl_mem_flags /* flags */,
    cl_buffer_create_type /* buffer_create_type */,
    const void* /* buffer_create_info */,
    cl_int* /* errcode_ret */) CL_API_SUFFIX__VERSION_1_1;
typedef cl_mem(CL_API_CALL* PFN_clCreateImage)(
    cl_context /* context */, cl_mem_flags /* flags */,
    const cl_image_format* /* image_format */,
    const cl_image_desc* /* image_desc */, void* /* host_ptr */,
    cl_int* /* errcode_ret */) CL_API_SUFFIX__VERSION_1_2;
typedef cl_mem(CL_API_CALL* PFN_clCreatePipe)(
    cl_context /* context */, cl_mem_flags /* flags */,
    cl_uint /* pipe_packet_size */, cl_uint /* pipe_max_packets */,
    const cl_pipe_properties* /* properties */,
    cl_int* /* errcode_ret */) CL_API_SUFFIX__VERSION_2_0;
typedef cl_int(CL_API_CALL* PFN_clRetainMemObject)(cl_mem /* memobj */)
    CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clReleaseMemObject)(cl_mem /* memobj */)
    CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clGetSupportedImageFormats)(
    cl_context /* context */, cl_mem_flags /* flags */,
    cl_mem_object_type /* image_type */, cl_uint /* num_entries */,
    cl_image_format* /* image_formats */,
    cl_uint* /* num_image_formats */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clGetMemObjectInfo)(
    cl_mem /* memobj */, cl_mem_info /* param_name */,
    size_t /* param_value_size */, void* /* param_value */,
    size_t* /* param_value_size_ret */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clGetImageInfo)(
    cl_mem /* image */, cl_image_info /* param_name */,
    size_t /* param_value_size */, void* /* param_value */,
    size_t* /* param_value_size_ret */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clGetPipeInfo)(
    cl_mem /* pipe */, cl_pipe_info /* param_name */,
    size_t /* param_value_size */, void* /* param_value */,
    size_t* /* param_value_size_ret */) CL_API_SUFFIX__VERSION_2_0;
typedef cl_int(CL_API_CALL* PFN_clSetMemObjectDestructorCallback)(
    cl_mem /* memobj */,
    void(CL_CALLBACK* /*pfn_notify*/)(cl_mem /* memobj */, void* /*user_data*/),
    void* /*user_data */) CL_API_SUFFIX__VERSION_1_1;
typedef void*(CL_API_CALL* PFN_clSVMAlloc)(
    cl_context /* context */, cl_svm_mem_flags /* flags */, size_t /* size */,
    cl_uint /* alignment */)CL_API_SUFFIX__VERSION_2_0;
typedef void(CL_API_CALL* PFN_clSVMFree)(cl_context /* context */,
                                         void* /* svm_pointer */)
    CL_API_SUFFIX__VERSION_2_0;
typedef cl_sampler(CL_API_CALL* PFN_clCreateSamplerWithProperties)(
    cl_context /* context */,
    const cl_sampler_properties* /* normalized_coords */,
    cl_int* /* errcode_ret */) CL_API_SUFFIX__VERSION_2_0;
typedef cl_int(CL_API_CALL* PFN_clRetainSampler)(cl_sampler /* sampler */)
    CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clReleaseSampler)(cl_sampler /* sampler */)
    CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clGetSamplerInfo)(
    cl_sampler /* sampler */, cl_sampler_info /* param_name */,
    size_t /* param_value_size */, void* /* param_value */,
    size_t* /* param_value_size_ret */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_program(CL_API_CALL* PFN_clCreateProgramWithSource)(
    cl_context /* context */, cl_uint /* count */, const char** /* strings */,
    const size_t* /* lengths */,
    cl_int* /* errcode_ret */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_program(CL_API_CALL* PFN_clCreateProgramWithBinary)(
    cl_context /* context */, cl_uint /* num_devices */,
    const cl_device_id* /* device_list */, const size_t* /* lengths */,
    const unsigned char** /* binaries */, cl_int* /* binary_status */,
    cl_int* /* errcode_ret */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_program(CL_API_CALL* PFN_clCreateProgramWithBuiltInKernels)(
    cl_context /* context */, cl_uint /* num_devices */,
    const cl_device_id* /* device_list */, const char* /* kernel_names */,
    cl_int* /* errcode_ret */) CL_API_SUFFIX__VERSION_1_2;
typedef cl_int(CL_API_CALL* PFN_clRetainProgram)(cl_program /* program */)
    CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clReleaseProgram)(cl_program /* program */)
    CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clBuildProgram)(
    cl_program /* program */, cl_uint /* num_devices */,
    const cl_device_id* /* device_list */, const char* /* options */,
    void(CL_CALLBACK* /* pfn_notify */)(cl_program /* program */,
                                        void* /* user_data */),
    void* /* user_data */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clCompileProgram)(
    cl_program /* program */, cl_uint /* num_devices */,
    const cl_device_id* /* device_list */, const char* /* options */,
    cl_uint /* num_input_headers */, const cl_program* /* input_headers */,
    const char** /* header_include_names */,
    void(CL_CALLBACK* /* pfn_notify */)(cl_program /* program */,
                                        void* /* user_data */),
    void* /* user_data */) CL_API_SUFFIX__VERSION_1_2;
typedef cl_program(CL_API_CALL* PFN_clLinkProgram)(
    cl_context /* context */, cl_uint /* num_devices */,
    const cl_device_id* /* device_list */, const char* /* options */,
    cl_uint /* num_input_programs */, const cl_program* /* input_programs */,
    void(CL_CALLBACK* /* pfn_notify */)(cl_program /* program */,
                                        void* /* user_data */),
    void* /* user_data */,
    cl_int* /* errcode_ret */) CL_API_SUFFIX__VERSION_1_2;
typedef cl_int(CL_API_CALL* PFN_clUnloadPlatformCompiler)(
    cl_platform_id /* platform */) CL_API_SUFFIX__VERSION_1_2;
typedef cl_int(CL_API_CALL* PFN_clGetProgramInfo)(
    cl_program /* program */, cl_program_info /* param_name */,
    size_t /* param_value_size */, void* /* param_value */,
    size_t* /* param_value_size_ret */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clGetProgramBuildInfo)(
    cl_program /* program */, cl_device_id /* device */,
    cl_program_build_info /* param_name */, size_t /* param_value_size */,
    void* /* param_value */,
    size_t* /* param_value_size_ret */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_kernel(CL_API_CALL* PFN_clCreateKernel)(
    cl_program /* program */, const char* /* kernel_name */,
    cl_int* /* errcode_ret */) CL_API_SUFFIX__VERSION_1_0;
#if (defined(ANDROID) || defined(__ANDROID__))
typedef cl_recording_qcom (CL_API_CALL * PFN_clNewRecordingQCOM)
        ( cl_command_queue, cl_int *);
typedef cl_int(CL_API_CALL *PFN_clEndRecordingQCOM)(cl_recording_qcom);
typedef cl_int(CL_API_CALL *PFN_clReleaseRecordingQCOM)(cl_recording_qcom);
typedef cl_int (CL_API_CALL *PFN_clEnqueueRecordingQCOM) (cl_command_queue    /** command_queue */,
                       cl_recording_qcom   /** recording */,

                       size_t              /** number of recorded args being updated */,
                       const cl_array_arg_qcom * /** recorded arg to update */,

                       size_t               /** Number of global offsets to update */,
                       const cl_offset_qcom * /** Array  offsets to update */,

                       size_t             /** number of global workgroups being updated */,
                       const cl_workgroup_qcom * ,

                       size_t             ,
                       const cl_workgroup_qcom * ,
                       cl_uint        ,
                       const cl_event * ,
                       cl_event *        );
#endif
typedef cl_int(CL_API_CALL* PFN_clCreateKernelsInProgram)(
    cl_program /* program */, cl_uint /* num_kernels */,
    cl_kernel* /* kernels */,
    cl_uint* /* num_kernels_ret */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clRetainKernel)(cl_kernel /* kernel */)
    CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clReleaseKernel)(cl_kernel /* kernel */)
    CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clSetKernelArg)(
    cl_kernel /* kernel */, cl_uint /* arg_index */, size_t /* arg_size */,
    const void* /* arg_value */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clSetKernelArgSVMPointer)(
    cl_kernel /* kernel */, cl_uint /* arg_index */,
    const void* /* arg_value */) CL_API_SUFFIX__VERSION_2_0;
typedef cl_int(CL_API_CALL* PFN_clSetKernelExecInfo)(
    cl_kernel /* kernel */, cl_kernel_exec_info /* param_name */,
    size_t /* param_value_size */,
    const void* /* param_value */) CL_API_SUFFIX__VERSION_2_0;
typedef cl_int(CL_API_CALL* PFN_clGetKernelInfo)(
    cl_kernel /* kernel */, cl_kernel_info /* param_name */,
    size_t /* param_value_size */, void* /* param_value */,
    size_t* /* param_value_size_ret */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clGetKernelArgInfo)(
    cl_kernel /* kernel */, cl_uint /* arg_indx */,
    cl_kernel_arg_info /* param_name */, size_t /* param_value_size */,
    void* /* param_value */,
    size_t* /* param_value_size_ret */) CL_API_SUFFIX__VERSION_1_2;
typedef cl_int(CL_API_CALL* PFN_clGetKernelWorkGroupInfo)(
    cl_kernel /* kernel */, cl_device_id /* device */,
    cl_kernel_work_group_info /* param_name */, size_t /* param_value_size */,
    void* /* param_value */,
    size_t* /* param_value_size_ret */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clWaitForEvents)(
    cl_uint /* num_events */,
    const cl_event* /* event_list */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clGetEventInfo)(
    cl_event /* event */, cl_event_info /* param_name */,
    size_t /* param_value_size */, void* /* param_value */,
    size_t* /* param_value_size_ret */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_event(CL_API_CALL* PFN_clCreateUserEvent)(cl_context /* context */,
                                                     cl_int* /* errcode_ret */)
    CL_API_SUFFIX__VERSION_1_1;
typedef cl_int(CL_API_CALL* PFN_clRetainEvent)(cl_event /* event */)
    CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clReleaseEvent)(cl_event /* event */)
    CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clSetUserEventStatus)(
    cl_event /* event */,
    cl_int /* execution_status */) CL_API_SUFFIX__VERSION_1_1;
typedef cl_int(CL_API_CALL* PFN_clSetEventCallback)(
    cl_event /* event */, cl_int /* command_exec_callback_type */,
    void(CL_CALLBACK* /* pfn_notify */)(cl_event, cl_int, void*),
    void* /* user_data */) CL_API_SUFFIX__VERSION_1_1;
typedef cl_int(CL_API_CALL* PFN_clGetEventProfilingInfo)(
    cl_event /* event */, cl_profiling_info /* param_name */,
    size_t /* param_value_size */, void* /* param_value */,
    size_t* /* param_value_size_ret */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clFlush)(cl_command_queue /* command_queue */)
    CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clFinish)(cl_command_queue /* command_queue */)
    CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clEnqueueReadBuffer)(
    cl_command_queue /* command_queue */, cl_mem /* buffer */,
    cl_bool /* blocking_read */, size_t /* offset */, size_t /* size */,
    void* /* ptr */, cl_uint /* num_events_in_wait_list */,
    const cl_event* /* event_wait_list */,
    cl_event* /* event */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clEnqueueReadBufferRect)(
    cl_command_queue /* command_queue */, cl_mem /* buffer */,
    cl_bool /* blocking_read */, const size_t* /* buffer_offset */,
    const size_t* /* host_offset */, const size_t* /* region */,
    size_t /* buffer_row_pitch */, size_t /* buffer_slice_pitch */,
    size_t /* host_row_pitch */, size_t /* host_slice_pitch */, void* /* ptr */,
    cl_uint /* num_events_in_wait_list */,
    const cl_event* /* event_wait_list */,
    cl_event* /* event */) CL_API_SUFFIX__VERSION_1_1;
typedef cl_int(CL_API_CALL* PFN_clEnqueueWriteBuffer)(
    cl_command_queue /* command_queue */, cl_mem /* buffer */,
    cl_bool /* blocking_write */, size_t /* offset */, size_t /* size */,
    const void* /* ptr */, cl_uint /* num_events_in_wait_list */,
    const cl_event* /* event_wait_list */,
    cl_event* /* event */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clEnqueueWriteBufferRect)(
    cl_command_queue /* command_queue */, cl_mem /* buffer */,
    cl_bool /* blocking_write */, const size_t* /* buffer_offset */,
    const size_t* /* host_offset */, const size_t* /* region */,
    size_t /* buffer_row_pitch */, size_t /* buffer_slice_pitch */,
    size_t /* host_row_pitch */, size_t /* host_slice_pitch */,
    const void* /* ptr */, cl_uint /* num_events_in_wait_list */,
    const cl_event* /* event_wait_list */,
    cl_event* /* event */) CL_API_SUFFIX__VERSION_1_1;
typedef cl_int(CL_API_CALL* PFN_clEnqueueFillBuffer)(
    cl_command_queue /* command_queue */, cl_mem /* buffer */,
    const void* /* pattern */, size_t /* pattern_size */, size_t /* offset */,
    size_t /* size */, cl_uint /* num_events_in_wait_list */,
    const cl_event* /* event_wait_list */,
    cl_event* /* event */) CL_API_SUFFIX__VERSION_1_2;
typedef cl_int(CL_API_CALL* PFN_clEnqueueCopyBuffer)(
    cl_command_queue /* command_queue */, cl_mem /* src_buffer */,
    cl_mem /* dst_buffer */, size_t /* src_offset */, size_t /* dst_offset */,
    size_t /* size */, cl_uint /* num_events_in_wait_list */,
    const cl_event* /* event_wait_list */,
    cl_event* /* event */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clEnqueueCopyBufferRect)(
    cl_command_queue /* command_queue */, cl_mem /* src_buffer */,
    cl_mem /* dst_buffer */, const size_t* /* src_origin */,
    const size_t* /* dst_origin */, const size_t* /* region */,
    size_t /* src_row_pitch */, size_t /* src_slice_pitch */,
    size_t /* dst_row_pitch */, size_t /* dst_slice_pitch */,
    cl_uint /* num_events_in_wait_list */,
    const cl_event* /* event_wait_list */,
    cl_event* /* event */) CL_API_SUFFIX__VERSION_1_1;
typedef cl_int(CL_API_CALL* PFN_clEnqueueReadImage)(
    cl_command_queue /* command_queue */, cl_mem /* image */,
    cl_bool /* blocking_read */, const size_t* /* origin[3] */,
    const size_t* /* region[3] */, size_t /* row_pitch */,
    size_t /* slice_pitch */, void* /* ptr */,
    cl_uint /* num_events_in_wait_list */,
    const cl_event* /* event_wait_list */,
    cl_event* /* event */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clEnqueueWriteImage)(
    cl_command_queue /* command_queue */, cl_mem /* image */,
    cl_bool /* blocking_write */, const size_t* /* origin[3] */,
    const size_t* /* region[3] */, size_t /* input_row_pitch */,
    size_t /* input_slice_pitch */, const void* /* ptr */,
    cl_uint /* num_events_in_wait_list */,
    const cl_event* /* event_wait_list */,
    cl_event* /* event */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clEnqueueFillImage)(
    cl_command_queue /* command_queue */, cl_mem /* image */,
    const void* /* fill_color */, const size_t* /* origin[3] */,
    const size_t* /* region[3] */, cl_uint /* num_events_in_wait_list */,
    const cl_event* /* event_wait_list */,
    cl_event* /* event */) CL_API_SUFFIX__VERSION_1_2;
typedef cl_int(CL_API_CALL* PFN_clEnqueueCopyImage)(
    cl_command_queue /* command_queue */, cl_mem /* src_image */,
    cl_mem /* dst_image */, const size_t* /* src_origin[3] */,
    const size_t* /* dst_origin[3] */, const size_t* /* region[3] */,
    cl_uint /* num_events_in_wait_list */,
    const cl_event* /* event_wait_list */,
    cl_event* /* event */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clEnqueueCopyImageToBuffer)(
    cl_command_queue /* command_queue */, cl_mem /* src_image */,
    cl_mem /* dst_buffer */, const size_t* /* src_origin[3] */,
    const size_t* /* region[3] */, size_t /* dst_offset */,
    cl_uint /* num_events_in_wait_list */,
    const cl_event* /* event_wait_list */,
    cl_event* /* event */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clEnqueueCopyBufferToImage)(
    cl_command_queue /* command_queue */, cl_mem /* src_buffer */,
    cl_mem /* dst_image */, size_t /* src_offset */,
    const size_t* /* dst_origin[3] */, const size_t* /* region[3] */,
    cl_uint /* num_events_in_wait_list */,
    const cl_event* /* event_wait_list */,
    cl_event* /* event */) CL_API_SUFFIX__VERSION_1_0;
typedef void*(CL_API_CALL* PFN_clEnqueueMapBuffer)(
    cl_command_queue /* command_queue */, cl_mem /* buffer */,
    cl_bool /* blocking_map */, cl_map_flags /* map_flags */,
    size_t /* offset */, size_t /* size */,
    cl_uint /* num_events_in_wait_list */,
    const cl_event* /* event_wait_list */, cl_event* /* event */,
    cl_int* /* errcode_ret */)CL_API_SUFFIX__VERSION_1_0;
typedef void*(CL_API_CALL* PFN_clEnqueueMapImage)(
    cl_command_queue /* command_queue */, cl_mem /* image */,
    cl_bool /* blocking_map */, cl_map_flags /* map_flags */,
    const size_t* /* origin[3] */, const size_t* /* region[3] */,
    size_t* /* image_row_pitch */, size_t* /* image_slice_pitch */,
    cl_uint /* num_events_in_wait_list */,
    const cl_event* /* event_wait_list */, cl_event* /* event */,
    cl_int* /* errcode_ret */)CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clEnqueueUnmapMemObject)(
    cl_command_queue /* command_queue */, cl_mem /* memobj */,
    void* /* mapped_ptr */, cl_uint /* num_events_in_wait_list */,
    const cl_event* /* event_wait_list */,
    cl_event* /* event */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clEnqueueMigrateMemObjects)(
    cl_command_queue /* command_queue */, cl_uint /* num_mem_objects */,
    const cl_mem* /* mem_objects */, cl_mem_migration_flags /* flags */,
    cl_uint /* num_events_in_wait_list */,
    const cl_event* /* event_wait_list */,
    cl_event* /* event */) CL_API_SUFFIX__VERSION_1_2;
typedef cl_int(CL_API_CALL* PFN_clEnqueueNDRangeKernel)(
    cl_command_queue /* command_queue */, cl_kernel /* kernel */,
    cl_uint /* work_dim */, const size_t* /* global_work_offset */,
    const size_t* /* global_work_size */, const size_t* /* local_work_size */,
    cl_uint /* num_events_in_wait_list */,
    const cl_event* /* event_wait_list */,
    cl_event* /* event */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clEnqueueNativeKernel)(
    cl_command_queue /* command_queue */,
    void(CL_CALLBACK* /*user_func*/)(void*), void* /* args */,
    size_t /* cb_args */, cl_uint /* num_mem_objects */,
    const cl_mem* /* mem_list */, const void** /* args_mem_loc */,
    cl_uint /* num_events_in_wait_list */,
    const cl_event* /* event_wait_list */,
    cl_event* /* event */) CL_API_SUFFIX__VERSION_1_0;
typedef cl_int(CL_API_CALL* PFN_clEnqueueMarkerWithWaitList)(
    cl_command_queue /* command_queue */, cl_uint /* num_events_in_wait_list */,
    const cl_event* /* event_wait_list */,
    cl_event* /* event */) CL_API_SUFFIX__VERSION_1_2;
typedef cl_int(CL_API_CALL* PFN_clEnqueueBarrierWithWaitList)(
    cl_command_queue /* command_queue */, cl_uint /* num_events_in_wait_list */,
    const cl_event* /* event_wait_list */,
    cl_event* /* event */) CL_API_SUFFIX__VERSION_1_2;
typedef cl_int(CL_API_CALL* PFN_clEnqueueSVMFree)(
    cl_command_queue /* command_queue */, cl_uint /* num_svm_pointers */,
    void*[] /* svm_pointers[] */,
    void(CL_CALLBACK* /*pfn_free_func*/)(cl_command_queue /* queue */,
                                         cl_uint /* num_svm_pointers */,
                                         void*[] /* svm_pointers[] */,
                                         void* /* user_data */),
    void* /* user_data */, cl_uint /* num_events_in_wait_list */,
    const cl_event* /* event_wait_list */,
    cl_event* /* event */) CL_API_SUFFIX__VERSION_2_0;
typedef cl_int(CL_API_CALL* PFN_clEnqueueSVMMemcpy)(
    cl_command_queue /* command_queue */, cl_bool /* blocking_copy */,
    void* /* dst_ptr */, const void* /* src_ptr */, size_t /* size */,
    cl_uint /* num_events_in_wait_list */,
    const cl_event* /* event_wait_list */,
    cl_event* /* event */) CL_API_SUFFIX__VERSION_2_0;
typedef cl_int(CL_API_CALL* PFN_clEnqueueSVMMemFill)(
    cl_command_queue /* command_queue */, void* /* svm_ptr */,
    const void* /* pattern */, size_t /* pattern_size */, size_t /* size */,
    cl_uint /* num_events_in_wait_list */,
    const cl_event* /* event_wait_list */,
    cl_event* /* event */) CL_API_SUFFIX__VERSION_2_0;
typedef cl_int(CL_API_CALL* PFN_clEnqueueSVMMap)(
    cl_command_queue /* command_queue */, cl_bool /* blocking_map */,
    cl_map_flags /* flags */, void* /* svm_ptr */, size_t /* size */,
    cl_uint /* num_events_in_wait_list */,
    const cl_event* /* event_wait_list */,
    cl_event* /* event */) CL_API_SUFFIX__VERSION_2_0;
typedef cl_int(CL_API_CALL* PFN_clEnqueueSVMUnmap)(
    cl_command_queue /* command_queue */, void* /* svm_ptr */,
    cl_uint /* num_events_in_wait_list */,
    const cl_event* /* event_wait_list */,
    cl_event* /* event */) CL_API_SUFFIX__VERSION_2_0;
typedef void*(CL_API_CALL* PFN_clGetExtensionFunctionAddressForPlatform)(
    cl_platform_id /* platform */,
    const char* /* func_name */)CL_API_SUFFIX__VERSION_1_2;
typedef cl_mem(CL_API_CALL* PFN_clCreateImage2D)(
    cl_context /* context */, cl_mem_flags /* flags */,
    const cl_image_format* /* image_format */, size_t /* image_width */,
    size_t /* image_height */, size_t /* image_row_pitch */,
    void* /* host_ptr */, cl_int* /* errcode_ret */);
typedef cl_mem(CL_API_CALL* PFN_clCreateImage3D)(
    cl_context /* context */, cl_mem_flags /* flags */,
    const cl_image_format* /* image_format */, size_t /* image_width */,
    size_t /* image_height */, size_t /* image_depth */,
    size_t /* image_row_pitch */, size_t /* image_slice_pitch */,
    void* /* host_ptr */, cl_int* /* errcode_ret */);
typedef cl_int(CL_API_CALL* PFN_clEnqueueMarker)(
    cl_command_queue /* command_queue */, cl_event* /* event */);
typedef cl_int(CL_API_CALL* PFN_clEnqueueWaitForEvents)(
    cl_command_queue /* command_queue */, cl_uint /* num_events */,
    const cl_event* /* event_list */);
typedef cl_int(CL_API_CALL* PFN_clEnqueueBarrier)(
    cl_command_queue /* command_queue */);
typedef cl_int(CL_API_CALL* PFN_clUnloadCompiler)();
typedef void*(CL_API_CALL* PFN_clGetExtensionFunctionAddress)(
    const char* /* func_name */);
typedef cl_command_queue(CL_API_CALL* PFN_clCreateCommandQueue)(
    cl_context /* context */, cl_device_id /* device */,
    cl_command_queue_properties /* properties */, cl_int* /* errcode_ret */);
typedef cl_sampler(CL_API_CALL* PFN_clCreateSampler)(
    cl_context /* context */, cl_bool /* normalized_coords */,
    cl_addressing_mode /* addressing_mode */, cl_filter_mode /* filter_mode */,
    cl_int* /* errcode_ret */);
typedef cl_int(CL_API_CALL* PFN_clEnqueueTask)(
    cl_command_queue /* command_queue */, cl_kernel /* kernel */,
    cl_uint /* num_events_in_wait_list */,
    const cl_event* /* event_wait_list */, cl_event* /* event */);

// OpenGL sharing
typedef cl_mem(CL_API_CALL* PFN_clCreateFromGLBuffer)(cl_context, cl_mem_flags,
                                                      cl_GLuint, int*);
typedef cl_mem(CL_API_CALL* PFN_clCreateFromGLTexture)(
    cl_context /* context */, cl_mem_flags /* flags */, cl_GLenum /* target */,
    cl_GLint /* miplevel */, cl_GLuint /* texture */,
    cl_int* /* errcode_ret */) CL_API_SUFFIX__VERSION_1_2;
typedef cl_int(CL_API_CALL* PFN_clEnqueueAcquireGLObjects)(
    cl_command_queue /* command_queue */, cl_uint /* num_objects */,
    const cl_mem* /* mem_objects */, cl_uint /* num_events_in_wait_list */,
    const cl_event* /* event_wait_list */, cl_event* /* event */);
typedef cl_int(CL_API_CALL* PFN_clEnqueueReleaseGLObjects)(
    cl_command_queue /* command_queue */, cl_uint /* num_objects */,
    const cl_mem* /* mem_objects */, cl_uint /* num_events_in_wait_list */,
    const cl_event* /* event_wait_list */,
    cl_event* /* event */) CL_API_SUFFIX__VERSION_1_0;

// cl_khr_egl_event extension

// CLeglDisplayKHR is an opaque handle to an EGLDisplay
typedef void* CLeglDisplayKHR;

// CLeglSyncKHR is an opaque handle to an EGLSync object
typedef void* CLeglSyncKHR;

typedef cl_event(CL_API_CALL* PFN_clCreateEventFromEGLSyncKHR)(
    cl_context /* context */, CLeglSyncKHR /* sync */,
    CLeglDisplayKHR /* display */, cl_int* /* errcode_ret */);

// EGL sharing
typedef cl_mem(CL_API_CALL* PFN_clCreateFromEGLImageKHR)(
    cl_context /*context*/, CLeglDisplayKHR /*display*/,
    CLeglImageKHR /*image*/, cl_mem_flags /*flags*/,
    const cl_egl_image_properties_khr* /*properties*/, cl_int* /*errcode_ret*/);
typedef cl_int(CL_API_CALL* PFN_clEnqueueAcquireEGLObjectsKHR)(
    cl_command_queue /*command_queue*/, cl_uint /*num_objects*/,
    const cl_mem* /*mem_objects*/, cl_uint /*num_events_in_wait_list*/,
    const cl_event* /*event_wait_list*/, cl_event* /*event*/);
typedef cl_int(CL_API_CALL* PFN_clEnqueueReleaseEGLObjectsKHR)(
    cl_command_queue /*command_queue*/, cl_uint /*num_objects*/,
    const cl_mem* /*mem_objects*/, cl_uint /*num_events_in_wait_list*/,
    const cl_event* /*event_wait_list*/, cl_event* /*event*/);

// cl_khr_command_buffer
typedef cl_command_buffer_khr(CL_API_CALL* PFN_clCreateCommandBufferKHR)(
    cl_uint /*num_queues*/, const cl_command_queue* /*queues*/,
    const cl_command_buffer_properties_khr* /*properties*/,
    cl_int* /*errcode_ret*/);

typedef cl_int(CL_API_CALL* PFN_clRetainCommandBufferKHR)(
    cl_command_buffer_khr /*command_buffer*/);

typedef cl_int(CL_API_CALL* PFN_clReleaseCommandBufferKHR)(
    cl_command_buffer_khr /*command_buffer*/);

typedef cl_int(CL_API_CALL* PFN_clFinalizeCommandBufferKHR)(
    cl_command_buffer_khr /*command_buffer*/);

typedef cl_int(CL_API_CALL* PFN_clEnqueueCommandBufferKHR)(
    cl_uint /*num_queues*/, cl_command_queue* /*queues*/,
    cl_command_buffer_khr /*command_buffer*/,
    cl_uint /*num_events_in_wait_list*/, const cl_event* /*event_wait_list*/,
    cl_event* /*event*/);

typedef cl_int(CL_API_CALL* PFN_clCommandNDRangeKernelKHR)(
    cl_command_buffer_khr /*command_buffer*/,
    cl_command_queue /*command_queue*/,
    const cl_ndrange_kernel_command_properties_khr* /*properties*/,
    cl_kernel /*kernel*/, cl_uint /*work_dim*/,
    const size_t* /*global_work_offset*/, const size_t* /*global_work_size*/,
    const size_t* /*local_work_size*/, cl_uint /*num_sync_points_in_wait_list*/,
    const cl_sync_point_khr* /*sync_point_wait_list*/,
    cl_sync_point_khr* /*sync_point*/,
    cl_mutable_command_khr* /*mutable_handle*/);

typedef cl_int(CL_API_CALL* PFN_clGetCommandBufferInfoKHR)(
    cl_command_buffer_khr /*command_buffer*/,
    cl_command_buffer_info_khr /*param_name*/, size_t /*param_value_size*/,
    void* /*param_value*/, size_t* /*param_value_size_ret*/);

extern PFN_clGetPlatformIDs clGetPlatformIDs;
extern PFN_clGetPlatformInfo clGetPlatformInfo;
extern PFN_clGetDeviceIDs clGetDeviceIDs;
extern PFN_clGetDeviceInfo clGetDeviceInfo;
extern PFN_clCreateSubDevices clCreateSubDevices;
extern PFN_clRetainDevice clRetainDevice;
extern PFN_clReleaseDevice clReleaseDevice;
extern PFN_clCreateContext clCreateContext;
extern PFN_clCreateContextFromType clCreateContextFromType;
extern PFN_clRetainContext clRetainContext;
extern PFN_clReleaseContext clReleaseContext;
extern PFN_clGetContextInfo clGetContextInfo;
extern PFN_clCreateCommandQueueWithProperties
    clCreateCommandQueueWithProperties;
extern PFN_clRetainCommandQueue clRetainCommandQueue;
extern PFN_clReleaseCommandQueue clReleaseCommandQueue;
extern PFN_clGetCommandQueueInfo clGetCommandQueueInfo;
extern PFN_clCreateBuffer clCreateBuffer;
extern PFN_clCreateSubBuffer clCreateSubBuffer;
extern PFN_clCreateImage clCreateImage;
extern PFN_clCreatePipe clCreatePipe;
extern PFN_clRetainMemObject clRetainMemObject;
extern PFN_clReleaseMemObject clReleaseMemObject;
extern PFN_clGetSupportedImageFormats clGetSupportedImageFormats;
extern PFN_clGetMemObjectInfo clGetMemObjectInfo;
extern PFN_clGetImageInfo clGetImageInfo;
extern PFN_clGetPipeInfo clGetPipeInfo;
extern PFN_clSetMemObjectDestructorCallback clSetMemObjectDestructorCallback;
extern PFN_clSVMAlloc clSVMAlloc;
extern PFN_clSVMFree clSVMFree;
extern PFN_clCreateSamplerWithProperties clCreateSamplerWithProperties;
extern PFN_clRetainSampler clRetainSampler;
extern PFN_clReleaseSampler clReleaseSampler;
extern PFN_clGetSamplerInfo clGetSamplerInfo;
extern PFN_clCreateProgramWithSource clCreateProgramWithSource;
extern PFN_clCreateProgramWithBinary clCreateProgramWithBinary;
extern PFN_clCreateProgramWithBuiltInKernels clCreateProgramWithBuiltInKernels;
extern PFN_clRetainProgram clRetainProgram;
extern PFN_clReleaseProgram clReleaseProgram;
extern PFN_clBuildProgram clBuildProgram;
extern PFN_clCompileProgram clCompileProgram;
#if (defined(ANDROID) || defined(__ANDROID__))
extern PFN_clNewRecordingQCOM clNewRecordingQCOM;
extern PFN_clEndRecordingQCOM clEndRecordingQCOM;
extern PFN_clEnqueueRecordingQCOM clEnqueueRecordingQCOM;
extern PFN_clReleaseRecordingQCOM clReleaseRecordingQCOM;
#endif
extern PFN_clLinkProgram clLinkProgram;
extern PFN_clUnloadPlatformCompiler clUnloadPlatformCompiler;
extern PFN_clGetProgramInfo clGetProgramInfo;
extern PFN_clGetProgramBuildInfo clGetProgramBuildInfo;
extern PFN_clCreateKernel clCreateKernel;
extern PFN_clCreateKernelsInProgram clCreateKernelsInProgram;
extern PFN_clRetainKernel clRetainKernel;
extern PFN_clReleaseKernel clReleaseKernel;
extern PFN_clSetKernelArg clSetKernelArg;
extern PFN_clSetKernelArgSVMPointer clSetKernelArgSVMPointer;
extern PFN_clSetKernelExecInfo clSetKernelExecInfo;
extern PFN_clGetKernelInfo clGetKernelInfo;
extern PFN_clGetKernelArgInfo clGetKernelArgInfo;
extern PFN_clGetKernelWorkGroupInfo clGetKernelWorkGroupInfo;
extern PFN_clWaitForEvents clWaitForEvents;
extern PFN_clGetEventInfo clGetEventInfo;
extern PFN_clCreateUserEvent clCreateUserEvent;
extern PFN_clRetainEvent clRetainEvent;
extern PFN_clReleaseEvent clReleaseEvent;
extern PFN_clSetUserEventStatus clSetUserEventStatus;
extern PFN_clSetEventCallback clSetEventCallback;
extern PFN_clGetEventProfilingInfo clGetEventProfilingInfo;
extern PFN_clFlush clFlush;
extern PFN_clFinish clFinish;
extern PFN_clEnqueueReadBuffer clEnqueueReadBuffer;
extern PFN_clEnqueueReadBufferRect clEnqueueReadBufferRect;
extern PFN_clEnqueueWriteBuffer clEnqueueWriteBuffer;
extern PFN_clEnqueueWriteBufferRect clEnqueueWriteBufferRect;
extern PFN_clEnqueueFillBuffer clEnqueueFillBuffer;
extern PFN_clEnqueueCopyBuffer clEnqueueCopyBuffer;
extern PFN_clEnqueueCopyBufferRect clEnqueueCopyBufferRect;
extern PFN_clEnqueueReadImage clEnqueueReadImage;
extern PFN_clEnqueueWriteImage clEnqueueWriteImage;
extern PFN_clEnqueueFillImage clEnqueueFillImage;
extern PFN_clEnqueueCopyImage clEnqueueCopyImage;
extern PFN_clEnqueueCopyImageToBuffer clEnqueueCopyImageToBuffer;
extern PFN_clEnqueueCopyBufferToImage clEnqueueCopyBufferToImage;
extern PFN_clEnqueueMapBuffer clEnqueueMapBuffer;
extern PFN_clEnqueueMapImage clEnqueueMapImage;
extern PFN_clEnqueueUnmapMemObject clEnqueueUnmapMemObject;
extern PFN_clEnqueueMigrateMemObjects clEnqueueMigrateMemObjects;
extern PFN_clEnqueueNDRangeKernel clEnqueueNDRangeKernel;
extern PFN_clEnqueueNativeKernel clEnqueueNativeKernel;
extern PFN_clEnqueueMarkerWithWaitList clEnqueueMarkerWithWaitList;
extern PFN_clEnqueueBarrierWithWaitList clEnqueueBarrierWithWaitList;
extern PFN_clEnqueueSVMFree clEnqueueSVMFree;
extern PFN_clEnqueueSVMMemcpy clEnqueueSVMMemcpy;
extern PFN_clEnqueueSVMMemFill clEnqueueSVMMemFill;
extern PFN_clEnqueueSVMMap clEnqueueSVMMap;
extern PFN_clEnqueueSVMUnmap clEnqueueSVMUnmap;
extern PFN_clGetExtensionFunctionAddressForPlatform
    clGetExtensionFunctionAddressForPlatform;
extern PFN_clCreateImage2D clCreateImage2D;
extern PFN_clCreateImage3D clCreateImage3D;
extern PFN_clEnqueueMarker clEnqueueMarker;
extern PFN_clEnqueueWaitForEvents clEnqueueWaitForEvents;
extern PFN_clEnqueueBarrier clEnqueueBarrier;
extern PFN_clUnloadCompiler clUnloadCompiler;
extern PFN_clGetExtensionFunctionAddress clGetExtensionFunctionAddress;
extern PFN_clCreateCommandQueue clCreateCommandQueue;
extern PFN_clCreateSampler clCreateSampler;
extern PFN_clEnqueueTask clEnqueueTask;

// OpenGL sharing
extern PFN_clCreateFromGLBuffer clCreateFromGLBuffer;
extern PFN_clCreateFromGLTexture clCreateFromGLTexture;
extern PFN_clEnqueueAcquireGLObjects clEnqueueAcquireGLObjects;
extern PFN_clEnqueueReleaseGLObjects clEnqueueReleaseGLObjects;

// cl_khr_egl_event extension
extern PFN_clCreateEventFromEGLSyncKHR clCreateEventFromEGLSyncKHR;

// EGL sharing
extern PFN_clCreateFromEGLImageKHR clCreateFromEGLImageKHR;
extern PFN_clEnqueueAcquireEGLObjectsKHR clEnqueueAcquireEGLObjectsKHR;
extern PFN_clEnqueueReleaseEGLObjectsKHR clEnqueueReleaseEGLObjectsKHR;

// cl_khr_command_buffer extension
extern PFN_clCreateCommandBufferKHR clCreateCommandBufferKHR;
extern PFN_clRetainCommandBufferKHR clRetainCommandBufferKHR;
extern PFN_clReleaseCommandBufferKHR clReleaseCommandBufferKHR;
extern PFN_clFinalizeCommandBufferKHR clFinalizeCommandBufferKHR;
extern PFN_clEnqueueCommandBufferKHR clEnqueueCommandBufferKHR;
extern PFN_clCommandNDRangeKernelKHR clCommandNDRangeKernelKHR;
extern PFN_clGetCommandBufferInfoKHR clGetCommandBufferInfoKHR;

// For convenient image creation
// It uses clCreateImage if it available (clCreateImage available since cl 1.2)
// otherwise it will use legacy clCreateImage2D
cl_mem CreateImage2DLegacy(cl_context context, cl_mem_flags flags,
                           const cl_image_format* image_format,
                           const cl_image_desc* image_desc,
                           const void* host_ptr, cl_int* errcode_ret);

// It uses clCreateImage if it available (clCreateImage available since cl 1.2)
// otherwise it will use legacy clCreateImage3D
cl_mem CreateImage3DLegacy(cl_context context, cl_mem_flags flags,
                           const cl_image_format* image_format,
                           const cl_image_desc* image_desc, void* host_ptr,
                           cl_int* errcode_ret);
bool OpenCLSupported();
#endif

} // namespace fuai
#if defined(_WIN32)
#define __WINDOWS__
#endif

#ifdef __WINDOWS__
#include <windows.h>
#else
#include <dlfcn.h>
#endif

#include "opencl_wrapper.h"
#include <iostream>
#include <string>

namespace fuai {

#ifdef __WINDOWS__
static HMODULE libopencl_handle = nullptr;
#else
static void *libopencl_handle = nullptr;
#endif

#ifdef __ANDROID__
#define LoadFunction(function)                                                 \
  if (use_wrapper) {                                                           \
    function = reinterpret_cast<PFN_##function>(loadOpenCLPointer(#function)); \
  } else {                                                                     \
    function = reinterpret_cast<PFN_##function>(dlsym(libopencl, #function));  \
  }
#elif defined(__WINDOWS__)
#define LoadFunction(function)                                                 \
  function =                                                                   \
      reinterpret_cast<PFN_##function>(GetProcAddress(libopencl, #function));
#else
#define LoadFunction(function)                                                 \
  function = reinterpret_cast<PFN_##function>(dlsym(libopencl, #function));
#endif

#ifdef __WINDOWS__
void LoadOpenCLFunctions(HMODULE libopencl);
#else
void LoadOpenCLFunctions(void *libopencl, bool use_wrapper);
#endif

bool LoadOpenCLImpl() {
#ifdef __WINDOWS__
  libopencl_handle = LoadLibraryA("OpenCL.dll");
  if (libopencl_handle) {
    LoadOpenCLFunctions(libopencl_handle);
    return true;
  } else {
    DWORD error_code = GetLastError();
    std::string msg =
        "Can not open OpenCL library on this device, error code - ";
    msg += std::to_string(error_code);
    std::cerr << msg << std::endl;
    return false;
  }
#else
#ifdef __ANDROID__
  // Pixel phone or auto?
  libopencl_handle = dlopen("libOpenCL-pixel.so", RTLD_NOW | RTLD_LOCAL);
  if (!libopencl_handle) {
    libopencl_handle = dlopen("libOpenCL-car.so", RTLD_NOW | RTLD_LOCAL);
  }
  if (libopencl_handle) {
    typedef void (*enableOpenCL_t)();
    enableOpenCL_t enableOpenCL = reinterpret_cast<enableOpenCL_t>(
        dlsym(libopencl_handle, "enableOpenCL"));
    enableOpenCL();
    LoadOpenCLFunctions(libopencl_handle, true);
    return true;
  }
#endif
#ifdef __APPLE__
  static const char *kClLibName =
      "/System/Library/Frameworks/OpenCL.framework/OpenCL";
#else
  static const char *kClLibName = "libOpenCL.so";
#endif
  libopencl_handle = dlopen(kClLibName, RTLD_NOW | RTLD_LOCAL);
  if (libopencl_handle) {
    LoadOpenCLFunctions(libopencl_handle, false);
    return true;
  }
  // record error
  std::string error(dlerror());
  std::string msg = "Can not open OpenCL library on this device - ";
  msg += error;
  std::cerr << msg << std::endl;
  return false;
#endif
}

bool LoadOpenCL() {
  static bool init = false;
  if (!init) {
    init = LoadOpenCLImpl();
  }
  return init;
}

bool CloseOpenCLImpl() {
#ifdef __WINDOWS__
  if (!FreeLibrary(libopencl_handle)) {
    DWORD error_code = GetLastError();
    std::string msg =
        "Can not release OpenCL library on this device, error code - ";
    msg += std::to_string(error_code);
    std::cerr << msg << std::endl;
    return false;
  }
#else
  // Pixel phone or auto?
  if (dlclose(libopencl_handle) != 0) {
    std::string err(dlerror());
    std::string msg = "libOpenCL dlclose failed." + err;
    std::cerr << msg << std::endl;
    return false;
  }
#endif
  libopencl_handle = nullptr;
  return true;
}
bool CloseOpenCL() { return CloseOpenCLImpl(); }
#ifdef __WINDOWS__
void LoadOpenCLFunctions(HMODULE libopencl) {
#else
void LoadOpenCLFunctions(void *libopencl, bool use_wrapper) {
#ifdef __ANDROID__
  typedef void *(*loadOpenCLPointer_t)(const char *name);
  loadOpenCLPointer_t loadOpenCLPointer;
  if (use_wrapper) {
    loadOpenCLPointer = reinterpret_cast<loadOpenCLPointer_t>(
        dlsym(libopencl, "loadOpenCLPointer"));
  }
#endif
#endif

  LoadFunction(clGetPlatformIDs);
  LoadFunction(clGetPlatformInfo);
  LoadFunction(clGetDeviceIDs);
  LoadFunction(clGetDeviceInfo);
  LoadFunction(clCreateSubDevices);
  LoadFunction(clRetainDevice);
  LoadFunction(clReleaseDevice);
  LoadFunction(clCreateContext);
  LoadFunction(clCreateContextFromType);
  LoadFunction(clRetainContext);
  LoadFunction(clReleaseContext);
  LoadFunction(clGetContextInfo);
  LoadFunction(clCreateCommandQueueWithProperties);
  LoadFunction(clRetainCommandQueue);
  LoadFunction(clReleaseCommandQueue);
  LoadFunction(clGetCommandQueueInfo);
  LoadFunction(clCreateBuffer);
  LoadFunction(clCreateSubBuffer);
  LoadFunction(clCreateImage);
  LoadFunction(clCreatePipe);
  LoadFunction(clRetainMemObject);
  LoadFunction(clReleaseMemObject);
  LoadFunction(clGetSupportedImageFormats);
  LoadFunction(clGetMemObjectInfo);
  LoadFunction(clGetImageInfo);
  LoadFunction(clGetPipeInfo);
  LoadFunction(clSetMemObjectDestructorCallback);
  LoadFunction(clSVMAlloc);
  LoadFunction(clSVMFree);
  LoadFunction(clCreateSamplerWithProperties);
  LoadFunction(clRetainSampler);
  LoadFunction(clReleaseSampler);
  LoadFunction(clGetSamplerInfo);
  LoadFunction(clCreateProgramWithSource);
  LoadFunction(clCreateProgramWithBinary);
  LoadFunction(clCreateProgramWithBuiltInKernels);
  LoadFunction(clRetainProgram);
  LoadFunction(clReleaseProgram);
  LoadFunction(clBuildProgram);
  LoadFunction(clCompileProgram);
  LoadFunction(clLinkProgram);
  LoadFunction(clUnloadPlatformCompiler);
  LoadFunction(clGetProgramInfo);
  LoadFunction(clGetProgramBuildInfo);
  LoadFunction(clCreateKernel);
  LoadFunction(clCreateKernelsInProgram);
  LoadFunction(clRetainKernel);
  LoadFunction(clReleaseKernel);
  LoadFunction(clSetKernelArg);
#if (defined(ANDROID) || defined(__ANDROID__))
  LoadFunction(clNewRecordingQCOM);
  LoadFunction(clEndRecordingQCOM);
  LoadFunction(clEnqueueRecordingQCOM);
  LoadFunction(clReleaseRecordingQCOM);
#endif
  LoadFunction(clSetKernelArgSVMPointer);
  LoadFunction(clSetKernelExecInfo);
  LoadFunction(clGetKernelInfo);
  LoadFunction(clGetKernelArgInfo);
  LoadFunction(clGetKernelWorkGroupInfo);
  LoadFunction(clWaitForEvents);
  LoadFunction(clGetEventInfo);
  LoadFunction(clCreateUserEvent);
  LoadFunction(clRetainEvent);
  LoadFunction(clReleaseEvent);
  LoadFunction(clSetUserEventStatus);
  LoadFunction(clSetEventCallback);
  LoadFunction(clGetEventProfilingInfo);
  LoadFunction(clFlush);
  LoadFunction(clFinish);
  LoadFunction(clEnqueueReadBuffer);
  LoadFunction(clEnqueueReadBufferRect);
  LoadFunction(clEnqueueWriteBuffer);
  LoadFunction(clEnqueueWriteBufferRect);
  LoadFunction(clEnqueueFillBuffer);
  LoadFunction(clEnqueueCopyBuffer);
  LoadFunction(clEnqueueCopyBufferRect);
  LoadFunction(clEnqueueReadImage);
  LoadFunction(clEnqueueWriteImage);
  LoadFunction(clEnqueueFillImage);
  LoadFunction(clEnqueueCopyImage);
  LoadFunction(clEnqueueCopyImageToBuffer);
  LoadFunction(clEnqueueCopyBufferToImage);
  LoadFunction(clEnqueueMapBuffer);
  LoadFunction(clEnqueueMapImage);
  LoadFunction(clEnqueueUnmapMemObject);
  LoadFunction(clEnqueueMigrateMemObjects);
  LoadFunction(clEnqueueNDRangeKernel);
  LoadFunction(clEnqueueNativeKernel);
  LoadFunction(clEnqueueMarkerWithWaitList);
  LoadFunction(clEnqueueBarrierWithWaitList);
  LoadFunction(clEnqueueSVMFree);
  LoadFunction(clEnqueueSVMMemcpy);
  LoadFunction(clEnqueueSVMMemFill);
  LoadFunction(clEnqueueSVMMap);
  LoadFunction(clEnqueueSVMUnmap);
  LoadFunction(clGetExtensionFunctionAddressForPlatform);
  LoadFunction(clCreateImage2D);
  LoadFunction(clCreateImage3D);
  LoadFunction(clEnqueueMarker);
  LoadFunction(clEnqueueWaitForEvents);
  LoadFunction(clEnqueueBarrier);
  LoadFunction(clUnloadCompiler);
  LoadFunction(clGetExtensionFunctionAddress);
  LoadFunction(clCreateCommandQueue);
  LoadFunction(clCreateSampler);
  LoadFunction(clEnqueueTask);

  // OpenGL sharing
  LoadFunction(clCreateFromGLBuffer);
  LoadFunction(clCreateFromGLTexture);
  LoadFunction(clEnqueueAcquireGLObjects);
  LoadFunction(clEnqueueReleaseGLObjects);

  // cl_khr_egl_event extension
  LoadFunction(clCreateEventFromEGLSyncKHR);

  // EGL sharing
  LoadFunction(clCreateFromEGLImageKHR);
  LoadFunction(clEnqueueAcquireEGLObjectsKHR);
  LoadFunction(clEnqueueReleaseEGLObjectsKHR);

  // cl_khr_command_buffer extension
  LoadFunction(clCreateCommandBufferKHR);
  LoadFunction(clRetainCommandBufferKHR);
  LoadFunction(clReleaseCommandBufferKHR);
  LoadFunction(clFinalizeCommandBufferKHR);
  LoadFunction(clEnqueueCommandBufferKHR);
  LoadFunction(clCommandNDRangeKernelKHR);
  LoadFunction(clGetCommandBufferInfoKHR);

  LoadQcomExtensionFunctions();
}

// No OpenCL support, do not set function addresses
PFN_clGetPlatformIDs clGetPlatformIDs;
PFN_clGetPlatformInfo clGetPlatformInfo;
PFN_clGetDeviceIDs clGetDeviceIDs;
PFN_clGetDeviceInfo clGetDeviceInfo;
PFN_clCreateSubDevices clCreateSubDevices;
PFN_clRetainDevice clRetainDevice;
PFN_clReleaseDevice clReleaseDevice;
PFN_clCreateContext clCreateContext;
PFN_clCreateContextFromType clCreateContextFromType;
PFN_clRetainContext clRetainContext;
PFN_clReleaseContext clReleaseContext;
PFN_clGetContextInfo clGetContextInfo;
PFN_clCreateCommandQueueWithProperties clCreateCommandQueueWithProperties;
PFN_clRetainCommandQueue clRetainCommandQueue;
PFN_clReleaseCommandQueue clReleaseCommandQueue;
PFN_clGetCommandQueueInfo clGetCommandQueueInfo;
PFN_clCreateBuffer clCreateBuffer;
PFN_clCreateSubBuffer clCreateSubBuffer;
PFN_clCreateImage clCreateImage;
PFN_clCreatePipe clCreatePipe;
PFN_clRetainMemObject clRetainMemObject;
PFN_clReleaseMemObject clReleaseMemObject;
PFN_clGetSupportedImageFormats clGetSupportedImageFormats;
PFN_clGetMemObjectInfo clGetMemObjectInfo;
PFN_clGetImageInfo clGetImageInfo;
PFN_clGetPipeInfo clGetPipeInfo;
PFN_clSetMemObjectDestructorCallback clSetMemObjectDestructorCallback;
PFN_clSVMAlloc clSVMAlloc;
PFN_clSVMFree clSVMFree;
PFN_clCreateSamplerWithProperties clCreateSamplerWithProperties;
PFN_clRetainSampler clRetainSampler;
PFN_clReleaseSampler clReleaseSampler;
PFN_clGetSamplerInfo clGetSamplerInfo;
PFN_clCreateProgramWithSource clCreateProgramWithSource;
PFN_clCreateProgramWithBinary clCreateProgramWithBinary;
PFN_clCreateProgramWithBuiltInKernels clCreateProgramWithBuiltInKernels;
PFN_clRetainProgram clRetainProgram;
PFN_clReleaseProgram clReleaseProgram;
PFN_clBuildProgram clBuildProgram;
PFN_clCompileProgram clCompileProgram;
PFN_clLinkProgram clLinkProgram;
PFN_clUnloadPlatformCompiler clUnloadPlatformCompiler;
PFN_clGetProgramInfo clGetProgramInfo;
PFN_clGetProgramBuildInfo clGetProgramBuildInfo;
PFN_clCreateKernel clCreateKernel;
PFN_clCreateKernelsInProgram clCreateKernelsInProgram;
PFN_clRetainKernel clRetainKernel;
PFN_clReleaseKernel clReleaseKernel;
PFN_clSetKernelArg clSetKernelArg;
PFN_clSetKernelArgSVMPointer clSetKernelArgSVMPointer;
PFN_clSetKernelExecInfo clSetKernelExecInfo;
PFN_clGetKernelInfo clGetKernelInfo;
PFN_clGetKernelArgInfo clGetKernelArgInfo;
PFN_clGetKernelWorkGroupInfo clGetKernelWorkGroupInfo;
PFN_clWaitForEvents clWaitForEvents;
PFN_clGetEventInfo clGetEventInfo;
PFN_clCreateUserEvent clCreateUserEvent;
PFN_clRetainEvent clRetainEvent;
PFN_clReleaseEvent clReleaseEvent;
PFN_clSetUserEventStatus clSetUserEventStatus;
PFN_clSetEventCallback clSetEventCallback;
PFN_clGetEventProfilingInfo clGetEventProfilingInfo;
PFN_clFlush clFlush;
PFN_clFinish clFinish;
#if (defined(ANDROID) || defined(__ANDROID__))
PFN_clNewRecordingQCOM clNewRecordingQCOM;
PFN_clEndRecordingQCOM clEndRecordingQCOM;
PFN_clEnqueueRecordingQCOM clEnqueueRecordingQCOM;
PFN_clReleaseRecordingQCOM clReleaseRecordingQCOM;
#endif
PFN_clEnqueueReadBuffer clEnqueueReadBuffer;
PFN_clEnqueueReadBufferRect clEnqueueReadBufferRect;
PFN_clEnqueueWriteBuffer clEnqueueWriteBuffer;
PFN_clEnqueueWriteBufferRect clEnqueueWriteBufferRect;
PFN_clEnqueueFillBuffer clEnqueueFillBuffer;
PFN_clEnqueueCopyBuffer clEnqueueCopyBuffer;
PFN_clEnqueueCopyBufferRect clEnqueueCopyBufferRect;
PFN_clEnqueueReadImage clEnqueueReadImage;
PFN_clEnqueueWriteImage clEnqueueWriteImage;
PFN_clEnqueueFillImage clEnqueueFillImage;
PFN_clEnqueueCopyImage clEnqueueCopyImage;
PFN_clEnqueueCopyImageToBuffer clEnqueueCopyImageToBuffer;
PFN_clEnqueueCopyBufferToImage clEnqueueCopyBufferToImage;
PFN_clEnqueueMapBuffer clEnqueueMapBuffer;
PFN_clEnqueueMapImage clEnqueueMapImage;
PFN_clEnqueueUnmapMemObject clEnqueueUnmapMemObject;
PFN_clEnqueueMigrateMemObjects clEnqueueMigrateMemObjects;
PFN_clEnqueueNDRangeKernel clEnqueueNDRangeKernel;
PFN_clEnqueueNativeKernel clEnqueueNativeKernel;
PFN_clEnqueueMarkerWithWaitList clEnqueueMarkerWithWaitList;
PFN_clEnqueueBarrierWithWaitList clEnqueueBarrierWithWaitList;
PFN_clEnqueueSVMFree clEnqueueSVMFree;
PFN_clEnqueueSVMMemcpy clEnqueueSVMMemcpy;
PFN_clEnqueueSVMMemFill clEnqueueSVMMemFill;
PFN_clEnqueueSVMMap clEnqueueSVMMap;
PFN_clEnqueueSVMUnmap clEnqueueSVMUnmap;
PFN_clGetExtensionFunctionAddressForPlatform
    clGetExtensionFunctionAddressForPlatform;
PFN_clCreateImage2D clCreateImage2D;
PFN_clCreateImage3D clCreateImage3D;
PFN_clEnqueueMarker clEnqueueMarker;
PFN_clEnqueueWaitForEvents clEnqueueWaitForEvents;
PFN_clEnqueueBarrier clEnqueueBarrier;
PFN_clUnloadCompiler clUnloadCompiler;
PFN_clGetExtensionFunctionAddress clGetExtensionFunctionAddress;
PFN_clCreateCommandQueue clCreateCommandQueue;
PFN_clCreateSampler clCreateSampler;
PFN_clEnqueueTask clEnqueueTask;

// OpenGL sharing
PFN_clCreateFromGLBuffer clCreateFromGLBuffer;
PFN_clCreateFromGLTexture clCreateFromGLTexture;
PFN_clEnqueueAcquireGLObjects clEnqueueAcquireGLObjects;
PFN_clEnqueueReleaseGLObjects clEnqueueReleaseGLObjects;

// cl_khr_egl_event extension
PFN_clCreateEventFromEGLSyncKHR clCreateEventFromEGLSyncKHR;

// EGL sharing
PFN_clCreateFromEGLImageKHR clCreateFromEGLImageKHR;
PFN_clEnqueueAcquireEGLObjectsKHR clEnqueueAcquireEGLObjectsKHR;
PFN_clEnqueueReleaseEGLObjectsKHR clEnqueueReleaseEGLObjectsKHR;

// cl_khr_command_buffer extension
PFN_clCreateCommandBufferKHR clCreateCommandBufferKHR;
PFN_clRetainCommandBufferKHR clRetainCommandBufferKHR;
PFN_clReleaseCommandBufferKHR clReleaseCommandBufferKHR;
PFN_clFinalizeCommandBufferKHR clFinalizeCommandBufferKHR;
PFN_clEnqueueCommandBufferKHR clEnqueueCommandBufferKHR;
PFN_clCommandNDRangeKernelKHR clCommandNDRangeKernelKHR;
PFN_clGetCommandBufferInfoKHR clGetCommandBufferInfoKHR;

DEFINE_QCOM_FUNCTION_PTRS

cl_mem CreateImage2DLegacy(cl_context context, cl_mem_flags flags,
                           const cl_image_format *image_format,
                           const cl_image_desc *image_desc,
                           const void *host_ptr, cl_int *errcode_ret) {
  if (0) { // clCreateImage available since OpenCL 1.2
    return clCreateImage(context, flags, image_format, image_desc,
                         const_cast<void *>(host_ptr), errcode_ret);
  } else {
    return clCreateImage2D(context, flags, image_format,
                           image_desc->image_width, image_desc->image_height,
                           image_desc->image_row_pitch,
                           const_cast<void *>(host_ptr), errcode_ret);
  }
}

cl_mem CreateImage3DLegacy(cl_context context, cl_mem_flags flags,
                           const cl_image_format *image_format,
                           const cl_image_desc *image_desc, void *host_ptr,
                           cl_int *errcode_ret) {
  if (clCreateImage) { // clCreateImage available since OpenCL 1.2
    return clCreateImage(context, flags, image_format, image_desc, host_ptr,
                         errcode_ret);
  } else {
    return clCreateImage3D(context, flags, image_format,
                           image_desc->image_width, image_desc->image_height,
                           image_desc->image_depth, image_desc->image_row_pitch,
                           image_desc->image_slice_pitch, host_ptr,
                           errcode_ret);
  }
}
bool OpenCLSupported() { return LoadOpenCL(); }
} // namespace fuai

原生OpenCL

pipeline

#include <iostream>
#include <string>
#include <vector>

#include "opencl_wrapper.h"
const char *mad_kernel_source =
    R"(__kernel void mad_8uTo8u_elm4(__global const uchar *srcptr,
                              __global uchar *dstptr, int elm_size,
                              float multipler, float addend) {
  int offset = get_global_id(0) * 4;
  if (offset + 4 <= elm_size) {
    float4 pix = convert_float4(vload4(0, srcptr + offset));
    pix.x = mad(pix.x, multipler, addend);
    pix.y = mad(pix.y, multipler, addend);
    pix.z = mad(pix.z, multipler, addend);
    pix.w = mad(pix.w, multipler, addend);
    vstore4(convert_uchar4_sat_rtz(pix), 0, dstptr + offset);
  } else if (offset + 3 == elm_size) {
    float3 pix = convert_float3(vload3(0, srcptr + offset));
    pix.x = mad(pix.x, multipler, addend);
    pix.y = mad(pix.y, multipler, addend);
    pix.z = mad(pix.z, multipler, addend);
    vstore3(convert_uchar3_sat_rtz(pix), 0, dstptr + offset);
  } else if (offset + 2 == elm_size) {
    float2 pix = convert_float2(vload2(0, srcptr + offset));
    pix.x = mad(pix.x, multipler, addend);
    pix.y = mad(pix.y, multipler, addend);
    vstore2(convert_uchar2_sat_rtz(pix), 0, dstptr + offset);
  } else if (offset + 1 == elm_size) {
    (dstptr + offset)[0] = convert_uchar_sat_rtz(
        mad(convert_float((srcptr + offset)[0]), multipler, addend));
  }
})";

int main() {
  std::cout << "Loaded opencl." << std::endl;
  fuai::LoadOpenCL();

  cl_platform_id platform_id = nullptr;
  cl_uint num_platforms;
  cl_int ret = fuai::clGetPlatformIDs(1, &platform_id, &num_platforms);

  if (ret != CL_SUCCESS) {
    std::cerr << "Failed to get OpenCL platform. Error: " << ret << std::endl;
    return EXIT_FAILURE;
  }

  // device_id
  cl_device_id device_id = nullptr;
  cl_uint num_devices;
  ret = fuai::clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_GPU, 1, &device_id,
                             &num_devices);
  if (ret != CL_SUCCESS) {
    std::cerr << "Failed to get OpenCL device. Error: " << ret << std::endl;
    return EXIT_FAILURE;
  }
  // context
  cl_context context =
      fuai::clCreateContext(nullptr, 1, &device_id, nullptr, nullptr, &ret);
  if (ret != CL_SUCCESS) {
    std::cerr << "Failed to create OpenCL context. Error: " << ret << std::endl;
    fuai::clReleaseDevice(device_id);
    return EXIT_FAILURE;
  }
  cl_command_queue command_queue =
      fuai::clCreateCommandQueueWithProperties(context, device_id, 0, &ret);
  if (ret != CL_SUCCESS) {
    std::cerr << "Failed to create command queue. Error: " << ret << std::endl;
    fuai::clReleaseContext(context);
    fuai::clReleaseDevice(device_id);
    return EXIT_FAILURE;
  }

  cl_program program = fuai::clCreateProgramWithSource(
      context, 1, &mad_kernel_source, nullptr, &ret);
  if (ret != CL_SUCCESS) {
    std::cout << "Failed to get program from source. Error code: " << ret
              << std::endl;
    return EXIT_FAILURE;
  }
  ret = fuai::clBuildProgram(program, 1, &device_id, nullptr, nullptr, nullptr);
  if (ret != CL_SUCCESS) {
    std::cout << "Failed to build OpenCL program. Error code: " << ret
              << std::endl;
    return EXIT_FAILURE;
  }
  cl_kernel kernel = fuai::clCreateKernel(program, "mad_8uTo8u_elm4", &ret);
  if (ret != CL_SUCCESS) {
    std::cout << "Failed to create OpenCL kernel. Error code: " << ret
              << std::endl;
    return EXIT_FAILURE;
  }

  // Create input/output buffer
  const int image_width = 512;
  const int image_height = 512;
  const int image_channel = 3;
  const int image_data_size = image_width * image_height * image_channel;

  std::vector<uint8_t> input_image(image_data_size, 128);
  std::vector<uint8_t> output_image(image_data_size, 0);

#if 1
  // CL_MEM_READ_ONLY, CL_MEM_WRITE_ONLY, CL_MEM_READ_WRITE
  // CL_MEM_COPY_HOST_PTR is important for initing data from host buffer.
  cl_mem cl_input_mem = fuai::clCreateBuffer(
      context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, image_data_size,
      input_image.data(), &ret); // copy from host ptr into cl_mem
  // Or use the below
#else
  cl_mem cl_input_mem = fuai::clCreateBuffer(context, CL_MEM_READ_ONLY,
                                             image_data_size, nullptr, &ret);
  ret = fuai::clEnqueueWriteBuffer(command_queue, cl_input_mem, true, 0,
                                   image_data_size, input_image.data(), 0,
                                   nullptr, nullptr);
#endif
  cl_mem cl_output_mem = fuai::clCreateBuffer(context, CL_MEM_WRITE_ONLY,
                                              image_data_size, nullptr, &ret);

  float multipler = 0.2;
  float addend = 10;

  // result will be:
  // 128 * 0.2 + 10 = 35.6

  fuai::clSetKernelArg(kernel, 0, sizeof(cl_mem), &cl_input_mem);
  fuai::clSetKernelArg(kernel, 1, sizeof(cl_mem), &cl_output_mem);
  fuai::clSetKernelArg(kernel, 2, sizeof(int), &image_data_size);
  fuai::clSetKernelArg(kernel, 3, sizeof(float), &multipler);
  fuai::clSetKernelArg(kernel, 4, sizeof(float), &addend);

  const size_t global_worker_size =
      image_data_size / 4 + (image_data_size % 4 > 0);
  const size_t local_worker_size = 1;

  ret = fuai::clEnqueueNDRangeKernel(command_queue, kernel, 1, nullptr,
                                     &global_worker_size, &local_worker_size, 0,
                                     nullptr, nullptr);
  if (ret != CL_SUCCESS) {
    std::cerr << "clEnqueueNDRangeKernel failed: " << ret << std::endl;
    return EXIT_FAILURE;
  }
  fuai::clFinish(command_queue);

  // Read the data out.
  ret = fuai::clEnqueueReadBuffer(command_queue, cl_output_mem, true, 0,
                                  image_data_size, output_image.data(), 0,
                                  nullptr, nullptr);
  std::cout << int(output_image[0]) << ", " << int(output_image[1])
            << std::endl;

  ret = fuai::clReleaseMemObject(cl_input_mem);
  ret = fuai::clReleaseMemObject(cl_output_mem);

  fuai::clReleaseKernel(kernel);
  fuai::clReleaseProgram(program);
  // 释放
  if (command_queue) {
    fuai::clReleaseCommandQueue(command_queue);
    command_queue = nullptr;
  }
  if (context) {
    fuai::clReleaseContext(context);
    context = nullptr;
  }
  if (device_id) {
    fuai::clReleaseDevice(device_id);
    device_id = nullptr;
  }
  fuai::CloseOpenCL();
  std::cout << "Closed opencl." << std::endl;
  return 0;
}

check extension

bool checkExtensionSupported(cl_device_id device_id, , const std::string &ext_name) {
  if (extensions_ == "") {
    size_t ext_len = 0;
    cl_int ret = fuai::clGetDeviceInfo(device_id, CL_DEVICE_EXTENSIONS, 0, nullptr,
                                 &ext_len);
    if (ret != CL_SUCCESS) {
      printf("Failed to get extensions size. Error: %d", ret);
      return false;
    }

    std::vector<char> ext_buf(ext_len, '\0');
    ret = fuai::clGetDeviceInfo(device_id, CL_DEVICE_EXTENSIONS, ext_len,
                          ext_buf.data(), nullptr);
    if (ret != CL_SUCCESS) {
      printf("Failed to get device extensions. Error: %d", ret);
      extensions_ = "";
      return false;
    }
    extensions_ = std::string(ext_buf.begin(), ext_buf.end());
  }
  return extensions_.find(ext_name) != std::string::npos;
}

CLHPP

#include <iostream>
#include <string>
#include <vector>

const char *mad_kernel_source =
    R"(__kernel void mad_8uTo8u_elm4(__global const uchar *srcptr,
                              __global uchar *dstptr, int elm_size,
                              float multipler, float addend) {
  int offset = get_global_id(0) * 4;
  if (offset + 4 <= elm_size) {
    float4 pix = convert_float4(vload4(0, srcptr + offset));
    pix.x = mad(pix.x, multipler, addend);
    pix.y = mad(pix.y, multipler, addend);
    pix.z = mad(pix.z, multipler, addend);
    pix.w = mad(pix.w, multipler, addend);
    vstore4(convert_uchar4_sat_rtz(pix), 0, dstptr + offset);
  } else if (offset + 3 == elm_size) {
    float3 pix = convert_float3(vload3(0, srcptr + offset));
    pix.x = mad(pix.x, multipler, addend);
    pix.y = mad(pix.y, multipler, addend);
    pix.z = mad(pix.z, multipler, addend);
    vstore3(convert_uchar3_sat_rtz(pix), 0, dstptr + offset);
  } else if (offset + 2 == elm_size) {
    float2 pix = convert_float2(vload2(0, srcptr + offset));
    pix.x = mad(pix.x, multipler, addend);
    pix.y = mad(pix.y, multipler, addend);
    vstore2(convert_uchar2_sat_rtz(pix), 0, dstptr + offset);
  } else if (offset + 1 == elm_size) {
    (dstptr + offset)[0] = convert_uchar_sat_rtz(
        mad(convert_float((srcptr + offset)[0]), multipler, addend));
  }
})";

#define CL_HPP_ENABLE_EXCEPTIONS
#ifdef CL_TARGET_OPENCL_VERSION
#undef CL_TARGET_OPENCL_VERSION
#define CL_TARGET_OPENCL_VERSION 200
#endif
#define CL_HPP_MINIMUM_OPENCL_VERSION 200
#define CL_HPP_TARGET_OPENCL_VERSION 200
#include "opencl.hpp"

int main() {
  fuai::LoadOpenCL();
  {
    fuai::cl::Device device;
    fuai::cl::Context context;
    fuai::cl::CommandQueue commandQueue;
    try {
      std::vector<fuai::cl::Platform> platforms;
      fuai::cl::Platform::get(&platforms);
      for (const auto &plat : platforms) {
        std::vector<fuai::cl::Device> devices;
        plat.getDevices(CL_DEVICE_TYPE_GPU, &devices);
        for (const auto &d : devices) {
          if (d.getInfo<CL_DEVICE_AVAILABLE>()) {
            std::cout << "Creating context using Platform: "
                      << plat.getInfo<CL_PLATFORM_NAME>() << std::endl;
            // Get the first available device.
            device = d;
            context = fuai::cl::Context(device);
            commandQueue = fuai::cl::CommandQueue(context, device);
          }
        }
      }
    } catch (fuai::cl::Error &clError) {
      std::cout << "OpenCL error: " << clError.what() << ", " << clError.err()
                << std::endl;
      return EXIT_FAILURE;
    }

    std::string buildOpts = "";
    fuai::cl::Program program(
        context, fuai::cl::Program::Sources(
                     std::vector<std::string>({mad_kernel_source})));
    try {
      program.build(device, buildOpts);
    } catch (const fuai::cl::Error &err) {
      std::string log = program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(device);
      std::string error_msg = "Failed to build program.\nBuild log: " + log;
      std::cerr << error_msg << std::endl;
      return EXIT_FAILURE;
    }

    fuai::cl::Kernel kernel;
    try {
      // Get the kernel from program.
      kernel = fuai::cl::Kernel(program, "mad_8uTo8u_elm4");
    } catch (const fuai::cl::Error &err) {
      std::string error_msg = "Failed to get kernel: mad_8uTo8u_elm4";
      error_msg = error_msg + err.what() + std::to_string(err.err());
      std::cerr << error_msg << std::endl;
      return EXIT_FAILURE;
    }

    // Create input/output buffer
    const int image_width = 512;
    const int image_height = 512;
    const int image_channel = 3;
    const int image_data_size = image_width * image_height * image_channel;

    std::vector<uint8_t> input_image(image_data_size, 128);
    std::vector<uint8_t> output_image(image_data_size, 0);

#if 0
    std::cout << "Copy from host ptr" << std::endl;
    fuai::cl::Buffer cl_input_mem(
        context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
        sizeof(uint8_t) * image_data_size, input_image.data());
    fuai::cl::Buffer cl_output_mem(context, CL_MEM_WRITE_ONLY,
                                   sizeof(uint8_t) * image_data_size, nullptr);
#else
    std::cout << "Write host data into mem" << std::endl;
    fuai::cl::Buffer cl_input_mem(context, CL_MEM_READ_ONLY,
                                  sizeof(uint8_t) * image_data_size, nullptr);
    fuai::cl::Buffer cl_output_mem(context, CL_MEM_WRITE_ONLY,
                                   sizeof(uint8_t) * image_data_size, nullptr);
    commandQueue.enqueueWriteBuffer(cl_input_mem, true, 0, image_data_size,
                                    input_image.data(), nullptr, nullptr);
#endif
    float multipler = 0.2;
    float addend = 10;

    try {
      kernel.setArg(0, cl_input_mem);
      kernel.setArg(1, cl_output_mem);
      kernel.setArg(2, image_data_size);
      kernel.setArg(3, multipler);
      kernel.setArg(4, addend);
    } catch (const fuai::cl::Error &err) {
      std::cerr << "failed to setArg: " << err.what() << ", " << err.err()
                << std::endl;
      return EXIT_FAILURE;
    }
    const size_t global_worker_size =
        image_data_size / 4 + (image_data_size % 4 > 0);
    // const size_t local_worker_size = 1;
    try {
      commandQueue.enqueueNDRangeKernel(kernel, fuai::cl::NullRange,
                                        fuai::cl::NDRange(global_worker_size),
                                        fuai::cl::NDRange(), nullptr, nullptr);
    } catch (const fuai::cl::Error &err) {
      std::cerr << "failed to enqueueNDRangeKernel, " << err.what() << ", "
                << err.err() << std::endl;
      return EXIT_FAILURE;
    }
    commandQueue.enqueueReadBuffer(cl_output_mem, true, 0, image_data_size,
                                   output_image.data(), nullptr, nullptr);

    std::cout << int(output_image[0]) << ", " << int(output_image[1])
              << std::endl;
  }

  // 需要在所有OpenCL函数调用前释放
  fuai::CloseOpenCL();
}

内核实现

mad

__kernel void mad_8uTo32f_elm4(__global const uchar* srcptr,
                               __global float* dstptr, int elm_size,
                               float multipler, float addend) {
  int offset = get_global_id(0) * 4;
  if (offset + 4 <= elm_size) {
    float4 pix = convert_float4(vload4(0, srcptr + offset));
    pix.x = mad(pix.x, multipler, addend);
    pix.y = mad(pix.y, multipler, addend);
    pix.z = mad(pix.z, multipler, addend);
    pix.w = mad(pix.w, multipler, addend);
    vstore4(pix, 0, dstptr + offset);
  } else if (offset + 3 == elm_size) {
    float3 pix = convert_float3(vload3(0, srcptr + offset));
    pix.x = mad(pix.x, multipler, addend);
    pix.y = mad(pix.y, multipler, addend);
    pix.z = mad(pix.z, multipler, addend);
    vstore3(pix, 0, dstptr + offset);
  } else if (offset + 2 == elm_size) {
    float2 pix = convert_float2(vload2(0, srcptr + offset));
    pix.x = mad(pix.x, multipler, addend);
    pix.y = mad(pix.y, multipler, addend);
    vstore2(pix, 0, dstptr + offset);
  } else if (offset + 1 == elm_size) {
    (dstptr + offset)[0] =
        mad(convert_float((srcptr + offset)[0]), multipler, addend);
  }
}

resize

__kernel void resize_nearest_8uc4(__global const uchar *srcptr, int src_step,
                                  int src_offset, int src_rows, int src_cols,
                                  __global uchar *dstptr, int dst_step,
                                  int dst_offset, int dst_rows, int dst_cols,
                                  float ifx, float ify) {
  int dx = get_global_id(0);
  int dy = get_global_id(1);

  if (dx < dst_cols && dy < dst_rows) {
    float s1 = dx * ifx;
    float s2 = dy * ify;
    int sx = min(convert_int_rtz(s1), src_cols - 1);
    int sy = min(convert_int_rtz(s2), src_rows - 1);

    int src_index = mad24(sy, src_step, mad24(sx, 4, src_offset));
    int dst_index = mad24(dy, dst_step, mad24(dx, 4, dst_offset));
    vstore4(vload4(0, srcptr + src_index), 0, dstptr + dst_index);
  }
}

#define INC(x, l) min(x + 1, l - 1)
#define INTER_RESIZE_COEF_BITS 11
#define INTER_RESIZE_COEF_SCALE (1 << INTER_RESIZE_COEF_BITS)
#define CAST_BITS (INTER_RESIZE_COEF_BITS << 1)

__kernel void resize_linear_8uc4(__global const uchar *srcptr, int src_step,
                                 int src_offset, int src_rows, int src_cols,
                                 __global uchar *dstptr, int dst_step,
                                 int dst_offset, int dst_rows, int dst_cols,
                                 float ifx, float ify) {
  int dx = get_global_id(0);
  int dy = get_global_id(1);

  if (dx < dst_cols && dy < dst_rows) {
    float sx = ((dx + 0.5f) * ifx - 0.5f), sy = ((dy + 0.5f) * ify - 0.5f);
    int x = floor(sx), y = floor(sy);

    float u = sx - x, v = sy - y;

    if (x < 0)
      x = 0, u = 0;
    if (x >= src_cols)
      x = src_cols - 1, u = 0;
    if (y < 0)
      y = 0, v = 0;
    if (y >= src_rows)
      y = src_rows - 1, v = 0;

    int y_ = INC(y, src_rows);
    int x_ = INC(x, src_cols);

    u = u * INTER_RESIZE_COEF_SCALE;
    v = v * INTER_RESIZE_COEF_SCALE;

    int U = rint(u);
    int V = rint(v);
    int U1 = rint(INTER_RESIZE_COEF_SCALE - u);
    int V1 = rint(INTER_RESIZE_COEF_SCALE - v);

    int4 lt = convert_int4(
        vload4(0, srcptr + mad24(y, src_step, mad24(x, 4, src_offset))));
    int4 rt = convert_int4(
        vload4(0, srcptr + mad24(y, src_step, mad24(x_, 4, src_offset))));
    int4 lb = convert_int4(
        vload4(0, srcptr + mad24(y_, src_step, mad24(x, 4, src_offset))));
    int4 rb = convert_int4(
        vload4(0, srcptr + mad24(y_, src_step, mad24(x_, 4, src_offset))));

    // Use int to simulate the linear inter.
    int4 val = mul24((int4)mul24(U1, V1), lt) + mul24((int4)mul24(U, V1), rt) +
               mul24((int4)mul24(U1, V), lb) + mul24((int4)mul24(U, V), rb);

    // (val + 0.5s) / s
    vstore4(convert_uchar4((val + (1 << (CAST_BITS - 1))) >> CAST_BITS), 0,
            dstptr + mad24(dy, dst_step, mad24(dx, 4, dst_offset)));
  }
}

warpaffine

__constant short4 vec_offset = (short4)(0, 1, 2, 3);
#define GET_VAL_C4(x, y)                                     \
  ((x) < 0 || (x) >= src_cols || (y) < 0 || (y) >= src_rows) \
      ? (uchar4)scalar                                       \
      : vload4(0, src + y * src_step + x * 4)

#define GET_VAL_2xC4(x, y)                                   \
  ((x) < 0 || (x) >= src_cols || (y) < 0 || (y) >= src_rows) \
      ? (uchar8)scalar                                       \
      : vload8(0, src + y * src_step + x * 4)

__kernel void warpAffine_nearest_8uc4(__global const uchar* src, int src_step,
                                      int src_offset, int src_rows,
                                      int src_cols, __global uchar* dst,
                                      int dst_step, int dst_offset,
                                      int dst_rows, int dst_cols,
                                      __constant float* M, uchar scalar_) {
  short dst_x = get_global_id(0);
  short dst_y = get_global_id(1);
  uchar scalar = convert_uchar_sat_rte(scalar_);

  if (dst_x >= (short)dst_cols || dst_y >= (short)dst_rows) return;

  /* { M0, M1, M2 }
   * { M3, M4, M5 }
   */
  short src_x, src_y;
  src_x =
      convert_short_sat_rte(M[0] * (float)dst_x + M[1] * (float)dst_y + M[2]);
  src_y =
      convert_short_sat_rte(M[3] * (float)dst_x + M[4] * (float)dst_y + M[5]);

  uchar4 pix = GET_VAL_C4(src_x, src_y);
  vstore4(pix, 0, dst + dst_y * dst_step + dst_x * 4);
}


uchar4 read_pixels_c4(__global const uchar* src, short tx, short ty,
                      int src_offset, int src_step, int src_cols, int src_rows,
                      float s, float t, uchar scalar) {
  uchar4 pt[2];
  uchar4 pb[2];

  short bx, by;
  bx = tx + 1;
  by = ty + 1;

  if (tx >= 0 && (tx + 1) < src_cols && ty >= 0 && ty < src_rows) {
    int src_index = ty * src_step + tx * 4 + src_offset;
    pt[0] = vload4(0, src + src_index);
    pt[1] = vload4(0, src + src_index + 4);
  } else {
    pt[0] = GET_VAL_C4(tx, ty);
    pt[1] = GET_VAL_C4(bx, ty);
  }

  if (tx >= 0 && (tx + 1) < src_cols && by >= 0 && by < src_rows) {
    int src_index = by * src_step + tx * 4 + src_offset;
    pb[0] = vload4(0, src + src_index);
    pb[1] = vload4(0, src + src_index + 4);
  } else {
    pb[0] = GET_VAL_C4(tx, by);
    pb[1] = GET_VAL_C4(bx, by);
  }

  // interpolate and get the points.
  float4 tl, tr, bl, br, pixel;
  tl = convert_float4(pt[0]);
  tr = convert_float4(pt[1]);
  bl = convert_float4(pb[0]);
  br = convert_float4(pb[1]);
  pixel =
      tl * (1 - s) * (1 - t) + tr * s * (1 - t) + bl * (1 - s) * t + br * s * t;

  return convert_uchar4_sat_rte(pixel);
}
__kernel void warpAffine_linear_8uc4(__global const uchar* src, int src_step,
                                     int src_offset, int src_rows, int src_cols,
                                     __global uchar* dst, int dst_step,
                                     int dst_offset, int dst_rows, int dst_cols,
                                     __constant float* M, uchar scalar_) {
  int dst_x = get_global_id(0);
  int dst_y = get_global_id(1);
  uchar scalar = convert_uchar_sat_rte(scalar_);

  if (dst_x >= dst_cols || dst_y >= dst_rows) return;

  /* { M0, M1, M2 }
   * { M3, M4, M5 }
   */

  float src_fx, src_fy;
  src_fx = M[0] * (float)dst_x + M[1] * (float)dst_y + M[2];
  src_fy = M[3] * (float)dst_x + M[4] * (float)dst_y + M[5];

  float s, t;
  s = round((src_fx - floor(src_fx)) * 32.0f) / 32.0f;
  t = round((src_fy - floor(src_fy)) * 32.0f) / 32.0f;

  short src_x, src_y;
  src_x = convert_short_sat_rtn(src_fx);
  src_y = convert_short_sat_rtn(src_fy);

  uchar4 pixel = read_pixels_c4(src, src_x, src_y, src_offset, src_step,
                                src_cols, src_rows, s, t, scalar);

  int dst_index = dst_y * dst_step + dst_x * 4 + dst_offset;
  vstore4(pixel, 0, dst + dst_index);
}