Qualifiers for kernel functions.


__attribute__((work_group_size_hint(X, Y, Z)))
__attribute__((reqd_work_group_size(X, Y, Z)))


The __kernel (or kernel) qualifier declares a function to be a kernel that can be executed by an application on an OpenCL device(s). The following rules apply to functions that are declared with this qualifier:

  • It can be executed on the device only

  • It can be called by the host

  • It is just a regular function call if a __kernel function is called by another kernel function.

Kernel functions with variables declared inside the function with the __local or local qualifier can be called by the host using appropriate APIs such as clEnqueueNDRangeKernel.

The __kernel and kernel names are reserved for use as functions qualifiers and shall not be used otherwise.

Optional Attribute Qualifiers

The kernel qualifier can be used with the keyword attribute__ to declare additional information about the kernel function as described below.

The optional attributevec_type_hint(<type>) is a hint to the compiler and is intended to be a representation of the computational width of the kernel, and should serve as the basis for calculating processor bandwidth utilization when the compiler is looking to autovectorize the code. In the attributevec_type_hint(<type>) qualifier <type> is one of the built-in vector types or the constituent scalar element types. If vec_type_hint (<type>) is not specified, the kernel is assumed to have the attribute__vec_type_hint(int) qualifier.

Implicit in autovectorization is the assumption that any libraries called from the __kernel must be recompilable at run time to handle cases where the compiler decides to merge or separate workitems. This probably means that such libraries can never be hard coded binaries or that hard coded binaries must be accompanied either by source or some retargetable intermediate representation. This may be a code security question for some.

For example, where the developer specified a width of float4, the compiler should assume that the computation usually uses up 4 lanes of a float vector, and would decide to merge work-items or possibly even separate one work-item into many threads to better match the hardware capabilities. A conforming implementation is not required to autovectorize code, but shall support the hint. A compiler may autovectorize, even if no hint is provided. If an implementation merges N work-items into one thread, it is responsible for correctly handling cases where the number of global or local work-items in any dimension modulo N is not zero.

If for example, a kernel function is declared with attributevec_type_hint (float4) (meaning that most operations in the kernel are explicitly vectorized using float4) and the kernel is running using IntelĀ® Advanced Vector Instructions (IntelĀ® AVX) which implements a 8-float-wide vector unit, the autovectorizer might choose to merge two work-items to one thread, running a second work-item in the high half of the 256-bit AVX register.

As another example, a Power4 machine has two scalar double precision floating-point units with an 6-cycle deep pipe. An autovectorizer for the Power4 machine might choose to interleave six kernels declared with the attributevec_type_hint (double2) qualifier into one hardware thread, to ensure that there is always 12-way parallelism available to saturate the FPUs. It might also choose to merge 4 or 8 work-items (or some other number) if it concludes that these are better choices, due to resource utilization concerns or some preference for divisibility by 2.

The optional attributework_group_size_hint(X, Y, Z) is a hint to the compiler and is intended to specify the work-group size that may be used i.e. value most likely to be specified by the local_work_size argument to clEnqueueNDRangeKernel. For example the attributework_group_size_hint(1, 1, 1) is a hint to the compiler that the kernel will most likely be executed with a work-group size of 1.

The optional attributereqd_work_group_size(X, Y, Z) is the work-group size that must be used as the local_work_size argument to clEnqueueNDRangeKernel. This allows the compiler to optimize the generated code appropriately for this kernel.

If Z is one, the work_dim argument to clEnqueueNDRangeKernel can be 2 or 3. If Y and Z are one, the work_dim argument to clEnqueueNDRangeKernel can be 1, 2 or 3.

The optional attributenosvm qualifier can be used with a pointer variable to inform the compiler that the pointer does not refer to a shared virtual memory region.


// autovectorize assuming float4 as the
// basic computation width
__kernel __attribute__((vec_type_hint(float4)))
void foo( __global float4 *p ) { .... }

// autovectorize assuming double as the
// basic computation width
__kernel __attribute__((vec_type_hint(double)))
void foo( __global float4 *p ){ .... }

// autovectorize assuming int (default)
// as the basic computation width
void foo( __global float4 *p ){ .... }

Also see