Compile Options#
The following list of options are to be used with the compile_options
dictionary argument of @polyblocks_jit_* decorators. As an example:
@polyblocks_jit_torch(compile_options={
'mm_conv_precision_mode': 'fp16',
'target': 'nvgpu'
})
When using the polyblocks compiler from a terminal/shell, these options listed below are to be used with underscores replaced with hyphens (when supplying them on the command line).
General options#
device_one_time_loadallows the load of the device kernel only once (the first time), irrespective of how many times the kernel is launched.mm_conv_precision_modePrecision for add/mul for convolution and matmul; possible values are fp16,bf16,fp32,int32,mixed_fp16_fp32,mixed_bf16_fp32, andmixed_int8_int32; the default mode for GPUs with tensor cores ismixed_fp16_fp32, with float16 and bfloat16 input data types supported. For ‘cpu’ and ‘gpu’ with tensor core targeting disabled, the default behavior is to maintain the existing precision as dictated by the input/output types of a particular matmul/convolution operation.mlir_disable_fast_mathDisable fast math while mapping math ops to lower-level target intrinsics. Fast math is enabled by default.num_itersThe number of iterations for which the spec will be run. Each run uses the same input. Meant for testing. Default: 1. on_gpu_tensorsGenerate libraries while expecting input and output tensors to already be present on on the GPU (polyblock_gen_libraryhas to be specified as well).polyblocks_disable_affine_fusionDisable affine fusion for PolyBlocks codegen. Enabled by default.polyblocks_gpu_no_tensor_coreDisable targeting tensor cores for PolyBlocks GPU compilation. Tensor cores are enabled by default.polyblocks_disable_reduce_reduce_fusionDisable fusion across multiple reduction operators (e.g. matmul-softmax-matmul)polyblocks_disable_unsafe_math_canonicalizationsDisable canonicalizations for math operations that are not guaranteed to be numerically stable.strict_targetEnsure that all kernels are compiled for the specified target. Disabled by default.targetThe hardware class to target: the supported targets are ‘cpu’ (CPUs),nvgpufor NVIDIA GPUs, andamdgpufor AMD GPUs. The default target is ‘cpu’.torch_polyblocks_select_graphsSelectively run PyTorch graphs mentioned in this list via PolyBlocks.torch_polyblocks_filter_graphsSelectively skip PyTorch graphs mentioned in this list via PolyBlocks.
AOT-specific options#
aotGenerate libraries from PolyBlocks/MLIR compilation. Disabled by default.aot_nameUse the specified name for the main generated function. This name is also used as the prefix for the emitted artifacts. Default: polyblocks_mlir_artifact.deviceSpecify a particular target device/chip to compile or cross-compile for (in AOT mode). Examples for `nvgpu’ include ‘rtx3090’, ‘rtx4090’, ‘a10’, ‘a40’, ‘a100’, and ‘orin_nano’. Any number of chip configurations can be added to the JSON file, ‘devices.json’.device_one_time_load(enabled by default) allows the load of the device kernel only once (the first time), irrespective of how many times the kernel is launched.on_gpu_tensorsGenerate libraries while expecting input and output tensors to already be present on the GPU.user_device_memoryGenerate libraries while expecting the user to provide a pool of device_allocated memory through the trailing argument of the generated function; generated code does not perform any device allocations but simply uses the supplied pool of memory. Wheneveruser_device_memoryis specified, a user of the generated function is also expected to pass a pointer to the relevant CUDA stream (cudaStream_t) as the trailing argument; as a result, the last two arguments of the generated library’s signature have to be the device memory buffer and the CUDA stream pointer whenever -user-device-memory is specified.