AOT Compilation#

The PolyBlocks compiler can be used to perform ahead-of-time (AOT) compilation. The artifacts of PolyBlocks/MLIR compilation can be turned into libraries for offline use via the -aot flag. With this flag, instead of JIT compiling and executing the compiled code, the object files and their headers are written to disk. The generated library can be called from any C/C++ application or from any language that can use a C API. The generated library’s only dependence is on the MLIR runtime utils library (which is just a thin wrapper) and on the CUDA runtime (if executing on the NVIDIA GPUs), the OpenMP runtime (for multicore CPUs), or the ROCm runtime (on the AMD GPUs).

The generated library takes inputs followed by outputs. It does not take or provide ownership of any buffers/memory. The caller is expected to manage input and output data’s memory. -on-gpu-tensors can be used in conjunction with -aot to specify that the passed buffers are already on the GPU when compiling for GPUs.

AOT Features#

  • User-provided memory pool: The user can provide pre-allocated memory and have the generated library use that memory pool instead of allocating and deallocating on its own.

  • User-supplied stream (CUDA stream): This allows the user to provide an existing CUDA stream. User-provided memory pool and stream go together.

  • On-GPU tensors: The user can specify that the inputs and outputs of the generated function already reside on the device (GPU). The generated library will not perform any host-to-device transfers.

Command-line flags for AOT


An AOT-compilation example is shown below. The relevant files can be found in the playground.

# Run the TF spec through the Polyblocks/MLIR compiler with the -aot flag.
$ python ../ -gpu -H 1080 -W 1920 -aot \
  -aot-name polyblocks -skip-tf-xla -skip-tf-standard

# Polyblocks/MLIR compilation artifact is dumped to `polyblocks_mlir_artifact.o`,
# and the header to include is generated as polyblocks_mlir_artifact.h
# (default names).
/// Signature of the PolyBlocks-generated function.
/// Type/shape of img is memref<1080x1920x3xi8>.
/// Type/shape of identity_RetVal is memref<1076x1916x3xf32>.
extern "C" void unsharp_mask_polyblocks(int8_t *img, float *out);

Example of a header generated with AOT compilation with user-supplied memory and stream:

/// Signature of the PolyBlocks-generated function.
/// Requires 200933568 bytes (1.916252e+02 MB) of memory to be supplied via a
/// trailing argument.
/// Type/shape of img is memref<1080x1920x3xi8>.
/// Type/shape of out is memref<1076x1916x3xf32>.
/// Type/shape of user_device_memory is memref<80674880xi8>.
extern "C" void unsharp_mask_polyblocks(float *img, float *out,
                                        int8_t *user_device_memory,
                                        void *stream);
# Compile and link with it. The driver file has a few lines to call the
# generated function.
$ clang++ -O3 -I/ws/release/llvm-mlir/include unsharp_mask_driver.cpp \
polyblocks_mlir_artifact.o -L/usr/local/cuda/lib64 -lcuda -lcudart \
-L/ws/release/llvm-mlir/lib -lmlir_cuda_runtime -o unsharp_mask
$ ./unsharp_mask

# Generate a shared object.
$ clang++ -O3 -fPIC -shared polyblocks_mlir_artifact.o \
-I/usr/local/cuda/include -L/ws/release/llvm-mlir/lib -lmlir_cuda_runtime \
-L/usr/local/cuda/lib64 -lcuda -lcudart -o
# Call it from Python, passing it `numpy` arrays using `ctypes` or the
# numpy to memref helpers (for dynamically shaped memrefs) available in
# the MLIR Python bindings.

The above runs realize full-fledged GPU-accelerated execution with performance identical to the JIT-compiled run from the Python Tensorflow specification.

AOT compilation works in the same way for PyTorch as well. In the example below, we use the annotation/API-based approach.

from polyblocks.torch_polyblocks_compiler import polyblocks_aot_torch
import torch

class ConvBiasAddRelu(torch.nn.Module):
    def __init__(self, *args, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        self.conv = torch.nn.Conv2d(
          in_channels=3, out_channels=1024, kernel_size=32, stride=1, bias=True

    def forward(self, input: torch.Tensor) -> torch.Tensor:
        return torch.nn.functional.relu(self.conv(input))

input_img = torch.rand(1, 3, 224, 224)
with torch.no_grad():
    model = ConvBiasAddRelu().eval()
    polyblocks_aot_torch(compile_options={'target': 'nvgpu'})(model)(input_img)
/// Signature of the PolyBlocks-generated function.
/// Requires 324921344 bytes (3.098691e+02 MB) of memory to be supplied via the
/// trailing argument.
/// Type/shape of input is memref<1024x3x32x32xf32>.
/// Type/shape of bias is memref<1024xf32>.
/// Type/shape of weight is memref<1x3x224x224xf32>.
/// Type/shape of output0 is memref<1x1024x193x193xf32>.
/// Type/shape of user_device_memory is memref<324921344xi8>.
/// Type/shape of stream is !gpu.async.token.
extern "C" void conv_bias_add_relu(float *input, float *bias, float *weight,
                                   float *output0, int8_t *user_device_memory,
                                   void *stream);
$ clang++ -O3 -fPIC -shared polyblocks_mlir_artifact.o \
       -L... -lmlir_cuda_runtime -o
# Link to this shared object from any C/C++/CUDA app or from any language with
# a C-compatible API.