Custom C++: Systolic GEMM
This example describes how to integrate an existing design using the C++ flow
In this example we are going to integrate an existing design: the Systolic GEMM processor provided by Xilinx on their Vitis Accel' Examples repository in GitHub.
The Xilinx Systolic GEMM processor is described as a function in C++. The function has the following signature:
void mmult(const int* a, // Read-Only Matrix A
const int* b, // Read-Only Matrix B
int* c, // Output Result
int a_row, // Matrix A Row Size
int a_col, // Matrix A Col Size
int b_col // Matrix B Col Size
)
We need to take special note of the function parameters as these will define the interfaces we need to make available in our project. We want to replicate these so that we can integrate this processor into a Sabana image without any modifications.
These are the steps we will follow to successfully package this module in a Sabana image:
  • Create a Sabana project with a set of interfaces tailored to the Systolic GEMM module.
  • Integrate source files that describe the GEMM module
  • Building the image
  • Creating a Sabana program to interact with a deployed instance

The first step is to create a new Sabana project and to add the required interfaces to match the function signature of the GEMM module. For that we will use the Sabana CLI tool:
  • We will use the C++ flow
  • We will select start from scratch rather than using an example
  • We will add the interfaces to match the function's signature
Start the process with Sabana new:
sabana new c_axi_systolic_gemm_16x16_int
First select the Cpp flow:
Welcome to sabana, let's create a project
? Select a language ›
Verilog
❯ Cpp
Now select the Start from scratch flow so we can add the specific interfaces we need to integrate the mmult function signature into our image:
Welcome to sabana, let's create a project
✔ Select a language · Cpp
? Select a starting point ›
Matrix multiply accelerator (GEMM)
❯ Start from scratch
Now we need to add each interface. For each we will have to provide the interface name, its type, and direction.
For convenience here is a handy table that outlines all the interfaces we will add to the project together with their respective parameters. The columns are ordered following the questions prompted by the Sabana CLI tool
Resource Type
Name
IO
Data Type
Value
a_row
input
int
Value
a_col
input
int
Value
b_col
input
int
Buffer
a
input
int
Buffer
b
input
int
Buffer
c
output
int
For registers select the type Value:
Welcome to sabana, let's create a project
✔ Select a language · Cpp
✔ Select a starting point · Start from scratch
? Resource type ›
❯ Value
Buffer
Done adding resources
For memory buffers select the type Buffer:
Welcome to sabana, let's create a project
✔ Select a language · Cpp
✔ Select a starting point · Start from scratch
? Resource type ›
Value
❯ Buffer
Done adding resources
After adding all the interfaces your command shell should look like this:
Welcome to sabana, let's create a project
✔ Select a language · Cpp
✔ Select a starting point · Start from scratch
✔ Resource type · Value
✔ Name · a_row
✔ IO · Input
✔ DataType · int
✔ Resource type · Value
✔ Name · a_col
✔ IO · Input
✔ DataType · int
✔ Resource type · Value
✔ Name · b_col
✔ IO · Input
✔ DataType · int
✔ Resource type · Buffer
✔ Name · a
✔ IO · Input
✔ DataType · int
✔ Resource type · Buffer
✔ Name · b
✔ IO · Input
✔ DataType · int
✔ Resource type · Buffer
✔ Name · c
✔ IO · Output
✔ DataType · int
✔ Resource type · Done adding resources

Download the mmult.cpp source file from the Xilinx GitHub repository and place it on the src directory together with the sabana.cc file.
In order to be able to include this file in the sabana.cc file we need to create a header file. Create a new header file with the following contents:
#ifndef __MMULT_H
#define __MMULT_H
​
extern "C" void mmult(const int* a, // Read-Only Matrix A
const int* b, // Read-Only Matrix B
int* c, // Output Result
int a_row, // Matrix A Row Size
int a_col, // Matrix A Col Size
int b_col // Matrix B Col Size
);
​
#endif // __MMULT_H
Now we only need to include this function in the sabana.cc file. First include the header file on line 18 at the top of the file:
#include <stdio.h>
#include <string.h>
​
// Include the header file we just created
#include "mmult.h"
Now include a call to the mmult function just under the "your code here" comment:
// your code here
#pragma HLS inline recursive
// Double check that the order of the arguments
// match the function signature in mmult.c
mmult(a, b, c, a_row, a_col, b_col);
We use the pragma HLS inline recursive in order to avoid generating a new level of hierarchy in the design. This sometimes allows for optimizations to be done by the compiler.

After we have added the relevant source files and we have integrated the module with the top level in sabana.cc we are ready to build the image:
sabana push -d
We use the -d on the command call above in order to get the terminal back.
Building this image should take around 20 minutes. In the mean time we can get everything else in place to test the image once it is ready.

When we created the project the CLI provided us with a template of a Python test script. This is located in the tests directory. It is a good starting point but not enough in order to interact with our image. In order to send requests to the image and get back the resulting matrix multiplication results we need to add the following sections:
  • Add a set of requests to be sent to the instance to a program
    • Create a Program object
    • Populate the program:
  • Deploy an instance and execute the program
  • Check the results
Let's start creating the requests we will send to the instance.

As we know a Sabana program is a collection of requests to be sent to the instance. This will be dependant on the underlying hardware of the image. These are the steps we have to follow in order to interact with our Systolic GEMM instance:
  • Allocate MMIO memory to interact with registers
  • Allocate Buffers for every memory buffer required by the module
  • Write the parameters of the matrices as required by the function
  • Write the input matrices to be multiplied
  • Start the multiplication function
  • Wait for the multiplication to finish
  • Read the results
The following snippet implements the pseudocode above.
1
# declare the data
2
dt = np.uint32
3
n = 16
4
m = 255
5
shape = (n, n)
6
bufa = "bufferA" # Link this buffer to mmio register 0x28
7
bufb = "bufferB" # Link this buffer to mmio register 0x34
8
bufc = "bufferC" # Link this buffer to mmio register 0x40
9
a = np.random.randint(m, size=shape, dtype=dt)
10
b = np.random.randint(m, size=shape, dtype=dt)
11
cols = np.array([n], dt)
12
start = np.ones([1], dt)
13
done = np.array([14], dt)
14
​
15
# create program
16
program = Program()
17
# allocate memory
18
program.mmio_alloc(name="c0", size=0x00010000, base_address=0xA0000000)
19
program.buffer_alloc(name=bufa, size=a.nbytes, mmio_name=ctrl, mmio_offset=0x28)
20
program.buffer_alloc(name=bufb, size=b.nbytes, mmio_name=ctrl, mmio_offset=0x34)
21
program.buffer_alloc(name=bufc, size=b.nbytes, mmio_name=ctrl, mmio_offset=0x40)
22
# write inputs
23
program.mmio_write(cols, name="c0", offset=0x10)
24
program.mmio_write(cols, name="c0", offset=0x18)
25
program.mmio_write(cols, name="c0", offset=0x20)
26
program.buffer_write(a, name=bufa, offset=0)
27
program.buffer_write(b, name=bufb, offset=0)
28
# start execution
29
program.mmio_write(start, name="c0", offset=0x0)
30
# wait for processing
31
program.mmio_wait(done, name="c0", offset=0x0, timeout=4)
32
# readout results
33
program.buffer_read(name=bufc, offset=0, dtype=dt, shape=a.shape)
34
program.mmio_dealloc(name="c0")
35
program.buffer_dealloc(name=bufa)
36
program.buffer_dealloc(name=bufb)
37
program.buffer_dealloc(name=bufc)
Copy the snippet at the beginning of the test_main function.
Note how we are bundling the a, b, and c buffers with their respective pointer registers. The offset of these registers can be taken out from the project file.

With the set of requests prepared we can now create an Instance object. The idea is to:
  • Create an Instance object so we get a handler to deploy the instance
  • Request an instance to be deployed (up)
  • Request the execution of the program we created (execute)
  • Request the instance to be terminated (down)
The following snipped of code implements the pseudocode above:
# deploy instance
image_file = Path(__file__).resolve().parent.parent.joinpath("sabana.json")
inst = Instance(image_file=image_file)
inst.up()
​
# run program
responses = inst.execute(program)
​
# terminate instance
inst.down()
Paste the snippet above just under the previous snippet inside of the test_main function.

At this point in the script the responses from the instance are stored in the responses variable. It is a list containing numpy arrays for each read request we issued with the program.
In our example program above we issued a single read request, so we can access the resulting matrix from the multiplication and check the results like this:
# check results
assert np.array_equal(responses[0], np.matmul(a, b))
print("Check OK!")
Paste the snippet above just under the previous snippet inside of the test_main function. This is the last snippet to paste.

At this point our Python script is ready to be used:
python3 tests/test_c_axi_systolic_gemm_16x16_int.py
The script should execute without exceptions and print the Check OK! message.
Congratulations! you have successfully packaged a hardware module described in C++ as a Sabana image!

Just in case you need it, you can find the source code for this example in our Examples repository in GitHub.
sabana-examples/c_axi_systolic_gemm_16x16_int at main · sabanaio/sabana-examples
GitHub
This is the source repository of the AMD Xilinx Systolic Array example
Vitis_Accel_Examples/cpp_kernels/systolic_array at master · Xilinx/Vitis_Accel_Examples
GitHub

Now that we have successfully packed this image we are able to use it.
  • You can share your Sabana image string so others are able to deploy instances of it.
Another step you can take is to create a driver script for the image. The script we put together before works best for a unit test, where the inputs are fixed and we check the output at the end. However this is not very friendly for the purpose of allowing someone else to just use your hardware as part of their applicatioin.
The driver script will be an adaptation layer, that will receive the inputs from an application, for example two numpy arrays, and return the results of the operation, another numpy array.
For more details on Sabana programs consult the Program reference page:
For another interesting project, head to our next example:
Copy link
On this page
Creating the project
Integrate source files
Build the image
Creating a program
Adding requests to a program
Deploying an instance and executing the program
Checking the results
Running the program
GitHub
Next steps