## **Generating Compiler Backends from Formal Models of Hardware**

**Gus Smith's Generals Exam** 

May 6th, 2022



To discuss how we'll generate compiler backends from formal models of hardware, we first need to talk about compilers themselves.



So...this is a compiler. What does a compiler do?



A compiler translates high-level code down to machine code that can be run on hardware.



Compilers are generally composed of a few layers: a frontend, some target independent optimization, and finally, a backend. (Build)

The compiler backend is what does target-specific optimization and code generation for the target hardware.





The compiler backend produces machine code,

(Build)

Which is what runs on the target hardware.

(Build)

Our focus today will be on compiler backends and the hardware they target.







To compile to hardware, the compiler backend needs to know about the underlying hardware its compiling to.



In reality, this is implemented by building models of the target hardware into the components of the compiler.

(Build)

For example, the memory planning component of the compiler will contain a model of the hardware's memory hierarchy,

(Build)

While the instruction selector will contain a model of the hardware interface or ISA.







Importantly, though, these models of the underlying hardware are generally *implicit*. They are not explicitly coded descriptions or simulators of the underlying hardware, but instead, implicit descriptions like heuristics and hard-coded optimizations.



In this paper, the authors perform a study of how various minor code transformations affect the quality of output of major compilers GCC, ICC, and Clang. They find that small changes in the input program lead to large changes in the performance of the compiled result. One of their conclusions is that (Build)

Inaccurate vectorization models in these major compilers are a primary source of low-quality compiler results.

These vectorization models, which are heuristics for when code should be vectorized for a specific architecture, are a great example of implicit hardware models that exist in major compilers.





So, as this paper highlights, relying on implicit models in compilers can produce (Build)

Imperfectly optimized, non-performant code.

But this is not the only downside of implicit models.





Implicit models also cause many headaches during development. Modifying implicit models often requires the time and attention of experts, as it can be unclear where modifications need to be made, and modifications can have unintended side affects.



Lastly and most importantly, implicit models can be an insidious source of bugs. If the implicit models are incorrect, then the code they generate will be buggy. Worse still, fixing bugs in these implicit models is made harder by the fact that they are implicit.











16

I hypothesize that...

...that is, the optimizer is able to automatically compose small facts about the underlying hardware into large and complex optimizations.

16

• gives rise to emergent optimizations,

16

- gives rise to emergent optimizations,
- reduces development time, and

16

- gives rise to emergent optimizations,
- reduces development time, and
- enables verification.





This thesis is very much so in line with a lot of Norman Ramsey's work on automatically generated instruction selectors.

## (Build)

However, unlike Ramsey, we will be applying these ideas not on CPUs, but on fixed function accelerators, and programmable hardware.





So let's discuss the structure of this talk.



I've broken this talk into three primary sections:

What I'll show: namely, my stated thesis,

How I've shown it so far, which is the work I've already done towards demonstrating this thesis,

And

How I'll finish, which is where I propose my final project.



So I just finished explaining what I will show; Now, let's jump into how I've shown it so far.



With that, let's talk about glenside.


This work was done with my colleagues here at UW and was published at MAPS 2021.



Because we're about to get a bit into the weeds, I want to first summarize the high-level story of Glenside.

Glenside is a tensor IR\* built for equality saturation.

23

\* intermediate representation

Glenside is a tensor IR\* built for equality saturation.

Glenside enables users to model hardware accelerators as program rewrites.

23

\* intermediate representation

Glenside is a tensor IR\* built for equality saturation.

Glenside enables users to model hardware accelerators as program rewrites.

These rewrites, in concert with Glenside's built-in rewrites, automatically discover ways to map machine learning workloads to accelerators.

23

\* intermediate representation



When we began designing glenside, we had three primary requirements

First, the language must be pure,

Second, the language must be low-level, so that we can actually reason about hardware. And lastly, the language should avoid binding, which makes term rewriting much easier.



Three design requirements for Glenside:

1. The language must be **pure**-a necessary requirement for equality saturation.

24

Three design requirements for Glenside:

1. The language must be **pure**—a necessary requirement for equality saturation.

24

2. The language must be **low-level**, letting us reason about hardware.

Three design requirements for Glenside:

1. The language must be **pure**—a necessary requirement for equality saturation.

24

- 2. The language must be **low-level**, letting us reason about hardware.
- 3. The language must **not use binding,** making term rewriting much easier.



So, as always, I think it's easiest to start with an example.

So let's begin with the most common kernel in deep learning -matrix multiplication!



Remember, our goals are to represent multiplication in a way that...



We want to represent matrix multiplication in a way that

26

1. is pure,

We want to represent matrix multiplication in a way that

26

- 1. is pure,
- 2. is low-level, and

We want to represent matrix multiplication in a way that

26

- 1. is pure,
- 2. is low-level, and
- 3. avoids binding.

Given matrices A and B, pair each row of A with each column of B, compute their dot products, and arrange the results back into a matrix.

27

Just to remind ourselves, here is the matrix multiplication algorithm. Given matrices A and B...



This simple English description of the algorithm immediately suggests a possible pure, low-level, binder-free implementation.



First, we view the matrix A, on the left, as a list of its rows, and the matrix B, on the right, as a list of its columns.





We then pair each row from A with each column from B by taking the Cartesian product of the two lists.



The result is a list of row-column pairs.



Finally, we map the dot product operator over this list of row-column pairs.



Our result is a list of scalar values!



But there's a problem



A 2D matrix times a 2D matrix should produce a 2D matrix.

However, our algorithm produces a one-dimensional list of values!

Even if the values are correct, we're missing a core piece of information: the shape!



Let's step back in time to understand where the shape information got lost.





By the time we're mapping the dot product operator over our list, the list is already a flat list of pairs.



As is the case in the step before.



But at this step, just before we take the cartesian product, all of the original shape information seems to be present. Specifically, we still have A and B in their two dimensional forms, as a list of rows and columns, respectively. So shape information is present here, but when we step forward again...









...the shape information is lost.

Our 2D matrices have been flattened into a one-dimensional list, with no way to recover the shapes of the original matrices.




So it seems that the Cartesian product operator is destroying our shape information!



Thus, we introduce a new, two dimensional Cartesian product operator.



This new cartesian product operator preserves shape information.

Now, the cartesian product of the three rows of A and the two columns of B are a three-by-two matrix of pairs.



But now, when we try to map our dot product operator over our new list, we run into another problem.



Our map operator maps the dot product over the wrong dimension!



Thus, we also introduce a new map operator.



This map operator knows to map the dot product two dimensions deep, correctly mapping the dot product onto the row-column pairs.



Finally, we get out what we expected:



A two dimensional matrix!





So what happened here?

Cartesian product 2D and map 2D hard-code which dimensions are iterated over and which dimensions are computed on.

But if the shapes of A and B change, we'll need entirely new operators, for example, map3D.

So can we encode this information—that is, which dimensions are iterated over and which dimensions are computed on—directly in the tensor itself? Unsurprisingly, the answer is yes!

This is exactly what access patterns do.





...but if tensor shapes change, we'll need entirely new operators!







Access patterns are the core innovation of Glenside.

Access patterns simply add a bit more information on top of a traditional tensor.

Whereas a tensor is defined by a tuple of integers called a shape-in this case, three comma four...



An access pattern is defined by a pair of tuples, where the first tuple holds the access dimensions, or the dimensions that are iterated over, and the second tuple holds the compute dimensions, or the dimensions which are computed on.

Concatenating the tuple gives the shape of the underlying tensor.

An access pattern simply represents a view over a tensor, conveying how an algorithm computes over the tensor.

For example, this access pattern is viewing our three comma four shaped tensor as a three-length vector of four length vectors.

But if we shift the dimensions of the access pattern over,







We can view the same tensor as a three-comma-four shaped tensor of scalars. And if we shift the dimensions in the other direction,



We can view the tensor as a scalar-shaped tensor, in which that quote unquote scalar, i.e. a single element, is a single three-comma-four shaped tensor.



Thus, each of these different access patterns represents a different view of the same underlying tensor, each view being potentially useful depending on the algorithm processing the tensor.



Now, once we've defined access patterns, we can redefine a bunch of common tensor and list operators using access pattern semantics—all of these operators collectively form the Glenside IR. There's a bunch of detail about the operators in the paper, and we'll see a few of them in our case studies.

| (transpose                                                                                                      | ;                    | ((N, O, H', W'), ())                                                              | (compute dotProd                                               | ;          | ((M, O), ())                                             |
|-----------------------------------------------------------------------------------------------------------------|----------------------|-----------------------------------------------------------------------------------|----------------------------------------------------------------|------------|----------------------------------------------------------|
| (squeeze                                                                                                        | ;                    | ((N, H', W', O), ())                                                              | (cartProd                                                      | ;          | ((M, O), (2, N))                                         |
| (compute dotProd                                                                                                | ;                    | ((N, 1, H', W', O), ())                                                           | (access activations 1)                                         | ;          | ((M), (N))                                               |
| (cartProd                                                                                                       | ;                    | $((N, 1, H', W', O), (2, C, K_h, K_w))$                                           | (transpose                                                     | ;          | ((O), (N))                                               |
| (windows                                                                                                        | ;                    | $((N, 1, H', W'), (C, K_h, K_w))$                                                 | (access weights 1)                                             | ;          | ((N), (O))                                               |
| (access activations 1)                                                                                          | ;                    | ((N), (C, H, W))                                                                  | (list 1 0))))                                                  |            |                                                          |
| (shape C Kh Kw)<br>(shape 1 Sh Sw))                                                                             |                      | (b) Matrix multiplication.                                                        |                                                                |            |                                                          |
| (access weights 1)))                                                                                            | :                    | $((O), (C, K_h, K_w))$                                                            | (compute reduceMax                                             | ; (        | ((N, C, H', W'), ())                                     |
| 1)                                                                                                              | ,                    |                                                                                   | (windows                                                       | ; (        | $((N,C,H',W'), (K_h,K_w))$                               |
| (list 0 3 1 2))                                                                                                 |                      |                                                                                   | (access activations 2)                                         | ; (        | ((N,C), (H,W))                                           |
|                                                                                                                 |                      |                                                                                   | (shape Kh Kw)                                                  |            |                                                          |
|                                                                                                                 |                      | 1.0                                                                               | (shape Sh Sw)))                                                |            |                                                          |
| (a) 2D convolution.                                                                                             |                      |                                                                                   | (c) Max pooling.                                               |            |                                                          |
| <b>Figure 2.</b> Common tensor kerne<br>with their access pattern shape<br>are filter height/width; $S_h/S_w$ a | els f<br>. N<br>re s | from machine learning expressed is batch size; <i>H/W</i> are spatial dim trides. | in Glenside. Lines containing<br>nension sizes; C/O are input/ | acc<br>out | cess patterns are annotated put channel count; $K_h/K_w$ |
|                                                                                                                 |                      |                                                                                   |                                                                |            | _                                                        |
|                                                                                                                 |                      |                                                                                   |                                                                |            |                                                          |

And then once we define Glenside, we can use it to represent common kernels in machine learning. Not just matrix multiplication, but more complex kernels like 1, 2, and 3d convolution.



So we've defined Glenside, and we see that it can be used to represent kernels in deep learning.







To answer that, we're going to move to the next section of my talk, where I talk about Glenside's in a project called 3LA



So just to keep the map in mind, we're moving on to the last part of how I've shown my thesis so far.



3LA is a collaborative project between us here at UW and our colleagues at Princeton and Harvard. We are currently resubmitting the work to ASPLOS, but for now, we have a paper on Arxiv.



To save time, I'm not going to go into too much detail about the internals of 3LA; instead, I'm mostly going to focus on Glenside's role. However, I wanted to begin by giving the high-level motivation for 3LA. Simulating, verifying, and compiling workloads on custom accelerators is hard.

Simulating, verifying, and compiling workloads on custom accelerators is hard.

3LA is a toolkit which makes it easier, by compiling workloads to the ILA simulation and verification framework.

Simulating, verifying, and compiling workloads on custom accelerators is hard.

3LA is a toolkit which makes it easier, by compiling workloads to the ILA simulation and verification framework.

Glenside is a key component of 3LA, where it is used to discover mappings of workloads to accelerators.



The Instruction Level Abstraction, or ILA, is a specification system developed by our collaborators at Princeton. ILA allows hardware developers to formally specify the behavior of their hardware by defining an ISA-like interface.

This interface is portable across platforms, easy to target from compilers, and provides both verification and simulation abilities out of the box.






The core goal of the 3LA project is to compile deep learning workloads down to ILA instructions, so that we can do simulation, verification, and compilation to our custom hardware.

But the question of how to actually map the computations within deep learning models down to ILA instructions

(Build)

Is not an easy problem.

(Build)

And a spoiler alert here: we're going to use Glenside to solve this problem.

But first, let's talk about what we tried before Glenside.







Before using glenside to map models down to accelerators, we attempted to use TVM's bring your own codegen, or BYOC. BYOC allows you to match patterns

(Build)

Such as this, within machine learning models.

This pattern matches what's called linear layer: a dense followed by a bias\_add.







Using BYOC, we attempt to map five workloads to two different accelerators: VTA and FlexASR, which both accelerate linear layers. As you can see, with the exception of ResMLP and Transformer on VTA, BYOC finds very few mapping opportunities. And, as we'll see, it's not because the opportunities aren't there!



Why aren't we finding many matches? In many cases, it is because of simple mismatches. For example, this snippet of code from Mobilenet represents a linear layer, yet it will not match our pattern (Build)

as the add should be a bias\_add.

Many small mismatches like this exist within the workloads. One solution would be to write additional BYOC patterns, to match the subtle variations in the workload. (Build)

But a more sustainable solution would be to somehow make these rewrites more flexible.







This is where we will use Glenside.



Using Glenside and equality saturation, we implement what we call flexible matching, where small exploratory rewrites provided by Glenside expose many more possible hardware mappings.







In Proceedings of the 36th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pp. 264-276. 2009.

Basic idea:

instead of *destructively* rewriting a program with a predetermined list of program rewrites,

run all rewrites simultaneously and repeatedly,

Ross Tate, Michael Stepp, Zachary Tatlock, and Sorin Lerner. "Equality saturation: a new approach to optimization." In Proceedings of the 36th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pp. 264-276. 2009.

Basic idea:

instead of *destructively* rewriting a program with a predetermined list of program rewrites,

run all rewrites simultaneously and repeatedly,

and keep all of the discovered versions of the program!

71

Ross Tate, Michael Stepp, Zachary Tatlock, and Sorin Lerner. "Equality saturation: a new approach to optimization." In Proceedings of the 36th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pp. 264-276. 2009.





So let's look at a simple example. Imagine we have a small math language with the following rewrites.

We know that x times 1 equals 1.

We know that x divided by x also equals 1.

We know that x times two can be implemented as x left shifted by 1.

And finally, we know that we can re-associate multiplication and division.











So let's use these rewrites to simplify a times two divided by two down to (Build)

a.





First, we reassociate the multiplication and the division, so that we have two-divided by two.



Then, we simplify two divided by two.



Finally, we simplify a times 1 to just a.





Imagine for example...



That we ran this rewrite first, and converted a times two to a left shifted by one. Well at this point, we're stuck! We can't do any more rewrites, and we won't be able to simplify this expression.







So why not keep around all discovered versions of the program?




Using an egraph, when we apply our rewrite to a times 2,



We still get our a left shifted by 1 expression, but (Build) We keep around the a times two expression! As a result (Build)







This is implemented within a library called egg, developed here at UW by Max Willsey.



So coming back to 3LA. Equality saturation allows us to use small exploratory rewrites to discover many equivalencies within the egraph, which expose more possible mappings down to hardware.



So what do these rewrites look like?

Well, we have two types of rewrites.

(Build)

The first type of rewrite capture accelerator semantics as rewrites which rewrite glenside expressions to accelerator calls, and (Build)

The second type of rewrite, our exploratory rewrites, are just general purpose rewrites over Glenside, provided by Glenside itself!







So what does it look like when we actually run these rewrites?



The exploratory rewrites will learn equivalencies about our deep learning model, and eventually,







|         | EfficientNet  | MobileNet V2 | ResMLP | ResNet-20 | Transformer |
|---------|---------------|--------------|--------|-----------|-------------|
| VTA     | 0 <b>→</b> 35 | 1 → 41       | 38     | 2 → 22    | 66          |
| FlexASR | 0 <b>→</b> 35 | 0 → 41       | 0 → 38 | 2 → 22    | 0 → 66      |
|         |               |              |        |           |             |
|         |               |              |        |           |             |
|         |               |              |        |           |             |
|         |               |              |        |           |             |
|         |               |              |        |           |             |
|         |               |              |        |           |             |
|         |               |              |        |           |             |

With flexible matching, we can drastically increase the number of matches for each workload on each accelerator over BYOC.



One example of a mapping that Glenside finds is the im2col transformation, which allows complex kernels like 2d convolution to be run on matrix multiplication accelerators.

This mapping is uncovered by these three exploratory rewrites, which are general purpose rewrites over Glenside itself. These rewrites do not explicitly encode the im2col transformation, but when used all together, they emergently rediscover the transformation.



So that is the 3LA project. (Build) But what does it show about Glenside?



Automatically generating compiler backends from explicit, formal hardware models

95

- gives rise to emergent optimizations,
- reduces development time, and
- enables verification.















Let's recall what is happening within Glenside as we're running rewrites and searching for hardware mappings. As we run rewrites, we are discovering equivalencies within the egraph.







And eventually, we find places to map in calls to our hardware













This is the core idea behind lakeroad. (Build)
Lakeroad uses similar techniques to Glenside (i.e. equality saturation) to map computation to custom hardware—in this case, FPGAs.

Lakeroad uses similar techniques to Glenside (i.e. equality saturation) to map computation to custom hardware—in this case, FPGAs.

However, Lakeroad additionally uses what it discovers to propose entirely *new* hardware primitives!





Fpga stands for field programmable gate array









FPGAs are composed of three primary types of devices:

First and most importantly, logic gates, from which FPGAs get their names.

Second, memory. There are memory units scattered throughout the device.

Lastly, digital signal processors or DSPs. These are more complex computational units; often they are small programmable processors! They can implement complex datatypes like floating point or bfloat.









For the purposes of this proposal, we will be focusing only on the programmable logic. Don't get me wrong: DSPs and memory are crucial in implementing good hardware designs, but for this proposal, we will test our ideas on programmable logic.

So what is programmable logic? In most cases, it is not and or gates like I have shown here, but instead, hardware blocks called look-up tables. (Build)

A look up table is just that—a table.

It takes some number of inputs and outputs some number of outputs, in this case, four inputs and one output.

Based on the inputs, it just looks up the output in its internal table.

(Build)

In this case, we've programmed this lookup table to implement an and on i0 and i1.

## (Build)

Lookup tables come in all shapes and sizes: for example, the lookup tables on the Xilinx Ultrascale+ FPGA architecture have six inputs and two outputs.















Now, let's get into how designs are compiled for FPGAs.



"UltraScale+ devices employ DSP blocks that are rated at 891MHz for the fastest speed grade. Nonetheless, large designs implemented on FPGAs typically achieve system frequencies lower than 400MHz."

Lavin, Chris, and Alireza Kaviani. "RapidWright: Enabling custom crafted implementations for FPGAs." 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2018.

111

In a paper written by Xilinx engineers, the authors state.

So things must be pretty bad if Xilinx engineers themselves are bemoaning the fact that FPGA compilers fail to compile effectively for their hardware!



The reason for the poor performance of FPGA compilers is that, among other reasons, FPGA compilers perform a complex dance, in which they compile the input hardware design from high-level behavioral verilog all the way down to a low level gate representation, and then attempt to raise it back up to the level of abstraction of FPGAs.

Recent works (Reticle!) have attempted a more direct, software-compiler-like approach.







To implement this, we need an FPGA "ISA": the lowest-level IR which gets converted to FPGA-ready Verilog.



New FPGA compiler toolchains specify their ISAs explicitly!

## 'comb' Dialect

Types and operations for comb dialect This dialect defines the comb dialect, which is intended to be a generic representation of combinational logic outside of a particular use-case.

| • | Operation | definition |
|---|-----------|------------|

- <u>comb.add (::circt::comb::AddOp)</u>
- <u>comb.and (::circt::comb::AndOp)</u>
- <u>comb.concat (::circt::comb::ConcatOp)</u>
- <u>comb.divs (::circt::comb::DivSOp)</u>
- <u>comb.divu (::circt::comb::DivUOp)</u>
  <u>comb.extract (::circt::comb::ExtractOp)</u>
- <u>comb.icmp (::circt::comb::lCmpOp)</u>
- <u>comb.mods (::circt::comb::ModSOp)</u>
- comb.modu (::circt::comb::ModUOp)
- <u>comb.mul (::circt::comb::MulOp)</u>
- <u>comb.mux (::circt::comb::MuxOp)</u>
- comb.or (::circt::comb::OrOp)
- <u>comb.parity (::circt::comb::ParityOp)</u>
- <u>comb.replicate (::circt::comb::ReplicateOp)</u>
- <u>comb.shl (::circt::comb::ShlOp)</u>
- <u>comb.shrs (::circt::comb::ShrSOp)</u>
- comb.shru (::circt::comb::ShrUOp)
- comb.sub (::circt::comb::SubOp)
- <u>comb.xor (::circt::comb::XorOp)</u>

| ះ 5   | cbc6be6d4 - calyx / primitives / core.futil                                                         |                  |              | Go to   | file |       |
|-------|-----------------------------------------------------------------------------------------------------|------------------|--------------|---------|------|-------|
| ۲     | rachitnigam @reset interface port (#579) 📖 🗸                                                        | Latest commit 15 | 4becf on Jul | 1, 2021 | Юні  | story |
| ጸኣ 4  | contributors  💮 🍓 🤤                                                                                 |                  |              |         |      |       |
| 100 1 | ines (93 sloc) 2.87 KB                                                                              | Ra               | w Blame      |         | ġ    |       |
|       | extern "core.sv" {                                                                                  |                  |              |         |      |       |
|       | /// Primitives                                                                                      |                  |              |         |      |       |
|       | <pre>primitive std_const&lt;"share"=1&gt;[WIDTH, VALUE]() -&gt; (out: WIDTH);</pre>                 |                  |              |         |      |       |
|       | primitive std_slice<"share"=1>[IN_WIDTH, OUT_WIDTH](in: IN_WIDTH) -> (out: OUT_WIDTH);              |                  |              |         |      |       |
|       | primitive std_pad<"share"=1>[IN_WIDTH, OUT_WIDTH](in: IN_WIDTH) -> (out: OUT_WIDTH);                |                  |              |         |      |       |
|       |                                                                                                     |                  |              |         |      |       |
|       | /// Logical operators                                                                               |                  |              |         |      |       |
|       | primitive std_not<"share"=1>[WIDTH](in: WIDTH) -> (out: WIDTH);                                     |                  |              |         |      |       |
|       | primitive std_and<"share"=1>[WIDTH](left: WIDTH, right: WIDTH) -> (out: WIDTH);                     |                  |              |         |      |       |
| 10    | primitive std_or<"share"=1>[WIDTH](left: WIDTH, right: WIDTH) -> (out: WIDTH);                      |                  |              |         |      |       |
| 11    | primitive std_xor<"share"=1>[WIDTH](left: WIDTH, right: WIDTH) -> (out: WIDTH);                     |                  |              |         |      |       |
| 12    |                                                                                                     |                  |              |         |      |       |
| 13    | /// Numerical Operators                                                                             |                  |              |         |      |       |
| 14    | <pre>primitive std_add&lt;"share"=1&gt;[WIDTH](left: WIDTH, right: WIDTH) -&gt; (out: WIDTH);</pre> |                  |              |         |      |       |
| 15    | <pre>primitive std_sub&lt;"share"=1&gt;[WIDTH](left: WIDTH, right: WIDTH) -&gt; (out: WIDTH);</pre> |                  |              |         |      |       |
| 16    | <pre>primitive std_gt&lt;"share"=1&gt;[WIDTH](left: WIDTH, right: WIDTH) -&gt; (out: 1);</pre>      |                  |              |         |      |       |
| 17    | <pre>primitive std_lt&lt;"share"=1&gt;[WIDTH](left: WIDTH, right: WIDTH) -&gt; (out: 1);</pre>      |                  |              |         |      |       |
| 18    | <pre>primitive std_eq&lt;"share"=1&gt;[WIDTH](left: WIDTH, right: WIDTH) -&gt; (out: 1);</pre>      |                  |              |         |      |       |
| 19    | <pre>primitive std_neq&lt;"share"=1&gt;[WIDTH](left: WIDTH, right: WIDTH) -&gt; (out: 1);</pre>     |                  |              |         |      |       |

Reticle compiles designs to these ISA instructions, and then those instructions get converted to FPGA-specific Verilog.



So we know ISAs are necessary for these new FPGA compilation toolchains.

But how do we choose the ISA?

But how do we choose the ISA? And how do we implement it?

But how do we choose the ISA? And how do we implement it? Currently: by hand!


Why is choosing an ISA by hand bad?





A key optimization for FPGAs is packing or fusing lookup tables.

This diagram shows three lookup tables being used to implement the instructions a, b, and c.

(Build)

If the optimizer discovers that a and c can fit into a single lookup table,

(Build)

It will combine them! This is called packing or fusion.

This is a key optimization in improving FPGA performance and packing larger designs onto the FPGA.















Okay, so choosing an ISA by hand is bad.



Implementing ISAs by hand is infeasible for large ISAs—and a great source of bugs!



This is an example from Reticle.

In Reticle, Luis has written rewrites which map high level Reticle constructs, such as this add,

(Build)

To low level, FPGA specific implementations of those constructs. This is that same add, implemented using eight lookup tables on a Xilinx FPGA. (Build)

Luis wrote all of these rewrites by hand for Reticle!

|                                                                                                                  | Xilinx 7-series implementation of 8-bit add:           |
|------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------|
| <pre>8-bit add ISA instruction:<br/>pat add_i8(a:i8, b:i8) -&gt; (y:i8) {<br/>y:i8 = add(a, b) @lut;<br/>}</pre> | <pre>imp add i8[1, 2](a:i8, b:i8) -&gt; (y:i8) {</pre> |
|                                                                                                                  | t0:bool = ext[0](a);                                   |
|                                                                                                                  | t1:bool = ext[1](a);                                   |
|                                                                                                                  | t2:bool = ext[2](a);                                   |
|                                                                                                                  | t3:bool = ext[3](a);                                   |
|                                                                                                                  | t4:bool = ext[4](a);                                   |
|                                                                                                                  | t5:bool = ext[5](a);                                   |
|                                                                                                                  | t6:bool = ext[6](a);                                   |
|                                                                                                                  | t7:bool = ext[7](a);                                   |
|                                                                                                                  | t8:bool = ext[0](b);                                   |
|                                                                                                                  | t9:bool = ext[1](b);                                   |
|                                                                                                                  | t10:bool = ext[2](b);                                  |
|                                                                                                                  | t11:bool = ext[3](b);                                  |
|                                                                                                                  | <pre>t12:bool = ext[4](b);</pre>                       |
|                                                                                                                  | t13:bool = ext[5](b);                                  |
|                                                                                                                  | t14:bool = ext[6](b);                                  |
|                                                                                                                  | t15:bool = ext[7](b);                                  |
|                                                                                                                  | t16:bool = lut2[6](t0, t8) @a6(??, ??);                |
|                                                                                                                  | t17:bool = lut2[6](t1, t9) @b6(??, ??);                |
|                                                                                                                  | t18:bool = lut2[6](t2, t10) @c6(??, ??);               |
|                                                                                                                  | t19:bool = lut2[6](t3, t11) @d6(??, ??);               |
|                                                                                                                  | t20:bool = lut2[6](t4, t12) @e6(??, ??);               |
|                                                                                                                  | t21:bool = lut2[6](t5, t13) @f6(??, ??);               |
|                                                                                                                  | t22:bool = lut2[6](t6, t14) @g6(??, ??);               |
|                                                                                                                  | t23:bool = lut2[6](t7, t15) @h6(??, ??);               |
|                                                                                                                  | t24:i8 = cat(t16, t17, t18, t19, t20, t21, t22, t23);  |
|                                                                                                                  | y:i8 = carryadd(a, t24) @c8(??, ??);                   |
|                                                                                                                  | }                                                      |



This is slow, especially if we have multiple backends and many fused instructions!



We introduce Lakeroad, a tool for automatically *defining* and *implementing* ISAs for FPGAs.



Lakeroad is split into two halves: ISA exploration, which chooses an ISA, and ISA implementation synthesis, which, given an ISA, produces an FPGA-specific implementation.



So let's begin with ISA exploration.

Core idea of ISA exploration: define the ISA from instructions found in real designs.





Internally, ISA exploration has two stages: instruction enumeration, in which we enumerate all instruction seen in all hardware designs, and instruction filtering, in which we pare down the list of instructions to our final ISA.



A simple example: enumerate the instructions present in not (a and b)



Naive approach: instructions are just the subexpressions!





Then the subexpressions would be,

First, the expression itself, not (i0 and i1), where a and b are replaced with generic placeholders i0 and i1, And second, and of i0 and i1.



But this naive approach misses an instruction!





Specifically, the not instruction.



So this naive approach doesn't work. So how do we capture all instructions?

Unsurprisingly, I'm going to do it with rewrites, and use egg!





First I'll describe what the rewrites do at a high level, and then we'll look at the rewrites running in an egraph.

In general, the rewrites say that, to convert a node into an instruction, we have to decide which of its children we will convert to placeholder arguments like i0 and i1, and which children we will keep as their full expressions.



So if we look at the and expression,


The only option we have is to convert its children to placeholders.



But when we look at the not expression, now we have a choice!



We can either choose to keep the and expression, which produce this instruction, or



We can replace the and with a placeholder.





Running the rewrites over an egraph looks something like this.



First, we'll rewrite the and expression into the application of the i0 and i1 instruction on arguments a and b.



Then we'll rewrite the not expression similarly, to the application of the not i0 instruction onto a and b.

Finally, when we see that we have two instructions in sequence, that is a not instruction being applied to the result of an and instruction,



We have a rewrite that fuses them into a single fused instruction, not i0 and i1.



From here, it's easy to pull out the enumerated instructions from the egraph.





So for example, if we apply one of Demorgan's laws...



...onto our egraph...



...we get a completely new expression. If we are running the enumeration rewrites simultaneously, the rewrites will enumerate the instructions in this new expression.





We can pull all of the instructions we found out of the egraph-this becomes our list of potential instructions for the ISA.



Next, we filter the instructions to produce our final ISA.



In this step, we filter down the potentially long list of instructions based on user preference.



For example, we might filter out this instruction as being too specialized.

In this toy example, we're likely not going to want to filter out any of these instructions, but as our list of instructions gets long, filtering will be important to keep the size of the isa reasonable.





At the core of the ISA implementation synthesizer is a tool called Rosette.

For all inputs a and b,

For all inputs a and b, find an implementation of low-level-FPGA-impl such that

For all inputs a and b,
find an implementation of low-level-FPGA-impl such that
 low-level-FPGA-impl(a, b) == high-level-instr-impl(a, b)

For all inputs a and b,
find an implementation of low-level-FPGA-impl such that
 low-level-FPGA-impl(a, b) == high-level-instr-impl(a, b)

To do so, we need to define the high-level semantics of the instruction,

For all inputs a and b,
find an implementation of low-level-FPGA-impl such that
 low-level-FPGA-impl(a, b) == high-level-instr-impl(a, b)

To do so, we need to define the **high-level semantics** of the instruction, and the **low-level semantics** of the FPGA.





| 4.3.2 Bitwise Operators    | 4.3.2 Bitwise Operators                                                       |
|----------------------------|-------------------------------------------------------------------------------|
| bvnot                      |                                                                               |
| bvand                      | $(hunst x) \rightarrow (hityester n)$                                         |
| bvor                       | x : (hitvector n)                                                             |
| bvxor                      |                                                                               |
| bvshl                      | Returns the bitwise negation of the given bitvector value.                    |
| bvlshr                     | Evenue                                                                        |
| bvashr                     | Examples:                                                                     |
| 4.3.3 Arithmetic Operators | > (bvnot (bv -1 4))                                                           |
| bvneg                      | (bv #x0 4)                                                                    |
| bvadd                      | > (bvnot (bv 0 4))<br>(bv #vf 4)                                              |
| bvsub                      | > (define-symbolic b boolean?)                                                |
| bymul                      | > (bvnot (if b 0 (bv 0 4))); This typechecks only when b is false,            |
| bysdiv                     | (bv #xf 4)                                                                    |
| bysdiv                     | > (vc) ; so Rosette emits a corresponding assertion.                          |
|                            | (vc #t (! b))                                                                 |
| bvsrem                     |                                                                               |
| bvurem                     | $(bvand x \dots) \rightarrow (bitvector n)$ procedure                         |
| bvsmod                     | x : (bitvector n)                                                             |
| 4.3.4 Conversion           | $(\mathbf{bvor} \ x \ \dots +) \rightarrow (\mathbf{bitvector} \ \mathbf{n})$ |
| operators                  | <pre>x : (bitvector n)</pre>                                                  |
| concat                     | $(bvxor x \dots) \rightarrow (bitvector n)$                                   |



i0 and i1  $\longrightarrow$  (bvand i0 i1) not (i0 and i1)  $\longrightarrow$  (bvnot (bvand i0 i1)) not i0  $\longrightarrow$  (bvnot i0) i0 or i1  $\longrightarrow$  (bvor i0 i1)

175



To capture architecture-level semantics of FPGAs, we simply build an *interpreter* for each FPGA component!

177



## (define (lut memory inputs)

(let\* ([inputs (zero-extend inputs (bitvector (length (bitvector->bits memory))))])
 (extract 0 0 (bvlshr memory inputs))))

178






For all inputs a and b, find an implementation of low-level-and-impl such that low-level-and-impl(a,b) == high-level-and-impl(a,b)

For all inputs a and b, find an implementation of low-level-and-impl such that low-level-and-impl(a,b) == (bvand a b)

For all inputs a and b, find a setting of memory such that (ultrascale-plus-lut6-2 memory (list a b)) == (bvand a b)

For all inputs a and b, find a setting of memory such that (ultrascale-plus-lut6-2 memory (list a b)) == (bvand a b)

184

So we plug that into rosette, and rosette gives us an answer!















To support new FPGA architectures... ...just provide an interpreter!





So we demonstrated synthesizing for Xilinx ultrascale+, but if I wanted to synthesize

## (Build)

For lattice ECP5, which is another FPGA, I would just need to provide an interpreter for ECP5's lookup tables.





1. Insert it into the egraph

- 1. Insert it into the egraph
- 2. Enumerate its instructions via rewrites

- 1. Insert it into the egraph
- 2. Enumerate its instructions via rewrites
- 3. Extract an implementation composed of instructions in our ISA

- 1. Insert it into the egraph
- 2. Enumerate its instructions via rewrites
- 3. Extract an implementation composed of instructions in our ISA

191

4. Output Verilog

- 1. Insert it into the egraph
- 2. Enumerate its instructions via rewrites
- 3. Extract an implementation composed of instructions in our ISA

191

4. Output Verilog

If the design isn't covered with the current ISA, we can:

- 1. Insert it into the egraph
- 2. Enumerate its instructions via rewrites
- 3. Extract an implementation composed of instructions in our ISA
- 4. Output Verilog

If the design isn't covered with the current ISA, we can:

Run rewrites to find alternative implementations of the design

- 1. Insert it into the egraph
- 2. Enumerate its instructions via rewrites
- 3. Extract an implementation composed of instructions in our ISA
- 4. Output Verilog

If the design isn't covered with the current ISA, we can:

Run rewrites to find alternative implementations of the design

191

• Find a minimal set of new instructions to add to the ISA



So that is lakeroad: a tool that automatically defines and implements ISAs for FPGAs.

Automatically generating compiler backends from explicit, formal hardware models

- gives rise to emergent optimizations,
- reduces development time, and
- enables verification.









## **Proposed Evaluation**

## **Proposed Evaluation**

Paper 1: ISA Implementation Synthesis

## **Proposed Evaluation**

Paper 1: ISA Implementation Synthesis

 Goal: Evaluate the quality of synthesized implementations; demonstrate ability to support new FPGAs.
Paper 1: ISA Implementation Synthesis

• Goal: Evaluate the quality of synthesized implementations; demonstrate ability to support new FPGAs.

195

• We will synthesize three ISAs (Reticle, Calyx, MLIR Comb) for three FPGAs (UltraScale+, ECP5, SOFA)

Paper 1: ISA Implementation Synthesis

• Goal: Evaluate the quality of synthesized implementations; demonstrate ability to support new FPGAs.

195

• We will synthesize three ISAs (Reticle, Calyx, MLIR Comb) for three FPGAs (UltraScale+, ECP5, SOFA)

Paper 2: ISA Exploration and Lakeroad End-to-End

Paper 1: ISA Implementation Synthesis

• Goal: Evaluate the quality of synthesized implementations; demonstrate ability to support new FPGAs.

195

 We will synthesize three ISAs (Reticle, Calyx, MLIR Comb) for three FPGAs (UltraScale+, ECP5, SOFA)

Paper 2: ISA Exploration and Lakeroad End-to-End

• Goal: Demonstrate ability to enumerate a large space of interesting instructions; demonstrate fast compilation using the egraph.

Paper 1: ISA Implementation Synthesis

- Goal: Evaluate the quality of synthesized implementations; demonstrate ability to support new FPGAs.
- We will synthesize three ISAs (Reticle, Calyx, MLIR Comb) for three FPGAs (UltraScale+, ECP5, SOFA)

Paper 2: ISA Exploration and Lakeroad End-to-End

- Goal: Demonstrate ability to enumerate a large space of interesting instructions; demonstrate fast compilation using the egraph.
- We will run Lakeroad end-to-end on a large corpus of hardware benchmarks (from sources like MachSuite)



Automatically generating compiler backends from explicit, formal hardware models

- gives rise to emergent optimizations,
- reduces development time, and
- enables verification.

So far, I have provided evidence for this thesis through Glenside and its application in 3LA.

So far, I have provided evidence for this thesis through Glenside and its application in 3LA.

I plan to demonstrate this thesis once more through Lakeroad.





June 2022: Submit Lakeroad part 2 to 2nd round of ASPLOS June 2022: Resubmit 3LA paper to 2nd round of ASPLOS

June 2022: Submit Lakeroad part 2 to 2nd round of ASPLOS June 2022: Resubmit 3LA paper to 2nd round of ASPLOS October 2022: Submit Lakeroad part 1 to 3rd round of ASPLOS

June 2022: Submit Lakeroad part 2 to 2nd round of ASPLOS

199

June 2022: Resubmit 3LA paper to 2nd round of ASPLOS

October 2022: Submit Lakeroad part 1 to 3rd round of ASPLOS

Autumn Quarter 2022: Submit 3LA verification paper

June 2022: Submit Lakeroad part 2 to 2nd round of ASPLOS June 2022: Resubmit 3LA paper to 2nd round of ASPLOS October 2022: Submit Lakeroad part 1 to 3rd round of ASPLOS Autumn Quarter 2022: Submit 3LA verification paper Winter Quarter 2023: Fulfill final TA requirement

June 2022: Submit Lakeroad part 2 to 2nd round of ASPLOS June 2022: Resubmit 3LA paper to 2nd round of ASPLOS October 2022: Submit Lakeroad part 1 to 3rd round of ASPLOS Autumn Quarter 2022: Submit 3LA verification paper Winter Quarter 2023: Fulfill final TA requirement Winter/Spring Quarter 2023: Deal with Lakeroad and 3LA resubmissions

June 2022: Submit Lakeroad part 2 to 2nd round of ASPLOS June 2022: Resubmit 3LA paper to 2nd round of ASPLOS October 2022: Submit Lakeroad part 1 to 3rd round of ASPLOS Autumn Quarter 2022: Submit 3LA verification paper Winter Quarter 2023: Fulfill final TA requirement Winter/Spring Quarter 2023: Deal with Lakeroad and 3LA resubmissions Spring Quarter 2023: Write thesis and defend

















## Outline

- Motivating Example: Matrix Multiplication
- Access Pattern Definition
- Case Studies
  - Reimplementing Matrix Multiplication with Access Patterns
  - Implementing 2D Convolution with Access Patterns
  - Hardware Mapping as Program Rewriting
  - Flexible Hardware Mapping with Equality Saturation

Now that we've defined our core abstraction, access patterns, we will demonstrate how access patterns and the Glenside IR enable rich term rewriting over tensor programs.

208

We begin by first showing how Glenside cleanly represents two core deep learning kernels: matrix multiplication and 2D convolution.

We then return to our original task, and show how Glenside can be used to implement hardware mapping via program rewriting.

Finally, using equality saturation, a modern program rewriting framework, we demonstrate how Glenside can discover ways to flexibly map other kernels onto hardware that was designed for a specific purpose: in this case, mapping 2D convolution to matrix multiplication hardware.

# Outline

- Motivating Example: Matrix Multiplication
- Access Pattern Definition
- Case Studies
  - Reimplementing Matrix Multiplication with Access Patterns

209

- Implementing 2D Convolution with Access Patterns
- Hardware Mapping as Program Rewriting
- Flexible Hardware Mapping with Equality Saturation

Let's begin by reimplementing matrix multiplication with access patterns and the Glenside IR.

Given matrices A and B, pair each row of A with each column of B, compute their dot products, and arrange the results back into a matrix.

210

Remember, we defined our matrix multiplication algorithm as,



We will now implement matrix multiplication with Glenside.

In these slides, the Glenside expression will appear on the left, and the access pattern shape of the Glenside expression on that line will appear as a lisp-style comment on the right.

To begin implementing matrix multiplication, we first need to access A as a list of its rows, and access B as a list of its columns.

Accessing A as a list of its rows is straightforward. Given a tensor and a dimension index, the access operator converts a tensor into an access pattern by splitting the tensor's shape at the given dimension index. In this case, A is a three-comma-four shaped tensor, so accessing it at dimension 1 produces this access pattern shape. This access pattern views A as a three-length vector of four-length vectors—in other words, a list of the rows of A.





Next, we must access B as a list of its columns. Accessing B at dimension 1, however, gives a list of B's rows.



Thus, we employ transpose, which we call an access pattern transformer in Glenside.

Given an access pattern and a list of integers, transpose reorders the dimensions of the access pattern.

In this case, we simply swap the dimensions of the access pattern, giving us a 2-length vector of 4-length vectors, or a list of B's columns.





Now that we have a list of A's rows and a list of B's columns, we use Glenside's cartesian product transformer to create every row-column pair. Notice that the resulting access pattern is a three-comma-two shaped matrix, each element of which is a two-comma-four shaped matrix. The three-comma-two shape preserves the shape information from A and B, and will give us the final shape of the output matrix. The two-comma-four shape represents a pair of 4-length vectors—one row of A paired with one column of B.





Finally, we compute the dot product of each row-column pair. Glenside's compute expression maps a given operator over the compute dimensions of the access pattern, i.e. the second shape tuple, completely ignoring the access dimensions, i.e. the first shape tuple. In this case, the dot product operator reduces each pair of 4-length vectors to a scalar.

And that is our full representation of matrix multiplication in Glenside! As you can see, our representation here is pure, low-level, and conveniently avoids relying on binding.


## Outline

- Motivating Example: Matrix Multiplication
- Access Pattern Definition
- Case Studies
  - Reimplementing Matrix Multiplication with Access Patterns

216

- Implementing 2D Convolution with Access Patterns
- Hardware Mapping as Program Rewriting
- Flexible Hardware Mapping with Equality Saturation

Next, we will demonstrate the expressiveness of access patterns by implementing an entirely different kernel – 2D convolution – with Glenside.



Let's first get a brief refresher on how 2D convolution works.

The inputs to 2D convolution are a batch of image or activation tensors, and a list of weight or filter tensors.



In this example, we will show what happens with just one set of activations and one filter, as the same process is repeated for every set of activations and every filter.



The filter is applied onto a window of the activations by computing an element wise multiplication and sum between the filter and the window of activations.





This element wise multiplication and sum is computed over every possible window of the activations.







This same process is completed for each filter in the set, producing one output channel for each filter.



To implement 2D convolution in Glenside, we begin by accessing the weights as a list of 3 dimensional filters. In this case, there are O filters, where O will be our number of output dimensions, and each filter is of shape C, Kh, Kw, where C represents our number of input channels and Kh and Kw are our filter height and width.



Next we access our activations as a batch of N 3 dimensional images, where N is our batch size, and H and W are the input height and width.



Now we must form windows over our activations. As this is a common pattern, also appearing for example in max and average pooling, Glenside provides the windows transformer for this purpose.



The windows transformer takes an access pattern, a window shape, and a list of strides.



The result of the windows transformer is a new batch of images of shape H',W', where at each location in each new image, there is a C, Kh, Kw shaped window, representing each possible window of each image in the batch.



Next, we use the cartesian product transformer to pair every window of every image with every filter.



Finally, we compute the dot product of each window-filter pair, which element wise multiplies each 3D window with each 3D filter and sums the results to a single scalar value.

This represents the core of the Glenside implementation of 2d convolution.



Optionally, we can also remove and rearrange some dimensions so that the output layout matches the input layout.

## Outline

- Motivating Example: Matrix Multiplication
- Access Pattern Definition
- Case Studies
  - Reimplementing Matrix Multiplication with Access Patterns
  - Implementing 2D Convolution with Access Patterns
  - Hardware Mapping as Program Rewriting
  - Flexible Hardware Mapping with Equality Saturation

Now that we've shown that Glenside can represent common machine learning kernels, let's understand how it can be used for term rewriting. We begin by addressing our original goal.

232



Recall our original idea: that hardware mapping is a program rewriting problem. Can we turn our hardware designer's description of her accelerator into a searchable pattern?







With Glenside, we can!

In Glenside, we can express her hardware as the following pattern, which matches a dot product mapped over the cartesian product of two access patterns a0 and a1. Importantly, we can even express the fact that her hardware expects both access patterns to be vectors of vectors.



We can then directly rewrite this pattern to hardware invocations, represented here as a new systolic array expression. We can even convey hardware parameters such as the number of systolic array rows and columns.

```
((?n), (?rows))
and ?a1 is of shape
((?cols), (?rows))
        (compute<sup>sy</sup>dotPrfoday ?rows ?cols ?a0 ?a1)
        (cartProd
        (access A 1)
        (transpose
        (access B 1)
        (list 1 0))))
```

Now, we can use this rewrite to map our matrix multiplication implementation to her systolic array.

```
((?n), (?rows))
and ?a1 is of shape
((?cols), (?rows))
(compute<sup>sy</sup>dotProday ?rows ?cols ?a0 ?a1)
        (cartProd
        (access A 1)
        (transpose
        (access B 1)
        (list 1 0))))
```

## Outline

- Motivating Example: Matrix Multiplication
- Access Pattern Definition
- Case Studies
  - Reimplementing Matrix Multiplication with Access Patterns
  - Implementing 2D Convolution with Access Patterns
  - Hardware Mapping as Program Rewriting
  - Flexible Hardware Mapping with Equality Saturation

We have shown how we can use Glenside to formulate a program rewrite to perform hardware mapping. Yet mapping a matrix multiplication to matrix multiplication hardware is perhaps not all that impressive. To show the true power of program rewriting with Glenside, we will now demonstrate the *flexible* mapping of a 2D convolution to matrix multiplication hardware.

239



Looking at the Glenside implementations of 2d convolution—on the left—and matrix multiplication—on the right—we can see a remarkably similar structure.

| <pre>(windows<br/>(access activ<br/>(shape C Kh K<br/>(shape 1 Sh Sw))<br/>(access weights 1)))<br/>1)<br/>(list 0 3 1 2))</pre> | <pre>(compute dotProd<br/>(cartProd<br/>(access A 1)<br/>(transpose<br/>(access B 1)<br/>(list 1 0))))</pre> |
|----------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|
|                                                                                                                                  | 241                                                                                                          |

Specifically, their core computation is the same: we take the cartesian product of two access patterns, and compute a dot product over the result. The primary difference is in how we form the input access patterns.



Knowing that 2D convolution is so similar to matrix multiplication, it begs the question: can we apply our systolic array rewrite?



As it stands, the answer is no. The systolic array rewrite expects access patterns which are vectors of vectors, while the access patterns present in 2D convolution are much more complicated.



However, if we could somehow flatten the access patterns to be vectors of vectors, we could apply our rewrite. So the question is, can we flatten our access patterns?



The answer is yes, and it begins with a very simple rewrite. This rewrite matches any access pattern, flattening and immediately reshaping the access pattern back to its original shape.

On its surface, it may seem impractical that this rewrite is applied to every possible access pattern. However, using the equality saturation program rewriting framework provided by the egg library, we can efficiently apply this rewrite in all possible locations and store all resulting programs.



When we apply this rewrite, one of the resulting programs is the following.



Notice the reshape and flatten transformers on the two access patterns.

However, our access pattern shapes haven't changed, as even though the access patterns are flattened, they are then immediately reshaped back to their original shapes.




To remedy this, we need to bubble the reshapes to the top.

```
  these rewrites "bubble" reshape through cartProd and compute dotProd
(cartProd
  (reshape ?a0 ?shape0)
  (reshape ?a1 ?shape1)) * (reshape (cartProd ?a0 ?a1) ?newShape)

  (compute dotProd
  (reshape ?a ?shape)) * (reshape (compute dotProd ?a) ?newShape)
```

To do so, we can use these two rewrites, which express the composition commutativity of reshape with cartesian product and compute dot product. Note that these rewrites express general properties of Glenside, and are not specific to the task at hand.

Once we apply these rewrites, our program looks like this.



The reshapes have been moved out, and the access patterns are now flattened.



We can now use our rewrite to map 2D convolution to our matrix multiplication hardware!

This transformation is not new—this is what's called the im2col transformation. With a set of three simple rewrites expressing general properties of Glenside, we are able to rediscover im2col!













Glenside is an actively maintained Rust library! Please try it out and feel free to open issues if you have questions or find bugs.



Lastly, I'd like to thank everyone who worked on this project, my labs, and the funding agencies that made it possible.



Thank you for listening, and I'm excited to take your questions!