Notes on “SCALE-Sim: Systolic CNN Accelerator Simulator” from Samajdar et. al.

I’m reading this paper with these goals:

I want to understand when multiple systolic arrays make sense. It must depend on the workload. What kinds of workloads benefit? And what’s the strategy for running them on multiple systolic arrays? Do we parallelize a single layer, or do we run multiple layers at once?
I’d also like to understand what the design space of dataflows looks like. I’ve been having this thought that that’s where the real design space is, and that it might be an interesting place to explore software–hardware codesign.

We provide the first open-source, cycle-accurate DNN accelerator simulator based on the systolic-array architecture.

This is surprising to me!

When it comes to memory: larger means more power/area, but could be worth it when there’s a lot of reuse. Smaller means more spilling to DRAM depending on the workload, which might actually incur more DRAM activity, and thus, power.

SCALE-SIM is a simulator that provides a publicly available modeling infrastructure for systolic array CNN accelerators. SCALE-SIM enables designers to quickly iterate over and validate their upcoming designs with respect to the various optimization goals for their respective implementation points.

So they’re selling this (at least partly) as a tool/infrastructure paper. A feeling I’ve been having is that, with hardware, it seems very difficult to build abstractions that will work for everyone, which is exactly what you need to do if you’re going to build a flexible simulation framework. So I’m excited to see how they do it!

The “convolutions and matrix multiplications are 90% of CNN inference” statistic apparently comes from the Eyeriss paper. Good to remember, as I’m sure I’ll need it someday.

Reading their description of systolic arrays led me to a question that I can’t believe I haven’t asked: are GPUs implemented with systolic arrays? My assumption is no, but I’m unsure on what the differences are. My guess is that the individual processing units are more complicated, in a GPU, and the network between units is not fixed, like that of a systolic array. Oh, and the data reuse/data movement aspect of systolic arrays is central to their design (in fact, it’s where their name comes from!)

Reading this reminds me of a core research idea of mine: nothing in hardware matters without the workload. The workload defines exactly what is run. So the fact that the workload does not drive design even more directly than it currently does is shocking. Workloads do drive design. Architects profile workloads to make decisions. But the fact that that loop isn’t tighter and more automated is the perplexing thing.

Output stationary: each PE handles a single output. Rows are assigned to channels, that is, one row contains outputs from the same channel. I’d guess that implies that columns are spatial?

I’m more familiar with weight stationary because it’s what we use.

Unsure when input stationary makes sense, compared to weight stationary.

When the convolution operation is formulated as successive dot-product operations, three reuse patterns are immediately evident.

Each convolution window uses the same filter matrix, to generate pixels corresponding to a given output channel

The adjacent convolution windows share portions of the input matrix if the stride is smaller than window dimension

To generate a output pixel in different output channels, different filter matrices use the same convolution window.

In short there are spatial, temporal and spatio-temporal reuse patterns in CNN inference.

I like how straightforward this description is. It makes the dataflow designs pretty obvious. Yet it’s important not to let the simplicity of this list convince you that these are the only possible dataflows. Eyeriss’s row stationary exploits multiple of these reuse patterns.

Double-buffering: there’s a working set and an idle set. Working set is actively used, while the idle set is being loaded into.

They make a great point about contemporary work ignoring system integration concerns. I will take their word for it, only because it was easy enough for us to ignore system integration concerns. They model system behavior as the behavior as seen over the system bus:

SCALE-Sim allows for modeling the main memory behavior by generating accurate read and write bandwidths of the interface, which can then be fed into a DRAM simulator eg. DRAM-Sim2.

I’m still a bit confused; does SCALE-Sim generate an estimate for the bandwidths needed?

SCALE-Sim internally takes an inside out implementation approach. Specifically, the simulator assumes that the compute units are always used to the maximum possible utilization—as dictated by dataflow mapping, and never stall waiting for data.

This is interesting. Scott does something similar to this as well.

SCALE-Sim works the following way:

SRAM read traces are generated, assuming that there are no stalls for reading data.
Because systolic array operation is predictable, output traces (i.e. SRAM write traces) can be generated once we figure out the input/read traces.
Given the read and write traces and the SRAM configuration (e.g. the size), the DRAM traces can be determined.
Finally, DRAM traces can be used to estimate power and bandwidth requirement.

They note how they validate against an RTL model; I’m unsure how much value to assign to that, given that they could simply have designed their simulator around the RTL design.

Skipping some sections to get to results.

In a glance it seems, Output Stationary(OS) outperforms the other two dataflow in every aspect. However, it might be worth noting that implementing a stall free OS hardware might not be trivial.

I think they have a sentence somewhere arguing why the fact that their tool assumes no stalls (which isn’t realistic) is okay. But sentences like the above sentence make me doubt it!

We performed an experiment to explore the tradeoffs between the two approaches [scaling up vs scaling out; monolithic vs spatial]. We start from an 8x8 array (64PEs) and increase the number of compute elements to 16384, multiplying by 4 in each step. For scaling-up each step corresponds to doubling the length and breadth of the array, while for scaling-out each step is just quadrupling the number of 8x8 arrays. For the purposes of this case study, we divide the workload for scale-out along the output channels, i.e., the different filters are assigned to different nodes thus different nodes generating different output channels. Alternate partitioning strategies exist, and in fact the best strategy may differ from layer to layer depending on the number of filters vs channels. We do not add any arbitration or bandwidth constraints on the interconnect connecting the arrays in the scale-out mode; instead SCALE-SIM’s outputs for SRAM read bandwidth requirement determines the bandwidth requirement for this interconnect. [emphasis my own]

The first two emphasized phrases tell me that I’m not going to learn much about scheduling from this paper. It seems like they choose a static scheduling decision which they even admit may not be optimal. Unfortunately, that’s what I need to learn; I need to understand how complex the space of schedules may be.

I interpreted the third emphasized phrase as saying that they don’t factor in an overhead from a theoretical on-chip network connecting the systolic arrays in the scale-out case. I could go either way on this; I don’t know much about on-chip networks for accelerators, and I could believe equally that a network would add a constant overhead (and so, can be ignored) and that the network adds a complex and non-negligible overhead, but one that is hard to determine, so ignoring it was an engineering decision.

After talking with Joseph, I reflected that the “scale out” evaluation might have been a little more fair than what I first thought. If the design they are assuming is some number of 8x8 systolic arrays which are fed activations and weights simultaneously, compute simultaneously, and then write their results back simultaneously, without actually needing communication between systolic arrays, then perhaps any overheads being ignored would apply equally to all systolic arrays and thus ignoring them is okay. However, even in this very simple setup this could easily not be the case: if some systolic arrays can reuse data from one timeslot to the next, but others cannot, then the overheads are presumably not the same over all systolic arrays in the timeslot. I don’t know, honestly! It does suggest even more strongly that I need to try and compile a conv2d for multiple systolic arrays by hand.

Reflecting on Reading Goals

My goals were:

I want to understand when multiple systolic arrays make sense. It must depend on the workload. What kinds of workloads benefit? And what’s the strategy for running them on multiple systolic arrays? Do we parallelize a single layer, or do we run multiple layers at once?

I’d also like to understand what the design space of dataflows looks like. I’ve been having this thought that that’s where the real design space is, and that it might be an interesting place to explore software–hardware codesign.

Regarding the first goal. This paper does contain data about where multiple systolic arrays work, which I should dig into more deeply. However, I had concerns about just how far into the results we should read. Specifically, the way they simulate multiple systolic arrays is by simply assuming that all systolic arrays are 8x8. Furthermore, they don’t seem to consider on-chip network overheads; I’m unclear on whether they just don’t exist, or whether they were safe to omit, or whether they were unsafe to omit. I’m hoping other papers (e.g. MAESTRO) will get into this.

Regarding the second goal—understanding the design space of dataflows. They laid out the types of sharing that exist, fundamentally, when considering the conv2d algorithm. This was useful, as it directly leads to input/weight/output stationary dataflows: the design points at the three extremes. Then, the question is, is there an interstitial space of design points between those three extremes? The existence of the row stationary dataflow suggests the answer is yes. A next step here would be to understand the three dataflows at the extremes plus row stationary and see if they can all be represented in some unifying framework which might then allow for the expression of new dataflows.