There has been substantial research interest in graph processing systems. To best guide that research, it is important to understand the workload characteristics of graph processing to understand its corresponding implementation and architectural needs. There is some "conventional wisdom" about the properties of graph algorithms, but it is important to actually profile workloads to confirm or challenge these assumptions.
In this workload characterization study, we use hardware performance counters to identify and quantify bottlenecks of full-size graph processing workloads running on real hardware (dual-socket Intel Ivy Bridge). We use 3 high-performance implementations (GAP, Galois, Ligra) of 5 different graph kernels executing on 5 different input graphs. This site is a summary of the main single-thread results, but we encourage those interested to consult our paper for details.
Main Results
Graph workloads have diverse performance characteristics
There is no single "representative" graph workload. A few graph workloads actually are memory bandwidth-bound (bottom-right of Figure 1), a few are compute-bound (upper-left of Figure 1), and many are in the middle (neither compute-bound nor memory bandwidth-bound). The most important determinant of graph workload characteristics is typically the input graph and surprisingly not the implementation or even the graph kernel.
Memory bandwidth is usually not the bottleneck
Other bottlenecks will typically limit performance before the raw off-chip memory bandwidth does.
Graph workloads have locality
Graph workloads should not simply be thought of as "random pointer-chasing," as in practice, only a small fraction of memory accesses miss all of the on-chip caches and need to access DRAM (Figure 2).
Reorder buffer size limits memory bandwidth utilization
Due to the locality, last-level cache misses are typically rare, causing a considerable number of instructions to be executed for each cache-missing instruction. These instructions that do not generate cache misses fill up the reorder buffer, allowing only a few cache-missing instructions into the buffer at a time. These few cache-missing instructions generate insufficient memory level parallelism to fully utilize the platform's memory bandwidth. A simple back-of-the-envelope model can bound this limit (Figure 2).
References
"Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server", Scott Beamer, Krste Asanović, and David Patterson, International Symposium on Workload Characterization (IISWC), Atlanta, October 2015. PDFIEEE Best Paper Award