Performance metrics

ReFrame provides the performance logs at the end of successful tests. These performance metrics are the ones that are instrumented within the code or benchmark. In the case of microbenchmarks, these tend to be metric of interest like Gflops, memory bandwidth, latencies, etc. Whereas for the application benchmarks, they tend to be higher level metrics like CPU wall time.

In order to optimise the codes, we need more low level metrics than just CPU wall time. One way to obtain these low level metrics is by profiling the code. However, profiling has very high overheads. We can use a solution that falls between these two extremes by monitoring several CPU, memory and network related metrics as the benchmark tests run. A time series data of such metrics can be used to identify the bottlenecks and hotspots relatively quickly and give some insights to developers about performance of codes.

A toolkit implemented to extract several CPU metrics like usage, time, memory consumption, bandwidth, etc. It also monitors the low-level metrics from perf stat command like FLOPS, L2/L3 bandwidth. In the future, perf record profiling output will also be added to the toolkit such that we get a lower level profiling of the code. More details on the toolkit and the metrics it reports can be consulted in the documentation.

Note

We mainly collect these performance metrics for level 1 and level 2 benchmarks (but not exclusively). Typically, level 0 benchmarks uses mini kernels and the performance metrics are already instrumented inside the benchmark code. On the other hand, level 1 and level 2 benchmarks are more complex and application oriented and time series data of several performance measures are desirable to understand the bottlenecks of the code.

Currently, all the metrics recorded from each benchmark are saved in a HDF5 format. It can be found at perfmetrics/{system}/{partition}/{environment}/{test}/test.h5. Each run will create two tables in the HDF store with names as cpu_metrics_<job_id> and perf_metrics_<job_id>, where job_id is the ID of the batch scheduler job. These tables can be imported to a Pandas dataframe for plotting and other post-processing.