Skip to content
Advertisement

Using perf_event with the ARM PMU inside gem5

I know that the ARM PMU is partially implemented, thanks to the gem5 source code and some publications.

I have a binary which uses perf_event to access the PMU on a Linux-based OS, under an ARM processor. Could it use perf_event inside a gem5 full-system simulation with a Linux kernel, under the ARM ISA?

So far, I haven’t found the right way to do it. If someone knows, I will be very grateful!

Advertisement

Answer

Context

I was not able to use the Performance Monitoring Unit (PMU) because of a gem5’s unimplemented feature. The reference on the mailing list can be found here. After a personal patch, the PMU is accessible through perf_event. Fortunately, a similar patch will be released in the official gem5 release soon, could be seen here. The patch will be described in another answer, due to the number of link limitation inside one message.

How to use the PMU

C source code

This is a minimal working example of a C source code using perf_event, used to count the number of mispredicted branches by the branch predictor unit during a specific task:

#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <errno.h>

#include <unistd.h>
#include <sys/syscall.h>
#include <linux/perf_event.h>

int main(int argc, char **argv) {
    /* File descriptor used to read mispredicted branches counter. */
    static int perf_fd_branch_miss;
    
    /* Initialize our perf_event_attr, representing one counter to be read. */
    static struct perf_event_attr attr_branch_miss;
    attr_branch_miss.size = sizeof(attr_branch_miss);
    attr_branch_miss.exclude_kernel = 1;
    attr_branch_miss.exclude_hv = 1;
    attr_branch_miss.exclude_callchain_kernel = 1;
    /* On a real system, you can do like this: */
    attr_branch_miss.type = PERF_TYPE_HARDWARE;
    attr_branch_miss.config = PERF_COUNT_HW_BRANCH_MISSES;
    /* On a gem5 system, you have to do like this: */
    attr_branch_miss.type = PERF_TYPE_RAW;
    attr_branch_miss.config = 0x10;
    
    /* Open the file descriptor corresponding to this counter. The counter
       should start at this moment. */
    if ((perf_fd_branch_miss = syscall(__NR_perf_event_open, &attr_branch_miss, 0, -1, -1, 0)) == -1)
        fprintf(stderr, "perf_event_open fail %d %d: %sn", perf_fd_branch_miss, errno, strerror(errno));
    
    /* Workload here, that means our specific task to profile. */

    /* Get and close the performance counters. */
    uint64_t counter_branch_miss = 0;
    read(perf_fd_branch_miss, &counter_branch_miss, sizeof(counter_branch_miss));
    close(perf_fd_branch_miss);

    /* Display the result. */
    printf("Number of mispredicted branches: %dn", counter_branch_miss);
}

I will not enter into the details of how using perf_event, good resources are available here, here, here, here. However, just a few notes about the code above:

  • On real hardware, when using perf_event and common events (events that are available under a lot of architectures), it is recommended to use perf_event macros PERF_TYPE_HARDWARE as type and to use macros like PERF_COUNT_HW_BRANCH_MISSES for the number of mispredicted branches, PERF_COUNT_HW_CACHE_MISSES for the number of cache misses, and so on (see the manual page for a list). This is a best practice to have a portable code.
  • On a gem5 simulated system, currently (v20.0), a C source code have to use PERF_TYPE_RAW type and architectural event ID to identify an event. Here, 0x10 is the ID of the 0x0010, BR_MIS_PRED, Mispredicted or not predicted branch event, described in the ARMv8-A Reference Manual (here). In the manual, all events available in real hardware are described. However, they are not all implemented into gem5. To see the list of implemented event inside gem5, refer to the src/arch/arm/ArmPMU.py file. In the latter, the line self.addEvent(ProbeEvent(self,0x10, bpred, "Misses")) corresponds to the declaration of the counter described in the manual. This is not a normal behavior, hence gem5 should be patched to allow using PERF_TYPE_HARDWARE one day.

gem5 simulation script

This is not a entire MWE script (it would be too long!), only the needed portion to add inside a full-system script to use the PMU. We use an ArmSystem as a system, with the RealView platform.

For each ISA (we use an ARM ISA here) of each CPU (e.g., a DerivO3CPU) in our cluster (which is a SubSystem class), we add to it a PMU with a unique interrupt number and the already implemented architectural event. An example of this function could be found in configs/example/arm/devices.py.

To choose an interrupt number, pick a free PPI interrupt in the platform interrupt mapping. Here, we choose PPI n°20, according to the RealView interrupt map (src/dev/arm/RealView.py). Since PPIs interrupts are local per Processing Element (PE, corresponds to cores in our context), the interrupt number can be the same for all PE without any conflict. To know more about PPI interrupts, see the GIC guide from ARM here.

Here, we can see that the interrupt n°20 is not used by the system (from RealView.py):

Interrupts:
      0- 15: Software generated interrupts (SGIs)
     16- 31: On-chip private peripherals (PPIs)
        25   : vgic
        26   : generic_timer (hyp)
        27   : generic_timer (virt)
        28   : Reserved (Legacy FIQ)

We pass to addArchEvents our system components (dtb, itb, etc.) to link the PMU with them, thus the PMU will use the internal counters (called probes) of these components as exposed counters to the system.

for cpu in system.cpu_cluster.cpus:
    for isa in cpu.isa:
        isa.pmu = ArmPMU(interrupt=ArmPPI(num=20))
        # Add the implemented architectural events of gem5. We can
        # discover which events is implemented by looking at the file
        # "ArmPMU.py".
        isa.pmu.addArchEvents(
            cpu=cpu, dtb=cpu.dtb, itb=cpu.itb,
            icache=getattr(cpu, "icache", None),
            dcache=getattr(cpu, "dcache", None),
            l2cache=getattr(system.cpu_cluster, "l2", None))
Advertisement