Skip to content
Advertisement

Why malloc is faster than static memory allocation in my test program?

I have a test program. I get this result when executing it in ubuntu trusty 64 bit.

malloc time:9571

static time:45587

Why malloc is faster than static memory allocation,or my test program is wrong?

The test program is like this.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>

#define TIME 10000

int data[1024] = { 1,2,3,4,5,6,6,7,8,5,4,3,2,3 };
int st[TIME][1024];
int main(void) {
    int i = 0;
    int time = 0;
    struct timeval tv1,tv2;

    /* test for malloc */
    memset(&tv1,0,sizeof(tv1));
    memset(&tv2,0,sizeof(tv2));
    gettimeofday(&tv1,NULL);
    for(i=0;i<TIME;i++) {
        void * p = malloc(4096);
        memset(p,0,4096);
        memcpy(p,data,sizeof(data));
        free(p);
        p = NULL;
    }
    gettimeofday(&tv2,NULL);
    time = ((tv2.tv_sec - tv1.tv_sec) * 1000000 +
        (tv2.tv_usec - tv1.tv_usec));
    printf("malloc time:%dn",time);

    /* test for static memory allocation */
    memset(&tv1,0,sizeof(tv1));
    memset(&tv2,0,sizeof(tv2));
    gettimeofday(&tv1,NULL);
    for(i=0;i<TIME;i++) {
        memset(st[i],0,4096);
        memcpy(st[i],data,sizeof(data));
    }
    gettimeofday(&tv2,NULL);
    time = ((tv2.tv_sec - tv1.tv_sec) * 1000000 +
        (tv2.tv_usec - tv1.tv_usec));
    printf("static time:%dn",time);

    return 0;
}

Advertisement

Answer

That benchmark is essentially meaningless because most of what it is measuring has little relationship with the use of the two memory regions.

When your program starts up (that is, when execution of main commences), the default-initialized data segment (that is, the 40 megabytes of st[40000][1024]) has not yet been mapped into physical memory. The virtual memory addresses have been flagged for lazy mapping into zero-initialized memory, but this will not happen until the program actually attempts to reference those addresses. Each such reference requires kernel intervention to adjust the virtual memory map (as well as zero-initializing the page of physical memory). After that intervention, that page of virtual memory is mapped, but not any other page in the same data segment. So as you pass over the st array, you will generate a large number of page faults, each of which takes a significant amount of time. The vast majority of the time you’re measuring in the “static memory” test consists of these kernel traps.

On the other hand, the first call to malloc will cause the standard library to initialize the memory allocation system. While that initialization is not too complicated, it is not free either. So most of the time you measure in the “malloc’ed memory” test consists of that initialization.

In order to do a meaningful benchmark, you need to make sure that all lazy initialization has been completed before you start measuring. One way of doing that is to perform the benchmark several times in the same executable, and discard the first (or first few) repetitions.

As an illustration, I modified your benchmark trivially by adding the following loop around the entire contents of main (except the return statement):

for (int reps = 0; reps < 4; ++reps) {
  printf("Repetition %dn", reps);
  /* Body of main goes here */
}

This resulted in the following output:

Repetition 0:
malloc time:9584
static time:26923
Repetition 1:
malloc time:2467
static time:4360
Repetition 2:
malloc time:2463
static time:4332
Repetition 3:
malloc time:2413
static time:4609

Note the difference between the times measured in the “warm-up” iteration (repetition 0) and the remaining ones.

That still leaves a difference between the dynamic and static memory tests. Here, it’s worth noting that the two tests use memory in different ways. The malloc test (probably) reuses the same buffer on every iteration, since after you free a block of memory, the next malloc of that size will probably immediately return it. The static memory test, on the other hand, cycles through the entire 40MB allocation. A better comparison would be to either malloc a buffer of the same size as st and cycle through it using the same code as the static test, or make st a single vector of 1024 ints (like the malloc) and reuse it using the same code as the malloc test. In other words, to compare two possible approaches, minimize the differences between the two. (I’ll leave that as an exercise.)

If you make those suggested changes, you will likely find that the difference reduces to noise. But it is possible that there will remain some consistent (but small) difference, which will reflect the variety of hard-to-control for differences between the two loops, including alignment of code and data, and the precise details of the CPU cache. These minor differences would fit into the category of “coincidence”, as ably articulated in @seb’s answer to this question. (While I think that it is important to understand avoidable pitfalls in benchmarking, I would emphasize that @seb’s advice in that question is unquestionably correct.)

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement