Because one line of unrelated code, the speed difference so much

Question

Why use the precompiler to add a line of code, the speed change is so great? ubuntu@host:~/test_intersection$ g++ -std=c++11 -O3 -DLINUX main1.cpp -o main1 ubuntu@host:~/test_intersection$ ./main1 duration: 62080 microseconds,set a size: 2998917 set b size: 3000000 set result size: 2087 ubuntu@host:~/test_intersection$ g++ -std=c++11 -O3 -DLINUX -DTEST main1.cpp -o main1 ubuntu@host:~/test_intersection$ ./main1 duration: 362546 microseconds,set a size: 2998985 set b size:

Accepted Answer

I get the same result as you on my Ubuntu droplet hosted at DigitalOcean. It has fairly limited RAM. Prior to running the test, it had about 220MB free (output of /usr/bin/free -tm):# free -tm total used free shared buff/cache availableMem: 493 153 220 14 119 298Swap: 0 0 0Total: 493 153 220When I run the slow test, I can watch the available memory get completely soaked up.# free -tm total used free shared buff/cache availableMem: 493 383 10 14 99 69Swap: 0 0 0Total: 493 383 10Just in case the clear() method kept all that memory reserved internally, I tried instead:data = std::move( std::set() );But this made little difference.So one of my original suspicions is that you have fragmented your memory by exhausting it with a data structure like std::set, which performs lots of allocations to build a tree and then frees them in an unspecified order (due to the arrangement of nodes in the tree).To simulate this, I replaced the TEST section with code that performed a lot of allocations and then released them in a different order (by stepping over the list using a prime number stride).#ifdef TEST //prepare_data(a, SIX_MILLION); { std::vector mem(SIX_MILLION); for( auto & val : mem ) val = malloc(24); for( int i=0, p=0, step=499739; i < SIX_MILLION; i++) { p = (p + step ) % SIX_MILLION; free(mem[p]); } }#endifAllocations of 24 bytes were sufficient to stress the memory allocator on my system, leading to similar results to those you have described. I found that if I free the values in a more predictable order (i.e. walking through from first to last), this did not have the same effect on performance.So I would say the final explanation for this is that you are a victim of memory fragmentation. You filled up your memory with lots of small allocations, and then freed them in a random order. You then built new data sets, which suffered from poor cache locality because the allocation system was stretched. This had a measurably severe impact on performance when it came to calculate an expensive intersection of these two data sets.

Advertisement

Answer