Performance in C/C++

Inline and macros

Inlining short functions which are often called is very important for performance.


In C you can use macros :

#define sqr(x) (x)*(x)

But be careful, macros are very dangerous.

For example parenthesis around x are not optional at all. Indeed macros are stupidly replaced before compilation, and x*x called with a-b will be replaced by a-b*a-b which is not what you want because of operators priorities !

Moreover if the argument of the macro is a function call, this call will be done two times !

Inline functions

With inline functions you have all the advantages (it is as fast as macros, see instruction-timings) and none of these traps. That's why in C++ you MUST use inline functions instead of macros.

You can even use templates to have a generic function :

template <typename T>
inline T sqr(T x) { return x*x; }

And use it like :

int x = 5;
int y = sqr(x);
int y2 = sqr<int>(x);
Remark 1

If you want to export your inline functions in several files, you must define it in the header file.

Remark 2

Methods which are defined inside the body of a class are automatically inlined, eg :

class Point
  int x,y;
  int getX() { return x; } // this function is automatically inlined !
  int getY() { return y; } // this function is automatically inlined !

Secure macros with C99

The C99 standard allows a lot of new features, among which the possibility to create macros without the dangers of classical macros.

To enable C99 standard, add the -std=c99 or -std=gnu99 flag to the compiler command line with gcc.

Then you can define the sqr macro as follow :

#define sqr(x) ({ typeof(x) xx=(x); xx*xx; })

Instruction timings

You can see complete results here.

Compiler options

Using the good optimization options of your compiler for your final release is extremely important. Indeed as long as you compile with debug options no optimization is done, inline functions are not inlined, and overhead is added.

For VisualStudio, just compile in Release mode instead of Debug mode.

For gcc/g++ don't use -g option, and use -O1, -O2 or -O3 optimization flag (increasing optimization).


Find the bottleneck

The 80-20 rule is also true when programming : 80% of the time is spend in 20% of your code. Therefore you have to find to which code correspond these 20%, it is useless to try to optimize something else that these 20%.



Valgrind is a Linux memory profiler, and is essential to everyone coding in C++ (which C++ programmer never spent a night trying to debug a segfault ?). It can detect all memory errors, even if there is no explicit segfault (therefore it is more deterministic), and it gives the exact line where is the problem :

valgrind <your-program> <args>

It can also show you where are memory leaks (memory which is not freed), and you can ask more details with :

valgrind --leak-check=full <your-program> <args>

There is also a tool with valgrind which is very useful to optimize your code, cachegrind. cachegrind tells you where are cache misses in your CPU cache memory (ie where your CPU needs to grab data in RAM because it is not in its cache). This is extremely important, for example if you do a matrix product with big matrices, you can divide the time needed by over 20 just by changing the order of operations to reduce cache misses.

valgrind --tool=cachegrind <your-program>

How to optimize the use of cache memory (pdf) [EN]

There are other tools and a lot of options, look man valgrind.


gprof is a profiling tool that tells you how many times is called a function or how much time is spend in a function.

To use it :

  • compile your program with the '-pg' flag for both compiling and linking (and of course the '-g' classic debug flag)
  • run your program (it will generate a log file gmon.out)
  • run gprof gmon.out. You can try the –graph option which will include the time spent in children calls for each function.

Other profilers

sysprof and oprofile are other profilers which use a kernel module. Sysprof has even a GUI, and shows the time spent in all running functions and programs (system wide). If you want details about scheduling of the processes (which process was running when on which cpu), you can use trace-cmd and its front-end KernelShark.


You could think that you can find which lines are taking time by removing them and see how much less time it takes, but it won't work if optimizations are enabled, because the compiler can remove a lot more of code if it is not useful because of the few lines that you removed.

Memory usage

Dynamic vs Static

You have to keep in mind that dynamic allocations with new are very expensive operations. Moreover the heap where are stored dynamic objects is subject to fragmentation.

Static allocation on the stack are faster, but you need to know the maximum size at compilation time. You can also do static allocations on the heap in global variables, which is even faster, but memory is permanently used.

A nice way to deal with the trade-off between dynamic allocations and static allocations is to allocate memory by blocks.


In looking at the following two structures, it would appear that both contain the same data, and take up the same space:

 struct foo1 {
  char a;
  int  b;
  char c; 
  int  d; 
  char e;
  char f;
 struct foo2 {
  int b;
  int d;
  char a;
  char c;
  char e; 
  char f;

Due to the way that data gets aligned by the compiler, however, the above two structures are not the same. On a 32-bit int system, foo1 will be 20 bytes, while foo2 will be a mere 12 bytes.

In general, you want to keep the variables arranged such that for a given set of 32 bits (or 64 bits), that as many variables as possible are packed into that space.

Natural data structures

On a 32-bits processor it can be slower to use 8-bits or 16-bits variables than 32-bits variables, because :

  • maybe there are data alignment problems
  • the sign bit is usually the most significant bit, and the CPU have extra work to do to detect it

Therefore all simple counters should be done in the natural word size of the CPU.


Since Pentium III with SSE instructions, you can do 4 float add/sub/mul/div/sqrt operations in one instruction, and since Pentium 4 with SSE2 instructions, you can do 4 integer add/sub operations in one operation. As all fairly recent CPUs provide these instruction sets, it can be tempting to use them to speed up your program.

However there are two problems:

  • these operations are anyway quite fast (especially because it is a single operation), so if you need to do too much data reorganization to have them as a contiguous vector, it will quickly create too much overhead that cancels what you win by parallelizing the operations.
  • data must be 16 bytes aligned, so it can prevent you to directly use raw data (eg if you want to compute a haar feature with size not multiple of 4 with an integral image), which raises the previous problem…

But still when you can use it, it can worth the pain, especially with divisions or sqrt that are especially cycles consuming, eg to compute 4 parabolic interpolations simultaneously:

struct SSE_f
	typedef float v4sf __attribute__((vector_size(16)));
	union { v4sf v; float f[4]; };
inline void parabolicInterpolation4(
	const SSE_f &x0, const SSE_f &y0, const SSE_f &x1, const SSE_f &y1, const SSE_f &x2, const SSE_f &y2, 
	SSE_f &extremum_x, SSE_f &extremum_y, SSE_f &a, SSE_f &b, SSE_f &c)
	SSE_f x01; x01.v = _mm_sub_ps(x0.v, x1.v);
	SSE_f x02; x02.v = _mm_sub_ps(x0.v, x2.v);
	SSE_f x12; x12.v = _mm_sub_ps(x1.v, x2.v);
	SSE_f t0; t0.v = _mm_div_ps(y0.v, _mm_mul_ps(x01.v, x02.v));
	SSE_f t1; t1.v = _mm_div_ps(y1.v, _mm_mul_ps(x01.v, x12.v));
	SSE_f t2; t2.v = _mm_div_ps(y2.v, _mm_mul_ps(x02.v, x12.v));
	a.v = _mm_add_ps(_mm_sub_ps(t0, t1.v), t2.v);
	b.v = _mm_sub_ps(_mm_mul_ps(_mm_add_ps(x0.v, x2.v), t1.v),
	      _mm_add_ps(_mm_mul_ps(_mm_add_ps(x1.v, x2.v), t0.v),
	                 _mm_mul_ps(_mm_add_ps(x0.v, x1.v), t2.v)));
	c.v = _mm_add_ps(_mm_sub_ps(
		_mm_mul_ps(_mm_mul_ps(x1.v, x2.v), t0.v),
		_mm_mul_ps(_mm_mul_ps(x0.v, x2.v), t1.v)),
		_mm_mul_ps(_mm_mul_ps(x0.v, x1.v), t2.v));
	extremum_x.v = _mm_div_ps(b.v, _mm_mul_ps(a.v, _mm_set1_ps(-2.0f)));
	extremum_y.v = _mm_sub_ps(c.v, _mm_div_ps(_mm_mul_ps(b.v, b.v), _mm_mul_ps(a.v, _mm_set1_ps(4.0f))));

It takes roughly 60% less time than to compute them sequentially, which means that when done on a whole 1.2 Mpix image, it takes 20ms instead of 50ms.


More simple names are available in GCC headers:

  • SSE for float (<emmintrin.h>, operand type _ _mm128): _mm_add_ps, _mm_sub_ps, _mm_mul_ps, _mm_div_ps, _mm_sqrt_ps, _mm_rsqrt_ps, _mm_rcp_ps, _mm_load_ps, _mm_store_ps, _mm_set1_ps, _mm_setr_ps
  • SSE2 for int (<xmmintrin.h>, operand type _ _mm128i): _mm_add_epi32, _mm_sub_epi32, _mm_set1_epi32, _mm_setr_epi32
  • SSE2 int/float conversions: _ _builtin_ia32_cvtdq2ps, _ _builtin_ia32_cvtps2dq

And you have to compile with GCC flags -msse and -msse2, or one -march that supports it.

Measuring performance


Measuring total time

There is a timer with a microsecond precision :

#include <sys/time.h>
struct timeval tv;
unsigned microseconds = tv.tv_sec*1000000 + tv.tv_usec; // beware overflows

Measuring cpu time

The function getrusage measures the cpu time allocated to you process. Therefore it is less sensible to perturbations of other running processes :

#include <sys/resource.h>
struct rusage ru;
getrusage(RUSAGE_SELF, &ru);
unsigned kernel_microseconds = ru.ru_stime.tv_sec*1000000 + ru.ru_stime.tv_usec
unsigned   user_microseconds = ru.ru_utime.tv_sec*1000000 + ru.ru_utime.tv_usec
unsigned total_microseconds = kernel_microseconds + user_microseconds;

When you time your code, you should use “user_microseconds”, because kernel charge is not properly part of your algorithm.


Measuring total time

You can use the QueryPerformanceCounter function which is very precise too.

A simple wrapper is available here

Measuring cpu time

FIXME Does getrusage works ??

programming/c-cpp-performance.txt · Last modified: 2013/09/19 16:40 (external edit)
CC Attribution-Share Alike 4.0 International
Driven by DokuWiki Recent changes RSS feed Valid CSS Valid XHTML 1.0