====== Performance in C/C++ ======
===== Inline and macros =====
Inlining short functions which are often called is very important for performance.
==== Macros ====
In C you can use macros :
#define sqr(x) (x)*(x)
But be careful, **macros are very dangerous**.
For example parenthesis around ''x'' are not optional at all. Indeed macros are stupidly replaced before compilation, and ''x*x'' called with ''a-b'' will be replaced by ''a-b*a-b'' which is not what you want because of operators priorities !
Moreover if the argument of the macro is a function call, this call will be done two times !
==== Inline functions ====
With inline functions you have all the advantages (it is as fast as macros, see [[instruction-timings]]) and none of these traps. **That's why in C++ you MUST use inline functions instead of macros**.
You can even use templates to have a generic function :
template
inline T sqr(T x) { return x*x; }
And use it like :
int x = 5;
int y = sqr(x);
int y2 = sqr(x);
== Remark 1 ==
If you want to export your inline functions in several files, you must define it in the header file.
== Remark 2 ==
Methods which are defined inside the body of a class are automatically inlined, eg :
class Point
{
int x,y;
public:
int getX() { return x; } // this function is automatically inlined !
int getY() { return y; } // this function is automatically inlined !
}
==== Secure macros with C99 ====
The C99 standard allows a lot of new features, among which the possibility to create macros without the dangers of classical macros.
To enable C99 standard, add the ''-std=c99'' or ''-std=gnu99'' flag to the compiler command line with gcc.
Then you can define the ''sqr'' macro as follow :
#define sqr(x) ({ typeof(x) xx=(x); xx*xx; })
===== Instruction timings =====
You can see complete results [[.instruction-timings|here]].
===== Compiler options =====
Using the good optimization options of your compiler for your final release is extremely important. Indeed as long as you compile with debug options no optimization is done, inline functions are not inlined, and overhead is added.
For VisualStudio, just compile in Release mode instead of Debug mode.
For gcc/g++ don't use ''-g'' option, and use ''-O1'', ''-O2'' or ''-O3'' optimization flag (increasing optimization).
References:
* [[http://www.redhat.com/magazine/011sep05/features/gcc/]]
===== Find the bottleneck =====
The 80-20 rule is also true when programming : 80% of the time is spend in 20% of your code. Therefore you have to find to which code correspond these 20%, it is useless to try to optimize something else that these 20%.
References:
* [[http://www.redhat.com/magazine/012oct05/features/gcc/]]
==== Valgrind ====
Valgrind is a Linux memory profiler, and is essential to everyone coding in C++ (which C++ programmer never spent a night trying to debug a segfault ?). It can detect all memory errors, even if there is no explicit segfault (therefore it is more deterministic), and it gives the exact line where is the problem :
valgrind
It can also show you where are memory leaks (memory which is not freed), and you can ask more details with :
valgrind --leak-check=full
There is also a tool with valgrind which is very useful to optimize your code, cachegrind. ''cachegrind'' tells you where are //cache misses// in your CPU cache memory (ie where your CPU needs to grab data in RAM because it is not in its cache). This is extremely important, for example if you do a matrix product with big matrices, you can divide the time needed by over 20 just by changing the order of operations to reduce cache misses.
valgrind --tool=cachegrind
[[http://citeseer.ist.psu.edu/cache/papers/cs/28135/http:zSzzSzwww10.informatik.uni-erlangen.dezSzResearchzSzProjectszSzDiMEzSzpubszSzgi_03.pdf/kowarschik03overview.pdf|How to optimize the use of cache memory (pdf) [EN] ]]
There are other tools and a lot of options, look ''man valgrind''.
==== gprof ====
''gprof'' is a profiling tool that tells you how many times is called a function or how much time is spend in a function.
To use it :
* compile your program with the '-pg' flag for both compiling and linking (and of course the '-g' classic debug flag)
* run your program (it will generate a log file ''gmon.out'')
* run ''gprof gmon.out''. You can try the ''--graph'' option which will include the time spent in children calls for each function.
==== Other profilers ====
sysprof and oprofile are other profilers which use a kernel module. Sysprof has even a GUI, and shows the time spent in all running functions and programs (system wide). If you want details about scheduling of the processes (which process was running when on which cpu), you can use trace-cmd and its front-end KernelShark.
==== Traps ====
You could think that you can find which lines are taking time by removing them and see how much less time it takes, but it won't work if optimizations are enabled, because the compiler can remove a lot more of code if it is not useful because of the few lines that you removed.
===== Memory usage =====
==== Dynamic vs Static ====
You have to keep in mind that dynamic allocations with ''new'' are **very** expensive operations. Moreover the heap where are stored dynamic objects is subject to fragmentation.
Static allocation on the stack are faster, but you need to know the maximum size at compilation time. You can also do static allocations on the heap in global variables, which is even faster, but memory is permanently used.
A nice way to deal with the trade-off between dynamic allocations and static allocations is to allocate memory by blocks.
==== Alignment ====
In looking at the following two structures, it would appear that both
contain the same data, and take up the same space:
struct foo1 {
char a;
int b;
char c;
int d;
char e;
char f;
};
struct foo2 {
int b;
int d;
char a;
char c;
char e;
char f;
};
Due to the way that data gets aligned by the compiler, however, the
above two structures are **not** the same. On a 32-bit int system,
foo1 will be 20 bytes, while foo2 will be a mere 12 bytes.
In general, you want to keep the variables arranged such that for a given
set of 32 bits (or 64 bits), that as many variables as possible are
packed into that space.
==== Natural data structures ====
On a 32-bits processor it can be slower to use 8-bits or 16-bits variables than 32-bits variables, because :
* maybe there are data alignment problems
* the sign bit is usually the most significant bit, and the CPU have extra work to do to detect it
Therefore all simple counters should be done in the natural word size of the CPU.
====== Vectorialization ======
Since Pentium III with SSE instructions, you can do 4 float add/sub/mul/div/sqrt operations in one instruction, and since Pentium 4 with SSE2 instructions, you can do 4 integer add/sub operations in one operation. As all fairly recent CPUs provide these instruction sets, it can be tempting to use them to speed up your program.
However there are two problems:
* these operations are anyway quite fast (especially because it is a single operation), so if you need to do too much data reorganization to have them as a contiguous vector, it will quickly create too much overhead that cancels what you win by parallelizing the operations.
* data **must** be 16 bytes aligned, so it can prevent you to directly use raw data (eg if you want to compute a haar feature with size not multiple of 4 with an integral image), which raises the previous problem...
But still when you can use it, it can worth the pain, especially with divisions or sqrt that are especially cycles consuming, eg to compute 4 parabolic interpolations simultaneously:
struct SSE_f
{
typedef float v4sf __attribute__((vector_size(16)));
union { v4sf v; float f[4]; };
};
inline void parabolicInterpolation4(
const SSE_f &x0, const SSE_f &y0, const SSE_f &x1, const SSE_f &y1, const SSE_f &x2, const SSE_f &y2,
SSE_f &extremum_x, SSE_f &extremum_y, SSE_f &a, SSE_f &b, SSE_f &c)
{
SSE_f x01; x01.v = _mm_sub_ps(x0.v, x1.v);
SSE_f x02; x02.v = _mm_sub_ps(x0.v, x2.v);
SSE_f x12; x12.v = _mm_sub_ps(x1.v, x2.v);
SSE_f t0; t0.v = _mm_div_ps(y0.v, _mm_mul_ps(x01.v, x02.v));
SSE_f t1; t1.v = _mm_div_ps(y1.v, _mm_mul_ps(x01.v, x12.v));
SSE_f t2; t2.v = _mm_div_ps(y2.v, _mm_mul_ps(x02.v, x12.v));
a.v = _mm_add_ps(_mm_sub_ps(t0, t1.v), t2.v);
b.v = _mm_sub_ps(_mm_mul_ps(_mm_add_ps(x0.v, x2.v), t1.v),
_mm_add_ps(_mm_mul_ps(_mm_add_ps(x1.v, x2.v), t0.v),
_mm_mul_ps(_mm_add_ps(x0.v, x1.v), t2.v)));
c.v = _mm_add_ps(_mm_sub_ps(
_mm_mul_ps(_mm_mul_ps(x1.v, x2.v), t0.v),
_mm_mul_ps(_mm_mul_ps(x0.v, x2.v), t1.v)),
_mm_mul_ps(_mm_mul_ps(x0.v, x1.v), t2.v));
extremum_x.v = _mm_div_ps(b.v, _mm_mul_ps(a.v, _mm_set1_ps(-2.0f)));
extremum_y.v = _mm_sub_ps(c.v, _mm_div_ps(_mm_mul_ps(b.v, b.v), _mm_mul_ps(a.v, _mm_set1_ps(4.0f))));
}
It takes roughly 60% less time than to compute them sequentially, which means that when done on a whole 1.2 Mpix image, it takes 20ms instead of 50ms.
References:
* [[http://gcc.gnu.org/onlinedocs/gcc-4.4.2/gcc/X86-Built_002din-Functions.html#X86-Built_002din-Functions]] for a list of available instructions in GCC
* [[http://softpixel.com/~cwright/programming/simd/sse.php]] for a quick description of SSE instructions
* [[http://www.tommesani.com/Docs.html]] for a more detailed description
More simple names are available in GCC headers:
* SSE for float (, operand type _ _mm128): _mm_add_ps, _mm_sub_ps, _mm_mul_ps, _mm_div_ps, _mm_sqrt_ps, _mm_rsqrt_ps, _mm_rcp_ps, _mm_load_ps, _mm_store_ps, _mm_set1_ps, _mm_setr_ps
* SSE2 for int (, operand type _ _mm128i): _mm_add_epi32, _mm_sub_epi32, _mm_set1_epi32, _mm_setr_epi32
* SSE2 int/float conversions: _ _builtin_ia32_cvtdq2ps, _ _builtin_ia32_cvtps2dq
And you have to compile with GCC flags -msse and -msse2, or one -march that supports it.
====== Measuring performance ======
===== Linux =====
==== Measuring total time ====
There is a timer with a microsecond precision :
#include
struct timeval tv;
gettimeofday(&tv,NULL);
unsigned microseconds = tv.tv_sec*1000000 + tv.tv_usec; // beware overflows
==== Measuring cpu time ====
The function ''getrusage'' measures the cpu time allocated to you process. Therefore it is less sensible to perturbations of other running processes :
#include
struct rusage ru;
getrusage(RUSAGE_SELF, &ru);
unsigned kernel_microseconds = ru.ru_stime.tv_sec*1000000 + ru.ru_stime.tv_usec
unsigned user_microseconds = ru.ru_utime.tv_sec*1000000 + ru.ru_utime.tv_usec
unsigned total_microseconds = kernel_microseconds + user_microseconds;
When you time your code, you should use "user_microseconds", because kernel charge is not properly part of your algorithm.
===== Windows =====
==== Measuring total time ====
You can use the ''QueryPerformanceCounter'' function which is very precise too.
A simple wrapper is available [[http://www.codeproject.com/datetime/perftimer.asp|here]]
==== Measuring cpu time ====
FIXME Does ''getrusage'' works ??