Differences

This shows you the differences between two versions of the page.

--- programming:c-cpp-performance [2006/12/14 13:23]
cyril created
+++ programming:c-cpp-performance [2013/09/19 16:40] (current)
@@ Line 1: / Line 1: @@
 ====== Performance in C/C++ ======
-===== Inline functions and macros =====
+===== Inline and macros =====
 Inlining short functions which are often called is very important for performance.
@@ Line 12: / Line 12: @@
 </code>
-But be careful, macros are very dangerous. For example parenthesis around ''x'' are not optional at all. Indeed macros are stupidly replaced before compilation, and ''x*x'' called with ''a-b'' will be replaced by ''a-b*a-b'' which is not what you want !
+But be careful, **macros are very dangerous**.
+For example parenthesis around ''x'' are not optional at all. Indeed macros are stupidly replaced before compilation, and ''x*x'' called with ''a-b'' will be replaced by ''a-b*a-b'' which is not what you want because of operators priorities !
 Moreover if the argument of the macro is a function call, this call will be done two times !
 ==== Inline functions ====
-With inline functions you have all the advantages (it is as fast as macros, see [[programming:instruction-timings]]) and none of these traps. That's why in C++ you MUST use inline functions instead of macros.
+With inline functions you have all the advantages (it is as fast as macros, see [[instruction-timings]]) and none of these traps. **That's why in C++ you MUST use inline functions instead of macros**.
 You can even use templates to have a generic function :
 <code cpp>
-<template typename T>
+template <typename T>
 inline T sqr(T x) { return x*x; }
 </code>
@@ Line 29: / Line 33: @@
 <code cpp>
 int x = 5;
-int y = sqr<int>(x);
+int y = sqr(x);
+int y2 = sqr<int>(x);
 </code>
@@ Line 47: / Line 52: @@
 </code>
+==== Secure macros with C99 ====
+The C99 standard allows a lot of new features, among which the possibility to create macros without the dangers of classical macros.
+To enable C99 standard, add the ''-std=c99'' or ''-std=gnu99'' flag to the compiler command line with gcc.
+Then you can define the ''sqr'' macro as follow :
+<code c>
+#define sqr(x) ({ typeof(x) xx=(x); xx*xx; })
+</code>
 ===== Instruction timings =====
-You can see complete results [[programming:instruction-timings|here]].
+You can see complete results [[.instruction-timings|here]].
 ===== Compiler options =====
@@ Line 59: / Line 75: @@
 For gcc/g++ don't use ''-g'' option, and use ''-O1'', ''-O2'' or ''-O3'' optimization flag (increasing optimization).
+References:
+  * [[http://www.redhat.com/magazine/011sep05/features/gcc/]]
+===== Find the bottleneck =====
+The 80-20 rule is also true when programming : 80% of the time is spend in 20% of your code. Therefore you have to find to which code correspond these 20%, it is useless to try to optimize something else that these 20%.
+References:
+  * [[http://www.redhat.com/magazine/012oct05/features/gcc/]]
+==== Valgrind ====
+Valgrind is a Linux memory profiler, and is essential to everyone coding in C++ (which C++ programmer never spent a night trying to debug a segfault ?). It can detect all memory errors, even if there is no explicit segfault (therefore it is more deterministic), and it gives the exact line where is the problem :
+<code bash>
+valgrind <your-program> <args>
+</code>
+It can also show you where are memory leaks (memory which is not freed), and you can ask more details with :
+<code bash>
+valgrind --leak-check=full <your-program> <args>
+</code>
+There is also a tool with valgrind which is very useful to optimize your code, cachegrind. ''cachegrind'' tells you where are //cache misses// in your CPU cache memory (ie where your CPU needs to grab data in RAM because it is not in its cache). This is extremely important, for example if you do a matrix product with big matrices, you can divide the time needed by over 20 just by changing the order of operations to reduce cache misses.
+<code bash>
+valgrind --tool=cachegrind <your-program>
+</code>
+[[http://citeseer.ist.psu.edu/cache/papers/cs/28135/http:zSzzSzwww10.informatik.uni-erlangen.dezSzResearchzSzProjectszSzDiMEzSzpubszSzgi_03.pdf/kowarschik03overview.pdf|How to optimize the use of cache memory (pdf) [EN] ]]
+There are other tools and a lot of options, look ''man valgrind''.
+==== gprof ====
+''gprof'' is a profiling tool that tells you how many times is called a function or how much time is spend in a function.
+To use it :
+  * compile your program with the '-pg' flag for both compiling and linking (and of course the '-g' classic debug flag)
+  * run your program (it will generate a log file ''gmon.out'')
+  * run ''gprof gmon.out''. You can try the ''--graph'' option which will include the time spent in children calls for each function.
+==== Other profilers ====
+sysprof and oprofile are other profilers which use a kernel module. Sysprof has even a GUI, and shows the time spent in all running functions and programs (system wide). If you want details about scheduling of the processes (which process was running when on which cpu), you can use trace-cmd and its front-end KernelShark.
+==== Traps ====
+You could think that you can find which lines are taking time by removing them and see how much less time it takes, but it won't work if optimizations are enabled, because the compiler can remove a lot more of code if it is not useful because of the few lines that you removed.
+===== Memory usage =====
+==== Dynamic vs Static ====
+You have to keep in mind that dynamic allocations with ''new'' are **very** expensive operations. Moreover the heap where are stored dynamic objects is subject to fragmentation.
+Static allocation on the stack are faster, but you need to know the maximum size at compilation time. You can also do static allocations on the heap in global variables, which is even faster, but memory is permanently used.
+A nice way to deal with the trade-off between dynamic allocations and static allocations is to allocate memory by blocks.
+==== Alignment ====
+In looking at the following two structures, it would appear that both
+contain the same data, and take up the same space:
+<code cpp>
+ struct foo1 {
+  char a;
+  int  b;
+  char c;
+  int  d;
+  char e;
+  char f;
+ };
+</code>
+<code cpp>
+ struct foo2 {
+  int b;
+  int d;
+  char a;
+  char c;
+  char e;
+  char f;
+ };
+</code>
+Due to the way that data gets aligned by the compiler, however, the
+above two structures are **not** the same.  On a 32-bit int system,
+foo1 will be 20 bytes, while foo2 will be a mere 12 bytes.
+In general, you want to keep the variables arranged such that for a given
+set of 32 bits (or 64 bits), that as many variables as possible are
+packed into that space.
+==== Natural data structures ====
+On a 32-bits processor it can be slower to use 8-bits or 16-bits variables than 32-bits variables, because :
+  * maybe there are data alignment problems
+  * the sign bit is usually the most significant bit, and the CPU have extra work to do to detect it
+Therefore all simple counters should be done in the natural word size of the CPU.
+====== Vectorialization ======
+Since Pentium III with SSE instructions, you can do 4 float add/sub/mul/div/sqrt operations in one instruction, and since Pentium 4 with SSE2 instructions, you can do 4 integer add/sub operations in one operation. As all fairly recent CPUs provide these instruction sets, it can be tempting to use them to speed up your program.
+However there are two problems:
+  * these operations are anyway quite fast (especially because it is a single operation), so if you need to do too much data reorganization to have them as a contiguous vector, it will quickly create too much overhead that cancels what you win by parallelizing the operations.
+  * data **must** be 16 bytes aligned, so it can prevent you to directly use raw data (eg if you want to compute a haar feature with size not multiple of 4 with an integral image), which raises the previous problem...
+But still when you can use it, it can worth the pain, especially with divisions or sqrt that are especially cycles consuming, eg to compute 4 parabolic interpolations simultaneously:
+<code cpp>
+struct SSE_f
+{
+	typedef float v4sf __attribute__((vector_size(16)));
+	union { v4sf v; float f[4]; };
+};
+inline void parabolicInterpolation4(
+	const SSE_f &x0, const SSE_f &y0, const SSE_f &x1, const SSE_f &y1, const SSE_f &x2, const SSE_f &y2,
+	SSE_f &extremum_x, SSE_f &extremum_y, SSE_f &a, SSE_f &b, SSE_f &c)
+{
+	SSE_f x01; x01.v = _mm_sub_ps(x0.v, x1.v);
+	SSE_f x02; x02.v = _mm_sub_ps(x0.v, x2.v);
+	SSE_f x12; x12.v = _mm_sub_ps(x1.v, x2.v);
+	SSE_f t0; t0.v = _mm_div_ps(y0.v, _mm_mul_ps(x01.v, x02.v));
+	SSE_f t1; t1.v = _mm_div_ps(y1.v, _mm_mul_ps(x01.v, x12.v));
+	SSE_f t2; t2.v = _mm_div_ps(y2.v, _mm_mul_ps(x02.v, x12.v));
+	a.v = _mm_add_ps(_mm_sub_ps(t0, t1.v), t2.v);
+	b.v = _mm_sub_ps(_mm_mul_ps(_mm_add_ps(x0.v, x2.v), t1.v),
+	      _mm_add_ps(_mm_mul_ps(_mm_add_ps(x1.v, x2.v), t0.v),
+	                 _mm_mul_ps(_mm_add_ps(x0.v, x1.v), t2.v)));
+	c.v = _mm_add_ps(_mm_sub_ps(
+		_mm_mul_ps(_mm_mul_ps(x1.v, x2.v), t0.v),
+		_mm_mul_ps(_mm_mul_ps(x0.v, x2.v), t1.v)),
+		_mm_mul_ps(_mm_mul_ps(x0.v, x1.v), t2.v));
+	extremum_x.v = _mm_div_ps(b.v, _mm_mul_ps(a.v, _mm_set1_ps(-2.0f)));
+	extremum_y.v = _mm_sub_ps(c.v, _mm_div_ps(_mm_mul_ps(b.v, b.v), _mm_mul_ps(a.v, _mm_set1_ps(4.0f))));
+}
+</code>
+It takes roughly 60% less time than to compute them sequentially, which means that when done on a whole 1.2 Mpix image, it takes 20ms instead of 50ms.
+References:
+  * [[http://gcc.gnu.org/onlinedocs/gcc-4.4.2/gcc/X86-Built_002din-Functions.html#X86-Built_002din-Functions]] for a list of available instructions in GCC
+  * [[http://softpixel.com/~cwright/programming/simd/sse.php]] for a quick description of SSE instructions
+  * [[http://www.tommesani.com/Docs.html]] for a more detailed description
+More simple names are available in GCC headers:
+  * SSE for float (<emmintrin.h>, operand type _ _mm128): _mm_add_ps, _mm_sub_ps, _mm_mul_ps, _mm_div_ps, _mm_sqrt_ps, _mm_rsqrt_ps, _mm_rcp_ps, _mm_load_ps, _mm_store_ps, _mm_set1_ps, _mm_setr_ps
+  * SSE2 for int (<xmmintrin.h>, operand type _ _mm128i): _mm_add_epi32, _mm_sub_epi32, _mm_set1_epi32, _mm_setr_epi32
+  * SSE2 int/float conversions: _ _builtin_ia32_cvtdq2ps, _ _builtin_ia32_cvtps2dq
+And you have to compile with GCC flags -msse and -msse2, or one -march that supports it.
+====== Measuring performance ======
+===== Linux =====
+==== Measuring total time ====
+There is a timer with a microsecond precision :
+<code cpp>
+#include <sys/time.h>
+struct timeval tv;
+gettimeofday(&tv,NULL);
+unsigned microseconds = tv.tv_sec*1000000 + tv.tv_usec; // beware overflows
+</code>
+==== Measuring cpu time ====
+The function ''getrusage'' measures the cpu time allocated to you process. Therefore it is less sensible to perturbations of other running processes :
+<code cpp>
+#include <sys/resource.h>
+struct rusage ru;
+getrusage(RUSAGE_SELF, &ru);
+unsigned kernel_microseconds = ru.ru_stime.tv_sec*1000000 + ru.ru_stime.tv_usec
+unsigned   user_microseconds = ru.ru_utime.tv_sec*1000000 + ru.ru_utime.tv_usec
+unsigned total_microseconds = kernel_microseconds + user_microseconds;
+</code>
+When you time your code, you should use "user_microseconds", because kernel charge is not properly part of your algorithm.
+===== Windows =====
+==== Measuring total time ====
+You can use the ''QueryPerformanceCounter'' function which is very precise too.
+A simple wrapper is available [[http://www.codeproject.com/datetime/perftimer.asp|here]]
+==== Measuring cpu time ====
+FIXME Does ''getrusage'' works ??