Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
programming:c-cpp-performance [2006/12/14 13:23] cyril created |
programming:c-cpp-performance [2013/02/26 15:14] cyril [Measuring total time] |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== Performance in C/C++ ====== | ====== Performance in C/C++ ====== | ||
- | ===== Inline | + | ===== Inline and macros ===== |
Inlining short functions which are often called is very important for performance. | Inlining short functions which are often called is very important for performance. | ||
Line 12: | Line 12: | ||
</ | </ | ||
- | But be careful, macros are very dangerous. For example parenthesis around '' | + | But be careful, |
+ | |||
+ | For example parenthesis around '' | ||
Moreover if the argument of the macro is a function call, this call will be done two times ! | Moreover if the argument of the macro is a function call, this call will be done two times ! | ||
+ | |||
+ | |||
==== Inline functions ==== | ==== Inline functions ==== | ||
- | With inline functions you have all the advantages (it is as fast as macros, see [[programming: | + | With inline functions you have all the advantages (it is as fast as macros, see [[instruction-timings]]) and none of these traps. |
You can even use templates to have a generic function : | You can even use templates to have a generic function : | ||
<code cpp> | <code cpp> | ||
- | <template typename T> | + | template |
inline T sqr(T x) { return x*x; } | inline T sqr(T x) { return x*x; } | ||
</ | </ | ||
Line 29: | Line 33: | ||
<code cpp> | <code cpp> | ||
int x = 5; | int x = 5; | ||
- | int y = sqr< | + | int y = sqr(x); |
+ | int y2 = sqr< | ||
</ | </ | ||
Line 47: | Line 52: | ||
</ | </ | ||
+ | ==== Secure macros with C99 ==== | ||
+ | |||
+ | The C99 standard allows a lot of new features, among which the possibility to create macros without the dangers of classical macros. | ||
+ | |||
+ | To enable C99 standard, add the '' | ||
+ | |||
+ | Then you can define the '' | ||
+ | <code c> | ||
+ | #define sqr(x) ({ typeof(x) xx=(x); xx*xx; }) | ||
+ | </ | ||
===== Instruction timings ===== | ===== Instruction timings ===== | ||
- | You can see complete results [[programming: | + | You can see complete results [[.instruction-timings|here]]. |
===== Compiler options ===== | ===== Compiler options ===== | ||
Line 59: | Line 75: | ||
For gcc/g++ don't use '' | For gcc/g++ don't use '' | ||
+ | |||
+ | References: | ||
+ | * [[http:// | ||
+ | |||
+ | ===== Find the bottleneck ===== | ||
+ | |||
+ | The 80-20 rule is also true when programming : 80% of the time is spend in 20% of your code. Therefore you have to find to which code correspond these 20%, it is useless to try to optimize something else that these 20%. | ||
+ | |||
+ | References: | ||
+ | * [[http:// | ||
+ | |||
+ | |||
+ | ==== Valgrind ==== | ||
+ | |||
+ | Valgrind is a Linux memory profiler, and is essential to everyone coding in C++ (which C++ programmer never spent a night trying to debug a segfault ?). It can detect all memory errors, even if there is no explicit segfault (therefore it is more deterministic), | ||
+ | <code bash> | ||
+ | valgrind < | ||
+ | </ | ||
+ | |||
+ | It can also show you where are memory leaks (memory which is not freed), and you can ask more details with : | ||
+ | <code bash> | ||
+ | valgrind --leak-check=full < | ||
+ | </ | ||
+ | |||
+ | There is also a tool with valgrind which is very useful to optimize your code, cachegrind. '' | ||
+ | <code bash> | ||
+ | valgrind --tool=cachegrind < | ||
+ | </ | ||
+ | |||
+ | [[http:// | ||
+ | |||
+ | There are other tools and a lot of options, look '' | ||
+ | |||
+ | |||
+ | |||
+ | ==== gprof ==== | ||
+ | |||
+ | '' | ||
+ | |||
+ | To use it : | ||
+ | * compile your program with the ' | ||
+ | * run your program (it will generate a log file '' | ||
+ | * run '' | ||
+ | |||
+ | ==== Other profilers ==== | ||
+ | |||
+ | sysprof and oprofile are other profilers which use a kernel module. Sysprof has even a GUI, and shows the time spent in all running functions and programs (system wide). If you want details about scheduling of the processes (which process was running when on which cpu), you can use trace-cmd and its front-end KernelShark. | ||
+ | |||
+ | ==== Traps ==== | ||
+ | |||
+ | You could think that you can find which lines are taking time by removing them and see how much less time it takes, but it won't work if optimizations are enabled, because the compiler can remove a lot more of code if it is not useful because of the few lines that you removed. | ||
+ | |||
+ | ===== Memory usage ===== | ||
+ | ==== Dynamic vs Static ==== | ||
+ | |||
+ | You have to keep in mind that dynamic allocations with '' | ||
+ | |||
+ | Static allocation on the stack are faster, but you need to know the maximum size at compilation time. You can also do static allocations on the heap in global variables, which is even faster, but memory is permanently used. | ||
+ | |||
+ | A nice way to deal with the trade-off between dynamic allocations and static allocations is to allocate memory by blocks. | ||
+ | |||
+ | |||
+ | ==== Alignment ==== | ||
+ | |||
+ | In looking at the following two structures, it would appear that both | ||
+ | contain the same data, and take up the same space: | ||
+ | <code cpp> | ||
+ | | ||
+ | char a; | ||
+ | int b; | ||
+ | char c; | ||
+ | int d; | ||
+ | char e; | ||
+ | char f; | ||
+ | }; | ||
+ | </ | ||
+ | |||
+ | <code cpp> | ||
+ | | ||
+ | int b; | ||
+ | int d; | ||
+ | char a; | ||
+ | char c; | ||
+ | char e; | ||
+ | char f; | ||
+ | }; | ||
+ | </ | ||
+ | |||
+ | Due to the way that data gets aligned by the compiler, however, the | ||
+ | above two structures are **not** the same. On a 32-bit int system, | ||
+ | foo1 will be 20 bytes, while foo2 will be a mere 12 bytes. | ||
+ | |||
+ | In general, you want to keep the variables arranged such that for a given | ||
+ | set of 32 bits (or 64 bits), that as many variables as possible are | ||
+ | packed into that space. | ||
+ | |||
+ | ==== Natural data structures ==== | ||
+ | |||
+ | On a 32-bits processor it can be slower to use 8-bits or 16-bits variables than 32-bits variables, because : | ||
+ | * maybe there are data alignment problems | ||
+ | * the sign bit is usually the most significant bit, and the CPU have extra work to do to detect it | ||
+ | Therefore all simple counters should be done in the natural word size of the CPU. | ||
+ | |||
+ | ====== Vectorialization ====== | ||
+ | |||
+ | Since Pentium III with SSE instructions, | ||
+ | |||
+ | However there are two problems: | ||
+ | * these operations are anyway quite fast (especially because it is a single operation), so if you need to do too much data reorganization to have them as a contiguous vector, it will quickly create too much overhead that cancels what you win by parallelizing the operations. | ||
+ | * data **must** be 16 bytes aligned, so it can prevent you to directly use raw data (eg if you want to compute a haar feature with size not multiple of 4 with an integral image), which raises the previous problem... | ||
+ | |||
+ | But still when you can use it, it can worth the pain, especially with divisions or sqrt that are especially cycles consuming, eg to compute 4 parabolic interpolations simultaneously: | ||
+ | |||
+ | <code cpp> | ||
+ | struct SSE_f | ||
+ | { | ||
+ | typedef float v4sf __attribute__((vector_size(16))); | ||
+ | union { v4sf v; float f[4]; }; | ||
+ | }; | ||
+ | |||
+ | inline void parabolicInterpolation4( | ||
+ | const SSE_f &x0, const SSE_f &y0, const SSE_f &x1, const SSE_f &y1, const SSE_f &x2, const SSE_f & | ||
+ | SSE_f & | ||
+ | { | ||
+ | SSE_f x01; x01.v = _mm_sub_ps(x0.v, | ||
+ | SSE_f x02; x02.v = _mm_sub_ps(x0.v, | ||
+ | SSE_f x12; x12.v = _mm_sub_ps(x1.v, | ||
+ | SSE_f t0; t0.v = _mm_div_ps(y0.v, | ||
+ | SSE_f t1; t1.v = _mm_div_ps(y1.v, | ||
+ | SSE_f t2; t2.v = _mm_div_ps(y2.v, | ||
+ | a.v = _mm_add_ps(_mm_sub_ps(t0, | ||
+ | b.v = _mm_sub_ps(_mm_mul_ps(_mm_add_ps(x0.v, | ||
+ | _mm_add_ps(_mm_mul_ps(_mm_add_ps(x1.v, | ||
+ | | ||
+ | c.v = _mm_add_ps(_mm_sub_ps( | ||
+ | _mm_mul_ps(_mm_mul_ps(x1.v, | ||
+ | _mm_mul_ps(_mm_mul_ps(x0.v, | ||
+ | _mm_mul_ps(_mm_mul_ps(x0.v, | ||
+ | extremum_x.v = _mm_div_ps(b.v, | ||
+ | extremum_y.v = _mm_sub_ps(c.v, | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | It takes roughly 60% less time than to compute them sequentially, | ||
+ | |||
+ | References: | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | * [[http:// | ||
+ | |||
+ | More simple names are available in GCC headers: | ||
+ | * SSE for float (< | ||
+ | * SSE2 for int (< | ||
+ | * SSE2 int/float conversions: | ||
+ | |||
+ | And you have to compile with GCC flags -msse and -msse2, or one -march that supports it. | ||
+ | |||
+ | ====== Measuring performance ====== | ||
+ | ===== Linux ===== | ||
+ | ==== Measuring total time ==== | ||
+ | |||
+ | There is a timer with a microsecond precision : | ||
+ | <code cpp> | ||
+ | #include < | ||
+ | struct timeval tv; | ||
+ | gettimeofday(& | ||
+ | |||
+ | unsigned microseconds = tv.tv_sec*1000000 + tv.tv_usec; // beware overflows | ||
+ | </ | ||
+ | |||
+ | |||
+ | |||
+ | ==== Measuring cpu time ==== | ||
+ | The function '' | ||
+ | <code cpp> | ||
+ | #include < | ||
+ | struct rusage ru; | ||
+ | getrusage(RUSAGE_SELF, | ||
+ | |||
+ | unsigned kernel_microseconds = ru.ru_stime.tv_sec*1000000 + ru.ru_stime.tv_usec | ||
+ | unsigned | ||
+ | unsigned total_microseconds = kernel_microseconds + user_microseconds; | ||
+ | </ | ||
+ | |||
+ | When you time your code, you should use " | ||
+ | |||
+ | ===== Windows ===== | ||
+ | ==== Measuring total time ==== | ||
+ | |||
+ | You can use the '' | ||
+ | |||
+ | A simple wrapper is available [[http:// | ||
+ | |||
+ | ==== Measuring cpu time ==== | ||
+ | FIXME Does '' |