Differences

This shows you the differences between two versions of the page.

Link to this comparison view

programming:c-cpp-performance [2008/10/21 16:14]
cyril
programming:c-cpp-performance [2013/09/19 16:40]
Line 1: Line 1:
-====== Performance in C/C++ ====== 
  
-===== Inline and macros ===== 
- 
-Inlining short functions which are often called is very important for performance. 
- 
-==== Macros ==== 
- 
-In C you can use macros : 
-<code c> 
-#define sqr(x) (x)*(x) 
-</code> 
- 
-But be careful, **macros are very dangerous**.  
- 
-For example parenthesis around ''x'' are not optional at all. Indeed macros are stupidly replaced before compilation, and ''x*x'' called with ''a-b'' will be replaced by ''a-b*a-b'' which is not what you want because of operators priorities ! 
- 
-Moreover if the argument of the macro is a function call, this call will be done two times ! 
- 
- 
- 
-==== Inline functions ==== 
- 
-With inline functions you have all the advantages (it is as fast as macros, see [[instruction-timings]]) and none of these traps. **That's why in C++ you MUST use inline functions instead of macros**.  
- 
-You can even use templates to have a generic function : 
-<code cpp> 
-template <typename T> 
-inline T sqr(T x) { return x*x; } 
-</code> 
- 
-And use it like : 
-<code cpp> 
-int x = 5; 
-int y = sqr(x); 
-int y2 = sqr<int>(x); 
-</code> 
- 
-== Remark 1 == 
-If you want to export your inline functions in several files, you must define it in the header file. 
- 
-== Remark 2 == 
-Methods which are defined inside the body of a class are automatically inlined, eg : 
-<code cpp> 
-class Point 
-{ 
-  int x,y; 
- public: 
-  int getX() { return x; } // this function is automatically inlined ! 
-  int getY() { return y; } // this function is automatically inlined ! 
-} 
-</code> 
- 
-==== Secure macros with C99 ==== 
- 
-The C99 standard allows a lot of new features, among which the possibility to create macros without the dangers of classical macros. 
- 
-To enable C99 standard, add the ''-std=c99'' or ''-std=gnu99'' flag to the compiler command line with gcc. 
- 
-Then you can define the ''sqr'' macro as follow : 
-<code c> 
-#define sqr(x) ({ typeof(x) xx=(x); xx*xx; }) 
-</code> 
- 
-===== Instruction timings ===== 
- 
-You can see complete results [[.instruction-timings|here]]. 
- 
- 
-===== Compiler options ===== 
- 
-Using the good optimization options of your compiler for your final release is extremely important. Indeed as long as you compile with debug options no optimization is done, inline functions are not inlined, and overhead is added. 
- 
-For VisualStudio, just compile in Release mode instead of Debug mode. 
- 
-For gcc/g++ don't use ''-g'' option, and use ''-O1'', ''-O2'' or ''-O3'' optimization flag (increasing optimization). 
- 
-References: 
-  * [[http://www.redhat.com/magazine/011sep05/features/gcc/]] 
- 
-===== Find the bottleneck ===== 
- 
-The 80-20 rule is also true when programming : 80% of the time is spend in 20% of your code. Therefore you have to find to which code correspond these 20%, it is useless to try to optimize something else that these 20%. 
- 
-References: 
-  * [[http://www.redhat.com/magazine/012oct05/features/gcc/]] 
- 
- 
-==== Valgrind ==== 
- 
-Valgrind is a Linux memory profiler, and is essential to everyone coding in C++ (which C++ programmer never spent a night trying to debug a segfault ?). It can detect all memory errors, even if there is no explicit segfault (therefore it is more deterministic), and it gives the exact line where is the problem : 
-<code bash> 
-valgrind <your-program> <args> 
-</code> 
- 
-It can also show you where are memory leaks (memory which is not freed), and you can ask more details with : 
-<code bash> 
-valgrind --leak-check=full <your-program> <args> 
-</code> 
- 
-There is also a tool with valgrind which is very useful to optimize your code, cachegrind. ''cachegrind'' tells you where are //cache misses// in your CPU cache memory (ie where your CPU needs to grab data in RAM because it is not in its cache). This is extremely important, for example if you do a matrix product with big matrices, you can divide the time needed by over 20 just by changing the order of operations to reduce cache misses. 
-<code bash> 
-valgrind --tool=cachegrind <your-program> 
-</code> 
- 
-[[http://citeseer.ist.psu.edu/cache/papers/cs/28135/http:zSzzSzwww10.informatik.uni-erlangen.dezSzResearchzSzProjectszSzDiMEzSzpubszSzgi_03.pdf/kowarschik03overview.pdf|How to optimize the use of cache memory (pdf) [EN] ]] 
- 
-There are other tools and a lot of options, look ''man valgrind''. 
- 
- 
- 
-==== gprof ==== 
- 
-''gprof'' is a profiling tool that tells you how many times is called a function or how much time is spend in a function. 
- 
-To use it : 
-  * compile your program with the '-pg' flag for both compiling and linking (and of course the '-g' classic debug flag) 
-  * run your program (it will generate a log file ''gmon.out'') 
-  * run ''gprof gmon.out''. You can try the ''--graph'' option which will include the time spent in children calls for each function. 
- 
-==== Other profilers ==== 
- 
-sysprof and oprofile are other profiles which use a kernel module. Sysprof has even a GUI, and shows the time spent in all running functions and programs. 
- 
-==== Traps ==== 
- 
-You could think that you can find which lines are taking time by removing them and see how much less time it takes, but it won't work if optimizations are enabled, because the compiler can remove a lot more of code if it is not useful because of the few lines that you removed. 
- 
-===== Memory usage ===== 
-==== Dynamic vs Static ==== 
- 
-You have to keep in mind that dynamic allocations with ''new'' are **very** expensive operations. Moreover the heap where are stored dynamic objects is subject to fragmentation. 
- 
-Static allocation on the stack are faster, but you need to know the maximum size at compilation time. You can also do static allocations on the heap in global variables, which is even faster, but memory is permanently used. 
- 
-A nice way to deal with the trade-off between dynamic allocations and static allocations is to allocate memory by blocks. 
- 
- 
-==== Alignment ==== 
- 
-In looking at the following two structures, it would appear that both  
-contain the same data, and take up the same space: 
-<code cpp> 
- struct foo1 { 
-  char a; 
-  int  b; 
-  char c;  
-  int  d;  
-  char e; 
-  char f; 
- }; 
-</code> 
- 
-<code cpp> 
- struct foo2 { 
-  int b; 
-  int d; 
-  char a; 
-  char c; 
-  char e;  
-  char f; 
- }; 
-</code> 
- 
-Due to the way that data gets aligned by the compiler, however, the 
-above two structures are **not** the same.  On a 32-bit int system, 
-foo1 will be 20 bytes, while foo2 will be a mere 12 bytes. 
- 
-In general, you want to keep the variables arranged such that for a given 
-set of 32 bits (or 64 bits), that as many variables as possible are 
-packed into that space. 
- 
-==== Natural data structures ==== 
- 
-On a 32-bits processor it can be slower to use 8-bits or 16-bits variables than 32-bits variables, because : 
-  * maybe there are data alignment problems 
-  * the sign bit is usually the most significant bit, and the CPU have extra work to do to detect it 
-Therefore all simple counters should be done in the natural word size of the CPU. 
- 
- 
-====== Measuring performance ====== 
-===== Linux ===== 
-==== Measuring total time ==== 
- 
-There is a timer with a microsecond precision : 
-<code cpp> 
-#include <sys/time.h> 
-struct timeval tv; 
-struct timezone tz; 
-gettimeofday(&tv,&tz); 
- 
-unsigned microseconds = tv.tv_sec*1000000 + tv.tv_usec; // beware overflows 
-</code> 
- 
- 
- 
-==== Measuring cpu time ==== 
-The function ''getrusage'' measures the cpu time allocated to you process. Therefore it is less sensible to perturbations of other running processes : 
-<code cpp> 
-#include <sys/resource.h> 
-struct rusage ru; 
-getrusage(RUSAGE_SELF, &ru); 
- 
-unsigned kernel_microseconds = ru.ru_stime.tv_sec*1000000 + ru.ru_stime.tv_usec 
-unsigned   user_microseconds = ru.ru_utime.tv_sec*1000000 + ru.ru_utime.tv_usec 
-unsigned total_microseconds = kernel_microseconds + user_microseconds; 
-</code> 
- 
-When you time your code, you should use "user_microseconds", because kernel charge is not properly part of your algorithm. 
- 
-===== Windows ===== 
-==== Measuring total time ==== 
- 
-You can use the ''QueryPerformanceCounter'' function which is very precise too. 
- 
-A simple wrapper is available [[http://www.codeproject.com/datetime/perftimer.asp|here]] 
- 
-==== Measuring cpu time ==== 
-FIXME Does ''getrusage'' works ?? 
programming/c-cpp-performance.txt ยท Last modified: 2013/09/19 16:40 (external edit)
CC Attribution-Share Alike 4.0 International
Driven by DokuWiki Recent changes RSS feed Valid CSS Valid XHTML 1.0