Differences

This shows you the differences between two versions of the page.

--- programming:c-cpp-performance [2008/01/30 07:37]
127.0.0.1 external edit
+++ programming:c-cpp-performance [2013/09/19 16:40] (current)
@@ Line 26: / Line 26: @@
 You can even use templates to have a generic function :
 <code cpp>
-<template typename T>
+template <typename T>
 inline T sqr(T x) { return x*x; }
 </code>
@@ Line 33: / Line 33: @@
 <code cpp>
 int x = 5;
-int y = sqr<int>(x);
+int y = sqr(x);
+int y2 = sqr<int>(x);
 </code>
@@ Line 120: / Line 121: @@
 ==== Other profilers ====
-sysprof and oprofile are other profiles which use a kernel module. Sysprof has even a GUI, and shows the time spent in all running functions and programs.
+sysprof and oprofile are other profilers which use a kernel module. Sysprof has even a GUI, and shows the time spent in all running functions and programs (system wide). If you want details about scheduling of the processes (which process was running when on which cpu), you can use trace-cmd and its front-end KernelShark.
 ==== Traps ====
@@ Line 177: / Line 178: @@
 Therefore all simple counters should be done in the natural word size of the CPU.
+====== Vectorialization ======
+Since Pentium III with SSE instructions, you can do 4 float add/sub/mul/div/sqrt operations in one instruction, and since Pentium 4 with SSE2 instructions, you can do 4 integer add/sub operations in one operation. As all fairly recent CPUs provide these instruction sets, it can be tempting to use them to speed up your program.
+However there are two problems:
+  * these operations are anyway quite fast (especially because it is a single operation), so if you need to do too much data reorganization to have them as a contiguous vector, it will quickly create too much overhead that cancels what you win by parallelizing the operations.
+  * data **must** be 16 bytes aligned, so it can prevent you to directly use raw data (eg if you want to compute a haar feature with size not multiple of 4 with an integral image), which raises the previous problem...
+But still when you can use it, it can worth the pain, especially with divisions or sqrt that are especially cycles consuming, eg to compute 4 parabolic interpolations simultaneously:
+<code cpp>
+struct SSE_f
+{
+	typedef float v4sf __attribute__((vector_size(16)));
+	union { v4sf v; float f[4]; };
+};
+inline void parabolicInterpolation4(
+	const SSE_f &x0, const SSE_f &y0, const SSE_f &x1, const SSE_f &y1, const SSE_f &x2, const SSE_f &y2,
+	SSE_f &extremum_x, SSE_f &extremum_y, SSE_f &a, SSE_f &b, SSE_f &c)
+{
+	SSE_f x01; x01.v = _mm_sub_ps(x0.v, x1.v);
+	SSE_f x02; x02.v = _mm_sub_ps(x0.v, x2.v);
+	SSE_f x12; x12.v = _mm_sub_ps(x1.v, x2.v);
+	SSE_f t0; t0.v = _mm_div_ps(y0.v, _mm_mul_ps(x01.v, x02.v));
+	SSE_f t1; t1.v = _mm_div_ps(y1.v, _mm_mul_ps(x01.v, x12.v));
+	SSE_f t2; t2.v = _mm_div_ps(y2.v, _mm_mul_ps(x02.v, x12.v));
+	a.v = _mm_add_ps(_mm_sub_ps(t0, t1.v), t2.v);
+	b.v = _mm_sub_ps(_mm_mul_ps(_mm_add_ps(x0.v, x2.v), t1.v),
+	      _mm_add_ps(_mm_mul_ps(_mm_add_ps(x1.v, x2.v), t0.v),
+	                 _mm_mul_ps(_mm_add_ps(x0.v, x1.v), t2.v)));
+	c.v = _mm_add_ps(_mm_sub_ps(
+		_mm_mul_ps(_mm_mul_ps(x1.v, x2.v), t0.v),
+		_mm_mul_ps(_mm_mul_ps(x0.v, x2.v), t1.v)),
+		_mm_mul_ps(_mm_mul_ps(x0.v, x1.v), t2.v));
+	extremum_x.v = _mm_div_ps(b.v, _mm_mul_ps(a.v, _mm_set1_ps(-2.0f)));
+	extremum_y.v = _mm_sub_ps(c.v, _mm_div_ps(_mm_mul_ps(b.v, b.v), _mm_mul_ps(a.v, _mm_set1_ps(4.0f))));
+}
+</code>
+It takes roughly 60% less time than to compute them sequentially, which means that when done on a whole 1.2 Mpix image, it takes 20ms instead of 50ms.
+References:
+  * [[http://gcc.gnu.org/onlinedocs/gcc-4.4.2/gcc/X86-Built_002din-Functions.html#X86-Built_002din-Functions]] for a list of available instructions in GCC
+  * [[http://softpixel.com/~cwright/programming/simd/sse.php]] for a quick description of SSE instructions
+  * [[http://www.tommesani.com/Docs.html]] for a more detailed description
+More simple names are available in GCC headers:
+  * SSE for float (<emmintrin.h>, operand type _ _mm128): _mm_add_ps, _mm_sub_ps, _mm_mul_ps, _mm_div_ps, _mm_sqrt_ps, _mm_rsqrt_ps, _mm_rcp_ps, _mm_load_ps, _mm_store_ps, _mm_set1_ps, _mm_setr_ps
+  * SSE2 for int (<xmmintrin.h>, operand type _ _mm128i): _mm_add_epi32, _mm_sub_epi32, _mm_set1_epi32, _mm_setr_epi32
+  * SSE2 int/float conversions: _ _builtin_ia32_cvtdq2ps, _ _builtin_ia32_cvtps2dq
+And you have to compile with GCC flags -msse and -msse2, or one -march that supports it.
 ====== Measuring performance ======
@@ Line 186: / Line 240: @@
 #include <sys/time.h>
 struct timeval tv;
-struct timezone tz;
+gettimeofday(&tv,NULL);
-gettimeofday(&tv,&tz);
 unsigned microseconds = tv.tv_sec*1000000 + tv.tv_usec; // beware overflows