Project

General

Profile

Actions

Bug #544

closed

execution time for enabled tracepoints has increased in 2.2.0-rc2 compared to 2.1.1

Added by Stanislav Vovk almost 11 years ago. Updated about 10 years ago.

Status:
Won't fix
Priority:
Normal
Target version:
-
Start date:
05/27/2013
Due date:
% Done:

0%

Estimated time:

Description

Commit id (2.2.0-rc2) ============
userspace-rcu : 8ff97cc3836c6a2739b62af0a859511c9c284e4f
lttng-ust : 996aeadece97a55e7f48ca1e73958a53de62ed4f
lttng-tools : 7f2a80589c0b04627eaa351a9debff202836f930

Commit id (2.1.1) ============
userspace-rcu : 3ff667f1ed404609bed536a8b926b90385153090
lttng-ust (2.1.2): 3f7a0876566a432ad2cc2b27a2725641e36da92d
lttng-tools : 959f036852e1eef40b2fb201a5b2914285a0abbe

Problem Description: ====================
Basically when comparing two above versions on PowerPC(LSI ACP34xx) it seems that time to execute one enabled tracepoint has increased by a factor of ~2,3. Test was done as following. One tracepoint is executed inside a loop a million times. Time is taken before entering the loop and right after exiting the loop. Then a difference and average time is calculated. Application sleeps 2 seconds at start(before time i taken) to wait that everything is initialized.

Same test were run in the same environment and build with same toolchain and sysroot, the only thing different is LTTng version. For detailed environment specifications see further down.

LTTng was configured as below,
root@du1:~# lttng list s1
Tracing session s1: [active]
Trace path: /root/out

=== Domain: UST global ===

Channels:
-------------
- ch1: [enabled]

Attributes:
overwrite mode: 1
subbufers size: 4096
number of subbufers: 4
switch timer interval: 0
read timer interval: 200
output: mmap()
Events:
sample_component:trace1 (type: tracepoint) [enabled]

Just to get an idea of execution times look at output from test program below.

LTTng 2.1.1:
root@du1:~# ./ltt-test-2.1.1 -l 1000000 -t 3
Benchmark test starts now...
Time it took to loop 1000000 times : 3 s 438450550 ns | [ 3438 ns per tp ]
Time it took to loop 1000000 times : 3 s 548152600 ns | [ 3548 ns per tp ]
Time it took to loop 1000000 times : 3 s 486106110 ns | [ 3486 ns per tp ]

LTTng 2.2.0-rc2:
root@du1:~# ./ltt-test-2.2.0 -l 1000000 -t 3
Benchmark test starts now...
Time it took to loop 1000000 times : 8 s 566419650 ns | [8566 ns per tp]
Time it took to loop 1000000 times : 8 s 594758380 ns | [8594 ns per tp]
Time it took to loop 1000000 times : 8 s 550842610 ns | [8550 ns per tp]

Target Environment ==================
root@du1:~# uname -a
Linux du1 3.4.34-rt40-00001-g7f4955b #1 SMP PREEMPT RT Wed, 22 May 2013 12:58:10 +0200 ppc GNU/Linux

root@du1:~# /lib/libc.so.6
GNU C Library (Wind River Linux Sourcery CodeBench 4.6a-98) stable release version 2.15, by Roland McGrath et al.
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 4.6.3.
Compiled on a Linux 3.4.11 system on 2013-01-15.
Available extensions:
crypt add-on version 2.1 by Michael Glad and others
GNU Libidn by Simon Josefsson
Native POSIX Threads Library by Ulrich Drepper et al
BIND-8.2.3-T5B
libc ABIs: UNIQUE IFUNC
For bug reporting instructions, please see:
<>.

root@du1:~# cat /proc/cpuinfo
processor : 0
cpu : 476dd2
clock : 1600.000000MHz
revision : 32.192 (pvr 11a5 20c0)
bogomips : 200.00

processor : 1
cpu : 476dd2
clock : 1600.000000MHz
revision : 32.192 (pvr 11a5 20c0)
bogomips : 200.00

processor : 2
cpu : 476dd2
clock : 1600.000000MHz
revision : 32.192 (pvr 11a5 20c0)
bogomips : 200.00

processor : 3
cpu : 476dd2
clock : 1600.000000MHz
revision : 32.192 (pvr 11a5 20c0)
bogomips : 200.00

total bogomips : 800.00
timebase : 100000000
platform : ACP3400DUS41
model : ibm,acpx1-4xx
Memory : 1792 MB

Is problem reproducible? =========================
Always

Perf stats ==========================
See attached.


Files

perf-stats.txt (10.2 KB) perf-stats.txt Stanislav Vovk, 05/27/2013 04:25 AM
sample.tgz (4.64 KB) sample.tgz Stanislav Vovk, 05/29/2013 03:33 AM
perf-annotate-2.1.1.txt (300 KB) perf-annotate-2.1.1.txt Stanislav Vovk, 05/30/2013 07:47 AM
perf-annotate-2.1-stable.txt (684 KB) perf-annotate-2.1-stable.txt Stanislav Vovk, 05/30/2013 07:47 AM
perf-report-2.1.1.txt (25.4 KB) perf-report-2.1.1.txt Stanislav Vovk, 05/30/2013 07:47 AM
perf-report-2.1-stable.txt (49.1 KB) perf-report-2.1-stable.txt Stanislav Vovk, 05/30/2013 07:47 AM
sample2.tgz (3.43 KB) sample2.tgz simple test program Stanislav Vovk, 07/30/2013 09:51 AM
Actions #1

Updated by Mathieu Desnoyers almost 11 years ago

  • Project changed from LTTng to LTTng-UST
Actions #2

Updated by Mathieu Desnoyers almost 11 years ago

FYI, the reports given by "perf record appname" and then "perf annotate" and "perf report" would be more helpful here: they provide a breakdown (per function) of the most used functions.

Actions #3

Updated by Mathieu Desnoyers almost 11 years ago

  • Assignee set to Mathieu Desnoyers

I tested with the following conditions:

x86-64
2-core Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz with hyperthreading

I compared master branch and stable-2.1:

master branch:
userspace-rcu 8ff97cc3836c6a2739b62af0a859511c9c284e4f
lttng-ust 996aeadece97a55e7f48ca1e73958a53de62ed4f
lttng-tools 7f2a80589c0b04627eaa351a9debff202836f930

stable-2.1:
userspace-rcu 8ff97cc3836c6a2739b62af0a859511c9c284e4f
lttng-ust 75ecf151e7571c682bb302e8d2206e22bdaa62b2
lttng-tools 8a78c3baad2d83207c672381826668751b340b29

With:

  508  lttng create
  509  lttng enable-channel test -u --overwrite
  510  lttng enable-event -u -a -c test
  511  lttng start

and then running 3 times:

time ./hello
(hello test program within tests/hello in lttng-ust)

The results are always within:

master:

compudj@thinkos:~/git/lttng-ust/tests/hello$ (git:master $)> time ./hello 
Hello, World!
Tracing...  done.

real    0m1.040s
user    0m0.700s
sys    0m0.308s
compudj@thinkos:~/git/lttng-ust/tests/hello$ (git:master $)> time ./hello 
Hello, World!
Tracing...  done.

real    0m1.123s
user    0m0.672s
sys    0m0.416s
compudj@thinkos:~/git/lttng-ust/tests/hello$ (git:master $)> time ./hello 
Hello, World!
Tracing...  done.

real    0m1.005s
user    0m0.704s
sys    0m0.280s

Stable 2.1:

compudj@thinkos:~/git/lttng-ust/tests/hello$ (git:stable-2.1 $)> time ./hello 
Hello, World!
Tracing...  done.

real    0m1.171s
user    0m0.732s
sys    0m0.404s
compudj@thinkos:~/git/lttng-ust/tests/hello$ (git:stable-2.1 $)> time ./hello 
Hello, World!
Tracing...  done.

real    0m1.108s
user    0m0.708s
sys    0m0.376s
compudj@thinkos:~/git/lttng-ust/tests/hello$ (git:stable-2.1 $)> time ./hello 
Hello, World!
Tracing...  done.

real    0m1.015s
user    0m0.708s
sys    0m0.288s

Can you try reproducing with these commits ?

Thanks,

Mathieu

Actions #4

Updated by Stanislav Vovk almost 11 years ago

Regarding Perf tool, we have problems with it on our target, it doesn't work correct and it won't record any samples. So I cant provide 'perf report' and 'perf annotate'. If we'll get it working any time soon I'll provide the needed info.

I tested commits suggested and I got the same results as you did. The time did not differ as previously between these commits. See below,

master:

root@du1:~# time ./ltt-test-2.2.0 -l 1000000 -t 1
Benchmark test starts now...
Time it took to loop 1000000 times : 8 s 678479180 ns | [8678 ns per tp]

real 0m10.750s
user 0m7.924s
sys 0m0.436s
root@du1:~# time ./ltt-test-2.2.0 -l 1000000 -t 1
Benchmark test starts now...
Time it took to loop 1000000 times : 9 s 70089336 ns | [9070 ns per tp]

real 0m11.108s
user 0m8.152s
sys 0m0.464s
root@du1:~# time ./ltt-test-2.2.0 -l 1000000 -t 1
Benchmark test starts now...
Time it took to loop 1000000 times : 8 s 682388555 ns | [8682 ns per tp]

real 0m10.722s
user 0m7.976s
sys 0m0.408s

2.1-stable:

root@du1:~# time ./ltt-test-2.1-s -l 1000000 -t 1
Benchmark test starts now...
Time it took to loop 1000000 times : 9 s 258012743 ns | [ 9258 ns per tp ]

real 0m11.290s
user 0m8.424s
sys 0m0.556s
root@du1:~# time ./ltt-test-2.1-s -l 1000000 -t 1
Benchmark test starts now...
Time it took to loop 1000000 times : 9 s 465164986 ns | [ 9465 ns per tp ]

real 0m11.495s
user 0m8.392s
sys 0m0.456s

Going back to the problem though, it seems that the difference can now be narrowed down to just a few commits, between tags stable-2.1 and v2.1.2 in UST the difference in time is still visible, as before.

Can you compare these 2 commits?

Ust:
stable-2.1 : 75ecf151e7571c682bb302e8d2206e22bdaa62b2
v2.1.2 : 3f7a0876566a432ad2cc2b27a2725641e36da92d
(other commit id's are same as in initial mail, )

stable-2.1:

root@du1:~# time ./ltt-test-2.1-s -l 1000000 -t 1
Benchmark test starts now...
Time it took to loop 1000000 times : 8 s 958886297 ns | [ 8958 ns per tp ]

real 0m10.990s
user 0m8.240s
sys 0m0.416s
root@du1:~# time ./ltt-test-2.1-s -l 1000000 -t 1
Benchmark test starts now...
Time it took to loop 1000000 times : 9 s 405790878 ns | [ 9405 ns per tp ]

real 0m11.436s
user 0m8.428s
sys 0m0.508s

v2.1.2:

root@du1:~# time ./ltt-test-2.1-s -l 1000000 -t 1
Benchmark test starts now...
Time it took to loop 1000000 times : 3 s 633527800 ns | [ 3633 ns per tp ]

real 0m5.661s
user 0m3.188s
sys 0m0.304s
root@du1:~# time ./ltt-test-2.1-s -l 1000000 -t 1
Benchmark test starts now...
Time it took to loop 1000000 times : 3 s 706052203 ns | [ 3706 ns per tp ]

real 0m5.733s
user 0m3.172s
sys 0m0.384s

Thanks,
Stanislav

Actions #5

Updated by Mathieu Desnoyers almost 11 years ago

> Can you compare these 2 commits?
> 
> Ust:
> stable-2.1 : 75ecf151e7571c682bb302e8d2206e22bdaa62b2
> v2.1.2 : 3f7a0876566a432ad2cc2b27a2725641e36da92d
> (other commit id's are same as in initial mail, )

On my Intel i7:

commit 75ecf151e7571c682bb302e8d2206e22bdaa62b2 (stable-2.1)

compudj@thinkos:~/git/lttng-ust/tests/hello$ (git:(75ecf15...) $)> time ./hello Hello, World!
Tracing...  done.

real    0m1.082s
user    0m0.736s
sys    0m0.316s
compudj@thinkos:~/git/lttng-ust/tests/hello$ (git:(75ecf15...) $)> time ./hello Hello, World!
Tracing...  done.

real    0m1.006s
user    0m0.712s
sys    0m0.276s
compudj@thinkos:~/git/lttng-ust/tests/hello$ (git:(75ecf15...) $)> time ./hello Hello, World!
Tracing...  done.

real    0m1.102s
user    0m0.748s
sys    0m0.332s

commit 3f7a0876566a432ad2cc2b27a2725641e36da92d (v2.1.2)

compudj@thinkos:~/git/lttng-ust/tests/hello$ (git:(v2.1.2) $)> time ./hello 
Hello, World!
Tracing...  done.

real    0m1.032s
user    0m0.616s
sys    0m0.396s
compudj@thinkos:~/git/lttng-ust/tests/hello$ (git:(v2.1.2) $)> time ./hello 
Hello, World!
Tracing...  done.

real    0m1.132s
user    0m0.788s
sys    0m0.320s
compudj@thinkos:~/git/lttng-ust/tests/hello$ (git:(v2.1.2) $)> time ./hello 
Hello, World!
Tracing...  done.

real    0m1.107s
user    0m0.764s
sys    0m0.324s

No significant difference between the two.

I suspect it could be your test program. Can you attach it to the bug
report ?

The only commit that could possibly have a performance impact would be
the following, but I don't quite see how it can be an issue:

commit 88b66998c135f5df377a1933f17fa6652c05eeae
Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Date:   Tue Mar 26 08:18:18 2013 -0400

    Fix: _LGPL_SOURCE rcu dereference fix

    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

diff --git a/include/lttng/tracepoint-rcu.h
b/include/lttng/tracepoint-rcu.h
index bae1279..ee936ff 100644
--- a/include/lttng/tracepoint-rcu.h
+++ b/include/lttng/tracepoint-rcu.h
@@ -31,7 +31,7 @@

 #define tp_rcu_read_lock_bp    rcu_read_lock_bp
 #define tp_rcu_read_unlock_bp  rcu_read_unlock_bp
-#define tp_rcu_dereference_bp  rcu_dereference_bp
+#define tp_rcu_dereference_bp  rcu_dereference
 #define TP_RCU_LINK_TEST()     1

 #else  /* _LGPL_SOURCE */

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Actions #6

Updated by Stanislav Vovk almost 11 years ago

Attaching source for my test program. Executed with options '-l' and '-t' so only code inside main() is run.

Updated by Stanislav Vovk almost 11 years ago

We managed to get perf record work with some events, recorded following events,
./perf record -e cpu-clock,task-clock,page-faults,minor-faults:u,major-faults,cs,migrations,alignment-faults,emulation-faults

Attaching 'report' and 'annotate'.

/Stanislav

Actions #8

Updated by Mathieu Desnoyers almost 11 years ago

Hi Stanislav,

I've tried your test program in sample.tgz, and got:

compudj@thinkos:~/test/cpp/sample$ ./ltt-test -l -t
Benchmark test starts now...
Floating point exception

Looking at the code, it seems that:

struct timespec average(struct timespec *av, int s)
{
  ...
  temp.tv_nsec = (temp.tv_nsec + temp.tv_sec * 1000000000) / s;
  ...
}

is probably the culprit. Also, you might want to reconsider these calculations in the light of the tv_nsec type, which is possibly a 32-bit integer on your architecture. And I don't quite get what this line is trying to achieve.

Thanks,

Mathieu

Actions #9

Updated by Mathieu Desnoyers almost 11 years ago

  • Status changed from New to Feedback
Actions #10

Updated by Stanislav Vovk over 10 years ago

Hi Mathieu,

Now I got back from vacation and took a look at this issue. The program code I sent you earlier was bigger then it should have been. So I simplified it to just a few lines to avoid errors and misunderstanding. Attaching it here. If you find some time try comparing the above two versions of ust please do.
Run the test program with
./ltt-test -l 1000000

Got these times running this program:

Ust master :

root@du1:~# time ./ltt-test -l 1000000
Benchmark test starts now...
Time it took to loop 1000000 times : 7 s 917002836 ns | [ 7917 ns per tp ]

real 0m7.937s
user 0m7.608s
sys 0m0.252s
root@du1:~# time ./ltt-test -l 1000000
Benchmark test starts now...
Time it took to loop 1000000 times : 7 s 805285265 ns | [ 7805 ns per tp ]

real 0m7.827s
user 0m7.432s
sys 0m0.232s

Ust 2.1.1 :

root@du1:~# time ./ltt-test_2.1 -l 1000000
Benchmark test starts now...
Time it took to loop 1000000 times : 3 s 312656932 ns | [ 3312 ns per tp ]

real 0m3.340s
user 0m3.032s
sys 0m0.228s
root@du1:~# time ./ltt-test_2.1 -l 1000000
Benchmark test starts now...
Time it took to loop 1000000 times : 3 s 276565088 ns | [ 3276 ns per tp ]

real 0m3.303s
user 0m2.924s
sys 0m0.300s

Actions #11

Updated by Mathieu Desnoyers over 10 years ago

Hi Stanislav,

I tried you program on two Intel machines without seeing different results between the two set of commits you tried.

It stays around 700ns on my i7 laptop, and it's about 1360ns on an Intel Xeon 2.0GHz.

Machine cpuinfo:

processor    : 0
vendor_id    : GenuineIntel
cpu family    : 6
model        : 58
model name    : Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
stepping    : 9
microcode    : 0x17
cpu MHz        : 1200.000
cache size    : 4096 KB
physical id    : 0
siblings    : 4
core id        : 0
cpu cores    : 2
apicid        : 0
initial apicid    : 0
fpu        : yes
fpu_exception    : yes
cpuid level    : 13
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips    : 5786.71
clflush size    : 64
cache_alignment    : 64
address sizes    : 36 bits physical, 48 bits virtual
power management:
processor    : 0
vendor_id    : GenuineIntel
cpu family    : 6
model        : 23
model name    : Intel(R) Xeon(R) CPU           E5405  @ 2.00GHz
stepping    : 6
microcode    : 0x606
cpu MHz        : 2000.169
cache size    : 6144 KB
physical id    : 0
siblings    : 4
core id        : 0
cpu cores    : 4
apicid        : 0
initial apicid    : 0
fpu        : yes
fpu_exception    : yes
cpuid level    : 10
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm dtherm tpr_shadow vnmi flexpriority
bogomips    : 4001.01
clflush size    : 64
cache_alignment    : 64
address sizes    : 38 bits physical, 48 bits virtual
power management:

We have a couple of choices to try to narrow down this regression, but I think that the easiest, considering that I cannot reproduce it here, is that you try to bissect which commit from which project introduced the regression on PowerPC. It would be very much appreciated.

Otherwise, it will have to wait until we get our own PowerPC up and running here.

Thanks,

Mathieu

Actions #12

Updated by Stanislav Vovk over 10 years ago

I see what you mean.
PowerPC is not really our final product HW, we just use it while waiting for real hw to come which is ARM Cortex-A15. We plan to utilize the new hw in the end of august. While waiting for A15 HW I can hopefully if time allows me try to narrow down this to a certain commit. Of course I will also test if this issue exists on A15 later.

So I guess for now we can put this on hold for a while.

Cheers,
Stanislav

Actions #13

Updated by Mathieu Desnoyers about 10 years ago

  • Status changed from Feedback to Won't fix

Please reopen of this issue can be reproduced with more recent versions.

Actions

Also available in: Atom PDF