Bug #497
closedlttng 2.2.0rc1: Size of session directory is very inconsistent when repeating the same test condition several time
100%
Description
Problem Description: ==================== * Not sure which project (ust or tools) this one should be reported to... * (Streaming is used when this issue has been encountered) When repeating the same test condition a couple of time, we notice that the total filesize of the session directory varies quite a bit and a propper explaination can not be deducted for why they differ. It ranges from ~300KB to 1.4MB even if the session only contain a close amount of trace logs. The following summary is extracted from the printouts of the last 2 lines of the test instructions below at different run number: : babeltrace $TC | grep -c $DOMAIN_EVENT du -h $TC | tail -n 1 --switch-timer Run# Number of lines Size of session seen by babeltrace directory ============== ==== ================== =============== 1000000 1 1760 1.4M 1000000 2 2002 0.38M 1000000 3 1691 1.5M 0 1 1655 1.4M 0 2 2064 0.24M 0 3 1943 0.78M * When babeltrace run1 (switchTimer 1sec) above, the total file size of the output of babeltrace is only about 400K. Why is the corresponding CTF binary take more than triple that size ? * When trying to run the above without streaming, another fault is encountered (will be reported in a separated bug) and therefore, no comparision can be made. Is problem reproducible ? ========================= * yes How to reproduce (if reproducible): =================================== export TC=LTTNGgbf_017_8kAppFirst_1s_Pid_buf16k_100ps_run1 DOMAIN_EVENT="com_ericsson_cba_trace_testapp100_hightraf:HundredPerSec" sleep 1 sar -u 1 300 >> $TC.load.log & netstat -etanp | head -n 2 ; netstat -etanp |egrep 'lttng' sleep 1 echo "Start to launch apps when no session is active." >> $TC.load.log #---------------------------------------------------------------------- date; time for n in $(seq 1 8000); do (/home/test_apps/TestApp_100psOnly 200 np &); done #---------------------------------------------------------------------- echo "End of launching apps when no session is active" >> $TC.load.log ps -ef |grep -c TestApp sleep 1 #---------------------------- Get PID to trace on -------------------------------------------------- export my_pid=$(ps -e |grep TestApp | tail -n 1 | awk {'print $1'}); echo $my_pid sleep 10 ps -ef |egrep 'lttng' lttng-relayd -C tcp://0.0.0.0:53000 -D tcp://0.0.0.0:53001 -o /cluster/temp/log/$TC & echo "Create the session." >> $TC.load.log date; lttng create $TC -U net://192.168.0.1:53000:53001/./ lttng list sleep 5 echo "Enable channel...." >> $TC.load.log date; lttng enable-channel chan1 -u -s $TC --buffers-uid --subbuf-size 16384 --switch-timer 1000000 echo "Enable channel done." >> $TC.load.log echo "Enable channel metadata...." >> $TC.load.log date; lttng enable-channel metadata -u -s $TC --buffers-uid --subbuf-size 16384 --switch-timer 1000000 echo "Enable channel metadata done." >> $TC.load.log sleep 5 lttng add-context -t vpid -u -s $TC echo "Enable Event...." >> $TC.load.log date; lttng enable-event -s $TC -u -c chan1 $DOMAIN_EVENT --filter "PID == $my_pid" echo "Enable Event done." >> $TC.load.log sleep 5 echo "Activate the session." >> $TC.load.log #---------------------------------------------------- activate session ----------------------- date; lttng start; echo "Activate session Done." >> $TC.load.log sleep 1 lttng list $TC sleep 1 ps -e |grep -c TestApp dir -R $TC/SC-1/ust/uid/0/64-bit sleep 1 dir -R $TC/SC-1/ust/uid/0/64-bit sleep 1 dir -R $TC/SC-1/ust/uid/0/64-bit sleep 1 dir -R $TC/SC-1/ust/uid/0/64-bit sleep 1 dir -R $TC/SC-1/ust/uid/0/64-bit sleep 5 dir -R $TC/SC-1/ust/uid/0/64-bit echo "Stopping the session..." >> $TC.load.log #---------------------------------------------------- De-activate session ----------------------- lttng stop echo "Stopping done." >> $TC.load.log sleep 1 dir -R $TC/SC-1/ust/uid/0/64-bit sleep 5 echo "Destroying the session..." >> $TC.load.log lttng destroy echo "Destroy done! " >> $TC.load.log sleep 5 dir -R $TC/SC-1/ust/uid/0/64-bit pkill relayd babeltrace $TC | head -n 10 sleep 1 dir /cluster/dumps sleep 5 echo "Killing TestApp" >> $TC.load.log ps -e |grep -c TestApp pkill TestApp echo "Killing Done ." >> $TC.load.log ps -e |grep -c TestApp sleep 5 pkill sar cat $TC.load.log rm $TC.load.log babeltrace $TC | grep -c $DOMAIN_EVENT du -h $TC | tail -n 1 Commit used: ============ babeltrace : 740aad2 (HEAD, origin/master, origin/HEAD) Show token in CTF parser error lttng-tools : 500c239 (HEAD, tag: v2.2.0-rc1, origin/master, origin/HEAD) Update version to v2.2.0-rc1 lttng-ust : 9c915ee (HEAD, tag: v2.2.0-rc1, origin/master, origin/HEAD) Version 2.2.0-rc1 userspace-rcu : d107390 (HEAD, origin/master, origin/HEAD, globalBuffer) Add tab to output in ... Any other information: ====================== -
Files
Updated by Mathieu Desnoyers over 11 years ago
- Project changed from LTTng-UST to LTTng-tools
Updated by David Goulet over 11 years ago
- Status changed from New to Confirmed
- Target version set to 2.2
Updated by Mathieu Desnoyers over 11 years ago
Can you attach a copy of:
- one small trace,
- one "too large" trace,
It might help us in our investigation.
Updated by Mathieu Desnoyers over 11 years ago
- Status changed from Confirmed to Feedback
Updated by Mathieu Desnoyers over 11 years ago
- Status changed from Feedback to In Progress
My guess is that the space is wasted by padding added when issuing the periodical timer.
For instance, let's take an example: let's assume buffers have about 40 bytes of packet header (approximation), and let's suppose we log an event that weight 30 bytes.
If we have a program tracing one such event per second, and a switch-timer every second, on a machine with 4kB pages, each event will weight 4kB, most of this wasted to padding: 4096-40-30 = 4026 bytes of padding per event.
One possible solution to this, now that babeltrace supports packets and traces smaller than the page size (since recent babeltrace commit ec32346491f3e11fbc802a04f7dfd3259e6a8b17), we could change the tracer output so it can write traces without padding. We'd have to think of the impact on splice() used for kernel tracing, but at least for user-space tracing, and tracing through network it should not degrade performance significantly.
Updated by Tan le tran over 11 years ago
- File run4_run5_terminal.log run4_run5_terminal.log added
- File run4.tar run4.tar added
- File run5.tar run5.tar added
I did not have the copies of the session directories requested. I have re-run the scenario (run-4 and run-5). Their corresponding CTF files and terminal logs are attached .
Updated by Mathieu Desnoyers over 11 years ago
Hi Tan,
Please try with commit:
commit d92ff3ef25e70f4ffad582e24bfb581cfdcea04f
Author: David Goulet <dgoulet@efficios.com>
Date: Fri Apr 12 10:21:51 2013 -0400
Fix: use channel per domain default values
It ensures the "--read-timer" is set to 0 for UST default. (note: it's read-timer, not switch-timer, that had the wrong default value there).
The amount of padding you are seeing can be caused by having --switch-timer set to a non-zero value. What is weird, though, is that your traces generated with --switch-timer set as 0 also have lots of padding.
So could you make sure you try both with --switch-timer as default (for ust: 0), and with a --switch-timer 1000000 ?
Also, can you provide your test application ?
Updated by Mathieu Desnoyers over 11 years ago
- Status changed from In Progress to Feedback
Updated by Tan le tran over 11 years ago
- File Apr22_bug497_sessionDirs.tar Apr22_bug497_sessionDirs.tar added
- File Apr22_runs_terminal.log Apr22_runs_terminal.log added
- File TestApp_100perSecOnly.tar TestApp_100perSecOnly.tar added
Our lttng has been rebuilt with latest from origin/master as of apr-22 except rcu is one commit behind (origin/stable-0.7): rcu : 8a97780 (HEAD) Fix: tests/api.h use cpuset.h ust : a16877a (HEAD, origin/master, origin/HEAD) Typo fix in README tools : 6bf5e7c (HEAD, origin/master, origin/HEAD) Fix: remove mention of trace ... babeltrace : d088404 (HEAD, origin/master, origin/HEAD) Use objstack for AST allocation When default --switchTimer is used (run7): ========================================== Scenario is to launch 8000 instances of TestApp, then activate the session (tracing on an event of a specific PID that would produce 100 hits/sec). At each sec (for the next 10 sec), check the file size of the session dir. Stop the session (it takes roughly about 4sec to stop). Check the file size of the session dir again. Check babeltrace to see how many events got recorded per CPU and compare that number with the corresponding channel file size. (complete scenario can be seen from the attached terminal log) This is what has been observed: file size after size after # of events 1st 10 sec stop for this cpu ==== ========== ========== ============ : chan1_2 32K ~53K 929 : chan1_8 0 ~1081K 650 chan1_9 16K ~20K 442 : Notice the huge file size of chan1_8 for a relatively small number of events captured. This only occurs when "lttng stop" is given. When --switchTimer 1000000 is used (run1): ========================================== Same scenario as above. file size after size after # of events 1st 10 sec stop for this cpu ==== ========== ========== ============ : chan1_5 ~28K 28K 273 chan1_6 ~12K 516K 337 chan1_7 ~45K 45K 510 : Notice the file size of chan1_6 is 10 times bigger than chan1_7, but has a lot less log data. File size are drastically increased when the session is stopped. (note, from the log, it shows that it takes about 4 sec to perform "lttng stop").
Updated by Mathieu Desnoyers over 11 years ago
- File fix-per-uid-flush.patch fix-per-uid-flush.patch added
- Assignee set to Mathieu Desnoyers
Hi Tan,
The patch attached should fix the issue. Can you give it a try ? As soon as we receive confirmation that it indeed fix the problem, we'll push it into lttng-tools.
Thanks!
Mathieu
Updated by Tan le tran over 11 years ago
Hi Mathieu,
The patch seems to fix the problem reported in this bug.
It can be pushed now.
Thanks,
Tan
Updated by Mathieu Desnoyers over 11 years ago
- Status changed from Feedback to Resolved
- % Done changed from 0 to 100
Applied in changeset b34cbebfc8e60a678782bcffdf4ff8fc12de037e.