Project

General

Profile

Actions

Bug #497

closed

lttng 2.2.0rc1: Size of session directory is very inconsistent when repeating the same test condition several time

Added by Tan le tran about 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Normal
Target version:
Start date:
04/09/2013
Due date:
% Done:

100%

Estimated time:

Description

Problem Description:
====================
 * Not sure which project (ust or tools) this one should be reported to...

 * (Streaming is used when this issue has been encountered)
   When repeating the same test condition a couple of time, we notice that the total filesize
   of the session directory varies quite a bit and a propper explaination can not be deducted 
   for why they differ. It ranges from ~300KB to 1.4MB even if the session only contain 
   a close amount of trace logs.

   The following summary is extracted from the printouts of the last 2 lines 
   of the test instructions below at different run number:
       :
       babeltrace $TC | grep -c $DOMAIN_EVENT
       du -h $TC | tail -n 1 

       --switch-timer   Run#   Number of lines     Size of session
                               seen by babeltrace  directory
       ==============   ====   ==================  ===============
         1000000        1      1760                1.4M
         1000000        2      2002                0.38M
         1000000        3      1691                1.5M
         0              1      1655                1.4M
         0              2      2064                0.24M
         0              3      1943                0.78M

 * When babeltrace run1 (switchTimer 1sec) above, the total file size
   of the output of babeltrace is only about 400K. Why is the corresponding
   CTF binary take more than triple that size ?

 * When trying to run the above without streaming, another fault is encountered
   (will be reported in a separated bug) and therefore, no comparision can be made.

Is problem reproducible ?
=========================
  * yes 

How to reproduce (if reproducible):
===================================
       export TC=LTTNGgbf_017_8kAppFirst_1s_Pid_buf16k_100ps_run1
       DOMAIN_EVENT="com_ericsson_cba_trace_testapp100_hightraf:HundredPerSec" 

       sleep 1
       sar -u 1 300 >> $TC.load.log &
       netstat -etanp | head -n 2 ; netstat -etanp |egrep 'lttng'
       sleep 1
       echo "Start to launch apps when no session is active." >> $TC.load.log
       #----------------------------------------------------------------------
       date; time for n in $(seq 1 8000); do (/home/test_apps/TestApp_100psOnly 200 np &); done
       #----------------------------------------------------------------------
       echo "End of launching apps when no session is active" >> $TC.load.log
       ps -ef |grep -c TestApp
       sleep 1

       #---------------------------- Get PID to trace on --------------------------------------------------
       export my_pid=$(ps -e |grep TestApp | tail -n 1 | awk {'print $1'}); echo $my_pid
       sleep 10

       ps -ef |egrep 'lttng'
       lttng-relayd  -C tcp://0.0.0.0:53000 -D tcp://0.0.0.0:53001  -o /cluster/temp/log/$TC &
       echo "Create the session." >> $TC.load.log
       date; lttng create $TC -U net://192.168.0.1:53000:53001/./
       lttng list
       sleep 5
       echo "Enable channel...." >> $TC.load.log
       date; lttng enable-channel chan1    -u -s $TC --buffers-uid --subbuf-size 16384 --switch-timer 1000000
       echo "Enable channel done." >> $TC.load.log
       echo "Enable channel metadata...." >> $TC.load.log
       date; lttng enable-channel metadata -u -s $TC --buffers-uid --subbuf-size 16384 --switch-timer 1000000 
       echo "Enable channel metadata done." >> $TC.load.log
       sleep 5
       lttng add-context -t vpid -u -s $TC
       echo "Enable Event...." >> $TC.load.log
       date; lttng enable-event -s $TC -u -c chan1 $DOMAIN_EVENT --filter "PID == $my_pid" 
       echo "Enable Event done." >> $TC.load.log
       sleep 5

       echo "Activate the session." >> $TC.load.log
       #---------------------------------------------------- activate session -----------------------
       date; lttng start;  
       echo "Activate session Done." >> $TC.load.log
       sleep 1
       lttng list $TC
       sleep 1
       ps -e |grep -c TestApp
       dir -R $TC/SC-1/ust/uid/0/64-bit  
       sleep 1       
       dir -R $TC/SC-1/ust/uid/0/64-bit  
       sleep 1       
       dir -R $TC/SC-1/ust/uid/0/64-bit  
       sleep 1       
       dir -R $TC/SC-1/ust/uid/0/64-bit  
       sleep 1       
       dir -R $TC/SC-1/ust/uid/0/64-bit  
       sleep 5
       dir -R $TC/SC-1/ust/uid/0/64-bit  
       echo "Stopping the session..." >> $TC.load.log
       #---------------------------------------------------- De-activate session -----------------------
       lttng stop
       echo "Stopping done." >> $TC.load.log
       sleep 1       
       dir -R $TC/SC-1/ust/uid/0/64-bit 
       sleep 5
       echo "Destroying the session..." >> $TC.load.log
       lttng destroy       
       echo "Destroy done! " >> $TC.load.log
       sleep 5       
       dir -R $TC/SC-1/ust/uid/0/64-bit 
       pkill relayd

       babeltrace $TC | head -n 10
       sleep 1
       dir /cluster/dumps
       sleep 5
       echo "Killing TestApp" >> $TC.load.log
       ps -e |grep -c TestApp
       pkill TestApp
       echo "Killing Done ." >> $TC.load.log
       ps -e |grep -c TestApp
       sleep 5
       pkill sar

       cat $TC.load.log

       rm $TC.load.log
       babeltrace $TC | grep -c $DOMAIN_EVENT
       du -h $TC | tail -n 1 

Commit used:
============
babeltrace    : 740aad2 (HEAD, origin/master, origin/HEAD) Show token in CTF parser error
lttng-tools   : 500c239 (HEAD, tag: v2.2.0-rc1, origin/master, origin/HEAD) Update version to v2.2.0-rc1
lttng-ust     : 9c915ee (HEAD, tag: v2.2.0-rc1, origin/master, origin/HEAD) Version 2.2.0-rc1
userspace-rcu : d107390 (HEAD, origin/master, origin/HEAD, globalBuffer) Add tab to output in ...

Any other information:
======================
- 


Files

TC17_node16_8KAppFirst_noTimer_TraceOn1pid_buf16k_100ps_a_b_c.log (97.9 KB) TC17_node16_8KAppFirst_noTimer_TraceOn1pid_buf16k_100ps_a_b_c.log Runs with streaming, SwitchTimer 0 Tan le tran, 04/09/2013 09:52 AM
TC17_node16_8KAppFirst_Timer1s_TraceOn1pid_buf16k_100ps_a_b_c.log (98.5 KB) TC17_node16_8KAppFirst_Timer1s_TraceOn1pid_buf16k_100ps_a_b_c.log Runs with streaming, SwitchTimer 1sec Tan le tran, 04/09/2013 09:52 AM
run4_run5_terminal.log (66 KB) run4_run5_terminal.log terminal log for run4 and 5 Tan le tran, 04/12/2013 08:50 AM
run4.tar (21.2 KB) run4.tar session dir Tan le tran, 04/12/2013 08:50 AM
run5.tar (24.5 KB) run5.tar session dir Tan le tran, 04/12/2013 08:50 AM
Apr22_bug497_sessionDirs.tar (78.9 KB) Apr22_bug497_sessionDirs.tar Session directories with ctf files Tan le tran, 04/22/2013 02:07 PM
Apr22_runs_terminal.log (130 KB) Apr22_runs_terminal.log Terminal logs Tan le tran, 04/22/2013 02:07 PM
TestApp_100perSecOnly.tar (415 KB) TestApp_100perSecOnly.tar TestApp_100perSecOnly (used in this bug report) Tan le tran, 04/22/2013 02:07 PM
fix-per-uid-flush.patch (5.83 KB) fix-per-uid-flush.patch Mathieu Desnoyers, 04/26/2013 03:27 PM
Actions #1

Updated by Mathieu Desnoyers about 11 years ago

  • Project changed from LTTng-UST to LTTng-tools
Actions #2

Updated by David Goulet about 11 years ago

  • Status changed from New to Confirmed
  • Target version set to 2.2
Actions #3

Updated by Mathieu Desnoyers about 11 years ago

Can you attach a copy of:

- one small trace,
- one "too large" trace,

It might help us in our investigation.

Actions #4

Updated by Mathieu Desnoyers about 11 years ago

  • Status changed from Confirmed to Feedback
Actions #5

Updated by Mathieu Desnoyers about 11 years ago

  • Status changed from Feedback to In Progress

My guess is that the space is wasted by padding added when issuing the periodical timer.

For instance, let's take an example: let's assume buffers have about 40 bytes of packet header (approximation), and let's suppose we log an event that weight 30 bytes.

If we have a program tracing one such event per second, and a switch-timer every second, on a machine with 4kB pages, each event will weight 4kB, most of this wasted to padding: 4096-40-30 = 4026 bytes of padding per event.

One possible solution to this, now that babeltrace supports packets and traces smaller than the page size (since recent babeltrace commit ec32346491f3e11fbc802a04f7dfd3259e6a8b17), we could change the tracer output so it can write traces without padding. We'd have to think of the impact on splice() used for kernel tracing, but at least for user-space tracing, and tracing through network it should not degrade performance significantly.

Updated by Tan le tran about 11 years ago

I did not have the copies of the session directories requested.

I have re-run the scenario (run-4 and run-5). Their corresponding CTF files and
terminal logs are attached .

Actions #7

Updated by Mathieu Desnoyers about 11 years ago

Hi Tan,

Please try with commit:

commit d92ff3ef25e70f4ffad582e24bfb581cfdcea04f
Author: David Goulet <>
Date: Fri Apr 12 10:21:51 2013 -0400

Fix: use channel per domain default values

It ensures the "--read-timer" is set to 0 for UST default. (note: it's read-timer, not switch-timer, that had the wrong default value there).

The amount of padding you are seeing can be caused by having --switch-timer set to a non-zero value. What is weird, though, is that your traces generated with --switch-timer set as 0 also have lots of padding.

So could you make sure you try both with --switch-timer as default (for ust: 0), and with a --switch-timer 1000000 ?

Also, can you provide your test application ?

Actions #8

Updated by Mathieu Desnoyers about 11 years ago

  • Status changed from In Progress to Feedback

Updated by Tan le tran almost 11 years ago

Our lttng has been rebuilt with latest from origin/master as of apr-22
except rcu is one commit behind (origin/stable-0.7):

   rcu        : 8a97780 (HEAD) Fix: tests/api.h use cpuset.h
   ust        : a16877a (HEAD, origin/master, origin/HEAD) Typo fix in README
   tools      : 6bf5e7c (HEAD, origin/master, origin/HEAD) Fix: remove mention of trace ...
   babeltrace : d088404 (HEAD, origin/master, origin/HEAD) Use objstack for AST allocation

When default --switchTimer is used (run7):
==========================================
  Scenario is to launch 8000 instances of TestApp, then 
  activate the session (tracing on an event of a specific PID that 
  would produce 100 hits/sec). 
  At each sec (for the next 10 sec), check the file size of the session dir.
  Stop the session (it takes roughly about 4sec to stop).
  Check the file size of the session dir again.
  Check babeltrace to see how many events got recorded per CPU and
  compare that number with the corresponding channel file size.
  (complete scenario can be seen from the attached terminal log)

  This is what has been observed:

  file     size after   size after   # of events
           1st 10 sec   stop         for this cpu
  ====     ==========   ==========   ============
  :
  chan1_2     32K           ~53K         929
  :
  chan1_8      0          ~1081K         650
  chan1_9     16K           ~20K         442
  :

  Notice the huge file size of chan1_8 for a relatively small number
  of events captured. This only occurs when "lttng stop" is given.

When --switchTimer 1000000 is used (run1):
==========================================
  Same scenario as above.

  file     size after   size after   # of events
           1st 10 sec   stop         for this cpu
  ====     ==========   ==========   ============
  :
  chan1_5     ~28K        28K          273
  chan1_6     ~12K       516K          337
  chan1_7     ~45K        45K          510
  :

  Notice the file size of chan1_6 is 10 times bigger than
  chan1_7, but has a lot less log data.

  File size are drastically increased when the session is stopped.
  (note, from the log, it shows that it takes about 4 sec to 
   perform "lttng stop").

Actions #10

Updated by Mathieu Desnoyers almost 11 years ago

Hi Tan,

The patch attached should fix the issue. Can you give it a try ? As soon as we receive confirmation that it indeed fix the problem, we'll push it into lttng-tools.

Thanks!

Mathieu

Actions #11

Updated by Tan le tran almost 11 years ago

Hi Mathieu,

The patch seems to fix the problem reported in this bug.
It can be pushed now.
Thanks,
Tan

Actions #12

Updated by Mathieu Desnoyers almost 11 years ago

  • Status changed from Feedback to Resolved
  • % Done changed from 0 to 100
Actions

Also available in: Atom PDF