Project

General

Profile

Actions

Bug #539

closed

lttng-tools2.2.0rc2: Possible memory leak from sessionD ?

Added by Tan le tran over 11 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
Critical
Assignee:
-
Target version:
Start date:
05/22/2013
Due date:
% Done:

0%

Estimated time:

Description


Commit used:
============
userspace-rcu : 56e676d (HEAD, origin/stable-0.7) Document: rculfhash destroy and resize side-effect in 0.7
lttng-ust     : 352fce3 (HEAD, origin/master, origin/HEAD) Remove 0.x TODO
lttng-tools   : b31398b (HEAD, origin/master, origin/HEAD) Fix: increment UST channel refcount at stream creation
babeltrace    : 9eaf254 (HEAD, tag: v1.0.3, origin/stable-1.0) Version 1.0.3

Problem Description:
====================
 * There seems to be memory leak in sessionD when many short live instrumented apps are launched
   during a period of time. At any time, we have about 700 instances of instrumented app present.
   Each of those instances has only 3 sec life spann.

   After about 3 minute run, the value under "%MEM" column increases from 0.0 to 0.4 for
   sessionD:

   [SC-1:Xen43 Wed May 22 06:29:29/cluster/temp/tdlt] # top -b -n 1 | head -n 7 | tail -n 1; top -b -n 1 |grep -i sessiond
     PID USER      PR  NI  VIRT  RES  SHR S   %CPU %MEM    TIME+  COMMAND
   13616 root      20   0 85160  872  604 S      0  0.0   0:00.00 lttng-sessiond
   :
   :
   [SC-1:Xen43 Wed May 22 06:32:34/cluster/temp/tdlt] # top -b -n 1 | head -n 7 | tail -n 1; top -b -n 1 |grep -i sessiond
     PID USER      PR  NI  VIRT  RES  SHR S   %CPU %MEM    TIME+  COMMAND
   13616 root      20   0  190m 7292  688 S      7  0.4   0:05.73 lttng-sessiond
   :
   :
   [SC-1:Xen43 Wed May 22 06:37:19/cluster/temp/tdlt] # top
   top - 06:37:25 up 1 day, 16:20,  2 users,  load average: 2.60, 3.14, 1.55
   Tasks: 187 total,   1 running, 186 sleeping,   0 stopped,   0 zombie
   Cpu(s):  0.1%us,  0.5%sy,  0.0%ni, 99.2%id,  0.1%wa,  0.1%hi,  0.0%si,  0.0%st
   Mem:   2049828k total,   210928k used,  1838900k free,     2808k buffers
   Swap:  4200992k total,   784808k used,  3416184k free,    42348k cached
   :

 * On a different machine, we observed the following (note the total mem size of
   this machine is different than the one above):

   [SC-1:Node16 Tue May 21 13:22:33/cluster/temp/tdlt/Stability/May12] # top
   top - 13:23:15 up 11 days,  4:41,  2 users,  load average: 0.00, 0.01, 0.05
   Tasks: 288 total,   1 running, 287 sleeping,   0 stopped,   0 zombie
   Cpu(s):  0.0%us,  0.1%sy,  0.0%ni, 99.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
   Mem:     24149M total,     3461M used,    20687M free,      372M buffers
   Swap:     4102M total,        0M used,     4102M free,     2411M cached
   :

   [SC-1:Node16 Tue May 21 13:32:09/cluster/temp/tdlt/Stability/May12] # top -b -n 1 | head -n 7 | tail -n 1; top -b -n 1 |egrep -i "lttng|trace" 
      PID USER      PR  NI  VIRT  RES  SHR S   %CPU %MEM    TIME+  COMMAND
    15012 root      20   0  311m 137m  912 S      0  0.6   2:46.43 lttng-sessiond
    15373 root      20   0 82128 1484  956 S      0  0.0   1:06.09 lttng-consumerd
    18322 root      20   0 42264 2268 1808 S      0  0.0   1:11.81 TraceEa
    18342 root      20   0 42264  912  444 S      0  0.0   0:45.82 TraceEa
    20874 root      20   0  190m 3808 2596 S      0  0.0   0:40.07 trace_c
    20880 root      20   0 47220 2948 2040 S      0  0.0   0:10.89 trace_p

Is problem reproducible ?
=========================
  * yes 

How to reproduce (if reproducible):
===================================
  1)_ Kill the current sessionD and relaunch a new one
  2)_ top -b -n 1 | head -n 7 | tail -n 1; top -b -n 1 |grep -i sessiond
  3)_ lttng list -u; lttng list    #--- no instrumented app running, no session available
  4)_ run multiple instances of instrumented app (each instance has only 3 sec lifespan)
      so that there are around 700 instances present at any time.
      ex:
          for a in $(seq 1 1234); do (for n in $(seq 1 100); do (/home/test_apps/TestApp_100perSecOnly 3 np > /dev/null &); done; usleep 200000); done &
  5)_ Once in a while, repeat step-2 above.
      Will see that the number under "%MEM" start increasing for sessionD.
  6)_ Stop the process from step-4 . No more instrumented app should be present.
  7)_ repeat step-2. %MEM used still remain.

Any other information:
======================
-   

Files

terminal.log (13.4 KB) terminal.log terminal log Tan le tran, 05/22/2013 07:14 AM
bug539.diff (5.13 KB) bug539.diff David Goulet, 06/06/2013 03:50 PM
Jun10_retest.log (11.5 KB) Jun10_retest.log terminal log when the above test was carried out (update #14) Tan le tran, 06/10/2013 07:42 AM
patch539_jun10_gitdiff.log (7.46 KB) patch539_jun10_gitdiff.log git diff after applying the patch Tan le tran, 06/10/2013 11:44 AM
patch539_jun10_terminal.log (14.7 KB) patch539_jun10_terminal.log terminal log showing ps printout when the TC is executed Tan le tran, 06/10/2013 11:44 AM
Actions #1

Updated by Jérémie Galarneau over 11 years ago

We're trying to reproduce the issue on our end.

Actions #2

Updated by Tan le tran over 11 years ago

Hi Jérémie,

Did you have a chance to looks into this issue ?
Thanks,
Tan

Actions #3

Updated by Jérémie Galarneau over 11 years ago

David has tried to reproduce the issue without success... However, we have identified another bug that may be related. We are still looking into it.
Could you run the sessiond under valgrind in "--leak-check=full" mode?

David has rebased a Valgrind-support patch on the current lttng-tools HEAD. It's available on his personal repository.

:dgoulet/lttng-tools-dev.git
branch: valgrind

Actions #4

Updated by Tan le tran over 11 years ago

Hi Jeremie,

Unfortunately, I can not reproduce this with Valgrind. The reason is that when sessionD is running with Valgrind, it is too slow to reponse to the registration of the instrumented app causing too many "Error: Timed out waiting for lttng-sessiond", which in turns, causes the instrumented app to hang (ie: they can not execute their main even after the LTTNG_UST_REGISTER_TIMEOUT has elapsed). This new behaviour has been reported in bug-545.

I will try to rerun with a lower launch rate (ie: launch less number of instances of intr app per sec) and see
if I can see the increase of MEM usage from the "top" printout.
Regards,
Tan

Actions #5

Updated by Jérémie Galarneau over 11 years ago

Okay. Reproducing it under a lighter load will absolutely make it easier to debug... Keep me posted.
Thanks for reporting #545 too!

Actions #6

Updated by Tan le tran over 11 years ago

When trying with lower rate, problem did not happen.
ex:
  for a in $(seq 1 28800); do (for n in $(seq 1 100); do (/home/test_apps/TestApp_100perSecOnly 3 np > /dev/null &);\
       done; usleep 800000); done &

  I have ran the above for 8hrs, each 0.8sec, launch 100 instances. The "top" printout before and after
  the 8 hour run are pretty much the same (from RES and MEM column):
  [SC-1:Node16 Mon May 27 15:40:05/cluster/temp/tdlt] # top -b -n 1 | head -n 7 | tail -n 1; top -b -n 1 |grep -i sessiond
  PID USER      PR  NI  VIRT  RES  SHR S   %CPU %MEM    TIME+  COMMAND
  712 root      20   0  157m 6892  664 S      2  0.0   0:00.28 lttng-sessiond
  :
  [SC-1:Node16 Tue May 28 06:51:22/cluster/temp/tdlt] # top -b -n 1 | head -n 7 | tail -n 1; top -b -n 1 |grep -i sessiond
  PID USER      PR  NI  VIRT  RES  SHR S   %CPU %MEM    TIME+  COMMAND
  712 root      20   0  236m 8720  708 S      0  0.0   7:08.17 lttng-sessiond

When changing the rate to 100 instances per 0.2 sec, I see the value under RES increase to 12MB and stay there 
even after no more apps being launched (NOTE: there is no session defined, no session active):
ex:
    for a in $(seq 1 28800); do (for n in $(seq 1 100); do (/home/test_apps/TestApp_100perSecOnly 3 np > /dev/null &); \
       done; usleep 200000); done &

  <before launching apps>
  [SC-1:Node16 Tue May 28 06:51:22/cluster/temp/tdlt] # top -b -n 1 | head -n 7 | tail -n 1; top -b -n 1 |grep -i sessiond
    PID USER      PR  NI  VIRT  RES  SHR S   %CPU %MEM    TIME+  COMMAND
    712 root      20   0  236m 8720  708 S      0  0.0   7:08.17 lttng-sessiond
  :
  :
  <after launching apps>
  [SC-1:Node16 Tue May 28 06:52:45/cluster/temp/tdlt] # top -b -n 1 | head -n 7 | tail -n 1; top -b -n 1 |grep -i sessiond
  PID USER      PR  NI  VIRT  RES  SHR S   %CPU %MEM    TIME+  COMMAND
  712 root      20   0  235m  12m  708 S      7  0.1   7:08.52 lttng-sessiond

Actions #7

Updated by David Goulet over 11 years ago

That's interesting. I'm betting on an error path that does not clean up memory when some limit is reached.

During those tests with a large number of applications, do you know if any errors occur like open file limit reached for instance ?

Actions #8

Updated by Tan le tran over 11 years ago

No, I did not observe any error while doing the test.

Actions #9

Updated by David Goulet over 11 years ago

  • Target version set to 2.2
Actions #10

Updated by David Goulet over 11 years ago

We've been running this use case (and variations) on a 16 cores machine for more than 50 minutes and we are unable to reproduce it.

So, can you confirm me that during this test there is NO session(s) created and the applications just register and unregister? Also, are you running health check during that time also.

I'm trying to find any "variables" that can change the state of the session daemon.

If none of the above apply, we'll have to come on site to help you guys. We have a small library that keeps track of the malloc/free and their call sites.

Actions #11

Updated by Tan le tran over 11 years ago


We confirm that during this test (as described earlier in this bug-report):
  _ Our TraceEA process performs the following at every 5 sec interval:
        1)_ Call the following APIs:
             lttng_health_check(LTTNG_HEALTH_CMD)
             lttng_health_check(LTTNG_HEALTH_APP_MANAGE)
             lttng_health_check(LTTNG_HEALTH_APP_REG)
             lttng_health_check(LTTNG_HEALTH_CONSUMER)
        2)_ Create a dummy session via the following API calls:
             lttng_create_session()
             lttng_destroy_session()

             (NOTE: no channel or event being enabled)

  _ However, when we turn off the above TraceEA process, the same thing
    is still observed (ie: after a few minutes run, we stop launching 
    the apps, the top or ps command printout indicate that the resident 
    memory for sessiond has increased by around 20 times its original 
    size.)

  _ Other than the above, only a bunch of instrumented application being
    launched and terminates. No real session (with enabled events) being 
    created and started.

 
Actions #12

Updated by David Goulet over 11 years ago

Is this can be reproduce with the session daemon in -vvv ?

Actually, only "-v" will be of great help. If not, I'll come on site to help you hunt this one.

Actions #13

Updated by David Goulet over 11 years ago

I'm pretty confident that this will fix the memory leak.

You'll have to apply this on master HEAD at least at commit cfa9a5a2b4a96e0d6a9eeddd2622a6d7c173b7ac, I just pushed a fix that is needed for this patch to work.

Actions #14

Updated by Tan le tran over 11 years ago

Hi David,

We have loaded the latest lttng-tools and run the test. We still see the value under "RSS" column increases for lttng-sessiond .

  Commit used:
  ============
   tools       : cfa9a5a (HEAD, origin/master, origin/HEAD) Fix: poll compat layer .....
   ust         : 844a160 (HEAD, origin/stable-2.2) Fix: liblttng-ust process startup hang....
   userspace   : 264716f (HEAD, origin/stable-0.7, stable-0.7) Fix: Use a filled signal ....

  The following value under RSS column from command "ps auxwwwf":

 RSS     Description
 ---     -----------
 912     - After "pkill sessiond" and a brand new sessiond is launched
 7272    - After:
             for a in $(seq 1 40); do (for n in $(seq 1 100); do (/home/test_apps/TestApp_100perSecOnly 3 np > /dev/null &); done;

 110940  - After running a case (described below) where session activation is involed:
           1)_ Start launching TestApp as follow to give an average of 1000 TestApp
               present at any giving time.
                for a in $(seq 1 100);do (date; ps -ef |grep -c TestApp; usleep 500000;  \
                for n in $(seq 1 300);do \
                (LTTNG_UST_REGISTER_TIMEOUT=100 /home/test_apps/TestApp_100perSecOnly 3 np > /dev/null 2>&1 & );\
                done); \
                done
           3)_ Sleep for 5 sec to have the average apps launched and terminated per sec
           4)_ Create multiple sessions and trace on HundredPerSec filtering on "OnePs < 10" 
               (this should satisfies all PID)
           5)_ Activate only ONE of the above sessions
           6)_ Wait for 10 sec
           7)_ deactivate the session.
           8)_ Collect processor load if possible.
           9)_ Stop then convert the session

  Can we use the value under RSS column (from ps command) as indication of memory leak or we have mis-interpreted that value ?

Actions #15

Updated by David Goulet over 11 years ago

Well yes RSS is the resident size of the program. I can see in the log that it goes up to 100Mb... The head commit of tools does not show the fix I've provided (commit: adbbaabc265e354b7986b48815ee556f1a07769b). Is this because you might have forget it or you just applied the diff ?

Actions #16

Updated by Tan le tran over 11 years ago

Ouch !

My Bad....my verrrry bad !

I thought I just need the latest commit ..... didn't see the patch :-(
Applying the patch now ...will update you soon .

Actions #17

Updated by David Goulet over 11 years ago

No problem. :)

If it is still visible after using the patch, I'll come on site hunt down this thing!

Updated by Tan le tran over 11 years ago

Hi David,

I have applied the patch (from update#13) on top of commit cfa9a5a .
Unfortunately, we still observe that RSS increases after the test .

Please, see attached log for further detail.
Regards,
Tan

Actions #19

Updated by Tan le tran over 11 years ago

After loading lttng-tools v2.2.3 and lttng-ust 2.2.1, the above mem leak is no longer observed.
Many thanks for the great support.

Actions #20

Updated by David Goulet over 11 years ago

  • Status changed from Confirmed to Resolved
Actions

Also available in: Atom PDF