Bug #539
closedlttng-tools2.2.0rc2: Possible memory leak from sessionD ?
0%
Description
Commit used: ============ userspace-rcu : 56e676d (HEAD, origin/stable-0.7) Document: rculfhash destroy and resize side-effect in 0.7 lttng-ust : 352fce3 (HEAD, origin/master, origin/HEAD) Remove 0.x TODO lttng-tools : b31398b (HEAD, origin/master, origin/HEAD) Fix: increment UST channel refcount at stream creation babeltrace : 9eaf254 (HEAD, tag: v1.0.3, origin/stable-1.0) Version 1.0.3 Problem Description: ==================== * There seems to be memory leak in sessionD when many short live instrumented apps are launched during a period of time. At any time, we have about 700 instances of instrumented app present. Each of those instances has only 3 sec life spann. After about 3 minute run, the value under "%MEM" column increases from 0.0 to 0.4 for sessionD: [SC-1:Xen43 Wed May 22 06:29:29/cluster/temp/tdlt] # top -b -n 1 | head -n 7 | tail -n 1; top -b -n 1 |grep -i sessiond PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 13616 root 20 0 85160 872 604 S 0 0.0 0:00.00 lttng-sessiond : : [SC-1:Xen43 Wed May 22 06:32:34/cluster/temp/tdlt] # top -b -n 1 | head -n 7 | tail -n 1; top -b -n 1 |grep -i sessiond PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 13616 root 20 0 190m 7292 688 S 7 0.4 0:05.73 lttng-sessiond : : [SC-1:Xen43 Wed May 22 06:37:19/cluster/temp/tdlt] # top top - 06:37:25 up 1 day, 16:20, 2 users, load average: 2.60, 3.14, 1.55 Tasks: 187 total, 1 running, 186 sleeping, 0 stopped, 0 zombie Cpu(s): 0.1%us, 0.5%sy, 0.0%ni, 99.2%id, 0.1%wa, 0.1%hi, 0.0%si, 0.0%st Mem: 2049828k total, 210928k used, 1838900k free, 2808k buffers Swap: 4200992k total, 784808k used, 3416184k free, 42348k cached : * On a different machine, we observed the following (note the total mem size of this machine is different than the one above): [SC-1:Node16 Tue May 21 13:22:33/cluster/temp/tdlt/Stability/May12] # top top - 13:23:15 up 11 days, 4:41, 2 users, load average: 0.00, 0.01, 0.05 Tasks: 288 total, 1 running, 287 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 0.1%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 24149M total, 3461M used, 20687M free, 372M buffers Swap: 4102M total, 0M used, 4102M free, 2411M cached : [SC-1:Node16 Tue May 21 13:32:09/cluster/temp/tdlt/Stability/May12] # top -b -n 1 | head -n 7 | tail -n 1; top -b -n 1 |egrep -i "lttng|trace" PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 15012 root 20 0 311m 137m 912 S 0 0.6 2:46.43 lttng-sessiond 15373 root 20 0 82128 1484 956 S 0 0.0 1:06.09 lttng-consumerd 18322 root 20 0 42264 2268 1808 S 0 0.0 1:11.81 TraceEa 18342 root 20 0 42264 912 444 S 0 0.0 0:45.82 TraceEa 20874 root 20 0 190m 3808 2596 S 0 0.0 0:40.07 trace_c 20880 root 20 0 47220 2948 2040 S 0 0.0 0:10.89 trace_p Is problem reproducible ? ========================= * yes How to reproduce (if reproducible): =================================== 1)_ Kill the current sessionD and relaunch a new one 2)_ top -b -n 1 | head -n 7 | tail -n 1; top -b -n 1 |grep -i sessiond 3)_ lttng list -u; lttng list #--- no instrumented app running, no session available 4)_ run multiple instances of instrumented app (each instance has only 3 sec lifespan) so that there are around 700 instances present at any time. ex: for a in $(seq 1 1234); do (for n in $(seq 1 100); do (/home/test_apps/TestApp_100perSecOnly 3 np > /dev/null &); done; usleep 200000); done & 5)_ Once in a while, repeat step-2 above. Will see that the number under "%MEM" start increasing for sessionD. 6)_ Stop the process from step-4 . No more instrumented app should be present. 7)_ repeat step-2. %MEM used still remain. Any other information: ====================== -
Files
Updated by Jérémie Galarneau over 11 years ago
We're trying to reproduce the issue on our end.
Updated by Tan le tran over 11 years ago
Hi Jérémie,
Did you have a chance to looks into this issue ?
Thanks,
Tan
Updated by Jérémie Galarneau over 11 years ago
David has tried to reproduce the issue without success... However, we have identified another bug that may be related. We are still looking into it.
Could you run the sessiond under valgrind in "--leak-check=full" mode?
David has rebased a Valgrind-support patch on the current lttng-tools HEAD. It's available on his personal repository.
git@github.com:dgoulet/lttng-tools-dev.git
branch: valgrind
Updated by Tan le tran over 11 years ago
Hi Jeremie,
Unfortunately, I can not reproduce this with Valgrind. The reason is that when sessionD is running with Valgrind, it is too slow to reponse to the registration of the instrumented app causing too many "Error: Timed out waiting for lttng-sessiond", which in turns, causes the instrumented app to hang (ie: they can not execute their main even after the LTTNG_UST_REGISTER_TIMEOUT has elapsed). This new behaviour has been reported in bug-545.
I will try to rerun with a lower launch rate (ie: launch less number of instances of intr app per sec) and see
if I can see the increase of MEM usage from the "top" printout.
Regards,
Tan
Updated by Jérémie Galarneau over 11 years ago
Okay. Reproducing it under a lighter load will absolutely make it easier to debug... Keep me posted.
Thanks for reporting #545 too!
Updated by Tan le tran over 11 years ago
When trying with lower rate, problem did not happen. ex: for a in $(seq 1 28800); do (for n in $(seq 1 100); do (/home/test_apps/TestApp_100perSecOnly 3 np > /dev/null &);\ done; usleep 800000); done & I have ran the above for 8hrs, each 0.8sec, launch 100 instances. The "top" printout before and after the 8 hour run are pretty much the same (from RES and MEM column): [SC-1:Node16 Mon May 27 15:40:05/cluster/temp/tdlt] # top -b -n 1 | head -n 7 | tail -n 1; top -b -n 1 |grep -i sessiond PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 712 root 20 0 157m 6892 664 S 2 0.0 0:00.28 lttng-sessiond : [SC-1:Node16 Tue May 28 06:51:22/cluster/temp/tdlt] # top -b -n 1 | head -n 7 | tail -n 1; top -b -n 1 |grep -i sessiond PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 712 root 20 0 236m 8720 708 S 0 0.0 7:08.17 lttng-sessiond When changing the rate to 100 instances per 0.2 sec, I see the value under RES increase to 12MB and stay there even after no more apps being launched (NOTE: there is no session defined, no session active): ex: for a in $(seq 1 28800); do (for n in $(seq 1 100); do (/home/test_apps/TestApp_100perSecOnly 3 np > /dev/null &); \ done; usleep 200000); done & <before launching apps> [SC-1:Node16 Tue May 28 06:51:22/cluster/temp/tdlt] # top -b -n 1 | head -n 7 | tail -n 1; top -b -n 1 |grep -i sessiond PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 712 root 20 0 236m 8720 708 S 0 0.0 7:08.17 lttng-sessiond : : <after launching apps> [SC-1:Node16 Tue May 28 06:52:45/cluster/temp/tdlt] # top -b -n 1 | head -n 7 | tail -n 1; top -b -n 1 |grep -i sessiond PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 712 root 20 0 235m 12m 708 S 7 0.1 7:08.52 lttng-sessiond
Updated by David Goulet over 11 years ago
That's interesting. I'm betting on an error path that does not clean up memory when some limit is reached.
During those tests with a large number of applications, do you know if any errors occur like open file limit reached for instance ?
Updated by Tan le tran over 11 years ago
No, I did not observe any error while doing the test.
Updated by David Goulet over 11 years ago
We've been running this use case (and variations) on a 16 cores machine for more than 50 minutes and we are unable to reproduce it.
So, can you confirm me that during this test there is NO session(s) created and the applications just register and unregister? Also, are you running health check during that time also.
I'm trying to find any "variables" that can change the state of the session daemon.
If none of the above apply, we'll have to come on site to help you guys. We have a small library that keeps track of the malloc/free and their call sites.
Updated by Tan le tran over 11 years ago
We confirm that during this test (as described earlier in this bug-report): _ Our TraceEA process performs the following at every 5 sec interval: 1)_ Call the following APIs: lttng_health_check(LTTNG_HEALTH_CMD) lttng_health_check(LTTNG_HEALTH_APP_MANAGE) lttng_health_check(LTTNG_HEALTH_APP_REG) lttng_health_check(LTTNG_HEALTH_CONSUMER) 2)_ Create a dummy session via the following API calls: lttng_create_session() lttng_destroy_session() (NOTE: no channel or event being enabled) _ However, when we turn off the above TraceEA process, the same thing is still observed (ie: after a few minutes run, we stop launching the apps, the top or ps command printout indicate that the resident memory for sessiond has increased by around 20 times its original size.) _ Other than the above, only a bunch of instrumented application being launched and terminates. No real session (with enabled events) being created and started.
Updated by David Goulet over 11 years ago
Is this can be reproduce with the session daemon in -vvv ?
Actually, only "-v" will be of great help. If not, I'll come on site to help you hunt this one.
Updated by David Goulet over 11 years ago
- File bug539.diff bug539.diff added
- Status changed from New to Confirmed
I'm pretty confident that this will fix the memory leak.
You'll have to apply this on master HEAD at least at commit cfa9a5a2b4a96e0d6a9eeddd2622a6d7c173b7ac, I just pushed a fix that is needed for this patch to work.
Updated by Tan le tran over 11 years ago
- File Jun10_retest.log Jun10_retest.log added
Hi David, We have loaded the latest lttng-tools and run the test. We still see the value under "RSS" column increases for lttng-sessiond . Commit used: ============ tools : cfa9a5a (HEAD, origin/master, origin/HEAD) Fix: poll compat layer ..... ust : 844a160 (HEAD, origin/stable-2.2) Fix: liblttng-ust process startup hang.... userspace : 264716f (HEAD, origin/stable-0.7, stable-0.7) Fix: Use a filled signal .... The following value under RSS column from command "ps auxwwwf": RSS Description --- ----------- 912 - After "pkill sessiond" and a brand new sessiond is launched 7272 - After: for a in $(seq 1 40); do (for n in $(seq 1 100); do (/home/test_apps/TestApp_100perSecOnly 3 np > /dev/null &); done; 110940 - After running a case (described below) where session activation is involed: 1)_ Start launching TestApp as follow to give an average of 1000 TestApp present at any giving time. for a in $(seq 1 100);do (date; ps -ef |grep -c TestApp; usleep 500000; \ for n in $(seq 1 300);do \ (LTTNG_UST_REGISTER_TIMEOUT=100 /home/test_apps/TestApp_100perSecOnly 3 np > /dev/null 2>&1 & );\ done); \ done 3)_ Sleep for 5 sec to have the average apps launched and terminated per sec 4)_ Create multiple sessions and trace on HundredPerSec filtering on "OnePs < 10" (this should satisfies all PID) 5)_ Activate only ONE of the above sessions 6)_ Wait for 10 sec 7)_ deactivate the session. 8)_ Collect processor load if possible. 9)_ Stop then convert the session Can we use the value under RSS column (from ps command) as indication of memory leak or we have mis-interpreted that value ?
Updated by David Goulet over 11 years ago
Well yes RSS is the resident size of the program. I can see in the log that it goes up to 100Mb... The head commit of tools does not show the fix I've provided (commit: adbbaabc265e354b7986b48815ee556f1a07769b). Is this because you might have forget it or you just applied the diff ?
Updated by Tan le tran over 11 years ago
Ouch !
My Bad....my verrrry bad !
I thought I just need the latest commit ..... didn't see the patch :-(
Applying the patch now ...will update you soon .
Updated by David Goulet over 11 years ago
No problem. :)
If it is still visible after using the patch, I'll come on site hunt down this thing!
Updated by Tan le tran over 11 years ago
- File patch539_jun10_gitdiff.log patch539_jun10_gitdiff.log added
- File patch539_jun10_terminal.log patch539_jun10_terminal.log added
Hi David,
I have applied the patch (from update#13) on top of commit cfa9a5a .
Unfortunately, we still observe that RSS increases after the test .
Please, see attached log for further detail.
Regards,
Tan
Updated by Tan le tran over 11 years ago
After loading lttng-tools v2.2.3 and lttng-ust 2.2.1, the above mem leak is no longer observed.
Many thanks for the great support.
Updated by David Goulet over 11 years ago
- Status changed from Confirmed to Resolved