Bug #415
closedIncomplete CTF and metadata files after session has been stopped
0%
Description
consumerd is not done flushing all the metadata and CTF files towards relayd even after sessiond has replied to the lttng_stop_tracing call.
The scenario I'm running is as follows:
- 4-node cluster (SC-1/SC-2/PL-3/PL-4) with one sessiond/consumerd pair per node
- One relayd in node SC-1 which receives data from the consumerd in each of the 4 nodes
- 3 TestApps are running at all times (4 processes total, since one of the apps forks a child) in each node
- One cluster-wide session (same session is started on each node) is activated for 10 seconds with one event that produces 10 hits per second from one of the apps
The result is that most of the files have file size 0.
Just for troubleshooting purposes I have introduced a 30-second delay in our code after sessiond has returned from lttng_stop_tracing and before killing relayd (we start a new relayd when each cluster-wide session is created and kill it when the session is stopped). As a result, we get more data, but the data is still incomplete.
I have included a log from the 30 sec delay scenario, but I can also provide the normal log without the delay, which shows most of the files have zero file size, if required. The issue is easily reproducible. I have not tried with verbose mode either, but will try it later and attach it to this bug report.
I have assigned high priority to this bug due to the closeness to our code-freeze date.
Here is my software level:
CURRENT HEAD: foss/babeltrace d0acc96 (HEAD, origin/master, origin/HEAD) Provides a basic pkg-config file for libbabeltrace
CURRENT HEAD: foss/lttng-tools bb63afd (HEAD, origin/master, origin/HEAD) Fix: for librelayd, fix negative reply ret code
CURRENT HEAD: foss/lttng-ust 45f399e (HEAD, origin/master, origin/HEAD) Adapt internal files and examples to TRACEPOINT_INCLUDE
CURRENT HEAD: foss/userspace-rcu b1f69e8 (HEAD, origin/master, origin/HEAD) documentation: fix rcu-api.txt duplicates
Files
Updated by Jesus Garcia almost 12 years ago
Quick Update: I tried the same scenario without traffic control and I still see the issue of CTF and metadata files with 0 file size.
Updated by David Goulet almost 12 years ago
Any chance you can run this experiment with logs on the session daemon
side ? (-vvv --versbose-consumer).
I might have found why this issue occurs but confirming it with your
logs would help.
Thanks!
Updated by David Goulet almost 12 years ago
Hi Jesus,
I need more information on a theory I have.
First, can you tell me when the applications are started (before or
after the start) ?
Second, in the example above, I would like you (or someone :) to tell me
what metadata contains for incomplete trace. You should have the file
magic number at the beginning: "57 1d d1 75" and then either some data
or zeroes.
Please, for each incomplete trace can you give me this information
because this would explains a LOT!
Also, the lifetime of the application here is very important as well
meaning that when it ends? and when is it started.
Thanks!
Updated by Jesus Garcia almost 12 years ago
- File bug415_verbose_no_tc.txt bug415_verbose_no_tc.txt added
- File metadata metadata added
Hi David,
Please see the attached log file for traces with verbose. The verbose has only been turned on in one of the nodes in the cluster (SC-1), which is the node running relayd. On that node there is also a sessiond and consumerd, of course.
The instrumented apps are running at all times (before, during, and after the life time of the session).
Most of the metadata files have zero file size, but here is an example of one with some data. Some of the corresponding CTF files were missing data. The file is attached.
/storage/no-backup/tracecc-apr9010500/TestProf_1/PL-4/ust/TestApp_type1-13831-20121219-105039:
total 16
drwxr-x--- 2 root root 4096 Dec 19 10:50 .
drwxr-x--- 6 root root 4096 Dec 19 10:50 ..
-rwxr-xr-x 1 root root 0 Dec 19 10:50 channel0_0
-rwxr-xr-x 1 root root 4096 Dec 19 10:50 channel0_1
-rwxr-xr-x 1 root root 0 Dec 19 10:50 channel0_2
-rwxr-xr-x 1 root root 0 Dec 19 10:50 channel0_3
-rwxr-xr-x 1 root root 4096 Dec 19 10:50 metadata
Thanks,
Jesus
Updated by Jesus Garcia almost 12 years ago
Just in case I was not clear. The 4 apps are never killed. I launch them before I run the trace and I never stopped them.
Updated by David Goulet almost 12 years ago
Ok so with applications running before and after the experiment this is
good. You should have everything since the lttng start will make sure
every registered application is setup before going further.
The problem I see with the logs is that your relayd dies just after the
destroy command and for which after, there is still data needed to be
streamed. I think this is due to the fact that between a stop and a
destroy, the UST tracer can provide more padding... but appart from
that, it does not explain the data file set to 0.
On line 1354, the problem starts. The consumer ask for a close_stream
command which has been initiated by the streams being destroyed on the
consumer side done by the lttng destroy command. This is OK since we are
basically ending the communication but the relayd died before.
So here is my question, in your experiment, when are you killing the
relayd ? I suppose after the "lttng stop" command has been issued and
returned for all nodes?
Updated by Jesus Garcia almost 12 years ago
I experimented with delaying the killing of relayd last night and I still saw the issue of zero file size. I tried with as much as a 30-second delay, which I consider a very long time considering that only one low-throughput event from one app is getting hit for a period of 10 seconds.
For the amount of data being produced by this session, how long do you estimate we need to delay the killing of relayd?
Updated by Jesus Garcia almost 12 years ago
Here is a new log for the same scenario but with a 40-second delay before killing relayd. I do not see the "Relayd close stream replied error" message, but I still see files with zero file size, even for the app with the traced event.
Updated by David Goulet almost 12 years ago
- File bug415-relayd.diff bug415-relayd.diff added
- Status changed from New to Confirmed
- Assignee set to David Goulet
- Priority changed from High to Critical
- Target version changed from 2.1 pre to 2.1 stable
We might have found the problem causing this.
The data pending check on the relayd side was off by one returning a NOT pending value for streams that are still being used making the stop command return quit rapidly but data was still being consumed. Killing the relayd some time after made the data incomplete.
Note that with bw limit to 100kbits, the stop can be quite long for a large number of streams where here it takes more than 5 minutes for around 50 streams. It all depends on the application throughput as well and how many concurrent lttng command you have that needs interaction with the streams on the consumerd (enable/start/stop/destroy).
Could you please test this patch and see if it helps?
Thanks!
Updated by David Goulet almost 12 years ago
- Status changed from Confirmed to Resolved
Was fixed with commit: f9dfc3d95caf3a7d927d58f252006390e168a1c2 and 39df6d9f42d8644d9ee25c8c02517b2cef6f5941