Project

General

Profile

Actions

Bug #415

closed

Incomplete CTF and metadata files after session has been stopped

Added by Jesus Garcia over 11 years ago. Updated over 11 years ago.

Status:
Resolved
Priority:
Critical
Assignee:
Target version:
Start date:
12/18/2012
Due date:
% Done:

0%

Estimated time:

Description

consumerd is not done flushing all the metadata and CTF files towards relayd even after sessiond has replied to the lttng_stop_tracing call.
The scenario I'm running is as follows:
- 4-node cluster (SC-1/SC-2/PL-3/PL-4) with one sessiond/consumerd pair per node
- One relayd in node SC-1 which receives data from the consumerd in each of the 4 nodes
- 3 TestApps are running at all times (4 processes total, since one of the apps forks a child) in each node
- One cluster-wide session (same session is started on each node) is activated for 10 seconds with one event that produces 10 hits per second from one of the apps

The result is that most of the files have file size 0.

Just for troubleshooting purposes I have introduced a 30-second delay in our code after sessiond has returned from lttng_stop_tracing and before killing relayd (we start a new relayd when each cluster-wide session is created and kill it when the session is stopped). As a result, we get more data, but the data is still incomplete.

I have included a log from the 30 sec delay scenario, but I can also provide the normal log without the delay, which shows most of the files have zero file size, if required. The issue is easily reproducible. I have not tried with verbose mode either, but will try it later and attach it to this bug report.

I have assigned high priority to this bug due to the closeness to our code-freeze date.

Here is my software level:
CURRENT HEAD: foss/babeltrace d0acc96 (HEAD, origin/master, origin/HEAD) Provides a basic pkg-config file for libbabeltrace
CURRENT HEAD: foss/lttng-tools bb63afd (HEAD, origin/master, origin/HEAD) Fix: for librelayd, fix negative reply ret code
CURRENT HEAD: foss/lttng-ust 45f399e (HEAD, origin/master, origin/HEAD) Adapt internal files and examples to TRACEPOINT_INCLUDE
CURRENT HEAD: foss/userspace-rcu b1f69e8 (HEAD, origin/master, origin/HEAD) documentation: fix rcu-api.txt duplicates


Files

Incomplete_data_30sec_delay_before_killing_relayd.txt (59 KB) Incomplete_data_30sec_delay_before_killing_relayd.txt Jesus Garcia, 12/19/2012 12:24 AM
bug415_verbose_no_tc.txt (439 KB) bug415_verbose_no_tc.txt Jesus Garcia, 12/19/2012 11:08 AM
metadata (4 KB) metadata Jesus Garcia, 12/19/2012 11:08 AM
bug415_verbose_no_tc_40sec_delay.txt (208 KB) bug415_verbose_no_tc_40sec_delay.txt Jesus Garcia, 12/19/2012 12:27 PM
bug415-relayd.diff (514 Bytes) bug415-relayd.diff David Goulet, 12/19/2012 01:32 PM
Actions #1

Updated by Jesus Garcia over 11 years ago

Quick Update: I tried the same scenario without traffic control and I still see the issue of CTF and metadata files with 0 file size.

Actions #2

Updated by David Goulet over 11 years ago

Any chance you can run this experiment with logs on the session daemon
side ? (-vvv --versbose-consumer).

I might have found why this issue occurs but confirming it with your
logs would help.

Thanks!

Actions #3

Updated by David Goulet over 11 years ago

Hi Jesus,

I need more information on a theory I have.

First, can you tell me when the applications are started (before or
after the start) ?

Second, in the example above, I would like you (or someone :) to tell me
what metadata contains for incomplete trace. You should have the file
magic number at the beginning: "57 1d d1 75" and then either some data
or zeroes.

Please, for each incomplete trace can you give me this information
because this would explains a LOT!

Also, the lifetime of the application here is very important as well
meaning that when it ends? and when is it started.

Thanks!

Updated by Jesus Garcia over 11 years ago

Hi David,

Please see the attached log file for traces with verbose. The verbose has only been turned on in one of the nodes in the cluster (SC-1), which is the node running relayd. On that node there is also a sessiond and consumerd, of course.

The instrumented apps are running at all times (before, during, and after the life time of the session).

Most of the metadata files have zero file size, but here is an example of one with some data. Some of the corresponding CTF files were missing data. The file is attached.

/storage/no-backup/tracecc-apr9010500/TestProf_1/PL-4/ust/TestApp_type1-13831-20121219-105039:
total 16
drwxr-x--- 2 root root 4096 Dec 19 10:50 .
drwxr-x--- 6 root root 4096 Dec 19 10:50 ..
-rwxr-xr-x 1 root root 0 Dec 19 10:50 channel0_0
-rwxr-xr-x 1 root root 4096 Dec 19 10:50 channel0_1
-rwxr-xr-x 1 root root 0 Dec 19 10:50 channel0_2
-rwxr-xr-x 1 root root 0 Dec 19 10:50 channel0_3
-rwxr-xr-x 1 root root 4096 Dec 19 10:50 metadata

Thanks,
Jesus

Actions #5

Updated by Jesus Garcia over 11 years ago

Just in case I was not clear. The 4 apps are never killed. I launch them before I run the trace and I never stopped them.

Actions #6

Updated by David Goulet over 11 years ago

Ok so with applications running before and after the experiment this is
good. You should have everything since the lttng start will make sure
every registered application is setup before going further.

The problem I see with the logs is that your relayd dies just after the
destroy command and for which after, there is still data needed to be
streamed. I think this is due to the fact that between a stop and a
destroy, the UST tracer can provide more padding... but appart from
that, it does not explain the data file set to 0.

On line 1354, the problem starts. The consumer ask for a close_stream
command which has been initiated by the streams being destroyed on the
consumer side done by the lttng destroy command. This is OK since we are
basically ending the communication but the relayd died before.

So here is my question, in your experiment, when are you killing the
relayd ? I suppose after the "lttng stop" command has been issued and
returned for all nodes?

Actions #7

Updated by Jesus Garcia over 11 years ago

I experimented with delaying the killing of relayd last night and I still saw the issue of zero file size. I tried with as much as a 30-second delay, which I consider a very long time considering that only one low-throughput event from one app is getting hit for a period of 10 seconds.

For the amount of data being produced by this session, how long do you estimate we need to delay the killing of relayd?

Actions #8

Updated by Jesus Garcia over 11 years ago

Here is a new log for the same scenario but with a 40-second delay before killing relayd. I do not see the "Relayd close stream replied error" message, but I still see files with zero file size, even for the app with the traced event.

Actions #9

Updated by David Goulet over 11 years ago

  • File bug415-relayd.diff bug415-relayd.diff added
  • Status changed from New to Confirmed
  • Assignee set to David Goulet
  • Priority changed from High to Critical
  • Target version changed from 2.1 pre to 2.1 stable

We might have found the problem causing this.

The data pending check on the relayd side was off by one returning a NOT pending value for streams that are still being used making the stop command return quit rapidly but data was still being consumed. Killing the relayd some time after made the data incomplete.

Note that with bw limit to 100kbits, the stop can be quite long for a large number of streams where here it takes more than 5 minutes for around 50 streams. It all depends on the application throughput as well and how many concurrent lttng command you have that needs interaction with the streams on the consumerd (enable/start/stop/destroy).

Could you please test this patch and see if it helps?

Thanks!

Actions #10

Updated by David Goulet over 11 years ago

  • Status changed from Confirmed to Resolved

Was fixed with commit: f9dfc3d95caf3a7d927d58f252006390e168a1c2 and 39df6d9f42d8644d9ee25c8c02517b2cef6f5941

Actions

Also available in: Atom PDF