Double PID registering and unregistering race
With strace and Python using our libc wrapper, somehow python registers once to the session daemon. Sometime after, a second register comes in with the same PID and than follows an unregister. This creates an assert() failure for our lttng_ht_add_unique call at the second registration since the same PID is used.
You can recreate this behaviour using the lttng-tools commit 82541c3400f9568835938b7c2c6ce5e18b5817c0 and lttng-ust commit d8de13549b80d40b0c823e43e81afd55266f2fe5. Having the libc wrapper installed. Python behaviour with this particular script (reproducible with gwibber-service also), is to load all *-libc.so found in ldconfig -p hence loading our library automatically. A bug has been report to the python dev. folks.
strace /usr/bin/python /usr/lib/desktopcouch/desktopcouch-service > /tmp/output.txt 2>&1
will hang the process. On Ctrl+C, you'll hit the issue (assert).
The "real" viable solution is planned for 2.1 or later stable release.
We have to remove the ust_app_sock_key_map and change it with a hash table containing applications having the FD has key. Each ust_app structure will have a node pointer to a hash table indexed by PID and FD. So, when having double PID registering, we'll use the direct lookup per PID, use add_replace in the hash table and clean up the old node. This prevent the PID-fd lookup race when the unregister happens just after the replace and before the close(fd).
We'll have to add the lttng_ht_add_replace function to the lttng_ht internal library.
For the 2.0 stable release, we will simply remove the assert from the add_unique, cleanup the old node (free() and close(fd) and go on. This is valid since we are still with a single thread handling registration and unregistration (sock lookup and close sock).
Updated by David Goulet about 8 years ago
We are actually not able to provide a quick fix for the 2.0 stable release since we are hitting a possible race between the close(fd) when unregistering and the close(fd) in the call_rcu thread (when cleaning up the old node).
I'll proceed with the pre20 release since this bug is a big corner case that does not stop the train :).
We'll try to push a fix during the 2.0 release candidate period.