There is a problem about tombstone, which it will fail to
generate tombstone file in some scenarios due to socket
communication exception.
Reproduce step:
step 1: reboot device
step 2: ps -ef |grep zygote , get the pid of zygote64
(Attention: zygote64 should never been killed or reboot,
otherwise we can get the tombstone file)
step 3: kill -5 pid of zygote64
step 4: cd data/tombstones/, and could not find the tombstone
file of zygote64.
[Cause Analysis]
1. There are following logs by logcat:
11-19 15:38:43.789 569 569 F libc : Fatal signal 5 (SIGTRAP),
code 0 (SI_USER) in tid 569 (main), pid 569 (main)
11-19 15:38:43.829 6115 6115 I crash_dump64: obtaining output
fd from tombstoned, type: kDebuggerdTombstone
11-19 15:38:43.830 569 5836 I Zygote : Process 6114 exited
cleanly (0)
11-19 15:38:43.830 777 777 I /system/bin/tombstoned: received
crash request for pid 569
11-19 15:38:43.831 6115 6115 I crash_dump64: performing dump of
process 569 (target tid = 569)
...
11-19 15:38:43.937 777 777 W /system/bin/tombstoned: crash
socket received short read of length 0 (expected 12)
2. The last log was print by function of crash_request_cb in
file of tombstoned.cpp, following related code:
rc = TEMP_FAILURE_RETRY(read(sockfd, &request, sizeof(request)));
if (rc == -1) {
PLOG(WARNING) << "failed to read from crash socket";
goto fail;
} else if (rc != sizeof(request)) {
LOG(WARNING) << "crash socket received short read of length " << rc << " (expected "
<< sizeof(request) << ")";
goto fail;
}
Tombstoned read message by socket, and now the message length is
zero. Some socket communication exception occurs at that time.
We try to let crash_dump resend the socket message when the
communication is abnormal. Just as this CL.
Test: 1 reboot device
2 ps -ef |grep zygote , get the pid of zygote64
(Attention: zygote64 should never been killed or reboot,
otherwise we can get the tombstone file)
3 kill -5 pid of zygote64
4 cd data/tombstones/, and could find the tombstone file of
zygote64.
Change-Id: Ic152b081024d6c12f757927079fd221b63445b18
Make XOM related crashes a little less mysterious by adding an abort
cause explaining the crash.
Bug: 77958880
Test: Abort cause in tombstone for a XOM-related crash.
Change-Id: I7af1bc251d9823bc755ad98d8b3b87c12bbaecba
Previously, when we received simultaneous dump requests, we were CASing
a file descriptor value into a variable, and then failing to close it
if the CAS failed.
Bug: http://b/118412443
Test: debuggerd_test
Change-Id: I075c35a239426002eb9416da3d268c3d1a18e9d2
TEMP_FAILURE_RETRY's result was unused for the call to read(), so now
mark it as such to silence a possible unused result warning. For
__read_chk(), this function is an internal implementation detail of
FORTIFY in Bionic. Under clang-tidy, FORTIFY checks are actually
removed, so this now results in an unknown function being called. The
code should not be explicitly depending on an implementation detail, but
we can just suppress the failing case to retain test coverage of the
actual implementation.
Bug: http://b/110779387
Test: Build using WITH_TIDY=1
Change-Id: If83ac1d6f3b6dc32c0d0fb56d8e675e53b586f78
Previously, if an intercept ends before we ask for a file descriptor
when doing a backtrace, we'll create a tombstone file instead.
Bug: http://b/114139908
Bug: http://b/115349586
Test: debuggerd_test32
Change-Id: I23c7bb8ae5a982a4374a862d0a4f17bee03eb1d9
system_server is sometimes failing to dump with the following error:
libdebuggerd_client: received packet of unexpected length from tombstoned: expected 128, received -1
Improve the logging to try to figure out what's going on.
Bug: http://b/114139908
Test: treehugger
Change-Id: Iee1bdc0891b9fc7bd80a330495ec22a530febddb
Make it possible for code such as fdsan that generates debugging
tombstones via raise(DEBUGGER_SIGNAL) to pass an abort message as well.
Bug: http://b/112770187
Test: debuggerd_test
Change-Id: Idc34263241c18033573e466da3a45aa6f716ddb3
This commit only prints the raw value of the owner tag, pretty-printing
will come in a follow-up commit.
Test: debuggerd `pidof adbd`
Test: static_crasher fdsan_file + manual inspection of tombstone
Change-Id: Idb7375a12e410d5b51e6fcb6885d4beb20bccd0e
Pass the address of the fdsan table down to crash_dump so that we can
dump the fdsan table along with the open file descriptor list.
Test: debuggerd_test
Test: manually ran an old static_crasher
Change-Id: Icbac5487109f2db1e1061c4d46de11b016b299e3
adbd has been built as a static executable since the same binary was
copied to the recovery partition where shared library is not supported.
However, since we now support shared library in the recovery partition,
adbd is built as a dynamic executable.
In addition, the dependency from adbd to libdebuggerd_handler is removed
as debuggerd is handled by the dynamic linker.
A few more modules in /system/core are marked as recovery_available:
true as they are transitive dependencies of the dynamic linker.
This change also includes ld.config.recovery.txt which is the linker
config file for the recovery mode. It is installed to /etc/ld.config.txt
and contains linker namespace config for the dynamic binaries under
/sbin.
Bug: 63673171
Test: `adb reboot recovery; adb devices` shows the device ID
Test: Select 'mount /system' in the recovery mode, then `adb shell`.
$ lsof -p `pidof adbd` shows that libm.so, libc.so, etc. are loaded from
the /lib directory.
Change-Id: I363d5a787863f1677ee40afb5d5841321ddaae77
Include the illegal instruction in the header if we get a
SIGILL. Otherwise (since these tend to be one-off bit flips), we don't
usually have any information to try to confirm our suspicion that any
given instance is actually a one-off bit flip.
Also add `SIGILL` as a crasher option to easily generate such crashes.
Before:
signal 4 (SIGILL), code 1 (ILL_ILLOPC), fault addr 0xab1456da
After:
signal 4 (SIGILL), code 1 (ILL_ILLOPC), fault addr 0xab1456da (*pc=0xe7f0def0)
Bug: http://b/77274448
Test: ran crasher
Change-Id: I5f8dedca5eea2b117b1b1e48430214b38e1366ed
adbd (and its dependencies) are marked as recovery_available:true so
that recovery version of the binary is built separately from the one for
system partition. This allows us to stop copying the system version to
the recovery partition and also opens up the way to enable shared
libraries in the recovery partition. Then we can also build adbd as a
dynamic executable.
Bug: 79146551
Test: m -j adbd.recovery
Change-Id: Ib95614c7435f9d0afc02a0c7d5ae1a94e439e32a
Switch from _exit to raising SIGABRT when we recurse in the fallback
handler, so that waiters see an abort instead of a regular exit.
Bug: http://b/79717060
Test: debuggerd_test32
Test: debuggerd_test64
Change-Id: Iddee1cb1b759690adf07bbb8cd0fda2faac87571
* New lld could create files that map to non-zero
offset at run time.
Test: debuggerd_test
Bug: 79590156
Change-Id: I12db0ebef489ba8a1e648a29d214f8d3c3703996
We can't actually link an unlinked file back onto disk if it wasn't
opened with O_TMPFILE. Switch to using a temporary filename instead.
Bug: http://b/77729983
Test: agampe
Change-Id: I1970497114f0056065a1ba65f6358f08b51ec551
70d8f28945 broke a test that was not
expecting to see the new detail about the signal's sender.
Bug: http://b/78594105
Test: ran tests
Change-Id: Idfa3a53b9e664308efdba560ffbb1401c1904530
We have a LOG(FATAL) that can potentially happen before we turn off
SIGABRT. Move the signal handler defusing to the very start of main.
Bug: http://b/77920633
Test: treehugger
Change-Id: I7a2f2a0f2bed16e54467388044eca254102aa6a0
Suicide doesn't change:
signal 6 (SIGABRT), code -6 (SI_TKILL), fault addr --------
But homicide now looks like this (this is `sleep 666` killed by
`kill -SEGV` as root:
signal 11 (SIGSEGV), code 0 (SI_USER from pid 4446, uid 0), fault addr --------
Bug: http://b/78594105
Test: manual
Change-Id: I8c2feafba8cc5a3db85e8250004d428a464c5d9e
Instead of creating tombstone FDs in place and passing them out to
crash_dump directly, create them as O_TMPFILEs and link them into place
when crash_dump reports success, to avoid creating empty tombstones
in cases like an aborting thread racing with another thread that
manages to cleanly exit_group before the dump finishes.
Bug: http://b/77729983
Test: debuggerd_test
Test: adb shell 'for x in `seq 0 50`; do crasher; done'
Change-Id: I31ce4fd4a524abf8bde57152450209483d9d0ba9
Let the logging implementation be the imposer of limits.
Bug: http://b/64759619
Test: debuggerd_test
Change-Id: I8bc73bf2301ce071668993b740880224846a4e75
The recent change to detect missing source files broke reading makefiles
for mips, since this didn't specify a source file.
Bug: 73904572
Test: lunch aosp_mips-eng; m nothing
Test: lunch aosp_arm-eng; m crash_dump.policy
Test: lunch aosp_arm64-eng; m crash_dump.policy
Test: lunch aosp_x86-eng; m crash_dump.policy
Test: lunch aosp_x86_64-eng; m crash_dump.policy
Change-Id: I28864b5af59267f1ab83084128f2c59b04039374
Calls to abort() will always result in our signal handler being called,
because abort will manually unblock SIGABRT before raising it. This
can lead to deadlock when handling address space exhaustion in the
fallback handler. To fix this, switch our mutex to a recursive mutex,
and manually keep track of our lock count.
Bug: http://b/72929749
Test: debuggerd_test --gtest_filter="CrasherTest.seccomp_crash_oom"
Change-Id: I609f263ce93550350b17757189326b627129d4a7
Add a comment explaining why we define PROT_READ/PROT_WRITE, even
though a current libminijail supports both cosntants.
Bug: http://b/73273658
Test: treehugger
Change-Id: I51c1be1b1b569e94dbc9045a90bc28221b7dc9c7
When generating crash_dump.*.policy, replace PROT_READ and PROT_WRITE
to numeric constants to make the policy backward compatible with old
libminijail.so.
Bug: 73273658
Test: use the new policy in OMR1 devices
Change-Id: I936a733340ad4df8aef6562c03eb10c29ffdada2
A race condition occurs when one thread takes more than a second to get
scheduled to handle the signal we send to ask it to dump its stack.
When this happens, the main thread will continue on, close the fd, and
then ask the next thread to dump, but the slow thread will then wake up
and try to write to the new thread's fd, or trigger an assertion in
__linker_enable_fallback_allocator.
Do a few things to make this less bad:
- encode both target tid and fd in the shared atomic, so that we know
who each fd is for
- switch __linker_enable_fallback_allocator to return success instead
of aborting, and bail out if it's already in use
- write to the output fd right when we get to it, instead of doing it
whenever the dumping code decides to, to reduce the likelihood that
the timeout expires
Test: debuggerd_test
Change-Id: Ife0f6dae388b601e7f991605f14d7a0274013f6b
Commit 3e235911 in bionic switched LP32's sigaction implementation over
to using the rt_sigaction syscall, matching LP64. Update our seccomp
policy to match.
Bug: http://b/73119572
Test: debuggerd_test32
Change-Id: I0a662a1c874298d434468d2dcdb4ebf9f276110c
Use the art dex file library to read the dex data.
Add unit tests for the UnwindDexFile code.
Bug: 72070049
Test: All unit tests continue to pass.
Test: Dumped the backtrace of the 137-cfi test while running in interpreter
Test: mode and verified that the stack trace is correct. Did this on host
Test: and for arm/arm64.
Change-Id: Ia6f343318c5dd6968a954015a7d59fdf101575b0
The stack dump was not printing leading zeros for data after the
change to remove uintptr_t types from the libbacktrace API.
Bug: 65682279
Test: Created an arm tombstone and an arm64 tombstone and verified
Test: that the stack data has leading zeros.
Change-Id: I1fbec2c4fa7c8b0fab18894c5628d18c5a580299
In order to support the offline unwinding properly, get rid of the
usage of non-fixed type uintptr_t from all API calls.
In addition, completely remove the old local and remote unwinding code
that used libunwind.
The next step will be to move the offline unwinding to the new unwinder.
Bug: 65682279
Test: Ran unit tests for libbacktrace/debuggerd.
Test: Ran debuggerd -b on a few arm and arm64 processes.
Test: Ran crasher and crasher64 and verified tombstones look correct.
Change-Id: Ib0c6cee3ad6785a102b74908a3d8e5e93e5c6b33
Set and restore PR_SET_PTRACER when performing a dump, so that when
Android is running on a kernel that has the Yama LSM enabled (and the
value of ptrace_scope is > 0), crash_dump can attach to processes and
print nice, symbolized stack traces.
Bug: 70992745
Test: kill -6 `pidof surfaceflinger` && logcat -d -b crash
# in both sailfish and Chrome OS
Change-Id: If4646442c6000fdcc69cf4ab95fdc71ae74baaaf
The abort message was accidentally relocated to be printed below the
registers, backtrace, and stack, which isn't very helpful. Move it back
to its rightful place.
Test: treehugger
Change-Id: I8aa5b63e58081f27ccdb42481fed8d9eb3a892a4
When a process crashes, both ActivityManager and init will try to kill
its process group when they notice. The recent change to minimize the
amount of time a process is paused results in crash dumps being killed
before they finish as a result of this. Since anything that needs to be
low-latency is probably not going to be too happy if it crashes, just
wait for completion whenever we're processing a real crash.
Bug: http://b/70343110
Test: debuggerd_test
Change-Id: I894bb06efd264b1ba005df06f7326a72f4b767bb
Add some helper macros that perform regex string matching to
<android-base/test_utils.h>.
Test: libbase_test32/64 on host
Change-Id: I1b0f03dc73f8b4fdfb8ac6c75d59ef421e0e9640
Add a benchmark to measure how long we pause a process when dumping.
Bug: http://b/62112103
Test: manually ran it
Change-Id: Iceec2f722915b0ae26144c86dcbeb35793f963da
Reduce the amount of time that a process remains paused by pausing its
threads, fetching their registers, and then performing unwinding on a
copy of its address space. This also works around a kernel change
that's in 4.9 that prevents ptrace from reading memory of processes
that we don't have immediate permissions to ptrace (even if we
previously ptraced them).
Bug: http://b/62112103
Bug: http://b/63989615
Test: treehugger
Change-Id: I7b9cc5dd8f54a354bc61f1bda0d2b7a8a55733c4
Add a static GetLoadBias method to the Elf object that only reads just
enough to get the load bias.
Add a method to MapInfo that gets the load bias. First attempt to get
it if the elf object already exists. If no elf object was created, use
the new static method to get the load bias.
In BacktraceMap, add a custom iterator so that when code dereferences
a map element, that's when the load bias will be retrieved if it hasn't
already been set.
Bug: 69871050
Test: New unit tests, verify tombstones have non-zero load bias values for
Test: libraries with a non-zero load bias.
Change-Id: I125f4abc827589957fce2f0df24b0f25d037d732
Always check to see if the fallback handler has been called and is
not trying to dump a specific thread.
Bug: 69110957
Test: Verified on a system where the prctl value changes, that before the
Test: change it dumps multiple tombstones, and after the change it
Test: works as expected.
Test: Ran debuggerd unit tests.
Test: Dumped process using debuggerd -b <PID> and debuggerd <PID>.
Change-Id: Id98bbe96cced9335f7c3e17088bb4ab2ad2e7a64
Nobody is looking at the mismatches, and it can cause problems
with tombstone parsers.
Also, fix the dump_header_info test and remove unused properties_fake.cpp.
Test: Ran unit tests, verified tombstones still work.
Change-Id: I4261646016b4e84b26a5aee72f3227f1ce48ec9a
This is needed if they will ever handle ro. properties that have
values longer than 92 characters.
Bug: 23102347
Bug: 34954705
Test: read and write properties with value length > 92 characters
Change-Id: I44aa135c97ec010f12162c30f743387810ae2c5d
Update the tests to match new output (and stop pluralizing '1 entries').
Test: `debuggerd_test{32,64} --gtest_filter="TombstoneTest.*" on hikey960
Change-Id: I16b0335715303252fad3a35d6a053a50fefdac30
Tombstones (especially ones with lots of VMAs) are regularly truncated.
We can at least show the number of VMAs, though, for anyone interested
in knowing whether they got close to the default 64Ki limit.
Bug: http://b/66911122
Test: ran crasher, examined tombstone
Change-Id: I286db66f28f132307d573dbe5164efc969dc6ddc
Apply the same fix from c2e98f63 to intercept_manager.cpp.
Bug: http://b/64543673
Test: debuggerd_test
Change-Id: Ibfb919e059fa62f8336cfc1426d03ef015590136
If a function crashes by jumping into unexecutable code, the old method
could not unwind through that. Add a fallback method to set the pc from
the default return address location.
In addition, add a new finished check for steps. This will provide a method
to indicate that this step is the last step. This prevents cases where
the fallback method might be triggered incorrectly.
Update the libbacktrace code to unwind using the new methodology.
Update the unwind tool to use the new unwind methodology.
Add a new option to crasher that calls through a null function.
Create a new object, Unwinder, that encapsulates the a basic unwind. For now,
libbacktrace will still use the custom code.
Added new unit tests to cover the new cases. Also add a test that
crashes calling a nullptr as a function, and then has call frames in
the signal stack.
Bug: 65842173
Test: Pass all unit tests, verify crasher dumps properly.
Change-Id: Ia18430ab107e9f7bdf0e14a9b74710b1280bd7f4
Android.bp assumed only an armv7-a-neon core needs to set HAS_VFP_D32.
In fact, an armv8 core also has 32 double-word floating point registers
for A32 and T32 ISAs (AArch32 or 32-bit armv8).
Bug: 65568426
Test: lunch aosp_arm64; emulator # on oc-mr1-dev; boot to home screen.
Check crashglue.o actually uses VFP_D16-31 for 32-bit armv8 core.
Change-Id: I34584a27fa24a55bb4809ccd7f99a8122971df0e
The order of arguments is wrong - we're passing flags=static_cast<unsigned>(-1)
and backlog=LEV_OPT_CLOSE_ON_FREE (which is 2).
On versions of libevent prior to 2.1.8, this ends up accidentally setting
OPT_LEAVE_SOCKETS_BLOCKING, OPT_CLOSE_ON_EXEC, OPT_REUSABLE and OPT_THREADSAFE
and limiting our backlog to two. These unintentional changes are relatively
benign; we never make our sockets block, we never exec, we never reuse
sockets and the additional locking overhead should be negligible. The
backlog of two might be a problem in theory, but there haven't been any
reports of issues caused by it.
Things get worse on 2.1.8 - that version introduces several new flags,
one of which is OPT_DISABLED. This disables the new listener by default,
which means that our event loop returns early because it has no active listeners
for any of its events.
Bug: 64543673
Test: Manual.
Change-Id: I9954bc7fe1af761de1a950d935dd2e6ce7e2c5f5