Commit graph

35 commits

Author SHA1 Message Date
Mark Salyzyn
6d9e515905 llkd: Use more inclusive language
Documentation is synchronized to match external, to ease updating.

blacklist is replaced with ignorelist or ignore depending on context.

Test: none
Change-Id: I6db7ad321684759e3c5ac1f66f940b6e8a5709a0
2020-06-16 10:28:14 -07:00
Woody Lin
e5a09d8da5 llkd: Print thread group before panic the kernel
Debugging information for addressing the reason why the process stops as
a zombie, can be found by observing the call trace and running status of
threads in thread group of the zombie process.

Bug: 154667692
Change-Id: Icd7fa2161e88b08fd5ce0d5dc3a3790ed4ac02d1
2020-04-24 02:55:00 +00:00
Treehugger Robot
0a86d01080 Merge changes Ibb4b4ca4,I31572afa
* changes:
  llkd: test: llkd.sleep also check for __arm64_sys_openat
  llkd: requires sys_admin permissions
2020-01-17 15:39:45 +00:00
Marco Ballesio
38c735e6ef llkd: ignore frozen processes
verify a process frozen state by reading its frezer cgroup value and
don't consider it as loop-locked if frozen.

Bug: 145698592
Test: llkd_unit_test
Test: Manually froze a few processes and waited for llkd timeout, verifying that
      no processes are killed, no reboot or ramdump occur and no llkd events are
      logged.
Change-Id: Iea02cd86dbd1df0e6658d02581aa4bb9b658f107
2020-01-16 22:55:37 +00:00
Mark Salyzyn
92f7bbfbe5 llkd: test: llkd.sleep also check for __arm64_sys_openat
4.19 kernel reported __arm64_sys_openat instead of SyS_openat, so
the test for llkd.sleep also needs to check for that as well.

Signed-off-by: Mark Salyzyn <salyzyn@google.com>
Bug: 147486902
Test: llkd_unit_test
Change-Id: Ibb4b4ca45391e35fd03fcb8e7ccea01f547b76e1
2020-01-15 09:08:48 -08:00
Mark Salyzyn
be2e2f2beb llkd: requires sys_admin permissions
As a result of commit f8a00cef17206ecd1b30d3d9f99e10d9fa707aa7
("proc: restrict kernel stack dumps to root")
the userdebug feature where llkd can monitor for live lock
signatures in the stack traces broke.

So now userdebug variant of llkd requires sys_admin permissions.

Signed-off-by: Mark Salyzyn <salyzyn@google.com>
Test: llkd_unit_test
Bug: 147486902
Change-Id: I31572afa08daa490a69783855bce55313eaed96c
2020-01-15 09:08:48 -08:00
Mark Salyzyn
16649d22ca llkd: do not call sync()
sync() will never return if the io subsystem is locked up, drop it.

Test: llkd_unit_test
Bug: 122263600
Change-Id: Ib378124415ce94da987d73391b027dc10317dbe9
2019-01-10 12:52:35 -08:00
Mark Salyzyn
da2aeb0c42 llkd: handle 'adbd shell setsid' to preserve adbd
A zombie setsid process occurs when adb shell setsid <command> is
issued, however llkd can only detect if it is a result of a kernel
livelock by killing the associated parent, which would be adbd;
resulting in the adb connection(s) being terminated.  Will special
case this condition in order to preserve adbd for debugging purposes.

We parse <parent>&<child> in ro.llk.blacklist.parent as this
association, thus adbd&[setsid] covers this special case.
Ampersand was selected because it is never part of a process name,
however a setprop in the shell requires it to be escaped or quoted;
init rc file where this is normally specified does not have issue.

getComm() is effectively pure, so hold on to the return value for
sake of efficiency.

This also reverts commit 599958d114
which granted adbd blanket parent immunity from monitoring on
userdebug builds.  The new logic is a more refined means of
preserving the live lock checking associated with adbd and allows
the operation to be performed on user builds.

POC: date ; adb shell setsid sleep 900 ; date
Positive for bug, reports less than 15 minutes, otherwise solved.

Test: llkd_unit_test
Bug: 120983740
Change-Id: I6442463a48499d925a3a074423a24a1622905559
2019-01-04 13:49:48 -08:00
Mark Salyzyn
8a5f081763 llkd: enhance list properties
Because of the limited length of properties, and to ease the
complexity of product and vendor adjustments, the comma separated
list properties will use a leading comma to preserve the defaults
and add or subtract entries with + and - prefixes respectively.

Without the leading comma, the list is explicitly specified as before.

Cleanup:
- use empty() instead of space() == 0 (or converse if != 0)
- if (unlikely) pprocp can not be allocated, to a to_string(ppid) check

For testing, observe before and after llkd_unit_test below to
confirm leading comma effects for example:

livelock: ro.llk.stack=wait_on_page_bit_killable,bit_wait_io,\
                       __get_user_pages,cma_alloc
livelock: ro.llk.stack=...,SyS_openat,...

Test: llkd_unit_test
Bug: 120983740
Change-Id: Ia3d164c2fdac5295a474c6c1294a34e4ae9d0b61
2019-01-04 11:43:15 -08:00
Mark Salyzyn
b658ffa2f3 llkd: check stack for wait_on_page_bit_killable
User process in S state blocked by deadlock in I/O system

wait_on_page_bit is covered by regular D state tracking.

Bug: 120776455
Test: long term stability on multiple devices
Change-Id: Icdb99b8095f384cb440f0f2bdeba86c5991b9ef4
2019-01-04 10:46:05 -08:00
Mark Salyzyn
22e05fb5a4 llkd: report stack signature matched
Adjusted debugging messaging to add clarity.  Report _which_ stack
signature matched that triggered the kernel panic.  Reduce the noise
associated with missing /stack to VERBOSE as that is for development
debugging only.

Test: observe during unit test we see something like following logs:

livelock: Found SyS_openat in stack for pid XXX
livelock: S 120.000s XXX->YYY port-bridge [kill]
livelock: Killing '/vendor/bin/port-bridge' (XXX) to check forward\
          scheduling progress in S state for\
          '/vendor/bin/port-bridge' (YYY)
. . .
livelock: Found SyS_openat in stack for pid XXXXX
livelock: S 120.000s XXXXX->XXXXX llkd_unit_test [kill]
livelock: Killing '/data/nativetest64/llkd_unit_test/llkd_unit_test\
          (XXXXX) to check forward scheduling progress in S state

Test: llkd_unit_test
Bug: 33808187
Change-Id: Ifac7dd9a656208563bb20e28739abb741358d964
2019-01-04 10:46:05 -08:00
Mark Salyzyn
eb9b11c096 Merge "llkd: adbd is allowed to be a bad parent on userdebug" 2018-12-17 21:51:29 +00:00
Mark Salyzyn
00289c9c79 Merge "llkd: make 100% sure process that triggers panic still exists" 2018-12-14 20:35:08 +00:00
Mark Salyzyn
599958d114 llkd: adbd is allowed to be a bad parent on userdebug
If an errant process is identified, and the parent is adbd,
on userdebug builds bypass monitoring the process.

Test: unit tests
Bug: 120983740
Change-Id: I2af0797504326b0e95aa4377918b35d17716a6a2
2018-12-14 08:44:33 -08:00
Chih-Hung Hsieh
1b7b7979af Fix performance-for-range-copy warnings
Bug: 30413223
Test: make with WITH_TIDY=1 DEFAULT_GLOBAL_TIDY_CHECKS=-*,performance*
Change-Id: I3ad102f2b0f971266d57488a3bd57d312f7ee3e6
2018-12-11 10:51:13 -08:00
Mark Salyzyn
fbc3a75ef4 llkd: make 100% sure process that triggers panic still exists
There is time between inspection, filtration, determination and
dumping before triggering panic, make 100% sure the process still
exists.  If we had one false start, but another process triggers
and panic in the same pass, then recognize that we have already
dumped the data and skip it on the later ones.

Test: llkd unit test.
Bug: 33808187
Bug: 120378563
Bug: 120229612
Change-Id: Iacaf82a3d58e5a3c18edcff3c8fa540b21da36f1
2018-12-10 13:34:02 -08:00
Mark Salyzyn
b3418a2255 llkd: do not crash kernel if llkd stops running
Today, assume llkd is not hardened enough to 100% guarantee that
lack of progress in inspection loops is a direct result of a
livelock condition affecting llkd itself.  Log a fatal alarm to
make init restart llkd instead for the time being.

ToDo: develop trust in llkd regarding sigalarm causes.

Test: compile
Bug: 119781757
Change-Id: I668dc1773898da6c95aad7221724b16f1684b067
2018-11-19 15:26:20 -08:00
Mark Salyzyn
3c3b14d0de llkd: stutter pre-panic message to both last kernel and last logcat
Test: compile
Bug: 118712403
Change-Id: I9067b9335b2685169bcf8b1dc0248f7ff4315046
2018-11-01 08:14:03 -07:00
Mark Salyzyn
53e782d531 llkd: Add waiting task to dump
Consider reporting d (lock dump) and w (waiting tasks) after
t (stack dump) as the ramoops buffer could be overflowed.

Test: compile
Bug: 118712403
Change-Id: I64fac7e13c14a1cbc45c9e35fe7746f9b778dcf4
2018-11-01 08:13:53 -07:00
Mark Salyzyn
bd7c856507 llkd: add ro.llk.sysrq_t
Allow sysrq stack trace dump to be disabled by ro.llk.sysrq_t.

Default is true if not on a limited memory device ro.config.low_ram.
Value is true if the property value is "eng" and on a userdebug or
eng device, signaled by the ro.debuggable set to 1.

Test: compile
Bug: 118712403
Change-Id: I02e999dc640125b6a08dca10077716e5d006da49
2018-11-01 08:13:44 -07:00
Mark Salyzyn
bb1256a728 llkd: add bit_wait_io to stack monitoring
This will discover if the I/O is starved.

Add the ability to search for " <symbol>.cfi+0x".

Cleaned up README.md to reflect current defaults.

Test: none
Bug: 113648929
Change-Id: I990a54f99de536406fd752a490e60f962380d71a
2018-11-01 08:13:33 -07:00
Mark Salyzyn
e81ede85c7 llkd: Skip apexd for process checks
apexd is a sensitive daemon, and the ability to ptrace this domain is
restricted by SELinux policy.  apexd spawns a binder thread which
makes matching difficult, as we would instead need to use
/system/bin/apexd as the blacklist key.

Change llkd to also check for a match on the basename of the
executable path.  This will solve a gotcha expectation when creating
a blacklist key.

Without this change, llkd continues to generate SELinux denials of

type=1400 audit(0.0:1764): avc: denied { ptrace } for comm="llkd" scontext=u:r:llkd:s0 tcontext=u:r:apexd:s0 tclass=process permissive=0

Commit 5390b9add4 was originally intended
to fix these denials, but it seems to have had no effect and the denials
are still being generated.  This change will fix it.

Test: none
Change-Id: I00aa10dfff30c65a120ad30582b820e2d4b1bb38
2018-10-22 16:11:02 -07:00
Nick Kralevich
5390b9add4 llkd: Do not check apexd by default for stack
apexd is now blocked by sepolicy, so skip checking it to
prevent an avc warning.

See system/sepolicy commit ac097ac4c7718f8593f2b6b96a93a776984ec7c4

Addresses the following SELinux denial:

type=1400 audit(0.0:386): avc: denied { ptrace } for comm="llkd" scontext=u:r:llkd:s0 tcontext=u:r:apexd:s0 tclass=process permissive=0

Test: manual
Change-Id: Iad24447c8200e915ac8397a8f84923feebc20613
2018-10-15 09:17:40 -07:00
Mark Salyzyn
2cfc8d9b78 llkd: Do not check ueventd by default for stack
ueventd is now blocked by sepolicy, so skip checking it to
prevent an avc warning.

Test: manual
Bug: 33808187
Change-Id: I7b7a53604b83ee18a47db38d7a8260ab4226d531
2018-10-08 10:32:51 -07:00
Mark Salyzyn
6cc230f587 Merge changes I49c9f064,I946e8564
* changes:
  llkd: Add cma_alloc stack symbol checking
  llkd: Add __get_user_page stack symbol checking
2018-10-08 14:22:18 +00:00
Tom Cherry
e0bc5a9aa2 Use only signed/unsigned numbers with ParseInt/ParseUint respectively
Test: build
Change-Id: I4d950d4aa8d24c90d1fc9b1cbea0f324aeed56a3
2018-10-05 14:30:39 -07:00
Mark Salyzyn
00b2ce7005 llkd: Add cma_alloc stack symbol checking
Add ro.llk.stack to list a set of symbols that should rarely happen
but if persistent in multiple checks, indicates a live lock condition.
At ro.llk.stack.timeout_ms the process is sent a kill, if it remains,
then panic the kernel.

There is no ABA detection in the paths, the condition for the
stack symbol being present instantaneously must be its rarity of
being caught.  If a livelock occurs in the path of the symbol, then
it is possible more than one path could be stuck in the state, but
the best candidate symbols are found underneath a lock resulting in
only one process being the culprit, and the best aim.  There may be
processes that induce a look of persistence, if so the symbol is not
a candidate for checking.

Adding cma_alloc to the default list.  It is not behind a lock, so
multiple references can happen.  The hope is the first one to spin or
wait gets the kill, but there is the possibility that both will get
the kill.  It is unknown if this will escalate to a kernel panic at
this time.  It is also suspect that a RT task could cause this by
starving the background worker, and llkd could suffer a similar fate
as it is SCHED_BATCH policy.

Test: compile
Bug: 33808187
Bug: 111910505
Bug: 80502612
Change-Id: I49c9f0646d627869144c5c1ca32272515ed60f7b
2018-08-30 13:53:28 -07:00
Mark Salyzyn
a9afe5933d llkd: Add __get_user_page stack symbol checking
Add ro.llk.stack to list a set of symbols that should rarely happen
but if persistent in multiple checks, indicates a live lock condition.
At ro.llk.stack.timeout_ms the process is sent a kill, if it remains,
then panic the kernel.

There is no ABA detection in the paths, the condition for the
stack symbol being present instantaneously must be its rarity of
being caught.  If a livelock occurs in the path of the symbol, then
it is possible more than one path could be stuck in the state, but
the best candidate symbols are found underneath a lock resulting in
only one process being the culprit, and the best aim.  There may be
processes that induce a look of persistence, if so the symbol is not
a candidate for checking.

The current candidate is __get_user_pages, after mm is locked,
should be a very short reference to look up a page, but can be
longer term if starved, or a condition causes a conflicting loop.

Test: compile
Bug: 33808187
Change-Id: I946e85641e59229b7491e929fcab5f1240794254
2018-08-30 13:53:28 -07:00
Mark Salyzyn
96505fad80 llkd: Add stack symbol checking
Feature outlined here is only available on userdebug or eng builds.
Blocked for security reasons because requires ptrace capabilities.

Add ro.llk.stack to list a set of symbols that should rarely happen
but if persistent in multiple checks, indicates a live lock condition.
At ro.llk.stack.timeout_ms the process is sent a kill, if it remains,
then panic the kernel.

There is no ABA detection in the paths, the condition for the
stack symbol being present instantaneously must be its rarity of
being caught.  If a livelock occurs in the path of the symbol, then
it is possible more than one path could be stuck in the state, but
the best candidate symbols are found underneath a lock resulting in
only one process being the culprit, and the best aim.  There may be
processes that induce a look of persistence, if so the symbol is not
a candidate for checking.

Add ro.llk.blacklist.process.stack to list process names we want
to skip checking.  This configuration parameter is also used to
prevent sepolicy noise when trying to acquire stacks from non
ptrace'able services.

Test: gTest llkd_unit_tests
Bug: 33808187
Bug: 111910505
Bug: 80502612
Change-Id: Ie71221e371b189bbdda2a1155d47826997842dcc
2018-08-30 13:53:19 -07:00
Mark Salyzyn
4832a8bd76 llkd: clear PR_SET_DUMPABLE
Test: compile
Bug: 33808187
Bug: 111910505
Bug: 80502612
Change-Id: I21ed937d79b3eb81b67ad145664ea82413fb65fd
2018-08-28 13:13:50 -07:00
Mark Salyzyn
acecaf7216 llkd: llkSplit should prevent empty entries
Add "false" as an option fed into llkSplit to be equivalent to empty,
as a truly empty list is replaced with the internal defaults.  Ensure
that no empty entries are added to the returned list.  Add some
additional provisos to README.md, as well as the explanation of what
"false" means for the associated properties.

Test: llkd_unit_test
Bug: 33808187
Bug: 111910505
Bug: 80502612
Change-Id: Iac0457ea1f6cd559b0875f9871dbae839001276d
2018-08-13 14:43:30 +00:00
Mark Salyzyn
52e54a68e5 llkd: switch to std::literals
Replace std::string("<string>") with "<string>"s

Test: build
Bug: 33808187
Bug: 111910505
Bug: 80502612
Change-Id: I8c7915b26719081d93f486c22d9a40b4ad548085
2018-08-09 07:58:08 -07:00
Mark Salyzyn
afd66f2fd3 llkd: bootstat: propagate detailed livelock canonical boot reason
Report kernel_panic,sysrq,livelock,<state> reboot reason via last
dmesg (pstore console).  Add ro.llk.killtest property, which will
allow reliable ABA platforms to drop kill test and go directly
to kernel panic.  This should also allow some manual unit testing
of the canonical boot reason report.

New canonical boot reasons from llkd are:
- kernel_panic,sysrq,livelock,alarm llkd itself locked up (Hail Mary)
- kernel_panic,sysrq,livelock,driver uninterrruptible D state
- kernel_panic,sysrq,livelock,zombie uninterrruptible Z state

Manual test assumptions:
- llkd is built by the platform and landed on system partition
- unit test is built and landed in /data/nativetest (could
  land in /data/nativetest64, adjust test correspondingly)
- llkd not enabled, ro.llk.enable and ro.llk.killtest
  are not set by platform allowing test to adjust all the
  configuration properties and start llkd.
- or, llkd is enabled, ro.llk.enable is true, and killtest is
  disabled, ro.llk.killtest is false, setup by the platform.
  This breaks the go/apct generic operations of the unit test
  for llk.zombie and llk.driver as kernel panic results
  requiring manual intervention otherwise.  If test moves to
  go/apct, then we will be forced to bypass these tests under
  this condition (but allow them to run if ro.llk.killtest
  is "off" so specific testing above/below can be run).

for i in driver zombie; do
        adb shell su root setprop ro.llk.killtest off
        adb shell /data/nativetest/llkd_unit_test/llkd_unit_test --gtest_filter=llkd.${i}
        adb wait-for-device
        adb shell su root setprop ro.llk.killtest off
        sleep 60
        adb shell getprop sys.boot.reason
        adb shell /data/nativetest/llkd_unit_test/llkd_unit_test --gtest_filter=llkd.${i}
done

Test: llkd_unit_test (see test assumptions)
Bug: 33808187
Bug: 72838192
Change-Id: I2b24875376ddfdbc282ba3da5c5b3567de85dbc0
2018-04-18 14:02:16 -07:00
Mark Salyzyn
d035dbbecf llkd: default enabled for userdebug
If LLK_ENABLE_DEFAULT is false, then check "ro.llk.enable" for "eng",
also the default value if not set, and then check if userdebug build
to establish a default of true for enable.  Same for
ro.khungtask.enable.

Test: llkd_unit_test report eng status on "userdebug" or "user" builds
Bug: 33808187
Bug: 72838192
Change-Id: I2adb23c7629dccaa2856c50bccbf4e363703c82c
2018-04-18 14:02:05 -07:00
Mark Salyzyn
f089e1403b llkd: add live-lock daemon
Introduce a standalone live-lock daemon (llkd), to catch kernel
or native user space deadlocks and take mitigating actions.  Will
also configure [khungtaskd] to fortify the actions.

If a thread is in D or Z state with no forward progress for longer
than ro.llk.timeout_ms, or ro.llk.[D|Z].timeout_ms, kill the process
or parent process respectively.  If another scan shows the same
process continues to exist, then have a confirmed live-lock condition
and need to panic.  Panic the kernel in a manner to provide the
greatest bugreporting details as to the condition.  Add a alarm self
watchdog should llkd ever get locked up that is double the expected
time to flow through the mainloop.  Sampling is every
ro.llk_sample_ms.

Default will not monitor init, or [kthreadd] and all that [kthreadd]
spawns.  This reduces the effectiveness of llkd by limiting its
coverage.  If in the future, if value in covering kthreadd spawned
threads, the requirement will be to code drivers so that they do not
remain in a persistent 'D' state, or that they have mechanisms to
recover the thread should it be killed externally.  Then the
blacklists can be adjusted accordingly if these conditions are met.

An accompanying gTest set have been added, and will setup a persistent
D or Z process, with and without forward progress, but not in a
live-lock state because that would require a buggy kernel, or a module
or kernel modification to stimulate.

Android Properties llkd respond to (*_ms parms are in milliseconds):
- ro.config.low_ram default false, if true do not sysrq t (dump
  all threads).
- ro.llk.enable default false, allow live-lock daemon to be enabled.
- ro.khungtask.enable default false, allow [khungtaskd] to be enabled.
- ro.llk.mlockall default true, allow mlock'd live-lock daemon.
- ro.khungtask.timeout default 12 minutes.
- ro.llk.timeout_ms default 10 minutes, D or Z maximum timelimit,
  double this value and it sets the alarm watchdog for llkd.
- ro.llk.D.timeout_ms default ro.llk.timeout_ms, D maximum timelimit.
- ro.llk.Z.timeout_ms default ro.llk.timeout_ms, Z maximum timelimit.
- ro.llk.check_ms default 2 minutes sampling interval
  (ro.llk.timeout_ms / 5) for threads in D or Z state.
- ro.llk.blacklist.process default 0,1,2 (kernel, init and
  [kthreadd]), and process names (/comm or /cmdline) init,[kthreadd],
  lmkd,lmkd.llkd,llkd,[khungtaskd],watchdogd,[watchdogd],
  [watchdogd/0] ...
- ro.llk.blacklist.parent default 0,2 (kernel and [kthreadd]) and
  "[kthreadd]".  A comma separated lists of process ids, /comm names
  or /cmdline names.
- ro.llk.blacklist.uid default <empty>, comma separated list of
  uid numbers or names from getpwuid/getpwnam.

Test: llkd_unit_test
Bug: 33808187
Bug: 72838192
Change-Id: I32e8aa78aef10834e093265d0f3ed5b4199807c6
2018-04-18 14:01:56 -07:00