Remove the Service::SetSigchldFd() method. Make the Service::GetSigchldFd()
create a signalfd for SIGCHLD. This makes it possible to use a SIGCHLD
signalfd in unit tests.
Change-Id: I0b41caa8f46c79f4d400e49aaba5227fad53c251
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Reduce the time spent in WaitToBeReaped() by waiting for SIGCHLD instead
of waiting for 50 ms.
Bug: 308687042
Change-Id: I5e259fdd22dec68e45d27205def2fc6463c06ca3
Signed-off-by: Bart Van Assche <bvanassche@google.com>
The max_processes calculation is incorrect for KillProcessGroup because
the set of processes in cgroup.procs can differ between the multiple
reads in the implementation. Luckily the exact value isn't very
important because it's just logged. Remove max_processes from the API
and remove the warning about the new behavior in Android 11.
Note that we still always LOG(INFO) that any cgroup is being killed.
Bug: 301871933
Change-Id: I8e449f5089d4a48dbc1797b6d979539e87026f43
OEM can add self-owned groups, but the system init cannot support if the group numbers are over than 12, relax some restrictions as appropriate.
Bug: b/296826987
Signed-off-by: Haichao Li <liuhc3@motorola.com>
Change-Id: I231d9f6c82e93c08bc97ca32df70e5b28760acbc
This CL allows restart_period to be set to a value shorter than 5s.
Previously this was prohibited to rate limit crashing services. That
behavior is considered to be a bit too conservative because some
services don't crash, but exit deliverately.
adbd is the motivating example. When adb root or adb unroot is
requested, it changes its mode of operation (via sysprop), exits itself,
and restarts (by init) to enter into the mode. However, due to the 5s
delay, the mode change can complete no earlier than 5 seconds after adbd
was started last time. This can slow the mode change when it is
requested right after the boot.
With this CL, restart_period can be set to a value smaller than 5. And
services like adbd can make use of it. However, in ordef to rate limit
crashing service, the default is enforced if the service was crashed
last time. In addition, such intended restart is not counted as crashes
when monitoring successive crashes during booting.
Bug: 286061817
Test: /packages/modules/Virtualization/vm/vm_shell.sh start-microdroid \
--auto-connect -- --protected
* with this change: within 2s
* without this change: over 6s
Change-Id: I1b3f0c92d349e8c8760821cf50fb69997b67b242
NOTE: in master, but should be submitted in AOSP.
Waiting to hear from security folks. Also might
need cleanup.
Not currently done. Seems errorprone.
Bug: 276813155
Test: boot, check logs
Change-Id: I7cbc39b282889dd582f06a8eedc38ae637c8edec
If a service specifies gentle_kill, attempt to stop it will send SIGTERM
instead of SIGKILL. After 200ms, it will issue a SIGKILL.
Bug: 249043036
Test: atest CtsInitTestCases:init#GentleKill
Added in next patch
Change-Id: Ieb0e4e24d31780aca1cf291f9d21d49cee181cf2
If a service that runs under root doesn't have the capabilities field in
it's definition, then it will inherit all the capabilities that init
has.
This change adds a linter to detect such services and ask developers to
explicitly specify capabilities that their service needs. If service
doesn't require any capabilities then empty capabilities fields should
be added in the service definition.
The actual access control list on what capabilities a process can use is
controlled by the SELinux, so inheriting all the init capabilities is
not a security issue here. However, asking services to explicitly
specify the capabilities they need is a good defense-in-depth mechanism.
So far this linter only checks the services on /system partition.
All currently offending services are added to the exempt list. I will
work on fixing some of them in the follow-up changes.
Bug: 249796710
Test: m dist
Change-Id: I2db06af165ae320a9c5086756067dceef20cd28d
Service objects have external state (the child process) and hence must
not be duplicated. Disable the copy constructor and the assignment
operator to prevent that these objects get duplicated accidentally.
Bug: 213617178
Change-Id: Ia5391154b94eca7f12be69eabcdf3f173fc06452
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Revert commit 14f9c15e05 ("init: Add more diagnostics for signalfd
hangs") because:
* That commit was intented to help with root-causing b/223076262.
* The root cause of b/223076262 has been fixed (not blocking SIGCHLD
in all threads in the init process).
Test: Treehugger
Change-Id: I586663ec0588e74a9d58512f7f31155398cf4f52
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Multiple tests in CtsInitTestCases, e.g. RebootTest#StopServicesSIGKILL,
can trigger the following race condition:
* A service is started. This involves calling fork() and also to call
RunService() in the child process. RunService() calls setpgid().
* Service::Stop() is called and calls KillProcessGroup().
KillProcessGroup() calls kill(-pgid, SIGKILL) before the child process
has called setpgid(). pgid is the process ID of the child process. The
kill() call fails because setpgid() has not yet been called.
Fix this race condition by adding a setpgid() call in the parent process
and by waiting from the parent until the child has called setsid() if a
console is attached.
Bug: 213617178
Change-Id: Ieb9e6908df725447e3695ed66bb8bd30e4e38aa9
Signed-off-by: Bart Van Assche <bvanassche@google.com>
It is nontrivial to derive from the implementation of class Service
which members are not modified. Hence this CL that documents this by
declaring these members 'const'.
Change-Id: I27b907a1c7044376d5c5393a29050c66cbdab7bf
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Multiple tests in CtsInitTestCases, e.g. RebootTest#StopServicesSIGKILL,
can trigger the following race condition:
* A service is started. This involves calling fork() and also to call
RunService() in the child process. RunService() calls setpgid().
* Service::Stop() is called and calls KillProcessGroup().
KillProcessGroup() calls kill(-pgid, SIGKILL) before the child process
has called setpgid(). pgid is the process ID of the child process. The
kill() call fails because setpgid() has not yet been called.
Fix this race condition by adding a setpgid() call in the parent process
and by waiting from the parent until the child has called setsid() if a
console is attached.
Bug: 213617178
Test: Cuttlefish + atest 'CtsInitTestCases'
Change-Id: I6931cd579e607c247b4f79a5b375455ca3d52e29
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Multiple tests in CtsInitTestCases, e.g. RebootTest#StopServicesSIGKILL,
can trigger the following race condition:
* A service is started. This involves calling fork() and also to call
RunService() in the child process. RunService() calls setpgid().
* Service::Stop() is called and calls KillProcessGroup().
KillProcessGroup() calls kill(-pgid, SIGKILL) before the child process
has called setpgid(). pgid is the process ID of the child process. The
kill() call fails because setpgid() has not yet been called.
Fix this race condition by adding a setpgid() call in the parent process
and by waiting from the parent until the child has called setsid() if a
console is attached.
Bug: 213617178
Test: Cuttlefish + atest 'CtsInitTestCases'
Change-Id: I4c55790c2dcde8716b860aecd57708d51a081086
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Prepare for introducing a second interprocess communication channel by
introducing the class InterprocessFifo. Stop using std::unique_ptr<> for
holding the pipe file descriptors. Handle EOF consistently.
Bug: 213617178
Change-Id: Ic0cf18d3d8ea61b8ee17e64de8a9df2736e26728
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Passed apex file name to service. The file name will be parsed
to determine 1) whether the service is from an apex; 2) apex name
Bug: 236090201
Change-Id: I2c292c0c067f4bf44bb25b1f80e4f972b94f7258
servicemanager/hwservicemanager are pre-apexd services but still wants
to see VINTF fragments from APEXes, especially from /data.
Like ueventd, these services need to be started in "default" mount
namespace.
Bug: 237672865
Test: m && boot
Change-Id: I0266c5be5530a1a07f8ffa23a26186d45a55613f
This adds three more diagnostics to stuck exec services:
1. /proc/pid/fds is dumped
2. /proc/pid/status is dumped
3. HandleSignalFd is called to see if a SIGCHLD got stuck somewhere
Bug: 223076262
Test: while (1) in linkerconfig
Ignore-AOSP-First: diagnostics
Change-Id: Ida601d86e18be9d49b143fb88b418cbc171ecac6
This adds two new diagnostics. First, signalfd reads are now non-blocking. If the read takes more than 10 seconds, we log an error.
Second, init now wakes up from epoll() every 10 seconds. If it waits on an "exec" command for more than 10 seconds, it logs an error.
This change will be reverted as soon as we get feedback.
Bug: 223076262
Test: device boots
Change-Id: I7ee98d159599217a641b3de2564a92c2435f57ef
The Service::Start() method is so long that its length negatively
affects readability of the code. Hence this patch that splits
Service::Start().
Test: Booted Android in Cuttlefish.
Change-Id: I5a6f587ecc5e6470137de6cceda7e685bce28ced
Signed-off-by: Bart Van Assche <bvanassche@google.com>
The Service::Start() method is so long that its length negatively
affects readability of the code. Hence this patch that splits
Service::Start().
Test: Booted Android in Cuttlefish.
Change-Id: I972f4e60844bb0d133b1cca1fd4e06bb89fc5f37
Signed-off-by: Bart Van Assche <bvanassche@google.com>
The Service::Start() method is so long that its length negatively
affects readability of the code. Hence this patch that splits
Service::Start().
Test: Booted Android in Cuttlefish.
Change-Id: Ib8e1e87fbd335520cbe3aac2a88d250fcf3b4ff0
Signed-off-by: Bart Van Assche <bvanassche@google.com>
Remove the class_start_post_data and class_reset_post_data commands,
since they aren't used anymore. They were only used on devices that
used FDE (Full Disk Encryption), via actions in rootdir/init.rc. These
actions have been removed, since support for FDE has been removed.
There is no use case for these commands in vendor init scripts either.
Keep the mark_post_data command, since DoUserspaceReboot() uses the
post-data service flag even on non-FDE devices.
Bug: 191796797
Change-Id: Ibcd97543daa724feb610546b5fc2a0dd7f1e62e7
Currently there is no socket for daemon instances launched during the
selinux phase of init. We don't create any sockets due to the complexity
of the required sepolicy.
This workaround will allow us to create the socket with very minimal
sepolicy changes. init will launch a one-off instance of snapuserd in
"proxy" mode, and then the following steps will occur:
1. The proxy daemon will be given two sockets, the "normal" socket that
snapuserd clients would connect to, and a "proxy" socket.
2. The proxy daemon will listen on the proxy socket.
3. The first-stage daemon will wake up and connect to the proxy daemon
as a client.
4. The proxy will send the normal socket via SCM_RIGHTS, then exit.
5. The first-stage daemon can now listen and accept on the normal
socket.
Ordering of these events is achieved through a snapuserd.proxy_ready
property.
Some special-casing was needed in init to make this work. The snapuserd
socket owned by snapuserd_proxy is placed into a "persist" mode so it
doesn't get deleted when snapuserd_proxy exits. There's also a special
case method to create a Service object around a previously existing pid.
Finally, first-stage init is technically on a different updateable
partition than snapuserd. Thus, we add a way to query snapuserd to see
if it supports socket handoff. If it does, we communicate this
information through an environment variable to second-stage init.
Bug: 193833730
Test: manual test
Change-Id: I1950b31028980f0138bc03578cd455eb60ea4a58
This reverts commit 1c51525f66 because it
accidentally made reboot_on_failure be a no-op for all services. This
is because Reap() itself calls KillProcessGroup() on devices with a
vendor level >= R, which in turn sets SVC_STOPPING. I had overlooked
this somehow, probably because I didn't consider that a service can
consist of multiple processes.
It turns out that real FDE devices don't actually need the above commit
because FDE devices aren't allowed to have updatable apexes enabled, and
without updatable apexes enabled, apexd exits automatically and
therefore doesn't have to be stopped. This can be verified by using the
aosp_cf_x86_phone_noapex build target, rather than aosp_cf_x86_phone
which I had used for testing before. So just revert it for now.
Bug: 194370048
Change-Id: I90eddf2a87397449b241e5acaaa8d4a4241d73a9
Add a new service flag SVC_STOPPING which tracks whether a service is
being manually stopped by init, and make the "reboot_on_failure" service
setting not apply when SVC_STOPPING is set.
This is needed for devices that use FDE, because otherwise the device
reboots during the following init script fragment:
on property:vold.decrypt=trigger_shutdown_framework
class_reset late_start
class_reset main
class_reset_post_data core
class_reset_post_data hal
... because that stops all services, including apexd which has been
marked with reboot_on_failure since
https://android-review.googlesource.com/c/platform/system/apex/+/1325212.
So init was killing apexd, then rebooting the device because apexd
"failed" due to having been killed. Making reboot_on_failure not apply
when init stops a service itself fixes the problem.
This is one of a set of changes that is needed to get FDE working again
so that devices that launched with FDE can be upgraded to Android 12.
Bug: 186165644
Test: Tested FDE on Cuttlefish
Change-Id: I599f7ba107e6c126e8f31d0ae659f0ae672a25e4
Any service which is executed when Runtime apex is mounted, but
linkerconfig is not updated can fail to be executed due to missing
information in ld.config.txt. This change updates init to have a status
variable which contains if current mount namespace is default
and APEX is not ready from ld.config.txt, and use bootstrap namespace if
it is not ready.
Bug: 181348374
Test: cuttlefish boot succeeded
Change-Id: Ia574b1fad2110d4e68586680dacbe6137186546e
Also includes new --out_<partition> flags for
system,system_ext,product,vendor,odm
to allow host_init_verifier to work with a collection of init rc files.
Test: host_init_verifier --out_system=... --out_vendor=...
where vendor contains an init rc file that overrides a service
present in system. Observe parse failure and non-zero exit.
Bug: 163089173
Change-Id: I520fef613e0036df8a7d47a98d47405eaa969110
The critical services can now using the interface `critical
[window=<fatal crash window mins>] [target=<fatal reboot target>]` to
setup the timing window that when there are more than 4 crashes in it,
the init will regard it as a fatal system error and reboot the system.
Config `window=${zygote.critical_window.minute:-off}' and
`target=zygote-fatal' for all system-server services, so platform that
configures ro.boot.zygote_critical_window can escape the system-server
crash-loop via init fatal handler.
Bug: 146818493
Change-Id: Ib2dc253616be6935ab9ab52184a1b6394665e813
Introduce new command to allow setting task profiles from inside .rc
script. This is to replace usage of writepid when a service is trying
to join a cgroup. Usage example from a .rc file:
service surfaceflinger /system/bin/surfaceflinger
task_profiles HighPerformance
Bug: 155419956
Test: change .rc file and confirm task profile is applied
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Change-Id: I0add9c3b363a7cb1ea89778780896cae1c8a303c
Some services are lazy HALs on some platforms and not lazy HALs on
others; this is known at runtime by hwservicemanager, so this change
adds these properties to allow hwservicemanager to turn one oneshot
(for lazy HALs). It may also be required to make a lazy HAL not lazy
anymore, and oneshot_off is provided for this.
Bug: 147841742
Test: new unit test that turn on and off oneshot on a service (bootanim)
and observes that it follows the expected behavior
Change-Id: I79524e2c9a5008f90c8d3bc40920fde00602a439
This is apparently causing problems with reboot.
This reverts commit 7205c62933.
Bug: 150863651
Test: build
Change-Id: Ib8a4835cdc8358a54c7acdebc5c95038963a0419
A previous change moved property_service into its own thread, since
there was otherwise a deadlock whenever a process called by init would
try to set a property. This new thread, however, would send a message
via a blocking socket to init for each property that it received,
since init may need to take action depending on which property it is.
Unfortunately, this means that the deadlock is still possible, the
only difference is the socket's buffer must be filled before init deadlocks.
There are possible partial solutions here: the socket's buffer may be
increased or property_service may only send messages for the
properties that init will take action on, however all of these
solutions still lead to eventual deadlock. The only complete solution
is to handle these messages asynchronously.
This change, therefore, adds the following:
1) A lock for instructing init to reboot
2) A lock for waiting on properties
3) A lock for queueing new properties
4) A lock for any actions with ServiceList or any Services, enforced
through thread annotations, particularly since this code was not
designed with the intention of being multi-threaded.
Bug: 146877356
Bug: 148236233
Test: boot
Test: kill hwservicemanager without deadlock
Change-Id: I84108e54217866205a48c45e8b59355012c32ea8
Such services will be re-parsed and added back to the service list
during post-fs-data stage.
Test: adb reboot userspace
Test: atest CtsInitTestCases
Bug: 145669993
Bug: 135984674
Change-Id: Ibb393dfe0f101c4ebe37bc763733fd5d981d3691
~2007 a change was added that would allow oneshot services to
daemonize by not killing their process group. This was a hack at the
time, and should certainly not be needed now. I've resisted removing
the behavior however, as it hadn't caused any issues.
Recently, it was detected that the cgroups that these processes belong
to, would exist forever and therefore leak memory. Instead of simply
removing the cgroups when empty, this provides a good opportunity to
do the right thing and fix this behavior once and for all.
The new (correct) behavior only happens for devices with vendor images
built for Android R or later. Init will log a warning to dmesg when
it detects this difference in behavior has occurred.
Bug: 144545923
Test: boot CF/Coral and see no difference in behavior.
Test: boot CF with a service that daemonizes and see the warning.
Change-Id: I333a2e25a541ec0114ac50ab8ae7f1ea3f055447
* Refactored code around stopping services a little bit to reuse it
between full reboot and userspace reboot.
* Add a scope_guard to fallback to full reboot in case userspace reboot
fails.
* In case of userspace reboot init will also wait for services to be
terminated/killed and log the ones that didn't react to
SIGTERM/SIGKILL in time.
* If some of the services didn't react to SIGKILL, fail userspace reboot.
Test: adb reboot userspace
Bug: 135984674
Change-Id: I820c7bc406169333b0f929f0eea028d8384eb2ac
This replaces the recently added `exec_reboot_on_failure` builtin, since
it'll be cleaner to extend service definitions than extending `exec`.
This is in line with what we decided when adding `exec_start` instead
of extending `exec` to add parameters for priority.
Test: `exec_start` a service with a reboot_on_failure option and watch
the system reboot appropriately when the service is not found and when
the service terminates with a non-zero exit code.
Change-Id: I332bf9839fa94840d159a810c4a6ba2522189d0b
Host init verifier already checks that the names and number of
arguments for builtins are correct, but it can check more. This
change ensures that property expansions are well formed, and that
arguments that can be parsed on the host are correct. For example it
checks that UIDs and GIDs exist, that numerical values can be parsed,
and that rlimit strings are correct.
Test: build
Change-Id: Ied8882498a88a9f8324db6b8d1020aeeccc8177b