a4959aa6f8
Following on from the towlower()/towupper() changes, add benchmarks for most of <ctype.h>, rewrite the tests to cover the entire defined range for all of these functions, and then reimplement most of the functions. The old table-based implementation is mostly a bad idea on modern hardware, with only ispunct() showing a significant benefit compared to any other way I could think of writing it, and isalnum() a marginal but still convincingly genuine benefit. My new benchmarks make an effort to test an example from each relevant range of characters to avoid, say, accidentally optimizing the behavior of `isalnum('0')` at the expense of `isalnum('z')`. Interestingly, clang is able to generate what I believe to be the optimal implementations from the most readable code, which is impressive. It certainly matched or beat all my attempts to be clever! The BSD table-based implementations made a special case of EOF despite having a `_ctype_` table that's offset by 1 to include EOF at index 0. I'm not sure why they didn't take advantage of that, but removing the explicit check for EOF measurably improves the generated code on arm and arm64, so even the two functions that still use the table benefit from this rewrite. Here are the benchmark results: arm64 before: BM_ctype_isalnum_n 3.73 ns 3.73 ns 183727137 BM_ctype_isalnum_y1 3.82 ns 3.81 ns 186383058 BM_ctype_isalnum_y2 3.73 ns 3.72 ns 187809830 BM_ctype_isalnum_y3 3.78 ns 3.77 ns 181383055 BM_ctype_isalpha_n 3.75 ns 3.75 ns 189453927 BM_ctype_isalpha_y1 3.76 ns 3.75 ns 184854043 BM_ctype_isalpha_y2 4.32 ns 3.78 ns 186326931 BM_ctype_isascii_n 2.49 ns 2.48 ns 275583822 BM_ctype_isascii_y 2.51 ns 2.51 ns 282123915 BM_ctype_isblank_n 3.11 ns 3.10 ns 220472044 BM_ctype_isblank_y1 3.20 ns 3.19 ns 226088868 BM_ctype_isblank_y2 3.11 ns 3.11 ns 220809122 BM_ctype_iscntrl_n 3.79 ns 3.78 ns 188719938 BM_ctype_iscntrl_y1 3.72 ns 3.71 ns 186209237 BM_ctype_iscntrl_y2 3.80 ns 3.80 ns 184315749 BM_ctype_isdigit_n 3.76 ns 3.74 ns 188334682 BM_ctype_isdigit_y 3.78 ns 3.77 ns 186249335 BM_ctype_isgraph_n 3.99 ns 3.98 ns 177814143 BM_ctype_isgraph_y1 3.98 ns 3.95 ns 175140090 BM_ctype_isgraph_y2 4.01 ns 4.00 ns 178320453 BM_ctype_isgraph_y3 3.96 ns 3.95 ns 175412814 BM_ctype_isgraph_y4 4.01 ns 4.00 ns 175711174 BM_ctype_islower_n 3.75 ns 3.74 ns 188604818 BM_ctype_islower_y 3.79 ns 3.78 ns 154738238 BM_ctype_isprint_n 3.96 ns 3.95 ns 177607734 BM_ctype_isprint_y1 3.94 ns 3.93 ns 174877244 BM_ctype_isprint_y2 4.02 ns 4.01 ns 178206135 BM_ctype_isprint_y3 3.94 ns 3.93 ns 175959069 BM_ctype_isprint_y4 4.03 ns 4.02 ns 176158314 BM_ctype_isprint_y5 3.95 ns 3.94 ns 178745462 BM_ctype_ispunct_n 3.78 ns 3.77 ns 184727184 BM_ctype_ispunct_y 3.76 ns 3.75 ns 187947503 BM_ctype_isspace_n 3.74 ns 3.74 ns 185300285 BM_ctype_isspace_y1 3.77 ns 3.76 ns 187202066 BM_ctype_isspace_y2 3.73 ns 3.73 ns 184105959 BM_ctype_isupper_n 3.81 ns 3.80 ns 185038761 BM_ctype_isupper_y 3.71 ns 3.71 ns 185885793 BM_ctype_isxdigit_n 3.79 ns 3.79 ns 184965673 BM_ctype_isxdigit_y1 3.76 ns 3.75 ns 188251672 BM_ctype_isxdigit_y2 3.79 ns 3.78 ns 184187481 BM_ctype_isxdigit_y3 3.77 ns 3.76 ns 187635540 arm64 after: BM_ctype_isalnum_n 3.37 ns 3.37 ns 205613810 BM_ctype_isalnum_y1 3.40 ns 3.39 ns 204806361 BM_ctype_isalnum_y2 3.43 ns 3.43 ns 205066077 BM_ctype_isalnum_y3 3.50 ns 3.50 ns 200057128 BM_ctype_isalpha_n 2.97 ns 2.97 ns 236084076 BM_ctype_isalpha_y1 2.97 ns 2.97 ns 236083626 BM_ctype_isalpha_y2 2.97 ns 2.97 ns 236084246 BM_ctype_isascii_n 2.55 ns 2.55 ns 272879994 BM_ctype_isascii_y 2.46 ns 2.45 ns 286522323 BM_ctype_isblank_n 3.18 ns 3.18 ns 220431175 BM_ctype_isblank_y1 3.18 ns 3.18 ns 220345602 BM_ctype_isblank_y2 3.18 ns 3.18 ns 220308509 BM_ctype_iscntrl_n 3.10 ns 3.10 ns 220344270 BM_ctype_iscntrl_y1 3.10 ns 3.07 ns 228973615 BM_ctype_iscntrl_y2 3.07 ns 3.07 ns 229192626 BM_ctype_isdigit_n 3.07 ns 3.07 ns 228925676 BM_ctype_isdigit_y 3.07 ns 3.07 ns 229182934 BM_ctype_isgraph_n 2.66 ns 2.66 ns 264268737 BM_ctype_isgraph_y1 2.66 ns 2.66 ns 264445277 BM_ctype_isgraph_y2 2.66 ns 2.66 ns 264327427 BM_ctype_isgraph_y3 2.66 ns 2.66 ns 264427480 BM_ctype_isgraph_y4 2.66 ns 2.66 ns 264155250 BM_ctype_islower_n 2.66 ns 2.66 ns 264421600 BM_ctype_islower_y 2.66 ns 2.66 ns 264341148 BM_ctype_isprint_n 2.66 ns 2.66 ns 264415198 BM_ctype_isprint_y1 2.66 ns 2.66 ns 264268793 BM_ctype_isprint_y2 2.66 ns 2.66 ns 264419205 BM_ctype_isprint_y3 2.66 ns 2.66 ns 264205886 BM_ctype_isprint_y4 2.66 ns 2.66 ns 264440797 BM_ctype_isprint_y5 2.72 ns 2.72 ns 264333293 BM_ctype_ispunct_n 3.52 ns 3.51 ns 198956572 BM_ctype_ispunct_y 3.38 ns 3.38 ns 201661792 BM_ctype_isspace_n 3.39 ns 3.39 ns 206896620 BM_ctype_isspace_y1 3.39 ns 3.39 ns 206569020 BM_ctype_isspace_y2 3.39 ns 3.39 ns 206564415 BM_ctype_isupper_n 2.76 ns 2.75 ns 254227134 BM_ctype_isupper_y 2.76 ns 2.75 ns 254235314 BM_ctype_isxdigit_n 3.60 ns 3.60 ns 194418653 BM_ctype_isxdigit_y1 2.97 ns 2.97 ns 236082424 BM_ctype_isxdigit_y2 3.48 ns 3.48 ns 200390011 BM_ctype_isxdigit_y3 3.48 ns 3.48 ns 202255815 arm32 before: BM_ctype_isalnum_n 4.77 ns 4.76 ns 129230464 BM_ctype_isalnum_y1 4.88 ns 4.87 ns 147939321 BM_ctype_isalnum_y2 4.74 ns 4.73 ns 145508054 BM_ctype_isalnum_y3 4.81 ns 4.80 ns 144968914 BM_ctype_isalpha_n 4.80 ns 4.79 ns 148262579 BM_ctype_isalpha_y1 4.74 ns 4.73 ns 145061326 BM_ctype_isalpha_y2 4.83 ns 4.82 ns 147642546 BM_ctype_isascii_n 3.74 ns 3.72 ns 186711139 BM_ctype_isascii_y 3.79 ns 3.78 ns 183654780 BM_ctype_isblank_n 4.20 ns 4.19 ns 169733252 BM_ctype_isblank_y1 4.19 ns 4.18 ns 165713363 BM_ctype_isblank_y2 4.22 ns 4.21 ns 168776265 BM_ctype_iscntrl_n 4.75 ns 4.74 ns 145417484 BM_ctype_iscntrl_y1 4.82 ns 4.81 ns 146283250 BM_ctype_iscntrl_y2 4.79 ns 4.78 ns 148662453 BM_ctype_isdigit_n 4.77 ns 4.76 ns 145789210 BM_ctype_isdigit_y 4.84 ns 4.84 ns 146909458 BM_ctype_isgraph_n 4.72 ns 4.71 ns 145874663 BM_ctype_isgraph_y1 4.86 ns 4.85 ns 142037606 BM_ctype_isgraph_y2 4.79 ns 4.78 ns 145109612 BM_ctype_isgraph_y3 4.75 ns 4.75 ns 144829039 BM_ctype_isgraph_y4 4.86 ns 4.85 ns 146769899 BM_ctype_islower_n 4.76 ns 4.75 ns 147537637 BM_ctype_islower_y 4.79 ns 4.78 ns 145648017 BM_ctype_isprint_n 4.82 ns 4.81 ns 147154780 BM_ctype_isprint_y1 4.76 ns 4.76 ns 145117604 BM_ctype_isprint_y2 4.87 ns 4.86 ns 145801406 BM_ctype_isprint_y3 4.79 ns 4.78 ns 148043446 BM_ctype_isprint_y4 4.77 ns 4.76 ns 145157619 BM_ctype_isprint_y5 4.91 ns 4.90 ns 147810800 BM_ctype_ispunct_n 4.74 ns 4.73 ns 145588611 BM_ctype_ispunct_y 4.82 ns 4.81 ns 144065436 BM_ctype_isspace_n 4.78 ns 4.77 ns 147153712 BM_ctype_isspace_y1 4.73 ns 4.72 ns 145252863 BM_ctype_isspace_y2 4.84 ns 4.83 ns 148615797 BM_ctype_isupper_n 4.75 ns 4.74 ns 148276631 BM_ctype_isupper_y 4.80 ns 4.79 ns 145529893 BM_ctype_isxdigit_n 4.78 ns 4.77 ns 147271646 BM_ctype_isxdigit_y1 4.74 ns 4.74 ns 145142209 BM_ctype_isxdigit_y2 4.83 ns 4.82 ns 146398497 BM_ctype_isxdigit_y3 4.78 ns 4.77 ns 147617686 arm32 after: BM_ctype_isalnum_n 4.35 ns 4.35 ns 161086146 BM_ctype_isalnum_y1 4.36 ns 4.35 ns 160961111 BM_ctype_isalnum_y2 4.36 ns 4.36 ns 160733210 BM_ctype_isalnum_y3 4.35 ns 4.35 ns 160897524 BM_ctype_isalpha_n 3.67 ns 3.67 ns 189377208 BM_ctype_isalpha_y1 3.68 ns 3.67 ns 189438146 BM_ctype_isalpha_y2 3.75 ns 3.69 ns 190971186 BM_ctype_isascii_n 3.69 ns 3.68 ns 191029191 BM_ctype_isascii_y 3.68 ns 3.68 ns 191011817 BM_ctype_isblank_n 4.09 ns 4.09 ns 171887541 BM_ctype_isblank_y1 4.09 ns 4.09 ns 171829345 BM_ctype_isblank_y2 4.08 ns 4.07 ns 170585590 BM_ctype_iscntrl_n 4.08 ns 4.07 ns 170614383 BM_ctype_iscntrl_y1 4.13 ns 4.11 ns 171495899 BM_ctype_iscntrl_y2 4.19 ns 4.18 ns 165255578 BM_ctype_isdigit_n 4.25 ns 4.24 ns 165237008 BM_ctype_isdigit_y 4.24 ns 4.24 ns 165256149 BM_ctype_isgraph_n 3.82 ns 3.81 ns 183610114 BM_ctype_isgraph_y1 3.82 ns 3.81 ns 183614131 BM_ctype_isgraph_y2 3.82 ns 3.81 ns 183616840 BM_ctype_isgraph_y3 3.79 ns 3.79 ns 183620182 BM_ctype_isgraph_y4 3.82 ns 3.81 ns 185740009 BM_ctype_islower_n 3.75 ns 3.74 ns 183619502 BM_ctype_islower_y 3.68 ns 3.68 ns 190999901 BM_ctype_isprint_n 3.69 ns 3.68 ns 190899544 BM_ctype_isprint_y1 3.68 ns 3.67 ns 190192384 BM_ctype_isprint_y2 3.67 ns 3.67 ns 189351466 BM_ctype_isprint_y3 3.67 ns 3.67 ns 189430348 BM_ctype_isprint_y4 3.68 ns 3.68 ns 189430161 BM_ctype_isprint_y5 3.69 ns 3.68 ns 190962419 BM_ctype_ispunct_n 4.14 ns 4.14 ns 171034861 BM_ctype_ispunct_y 4.19 ns 4.19 ns 168308152 BM_ctype_isspace_n 4.50 ns 4.50 ns 156250887 BM_ctype_isspace_y1 4.48 ns 4.48 ns 155124476 BM_ctype_isspace_y2 4.50 ns 4.50 ns 155077504 BM_ctype_isupper_n 3.68 ns 3.68 ns 191020583 BM_ctype_isupper_y 3.68 ns 3.68 ns 191015669 BM_ctype_isxdigit_n 4.50 ns 4.50 ns 156276745 BM_ctype_isxdigit_y1 3.28 ns 3.27 ns 214729725 BM_ctype_isxdigit_y2 4.48 ns 4.48 ns 155265129 BM_ctype_isxdigit_y3 4.48 ns 4.48 ns 155216846 I've also corrected a small mistake in the documentation for isxdigit(). Test: tests and benchmarks Change-Id: I4a77859f826c3fc8f0e327e847886882f29ec4a3 |
||
---|---|---|
.. | ||
spawn | ||
suites | ||
test_suites | ||
tests | ||
Android.bp | ||
atomic_benchmark.cpp | ||
bionic_benchmarks.cpp | ||
ctype_benchmark.cpp | ||
expf_input.cpp | ||
get_heap_size_benchmark.cpp | ||
inttypes_benchmark.cpp | ||
logf_input.cpp | ||
malloc_benchmark.cpp | ||
malloc_sql.h | ||
math_benchmark.cpp | ||
powf_input.cpp | ||
property_benchmark.cpp | ||
pthread_benchmark.cpp | ||
README.md | ||
run-on-host.sh | ||
semaphore_benchmark.cpp | ||
sincosf_input.cpp | ||
stdio_benchmark.cpp | ||
stdlib_benchmark.cpp | ||
string_benchmark.cpp | ||
time_benchmark.cpp | ||
unistd_benchmark.cpp | ||
util.cpp | ||
util.h | ||
wctype_benchmark.cpp |
Bionic Benchmarks
[TOC]
libc benchmarks (bionic-benchmarks)
bionic-benchmarks
is a command line tool for measuring the runtimes of libc functions. It is built
on top of Google Benchmark with some additions to organize
tests into suites.
Device benchmarks
$ mmma bionic/benchmarks
$ adb root
$ adb sync data
$ adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks
$ adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks
By default, bionic-benchmarks
runs all of the benchmarks in alphabetical order. Pass
--benchmark_filter=getpid
to run just the benchmarks with "getpid" in their name.
Host benchmarks
See the benchmarks/run-on-host.sh
script. The host benchmarks can be run with 32-bit or 64-bit
Bionic, or the host glibc.
XML suites
Suites are stored in the suites/
directory and can be chosen with the command line flag
--bionic_xml
.
To choose a specific XML file, use the --bionic_xml=FILE.XML
option. By default, this option
searches for the XML file in the suites/
directory. If it doesn't exist in that directory, then
the file will be found as relative to the current directory. If the option specifies the full path
to an XML file such as /data/nativetest/suites/example.xml
, it will be used as-is.
If no XML file is specified through the command-line option, the default is to use suites/full.xml
.
However, for the host bionic benchmarks (bionic-benchmarks-glibc
), the default is to use
suites/host.xml
.
XML suite format
The format for a benchmark is:
<fn>
<name>BM_sample_benchmark</name>
<cpu><optional_cpu_to_lock></cpu>
<iterations><optional_iterations_to_run></iterations>
<args><space separated list of function args|shorthand></args>
</fn>
XML-specified values for iterations and cpu take precedence over those specified via command line
(via --bionic_iterations
and --bionic_cpu
, respectively.)
To make small changes in runs, you can also schedule benchmarks by passing in their name and a
space-separated list of arguments via the --bionic_extra
command line flag, e.g.
--bionic_extra="BM_string_memcpy AT_COMMON_SIZES"
or --bionic_extra="BM_string_memcmp 32 8 8"
Note that benchmarks will run normally if extra arguments are passed in, and it will fail with a segfault if too few are passed in.
Shorthand
For the sake of brevity, multiple runs can be scheduled in one XML element by putting one of the following in the args field:
NUM_PROPS
MATH_COMMON
AT_ALIGNED_<ONE|TWO>BUF
AT_<any power of two between 2 and 16384>_ALIGNED_<ONE|TWO>BUF
AT_COMMON_SIZES
Definitions for these can be found in bionic_benchmarks.cpp, and example usages can be found in the suites directory.
Unit Tests
bionic-benchmarks
also has its own set of unit tests, which can be run from the binary in
/data/nativetest[64]/bionic-benchmarks-tests
Process startup time (bionic-spawn-benchmarks)
The spawn/
subdirectory has a few benchmarks measuring the time used to start simple programs
(e.g. Toybox's true
and sh -c true
). Run it on a device like so:
m bionic-spawn-benchmarks
adb root
adb sync data
adb shell /data/benchmarktest/bionic-spawn-benchmarks/bionic-spawn-benchmarks
adb shell /data/benchmarktest64/bionic-spawn-benchmarks/bionic-spawn-benchmarks
Google Benchmark reports both a real-time figure ("Time") and a CPU usage figure. For these
benchmarks, the CPU measurement only counts time spent in the thread calling posix_spawn
, not that
spent in the spawned process. The real-time is probably more useful, and it is the figure used to
determine the iteration count.
Locking the CPU frequency seems to improve the results of these benchmarks significantly, and it reduces variability.
Google Benchmark notes
Repetitions
Google Benchmark uses two settings to control how many times to run each benchmark, "iterations" and "repetitions". By default, the repetition count is one. Google Benchmark runs the benchmark a few times to determine a sufficiently-large iteration count.
Google Benchmark can optionally run a benchmark run repeatedly and report statistics (median, mean,
standard deviation) for the runs. To do so, pass the --benchmark_repetitions
option, e.g.:
# ./bionic-benchmarks --benchmark_filter=BM_stdlib_strtoll --benchmark_repetitions=4
...
-------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------
BM_stdlib_strtoll 27.7 ns 27.7 ns 25290525
BM_stdlib_strtoll 27.7 ns 27.7 ns 25290525
BM_stdlib_strtoll 27.7 ns 27.7 ns 25290525
BM_stdlib_strtoll 27.8 ns 27.7 ns 25290525
BM_stdlib_strtoll_mean 27.7 ns 27.7 ns 4
BM_stdlib_strtoll_median 27.7 ns 27.7 ns 4
BM_stdlib_strtoll_stddev 0.023 ns 0.023 ns 4
There are 4 runs, each with 25290525 iterations. Measurements for the individual runs can be suppressed if they aren't needed:
# ./bionic-benchmarks --benchmark_filter=BM_stdlib_strtoll --benchmark_repetitions=4 --benchmark_report_aggregates_only
...
-------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------
BM_stdlib_strtoll_mean 27.8 ns 27.7 ns 4
BM_stdlib_strtoll_median 27.7 ns 27.7 ns 4
BM_stdlib_strtoll_stddev 0.043 ns 0.043 ns 4
CPU frequencies
To get consistent results between runs, it can sometimes be helpful to restrict a benchmark to specific cores, or to lock cores at specific frequencies. Some phones have a big.LITTLE core setup, or at least allow some cores to run at higher frequencies than others.
A core can be selected for bionic-benchmarks
using the --bionic_cpu
option or using the
taskset
utility. e.g. A Pixel 3 device has 4 Kryo 385 Silver cores followed by 4 Gold cores:
blueline:/ # /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=BM_stdlib_strtoll --bionic_cpu=0
...
------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------
BM_stdlib_strtoll 64.2 ns 63.6 ns 11017493
blueline:/ # /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=BM_stdlib_strtoll --bionic_cpu=4
...
------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------
BM_stdlib_strtoll 21.8 ns 21.7 ns 33167103
A similar result can be achieved using taskset
. The first parameter is a bitmask of core numbers
to pass to sched_setaffinity
:
blueline:/ # taskset f /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=BM_stdlib_strtoll
...
------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------
BM_stdlib_strtoll 64.3 ns 63.6 ns 10998697
blueline:/ # taskset f0 /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=BM_stdlib_strtoll
...
------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------
BM_stdlib_strtoll 21.3 ns 21.2 ns 33094801
To lock the CPU frequency, use the sysfs interface at /sys/devices/system/cpu/cpu*/cpufreq/
.
Changing the scaling governor to performance
suppresses the warning that Google Benchmark
otherwise prints:
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
Some devices have a perf-setup.sh
script that locks CPU and GPU frequencies. Some TradeFed
benchmarks appear to be using the script. For more information:
- run
get_build_var BOARD_PERFSETUP_SCRIPT
- run
m perf-setup.sh
to install the script into${OUT}/data/local/tmp/perf-setup.sh
- see: https://android.googlesource.com/platform/platform_testing/+/refs/heads/master/scripts/perf-setup/