platform_bionic/libc/arch-arm64/generic/bionic
Sebastian Pop ed9bfc4616 [AArch64] Optimized memcmp
Patch written by Wilco Dijkstra submitted for review to newlib:
https://sourceware.org/ml/newlib/2017/msg00524.html

This is an optimized memcmp for AArch64.  This is a complete rewrite
using a different algorithm.  The previous version split into cases
where both inputs were aligned, the inputs were mutually aligned and
unaligned using a byte loop.  The new version combines all these cases,
while small inputs of less than 8 bytes are handled separately.

This allows the main code to be sped up using unaligned loads since
there are now at least 8 bytes to be compared.  After the first 8 bytes,
align the first input.  This ensures each iteration does at most one
unaligned access and mutually aligned inputs behave as aligned.
After the main loop, process the last 8 bytes using unaligned accesses.

This improves performance of (mutually) aligned cases by 25% and
unaligned by >500% (yes >6 times faster) on large inputs.

2017-06-28  Wilco Dijkstra  <wdijkstr@arm.com>

        * bionic/libc/arch-arm64/generic/bionic/memcmp.S (memcmp):
                Rewrite of optimized memcmp.

GLIBC benchtests/bench-memcmp.c performance comparison for Cortex-A53:

Length    1, alignment  1/ 1:        153%
Length    1, alignment  1/ 1:        119%
Length    1, alignment  1/ 1:        154%
Length    2, alignment  2/ 2:        121%
Length    2, alignment  2/ 2:        140%
Length    2, alignment  2/ 2:        121%
Length    3, alignment  3/ 3:        105%
Length    3, alignment  3/ 3:        105%
Length    3, alignment  3/ 3:        105%
Length    4, alignment  4/ 4:        155%
Length    4, alignment  4/ 4:        154%
Length    4, alignment  4/ 4:        161%
Length    5, alignment  5/ 5:        173%
Length    5, alignment  5/ 5:        173%
Length    5, alignment  5/ 5:        173%
Length    6, alignment  6/ 6:        145%
Length    6, alignment  6/ 6:        145%
Length    6, alignment  6/ 6:        145%
Length    7, alignment  7/ 7:        125%
Length    7, alignment  7/ 7:        125%
Length    7, alignment  7/ 7:        125%
Length    8, alignment  8/ 8:        111%
Length    8, alignment  8/ 8:        130%
Length    8, alignment  8/ 8:        124%
Length    9, alignment  9/ 9:        160%
Length    9, alignment  9/ 9:        160%
Length    9, alignment  9/ 9:        150%
Length   10, alignment 10/10:        170%
Length   10, alignment 10/10:        137%
Length   10, alignment 10/10:        150%
Length   11, alignment 11/11:        160%
Length   11, alignment 11/11:        160%
Length   11, alignment 11/11:        160%
Length   12, alignment 12/12:        146%
Length   12, alignment 12/12:        168%
Length   12, alignment 12/12:        156%
Length   13, alignment 13/13:        167%
Length   13, alignment 13/13:        167%
Length   13, alignment 13/13:        173%
Length   14, alignment 14/14:        167%
Length   14, alignment 14/14:        168%
Length   14, alignment 14/14:        168%
Length   15, alignment 15/15:        168%
Length   15, alignment 15/15:        173%
Length   15, alignment 15/15:        173%
Length    1, alignment  0/ 0:        134%
Length    1, alignment  0/ 0:        127%
Length    1, alignment  0/ 0:        119%
Length    2, alignment  0/ 0:        94%
Length    2, alignment  0/ 0:        94%
Length    2, alignment  0/ 0:        106%
Length    3, alignment  0/ 0:        82%
Length    3, alignment  0/ 0:        87%
Length    3, alignment  0/ 0:        82%
Length    4, alignment  0/ 0:        115%
Length    4, alignment  0/ 0:        115%
Length    4, alignment  0/ 0:        122%
Length    5, alignment  0/ 0:        127%
Length    5, alignment  0/ 0:        119%
Length    5, alignment  0/ 0:        127%
Length    6, alignment  0/ 0:        103%
Length    6, alignment  0/ 0:        100%
Length    6, alignment  0/ 0:        100%
Length    7, alignment  0/ 0:        82%
Length    7, alignment  0/ 0:        91%
Length    7, alignment  0/ 0:        87%
Length    8, alignment  0/ 0:        111%
Length    8, alignment  0/ 0:        124%
Length    8, alignment  0/ 0:        124%
Length    9, alignment  0/ 0:        136%
Length    9, alignment  0/ 0:        136%
Length    9, alignment  0/ 0:        136%
Length   10, alignment  0/ 0:        136%
Length   10, alignment  0/ 0:        135%
Length   10, alignment  0/ 0:        136%
Length   11, alignment  0/ 0:        136%
Length   11, alignment  0/ 0:        136%
Length   11, alignment  0/ 0:        135%
Length   12, alignment  0/ 0:        136%
Length   12, alignment  0/ 0:        136%
Length   12, alignment  0/ 0:        136%
Length   13, alignment  0/ 0:        135%
Length   13, alignment  0/ 0:        136%
Length   13, alignment  0/ 0:        136%
Length   14, alignment  0/ 0:        136%
Length   14, alignment  0/ 0:        136%
Length   14, alignment  0/ 0:        136%
Length   15, alignment  0/ 0:        136%
Length   15, alignment  0/ 0:        136%
Length   15, alignment  0/ 0:        136%
Length    4, alignment  0/ 0:        115%
Length    4, alignment  0/ 0:        115%
Length    4, alignment  0/ 0:        115%
Length   32, alignment  0/ 0:        127%
Length   32, alignment  7/ 2:        395%
Length   32, alignment  0/ 0:        127%
Length   32, alignment  0/ 0:        127%
Length    8, alignment  0/ 0:        111%
Length    8, alignment  0/ 0:        124%
Length    8, alignment  0/ 0:        124%
Length   64, alignment  0/ 0:        128%
Length   64, alignment  6/ 4:        475%
Length   64, alignment  0/ 0:        131%
Length   64, alignment  0/ 0:        134%
Length   16, alignment  0/ 0:        128%
Length   16, alignment  0/ 0:        119%
Length   16, alignment  0/ 0:        128%
Length  128, alignment  0/ 0:        129%
Length  128, alignment  5/ 6:        475%
Length  128, alignment  0/ 0:        130%
Length  128, alignment  0/ 0:        129%
Length   32, alignment  0/ 0:        126%
Length   32, alignment  0/ 0:        126%
Length   32, alignment  0/ 0:        126%
Length  256, alignment  0/ 0:        127%
Length  256, alignment  4/ 8:        545%
Length  256, alignment  0/ 0:        126%
Length  256, alignment  0/ 0:        128%
Length   64, alignment  0/ 0:        171%
Length   64, alignment  0/ 0:        171%
Length   64, alignment  0/ 0:        174%
Length  512, alignment  0/ 0:        126%
Length  512, alignment  3/10:        585%
Length  512, alignment  0/ 0:        126%
Length  512, alignment  0/ 0:        127%
Length  128, alignment  0/ 0:        129%
Length  128, alignment  0/ 0:        128%
Length  128, alignment  0/ 0:        129%
Length 1024, alignment  0/ 0:        125%
Length 1024, alignment  2/12:        611%
Length 1024, alignment  0/ 0:        126%
Length 1024, alignment  0/ 0:        126%
Length  256, alignment  0/ 0:        128%
Length  256, alignment  0/ 0:        127%
Length  256, alignment  0/ 0:        128%
Length 2048, alignment  0/ 0:        125%
Length 2048, alignment  1/14:        625%
Length 2048, alignment  0/ 0:        125%
Length 2048, alignment  0/ 0:        125%
Length  512, alignment  0/ 0:        126%
Length  512, alignment  0/ 0:        127%
Length  512, alignment  0/ 0:        127%
Length 4096, alignment  0/ 0:        125%
Length 4096, alignment  0/16:        125%
Length 4096, alignment  0/ 0:        125%
Length 4096, alignment  0/ 0:        125%
Length 1024, alignment  0/ 0:        126%
Length 1024, alignment  0/ 0:        126%
Length 1024, alignment  0/ 0:        126%
Length 8192, alignment  0/ 0:        125%
Length 8192, alignment 63/18:        636%
Length 8192, alignment  0/ 0:        125%
Length 8192, alignment  0/ 0:        125%
Length   16, alignment  1/ 2:        317%
Length   16, alignment  1/ 2:        317%
Length   16, alignment  1/ 2:        317%
Length   32, alignment  2/ 4:        395%
Length   32, alignment  2/ 4:        395%
Length   32, alignment  2/ 4:        398%
Length   64, alignment  3/ 6:        475%
Length   64, alignment  3/ 6:        475%
Length   64, alignment  3/ 6:        477%
Length  128, alignment  4/ 8:        479%
Length  128, alignment  4/ 8:        479%
Length  128, alignment  4/ 8:        479%
Length  256, alignment  5/10:        543%
Length  256, alignment  5/10:        539%
Length  256, alignment  5/10:        543%
Length  512, alignment  6/12:        585%
Length  512, alignment  6/12:        585%
Length  512, alignment  6/12:        585%
Length 1024, alignment  7/14:        611%
Length 1024, alignment  7/14:        611%
Length 1024, alignment  7/14:        611%

The performance measured on the bionic-benchmarks on a hikey
board with a new benchmark for unaligned memcmp submitted for
review at https://android-review.googlesource.com/414860

The base is with the libc from /system/lib64. The bionic libc
with this patch is in /data.

hikey:/data # export LD_LIBRARY_PATH=/system/lib64
hikey:/data # ./bionic-benchmarks --benchmark_filter=BM_string_memcmp
Run on (8 X 2.4 MHz CPU s)
Benchmark                                Time           CPU Iterations
----------------------------------------------------------------------
BM_string_memcmp/8                      30 ns         30 ns   22955680    251.07MB/s
BM_string_memcmp/64                     57 ns         57 ns   12349184   1076.99MB/s
BM_string_memcmp/512                   305 ns        305 ns    2297163   1.56496GB/s
BM_string_memcmp/1024                  571 ns        571 ns    1225211   1.66912GB/s
BM_string_memcmp/8k                   4307 ns       4306 ns     162562   1.77177GB/s
BM_string_memcmp/16k                  8676 ns       8675 ns      80676   1.75887GB/s
BM_string_memcmp/32k                 19233 ns      19230 ns      36394   1.58695GB/s
BM_string_memcmp/64k                 36986 ns      36984 ns      18952   1.65029GB/s
BM_string_memcmp_aligned/8             199 ns        199 ns    3519166   38.3336MB/s
BM_string_memcmp_aligned/64            386 ns        386 ns    1810734   158.073MB/s
BM_string_memcmp_aligned/512          1735 ns       1734 ns     403981   281.525MB/s
BM_string_memcmp_aligned/1024         3200 ns       3200 ns     218838   305.151MB/s
BM_string_memcmp_aligned/8k          25084 ns      25080 ns      28180   311.507MB/s
BM_string_memcmp_aligned/16k         51730 ns      51729 ns      13521   302.057MB/s
BM_string_memcmp_aligned/32k        103228 ns     103228 ns       6782   302.727MB/s
BM_string_memcmp_aligned/64k        207117 ns     207087 ns       3450   301.806MB/s
BM_string_memcmp_unaligned/8           339 ns        339 ns    2070998   22.5302MB/s
BM_string_memcmp_unaligned/64         1392 ns       1392 ns     502796   43.8454MB/s
BM_string_memcmp_unaligned/512        9194 ns       9194 ns      76133   53.1104MB/s
BM_string_memcmp_unaligned/1024      18325 ns      18323 ns      38206   53.2963MB/s
BM_string_memcmp_unaligned/8k       148579 ns     148574 ns       4713   52.5831MB/s
BM_string_memcmp_unaligned/16k      298169 ns     298120 ns       2344   52.4118MB/s
BM_string_memcmp_unaligned/32k      598813 ns     598797 ns       1085    52.188MB/s
BM_string_memcmp_unaligned/64k     1196079 ns    1196083 ns        540   52.2539MB/s

hikey:/data # export LD_LIBRARY_PATH=/data
hikey:/data # ./bionic-benchmarks --benchmark_filter=BM_string_memcmp

Benchmark                                Time           CPU Iterations
----------------------------------------------------------------------
BM_string_memcmp/8                      27 ns         27 ns   26198166   286.069MB/s
BM_string_memcmp/64                     45 ns         45 ns   15553753   1.32443GB/s
BM_string_memcmp/512                   242 ns        242 ns    2892423   1.97049GB/s
BM_string_memcmp/1024                  455 ns        455 ns    1537290   2.09436GB/s
BM_string_memcmp/8k                   3446 ns       3446 ns     203295   2.21392GB/s
BM_string_memcmp/16k                  7567 ns       7567 ns      92582   2.01657GB/s
BM_string_memcmp/32k                 16081 ns      16081 ns      43524    1.8977GB/s
BM_string_memcmp/64k                 31029 ns      31028 ns      22565   1.96712GB/s
BM_string_memcmp_aligned/8             184 ns        184 ns    3800912   41.3654MB/s
BM_string_memcmp_aligned/64            287 ns        287 ns    2438835    212.65MB/s
BM_string_memcmp_aligned/512          1370 ns       1370 ns     511014   356.498MB/s
BM_string_memcmp_aligned/1024         2543 ns       2543 ns     275253   384.006MB/s
BM_string_memcmp_aligned/8k          20413 ns      20411 ns      34306   382.764MB/s
BM_string_memcmp_aligned/16k         42908 ns      42907 ns      16132   364.158MB/s
BM_string_memcmp_aligned/32k         88902 ns      88886 ns       8087   351.574MB/s
BM_string_memcmp_aligned/64k        173016 ns     173007 ns       4122   361.258MB/s
BM_string_memcmp_unaligned/8           212 ns        212 ns    3304163   36.0243MB/s
BM_string_memcmp_unaligned/64          361 ns        361 ns    1941597   169.279MB/s
BM_string_memcmp_unaligned/512        1754 ns       1753 ns     399210   278.492MB/s
BM_string_memcmp_unaligned/1024       3308 ns       3308 ns     211622   295.243MB/s
BM_string_memcmp_unaligned/8k        27227 ns      27225 ns      25637   286.964MB/s
BM_string_memcmp_unaligned/16k       55877 ns      55874 ns      12455   279.645MB/s
BM_string_memcmp_unaligned/32k      112397 ns     112366 ns       6200    278.11MB/s
BM_string_memcmp_unaligned/64k      223493 ns     223482 ns       3127   279.665MB/s

Test: bionic-benchmarks --benchmark_filter='BM_string_memcmp*'

Change-Id: Ia16a8cf69c68b8c0533f025f03b925c9883bb708
2017-11-03 13:21:07 -04:00
..
__memcpy_chk.S Split our FORTIFY implementation into libc_fortify 2017-07-24 14:20:16 -07:00
memchr.S libc: clean up ARM64 copyright notices 2017-05-04 12:59:53 -04:00
memcmp.S [AArch64] Optimized memcmp 2017-11-03 13:21:07 -04:00
memcpy.S Split our FORTIFY implementation into libc_fortify 2017-07-24 14:20:16 -07:00
memcpy_base.S libc: ARM64: update memset/strlen/memcpy/memmove to newlib/cortex-strings 2016-11-28 19:35:12 +00:00
memmove.S libc: remove bcopy from memmove on 64-bit architectures 2015-08-17 22:06:12 +00:00
memset.S libc: ARM64: fix memset for non-standard ZVA sizes 2017-05-16 11:29:49 +01:00
stpcpy.S Add optimized stpcpy. 2014-06-30 12:48:13 -07:00
strchr.S libc: clean up ARM64 copyright notices 2017-05-04 12:59:53 -04:00
strcmp.S bionic: arm64: generic: strcmp: align to 64B cache line 2017-03-20 17:54:29 +00:00
strcpy.S Add optimized stpcpy. 2014-06-30 12:48:13 -07:00
string_copy.S Regenerate the bionic NOTICE files. 2014-07-07 15:42:06 -07:00
strlen.S libc: ARM64: update memset/strlen/memcpy/memmove to newlib/cortex-strings 2016-11-28 19:35:12 +00:00
strncmp.S Add ARMv8 optimized string handling functions based on cortex-strings 2014-03-06 14:59:51 -08:00
strnlen.S Add ARMv8 optimized string handling functions based on cortex-strings 2014-03-06 14:59:51 -08:00
wmemmove.S Add optimized AArch64 versions of bcopy and wmemmove based on memmove 2014-05-23 18:49:57 -07:00