As requested in the bug. This also rips __memcpy_chk out of memcpy.S,
which lets us cut down on copypasta (all of the implementations look
identical).
Bug: 12231437
Test: mma on aosp_{arm,arm64,mips,x86,x86_64} internal master;
checkbuild on bullhead internal master; CtsBionicTestCases on bullhead.
No new failures.
Change-Id: I88c39ca166bacde0b692aa3063e743bb046a5d2f
There are a few instructions deprecated on armv8 that result in lots
of warnings. Add an arch directive so that these warnings go away.
This doesn't cause any problems because the instructions still
execute properly.
Bug: 38319728
Test: Built all of these assembler files and verified the warning are gone.
Change-Id: If063defdd16f290c01975233c8d257d1b2005e76
Stream-mode detection for L1 in A7-core is failing for
non cache-line-size (non 64 byte) aligned addresses.
This leads to destination data getting cached unnecessarily.
This A7 issue is confirmed by ARM
This issue is solved by aligning destination address to 64 byte before
entering the loop in memcpy routine.
Though we get lower score for micro_bench memcpy when L1 cache is bypassed,
it is desirable since it avoids unnecessary eviction of other process data
from L1 which is good for overall system performance.
Higher micro_bench memcpy numbers for < 64byte alignment shows good numbers
but this is at the cost of L1 cache pollution. During memcpy/memset,
unnecessary data is filled in L1 cache, this causes eviction of other
process data from L1.
For example during msmset(0), L1 cache gets filled with 0s which should be
avoided.
Additionally, there is another issue with cortex A7 that impacts performance
for all alignments / all Android Wear versions:
Store Buffer on A7 is 32 byte which limits the 32-byte back to back stores.
In the current implementation back to back 32bytes writes is causing CPU stalls.
This issue can be solved by interleaved Loads and Stores.
This helps in avoiding CPU stalls during memcpy by utilizing efficiently the
A7 internal load and store buffers.
Change-Id: Ie5f12f2bb5d86f627686730416279057e4f5f6d0
Test: Changed angler target to use cortex-a7 and I compiled.
Test: Booted this version on angler and ran bionic-unit-tests.
Change-Id: Ice7f6ea38a2569582161a8e659d7877918c1a45a
Our FORTIFY _chk functions' implementations were very repetitive and verbose
but not very helpful. We'd also screwed up and put the SSIZE_MAX checks where
they would never fire unless you actually had a buffer as large as half your
address space, which probably doesn't happen very often.
Factor out the duplication and take the opportunity to actually show details
like how big the overrun buffer was, or by how much it was overrun.
Also remove the obsolete FORTIFY event logging.
Also remove the unused __libc_fatal_no_abort.
This change doesn't improve the diagnostics from the optimized assembler
implementations.
Change-Id: I176a90701395404d50975b547a00bd2c654e1252
Add an optimized memset that is ~20% faster for cortex-a7 and
cortex-a53.
Add a 32 bit optimized cortex-a53 memcpy that is about ~20% faster
on cached data.
Fix the cortex-a15 __str{cat,cpy}_chk.S, memcpy_base.S to remove
the phony functions, since they aren't needed any more. Then add
a direct include of these for cortex-a53.
Verified the new functions by stepping through all of the major
paths and verifying the backtrace is still correct.
Bug: 22696180
Change-Id: Iec92a3f82d51243cca76c9aff9f35d920ff865ae