platform_bionic

History

Chitti Babu Theegala cbfdc7f905 Fix streaming(memcpy) performance on Cortex-A7 Stream-mode detection for L1 in A7-core is failing for non cache-line-size (non 64 byte) aligned addresses. This leads to destination data getting cached unnecessarily. This A7 issue is confirmed by ARM This issue is solved by aligning destination address to 64 byte before entering the loop in memcpy routine. Though we get lower score for micro_bench memcpy when L1 cache is bypassed, it is desirable since it avoids unnecessary eviction of other process data from L1 which is good for overall system performance. Higher micro_bench memcpy numbers for < 64byte alignment shows good numbers but this is at the cost of L1 cache pollution. During memcpy/memset, unnecessary data is filled in L1 cache, this causes eviction of other process data from L1. For example during msmset(0), L1 cache gets filled with 0s which should be avoided. Additionally, there is another issue with cortex A7 that impacts performance for all alignments / all Android Wear versions: Store Buffer on A7 is 32 byte which limits the 32-byte back to back stores. In the current implementation back to back 32bytes writes is causing CPU stalls. This issue can be solved by interleaved Loads and Stores. This helps in avoiding CPU stalls during memcpy by utilizing efficiently the A7 internal load and store buffers. Change-Id: Ie5f12f2bb5d86f627686730416279057e4f5f6d0	2016-12-19 15:11:43 -08:00
..
bionic	Fix streaming(memcpy) performance on Cortex-A7	2016-12-19 15:11:43 -08:00

Chitti Babu Theegala cbfdc7f905 Fix streaming(memcpy) performance on Cortex-A7

Stream-mode detection for L1 in A7-core is failing for
non cache-line-size (non 64 byte) aligned addresses.
This leads to destination data getting cached unnecessarily.
This A7 issue is confirmed by ARM

This issue is solved by aligning destination address to 64 byte before
entering the loop in memcpy routine.
Though we get lower score for micro_bench memcpy when L1 cache is bypassed,
it is desirable since it avoids unnecessary eviction of other process data
from L1 which is good for overall system performance.

Higher micro_bench memcpy numbers for < 64byte alignment shows good numbers
but this is at the cost of L1 cache pollution. During memcpy/memset,
unnecessary data is filled in L1 cache, this causes eviction of other
process data from L1.
For example during msmset(0), L1 cache gets filled with 0s which should be
avoided.

Additionally, there is another issue with cortex A7 that impacts performance
for all alignments / all Android Wear versions:
Store Buffer on A7 is 32 byte which limits the 32-byte back to back stores.
In the current implementation back to back 32bytes writes is causing CPU stalls.
This issue can be solved by interleaved Loads and Stores.
This helps in avoiding CPU stalls during memcpy by utilizing efficiently the
A7 internal load and store buffers.

Change-Id: Ie5f12f2bb5d86f627686730416279057e4f5f6d0

2016-12-19 15:11:43 -08:00

bionic

Fix streaming(memcpy) performance on Cortex-A7

2016-12-19 15:11:43 -08:00