platform_bionic/libc/bionic/wchar.cpp

215 lines
7 KiB
C++
Raw Normal View History

/* $OpenBSD: citrus_utf8.c,v 1.6 2012/12/05 23:19:59 deraadt Exp $ */
/*-
* Copyright (c) 2002-2004 Tim J. Robbins
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
*
* THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
* ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
*/
#include <errno.h>
#include <string.h>
#include <sys/param.h>
#include <uchar.h>
#include <wchar.h>
#include "private/bionic_mbstate.h"
//
// This file is basically OpenBSD's citrus_utf8.c but rewritten to not require a
// 12-byte mbstate_t so we're backwards-compatible with our LP32 ABI where
// mbstate_t was only 4 bytes.
//
// The state is the UTF-8 sequence. We only support <= 4-bytes sequences so LP32
// mbstate_t already has enough space (out of the 4 available bytes we only
// need 3 since we should never need to store the entire sequence in the
// intermediary state).
//
// The C standard leaves the conversion state undefined after a bad conversion.
// To avoid unexpected failures due to the possible use of the internal private
// state we always reset the conversion state when encountering illegal
// sequences.
//
// We also implement the POSIX interface directly rather than being accessed via
// function pointers.
//
int mbsinit(const mbstate_t* ps) {
Optimize the mbs fast path slightly. From a logcat profile: ``` |--95.06%-- convertPrintable(char*, char const*, unsigned long) | |--13.95%-- [hit in function] | | | |--35.96%-- mbrtoc32 | | |--82.72%-- [hit in function] | | | | | |--11.07%-- mbsinit | | | | | |--5.96%-- @plt ``` I think we'd assumed that mbsinit() would be inlined, but since these functions aren't all in wchar.cpp it wasn't being. This change moves the implementation into a (more clearly named) inline function so we can trivially reclaim that 11%+6%. Benchmarks before: ``` ------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------- BM_stdlib_mbrtowc_1 8.03 ns 7.95 ns 87144997 BM_stdlib_mbrtowc_2 22.0 ns 21.8 ns 32002437 BM_stdlib_mbrtowc_3 30.0 ns 29.7 ns 23517699 BM_stdlib_mbrtowc_4 37.4 ns 37.1 ns 18895204 BM_stdlib_mbstowcs_ascii 792373 ns 782484 ns 890 bytes_per_second=609.389M/s BM_stdlib_mbstowcs_wide 15836785 ns 15678316 ns 44 bytes_per_second=30.4138M/s ``` Benchmarks after: ``` ------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------- BM_stdlib_mbrtowc_1 5.76 ns 5.72 ns 121863813 BM_stdlib_mbrtowc_2 17.1 ns 16.9 ns 41487260 BM_stdlib_mbrtowc_3 24.2 ns 24.0 ns 29141629 BM_stdlib_mbrtowc_4 30.3 ns 30.1 ns 23229291 BM_stdlib_mbstowcs_ascii 783506 ns 775389 ns 903 bytes_per_second=614.965M/s BM_stdlib_mbstowcs_wide 12787003 ns 12672642 ns 55 bytes_per_second=37.6273M/s ``` Bug: http://b/206523398 Test: treehugger Change-Id: If8c6c39880096ddd2cbd323c68dca82e9849ace6
2021-11-16 20:03:19 +01:00
return ps == nullptr || mbstate_is_initial(ps);
}
size_t mbrtowc(wchar_t* pwc, const char* s, size_t n, mbstate_t* ps) {
static mbstate_t __private_state;
mbstate_t* state = (ps == nullptr) ? &__private_state : ps;
// Our wchar_t is UTF-32.
return mbrtoc32(reinterpret_cast<char32_t*>(pwc), s, n, state);
}
size_t mbsnrtowcs(wchar_t* dst, const char** src, size_t nmc, size_t len, mbstate_t* ps) {
static mbstate_t __private_state;
mbstate_t* state = (ps == nullptr) ? &__private_state : ps;
size_t i, o, r;
// The fast paths in the loops below are not safe if an ASCII
// character appears as anything but the first byte of a
// multibyte sequence. Check now to avoid doing it in the loops.
if (nmc > 0 && mbstate_bytes_so_far(state) > 0 && static_cast<uint8_t>((*src)[0]) < 0x80) {
return mbstate_reset_and_return_illegal(EILSEQ, state);
}
// Measure only?
if (dst == nullptr) {
for (i = o = 0; i < nmc; i += r, o++) {
if (static_cast<uint8_t>((*src)[i]) < 0x80) {
// Fast path for plain ASCII characters.
if ((*src)[i] == '\0') {
return mbstate_reset_and_return(o, state);
}
r = 1;
} else {
r = mbrtowc(nullptr, *src + i, nmc - i, state);
if (r == BIONIC_MULTIBYTE_RESULT_ILLEGAL_SEQUENCE) {
return mbstate_reset_and_return_illegal(EILSEQ, state);
}
if (r == BIONIC_MULTIBYTE_RESULT_INCOMPLETE_SEQUENCE) {
return mbstate_reset_and_return_illegal(EILSEQ, state);
}
if (r == 0) {
return mbstate_reset_and_return(o, state);
}
}
}
return mbstate_reset_and_return(o, state);
}
// Actually convert, updating `dst` and `src`.
for (i = o = 0; i < nmc && o < len; i += r, o++) {
if (static_cast<uint8_t>((*src)[i]) < 0x80) {
// Fast path for plain ASCII characters.
dst[o] = (*src)[i];
r = 1;
if ((*src)[i] == '\0') {
*src = nullptr;
return mbstate_reset_and_return(o, state);
}
} else {
r = mbrtowc(dst + o, *src + i, nmc - i, state);
if (r == BIONIC_MULTIBYTE_RESULT_ILLEGAL_SEQUENCE) {
*src += i;
return mbstate_reset_and_return_illegal(EILSEQ, state);
}
if (r == BIONIC_MULTIBYTE_RESULT_INCOMPLETE_SEQUENCE) {
*src += nmc;
return mbstate_reset_and_return_illegal(EILSEQ, state);
}
if (r == 0) {
*src = nullptr;
return mbstate_reset_and_return(o, state);
}
}
}
*src += i;
return mbstate_reset_and_return(o, state);
}
size_t mbsrtowcs(wchar_t* dst, const char** src, size_t len, mbstate_t* ps) {
return mbsnrtowcs(dst, src, SIZE_MAX, len, ps);
}
Expose tzalloc()/localtime_rz()/mktime_z()/tzfree(). * Rationale The question often comes up of how to use multiple time zones in C code. If you're single-threaded, you can just use setenv() to manipulate $TZ. toybox does this, for example. But that's not thread-safe in two distinct ways: firstly, getenv() is not thread-safe with respect to modifications to the environment (and between the way putenv() is specified and the existence of environ, it's not obvious how to fully fix that), and secondly the _caller_ needs to ensure that no other threads are using tzset() or any function that behaves "as if" tzset() was called (which is neither easy to determine nor easy to ensure). This isn't a bigger problem because most of the time the right answer is to stop pretending that libc is at all suitable for any i18n, and switch to icu4c instead. (The NDK icu4c headers do not include ucal_*, so this is not a realistic option for most applications.) But what if you're somewhere in between? Like the rust chrono library, for example? What then? Currently their "least worst" option is to reinvent the entire wheel and read our tzdata files. Which isn't a great solution for anyone, for obvious maintainability reasons. So it's probably time we broke the catch-22 here and joined NetBSD in offering a less broken API than standard C has for the last 40 years. Sure, any would-be caller will have to have a separate "is this Android?" and even "is this API level >= 35?" path, but that will fix itself sometime in the 2030s when developers can just assume "yes, it is", whereas if we keep putting off exposing anything, this problem never gets solved. (No-one's bothered to try to implement the std::chrono::time_zone functionality in libc++ yet, but they'll face a similar problem if/when they do.) * Implementation The good news is that tzcode already implements these functions, so there's relatively little here. I've chosen not to expose `struct state` because `struct __timezone_t` makes for clearer error messages, given that compiler diagnostics will show the underlying type name (`struct __timezone_t*`) rather than the typedef name (`timezone_t`) that's used in calling code. I've moved us over to FreeBSD's wcsftime() rather than keep the OpenBSD one building --- I've long wanted to only have one implementation here, and FreeBSD is already doing the "convert back and forth, calling the non-wide function in the middle" dance that I'd hoped to get round to doing myself someday. This should mean that our strftime() and wcsftime() behaviors can't easily diverge in future, plus macOS/iOS are mostly FreeBSD, so any bugs will likely be interoperable with the other major mobile operating system, so there's something nice for everyone there! The FreeBSD wcsftime() implementation includes a wcsftime_l() implementation, so that's one stub we can remove. The flip side of that is that it uses mbsrtowcs_l() and wcsrtombs_l() which we didn't previously have. So expose those as aliases of mbsrtowcs() and wcsrtombs(). Bug: https://github.com/chronotope/chrono/issues/499 Test: treehugger Change-Id: Iee1b9d763ead15eef3d2c33666b3403b68940c3c
2023-06-15 22:17:08 +02:00
__strong_alias(mbsrtowcs_l, mbsrtowcs);
size_t wcrtomb(char* s, wchar_t wc, mbstate_t* ps) {
static mbstate_t __private_state;
mbstate_t* state = (ps == nullptr) ? &__private_state : ps;
// Our wchar_t is UTF-32.
return c32rtomb(s, static_cast<char32_t>(wc), state);
}
size_t wcsnrtombs(char* dst, const wchar_t** src, size_t nwc, size_t len, mbstate_t* ps) {
static mbstate_t __private_state;
mbstate_t* state = (ps == nullptr) ? &__private_state : ps;
Optimize the mbs fast path slightly. From a logcat profile: ``` |--95.06%-- convertPrintable(char*, char const*, unsigned long) | |--13.95%-- [hit in function] | | | |--35.96%-- mbrtoc32 | | |--82.72%-- [hit in function] | | | | | |--11.07%-- mbsinit | | | | | |--5.96%-- @plt ``` I think we'd assumed that mbsinit() would be inlined, but since these functions aren't all in wchar.cpp it wasn't being. This change moves the implementation into a (more clearly named) inline function so we can trivially reclaim that 11%+6%. Benchmarks before: ``` ------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------- BM_stdlib_mbrtowc_1 8.03 ns 7.95 ns 87144997 BM_stdlib_mbrtowc_2 22.0 ns 21.8 ns 32002437 BM_stdlib_mbrtowc_3 30.0 ns 29.7 ns 23517699 BM_stdlib_mbrtowc_4 37.4 ns 37.1 ns 18895204 BM_stdlib_mbstowcs_ascii 792373 ns 782484 ns 890 bytes_per_second=609.389M/s BM_stdlib_mbstowcs_wide 15836785 ns 15678316 ns 44 bytes_per_second=30.4138M/s ``` Benchmarks after: ``` ------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------- BM_stdlib_mbrtowc_1 5.76 ns 5.72 ns 121863813 BM_stdlib_mbrtowc_2 17.1 ns 16.9 ns 41487260 BM_stdlib_mbrtowc_3 24.2 ns 24.0 ns 29141629 BM_stdlib_mbrtowc_4 30.3 ns 30.1 ns 23229291 BM_stdlib_mbstowcs_ascii 783506 ns 775389 ns 903 bytes_per_second=614.965M/s BM_stdlib_mbstowcs_wide 12787003 ns 12672642 ns 55 bytes_per_second=37.6273M/s ``` Bug: http://b/206523398 Test: treehugger Change-Id: If8c6c39880096ddd2cbd323c68dca82e9849ace6
2021-11-16 20:03:19 +01:00
if (!mbstate_is_initial(state)) {
return mbstate_reset_and_return_illegal(EILSEQ, state);
}
char buf[MB_LEN_MAX];
size_t i, o, r;
if (dst == nullptr) {
for (i = o = 0; i < nwc; i++, o += r) {
wchar_t wc = (*src)[i];
if (static_cast<uint32_t>(wc) < 0x80) {
// Fast path for plain ASCII characters.
if (wc == 0) {
return o;
}
r = 1;
} else {
r = wcrtomb(buf, wc, state);
if (r == BIONIC_MULTIBYTE_RESULT_ILLEGAL_SEQUENCE) {
return r;
}
}
}
return o;
}
for (i = o = 0; i < nwc && o < len; i++, o += r) {
wchar_t wc = (*src)[i];
if (static_cast<uint32_t>(wc) < 0x80) {
// Fast path for plain ASCII characters.
dst[o] = wc;
if (wc == 0) {
*src = nullptr;
return o;
}
r = 1;
} else if (len - o >= sizeof(buf)) {
// Enough space to translate in-place.
r = wcrtomb(dst + o, wc, state);
if (r == BIONIC_MULTIBYTE_RESULT_ILLEGAL_SEQUENCE) {
*src += i;
return r;
}
} else {
// May not be enough space; use temp buffer.
r = wcrtomb(buf, wc, state);
if (r == BIONIC_MULTIBYTE_RESULT_ILLEGAL_SEQUENCE) {
*src += i;
return r;
}
if (r > len - o) {
break;
}
memcpy(dst + o, buf, r);
}
}
*src += i;
return o;
}
size_t wcsrtombs(char* dst, const wchar_t** src, size_t len, mbstate_t* ps) {
return wcsnrtombs(dst, src, SIZE_MAX, len, ps);
}
Expose tzalloc()/localtime_rz()/mktime_z()/tzfree(). * Rationale The question often comes up of how to use multiple time zones in C code. If you're single-threaded, you can just use setenv() to manipulate $TZ. toybox does this, for example. But that's not thread-safe in two distinct ways: firstly, getenv() is not thread-safe with respect to modifications to the environment (and between the way putenv() is specified and the existence of environ, it's not obvious how to fully fix that), and secondly the _caller_ needs to ensure that no other threads are using tzset() or any function that behaves "as if" tzset() was called (which is neither easy to determine nor easy to ensure). This isn't a bigger problem because most of the time the right answer is to stop pretending that libc is at all suitable for any i18n, and switch to icu4c instead. (The NDK icu4c headers do not include ucal_*, so this is not a realistic option for most applications.) But what if you're somewhere in between? Like the rust chrono library, for example? What then? Currently their "least worst" option is to reinvent the entire wheel and read our tzdata files. Which isn't a great solution for anyone, for obvious maintainability reasons. So it's probably time we broke the catch-22 here and joined NetBSD in offering a less broken API than standard C has for the last 40 years. Sure, any would-be caller will have to have a separate "is this Android?" and even "is this API level >= 35?" path, but that will fix itself sometime in the 2030s when developers can just assume "yes, it is", whereas if we keep putting off exposing anything, this problem never gets solved. (No-one's bothered to try to implement the std::chrono::time_zone functionality in libc++ yet, but they'll face a similar problem if/when they do.) * Implementation The good news is that tzcode already implements these functions, so there's relatively little here. I've chosen not to expose `struct state` because `struct __timezone_t` makes for clearer error messages, given that compiler diagnostics will show the underlying type name (`struct __timezone_t*`) rather than the typedef name (`timezone_t`) that's used in calling code. I've moved us over to FreeBSD's wcsftime() rather than keep the OpenBSD one building --- I've long wanted to only have one implementation here, and FreeBSD is already doing the "convert back and forth, calling the non-wide function in the middle" dance that I'd hoped to get round to doing myself someday. This should mean that our strftime() and wcsftime() behaviors can't easily diverge in future, plus macOS/iOS are mostly FreeBSD, so any bugs will likely be interoperable with the other major mobile operating system, so there's something nice for everyone there! The FreeBSD wcsftime() implementation includes a wcsftime_l() implementation, so that's one stub we can remove. The flip side of that is that it uses mbsrtowcs_l() and wcsrtombs_l() which we didn't previously have. So expose those as aliases of mbsrtowcs() and wcsrtombs(). Bug: https://github.com/chronotope/chrono/issues/499 Test: treehugger Change-Id: Iee1b9d763ead15eef3d2c33666b3403b68940c3c
2023-06-15 22:17:08 +02:00
__strong_alias(wcsrtombs_l, wcsrtombs);