From: Pádraig Brady Date: Fri, 18 May 2018 04:41:46 +0000 (-0700) Subject: wc: optimize processing of ASCII in multi byte locales X-Git-Tag: v8.30~25 X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=77517a99179b728c6369341b0d36568bac5d7914;p=thirdparty%2Fcoreutils.git wc: optimize processing of ASCII in multi byte locales ===== Benchmark setup (on GNU/Linux) ==== $ yes áááááááááááááááááááá | head -n100000 > mbc.txt $ yes 12345678901234567890 | head -n100000 > num.txt ===== Before ==== $ time src/wc -Lm < mbc.txt real 0m0.186s $ time src/wc -m < mbc.txt real 0m0.186s $ time src/wc -Lm < num.txt real 0m0.055s $ time src/wc -m < num.txt real 0m0.056s ==== After ==== $ time src/wc -Lm < mbc.txt real 0m0.196s $ time src/wc -m < mbc.txt real 0m0.173s $ time src/wc -Lm < num.txt real 0m0.031s $ time src/wc -m < num.txt real 0m0.028s * src/wc.c (wc): Only call wide variant functions like iswprint() and wcwidth() for non is_basic() characters. I.E. non ISO C "basic character set" characters. This is especially significant on OSX where wcwidth() is very expensive (about 10x in tests). * NEWS: Mention the improvement. Suggested by Eric Fischer. --- diff --git a/NEWS b/NEWS index 101afc0809..2020ab6e37 100644 --- a/NEWS +++ b/NEWS @@ -55,6 +55,9 @@ GNU coreutils NEWS -*- outline -*- version of XFS. stat -f --format=%T now reports the file system type, and tail -f uses inotify. + wc avoids redundant processing of ASCII text in multibyte locales, + which is especially significant on macOS. + * Noteworthy changes in release 8.29 (2017-12-27) [stable] diff --git a/src/wc.c b/src/wc.c index 0c72042a0b..2034c42bee 100644 --- a/src/wc.c +++ b/src/wc.c @@ -379,6 +379,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos) { wchar_t wide_char; size_t n; + bool wide = true; if (!in_shift && is_basic (*p)) { @@ -386,6 +387,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos) mbrtowc(). */ n = 1; wide_char = *p; + wide = false; } else { @@ -419,9 +421,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos) n = 1; } } - p += n; - bytes_read -= n; - chars++; + switch (wide_char) { case '\n': @@ -445,17 +445,33 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos) in_word = false; break; default: - if (iswprint (wide_char)) + if (wide && iswprint (wide_char)) { - int width = wcwidth (wide_char); - if (width > 0) - linepos += width; + /* wcwidth can be expensive on OSX for example, + so avoid if uneeded. */ + if (print_linelength) + { + int width = wcwidth (wide_char); + if (width > 0) + linepos += width; + } if (iswspace (wide_char)) goto mb_word_separator; in_word = true; } + else if (!wide && isprint (to_uchar (*p))) + { + linepos++; + if (isspace (to_uchar (*p))) + goto mb_word_separator; + in_word = true; + } break; } + + p += n; + bytes_read -= n; + chars++; } while (bytes_read > 0);