]> git.ipfire.org Git - thirdparty/coreutils.git/log
thirdparty/coreutils.git
2 weeks agocut: refactor find_bytesearch_field_terminator to be stateful
Pádraig Brady [Fri, 20 Mar 2026 14:49:44 +0000 (14:49 +0000)] 
cut: refactor find_bytesearch_field_terminator to be stateful

Allows better/simpler avoidance of repeated line/delim scans

TODO: speed up our really slow cut_fields_mb_any.
Compare for example:
  time src/cut -w  -f1 ll.in >/dev/null #14s
  time src/cut -d, -f1 ll.in >/dev/null #.1s
Could adjust so that LC_ALL=C does memchr2(space,tab) ?

2 weeks agocut: avoid repeated searchs for line_delim in the multi-byte delim case
Pádraig Brady [Fri, 20 Mar 2026 14:21:27 +0000 (14:21 +0000)] 
cut: avoid repeated searchs for line_delim in the multi-byte delim case

TODO: Refactor all this into find_bytesearch_field_terminator.
Also handle in the delim_length==1 case.

2 weeks agocut: refactor all byte search to find_bytesearch_field_terminator
Pádraig Brady [Fri, 20 Mar 2026 13:52:37 +0000 (13:52 +0000)] 
cut: refactor all byte search to find_bytesearch_field_terminator

TODO: Perhaps also add search only fields mode
to avoid rescans of very long lines

2 weeks agocut: optimize -f when finished processing fields for a line
Pádraig Brady [Fri, 20 Mar 2026 13:42:43 +0000 (13:42 +0000)] 
cut: optimize -f when finished processing fields for a line

TODO: simplify and compare perf

2 weeks agocut: optimize -f for fhe common case of single byte delimiters
Pádraig Brady [Fri, 20 Mar 2026 13:41:37 +0000 (13:41 +0000)] 
cut: optimize -f for fhe common case of single byte delimiters

* TODO: perf comparison

2 weeks agocut: optimize -d '?' in UTF-8 case
Pádraig Brady [Tue, 17 Mar 2026 14:17:06 +0000 (14:17 +0000)] 
cut: optimize -d '?' in UTF-8 case

ensure all ascii delims are processed with byte search in UTF-8

2 weeks agocut: merge cut_fields and cut_fields_bytesearch
Pádraig Brady [Tue, 17 Mar 2026 13:30:36 +0000 (13:30 +0000)] 
cut: merge cut_fields and cut_fields_bytesearch

TODO: See why this is much slower:
time LC_ALL=C.UTF-8 src/cut -f1 -dc as.in > /dev/null

2 weeks agocut: refactor -f to byte search and character processing
Pádraig Brady [Mon, 16 Mar 2026 22:23:02 +0000 (22:23 +0000)] 
cut: refactor -f to byte search and character processing

Not sure about this at all.
Only worthwhile if can also remove cut_fields_line_delim.

Refactored src/cut.c so the old UTF-8 byte-search path is now the
general byte-search field engine for all safe byte-search cases:
ordinary single-byte delimiters and valid UTF-8 delimiters. The old
cut_fields path did not go away completely; it is now
cut_fields_line_delim and is used only when the field delimiter equals
the record delimiter, because -d $'\n' and -z -d '' have different
semantics from normal line-based field splitting.

As part of that, I also folded the duplicated “start selected field”
logic into a shared helper, and renamed the byte-search helpers to match
their broader use. The current dispatcher in src/cut.c is now:
whitespace parser, then line- delimiter field mode, then byte-search
field mode, then the decoded multibyte parser.

2 weeks agocut: fix 25% perf regression mentioned in previous change
Pádraig Brady [Mon, 16 Mar 2026 20:47:23 +0000 (20:47 +0000)] 
cut: fix 25% perf regression mentioned in previous change

* src/cut.c: Prefer strstr() over memmem() as the former is
optimized per platforma on GLIBC, where as the latter is only
optimized for S390 as of glibc-2.42

TODO: look at merging cut_fields and cut_fields_mb_utf8
as they're both byte search

2 weeks agocut: use bounded memory in utf8 mode when possible
Pádraig Brady [Mon, 16 Mar 2026 14:11:08 +0000 (14:11 +0000)] 
cut: use bounded memory in utf8 mode when possible

TODO: See why a bit slower than old code

$ time src/cut.old -f1 -dç mb.in >/dev/null
real 0m0.136s
user 0m0.096s
sys 0m0.039s

$ time src/cut.new -f1 -dç mb.in >/dev/null
real 0m0.170s
user 0m0.139s
sys 0m0.030s

2 weeks agocut: add utf8 helper to mbbuf
Pádraig Brady [Mon, 16 Mar 2026 12:45:59 +0000 (12:45 +0000)] 
cut: add utf8 helper to mbbuf

* gl/lib/mbbuf.h: To safely search a bounded buffer,
without needed to have unbounded memory with getndelim2.

TODO: rename mbbuf_utf8_safe_prefix to mbbuf_fill_utf8

2 weeks agocut: faster utf8 processing
Pádraig Brady [Mon, 16 Mar 2026 11:11:59 +0000 (11:11 +0000)] 
cut: faster utf8 processing

TODO: improve to use bounded memory where possible

2 weeks agocut: support -F as an alias for -f -w -O ' '
Pádraig Brady [Fri, 13 Mar 2026 19:48:05 +0000 (19:48 +0000)] 
cut: support -F as an alias for -f -w -O ' '

To improve compatibility with toybox/busybox scripts.

2 weeks agomaint: cut: refactor buffered and ordinary field scanning
Pádraig Brady [Fri, 13 Mar 2026 15:08:34 +0000 (15:08 +0000)] 
maint: cut: refactor buffered and ordinary field scanning

* src/cut.c: Merge scan_mb_field and read_mb_field_to_buffer

2 weeks agocut: support --whitespace-delimited=trimmed
Pádraig Brady [Fri, 13 Mar 2026 14:57:42 +0000 (14:57 +0000)] 
cut: support --whitespace-delimited=trimmed

Support ignoring leading and trailing whitespace.
E.g. this matches awk's default field splitting mode.

* src/cut.c
* tests/cut/cut.pl: Add test cases.

2 weeks agocut: support -O as an alias for --output-delimiter
Pádraig Brady [Fri, 13 Mar 2026 11:05:40 +0000 (11:05 +0000)] 
cut: support -O as an alias for --output-delimiter

To improve compatibility with toybox/busybox scripts.

* doc/coreutils.texi (cut invocation): Add -O description.
* src/cut.c: Support -O as well as --output-delimiter
* tests/cut/cut.pl: Adjust one case to use -O.

2 weeks agodoc: cut: adjust for multi-byte support
Pádraig Brady [Wed, 11 Mar 2026 22:15:38 +0000 (22:15 +0000)] 
doc: cut: adjust for multi-byte support

* doc/coreutils.texi (cut invocation): Remove the note about
-c being the same as -b.

2 weeks agocut: refactor multi-byte updates
Pádraig Brady [Thu, 12 Mar 2026 18:58:46 +0000 (18:58 +0000)] 
cut: refactor multi-byte updates

* src/cut.c: 160 fewer lines

Helpers extracted (replacing repeated inline patterns):
- write_line_delim(), write_pending_line_delim(), reset_item_line()
  - line boundary code used by cut_bytes{,no_split}, cut_characters
- write_selected_item()
  - output-delimiter + write logic used by all three byte/char functions
- reset_field_line()
  - field line reset used by cut_fields_mb_any

Field functions unified via cut_fields_mb_any(stream, whitespace_mode):
- struct mbfield_parser encapsulates the whitespace vs.
  fixed-delimiter state (saved char, mode flag)
- mbfield_get_char() - dispatches to saved-char or direct read
- mbfield_terminator()
  - returns FIELD_{DATA,DELIMETER,LINE_DELIMITER} based on mode
- read_mb_field_to_buffer()
  - replaces the two duplicated first-field buffering loops
- scan_mb_field(mbbuf, parser, pending, write_field)
  - replaces the four duplicated field scan loops
  (print+skip × two modes) with a single function and a write_field bool
- cut_fields_mb and cut_fields_ws are now trivial wrappers

2 weeks agocut: implement -n to avoid outputting partial characters
Pádraig Brady [Thu, 12 Mar 2026 17:27:00 +0000 (17:27 +0000)] 
cut: implement -n to avoid outputting partial characters

Both the i18n patch and FreeBSD/macOS support this option.
They do differ in behavior somewhat as the i18n patch
may output more bytes than requested.

  $ printf '\xc3\xa9b\n' | i18n-cut -n -b1
  é

There is also a bug in the i18n patch with multi-byte
at the start of a line:

  $ printf '\xc3\xa9b\n' | i18n-cut -n -b1-2
  éb

We follow the FreeBSD behavior since it seems more
useful to have -b be a hard limit, rather than a soft limit.
This also reduces the possibility of duplicate character output
with separate cut invocations with non overlapping byte ranges.

* src/cut.c (cut_bytes_no_split): A new function
similar to cut_characters, to handle multi-byte characters
with byte limit semantics.
* tests/cut/cut.pl: Add test cases.

2 weeks agotests: cut: add a test for divergence from i18n patch
Pádraig Brady [Thu, 12 Mar 2026 16:09:39 +0000 (16:09 +0000)] 
tests: cut: add a test for divergence from i18n patch

* tests/cut/cut.pl: We don't fall back to byte mode
upon invalid uni-byte delimiter.

2 weeks agotests: cut: add case currently failing for coreutils-i18n patch
Pádraig Brady [Thu, 12 Mar 2026 15:50:04 +0000 (15:50 +0000)] 
tests: cut: add case currently failing for coreutils-i18n patch

* tests/cut/cut.pl: Test for extraneous character output with:
printf 'aéb\n' | cut -s -d 'é' -f1 | od -tx1

2 weeks agotests: cut: check multi-byte output delimiter
Pádraig Brady [Thu, 12 Mar 2026 15:04:29 +0000 (15:04 +0000)] 
tests: cut: check multi-byte output delimiter

* tests/cut/cut.pl: Add a test case.

2 weeks agocut: adjust error message to be less specific
Pádraig Brady [Thu, 12 Mar 2026 18:36:27 +0000 (18:36 +0000)] 
cut: adjust error message to be less specific

* src/cut.c (main): Cater for both misplaced -w and -d.

2 weeks agocut: implement -w,--whitespace-delimited
Pádraig Brady [Wed, 11 Mar 2026 22:42:45 +0000 (22:42 +0000)] 
cut: implement -w,--whitespace-delimited

* src/cut.c (cut_fields_ws): A new function handling both
uni-byte and multi-byte cases.
* tests/cut/cut.pl: Add a test cases.

2 weeks agocut: support single byte -d with GB18030 input
Pádraig Brady [Wed, 11 Mar 2026 22:06:43 +0000 (22:06 +0000)] 
cut: support single byte -d with GB18030 input

* src/cut.c
* tests/cut/mb-non-utf8.sh
* tests/local.mk

2 weeks agocut: support single byte -d that may be part of multi-byte
Pádraig Brady [Wed, 11 Mar 2026 21:23:24 +0000 (21:23 +0000)] 
cut: support single byte -d that may be part of multi-byte

Note this is a slight divergence from the i18n patch
as that switched to uni-byte for any single byte delimiter
that is not valid multi-byte.

That results in possibly splitting in the middle of
a valid multi-byte character.

Instead we only split on a single byte when they're
not part of a multi-byte character.

* src/cut.c

2 weeks agocut: support multi-byte field delimiters
Pádraig Brady [Wed, 11 Mar 2026 21:12:04 +0000 (21:12 +0000)] 
cut: support multi-byte field delimiters

* src/cut.c
* tests/cut/cut.pl

2 weeks agocut: support multi-byte input with -c
Pádraig Brady [Wed, 11 Mar 2026 20:50:23 +0000 (20:50 +0000)] 
cut: support multi-byte input with -c

* src/cut.c
* tests/cut/cut.pl

2 weeks agomaint: cut: refactor output calls
Pádraig Brady [Fri, 13 Mar 2026 15:43:21 +0000 (15:43 +0000)] 
maint: cut: refactor output calls

* src/cut.c (cut_fields): Refactor calls to fwrite() and putchar()

2 weeks agotests: cut: ensure no unecessary buffering
Pádraig Brady [Sat, 21 Mar 2026 18:53:24 +0000 (18:53 +0000)] 
tests: cut: ensure no unecessary buffering

* tests/misc/write-errors.sh: Ensure we write output when possible.

2 weeks agodoc: cut: reorder --complement alphabetically in help
Pádraig Brady [Fri, 13 Mar 2026 11:58:24 +0000 (11:58 +0000)] 
doc: cut: reorder --complement alphabetically in help

* src/cut.c (usage): Move placement of --comlement description.
* doc/coreutils.texi (cut invocation): Likewise.

2 weeks agodoc: cut: clarify description of -b and -c
Pádraig Brady [Wed, 11 Mar 2026 22:17:08 +0000 (22:17 +0000)] 
doc: cut: clarify description of -b and -c

* src/cut.c (usage): State the arguments are positions,
in case users may think they were values.

2 weeks agobuild: update to latest gnulib
Pádraig Brady [Sun, 5 Apr 2026 12:04:03 +0000 (13:04 +0100)] 
build: update to latest gnulib

Pick up mbrto{c32,wc} optimizations on UTF-8 on GLIBC.
Note configure.ac defines the required GNULIB_WCHAR_SINGLE_LOCALE.
This speeds up wc -m by 2.6x, when processing non ASCII chars,
and will similarly speed up per character processing
in the impending cut multi-byte implementation.
* NEWS: Mention the wc -m speed improvement.

2 weeks agobasename: avoid duplicate strlen calls on the suffix
Collin Funk [Sat, 4 Apr 2026 19:44:14 +0000 (12:44 -0700)] 
basename: avoid duplicate strlen calls on the suffix

    $ ltrace -c ./src/basename-prev -s a $(seq 100000) > /dev/null
    % time     seconds  usecs/call     calls      function
    ------ ----------- ----------- --------- --------------------
     50.00   30.030316          75    400000 strlen
    [...]
    $ ltrace -c ./src/basename -s a $(seq 100000) > /dev/null
    % time     seconds  usecs/call     calls      function
    ------ ----------- ----------- --------- --------------------
     42.88   22.413953          74    300001 strlen
    [...]

* src/basename.c (remove_suffix, perform_basename): Add a length
argument for the suffix and use it instead of strlen.
(main): Calculate the suffix length. Refactor code to avoid calling
perform_basename in multiple places.

2 weeks agodate: simplify -u by not calling putenv
Paul Eggert [Fri, 3 Apr 2026 01:53:34 +0000 (18:53 -0700)] 
date: simplify -u by not calling putenv

* src/date.c (TZSET): Remove; no longer needed.
(main): Simplify -u’s implementation by passing "UTC0" to tzalloc,
rather than by setting TZ in the environment and then calling getenv.
The old way of doing things dates back to before we had tzalloc.
* configure.ac (LOCALTIME_CACHE): Remove; no longer needed.

2 weeks agobuild: update gnulib submodule to latest
Paul Eggert [Wed, 1 Apr 2026 21:44:00 +0000 (14:44 -0700)] 
build: update gnulib submodule to latest

2 weeks agomaint: avoid sigaction lock overhead
Paul Eggert [Wed, 1 Apr 2026 21:43:40 +0000 (14:43 -0700)] 
maint: avoid sigaction lock overhead

* configure.ac (GNULIB_SIGACTION_SINGLE_THREAD):
Define to avoid unnecessary locking in Gnulib sigaction.  See:
https://lists.gnu.org/r/bug-gnulib/2026-04/msg00008.html

2 weeks agomaint: avoid Gnulib modules mbiter, mbiterf
Paul Eggert [Wed, 1 Apr 2026 18:56:18 +0000 (11:56 -0700)] 
maint: avoid Gnulib modules mbiter, mbiterf

* bootstrap.conf (avoided_gnulib_modules): Avoid mbiter and
mbiterf, for the same reason we avoid mbuiter and mbuiterf: these
modules are not needed because (due to mcel-prefer) we use mcel in
preference to mbiter/mbiterf/mbuiter/mbuiterf.

2 weeks agobuild: update gnulib submodule to latest
Paul Eggert [Wed, 1 Apr 2026 16:54:50 +0000 (09:54 -0700)] 
build: update gnulib submodule to latest

2 weeks agotests: dd: ensure memory exhaustion is handled gracefully
oech3 [Wed, 1 Apr 2026 11:37:09 +0000 (20:37 +0900)] 
tests: dd: ensure memory exhaustion is handled gracefully

* tests/dd/no-allocate.sh: Ensure we exit 1 upon mem allocation failure.
Also check other buffer size edge cases.
https://github.com/uutils/coreutils/issues/11436
https://github.com/uutils/coreutils/issues/11580
https://github.com/coreutils/coreutils/pull/235

2 weeks agotests: dd: avoid false failure with no controlling terminal
Pádraig Brady [Wed, 1 Apr 2026 12:23:54 +0000 (13:23 +0100)] 
tests: dd: avoid false failure with no controlling terminal

* tests/dd/misc.sh: test -w /dev/tty is not a strong enough check,
we need to actually open /dev/tty to ensure it's available.
It's not available under setsid for example.

2 weeks agotests: dd: check that erroneous seeks are not done in output
oech3 [Tue, 31 Mar 2026 06:57:58 +0000 (15:57 +0900)] 
tests: dd: check that erroneous seeks are not done in output

* tests/dd/misc.sh: Add test case for of=/dev/tty.
The same occurs for /dev/stdout, but that varies
in the test hardness so is best avoided.
https://github.com/coreutils/coreutils/pull/234

3 weeks agotests: coreutils: ensure empty arg is diagnosed
oech3 [Mon, 30 Mar 2026 09:07:42 +0000 (18:07 +0900)] 
tests: coreutils: ensure empty arg is diagnosed

* tests/misc/coreutils.sh: Add a test case.
https://github.com/coreutils/coreutils/pull/232

3 weeks agodate: avoid calling putenv multiple times unnecessarily
Collin Funk [Sun, 29 Mar 2026 01:57:49 +0000 (18:57 -0700)] 
date: avoid calling putenv multiple times unnecessarily

Adding environment variables can become quite expensive in some
admittedly unlikely situations.

    $ for i in $(seq 10000); do export A$i=A$i; done
    $ time ./src/date-prev -u $(yes -- -u | head -n 100000)
    Sun Mar 29 01:59:49 AM UTC 2026

    real 0m3.753s
    user 0m3.684s
    sys 0m0.050s
    $ time ./src/date -u $(yes -- -u | head -n 100000)
    Sun Mar 29 02:00:00 AM UTC 2026

    real 0m0.061s
    user 0m0.022s
    sys 0m0.045s

* src/date.c (main): Only add TZ=UTC0 to the environment once.

3 weeks agomaint: remove unnecessary return statements
Collin Funk [Sat, 28 Mar 2026 19:48:38 +0000 (12:48 -0700)] 
maint: remove unnecessary return statements

* src/env.c (initialize_signals): Remove return at the end of the
function.
* src/who.c (print_runlevel): Likewise.

3 weeks agowho: avoid locking standard output for each user with the -q option
Collin Funk [Sat, 28 Mar 2026 19:45:14 +0000 (12:45 -0700)] 
who: avoid locking standard output for each user with the -q option

* src/who (list_entries_who): Prefer putchar and fputs to printf.
Simplify separator tracking.

3 weeks agodoc: tty: mention the removal of the -s option from POSIX
Collin Funk [Sat, 28 Mar 2026 05:45:56 +0000 (22:45 -0700)] 
doc: tty: mention the removal of the -s option from POSIX

* doc/coreutils.texi (tty invocation): Mention that POSIX.1-2001 removed
the -s option and that portable scripts can redirect standard out to
/dev/null instead.

3 weeks agotests: env/env.sh: improve portability
oech3 [Thu, 12 Mar 2026 09:33:08 +0000 (18:33 +0900)] 
tests: env/env.sh: improve portability

* tests/env/env.sh: Make more portable by avoiding references to our
build dir,  and avoiding names that may cause false matches in
multi-call binaries.
https://github.com/coreutils/coreutils/pull/216

3 weeks agood: suppress address output on read error
Pádraig Brady [Wed, 25 Mar 2026 17:33:47 +0000 (17:33 +0000)] 
od: suppress address output on read error

We don't output an address for `od missing` or `od --strings .`,
so be consistent and suppress the address for `od .`.

* src/od.c (dump): Only output an address if no errors
or the offset is non zero.

3 weeks agotests: od: ensure -j1 /dev/null succeeds
oech3 [Wed, 25 Mar 2026 17:08:51 +0000 (02:08 +0900)] 
tests: od: ensure -j1 /dev/null succeeds

Users may be using this to convert bases.

* tests/od/od-j.sh: Add a test case.
https://github.com/coreutils/coreutils/pull/228

3 weeks agotests: truncate: don't rely on errno being EISDIR
Collin Funk [Tue, 24 Mar 2026 04:44:05 +0000 (21:44 -0700)] 
tests: truncate: don't rely on errno being EISDIR

* tests/truncate/multiple-files.sh: Only check that an error is printed
instead of an exact message.
Reported by Bruno Haible.

3 weeks agotests: yes: support more zero-copy related syscalls
oech3 [Mon, 23 Mar 2026 13:28:22 +0000 (22:28 +0900)] 
tests: yes: support more zero-copy related syscalls

* tests/misc/yes.sh: Disable other related zero-copy syscalls
to ensure better testing of future or other implementations.
https://github.com/coreutils/coreutils/pull/227

4 weeks agomaint: remove some unnecessary casts
Collin Funk [Tue, 24 Mar 2026 02:32:21 +0000 (19:32 -0700)] 
maint: remove some unnecessary casts

* src/sort.c (begfield, limfield): Remove size_t casts.

4 weeks agotests: cut: add test for -z with NUL delimiter and -s flag
Sylvestre Ledru [Fri, 20 Mar 2026 14:17:40 +0000 (15:17 +0100)] 
tests: cut: add test for -z with NUL delimiter and -s flag

* tests/cut/cut.pl (zerot-7): New test.
Identified https://github.com/uutils/coreutils/pull/11394
https://github.com/coreutils/coreutils/pull/226

4 weeks agotests: tr: add test for invalid character class name
Sylvestre Ledru [Thu, 19 Mar 2026 21:25:14 +0000 (22:25 +0100)] 
tests: tr: add test for invalid character class name

* tests/tr/tr.pl (invalid-class): New test.
Identified : https://github.com/uutils/coreutils/pull/11398
https://github.com/coreutils/coreutils/pull/225

4 weeks agosort: speed up keyed field sorting significantly using memchr
Chris Down [Mon, 23 Mar 2026 07:55:53 +0000 (15:55 +0800)] 
sort: speed up keyed field sorting significantly using memchr

When sort is invoked with an explicit field separator with `-t SEP`,
begfield() and limfield() scan for the separator to locate boundaries.
Right now the implementation there uses a loop that iterates over bytes
one by one, which is not ideal since we must scan past many bytes of
non-separator data one byte at a time.

Let's replace each of these loops with memchr(). On glibc systems,
memchr() uses SIMD to scan 16 bytes per step (NEON on aarch64) or 32
bytes per step (AVX2 on x86_64), rather than 1 byte at a time, so any
field longer than a handful of bytes stands to benefit quite
significantly.

Using the following input data:

  awk 'BEGIN {
      srand(42)
      for (i = 1; i <= 500000; i++)
          printf "%*d,%*d,%d\n", 4+int(rand()*9), 0,
                                 4+int(rand()*9), 0, int(rand()*10000)
  }' > short_csv_500k

  awk 'BEGIN {
      for (i = 1; i <= 500000; i++)
          printf "%100d,%100d,%d\n", 0, 0, int(rand()*10000)
  }' > wide_csv_500k

One can benchmark with:

  hyperfine --warmup 10 --runs 50 \
    "LC_ALL=C sort_before -t, -k3,3n short_csv_500k > /dev/null" \
    "LC_ALL=C sort_after -t, -k3,3n short_csv_500k > /dev/null"

  hyperfine --warmup 10 --runs 50 \
    "LC_ALL=C sort_before -t, -k3,3n wide_csv_500k > /dev/null" \
    "LC_ALL=C sort_after -t, -k3,3n wide_csv_500k > /dev/null"

  hyperfine --warmup 10 --runs 50 \
    "LC_ALL=C sort_before wide_csv_500k > /dev/null" \
    "LC_ALL=C sort_after wide_csv_500k > /dev/null"

The results on i9-14900HX x86_64 with -O2:

  sort -t, -k3,3n (500K lines, 4-12 byte short fields):
    Before: 123.1 ms    After: 108.1 ms    (-12.2%)

  sort -t, -k3,3n (500K lines, 100 byte wide fields):
    Before: 243.5 ms    After: 165.9 ms    (-31.9%)

  sort (default, no -k, 500K lines):
    Before: 141.6 ms    After: 141.8 ms    (unchanged)

And on M1 Pro aarch64 with -O2:

  sort -t, -k3,3n (500K lines, 4-12 byte short fields):
    Before: 98.0 ms     After: 92.3 ms     (-5.8%)

  sort -t, -k3,3n (500K lines, 100 byte wide fields):
    Before: 240.8 ms    After: 183.0 ms    (-24.0%)

  sort (default, no -k, 500K lines):
    Before: 145.6 ms    After: 145.6 ms    (unchanged)

Looking at profiling, the improvement is larger on x86_64 in these runs
because glibc's memchr uses AVX2 to scan 32 bytes per step versus 16
bytes per step with NEON on aarch64.

4 weeks agomaint: fix an incomplete sentence
Collin Funk [Sun, 22 Mar 2026 20:32:52 +0000 (13:32 -0700)] 
maint: fix an incomplete sentence

* tests/pwd/argument.sh: Fix the test description.
Reported by G. Branden Robinson.

4 weeks agotests: pwd: test the behavior when given an argument
Collin Funk [Sun, 22 Mar 2026 05:16:59 +0000 (22:16 -0700)] 
tests: pwd: test the behavior when given an argument

* tests/pwd/argument.sh: New file.
* tests/local.mk (all_tests): Add the new test.

4 weeks agotac: avoid unnecessary standard output buffering
Collin Funk [Sat, 21 Mar 2026 22:36:46 +0000 (15:36 -0700)] 
tac: avoid unnecessary standard output buffering

This has removes a tiny amount of overhead:

    $ seq 10000000 > input
    $ perf stat -e cpu-clock --repeat 1000 taskset 1 ./src/tac \
        input 2>&1 > /dev/null | grep -F 'seconds time'
              0.095707 +- 0.000223 seconds time elapsed  ( +-  0.23% )
    $ perf stat -e cpu-clock --repeat 1000 taskset 1 ./src/tac-prev \
        input 2>&1 > /dev/null | grep -F 'seconds time'
             0.1009378 +- 0.0000995 seconds time elapsed  ( +-  0.10% )

* src/tac.c (output): Use full_write instead of fread since we already
buffer the output ourselves.

4 weeks agotests: rm: fix a test that would sometimes hang
Collin Funk [Sat, 21 Mar 2026 19:19:21 +0000 (12:19 -0700)] 
tests: rm: fix a test that would sometimes hang

* tests/rm/dash-hint.sh: Add the file name argument to grep, as I
intended when adding this test.

4 weeks agotac: promptly diagnose write errors
Collin Funk [Sat, 21 Mar 2026 08:07:28 +0000 (01:07 -0700)] 
tac: promptly diagnose write errors

This patch also fixes a bug where 'tac' would print a vague error on
some inputs:

    $ seq 10000 | ./src/tac-prev > /dev/full
    tac-prev: write error
    $ seq 10000 | ./src/tac > /dev/full
    tac: write error: No space left on device

In this case ferror (stdout) is true, but errno has been set back to
zero by a successful fclose (stdout) call.

* src/tac.c (output): Call write_error() if fwrite fails.
* tests/misc/io-errors.sh: Check that 'tac' prints a detailed write
error.
* NEWS: Mention the improvement.

4 weeks agotests: support checking for specific write errors
Pádraig Brady [Sat, 21 Mar 2026 12:37:20 +0000 (12:37 +0000)] 
tests: support checking for specific write errors

* tests/misc/io-errors.sh: Support checkout for a specific error
in commands that don't run indefinitely.  Currently all the explicitly
listed commands output a specific error and do not need to be tagged.

4 weeks agotests: nl: check that all files are processed
Collin Funk [Sat, 21 Mar 2026 02:46:04 +0000 (19:46 -0700)] 
tests: nl: check that all files are processed

* tests/nl/multiple-files.sh: New file.
* tests/local.mk (all_tests): Add the new test.

4 weeks agotest: truncate: improve the test added in the previous commit
Collin Funk [Fri, 20 Mar 2026 06:14:52 +0000 (23:14 -0700)] 
test: truncate: improve the test added in the previous commit

* tests/truncate/multiple-files.sh: Check that nothing is printed to
standard output and that standard error has the correct error.

4 weeks agotests: truncate: check that all files are processed
Collin Funk [Fri, 20 Mar 2026 05:54:42 +0000 (22:54 -0700)] 
tests: truncate: check that all files are processed

* tests/truncate/multiple-files.sh: New file.
* tests/local.mk (all_tests): Add the new test.

4 weeks agosort,split,yes: ensure pipe and pipe2 don't open standard descriptors
Collin Funk [Wed, 18 Mar 2026 06:06:16 +0000 (23:06 -0700)] 
sort,split,yes: ensure pipe and pipe2 don't open standard descriptors

* bootstrap.conf (gnulib_modules): Add pipe2-safer.
* cfg.mk (sc_require_unistd_safer): New rule for 'make syntax-check'.
* gl/lib/fd-reopen.c: Include unistd--.h instead of unistd.h.
* src/sort.c: Include unistd--.h.
* src/split.c: Likewise.
* src/yes.c: Likewise.

5 weeks agotests: dd: fix false failure on NetBSD 10
Pádraig Brady [Mon, 16 Mar 2026 22:34:58 +0000 (22:34 +0000)] 
tests: dd: fix false failure on NetBSD 10

* tests/dd/partial-write.sh: Skip the test if
nothing written at all, as was seen on NetBSD 10.
Reported by Bruno Haible.

5 weeks agotests: ls: fix false failure on FreeBSD
Pádraig Brady [Mon, 16 Mar 2026 22:25:42 +0000 (22:25 +0000)] 
tests: ls: fix false failure on FreeBSD

* tests/ls/non-utf8-hidden.sh: Avoid sorting in ls, to avoid:
ls: cannot compare file names ...: Illegal byte sequence
seen on FreeBSD 14.
Reported by Bruno Haible.

5 weeks agomaint: tee: remove an affirm call to silence coverity
Collin Funk [Mon, 16 Mar 2026 22:04:24 +0000 (15:04 -0700)] 
maint: tee: remove an affirm call to silence coverity

* src/iopoll.c (write_wait): Don't check that an unsigned integer is
always great than or equal to zero since that is always true.

5 weeks agowc: make sure input buffer for neon 'wc -l' is aligned
Collin Funk [Sun, 15 Mar 2026 04:04:12 +0000 (21:04 -0700)] 
wc: make sure input buffer for neon 'wc -l' is aligned

* src/wc_neon.c (wc_lines_neon): Use alignas.

5 weeks agotee: prefer file descriptors over streams
Collin Funk [Sun, 15 Mar 2026 03:21:53 +0000 (20:21 -0700)] 
tee: prefer file descriptors over streams

We disable buffering on the streams anyways, so we were effectively
calling the write system call previously despite using streams.

* src/iopoll.h (fclose_wait, fwrite_wait): Remove declarations.
(close_wait, write_wait): Add declarations.
* src/iopoll.c (fwait_for_nonblocking_write, fclose_wait, fwrite_wait):
Remove functions.
(wait_for_nonblocking_write): New function based on
fwait_for_nonblocking_write.
(close_wait): New function based on fclose_wait.
(write_wait): New function based on fwrite_wait.
* src/tee.c: Include fcntl--.h. Don't include stdio--.h.
(get_next_out): Operate on file descriptors instead of streams.
(fail_output): Likewise. Remove clearerr call since we no longer call
fwrite on stdout.
(tee_files): Operate on file descriptors instead of streams. Remove
calls to setvbuf.

5 weeks agotimeout: don't exit immediately if the parent is the init process
Collin Funk [Sat, 14 Mar 2026 03:37:10 +0000 (20:37 -0700)] 
timeout: don't exit immediately if the parent is the init process

* src/timeout.c (main): Save the process ID before creating a child
process. Check if the result of getppid is different than the saved
process ID instead of checking if it is 1.
* tests/timeout/init-parent.sh: New file.
* tests/local.mk (all_tests): Add the new test.
* NEWS: Mention the bug fix. Also mention that this change allows
'timeout' to work when reparented by a subreaper process instead of
init.

5 weeks agodoc: fix missing '=' in texi option descriptions
Pádraig Brady [Fri, 13 Mar 2026 10:27:40 +0000 (10:27 +0000)] 
doc: fix missing '=' in texi option descriptions

* doc/coreutils.texi (cut invocation, fold invocation):
Fix missing '=' before option parameters.

5 weeks agodd: always diagnose partial writes on write failure
Pádraig Brady [Wed, 11 Mar 2026 15:39:20 +0000 (15:39 +0000)] 
dd: always diagnose partial writes on write failure

* src/dd.c (dd_copy): Increment the partial write count upon failure.
* tests/dd/partial-write.sh: Add a new test.
* tests/local.mk: Reference the new test.
* NEWS: Mention the bug fix.
Fixes https://bugs.gnu.org/80583

5 weeks agodoc: clarify a recent NEWS item
Pádraig Brady [Wed, 11 Mar 2026 15:57:22 +0000 (15:57 +0000)] 
doc: clarify a recent NEWS item

* NEWS: It was ambiguous as to whether we quoted a range of
observered throughputs.  Clarify this was the old and new
throughput on a single test system.

5 weeks agodoc: NEWS: adjust 'wc -l' aarch64 benchmark after recent commit
Collin Funk [Wed, 11 Mar 2026 06:08:57 +0000 (23:08 -0700)] 
doc: NEWS: adjust 'wc -l' aarch64 benchmark after recent commit

After commit e0190a9d1 (wc: improve aarch64 Neon optimization for
'wc -l', 2026-03-09), on a Ampere eMAG machine:

    $ yes | head -n 10000000000 > input
    $ (time ./src/wc -l input)
    10000000000 input

    real 0m3.447s
    user 0m1.533s
    sys 0m1.913s
    $ (export GLIBC_TUNABLES='glibc.cpu.hwcaps=-ASIMD,-AVX2,-AVX512F'; \
       time ./src/wc -l input)
    10000000000 input

    real 0m15.758s
    user 0m14.039s
    sys 0m1.720s

* NEWS: Mention the improved benchmark.

5 weeks agotests: rm: check for hints when running 'rm -foo'
Collin Funk [Tue, 10 Mar 2026 07:08:12 +0000 (00:08 -0700)] 
tests: rm: check for hints when running 'rm -foo'

* tests/rm/dash-hint.sh: New file.
* tests/local.mk (all_tests): Add the new test.

5 weeks agomaint: adjust to placate coverity
Pádraig Brady [Tue, 10 Mar 2026 20:14:42 +0000 (20:14 +0000)] 
maint: adjust to placate coverity

* src/system.h (c32issep): Adjust to more standard layout.

5 weeks agoyes: use a zero-copy implementation via (vm)splice
Pádraig Brady [Sat, 7 Mar 2026 14:23:38 +0000 (14:23 +0000)] 
yes: use a zero-copy implementation via (vm)splice

A good reference for the concepts used here is:
https://mazzo.li/posts/fast-pipes.html
We don't consider huge pages or busy loops here,
but use vmsplice(), and splice() to get significant speedups:

  i7-5600U-laptop $ taskset 1 yes | taskset 2 pv > /dev/null
  ... [4.98GiB/s]
  i7-5600U-laptop $ taskset 1 src/yes | taskset 2 pv > /dev/null
  ... [34.1GiB/s]

  IBM,9043-MRX $ taskset 1 yes | taskset 2 pv > /dev/null
  ... [11.6GiB/s]
  IBM,9043-MRX $ taskset 1 src/yes | taskset 2 pv > /dev/null
  ... [175GiB/s]

Also throughput to file (on BTRFS) was seen to increase significantly.
With a Fedora 43 laptop improving from 690MiB/s to 1.1GiB/s.

* bootstrap.conf: Ensure sys/uio.h is present.
This was an existing transitive dependency.
* m4/jm-macros.m4: Define HAVE_SPLICE appropriately.
We assume vmsplice() is available if splice() is as they
were introduced at the same time to Linux and glibc.
* src/yes.c (repeat_pattern): A new function to efficiently
duplicate a pattern in a buffer with memcpy calls that double in size.
This also makes the setup for the existing write() path more efficient.
(pipe_splice_size): A new function to increase the kernel pipe buffer
if possible, and use an appropriately sized buffer based on that (25%).
(splice_write): A new function to call vmplice() when outputting
to a pipe, and also splice() if outputting to a non-pipe.
* tests/misc/yes.sh: Verify the non-pipe output case,
(main): Adjust to always calling write on the minimal buffer first,
then trying vmsplice(), then falling back to write from bigger buffer.
and the vmsplice() fallback to write() case.
* NEWS: Mention the improvement.

5 weeks agoall: use more consistent blank character determination
Pádraig Brady [Mon, 9 Mar 2026 22:23:12 +0000 (22:23 +0000)] 
all: use more consistent blank character determination

* src/system.h (c32issep): A new function that is essentially
iswblank() on GLIBC platforms, and iswspace() with exceptions elsewhere.
* src/expand.c: Use it instead of c32isblank().
* src/fold.c: Likewise.
* src/join.c: Likewise.
* src/numfmt.c: Likewise.
* src/unexpand.c: Likewise.
* src/uniq.c: Likewise.
* NEWS: Mention the improvement.

5 weeks agocksum: fix tagged output on 32 bit platforms
Pádraig Brady [Tue, 10 Mar 2026 14:47:25 +0000 (14:47 +0000)] 
cksum: fix tagged output on 32 bit platforms

Fix an unreleased issue due to the recent change
to using idx_t in commit v9.10-91-g02983e493

* src/cksum.c (output_file): Cast the idx_t before passing to printf.

6 weeks agowc: improve aarch64 Neon optimization for 'wc -l'
Collin Funk [Tue, 10 Mar 2026 02:32:27 +0000 (19:32 -0700)] 
wc: improve aarch64 Neon optimization for 'wc -l'

    $ yes abcdefghijklmnopqrstuvwxyz | head -n 200000000 > input
    $ time ./src/wc-prev -l input
    200000000 input

    real 0m1.240s
    user 0m0.456s
    sys 0m0.784s
    $ time ./src/wc -l input
    200000000 input

    real 0m0.936s
    user 0m0.141s
    sys 0m0.795s

* configure.ac: Use unsigned char for the buffer to avoid potential
compiler warnings. Check for the functions being used in src/wc_neon.c
after this patch.
* src/wc_neon.c (wc_lines_neon): Use vreinterpretq_s8_u8 to convert 0xff
into -1 instead of bitwise AND instructions into convert it into 1.
Perform the pairwise addition and lane extraction once every 8192 bytes
instead of once every 64 bytes.
Thanks to Lasse Collin for spotting this and reviewing a draft of this
patch.

6 weeks agotests: expand: fix false failure on various systems
Pádraig Brady [Mon, 9 Mar 2026 21:01:27 +0000 (21:01 +0000)] 
tests: expand: fix false failure on various systems

* tests/expand/mb.sh: Use $LOCALE_FR_UTF8 rather than
hardcoding "en_US.UTF-8".
* tests/unexpand/mb.sh: Likewise.
Reported by Bruno Haible.

6 weeks agobuild: update to latest gnulib
Pádraig Brady [Mon, 9 Mar 2026 13:14:54 +0000 (13:14 +0000)] 
build: update to latest gnulib

* src/ls.c: Adjust for renamed acl permissions member.

6 weeks agomaint: remove duplicate names from THANKS
Collin Funk [Sun, 8 Mar 2026 01:08:13 +0000 (17:08 -0800)] 
maint: remove duplicate names from THANKS

* .mailmap: Prefer the most recently used email address from each commit
author.

6 weeks agomaint: prefer memset_explicit to explicit_bzero
Collin Funk [Sun, 8 Mar 2026 00:16:01 +0000 (16:16 -0800)] 
maint: prefer memset_explicit to explicit_bzero

The explicit_bzero function is a common extension, but memset_explicit
was standardized in C23. It will likely become more portable in the
future, and Gnulib provides an implementation if needed.

* bootstrap.conf (gnulib_modules): Add memset_explicit. Remove
explicit_bzero.
* gl/lib/randint.c (randint_free): Use memset_explicit instead of
explicit_bzero.
* gl/lib/randread.c (randread_free_body): Likewise.

6 weeks agoexpand,unexpand: support multi-byte input
Lukáš Zaoral [Fri, 6 Mar 2026 14:13:17 +0000 (14:13 +0000)] 
expand,unexpand: support multi-byte input

* src/expand.c: Use mbbuf to support multi-byte input.
* src/unexpand.c: Likewise.
* tests/expand/mb.sh: New multi-byte test.
* tests/unexpand/mb.sh: Likewise.
* tests/local.mk: Reference new tests.
* NEWS: Mention the improvement.

6 weeks agomaint: shred: fix typo in comment
Weixie Cui [Sat, 7 Mar 2026 02:01:17 +0000 (10:01 +0800)] 
maint: shred: fix typo in comment

* src/shred.c: Fix "then" -> "than" in comment.

6 weeks agomaint: dd: fix typo in comment
Weixie Cui [Fri, 6 Mar 2026 13:05:55 +0000 (21:05 +0800)] 
maint: dd: fix typo in comment

* src/dd.c: Fix "that that" -> "that the" in comment.

6 weeks agobuild: update gnulib submodule to latest
Collin Funk [Fri, 6 Mar 2026 09:09:45 +0000 (01:09 -0800)] 
build: update gnulib submodule to latest

6 weeks agobuild: update gnulib submodule to latest
Collin Funk [Fri, 6 Mar 2026 06:24:38 +0000 (22:24 -0800)] 
build: update gnulib submodule to latest

6 weeks agomaint: touch: reduce variable scope
Collin Funk [Thu, 5 Mar 2026 07:40:03 +0000 (23:40 -0800)] 
maint: touch: reduce variable scope

* src/touch.c (main): Declare variables where they are used instead of
at the start of the function.

6 weeks agomaint: chown,chgrp: reduce variable scope
Collin Funk [Thu, 5 Mar 2026 07:34:45 +0000 (23:34 -0800)] 
maint: chown,chgrp: reduce variable scope

* src/chown-core.c (describe_change, restricted_chown)
(change_file_owner, chown_files): Declare variables where they are used
instead of at the start of the function.
* src/chown.c (main): Likewise.

6 weeks agoinstall: allow the combination of --compare and --preserve-timestamps
Collin Funk [Sun, 1 Mar 2026 23:31:28 +0000 (15:31 -0800)] 
install: allow the combination of --compare and --preserve-timestamps

* NEWS: Mention the improvement.
* src/install.c (enum copy_status): New type to let the caller know if
the copy was performed or skipped.
(copy_file): Return the new type instead of bool. Reduce variable scope.
(install_file_in_file): Only strip the file if the copy was
performed. Update the timestamps if the copy was skipped.
(main): Don't error when --compare and --preserve-timestamps are
combined.
* tests/install/install-C.sh: Add some test cases.

6 weeks agocksum: use more defensive escaping for --check
Pádraig Brady [Sat, 28 Feb 2026 11:09:26 +0000 (11:09 +0000)] 
cksum: use more defensive escaping for --check

cksum --check is often the first interaction
users have with possibly untrusted downloads, so we should try
to be as defensive as possible when processing it.

Specifically we currently only escape \n characters in file names
presented in checksum files being parsed with cksum --check.
This gives some possibilty of dumping arbitrary data to the terminal
when checking downloads from an untrusted source.
This change gives these advantages:

  1. Avoids dumping arbitrary data to vulnerable terminals
  2. Avoids visual deception with ansi codes hiding checksum failures
  3. More secure if users copy and paste file names from --check output
  4. Simplifies programmatic parsing

Note this changes programmatic parsing, but given the original
format was so awkward to parse, I expect that's extremely rare.
I was not able to find example in the wild at least.
To parse the new format from from shell, you can do something like:

  cksum -c checksums | while IFS= read -r line; do
    case $line in
      *': FAILED')
        filename=$(eval "printf '%s' ${line%: FAILED}")
        cp -v "$filename" /quarantine
        ;;
    esac
  done

This change also slightly reduces the size of the sum(1) utility.
This change also apples to md5sum, sha*sum, and b2sum.

* src/cksum.c (digest_check): Call quotef() instead of
cksum(1) specific quoting.
* tests/cksum/md5sum-bsd.sh: Adjust accordingly.
* doc/coreutils.texi (cksum general options): Describe the
shell quoting used for problematic file names.
* NEWS: Mention the change in behavior.
Reported by: Aaron Rainbolt

6 weeks agomaint: tests: refactor uses of bad_unicode()
Pádraig Brady [Wed, 4 Mar 2026 17:57:54 +0000 (17:57 +0000)] 
maint: tests: refactor uses of bad_unicode()

* init.cfg: Use 0xFF rather than 0xC3 everywhere.
* tests/fold/fold-characters.sh: Reuse bad_unicode().
* tests/tac/tac-locale.sh: Likewise.

6 weeks agofold: fix output truncation with 0xFF bytes in input
Pádraig Brady [Wed, 4 Mar 2026 16:56:48 +0000 (16:56 +0000)] 
fold: fix output truncation with 0xFF bytes in input

On signed char platforms, 0xFF was converted to -1
which matches MBBUF_EOF, causing fold to stop processing.

* NEWS: Mention the bug fix.
* gl/lib/mbbuf.h: Avoid sign extension on signed char platforms.
* tests/fold/fold-characters.sh: Adjust test case.
Reported at https://src.fedoraproject.org/rpms/coreutils/pull-request/20

6 weeks agotests: date: add timezone conversion test
Sylvestre Ledru [Sat, 14 Feb 2026 19:08:12 +0000 (20:08 +0100)] 
tests: date: add timezone conversion test

*tests/date/date.pl: Add the test case.
Add test case for https://github.com/uutils/coreutils/issues/10800
to verify `date -u -d '10:30 UTC-05'` converts to 15:30 UTC.

6 weeks agotests: date: add edge cases for modifiers
Sylvestre Ledru [Fri, 27 Feb 2026 08:16:00 +0000 (09:16 +0100)] 
tests: date: add edge cases for modifiers

* tests/date/date.pl: Add the test case.
Add test cases for https://github.com/uutils/coreutils/issues/10957

6 weeks agotests: cut: add test case for newline delimiter with -s flag
Sylvestre Ledru [Wed, 4 Mar 2026 10:57:10 +0000 (11:57 +0100)] 
tests: cut: add test case for newline delimiter with -s flag

* tests/cut/cut.pl: Add a new test case.
https://github.com/coreutils/coreutils/pull/211