This patch set updates cut(1) to be multi-byte aware.
It also reduces interface divergence across implementations.
multi-byte awareness was added to the existing -c, n, and -d options.
Also considered for compatibility are the -w, -F, and -O options,
as these are present on at least two other common implementations.
= Interface / New functionality =
macOS, i18n, uutils, Toybox, Busybox, GNU
-c x x x x x x
-n x x x
-w x x x
-F x x x
-O x x x
-c is needed anyway as specified by all, including POSIX.
-n is needed also as specified by i18n/macOS/POSIX
-w is somewhat less important, but seeing as it's
on two other common platforms (and its functionality is
provided on two more), providing it is worthwhile for compat.
-F and -O are really just aliases to other options
so trivial to add, and probably worthwhile for compatibility.
Interface / functionality notes:
There is a slight divergence between -n implementations.
There was already a difference between FreeBSD and i18n, and
we've aligned with the more sensible FreeBSD implementation.
Note the i18n -n implementation is otherwise buggy in any case,
so I doubt this will be a practical compatibility concern.
Actually -n is specified by POSIX, and it matches FreeBSD.
Specifically our -n will not output a character unless the
byte range encompasses _the end_ of the multi-byte character.
I.e. the -b is a limit that is not passed, and thus ensures
we don't output overlapping characters for separate cut
invocations that do not have overlapping byte ranges.
-d <regex> from toybox is not implemented.
That's edge case functionality IMHO and not well suited to cut(1).
This functionality is supported by awk, and regex functionality
is best restricted to awk I think.
cut is a significant part of the i18n patch, so it will be good
to avoid that downstream divergence. Unfortunately there were
no tests with the cut i18n implementation.
Note the i18n cut implementation used fread() as so was
not reponsive to new data < BUFSIZ, whereas this implementation
uses read() and thus is responsive to data as it becomes available.
= Performance =
General performance notes:
We prefer byte searching (with -d) as that can be much faster
than character by character processing, and it's supported
on single byte and UTF-8 charsets. We also use byte searching
with -w on uni-byte locales.
This was seen to give up to 100x perf increase over the i18n patch.
Where we do use per character processing, we avoid conversion to
wide char when processing ASCII data (mcel provides this optimization).
This was seen to give a 14x performance increase over the i18n patch.
We prefer memchr() and strstr() as these are tuned for specific
platforms on glibc, even if memchr2() or memmem()
are algorithmically better.
We maintain the important memory behavior
of only buffering when necessary.
Performance testing:
There are _lots_ of combinations and optimziation opportunities.
I performance tested this patch set with the following setup:
$ yes | head -n10M > sl.in
$ yes $(yes eeeaae | head -n10K | paste -s -d,) | head -n10K > ll.in
$ yes $(yes eeeaae | head -n9 | paste -s -d,) | head -n1M > as.in
$ yes $(yes éééááé | head -n9 | paste -s -d,) | head -n1M \
> mb.in
$ for type in sl ll as mb; do
cat $type.in >/dev/null;
for imp in '' src/; do # '' maps to the system i18n ver on Fedora
echo ============ "${imp:-i18n}" $type ==============;
for d in -d, -dc -d, -dç -w -b -c; do
fields='-f1 -f10 -f100'
test "$d" = "-b" && { fields='-b1 -b10 -b100'; d=''; }
test "$d" = "-c" && { fields='-c1 -c10 -c100'; d=''; }
for f in $fields; do
for loc in C C.UTF-8; do
# SKip -b for UTF-8 as no different
test "$loc" = C.UTF-8 && echo "$f" | grep -q -- -b \
&& continue
# Skip multi-byte delimiter for C and not allowed
test "$loc" = C && test $(echo -n "$d" | wc -c) -ge 4 \
&& continue
LC_ALL=$loc ${imp}cut $f $d /dev/null 2>/dev/null &&
hyperfine -m2 -M4 \
"LC_ALL=$loc ${imp}cut $f $d $type.in >/dev/null" ||
printf 'Benchmark 1: %s\n unsupported\n\n' \
"LC_ALL=$loc ${imp}cut $f $d $type.in >/dev/null"
done;
done;
done;
done;
done
After a little post-processing of the results, we get:
-- cut-i18n
| command | sl | ll | as | mb |
| --------------- | -------- | -------- | -------- | -------- |
| C -f1 -d, | 66.3 ms | 1.605 s | 145.9 ms | 366.4 ms |
| UTF8 -f1 -d, | 65.8 ms | 1.593 s | 145.8 ms | 370.0 ms |
| C -f10 -d, | 301.4 ms | 1.590 s | 161.8 ms | 126.7 ms |
| UTF8 -f10 -d, | 303.5 ms | 1.599 s | 161.8 ms | 124.6 ms |
| C -f100 -d, | 300.6 ms | 1.596 s | 162.1 ms | 126.7 ms |
| UTF8 -f100 -d, | 301.3 ms | 1.595 s | 162.0 ms | 124.9 ms |
| C -f1 -dc | 66.6 ms | 1.845 s | 179.1 ms | 365.7 ms |
| UTF8 -f1 -dc | 73.8 ms | 1.878 s | 179.1 ms | 363.1 ms |
| C -f10 -dc | 300.7 ms | 349.8 ms | 76.0 ms | 125.3 ms |
| UTF8 -f10 -dc | 300.4 ms | 347.2 ms | 75.7 ms | 124.8 ms |
| C -f100 -dc | 300.1 ms | 348.1 ms | 76.5 ms | 125.5 ms |
| UTF8 -f100 -dc | 300.8 ms | 348.7 ms | 76.4 ms | 125.8 ms |
| UTF8 -f1 -d, | 563.5 ms | 21.775 s | 1.963 s | 1.665 s |
| UTF8 -f10 -d, | 833.6 ms | 20.504 s | 2.022 s | 1.612 s |
| UTF8 -f100 -d, | 825.2 ms | 20.448 s | 2.009 s | 1.616 s |
| UTF8 -f1 -dç | 563.7 ms | 21.827 s | 1.964 s | 2.319 s |
| UTF8 -f10 -dç | 825.3 ms | 21.713 s | 2.011 s | 2.248 s |
| UTF8 -f100 -dç | 831.6 ms | 20.505 s | 2.019 s | 2.276 s |
| C -f1 -w | - | - | - | - |
| UTF8 -f1 -w | - | - | - | - |
| C -f10 -w | - | - | - | - |
| UTF8 -f10 -w | - | - | - | - |
| C -f100 -w | - | - | - | - |
| UTF8 -f100 -w | - | - | - | - |
| C -b1 | 60.8 ms | 1.596 s | 154.8 ms | 313.7 ms |
| C -b10 | 51.6 ms | 1.594 s | 154.3 ms | 310.8 ms |
| C -b100 | 51.4 ms | 1.594 s | 153.0 ms | 312.2 ms |
| C -c1 | 60.7 ms | 1.597 s | 153.8 ms | 313.0 ms |
| UTF8 -c1 | 526.5 ms | 14.662 s | 1.362 s | 1.573 s |
| C -c10 | 51.8 ms | 1.591 s | 153.3 ms | 311.4 ms |
| UTF8 -c10 | 436.9 ms | 14.450 s | 1.336 s | 1.563 s |
| C -c100 | 51.0 ms | 1.593 s | 152.7 ms | 313.2 ms |
| UTF8 -c100 | 426.7 ms | 14.429 s | 1.344 s | 1.551 s |
-- src/cut
| command | sl | ll | as | mb |
| --------------- | -------- | -------- | -------- | -------- |
| C -f1 -d, | 4.6 ms | 108.2 ms | 45.4 ms | 24.2 ms |
| UTF8 -f1 -d, | 4.8 ms | 108.4 ms | 45.4 ms | 24.5 ms |
| C -f10 -d, | 4.5 ms | 109.3 ms | 123.7 ms | 24.3 ms |
| UTF8 -f10 -d, | 4.9 ms | 114.1 ms | 124.1 ms | 24.5 ms |
| C -f100 -d, | 4.7 ms | 119.2 ms | 124.1 ms | 24.5 ms |
| UTF8 -f100 -d, | 4.8 ms | 120.0 ms | 125.1 ms | 24.5 ms |
| C -f1 -dc | 4.4 ms | 120.5 ms | 11.9 ms | 24.1 ms |
| UTF8 -f1 -dc | 4.9 ms | 120.5 ms | 12.1 ms | 24.6 ms |
| C -f10 -dc | 4.7 ms | 125.3 ms | 11.8 ms | 24.1 ms |
| UTF8 -f10 -dc | 4.8 ms | 126.7 ms | 12.0 ms | 24.4 ms |
| C -f100 -dc | 4.6 ms | 127.0 ms | 11.9 ms | 24.3 ms |
| UTF8 -f100 -dc | 4.7 ms | 126.4 ms | 12.0 ms | 24.4 ms |
| UTF8 -f1 -d, | 6.0 ms | 169.4 ms | 15.6 ms | 67.4 ms |
| UTF8 -f10 -d, | 6.1 ms | 173.9 ms | 15.6 ms | 237.2 ms |
| UTF8 -f100 -d, | 6.1 ms | 174.0 ms | 15.6 ms | 237.8 ms |
| UTF8 -f1 -dç | 6.3 ms | 170.8 ms | 15.7 ms | 32.2 ms |
| UTF8 -f10 -dç | 6.0 ms | 172.9 ms | 15.9 ms | 32.1 ms |
| UTF8 -f100 -dç | 6.7 ms | 173.1 ms | 15.5 ms | 32.3 ms |
| C -f1 -w | 159.6 ms | 170.1 ms | 69.1 ms | 98.9 ms |
| UTF8 -f1 -w | 128.1 ms | 2.525 s | 246.5 ms | 1.086 s |
| C -f10 -w | 183.3 ms | 199.2 ms | 74.6 ms | 105.0 ms |
| UTF8 -f10 -w | 130.3 ms | 2.659 s | 276.5 ms | 1.099 s |
| C -f100 -w | 183.8 ms | 202.5 ms | 74.1 ms | 103.6 ms |
| UTF8 -f100 -w | 130.1 ms | 2.663 s | 276.6 ms | 1.097 s |
| C -b1 | 65.0 ms | 110.2 ms | 22.4 ms | 35.6 ms |
| C -b10 | 48.7 ms | 109.6 ms | 24.2 ms | 36.7 ms |
| C -b100 | 48.7 ms | 110.6 ms | 19.0 ms | 36.6 ms |
| C -c1 | 65.8 ms | 109.5 ms | 22.4 ms | 35.6 ms |
| UTF8 -c1 | 63.2 ms | 1.130 s | 116.9 ms | 610.2 ms |
| C -c10 | 48.7 ms | 109.8 ms | 24.3 ms | 36.8 ms |
| UTF8 -c10 | 39.7 ms | 1.133 s | 118.7 ms | 610.0 ms |
| C -c100 | 48.3 ms | 110.7 ms | 18.9 ms | 36.7 ms |
| UTF8 -c100 | 39.4 ms | 1.141 s | 115.0 ms | 598.8 ms |
In summary, compared to the i18n patch we're now as fast in all cases,
and much faster in most cases.
We can see the -f byte searching performing well,
being 120x faster in the no matching delimiter case,
to at least 3x faster in the matching delimiter case.
When we resort to per character processing we also compare well,
being 14x faster in the ASCII processing case
(due to mcel short-circuiting the wide char conversion).
Note the processing mb.in results above also show a 2x win
in per character processing cases, but the i18n patch would have
also picked that win up as it's achieved separately to this patch set:
https://lists.gnu.org/r/coreutils/2026-03/msg00117.html
** New Features
+ 'cut' now supports multi-byte input and delimiters. Consequently
+ the -c option is now honored, and no longer an alias for -b, and
+ the -n option is now honored, and no longer ignored.
+ Also the -d option supports multi-byte delimiters.
+
+ 'cut' adds new options for better compatibility:
+ The -w,--whitespace-delimited option was added to support blank aligned fields
+ and for better compatibility with FreeBSD/macOS.
+ The -O option was added as an alias for the --output-delimiter option,
+ for better compatibility with busybox/toybox.
+ The -F option was added as an alias for -w -O ' '
+ for better compatibility with busybox/toybox.
+
'date --date' now parses dot delimited dd.mm.yy format common in Europe.
This is in addition to the already supported mm/dd/yy and yy-mm-dd formats.