From 00381d22d454ab8a2b095658bf55416b4695dc8c Mon Sep 17 00:00:00 2001
From: Assaf Gordon <assafgordon@gmail.com>
Date: Mon, 27 Feb 2017 02:25:31 -0500
Subject: [PATCH] doc: expand 'join' info section

* doc/coreutils.texi (join invocation): Expand section to
add examples and more details.
Suggested by Dan Jacobson in https://bugs.gnu.org/25870
---
 doc/coreutils.texi | 424 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 389 insertions(+), 35 deletions(-)

diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index 3a8517cf5b..a649c088b8 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -6111,48 +6111,27 @@ Either @var{file1} or @var{file2} (but not both) can be @samp{-},
 meaning standard input.  @var{file1} and @var{file2} should be
 sorted on the join fields.
 
-@vindex LC_COLLATE
-Normally, the sort order is that of the
-collating sequence specified by the @env{LC_COLLATE} locale.  Unless
-the @option{-t} option is given, the sort comparison ignores blanks at
-the start of the join field, as in @code{sort -b}.  If the
-@option{--ignore-case} option is given, the sort comparison ignores
-the case of characters in the join field, as in @code{sort -f}.
-
-The @command{sort} and @command{join} commands should use consistent
-locales and options if the output of @command{sort} is fed to
-@command{join}.  You can use a command like @samp{sort -k 1b,1} to
-sort a file on its default join field, but if you select a non-default
-locale, join field, separator, or comparison options, then you should
-do so consistently between @command{join} and @command{sort}.
-If @samp{join -t ''} is specified then the whole line is considered which
-matches the default operation of sort.
-
-If the input has no unpairable lines, a GNU extension is
-available; the sort order can be any order that considers two fields
-to be equal if and only if the sort comparison described above
-considers them to be equal.  For example:
-
 @example
+@group
 $ cat file1
-a a1
-c c1
-b b1
+a 1
+b 2
+e 5
+
 $ cat file2
-a a2
-c c2
-b b2
+a X
+e Y
+f Z
+
 $ join file1 file2
-a a1 a2
-c c1 c2
-b b1 b2
+a 1 X
+e 5 Z
+@end group
 @end example
 
-@set JOIN_COMMAND
-@checkOrderOption{join}
-@clear JOIN_COMMAND
 
-The defaults are:
+@noindent
+@command{join}'s default behavior (when no options are given):
 @itemize
 @item the join field is the first field in each line;
 @item fields in the input are separated by one or more blanks, with leading
@@ -6162,6 +6141,18 @@ blanks on the line ignored;
 fields from @var{file1}, then the remaining fields from @var{file2}.
 @end itemize
 
+
+@menu
+* General options in join::      Options which affect general program behavior.
+* Sorting files for join::       Using @command{sort} before @command{join}.
+* Working with fields::          Joining on different fields.
+* Paired and unpaired lines::    Controlling @command{join}'s field matching.
+* Header lines::                 Working with header lines in files.
+* Set operations::               Union, Intersection and Difference of files.
+@end menu
+
+@node General options in join
+@subsection General options
 The program accepts the following options.  Also see @ref{Common options}.
 
 @table @samp
@@ -6262,6 +6253,369 @@ Print a line for each unpairable line in file @var{file-number}
 
 @exitstatus
 
+@set JOIN_COMMAND
+@checkOrderOption{join}
+@clear JOIN_COMMAND
+
+
+
+@node Sorting files for join
+@subsection Pre-sorting
+
+@command{join} requires sorted input files. Each input file should be
+sorted according to the key (=field/column number) used in
+@command{join}. The recommended sorting option is @samp{sort -k 1b,1}
+(assuming the desired key is in the first column).
+
+@noindent Typical usage:
+@example
+@group
+$ sort -k 1b,1 file1 > file1.sorted
+$ sort -k 1b,1 file2 > file2.sorted
+$ join file1.sorted file2.sorted > file3
+@end group
+@end example
+
+@vindex LC_COLLATE
+Normally, the sort order is that of the
+collating sequence specified by the @env{LC_COLLATE} locale.  Unless
+the @option{-t} option is given, the sort comparison ignores blanks at
+the start of the join field, as in @code{sort -b}.  If the
+@option{--ignore-case} option is given, the sort comparison ignores
+the case of characters in the join field, as in @code{sort -f}:
+
+@example
+@group
+$ sort -k 1bf,1 file1 > file1.sorted
+$ sort -k 1bf,1 file2 > file2.sorted
+$ join --ignore-case file1.sorted file2.sorted > file3
+@end group
+@end example
+
+The @command{sort} and @command{join} commands should use consistent
+locales and options if the output of @command{sort} is fed to
+@command{join}.  You can use a command like @samp{sort -k 1b,1} to
+sort a file on its default join field, but if you select a non-default
+locale, join field, separator, or comparison options, then you should
+do so consistently between @command{join} and @command{sort}.
+
+@noindent To avoid any locale-related issues, it is recommended to use the
+@samp{C} locale for both commands:
+
+@example
+@group
+$ LC_ALL=C sort -k 1b,1 file1 > file1.sorted
+$ LC_ALL=C sort -k 1b,1 file2 > file2.sorted
+$ LC_ALL=C join file1.sorted file2.sorted > file3
+@end group
+@end example
+
+
+@node Working with fields
+@subsection Working with fields
+
+Use @option{-1},@option{-2} to set the key fields for each of the input files.
+Ensure the preceeding @command{sort} commands operated on the same fields.
+
+@noindent
+The following example joins two files, using the values from seventh field
+of the first file and the third field of the second file:
+
+@example
+@group
+$ sort -k 7b,7 file1 > file1.sorted
+$ sort -k 3b,3 file2 > file2.sorted
+$ join -1 7 -2 3 file1.sorted file2.sorted > file3
+@end group
+@end example
+
+@noindent
+If the field number is the same for both files, use @option{-j}:
+
+@example
+@group
+$ sort -k4b,4 file1 > file1.sorted
+$ sort -k4b,4 file2 > file2.sorted
+$ join -j4    file1.sorted file2.sorted > file3
+@end group
+@end example
+
+@noindent
+Both @command{sort} and @command{join} operate of whitespace-delimited
+fields. To specify a different delimiter, use @option{-t} in @emph{both}:
+
+@example
+@group
+$ sort -t, -k3b,3 file1 > file1.sorted
+$ sort -t, -k3b,3 file2 > file2.sorted
+$ join -t, -j3    file1.sorted file2.sorted > file3
+@end group
+@end example
+
+@noindent
+To specify a tab (@sc{ascii} 0x09) character instead of whitespace, use
+@footnote{the @code{$'\t'} is supported in most modern shells.
+For older shells, use a literal tab}:
+
+@example
+@group
+$ sort -t$'\t' -k3b,3 file1 > file1.sorted
+$ sort -t$'\t' -k3b,3 file2 > file2.sorted
+$ join -t$'\t' -j3    file1.sorted file2.sorted > file3
+@end group
+@end example
+
+
+@noindent
+If @samp{join -t ''} is specified then the whole line is considered which
+matches the default operation of sort:
+
+@example
+@group
+$ sort file1 > file1.sorted
+$ sort file2 > file2.sorted
+$ join -t'' file1.sorted file2.sorted > file3
+@end group
+@end example
+
+
+@node Paired and unpaired lines
+@subsection Controlling @command{join}'s field matching
+
+In this section the @command{sort} commands are omitted for brevity.
+Sorting the files before joining is still required.
+
+@command{join}'s default behavior is to print only lines common to
+both input files. Use @option{-a} and @option{-v} to print unpairable lines
+from one or both files.
+
+@noindent
+All examples below use the following two (pre-sorted) input files:
+
+@multitable @columnfractions .5 .5
+@item
+@example
+$ cat file1
+a 1
+b 2
+@end example
+
+@tab
+@example
+$ cat file2
+a A
+c C
+@end example
+@end multitable
+
+
+@c TODO: Find better column widths that work for both HTML and PDF
+@c       and disable indentation of @example.
+@multitable @columnfractions 0.5 0.5
+
+@headitem Command @tab Outcome
+
+
+@item
+@example
+$ join file1 file2
+a 1 A
+@end example
+@tab
+common lines
+(@emph{intersection})
+
+
+
+@item
+@example
+$ join -a 1 file1 file2
+a 1 A
+b 2
+@end example
+@tab
+common lines @emph{and} unpaired
+lines from the first file
+
+
+@item
+@example
+$ join -a 2 file1 file2
+a 1 A
+c C
+@end example
+@tab
+common lines @emph{and} unpaired lines from the second file
+
+
+@item
+@example
+$ join -a 1 -a 2 file1 file2
+a 1 A
+b 2
+c C
+@end example
+@tab
+all lines (paired and unpaired) from both files
+(@emph{union}).
+@*
+see note below regarding @code{-o auto}.
+
+
+@item
+@example
+$ join -v 1 file1 file2
+b 1
+@end example
+@tab
+unpaired lines from the first file
+(@emph{difference})
+
+
+@item
+@example
+$ join -v 2 file1 file2
+c C
+@end example
+@tab
+unpaired lines from the second file
+(@emph{difference})
+
+
+@item
+@example
+$ join -v 1 -v 2 file1 file2
+b 2
+c C
+@end example
+@tab
+unpaired lines from both files, omitting common lines
+(@emph{symmetric difference}).
+
+
+@end multitable
+
+@noindent
+The @option{-o auto -e X} options are useful when dealing with unpaired lines.
+The following example prints all lines (common and unpaired) from both files.
+Without @option{-o auto} it is not easy to discern which fields originate from
+which file:
+
+@example
+$ join -a 1 -a 2 file1 file2
+a 1 A
+b 2
+c C
+
+$ join -o auto -e X -a 1 -a 2 file1 file2
+a 1 A
+b 2 X
+c X C
+@end example
+
+
+If the input has no unpairable lines, a GNU extension is
+available; the sort order can be any order that considers two fields
+to be equal if and only if the sort comparison described above
+considers them to be equal.  For example:
+
+@example
+@group
+$ cat file1
+a a1
+c c1
+b b1
+
+$ cat file2
+a a2
+c c2
+b b2
+$ join file1 file2
+a a1 a2
+c c1 c2
+b b1 b2
+@end group
+@end example
+
+
+@node Header lines
+@subsection Header lines
+
+The @option{--header} option can be used when the files to join
+have a header line which is not sorted:
+
+@example
+@group
+$ cat file1
+Name     Age
+Alice    25
+Charlie  34
+
+$ cat file2
+Name   Country
+Alice  France
+Bob    Spain
+
+$ join --header -o auto -e NA -a1 -a2 file1 file2
+Name     Age   Country
+Alice    25    France
+Bob      NA    Spain
+Charlie  34    NA
+@end group
+@end example
+
+
+To sort a file with a header line, use @sc{GNU} @command{sed -u}.
+The following example sort the files but keeps the first line of each
+file in place:
+
+@example
+@group
+$ ( sed -u 1q ; sort -k2b,2 ) < file1 > file1.sorted
+$ ( sed -u 1q ; sort -k2b,2 ) < file2 > file2.sorted
+$ join --header -o auto -e NA -a1 -a2 file1.sorted file2.sorted > file3
+@end group
+@end example
+
+@node Set operations
+@subsection Union, Intersection and Difference of files
+
+Combine @command{sort}, @command{uniq} and @command{join} to
+perform the equivalent of set operations on files:
+
+@c From https://www.pixelbeat.org/cmdline.html#sets
+@multitable @columnfractions 0.5 0.5
+@headitem Command @tab outcome
+@item @code{sort -u file1 file2}
+@tab Union of unsorted files
+
+@item @code{sort file1 file2 | uniq -d}
+@tab Intersection of unsorted files
+
+@item @code{sort file1 file1 file2 | uniq}
+@tab Difference of unsorted files
+
+@item @code{sort file1 file2 | uniq -u}
+@tab Symmetric Difference of unsorted files
+
+@item @code{join -t'' -a1 -a2 file1 file2}
+@tab Union of sorted files
+
+@item @code{join -t'' file1 file2}
+@tab Intersection of sorted files
+
+@item @code{join -t'' -v2 file1 file2}
+@tab Difference of sorted files
+
+@item @code{join -t'' -v1 -v2 file1 file2}
+@tab Symmetric Difference of sorted files
+
+@end multitable
+
+All examples above operate on entire lines and not on specific fields:
+@command{sort} without @option{-k} and @command{join -t''} both consider
+entire lines as the key.
+
 
 @node Operating on characters
 @chapter Operating on characters
-- 
2.47.3