]> git.ipfire.org Git - thirdparty/man-pages.git/blame - man/man7/regex.7
man/, share/mk/: Move man*/ to man/
[thirdparty/man-pages.git] / man / man7 / regex.7
CommitLineData
a1eaacb1 1'\" t
fea681da
MK
2.\" From Henry Spencer's regex package (as found in the apache
3.\" distribution). The package carries the following copyright:
4.\"
5.\" Copyright 1992, 1993, 1994 Henry Spencer. All rights reserved.
1a459d04 6.\" %%%LICENSE_START(MISC)
fea681da
MK
7.\" This software is not subject to any license of the American Telephone
8.\" and Telegraph Company or of the Regents of the University of California.
c13182ef 9.\"
fea681da
MK
10.\" Permission is granted to anyone to use this software for any purpose
11.\" on any computer system, and to alter it and redistribute it, subject
12.\" to the following restrictions:
c13182ef 13.\"
fea681da
MK
14.\" 1. The author is not responsible for the consequences of use of this
15.\" software, no matter how awful, even if they arise from flaws in it.
c13182ef 16.\"
fea681da
MK
17.\" 2. The origin of this software must not be misrepresented, either by
18.\" explicit claim or by omission. Since few users ever read sources,
19.\" credits must appear in the documentation.
c13182ef 20.\"
fea681da
MK
21.\" 3. Altered versions must be plainly marked as such, and must not be
22.\" misrepresented as being the original software. Since few users
23.\" ever read sources, credits must appear in the documentation.
c13182ef 24.\"
fea681da 25.\" 4. This notice may not be removed or altered.
8ff7380d 26.\" %%%LICENSE_END
c13182ef 27.\"
fea681da
MK
28.\" In order to comply with `credits must appear in the documentation'
29.\" I added an AUTHOR paragraph below - aeb.
30.\"
31.\" In the default nroff environment there is no dagger \(dg.
bf6c69c9
MK
32.\"
33.\" 2005-05-11 Removed discussion of `[[:<:]]' and `[[:>:]]', which
34.\" appear not to be in the glibc implementation of regcomp
35.\"
fea681da
MK
36.ie t .ds dg \(dg
37.el .ds dg (!)
4c1c5274 38.TH regex 7 (date) "Linux man-pages (unreleased)"
fea681da 39.SH NAME
4dec66f9 40regex \- POSIX.2 regular expressions
fea681da 41.SH DESCRIPTION
324633ae 42Regular expressions ("RE"s),
4dec66f9 43as defined in POSIX.2, come in two forms:
fea681da 44modern REs (roughly those of
f43a9ed0 45.BR egrep (1);
324633ae 46POSIX.2 calls these "extended" REs)
fea681da
MK
47and obsolete REs (roughly those of
48.BR ed (1);
324633ae 49POSIX.2 "basic" REs).
fea681da
MK
50Obsolete REs mostly exist for backward compatibility in some old programs;
51they will be discussed at the end.
fa203d85 52POSIX.2 leaves some aspects of RE syntax and semantics open;
333a424b 53"\*(dg" marks decisions on these aspects that
fa203d85 54may not be fully portable to other POSIX.2 implementations.
c6d039a3 55.P
aa796481 56A (modern) RE is one\*(dg or more nonempty\*(dg \fIbranches\fR,
b957f81f 57separated by \[aq]|\[aq].
fea681da 58It matches anything that matches one of the branches.
c6d039a3 59.P
fea681da 60A branch is one\*(dg or more \fIpieces\fR, concatenated.
f78ed33a
MK
61It matches a match for the first, followed by a match for the second,
62and so on.
c6d039a3 63.P
fea681da 64A piece is an \fIatom\fR possibly followed
b957f81f
AC
65by a single\*(dg \[aq]*\[aq], \[aq]+\[aq], \[aq]?\[aq], or \fIbound\fR.
66An atom followed by \[aq]*\[aq]
c45660d7 67matches a sequence of 0 or more matches of the atom.
b957f81f 68An atom followed by \[aq]+\[aq]
c45660d7 69matches a sequence of 1 or more matches of the atom.
b957f81f 70An atom followed by \[aq]?\[aq]
c45660d7 71matches a sequence of 0 or 1 matches of the atom.
c6d039a3 72.P
b957f81f
AC
73A \fIbound\fR is \[aq]{\[aq] followed by an unsigned decimal integer,
74possibly followed by \[aq],\[aq]
fea681da 75possibly followed by another unsigned decimal integer,
b957f81f 76always followed by \[aq]}\[aq].
097585ed
MK
77The integers must lie between 0 and
78.B RE_DUP_MAX
79(255\*(dg) inclusive,
fea681da
MK
80and if there are two of them, the first may not exceed the second.
81An atom followed by a bound containing one integer \fIi\fR
82and no comma matches
83a sequence of exactly \fIi\fR matches of the atom.
84An atom followed by a bound
85containing one integer \fIi\fR and a comma matches
86a sequence of \fIi\fR or more matches of the atom.
87An atom followed by a bound
88containing two integers \fIi\fR and \fIj\fR matches
89a sequence of \fIi\fR through \fIj\fR (inclusive) matches of the atom.
c6d039a3 90.P
c45660d7
MK
91An atom is a regular expression enclosed in "\fI()\fP"
92(matching a match for the regular expression),
333a424b 93an empty set of "\fI()\fP" (matching the null string)\*(dg,
d6c1998e
AC
94a \fIbracket expression\fR (see below),
95\[aq].\[aq] (matching any single character),
96\[aq]\[ha]\[aq] (matching the null string at the beginning of a line),
97\[aq]$\[aq] (matching the null string at the end of a line),
98a \[aq]\e\[aq] followed by one of the characters "\fI\[ha].[$()|*+?{\e\fP"
fea681da 99(matching that character taken as an ordinary character),
b957f81f 100a \[aq]\e\[aq] followed by any other character\*(dg
fea681da 101(matching that character taken as an ordinary character,
b957f81f 102as if the \[aq]\e\[aq] had not been present\*(dg),
fea681da 103or a single character with no other significance (matching that character).
d6c1998e
AC
104A \[aq]{\[aq] followed by a character other than a digit
105is an ordinary character,
106not the beginning of a bound\*(dg.
b957f81f 107It is illegal to end an RE with \[aq]\e\[aq].
c6d039a3 108.P
333a424b 109A \fIbracket expression\fR is a list of characters enclosed in "\fI[]\fP".
fea681da 110It normally matches any single character from the list (but see below).
a1e9245d 111If the list begins with \[aq]\[ha]\[aq],
fea681da
MK
112it matches any single character
113(but see below) \fInot\fR from the rest of the list.
b957f81f 114If two characters in the list are separated by \[aq]\-\[aq], this is shorthand
fea681da
MK
115for the full \fIrange\fR of characters between those two (inclusive) in the
116collating sequence,
333a424b 117for example, "\fI[0\-9]\fP" in ASCII matches any decimal digit.
fea681da 118It is illegal\*(dg for two ranges to share an
e2a71fb3 119endpoint, for example, "\fIa\-c\-e\fP".
fea681da
MK
120Ranges are very collating-sequence-dependent,
121and portable programs should avoid relying on them.
c6d039a3 122.P
b957f81f 123To include a literal \[aq]]\[aq] in the list, make it the first character
a1e9245d 124(following a possible \[aq]\[ha]\[aq]).
b957f81f 125To include a literal \[aq]\-\[aq], make it the first or last character,
fea681da 126or the second endpoint of a range.
b957f81f 127To use a literal \[aq]\-\[aq] as the first endpoint of a range,
c45660d7
MK
128enclose it in "\fI[.\fP" and "\fI.]\fP"
129to make it a collating element (see below).
b957f81f
AC
130With the exception of these and some combinations using \[aq][\[aq] (see next
131paragraphs), all other special characters, including \[aq]\e\[aq], lose their
fea681da 132special significance within a bracket expression.
c6d039a3 133.P
fea681da 134Within a bracket expression, a collating element (a character,
ae03dc66 135a multicharacter sequence that collates as if it were a single character,
fea681da 136or a collating-sequence name for either)
333a424b 137enclosed in "\fI[.\fP" and "\fI.]\fP" stands for the
fea681da
MK
138sequence of characters of that collating element.
139The sequence is a single element of the bracket expression's list.
ae03dc66 140A bracket expression containing a multicharacter collating element
fea681da 141can thus match more than one character,
333a424b
MK
142for example, if the collating sequence includes a "ch" collating element,
143then the RE "\fI[[.ch.]]*c\fP" matches the first five characters
144of "chchcc".
c6d039a3 145.P
333a424b
MK
146Within a bracket expression, a collating element enclosed in "\fI[=\fP" and
147"\fI=]\fP" is an equivalence class, standing for the sequences of characters
fea681da
MK
148of all collating elements equivalent to that one, including itself.
149(If there are no other equivalent collating elements,
c45660d7
MK
150the treatment is as if the enclosing delimiters
151were "\fI[.\fP" and "\fI.]\fP".)
acf10c16
BIG
152For example, if o and \(^o are the members of an equivalence class,
153then "\fI[[=o=]]\fP", "\fI[[=\(^o=]]\fP",
154and "\fI[o\(^o]\fP" are all synonymous.
fea681da
MK
155An equivalence class may not\*(dg be an endpoint
156of a range.
c6d039a3 157.P
fea681da 158Within a bracket expression, the name of a \fIcharacter class\fR enclosed
c45660d7
MK
159in "\fI[:\fP" and "\fI:]\fP" stands for the list
160of all characters belonging to that
fea681da
MK
161class.
162Standard character class names are:
c6d039a3 163.P
fea681da 164.RS
34f2dcd0
ER
165.TS
166l l l.
fea681da
MK
167alnum digit punct
168alpha graph space
169blank lower upper
170cntrl print xdigit
34f2dcd0 171.TE
fea681da 172.RE
c6d039a3 173.P
fea681da
MK
174These stand for the character classes defined in
175.BR wctype (3).
176A locale may provide others.
177A character class may not be used as an endpoint of a range.
bf6c69c9
MK
178.\" As per http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=295666
179.\" The following does not seem to apply in the glibc implementation
c6d039a3 180.\" .P
bf6c69c9 181.\" There are two special cases\*(dg of bracket expressions:
c45660d7
MK
182.\" the bracket expressions "\fI[[:<:]]\fP" and "\fI[[:>:]]\fP" match
183.\" the null string at the beginning and end of a word respectively.
bf6c69c9
MK
184.\" A word is defined as a sequence of
185.\" word characters
186.\" which is neither preceded nor followed by
187.\" word characters.
188.\" A word character is an
189.\" .I alnum
190.\" character (as defined by
191.\" .BR wctype (3))
192.\" or an underscore.
193.\" This is an extension,
4dec66f9 194.\" compatible with but not specified by POSIX.2,
bf6c69c9
MK
195.\" and should be used with
196.\" caution in software intended to be portable to other systems.
c6d039a3 197.P
fea681da
MK
198In the event that an RE could match more than one substring of a given
199string,
200the RE matches the one starting earliest in the string.
201If the RE could match more than one substring starting at that point,
202it matches the longest.
203Subexpressions also match the longest possible substrings, subject to
204the constraint that the whole match be as long as possible,
205with subexpressions starting earlier in the RE taking priority over
206ones starting later.
207Note that higher-level subexpressions thus take priority over
208their lower-level component subexpressions.
c6d039a3 209.P
fea681da
MK
210Match lengths are measured in characters, not collating elements.
211A null string is considered longer than no match at all.
212For example,
333a424b 213"\fIbb*\fP" matches the three middle characters of "abbbc",
c45660d7
MK
214"\fI(wee|week)(knights|nights)\fP"
215matches all ten characters of "weeknights",
333a424b 216when "\fI(.*).*\fP" is matched against "abc" the parenthesized subexpression
fea681da 217matches all three characters, and
c45660d7
MK
218when "\fI(a*)*\fP" is matched against "bc"
219both the whole RE and the parenthesized
fea681da 220subexpression match the null string.
c6d039a3 221.P
fea681da
MK
222If case-independent matching is specified,
223the effect is much as if all case distinctions had vanished from the
224alphabet.
225When an alphabetic that exists in multiple cases appears as an
226ordinary character outside a bracket expression, it is effectively
227transformed into a bracket expression containing both cases,
b957f81f 228for example, \[aq]x\[aq] becomes "\fI[xX]\fP".
fea681da 229When it appears inside a bracket expression, all case counterparts
333a424b 230of it are added to the bracket expression, so that, for example, "\fI[x]\fP"
a1e9245d 231becomes "\fI[xX]\fP" and "\fI[\[ha]x]\fP" becomes "\fI[\[ha]xX]\fP".
c6d039a3 232.P
fea681da
MK
233No particular limit is imposed on the length of REs\*(dg.
234Programs intended to be portable should not employ REs longer
235than 256 bytes,
236as an implementation can refuse to accept such REs and remain
237POSIX-compliant.
c6d039a3 238.P
324633ae 239Obsolete ("basic") regular expressions differ in several respects.
b957f81f 240\[aq]|\[aq], \[aq]+\[aq], and \[aq]?\[aq] are
c45660d7 241ordinary characters and there is no equivalent
fea681da 242for their functionality.
31a6818e 243The delimiters for bounds are "\fI\e{\fP" and "\fI\e}\fP",
b957f81f 244with \[aq]{\[aq] and \[aq]}\[aq] by themselves ordinary characters.
31a6818e 245The parentheses for nested subexpressions are "\fI\e(\fP" and "\fI\e)\fP",
b957f81f 246with \[aq](\[aq] and \[aq])\[aq] by themselves ordinary characters.
a1e9245d 247\[aq]\[ha]\[aq] is an ordinary character except at the beginning of the
fea681da 248RE or\*(dg the beginning of a parenthesized subexpression,
b957f81f 249\[aq]$\[aq] is an ordinary character except at the end of the
fea681da 250RE or\*(dg the end of a parenthesized subexpression,
b957f81f 251and \[aq]*\[aq] is an ordinary character if it appears at the beginning of the
fea681da 252RE or the beginning of a parenthesized subexpression
a1e9245d 253(after a possible leading \[aq]\[ha]\[aq]).
c6d039a3 254.P
fea681da 255Finally, there is one new type of atom, a \fIback reference\fR:
b957f81f 256\[aq]\e\[aq] followed by a nonzero decimal digit \fId\fR
fea681da
MK
257matches the same sequence of characters
258matched by the \fId\fRth parenthesized subexpression
259(numbering subexpressions by the positions of their opening parentheses,
260left to right),
31a6818e 261so that, for example, "\fI\e([bc]\e)\e1\fP" matches "bb" or "cc" but not "bc".
fea681da
MK
262.SH BUGS
263Having two kinds of REs is a botch.
c6d039a3 264.P
b957f81f
AC
265The current POSIX.2 spec says that \[aq])\[aq] is an ordinary character in
266the absence of an unmatched \[aq](\[aq];
fea681da
MK
267this was an unintentional result of a wording error,
268and change is likely.
269Avoid relying on it.
c6d039a3 270.P
fea681da
MK
271Back references are a dreadful botch,
272posing major problems for efficient implementations.
273They are also somewhat vaguely defined
274(does
31a6818e 275"\fIa\e(\e(b\e)*\e2\e)*d\fP" match "abbbd"?).
fea681da 276Avoid using them.
c6d039a3 277.P
fa203d85 278POSIX.2's specification of case-independent matching is vague.
324633ae 279The "one case implies all cases" definition given above
fea681da 280is current consensus among implementors as to the right interpretation.
4f020e78
MK
281.\" As per http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=295666
282.\" The following does not seem to apply in the glibc implementation
c6d039a3 283.\" .P
4f020e78 284.\" The syntax for word boundaries is incredibly ugly.
e0c674cd
MK
285.SH AUTHOR
286.\" Sigh... The page license means we must have the author's name
287.\" in the formatted output.
288This page was taken from Henry Spencer's regex package.
47297adb 289.SH SEE ALSO
845d36d6 290.BR grep (1),
e37e3282 291.BR regex (3)
c6d039a3 292.P
e37e3282 293POSIX.2, section 2.8 (Regular Expression Notation).