]> git.ipfire.org Git - thirdparty/man-pages.git/blame - man7/regex.7
lround.3, proc.5: Reword to avoid use of "etc."
[thirdparty/man-pages.git] / man7 / regex.7
CommitLineData
fea681da
MK
1.\" From Henry Spencer's regex package (as found in the apache
2.\" distribution). The package carries the following copyright:
3.\"
4.\" Copyright 1992, 1993, 1994 Henry Spencer. All rights reserved.
1a459d04 5.\" %%%LICENSE_START(MISC)
fea681da
MK
6.\" This software is not subject to any license of the American Telephone
7.\" and Telegraph Company or of the Regents of the University of California.
c13182ef 8.\"
fea681da
MK
9.\" Permission is granted to anyone to use this software for any purpose
10.\" on any computer system, and to alter it and redistribute it, subject
11.\" to the following restrictions:
c13182ef 12.\"
fea681da
MK
13.\" 1. The author is not responsible for the consequences of use of this
14.\" software, no matter how awful, even if they arise from flaws in it.
c13182ef 15.\"
fea681da
MK
16.\" 2. The origin of this software must not be misrepresented, either by
17.\" explicit claim or by omission. Since few users ever read sources,
18.\" credits must appear in the documentation.
c13182ef 19.\"
fea681da
MK
20.\" 3. Altered versions must be plainly marked as such, and must not be
21.\" misrepresented as being the original software. Since few users
22.\" ever read sources, credits must appear in the documentation.
c13182ef 23.\"
fea681da 24.\" 4. This notice may not be removed or altered.
8ff7380d 25.\" %%%LICENSE_END
c13182ef 26.\"
fea681da
MK
27.\" In order to comply with `credits must appear in the documentation'
28.\" I added an AUTHOR paragraph below - aeb.
29.\"
30.\" In the default nroff environment there is no dagger \(dg.
bf6c69c9
MK
31.\"
32.\" 2005-05-11 Removed discussion of `[[:<:]]' and `[[:>:]]', which
33.\" appear not to be in the glibc implementation of regcomp
34.\"
fea681da
MK
35.ie t .ds dg \(dg
36.el .ds dg (!)
e0c674cd 37.TH REGEX 7 2009-01-12 "" "Linux Programmer's Manual"
fea681da 38.SH NAME
4dec66f9 39regex \- POSIX.2 regular expressions
fea681da 40.SH DESCRIPTION
324633ae 41Regular expressions ("RE"s),
4dec66f9 42as defined in POSIX.2, come in two forms:
fea681da
MK
43modern REs (roughly those of
44.IR egrep ;
324633ae 45POSIX.2 calls these "extended" REs)
fea681da
MK
46and obsolete REs (roughly those of
47.BR ed (1);
324633ae 48POSIX.2 "basic" REs).
fea681da
MK
49Obsolete REs mostly exist for backward compatibility in some old programs;
50they will be discussed at the end.
fa203d85 51POSIX.2 leaves some aspects of RE syntax and semantics open;
333a424b 52"\*(dg" marks decisions on these aspects that
fa203d85 53may not be fully portable to other POSIX.2 implementations.
fea681da 54.PP
aa796481 55A (modern) RE is one\*(dg or more nonempty\*(dg \fIbranches\fR,
333a424b 56separated by \(aq|\(aq.
fea681da
MK
57It matches anything that matches one of the branches.
58.PP
59A branch is one\*(dg or more \fIpieces\fR, concatenated.
60It matches a match for the first, followed by a match for the second, etc.
61.PP
62A piece is an \fIatom\fR possibly followed
333a424b 63by a single\*(dg \(aq*\(aq, \(aq+\(aq, \(aq?\(aq, or \fIbound\fR.
c45660d7
MK
64An atom followed by \(aq*\(aq
65matches a sequence of 0 or more matches of the atom.
66An atom followed by \(aq+\(aq
67matches a sequence of 1 or more matches of the atom.
68An atom followed by \(aq?\(aq
69matches a sequence of 0 or 1 matches of the atom.
fea681da 70.PP
333a424b
MK
71A \fIbound\fR is \(aq{\(aq followed by an unsigned decimal integer,
72possibly followed by \(aq,\(aq
fea681da 73possibly followed by another unsigned decimal integer,
333a424b 74always followed by \(aq}\(aq.
097585ed
MK
75The integers must lie between 0 and
76.B RE_DUP_MAX
77(255\*(dg) inclusive,
fea681da
MK
78and if there are two of them, the first may not exceed the second.
79An atom followed by a bound containing one integer \fIi\fR
80and no comma matches
81a sequence of exactly \fIi\fR matches of the atom.
82An atom followed by a bound
83containing one integer \fIi\fR and a comma matches
84a sequence of \fIi\fR or more matches of the atom.
85An atom followed by a bound
86containing two integers \fIi\fR and \fIj\fR matches
87a sequence of \fIi\fR through \fIj\fR (inclusive) matches of the atom.
88.PP
c45660d7
MK
89An atom is a regular expression enclosed in "\fI()\fP"
90(matching a match for the regular expression),
333a424b
MK
91an empty set of "\fI()\fP" (matching the null string)\*(dg,
92a \fIbracket expression\fR (see below), \(aq.\(aq
93(matching any single character), \(aq^\(aq (matching the null string at the
94beginning of a line), \(aq$\(aq (matching the null string at the
31a6818e
MK
95end of a line), a \(aq\e\(aq followed by one of the characters
96"\fI^.[$()|*+?{\e\fP"
fea681da 97(matching that character taken as an ordinary character),
31a6818e 98a \(aq\e\(aq followed by any other character\*(dg
fea681da 99(matching that character taken as an ordinary character,
31a6818e 100as if the \(aq\e\(aq had not been present\*(dg),
fea681da 101or a single character with no other significance (matching that character).
333a424b 102A \(aq{\(aq followed by a character other than a digit is an ordinary
fea681da 103character, not the beginning of a bound\*(dg.
31a6818e 104It is illegal to end an RE with \(aq\e\(aq.
fea681da 105.PP
333a424b 106A \fIbracket expression\fR is a list of characters enclosed in "\fI[]\fP".
fea681da 107It normally matches any single character from the list (but see below).
333a424b 108If the list begins with \(aq^\(aq,
fea681da
MK
109it matches any single character
110(but see below) \fInot\fR from the rest of the list.
333a424b 111If two characters in the list are separated by \(aq\-\(aq, this is shorthand
fea681da
MK
112for the full \fIrange\fR of characters between those two (inclusive) in the
113collating sequence,
333a424b 114for example, "\fI[0\-9]\fP" in ASCII matches any decimal digit.
fea681da 115It is illegal\*(dg for two ranges to share an
333a424b 116endpoint, for example, "\fIa-c-e\fP".
fea681da
MK
117Ranges are very collating-sequence-dependent,
118and portable programs should avoid relying on them.
119.PP
333a424b
MK
120To include a literal \(aq]\(aq in the list, make it the first character
121(following a possible \(aq^\(aq).
122To include a literal \(aq\-\(aq, make it the first or last character,
fea681da 123or the second endpoint of a range.
333a424b 124To use a literal \(aq\-\(aq as the first endpoint of a range,
c45660d7
MK
125enclose it in "\fI[.\fP" and "\fI.]\fP"
126to make it a collating element (see below).
333a424b 127With the exception of these and some combinations using \(aq[\(aq (see next
31a6818e 128paragraphs), all other special characters, including \(aq\e\(aq, lose their
fea681da
MK
129special significance within a bracket expression.
130.PP
131Within a bracket expression, a collating element (a character,
ae03dc66 132a multicharacter sequence that collates as if it were a single character,
fea681da 133or a collating-sequence name for either)
333a424b 134enclosed in "\fI[.\fP" and "\fI.]\fP" stands for the
fea681da
MK
135sequence of characters of that collating element.
136The sequence is a single element of the bracket expression's list.
ae03dc66 137A bracket expression containing a multicharacter collating element
fea681da 138can thus match more than one character,
333a424b
MK
139for example, if the collating sequence includes a "ch" collating element,
140then the RE "\fI[[.ch.]]*c\fP" matches the first five characters
141of "chchcc".
fea681da 142.PP
333a424b
MK
143Within a bracket expression, a collating element enclosed in "\fI[=\fP" and
144"\fI=]\fP" is an equivalence class, standing for the sequences of characters
fea681da
MK
145of all collating elements equivalent to that one, including itself.
146(If there are no other equivalent collating elements,
c45660d7
MK
147the treatment is as if the enclosing delimiters
148were "\fI[.\fP" and "\fI.]\fP".)
fea681da 149For example, if o and \o'o^' are the members of an equivalence class,
c45660d7
MK
150then "\fI[[=o=]]\fP", "\fI[[=\o'o^'=]]\fP",
151and "\fI[o\o'o^']\fP" are all synonymous.
fea681da
MK
152An equivalence class may not\*(dg be an endpoint
153of a range.
154.PP
155Within a bracket expression, the name of a \fIcharacter class\fR enclosed
c45660d7
MK
156in "\fI[:\fP" and "\fI:]\fP" stands for the list
157of all characters belonging to that
fea681da
MK
158class.
159Standard character class names are:
160.PP
161.RS
34f2dcd0
ER
162.TS
163l l l.
fea681da
MK
164alnum digit punct
165alpha graph space
166blank lower upper
167cntrl print xdigit
34f2dcd0 168.TE
fea681da
MK
169.RE
170.PP
171These stand for the character classes defined in
172.BR wctype (3).
173A locale may provide others.
174A character class may not be used as an endpoint of a range.
bf6c69c9
MK
175.\" As per http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=295666
176.\" The following does not seem to apply in the glibc implementation
177.\" .PP
178.\" There are two special cases\*(dg of bracket expressions:
c45660d7
MK
179.\" the bracket expressions "\fI[[:<:]]\fP" and "\fI[[:>:]]\fP" match
180.\" the null string at the beginning and end of a word respectively.
bf6c69c9
MK
181.\" A word is defined as a sequence of
182.\" word characters
183.\" which is neither preceded nor followed by
184.\" word characters.
185.\" A word character is an
186.\" .I alnum
187.\" character (as defined by
188.\" .BR wctype (3))
189.\" or an underscore.
190.\" This is an extension,
4dec66f9 191.\" compatible with but not specified by POSIX.2,
bf6c69c9
MK
192.\" and should be used with
193.\" caution in software intended to be portable to other systems.
fea681da
MK
194.PP
195In the event that an RE could match more than one substring of a given
196string,
197the RE matches the one starting earliest in the string.
198If the RE could match more than one substring starting at that point,
199it matches the longest.
200Subexpressions also match the longest possible substrings, subject to
201the constraint that the whole match be as long as possible,
202with subexpressions starting earlier in the RE taking priority over
203ones starting later.
204Note that higher-level subexpressions thus take priority over
205their lower-level component subexpressions.
206.PP
207Match lengths are measured in characters, not collating elements.
208A null string is considered longer than no match at all.
209For example,
333a424b 210"\fIbb*\fP" matches the three middle characters of "abbbc",
c45660d7
MK
211"\fI(wee|week)(knights|nights)\fP"
212matches all ten characters of "weeknights",
333a424b 213when "\fI(.*).*\fP" is matched against "abc" the parenthesized subexpression
fea681da 214matches all three characters, and
c45660d7
MK
215when "\fI(a*)*\fP" is matched against "bc"
216both the whole RE and the parenthesized
fea681da
MK
217subexpression match the null string.
218.PP
219If case-independent matching is specified,
220the effect is much as if all case distinctions had vanished from the
221alphabet.
222When an alphabetic that exists in multiple cases appears as an
223ordinary character outside a bracket expression, it is effectively
224transformed into a bracket expression containing both cases,
333a424b 225for example, \(aqx\(aq becomes "\fI[xX]\fP".
fea681da 226When it appears inside a bracket expression, all case counterparts
333a424b
MK
227of it are added to the bracket expression, so that, for example, "\fI[x]\fP"
228becomes "\fI[xX]\fP" and "\fI[^x]\fP" becomes "\fI[^xX]\fP".
fea681da
MK
229.PP
230No particular limit is imposed on the length of REs\*(dg.
231Programs intended to be portable should not employ REs longer
232than 256 bytes,
233as an implementation can refuse to accept such REs and remain
234POSIX-compliant.
235.PP
324633ae 236Obsolete ("basic") regular expressions differ in several respects.
c45660d7
MK
237\(aq|\(aq, \(aq+\(aq, and \(aq?\(aq are
238ordinary characters and there is no equivalent
fea681da 239for their functionality.
31a6818e 240The delimiters for bounds are "\fI\e{\fP" and "\fI\e}\fP",
333a424b 241with \(aq{\(aq and \(aq}\(aq by themselves ordinary characters.
31a6818e 242The parentheses for nested subexpressions are "\fI\e(\fP" and "\fI\e)\fP",
333a424b
MK
243with \(aq(\(aq and \(aq)\(aq by themselves ordinary characters.
244\(aq^\(aq is an ordinary character except at the beginning of the
fea681da 245RE or\*(dg the beginning of a parenthesized subexpression,
333a424b 246\(aq$\(aq is an ordinary character except at the end of the
fea681da 247RE or\*(dg the end of a parenthesized subexpression,
333a424b 248and \(aq*\(aq is an ordinary character if it appears at the beginning of the
fea681da 249RE or the beginning of a parenthesized subexpression
333a424b 250(after a possible leading \(aq^\(aq).
4f020e78 251.PP
fea681da 252Finally, there is one new type of atom, a \fIback reference\fR:
31a6818e 253\(aq\e\(aq followed by a nonzero decimal digit \fId\fR
fea681da
MK
254matches the same sequence of characters
255matched by the \fId\fRth parenthesized subexpression
256(numbering subexpressions by the positions of their opening parentheses,
257left to right),
31a6818e 258so that, for example, "\fI\e([bc]\e)\e1\fP" matches "bb" or "cc" but not "bc".
fea681da
MK
259.SH BUGS
260Having two kinds of REs is a botch.
261.PP
333a424b
MK
262The current POSIX.2 spec says that \(aq)\(aq is an ordinary character in
263the absence of an unmatched \(aq(\(aq;
fea681da
MK
264this was an unintentional result of a wording error,
265and change is likely.
266Avoid relying on it.
267.PP
268Back references are a dreadful botch,
269posing major problems for efficient implementations.
270They are also somewhat vaguely defined
271(does
31a6818e 272"\fIa\e(\e(b\e)*\e2\e)*d\fP" match "abbbd"?).
fea681da
MK
273Avoid using them.
274.PP
fa203d85 275POSIX.2's specification of case-independent matching is vague.
324633ae 276The "one case implies all cases" definition given above
fea681da 277is current consensus among implementors as to the right interpretation.
4f020e78
MK
278.\" As per http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=295666
279.\" The following does not seem to apply in the glibc implementation
280.\" .PP
281.\" The syntax for word boundaries is incredibly ugly.
e0c674cd
MK
282.SH AUTHOR
283.\" Sigh... The page license means we must have the author's name
284.\" in the formatted output.
285This page was taken from Henry Spencer's regex package.
47297adb 286.SH SEE ALSO
845d36d6 287.BR grep (1),
e37e3282
MK
288.BR regex (3)
289.PP
290POSIX.2, section 2.8 (Regular Expression Notation).