]>
Commit | Line | Data |
---|---|---|
a1eaacb1 | 1 | '\" t |
fea681da MK |
2 | .\" From Henry Spencer's regex package (as found in the apache |
3 | .\" distribution). The package carries the following copyright: | |
4 | .\" | |
5 | .\" Copyright 1992, 1993, 1994 Henry Spencer. All rights reserved. | |
1a459d04 | 6 | .\" %%%LICENSE_START(MISC) |
fea681da MK |
7 | .\" This software is not subject to any license of the American Telephone |
8 | .\" and Telegraph Company or of the Regents of the University of California. | |
c13182ef | 9 | .\" |
fea681da MK |
10 | .\" Permission is granted to anyone to use this software for any purpose |
11 | .\" on any computer system, and to alter it and redistribute it, subject | |
12 | .\" to the following restrictions: | |
c13182ef | 13 | .\" |
fea681da MK |
14 | .\" 1. The author is not responsible for the consequences of use of this |
15 | .\" software, no matter how awful, even if they arise from flaws in it. | |
c13182ef | 16 | .\" |
fea681da MK |
17 | .\" 2. The origin of this software must not be misrepresented, either by |
18 | .\" explicit claim or by omission. Since few users ever read sources, | |
19 | .\" credits must appear in the documentation. | |
c13182ef | 20 | .\" |
fea681da MK |
21 | .\" 3. Altered versions must be plainly marked as such, and must not be |
22 | .\" misrepresented as being the original software. Since few users | |
23 | .\" ever read sources, credits must appear in the documentation. | |
c13182ef | 24 | .\" |
fea681da | 25 | .\" 4. This notice may not be removed or altered. |
8ff7380d | 26 | .\" %%%LICENSE_END |
c13182ef | 27 | .\" |
fea681da MK |
28 | .\" In order to comply with `credits must appear in the documentation' |
29 | .\" I added an AUTHOR paragraph below - aeb. | |
30 | .\" | |
31 | .\" In the default nroff environment there is no dagger \(dg. | |
bf6c69c9 MK |
32 | .\" |
33 | .\" 2005-05-11 Removed discussion of `[[:<:]]' and `[[:>:]]', which | |
34 | .\" appear not to be in the glibc implementation of regcomp | |
35 | .\" | |
fea681da MK |
36 | .ie t .ds dg \(dg |
37 | .el .ds dg (!) | |
4c1c5274 | 38 | .TH regex 7 (date) "Linux man-pages (unreleased)" |
fea681da | 39 | .SH NAME |
4dec66f9 | 40 | regex \- POSIX.2 regular expressions |
fea681da | 41 | .SH DESCRIPTION |
324633ae | 42 | Regular expressions ("RE"s), |
4dec66f9 | 43 | as defined in POSIX.2, come in two forms: |
fea681da | 44 | modern REs (roughly those of |
f43a9ed0 | 45 | .BR egrep (1); |
324633ae | 46 | POSIX.2 calls these "extended" REs) |
fea681da MK |
47 | and obsolete REs (roughly those of |
48 | .BR ed (1); | |
324633ae | 49 | POSIX.2 "basic" REs). |
fea681da MK |
50 | Obsolete REs mostly exist for backward compatibility in some old programs; |
51 | they will be discussed at the end. | |
fa203d85 | 52 | POSIX.2 leaves some aspects of RE syntax and semantics open; |
333a424b | 53 | "\*(dg" marks decisions on these aspects that |
fa203d85 | 54 | may not be fully portable to other POSIX.2 implementations. |
c6d039a3 | 55 | .P |
aa796481 | 56 | A (modern) RE is one\*(dg or more nonempty\*(dg \fIbranches\fR, |
b957f81f | 57 | separated by \[aq]|\[aq]. |
fea681da | 58 | It matches anything that matches one of the branches. |
c6d039a3 | 59 | .P |
fea681da | 60 | A branch is one\*(dg or more \fIpieces\fR, concatenated. |
f78ed33a MK |
61 | It matches a match for the first, followed by a match for the second, |
62 | and so on. | |
c6d039a3 | 63 | .P |
fea681da | 64 | A piece is an \fIatom\fR possibly followed |
b957f81f AC |
65 | by a single\*(dg \[aq]*\[aq], \[aq]+\[aq], \[aq]?\[aq], or \fIbound\fR. |
66 | An atom followed by \[aq]*\[aq] | |
c45660d7 | 67 | matches a sequence of 0 or more matches of the atom. |
b957f81f | 68 | An atom followed by \[aq]+\[aq] |
c45660d7 | 69 | matches a sequence of 1 or more matches of the atom. |
b957f81f | 70 | An atom followed by \[aq]?\[aq] |
c45660d7 | 71 | matches a sequence of 0 or 1 matches of the atom. |
c6d039a3 | 72 | .P |
b957f81f AC |
73 | A \fIbound\fR is \[aq]{\[aq] followed by an unsigned decimal integer, |
74 | possibly followed by \[aq],\[aq] | |
fea681da | 75 | possibly followed by another unsigned decimal integer, |
b957f81f | 76 | always followed by \[aq]}\[aq]. |
097585ed MK |
77 | The integers must lie between 0 and |
78 | .B RE_DUP_MAX | |
79 | (255\*(dg) inclusive, | |
fea681da MK |
80 | and if there are two of them, the first may not exceed the second. |
81 | An atom followed by a bound containing one integer \fIi\fR | |
82 | and no comma matches | |
83 | a sequence of exactly \fIi\fR matches of the atom. | |
84 | An atom followed by a bound | |
85 | containing one integer \fIi\fR and a comma matches | |
86 | a sequence of \fIi\fR or more matches of the atom. | |
87 | An atom followed by a bound | |
88 | containing two integers \fIi\fR and \fIj\fR matches | |
89 | a sequence of \fIi\fR through \fIj\fR (inclusive) matches of the atom. | |
c6d039a3 | 90 | .P |
c45660d7 MK |
91 | An atom is a regular expression enclosed in "\fI()\fP" |
92 | (matching a match for the regular expression), | |
333a424b | 93 | an empty set of "\fI()\fP" (matching the null string)\*(dg, |
d6c1998e AC |
94 | a \fIbracket expression\fR (see below), |
95 | \[aq].\[aq] (matching any single character), | |
96 | \[aq]\[ha]\[aq] (matching the null string at the beginning of a line), | |
97 | \[aq]$\[aq] (matching the null string at the end of a line), | |
98 | a \[aq]\e\[aq] followed by one of the characters "\fI\[ha].[$()|*+?{\e\fP" | |
fea681da | 99 | (matching that character taken as an ordinary character), |
b957f81f | 100 | a \[aq]\e\[aq] followed by any other character\*(dg |
fea681da | 101 | (matching that character taken as an ordinary character, |
b957f81f | 102 | as if the \[aq]\e\[aq] had not been present\*(dg), |
fea681da | 103 | or a single character with no other significance (matching that character). |
d6c1998e AC |
104 | A \[aq]{\[aq] followed by a character other than a digit |
105 | is an ordinary character, | |
106 | not the beginning of a bound\*(dg. | |
b957f81f | 107 | It is illegal to end an RE with \[aq]\e\[aq]. |
c6d039a3 | 108 | .P |
333a424b | 109 | A \fIbracket expression\fR is a list of characters enclosed in "\fI[]\fP". |
fea681da | 110 | It normally matches any single character from the list (but see below). |
a1e9245d | 111 | If the list begins with \[aq]\[ha]\[aq], |
fea681da MK |
112 | it matches any single character |
113 | (but see below) \fInot\fR from the rest of the list. | |
b957f81f | 114 | If two characters in the list are separated by \[aq]\-\[aq], this is shorthand |
fea681da MK |
115 | for the full \fIrange\fR of characters between those two (inclusive) in the |
116 | collating sequence, | |
333a424b | 117 | for example, "\fI[0\-9]\fP" in ASCII matches any decimal digit. |
fea681da | 118 | It is illegal\*(dg for two ranges to share an |
e2a71fb3 | 119 | endpoint, for example, "\fIa\-c\-e\fP". |
fea681da MK |
120 | Ranges are very collating-sequence-dependent, |
121 | and portable programs should avoid relying on them. | |
c6d039a3 | 122 | .P |
b957f81f | 123 | To include a literal \[aq]]\[aq] in the list, make it the first character |
a1e9245d | 124 | (following a possible \[aq]\[ha]\[aq]). |
b957f81f | 125 | To include a literal \[aq]\-\[aq], make it the first or last character, |
fea681da | 126 | or the second endpoint of a range. |
b957f81f | 127 | To use a literal \[aq]\-\[aq] as the first endpoint of a range, |
c45660d7 MK |
128 | enclose it in "\fI[.\fP" and "\fI.]\fP" |
129 | to make it a collating element (see below). | |
b957f81f AC |
130 | With the exception of these and some combinations using \[aq][\[aq] (see next |
131 | paragraphs), all other special characters, including \[aq]\e\[aq], lose their | |
fea681da | 132 | special significance within a bracket expression. |
c6d039a3 | 133 | .P |
fea681da | 134 | Within a bracket expression, a collating element (a character, |
ae03dc66 | 135 | a multicharacter sequence that collates as if it were a single character, |
fea681da | 136 | or a collating-sequence name for either) |
333a424b | 137 | enclosed in "\fI[.\fP" and "\fI.]\fP" stands for the |
fea681da MK |
138 | sequence of characters of that collating element. |
139 | The sequence is a single element of the bracket expression's list. | |
ae03dc66 | 140 | A bracket expression containing a multicharacter collating element |
fea681da | 141 | can thus match more than one character, |
333a424b MK |
142 | for example, if the collating sequence includes a "ch" collating element, |
143 | then the RE "\fI[[.ch.]]*c\fP" matches the first five characters | |
144 | of "chchcc". | |
c6d039a3 | 145 | .P |
333a424b MK |
146 | Within a bracket expression, a collating element enclosed in "\fI[=\fP" and |
147 | "\fI=]\fP" is an equivalence class, standing for the sequences of characters | |
fea681da MK |
148 | of all collating elements equivalent to that one, including itself. |
149 | (If there are no other equivalent collating elements, | |
c45660d7 MK |
150 | the treatment is as if the enclosing delimiters |
151 | were "\fI[.\fP" and "\fI.]\fP".) | |
acf10c16 BIG |
152 | For example, if o and \(^o are the members of an equivalence class, |
153 | then "\fI[[=o=]]\fP", "\fI[[=\(^o=]]\fP", | |
154 | and "\fI[o\(^o]\fP" are all synonymous. | |
fea681da MK |
155 | An equivalence class may not\*(dg be an endpoint |
156 | of a range. | |
c6d039a3 | 157 | .P |
fea681da | 158 | Within a bracket expression, the name of a \fIcharacter class\fR enclosed |
c45660d7 MK |
159 | in "\fI[:\fP" and "\fI:]\fP" stands for the list |
160 | of all characters belonging to that | |
fea681da MK |
161 | class. |
162 | Standard character class names are: | |
c6d039a3 | 163 | .P |
fea681da | 164 | .RS |
34f2dcd0 ER |
165 | .TS |
166 | l l l. | |
fea681da MK |
167 | alnum digit punct |
168 | alpha graph space | |
169 | blank lower upper | |
170 | cntrl print xdigit | |
34f2dcd0 | 171 | .TE |
fea681da | 172 | .RE |
c6d039a3 | 173 | .P |
fea681da MK |
174 | These stand for the character classes defined in |
175 | .BR wctype (3). | |
176 | A locale may provide others. | |
177 | A character class may not be used as an endpoint of a range. | |
bf6c69c9 MK |
178 | .\" As per http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=295666 |
179 | .\" The following does not seem to apply in the glibc implementation | |
c6d039a3 | 180 | .\" .P |
bf6c69c9 | 181 | .\" There are two special cases\*(dg of bracket expressions: |
c45660d7 MK |
182 | .\" the bracket expressions "\fI[[:<:]]\fP" and "\fI[[:>:]]\fP" match |
183 | .\" the null string at the beginning and end of a word respectively. | |
bf6c69c9 MK |
184 | .\" A word is defined as a sequence of |
185 | .\" word characters | |
186 | .\" which is neither preceded nor followed by | |
187 | .\" word characters. | |
188 | .\" A word character is an | |
189 | .\" .I alnum | |
190 | .\" character (as defined by | |
191 | .\" .BR wctype (3)) | |
192 | .\" or an underscore. | |
193 | .\" This is an extension, | |
4dec66f9 | 194 | .\" compatible with but not specified by POSIX.2, |
bf6c69c9 MK |
195 | .\" and should be used with |
196 | .\" caution in software intended to be portable to other systems. | |
c6d039a3 | 197 | .P |
fea681da MK |
198 | In the event that an RE could match more than one substring of a given |
199 | string, | |
200 | the RE matches the one starting earliest in the string. | |
201 | If the RE could match more than one substring starting at that point, | |
202 | it matches the longest. | |
203 | Subexpressions also match the longest possible substrings, subject to | |
204 | the constraint that the whole match be as long as possible, | |
205 | with subexpressions starting earlier in the RE taking priority over | |
206 | ones starting later. | |
207 | Note that higher-level subexpressions thus take priority over | |
208 | their lower-level component subexpressions. | |
c6d039a3 | 209 | .P |
fea681da MK |
210 | Match lengths are measured in characters, not collating elements. |
211 | A null string is considered longer than no match at all. | |
212 | For example, | |
333a424b | 213 | "\fIbb*\fP" matches the three middle characters of "abbbc", |
c45660d7 MK |
214 | "\fI(wee|week)(knights|nights)\fP" |
215 | matches all ten characters of "weeknights", | |
333a424b | 216 | when "\fI(.*).*\fP" is matched against "abc" the parenthesized subexpression |
fea681da | 217 | matches all three characters, and |
c45660d7 MK |
218 | when "\fI(a*)*\fP" is matched against "bc" |
219 | both the whole RE and the parenthesized | |
fea681da | 220 | subexpression match the null string. |
c6d039a3 | 221 | .P |
fea681da MK |
222 | If case-independent matching is specified, |
223 | the effect is much as if all case distinctions had vanished from the | |
224 | alphabet. | |
225 | When an alphabetic that exists in multiple cases appears as an | |
226 | ordinary character outside a bracket expression, it is effectively | |
227 | transformed into a bracket expression containing both cases, | |
b957f81f | 228 | for example, \[aq]x\[aq] becomes "\fI[xX]\fP". |
fea681da | 229 | When it appears inside a bracket expression, all case counterparts |
333a424b | 230 | of it are added to the bracket expression, so that, for example, "\fI[x]\fP" |
a1e9245d | 231 | becomes "\fI[xX]\fP" and "\fI[\[ha]x]\fP" becomes "\fI[\[ha]xX]\fP". |
c6d039a3 | 232 | .P |
fea681da MK |
233 | No particular limit is imposed on the length of REs\*(dg. |
234 | Programs intended to be portable should not employ REs longer | |
235 | than 256 bytes, | |
236 | as an implementation can refuse to accept such REs and remain | |
237 | POSIX-compliant. | |
c6d039a3 | 238 | .P |
324633ae | 239 | Obsolete ("basic") regular expressions differ in several respects. |
b957f81f | 240 | \[aq]|\[aq], \[aq]+\[aq], and \[aq]?\[aq] are |
c45660d7 | 241 | ordinary characters and there is no equivalent |
fea681da | 242 | for their functionality. |
31a6818e | 243 | The delimiters for bounds are "\fI\e{\fP" and "\fI\e}\fP", |
b957f81f | 244 | with \[aq]{\[aq] and \[aq]}\[aq] by themselves ordinary characters. |
31a6818e | 245 | The parentheses for nested subexpressions are "\fI\e(\fP" and "\fI\e)\fP", |
b957f81f | 246 | with \[aq](\[aq] and \[aq])\[aq] by themselves ordinary characters. |
a1e9245d | 247 | \[aq]\[ha]\[aq] is an ordinary character except at the beginning of the |
fea681da | 248 | RE or\*(dg the beginning of a parenthesized subexpression, |
b957f81f | 249 | \[aq]$\[aq] is an ordinary character except at the end of the |
fea681da | 250 | RE or\*(dg the end of a parenthesized subexpression, |
b957f81f | 251 | and \[aq]*\[aq] is an ordinary character if it appears at the beginning of the |
fea681da | 252 | RE or the beginning of a parenthesized subexpression |
a1e9245d | 253 | (after a possible leading \[aq]\[ha]\[aq]). |
c6d039a3 | 254 | .P |
fea681da | 255 | Finally, there is one new type of atom, a \fIback reference\fR: |
b957f81f | 256 | \[aq]\e\[aq] followed by a nonzero decimal digit \fId\fR |
fea681da MK |
257 | matches the same sequence of characters |
258 | matched by the \fId\fRth parenthesized subexpression | |
259 | (numbering subexpressions by the positions of their opening parentheses, | |
260 | left to right), | |
31a6818e | 261 | so that, for example, "\fI\e([bc]\e)\e1\fP" matches "bb" or "cc" but not "bc". |
fea681da MK |
262 | .SH BUGS |
263 | Having two kinds of REs is a botch. | |
c6d039a3 | 264 | .P |
b957f81f AC |
265 | The current POSIX.2 spec says that \[aq])\[aq] is an ordinary character in |
266 | the absence of an unmatched \[aq](\[aq]; | |
fea681da MK |
267 | this was an unintentional result of a wording error, |
268 | and change is likely. | |
269 | Avoid relying on it. | |
c6d039a3 | 270 | .P |
fea681da MK |
271 | Back references are a dreadful botch, |
272 | posing major problems for efficient implementations. | |
273 | They are also somewhat vaguely defined | |
274 | (does | |
31a6818e | 275 | "\fIa\e(\e(b\e)*\e2\e)*d\fP" match "abbbd"?). |
fea681da | 276 | Avoid using them. |
c6d039a3 | 277 | .P |
fa203d85 | 278 | POSIX.2's specification of case-independent matching is vague. |
324633ae | 279 | The "one case implies all cases" definition given above |
fea681da | 280 | is current consensus among implementors as to the right interpretation. |
4f020e78 MK |
281 | .\" As per http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=295666 |
282 | .\" The following does not seem to apply in the glibc implementation | |
c6d039a3 | 283 | .\" .P |
4f020e78 | 284 | .\" The syntax for word boundaries is incredibly ugly. |
e0c674cd MK |
285 | .SH AUTHOR |
286 | .\" Sigh... The page license means we must have the author's name | |
287 | .\" in the formatted output. | |
288 | This page was taken from Henry Spencer's regex package. | |
47297adb | 289 | .SH SEE ALSO |
845d36d6 | 290 | .BR grep (1), |
e37e3282 | 291 | .BR regex (3) |
c6d039a3 | 292 | .P |
e37e3282 | 293 | POSIX.2, section 2.8 (Regular Expression Notation). |