]>
Commit | Line | Data |
---|---|---|
fea681da MK |
1 | .\" From Henry Spencer's regex package (as found in the apache |
2 | .\" distribution). The package carries the following copyright: | |
3 | .\" | |
4 | .\" Copyright 1992, 1993, 1994 Henry Spencer. All rights reserved. | |
5 | .\" This software is not subject to any license of the American Telephone | |
6 | .\" and Telegraph Company or of the Regents of the University of California. | |
c13182ef | 7 | .\" |
fea681da MK |
8 | .\" Permission is granted to anyone to use this software for any purpose |
9 | .\" on any computer system, and to alter it and redistribute it, subject | |
10 | .\" to the following restrictions: | |
c13182ef | 11 | .\" |
fea681da MK |
12 | .\" 1. The author is not responsible for the consequences of use of this |
13 | .\" software, no matter how awful, even if they arise from flaws in it. | |
c13182ef | 14 | .\" |
fea681da MK |
15 | .\" 2. The origin of this software must not be misrepresented, either by |
16 | .\" explicit claim or by omission. Since few users ever read sources, | |
17 | .\" credits must appear in the documentation. | |
c13182ef | 18 | .\" |
fea681da MK |
19 | .\" 3. Altered versions must be plainly marked as such, and must not be |
20 | .\" misrepresented as being the original software. Since few users | |
21 | .\" ever read sources, credits must appear in the documentation. | |
c13182ef | 22 | .\" |
fea681da | 23 | .\" 4. This notice may not be removed or altered. |
c13182ef | 24 | .\" |
fea681da MK |
25 | .\" In order to comply with `credits must appear in the documentation' |
26 | .\" I added an AUTHOR paragraph below - aeb. | |
27 | .\" | |
28 | .\" In the default nroff environment there is no dagger \(dg. | |
bf6c69c9 MK |
29 | .\" |
30 | .\" 2005-05-11 Removed discussion of `[[:<:]]' and `[[:>:]]', which | |
31 | .\" appear not to be in the glibc implementation of regcomp | |
32 | .\" | |
fea681da MK |
33 | .ie t .ds dg \(dg |
34 | .el .ds dg (!) | |
5a25ff94 | 35 | .TH REGEX 7 2007-12-12 "" "Linux Programmer's Manual" |
fea681da | 36 | .SH NAME |
4dec66f9 | 37 | regex \- POSIX.2 regular expressions |
fea681da | 38 | .SH DESCRIPTION |
324633ae | 39 | Regular expressions ("RE"s), |
4dec66f9 | 40 | as defined in POSIX.2, come in two forms: |
fea681da MK |
41 | modern REs (roughly those of |
42 | .IR egrep ; | |
324633ae | 43 | POSIX.2 calls these "extended" REs) |
fea681da MK |
44 | and obsolete REs (roughly those of |
45 | .BR ed (1); | |
324633ae | 46 | POSIX.2 "basic" REs). |
fea681da MK |
47 | Obsolete REs mostly exist for backward compatibility in some old programs; |
48 | they will be discussed at the end. | |
fa203d85 | 49 | POSIX.2 leaves some aspects of RE syntax and semantics open; |
fea681da | 50 | `\*(dg' marks decisions on these aspects that |
fa203d85 | 51 | may not be fully portable to other POSIX.2 implementations. |
fea681da | 52 | .PP |
c382a365 | 53 | A (modern) RE is one\*(dg or more nonempty\*(dg \fIbranches\fR, |
fea681da MK |
54 | separated by `|'. |
55 | It matches anything that matches one of the branches. | |
56 | .PP | |
57 | A branch is one\*(dg or more \fIpieces\fR, concatenated. | |
58 | It matches a match for the first, followed by a match for the second, etc. | |
59 | .PP | |
60 | A piece is an \fIatom\fR possibly followed | |
61 | by a single\*(dg `*', `+', `?', or \fIbound\fR. | |
62 | An atom followed by `*' matches a sequence of 0 or more matches of the atom. | |
63 | An atom followed by `+' matches a sequence of 1 or more matches of the atom. | |
64 | An atom followed by `?' matches a sequence of 0 or 1 matches of the atom. | |
65 | .PP | |
66 | A \fIbound\fR is `{' followed by an unsigned decimal integer, | |
67 | possibly followed by `,' | |
68 | possibly followed by another unsigned decimal integer, | |
69 | always followed by `}'. | |
097585ed MK |
70 | The integers must lie between 0 and |
71 | .B RE_DUP_MAX | |
72 | (255\*(dg) inclusive, | |
fea681da MK |
73 | and if there are two of them, the first may not exceed the second. |
74 | An atom followed by a bound containing one integer \fIi\fR | |
75 | and no comma matches | |
76 | a sequence of exactly \fIi\fR matches of the atom. | |
77 | An atom followed by a bound | |
78 | containing one integer \fIi\fR and a comma matches | |
79 | a sequence of \fIi\fR or more matches of the atom. | |
80 | An atom followed by a bound | |
81 | containing two integers \fIi\fR and \fIj\fR matches | |
82 | a sequence of \fIi\fR through \fIj\fR (inclusive) matches of the atom. | |
83 | .PP | |
84 | An atom is a regular expression enclosed in `()' (matching a match for the | |
85 | regular expression), | |
86 | an empty set of `()' (matching the null string)\*(dg, | |
87 | a \fIbracket expression\fR (see below), `.' | |
88 | (matching any single character), `^' (matching the null string at the | |
89 | beginning of a line), `$' (matching the null string at the | |
90 | end of a line), a `\e' followed by one of the characters | |
91 | `^.[$()|*+?{\e' | |
92 | (matching that character taken as an ordinary character), | |
93 | a `\e' followed by any other character\*(dg | |
94 | (matching that character taken as an ordinary character, | |
95 | as if the `\e' had not been present\*(dg), | |
96 | or a single character with no other significance (matching that character). | |
97 | A `{' followed by a character other than a digit is an ordinary | |
98 | character, not the beginning of a bound\*(dg. | |
99 | It is illegal to end an RE with `\e'. | |
100 | .PP | |
101 | A \fIbracket expression\fR is a list of characters enclosed in `[]'. | |
102 | It normally matches any single character from the list (but see below). | |
103 | If the list begins with `^', | |
104 | it matches any single character | |
105 | (but see below) \fInot\fR from the rest of the list. | |
106 | If two characters in the list are separated by `\-', this is shorthand | |
107 | for the full \fIrange\fR of characters between those two (inclusive) in the | |
108 | collating sequence, | |
75b94dc3 | 109 | for example, `[0\-9]' in ASCII matches any decimal digit. |
fea681da | 110 | It is illegal\*(dg for two ranges to share an |
75b94dc3 | 111 | endpoint, for example, `a-c-e'. |
fea681da MK |
112 | Ranges are very collating-sequence-dependent, |
113 | and portable programs should avoid relying on them. | |
114 | .PP | |
115 | To include a literal `]' in the list, make it the first character | |
116 | (following a possible `^'). | |
117 | To include a literal `\-', make it the first or last character, | |
118 | or the second endpoint of a range. | |
119 | To use a literal `\-' as the first endpoint of a range, | |
120 | enclose it in `[.' and `.]' to make it a collating element (see below). | |
121 | With the exception of these and some combinations using `[' (see next | |
122 | paragraphs), all other special characters, including `\e', lose their | |
123 | special significance within a bracket expression. | |
124 | .PP | |
125 | Within a bracket expression, a collating element (a character, | |
126 | a multi-character sequence that collates as if it were a single character, | |
127 | or a collating-sequence name for either) | |
128 | enclosed in `[.' and `.]' stands for the | |
129 | sequence of characters of that collating element. | |
130 | The sequence is a single element of the bracket expression's list. | |
c13182ef | 131 | A bracket expression containing a multi-character collating element |
fea681da | 132 | can thus match more than one character, |
75b94dc3 | 133 | for example, if the collating sequence includes a `ch' collating element, |
fea681da MK |
134 | then the RE `[[.ch.]]*c' matches the first five characters |
135 | of `chchcc'. | |
136 | .PP | |
137 | Within a bracket expression, a collating element enclosed in `[=' and | |
138 | `=]' is an equivalence class, standing for the sequences of characters | |
139 | of all collating elements equivalent to that one, including itself. | |
140 | (If there are no other equivalent collating elements, | |
141 | the treatment is as if the enclosing delimiters were `[.' and `.]'.) | |
142 | For example, if o and \o'o^' are the members of an equivalence class, | |
143 | then `[[=o=]]', `[[=\o'o^'=]]', and `[o\o'o^']' are all synonymous. | |
144 | An equivalence class may not\*(dg be an endpoint | |
145 | of a range. | |
146 | .PP | |
147 | Within a bracket expression, the name of a \fIcharacter class\fR enclosed | |
148 | in `[:' and `:]' stands for the list of all characters belonging to that | |
149 | class. | |
150 | Standard character class names are: | |
151 | .PP | |
152 | .RS | |
153 | .nf | |
154 | .ta 3c 6c 9c | |
155 | alnum digit punct | |
156 | alpha graph space | |
157 | blank lower upper | |
158 | cntrl print xdigit | |
159 | .fi | |
160 | .RE | |
161 | .PP | |
162 | These stand for the character classes defined in | |
163 | .BR wctype (3). | |
164 | A locale may provide others. | |
165 | A character class may not be used as an endpoint of a range. | |
bf6c69c9 MK |
166 | .\" As per http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=295666 |
167 | .\" The following does not seem to apply in the glibc implementation | |
168 | .\" .PP | |
169 | .\" There are two special cases\*(dg of bracket expressions: | |
170 | .\" the bracket expressions `[[:<:]]' and `[[:>:]]' match the null string at | |
171 | .\" the beginning and end of a word respectively. | |
172 | .\" A word is defined as a sequence of | |
173 | .\" word characters | |
174 | .\" which is neither preceded nor followed by | |
175 | .\" word characters. | |
176 | .\" A word character is an | |
177 | .\" .I alnum | |
178 | .\" character (as defined by | |
179 | .\" .BR wctype (3)) | |
180 | .\" or an underscore. | |
181 | .\" This is an extension, | |
4dec66f9 | 182 | .\" compatible with but not specified by POSIX.2, |
bf6c69c9 MK |
183 | .\" and should be used with |
184 | .\" caution in software intended to be portable to other systems. | |
fea681da MK |
185 | .PP |
186 | In the event that an RE could match more than one substring of a given | |
187 | string, | |
188 | the RE matches the one starting earliest in the string. | |
189 | If the RE could match more than one substring starting at that point, | |
190 | it matches the longest. | |
191 | Subexpressions also match the longest possible substrings, subject to | |
192 | the constraint that the whole match be as long as possible, | |
193 | with subexpressions starting earlier in the RE taking priority over | |
194 | ones starting later. | |
195 | Note that higher-level subexpressions thus take priority over | |
196 | their lower-level component subexpressions. | |
197 | .PP | |
198 | Match lengths are measured in characters, not collating elements. | |
199 | A null string is considered longer than no match at all. | |
200 | For example, | |
201 | `bb*' matches the three middle characters of `abbbc', | |
202 | `(wee|week)(knights|nights)' matches all ten characters of `weeknights', | |
203 | when `(.*).*' is matched against `abc' the parenthesized subexpression | |
204 | matches all three characters, and | |
205 | when `(a*)*' is matched against `bc' both the whole RE and the parenthesized | |
206 | subexpression match the null string. | |
207 | .PP | |
208 | If case-independent matching is specified, | |
209 | the effect is much as if all case distinctions had vanished from the | |
210 | alphabet. | |
211 | When an alphabetic that exists in multiple cases appears as an | |
212 | ordinary character outside a bracket expression, it is effectively | |
213 | transformed into a bracket expression containing both cases, | |
75b94dc3 | 214 | for example, `x' becomes `[xX]'. |
fea681da | 215 | When it appears inside a bracket expression, all case counterparts |
75b94dc3 | 216 | of it are added to the bracket expression, so that, for example, `[x]' |
fea681da MK |
217 | becomes `[xX]' and `[^x]' becomes `[^xX]'. |
218 | .PP | |
219 | No particular limit is imposed on the length of REs\*(dg. | |
220 | Programs intended to be portable should not employ REs longer | |
221 | than 256 bytes, | |
222 | as an implementation can refuse to accept such REs and remain | |
223 | POSIX-compliant. | |
224 | .PP | |
324633ae | 225 | Obsolete ("basic") regular expressions differ in several respects. |
fea681da MK |
226 | `|', `+', and `?' are ordinary characters and there is no equivalent |
227 | for their functionality. | |
228 | The delimiters for bounds are `\e{' and `\e}', | |
229 | with `{' and `}' by themselves ordinary characters. | |
230 | The parentheses for nested subexpressions are `\e(' and `\e)', | |
231 | with `(' and `)' by themselves ordinary characters. | |
232 | `^' is an ordinary character except at the beginning of the | |
233 | RE or\*(dg the beginning of a parenthesized subexpression, | |
234 | `$' is an ordinary character except at the end of the | |
235 | RE or\*(dg the end of a parenthesized subexpression, | |
236 | and `*' is an ordinary character if it appears at the beginning of the | |
237 | RE or the beginning of a parenthesized subexpression | |
238 | (after a possible leading `^'). | |
4f020e78 | 239 | .PP |
fea681da | 240 | Finally, there is one new type of atom, a \fIback reference\fR: |
c382a365 | 241 | `\e' followed by a nonzero decimal digit \fId\fR |
fea681da MK |
242 | matches the same sequence of characters |
243 | matched by the \fId\fRth parenthesized subexpression | |
244 | (numbering subexpressions by the positions of their opening parentheses, | |
245 | left to right), | |
75b94dc3 | 246 | so that, for example, `\e([bc]\e)\e1' matches `bb' or `cc' but not `bc'. |
fea681da MK |
247 | .SH BUGS |
248 | Having two kinds of REs is a botch. | |
249 | .PP | |
fa203d85 | 250 | The current POSIX.2 spec says that `)' is an ordinary character in |
fea681da MK |
251 | the absence of an unmatched `('; |
252 | this was an unintentional result of a wording error, | |
253 | and change is likely. | |
254 | Avoid relying on it. | |
255 | .PP | |
256 | Back references are a dreadful botch, | |
257 | posing major problems for efficient implementations. | |
258 | They are also somewhat vaguely defined | |
259 | (does | |
260 | `a\e(\e(b\e)*\e2\e)*d' match `abbbd'?). | |
261 | Avoid using them. | |
262 | .PP | |
fa203d85 | 263 | POSIX.2's specification of case-independent matching is vague. |
324633ae | 264 | The "one case implies all cases" definition given above |
fea681da | 265 | is current consensus among implementors as to the right interpretation. |
4f020e78 MK |
266 | .\" As per http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=295666 |
267 | .\" The following does not seem to apply in the glibc implementation | |
268 | .\" .PP | |
269 | .\" The syntax for word boundaries is incredibly ugly. | |
fd7f0a7f MK |
270 | .\" .SH AUTHOR |
271 | .\" This page was taken from Henry Spencer's regex package. | |
e37e3282 MK |
272 | .SH "SEE ALSO" |
273 | .BR regex (3) | |
274 | .PP | |
275 | POSIX.2, section 2.8 (Regular Expression Notation). |