]>
Commit | Line | Data |
---|---|---|
ef416fc2 | 1 | |
2 | ||
3 | ||
4 | ||
5 | ||
6 | ||
7 | Network Working Group T. Berners-Lee | |
8 | Request for Comments: 3986 W3C/MIT | |
9 | STD: 66 R. Fielding | |
10 | Updates: 1738 Day Software | |
11 | Obsoletes: 2732, 2396, 1808 L. Masinter | |
12 | Category: Standards Track Adobe Systems | |
13 | January 2005 | |
14 | ||
15 | ||
16 | Uniform Resource Identifier (URI): Generic Syntax | |
17 | ||
18 | Status of This Memo | |
19 | ||
20 | This document specifies an Internet standards track protocol for the | |
21 | Internet community, and requests discussion and suggestions for | |
22 | improvements. Please refer to the current edition of the "Internet | |
23 | Official Protocol Standards" (STD 1) for the standardization state | |
24 | and status of this protocol. Distribution of this memo is unlimited. | |
25 | ||
26 | Copyright Notice | |
27 | ||
28 | Copyright (C) The Internet Society (2005). | |
29 | ||
30 | Abstract | |
31 | ||
32 | A Uniform Resource Identifier (URI) is a compact sequence of | |
33 | characters that identifies an abstract or physical resource. This | |
34 | specification defines the generic URI syntax and a process for | |
35 | resolving URI references that might be in relative form, along with | |
36 | guidelines and security considerations for the use of URIs on the | |
37 | Internet. The URI syntax defines a grammar that is a superset of all | |
38 | valid URIs, allowing an implementation to parse the common components | |
39 | of a URI reference without knowing the scheme-specific requirements | |
40 | of every possible identifier. This specification does not define a | |
41 | generative grammar for URIs; that task is performed by the individual | |
42 | specifications of each URI scheme. | |
43 | ||
44 | ||
45 | ||
46 | ||
47 | ||
48 | ||
49 | ||
50 | ||
51 | ||
52 | ||
53 | ||
54 | ||
55 | ||
56 | ||
57 | ||
58 | Berners-Lee, et al. Standards Track [Page 1] | |
59 | \f | |
60 | RFC 3986 URI Generic Syntax January 2005 | |
61 | ||
62 | ||
63 | Table of Contents | |
64 | ||
65 | 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 | |
66 | 1.1. Overview of URIs . . . . . . . . . . . . . . . . . . . . 4 | |
67 | 1.1.1. Generic Syntax . . . . . . . . . . . . . . . . . 6 | |
68 | 1.1.2. Examples . . . . . . . . . . . . . . . . . . . . 7 | |
69 | 1.1.3. URI, URL, and URN . . . . . . . . . . . . . . . 7 | |
70 | 1.2. Design Considerations . . . . . . . . . . . . . . . . . 8 | |
71 | 1.2.1. Transcription . . . . . . . . . . . . . . . . . 8 | |
72 | 1.2.2. Separating Identification from Interaction . . . 9 | |
73 | 1.2.3. Hierarchical Identifiers . . . . . . . . . . . . 10 | |
74 | 1.3. Syntax Notation . . . . . . . . . . . . . . . . . . . . 11 | |
75 | 2. Characters . . . . . . . . . . . . . . . . . . . . . . . . . . 11 | |
76 | 2.1. Percent-Encoding . . . . . . . . . . . . . . . . . . . . 12 | |
77 | 2.2. Reserved Characters . . . . . . . . . . . . . . . . . . 12 | |
78 | 2.3. Unreserved Characters . . . . . . . . . . . . . . . . . 13 | |
79 | 2.4. When to Encode or Decode . . . . . . . . . . . . . . . . 14 | |
80 | 2.5. Identifying Data . . . . . . . . . . . . . . . . . . . . 14 | |
81 | 3. Syntax Components . . . . . . . . . . . . . . . . . . . . . . 16 | |
82 | 3.1. Scheme . . . . . . . . . . . . . . . . . . . . . . . . . 17 | |
83 | 3.2. Authority . . . . . . . . . . . . . . . . . . . . . . . 17 | |
84 | 3.2.1. User Information . . . . . . . . . . . . . . . . 18 | |
85 | 3.2.2. Host . . . . . . . . . . . . . . . . . . . . . . 18 | |
86 | 3.2.3. Port . . . . . . . . . . . . . . . . . . . . . . 22 | |
87 | 3.3. Path . . . . . . . . . . . . . . . . . . . . . . . . . . 22 | |
88 | 3.4. Query . . . . . . . . . . . . . . . . . . . . . . . . . 23 | |
89 | 3.5. Fragment . . . . . . . . . . . . . . . . . . . . . . . . 24 | |
90 | 4. Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 | |
91 | 4.1. URI Reference . . . . . . . . . . . . . . . . . . . . . 25 | |
92 | 4.2. Relative Reference . . . . . . . . . . . . . . . . . . . 26 | |
93 | 4.3. Absolute URI . . . . . . . . . . . . . . . . . . . . . . 27 | |
94 | 4.4. Same-Document Reference . . . . . . . . . . . . . . . . 27 | |
95 | 4.5. Suffix Reference . . . . . . . . . . . . . . . . . . . . 27 | |
96 | 5. Reference Resolution . . . . . . . . . . . . . . . . . . . . . 28 | |
97 | 5.1. Establishing a Base URI . . . . . . . . . . . . . . . . 28 | |
98 | 5.1.1. Base URI Embedded in Content . . . . . . . . . . 29 | |
99 | 5.1.2. Base URI from the Encapsulating Entity . . . . . 29 | |
100 | 5.1.3. Base URI from the Retrieval URI . . . . . . . . 30 | |
101 | 5.1.4. Default Base URI . . . . . . . . . . . . . . . . 30 | |
102 | 5.2. Relative Resolution . . . . . . . . . . . . . . . . . . 30 | |
103 | 5.2.1. Pre-parse the Base URI . . . . . . . . . . . . . 31 | |
104 | 5.2.2. Transform References . . . . . . . . . . . . . . 31 | |
105 | 5.2.3. Merge Paths . . . . . . . . . . . . . . . . . . 32 | |
106 | 5.2.4. Remove Dot Segments . . . . . . . . . . . . . . 33 | |
107 | 5.3. Component Recomposition . . . . . . . . . . . . . . . . 35 | |
108 | 5.4. Reference Resolution Examples . . . . . . . . . . . . . 35 | |
109 | 5.4.1. Normal Examples . . . . . . . . . . . . . . . . 36 | |
110 | 5.4.2. Abnormal Examples . . . . . . . . . . . . . . . 36 | |
111 | ||
112 | ||
113 | ||
114 | Berners-Lee, et al. Standards Track [Page 2] | |
115 | \f | |
116 | RFC 3986 URI Generic Syntax January 2005 | |
117 | ||
118 | ||
119 | 6. Normalization and Comparison . . . . . . . . . . . . . . . . . 38 | |
120 | 6.1. Equivalence . . . . . . . . . . . . . . . . . . . . . . 38 | |
121 | 6.2. Comparison Ladder . . . . . . . . . . . . . . . . . . . 39 | |
122 | 6.2.1. Simple String Comparison . . . . . . . . . . . . 39 | |
123 | 6.2.2. Syntax-Based Normalization . . . . . . . . . . . 40 | |
124 | 6.2.3. Scheme-Based Normalization . . . . . . . . . . . 41 | |
125 | 6.2.4. Protocol-Based Normalization . . . . . . . . . . 42 | |
126 | 7. Security Considerations . . . . . . . . . . . . . . . . . . . 43 | |
127 | 7.1. Reliability and Consistency . . . . . . . . . . . . . . 43 | |
128 | 7.2. Malicious Construction . . . . . . . . . . . . . . . . . 43 | |
129 | 7.3. Back-End Transcoding . . . . . . . . . . . . . . . . . . 44 | |
130 | 7.4. Rare IP Address Formats . . . . . . . . . . . . . . . . 45 | |
131 | 7.5. Sensitive Information . . . . . . . . . . . . . . . . . 45 | |
132 | 7.6. Semantic Attacks . . . . . . . . . . . . . . . . . . . . 45 | |
133 | 8. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 46 | |
134 | 9. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 46 | |
135 | 10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 46 | |
136 | 10.1. Normative References . . . . . . . . . . . . . . . . . . 46 | |
137 | 10.2. Informative References . . . . . . . . . . . . . . . . . 47 | |
138 | A. Collected ABNF for URI . . . . . . . . . . . . . . . . . . . . 49 | |
139 | B. Parsing a URI Reference with a Regular Expression . . . . . . 50 | |
140 | C. Delimiting a URI in Context . . . . . . . . . . . . . . . . . 51 | |
141 | D. Changes from RFC 2396 . . . . . . . . . . . . . . . . . . . . 53 | |
142 | D.1. Additions . . . . . . . . . . . . . . . . . . . . . . . 53 | |
143 | D.2. Modifications . . . . . . . . . . . . . . . . . . . . . 53 | |
144 | Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 | |
145 | Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 60 | |
146 | Full Copyright Statement . . . . . . . . . . . . . . . . . . . . . 61 | |
147 | ||
148 | ||
149 | ||
150 | ||
151 | ||
152 | ||
153 | ||
154 | ||
155 | ||
156 | ||
157 | ||
158 | ||
159 | ||
160 | ||
161 | ||
162 | ||
163 | ||
164 | ||
165 | ||
166 | ||
167 | ||
168 | ||
169 | ||
170 | Berners-Lee, et al. Standards Track [Page 3] | |
171 | \f | |
172 | RFC 3986 URI Generic Syntax January 2005 | |
173 | ||
174 | ||
175 | 1. Introduction | |
176 | ||
177 | A Uniform Resource Identifier (URI) provides a simple and extensible | |
178 | means for identifying a resource. This specification of URI syntax | |
179 | and semantics is derived from concepts introduced by the World Wide | |
180 | Web global information initiative, whose use of these identifiers | |
181 | dates from 1990 and is described in "Universal Resource Identifiers | |
182 | in WWW" [RFC1630]. The syntax is designed to meet the | |
183 | recommendations laid out in "Functional Recommendations for Internet | |
184 | Resource Locators" [RFC1736] and "Functional Requirements for Uniform | |
185 | Resource Names" [RFC1737]. | |
186 | ||
187 | This document obsoletes [RFC2396], which merged "Uniform Resource | |
188 | Locators" [RFC1738] and "Relative Uniform Resource Locators" | |
189 | [RFC1808] in order to define a single, generic syntax for all URIs. | |
190 | It obsoletes [RFC2732], which introduced syntax for an IPv6 address. | |
191 | It excludes portions of RFC 1738 that defined the specific syntax of | |
192 | individual URI schemes; those portions will be updated as separate | |
193 | documents. The process for registration of new URI schemes is | |
194 | defined separately by [BCP35]. Advice for designers of new URI | |
195 | schemes can be found in [RFC2718]. All significant changes from RFC | |
196 | 2396 are noted in Appendix D. | |
197 | ||
198 | This specification uses the terms "character" and "coded character | |
199 | set" in accordance with the definitions provided in [BCP19], and | |
200 | "character encoding" in place of what [BCP19] refers to as a | |
201 | "charset". | |
202 | ||
203 | 1.1. Overview of URIs | |
204 | ||
205 | URIs are characterized as follows: | |
206 | ||
207 | Uniform | |
208 | ||
209 | Uniformity provides several benefits. It allows different types | |
210 | of resource identifiers to be used in the same context, even when | |
211 | the mechanisms used to access those resources may differ. It | |
212 | allows uniform semantic interpretation of common syntactic | |
213 | conventions across different types of resource identifiers. It | |
214 | allows introduction of new types of resource identifiers without | |
215 | interfering with the way that existing identifiers are used. It | |
216 | allows the identifiers to be reused in many different contexts, | |
217 | thus permitting new applications or protocols to leverage a pre- | |
218 | existing, large, and widely used set of resource identifiers. | |
219 | ||
220 | ||
221 | ||
222 | ||
223 | ||
224 | ||
225 | ||
226 | Berners-Lee, et al. Standards Track [Page 4] | |
227 | \f | |
228 | RFC 3986 URI Generic Syntax January 2005 | |
229 | ||
230 | ||
231 | Resource | |
232 | ||
233 | This specification does not limit the scope of what might be a | |
234 | resource; rather, the term "resource" is used in a general sense | |
235 | for whatever might be identified by a URI. Familiar examples | |
236 | include an electronic document, an image, a source of information | |
237 | with a consistent purpose (e.g., "today's weather report for Los | |
238 | Angeles"), a service (e.g., an HTTP-to-SMS gateway), and a | |
239 | collection of other resources. A resource is not necessarily | |
240 | accessible via the Internet; e.g., human beings, corporations, and | |
241 | bound books in a library can also be resources. Likewise, | |
242 | abstract concepts can be resources, such as the operators and | |
243 | operands of a mathematical equation, the types of a relationship | |
244 | (e.g., "parent" or "employee"), or numeric values (e.g., zero, | |
245 | one, and infinity). | |
246 | ||
247 | Identifier | |
248 | ||
249 | An identifier embodies the information required to distinguish | |
250 | what is being identified from all other things within its scope of | |
251 | identification. Our use of the terms "identify" and "identifying" | |
252 | refer to this purpose of distinguishing one resource from all | |
253 | other resources, regardless of how that purpose is accomplished | |
254 | (e.g., by name, address, or context). These terms should not be | |
255 | mistaken as an assumption that an identifier defines or embodies | |
256 | the identity of what is referenced, though that may be the case | |
257 | for some identifiers. Nor should it be assumed that a system | |
258 | using URIs will access the resource identified: in many cases, | |
259 | URIs are used to denote resources without any intention that they | |
260 | be accessed. Likewise, the "one" resource identified might not be | |
261 | singular in nature (e.g., a resource might be a named set or a | |
262 | mapping that varies over time). | |
263 | ||
264 | A URI is an identifier consisting of a sequence of characters | |
265 | matching the syntax rule named <URI> in Section 3. It enables | |
266 | uniform identification of resources via a separately defined | |
267 | extensible set of naming schemes (Section 3.1). How that | |
268 | identification is accomplished, assigned, or enabled is delegated to | |
269 | each scheme specification. | |
270 | ||
271 | This specification does not place any limits on the nature of a | |
272 | resource, the reasons why an application might seek to refer to a | |
273 | resource, or the kinds of systems that might use URIs for the sake of | |
274 | identifying resources. This specification does not require that a | |
275 | URI persists in identifying the same resource over time, though that | |
276 | is a common goal of all URI schemes. Nevertheless, nothing in this | |
277 | ||
278 | ||
279 | ||
280 | ||
281 | ||
282 | Berners-Lee, et al. Standards Track [Page 5] | |
283 | \f | |
284 | RFC 3986 URI Generic Syntax January 2005 | |
285 | ||
286 | ||
287 | specification prevents an application from limiting itself to | |
288 | particular types of resources, or to a subset of URIs that maintains | |
289 | characteristics desired by that application. | |
290 | ||
291 | URIs have a global scope and are interpreted consistently regardless | |
292 | of context, though the result of that interpretation may be in | |
293 | relation to the end-user's context. For example, "http://localhost/" | |
294 | has the same interpretation for every user of that reference, even | |
295 | though the network interface corresponding to "localhost" may be | |
296 | different for each end-user: interpretation is independent of access. | |
297 | However, an action made on the basis of that reference will take | |
298 | place in relation to the end-user's context, which implies that an | |
299 | action intended to refer to a globally unique thing must use a URI | |
300 | that distinguishes that resource from all other things. URIs that | |
301 | identify in relation to the end-user's local context should only be | |
302 | used when the context itself is a defining aspect of the resource, | |
303 | such as when an on-line help manual refers to a file on the end- | |
304 | user's file system (e.g., "file:///etc/hosts"). | |
305 | ||
306 | 1.1.1. Generic Syntax | |
307 | ||
308 | Each URI begins with a scheme name, as defined in Section 3.1, that | |
309 | refers to a specification for assigning identifiers within that | |
310 | scheme. As such, the URI syntax is a federated and extensible naming | |
311 | system wherein each scheme's specification may further restrict the | |
312 | syntax and semantics of identifiers using that scheme. | |
313 | ||
314 | This specification defines those elements of the URI syntax that are | |
315 | required of all URI schemes or are common to many URI schemes. It | |
316 | thus defines the syntax and semantics needed to implement a scheme- | |
317 | independent parsing mechanism for URI references, by which the | |
318 | scheme-dependent handling of a URI can be postponed until the | |
319 | scheme-dependent semantics are needed. Likewise, protocols and data | |
320 | formats that make use of URI references can refer to this | |
321 | specification as a definition for the range of syntax allowed for all | |
322 | URIs, including those schemes that have yet to be defined. This | |
323 | decouples the evolution of identification schemes from the evolution | |
324 | of protocols, data formats, and implementations that make use of | |
325 | URIs. | |
326 | ||
327 | A parser of the generic URI syntax can parse any URI reference into | |
328 | its major components. Once the scheme is determined, further | |
329 | scheme-specific parsing can be performed on the components. In other | |
330 | words, the URI generic syntax is a superset of the syntax of all URI | |
331 | schemes. | |
332 | ||
333 | ||
334 | ||
335 | ||
336 | ||
337 | ||
338 | Berners-Lee, et al. Standards Track [Page 6] | |
339 | \f | |
340 | RFC 3986 URI Generic Syntax January 2005 | |
341 | ||
342 | ||
343 | 1.1.2. Examples | |
344 | ||
345 | The following example URIs illustrate several URI schemes and | |
346 | variations in their common syntax components: | |
347 | ||
348 | ftp://ftp.is.co.za/rfc/rfc1808.txt | |
349 | ||
350 | http://www.ietf.org/rfc/rfc2396.txt | |
351 | ||
352 | ldap://[2001:db8::7]/c=GB?objectClass?one | |
353 | ||
354 | mailto:John.Doe@example.com | |
355 | ||
356 | news:comp.infosystems.www.servers.unix | |
357 | ||
358 | tel:+1-816-555-1212 | |
359 | ||
360 | telnet://192.0.2.16:80/ | |
361 | ||
362 | urn:oasis:names:specification:docbook:dtd:xml:4.1.2 | |
363 | ||
364 | ||
365 | 1.1.3. URI, URL, and URN | |
366 | ||
367 | A URI can be further classified as a locator, a name, or both. The | |
368 | term "Uniform Resource Locator" (URL) refers to the subset of URIs | |
369 | that, in addition to identifying a resource, provide a means of | |
370 | locating the resource by describing its primary access mechanism | |
371 | (e.g., its network "location"). The term "Uniform Resource Name" | |
372 | (URN) has been used historically to refer to both URIs under the | |
373 | "urn" scheme [RFC2141], which are required to remain globally unique | |
374 | and persistent even when the resource ceases to exist or becomes | |
375 | unavailable, and to any other URI with the properties of a name. | |
376 | ||
377 | An individual scheme does not have to be classified as being just one | |
378 | of "name" or "locator". Instances of URIs from any given scheme may | |
379 | have the characteristics of names or locators or both, often | |
380 | depending on the persistence and care in the assignment of | |
381 | identifiers by the naming authority, rather than on any quality of | |
382 | the scheme. Future specifications and related documentation should | |
383 | use the general term "URI" rather than the more restrictive terms | |
384 | "URL" and "URN" [RFC3305]. | |
385 | ||
386 | ||
387 | ||
388 | ||
389 | ||
390 | ||
391 | ||
392 | ||
393 | ||
394 | Berners-Lee, et al. Standards Track [Page 7] | |
395 | \f | |
396 | RFC 3986 URI Generic Syntax January 2005 | |
397 | ||
398 | ||
399 | 1.2. Design Considerations | |
400 | ||
401 | 1.2.1. Transcription | |
402 | ||
403 | The URI syntax has been designed with global transcription as one of | |
404 | its main considerations. A URI is a sequence of characters from a | |
405 | very limited set: the letters of the basic Latin alphabet, digits, | |
406 | and a few special characters. A URI may be represented in a variety | |
407 | of ways; e.g., ink on paper, pixels on a screen, or a sequence of | |
408 | character encoding octets. The interpretation of a URI depends only | |
409 | on the characters used and not on how those characters are | |
410 | represented in a network protocol. | |
411 | ||
412 | The goal of transcription can be described by a simple scenario. | |
413 | Imagine two colleagues, Sam and Kim, sitting in a pub at an | |
414 | international conference and exchanging research ideas. Sam asks Kim | |
415 | for a location to get more information, so Kim writes the URI for the | |
416 | research site on a napkin. Upon returning home, Sam takes out the | |
417 | napkin and types the URI into a computer, which then retrieves the | |
418 | information to which Kim referred. | |
419 | ||
420 | There are several design considerations revealed by the scenario: | |
421 | ||
422 | o A URI is a sequence of characters that is not always represented | |
423 | as a sequence of octets. | |
424 | ||
425 | o A URI might be transcribed from a non-network source and thus | |
426 | should consist of characters that are most likely able to be | |
427 | entered into a computer, within the constraints imposed by | |
428 | keyboards (and related input devices) across languages and | |
429 | locales. | |
430 | ||
431 | o A URI often has to be remembered by people, and it is easier for | |
432 | people to remember a URI when it consists of meaningful or | |
433 | familiar components. | |
434 | ||
435 | These design considerations are not always in alignment. For | |
436 | example, it is often the case that the most meaningful name for a URI | |
437 | component would require characters that cannot be typed into some | |
438 | systems. The ability to transcribe a resource identifier from one | |
439 | medium to another has been considered more important than having a | |
440 | URI consist of the most meaningful of components. | |
441 | ||
442 | In local or regional contexts and with improving technology, users | |
443 | might benefit from being able to use a wider range of characters; | |
444 | such use is not defined by this specification. Percent-encoded | |
445 | octets (Section 2.1) may be used within a URI to represent characters | |
446 | outside the range of the US-ASCII coded character set if this | |
447 | ||
448 | ||
449 | ||
450 | Berners-Lee, et al. Standards Track [Page 8] | |
451 | \f | |
452 | RFC 3986 URI Generic Syntax January 2005 | |
453 | ||
454 | ||
455 | representation is allowed by the scheme or by the protocol element in | |
456 | which the URI is referenced. Such a definition should specify the | |
457 | character encoding used to map those characters to octets prior to | |
458 | being percent-encoded for the URI. | |
459 | ||
460 | 1.2.2. Separating Identification from Interaction | |
461 | ||
462 | A common misunderstanding of URIs is that they are only used to refer | |
463 | to accessible resources. The URI itself only provides | |
464 | identification; access to the resource is neither guaranteed nor | |
465 | implied by the presence of a URI. Instead, any operation associated | |
466 | with a URI reference is defined by the protocol element, data format | |
467 | attribute, or natural language text in which it appears. | |
468 | ||
469 | Given a URI, a system may attempt to perform a variety of operations | |
470 | on the resource, as might be characterized by words such as "access", | |
471 | "update", "replace", or "find attributes". Such operations are | |
472 | defined by the protocols that make use of URIs, not by this | |
473 | specification. However, we do use a few general terms for describing | |
474 | common operations on URIs. URI "resolution" is the process of | |
475 | determining an access mechanism and the appropriate parameters | |
476 | necessary to dereference a URI; this resolution may require several | |
477 | iterations. To use that access mechanism to perform an action on the | |
478 | URI's resource is to "dereference" the URI. | |
479 | ||
480 | When URIs are used within information retrieval systems to identify | |
481 | sources of information, the most common form of URI dereference is | |
482 | "retrieval": making use of a URI in order to retrieve a | |
483 | representation of its associated resource. A "representation" is a | |
484 | sequence of octets, along with representation metadata describing | |
485 | those octets, that constitutes a record of the state of the resource | |
486 | at the time when the representation is generated. Retrieval is | |
487 | achieved by a process that might include using the URI as a cache key | |
488 | to check for a locally cached representation, resolution of the URI | |
489 | to determine an appropriate access mechanism (if any), and | |
490 | dereference of the URI for the sake of applying a retrieval | |
491 | operation. Depending on the protocols used to perform the retrieval, | |
492 | additional information might be supplied about the resource (resource | |
493 | metadata) and its relation to other resources. | |
494 | ||
495 | URI references in information retrieval systems are designed to be | |
496 | late-binding: the result of an access is generally determined when it | |
497 | is accessed and may vary over time or due to other aspects of the | |
498 | interaction. These references are created in order to be used in the | |
499 | future: what is being identified is not some specific result that was | |
500 | obtained in the past, but rather some characteristic that is expected | |
501 | to be true for future results. In such cases, the resource referred | |
502 | to by the URI is actually a sameness of characteristics as observed | |
503 | ||
504 | ||
505 | ||
506 | Berners-Lee, et al. Standards Track [Page 9] | |
507 | \f | |
508 | RFC 3986 URI Generic Syntax January 2005 | |
509 | ||
510 | ||
511 | over time, perhaps elucidated by additional comments or assertions | |
512 | made by the resource provider. | |
513 | ||
514 | Although many URI schemes are named after protocols, this does not | |
515 | imply that use of these URIs will result in access to the resource | |
516 | via the named protocol. URIs are often used simply for the sake of | |
517 | identification. Even when a URI is used to retrieve a representation | |
518 | of a resource, that access might be through gateways, proxies, | |
519 | caches, and name resolution services that are independent of the | |
520 | protocol associated with the scheme name. The resolution of some | |
521 | URIs may require the use of more than one protocol (e.g., both DNS | |
522 | and HTTP are typically used to access an "http" URI's origin server | |
523 | when a representation isn't found in a local cache). | |
524 | ||
525 | 1.2.3. Hierarchical Identifiers | |
526 | ||
527 | The URI syntax is organized hierarchically, with components listed in | |
528 | order of decreasing significance from left to right. For some URI | |
529 | schemes, the visible hierarchy is limited to the scheme itself: | |
530 | everything after the scheme component delimiter (":") is considered | |
531 | opaque to URI processing. Other URI schemes make the hierarchy | |
532 | explicit and visible to generic parsing algorithms. | |
533 | ||
534 | The generic syntax uses the slash ("/"), question mark ("?"), and | |
535 | number sign ("#") characters to delimit components that are | |
536 | significant to the generic parser's hierarchical interpretation of an | |
537 | identifier. In addition to aiding the readability of such | |
538 | identifiers through the consistent use of familiar syntax, this | |
539 | uniform representation of hierarchy across naming schemes allows | |
540 | scheme-independent references to be made relative to that hierarchy. | |
541 | ||
542 | It is often the case that a group or "tree" of documents has been | |
543 | constructed to serve a common purpose, wherein the vast majority of | |
544 | URI references in these documents point to resources within the tree | |
545 | rather than outside it. Similarly, documents located at a particular | |
546 | site are much more likely to refer to other resources at that site | |
547 | than to resources at remote sites. Relative referencing of URIs | |
548 | allows document trees to be partially independent of their location | |
549 | and access scheme. For instance, it is possible for a single set of | |
550 | hypertext documents to be simultaneously accessible and traversable | |
551 | via each of the "file", "http", and "ftp" schemes if the documents | |
552 | refer to each other with relative references. Furthermore, such | |
553 | document trees can be moved, as a whole, without changing any of the | |
554 | relative references. | |
555 | ||
556 | A relative reference (Section 4.2) refers to a resource by describing | |
557 | the difference within a hierarchical name space between the reference | |
558 | context and the target URI. The reference resolution algorithm, | |
559 | ||
560 | ||
561 | ||
562 | Berners-Lee, et al. Standards Track [Page 10] | |
563 | \f | |
564 | RFC 3986 URI Generic Syntax January 2005 | |
565 | ||
566 | ||
567 | presented in Section 5, defines how such a reference is transformed | |
568 | to the target URI. As relative references can only be used within | |
569 | the context of a hierarchical URI, designers of new URI schemes | |
570 | should use a syntax consistent with the generic syntax's hierarchical | |
571 | components unless there are compelling reasons to forbid relative | |
572 | referencing within that scheme. | |
573 | ||
574 | NOTE: Previous specifications used the terms "partial URI" and | |
575 | "relative URI" to denote a relative reference to a URI. As some | |
576 | readers misunderstood those terms to mean that relative URIs are a | |
577 | subset of URIs rather than a method of referencing URIs, this | |
578 | specification simply refers to them as relative references. | |
579 | ||
580 | All URI references are parsed by generic syntax parsers when used. | |
581 | However, because hierarchical processing has no effect on an absolute | |
582 | URI used in a reference unless it contains one or more dot-segments | |
583 | (complete path segments of "." or "..", as described in Section 3.3), | |
584 | URI scheme specifications can define opaque identifiers by | |
585 | disallowing use of slash characters, question mark characters, and | |
586 | the URIs "scheme:." and "scheme:..". | |
587 | ||
588 | 1.3. Syntax Notation | |
589 | ||
590 | This specification uses the Augmented Backus-Naur Form (ABNF) | |
591 | notation of [RFC2234], including the following core ABNF syntax rules | |
592 | defined by that specification: ALPHA (letters), CR (carriage return), | |
593 | DIGIT (decimal digits), DQUOTE (double quote), HEXDIG (hexadecimal | |
594 | digits), LF (line feed), and SP (space). The complete URI syntax is | |
595 | collected in Appendix A. | |
596 | ||
597 | 2. Characters | |
598 | ||
599 | The URI syntax provides a method of encoding data, presumably for the | |
600 | sake of identifying a resource, as a sequence of characters. The URI | |
601 | characters are, in turn, frequently encoded as octets for transport | |
602 | or presentation. This specification does not mandate any particular | |
603 | character encoding for mapping between URI characters and the octets | |
604 | used to store or transmit those characters. When a URI appears in a | |
605 | protocol element, the character encoding is defined by that protocol; | |
606 | without such a definition, a URI is assumed to be in the same | |
607 | character encoding as the surrounding text. | |
608 | ||
609 | The ABNF notation defines its terminal values to be non-negative | |
610 | integers (codepoints) based on the US-ASCII coded character set | |
611 | [ASCII]. Because a URI is a sequence of characters, we must invert | |
612 | that relation in order to understand the URI syntax. Therefore, the | |
613 | ||
614 | ||
615 | ||
616 | ||
617 | ||
618 | Berners-Lee, et al. Standards Track [Page 11] | |
619 | \f | |
620 | RFC 3986 URI Generic Syntax January 2005 | |
621 | ||
622 | ||
623 | integer values used by the ABNF must be mapped back to their | |
624 | corresponding characters via US-ASCII in order to complete the syntax | |
625 | rules. | |
626 | ||
627 | A URI is composed from a limited set of characters consisting of | |
628 | digits, letters, and a few graphic symbols. A reserved subset of | |
629 | those characters may be used to delimit syntax components within a | |
630 | URI while the remaining characters, including both the unreserved set | |
631 | and those reserved characters not acting as delimiters, define each | |
632 | component's identifying data. | |
633 | ||
634 | 2.1. Percent-Encoding | |
635 | ||
636 | A percent-encoding mechanism is used to represent a data octet in a | |
637 | component when that octet's corresponding character is outside the | |
638 | allowed set or is being used as a delimiter of, or within, the | |
639 | component. A percent-encoded octet is encoded as a character | |
640 | triplet, consisting of the percent character "%" followed by the two | |
641 | hexadecimal digits representing that octet's numeric value. For | |
642 | example, "%20" is the percent-encoding for the binary octet | |
643 | "00100000" (ABNF: %x20), which in US-ASCII corresponds to the space | |
644 | character (SP). Section 2.4 describes when percent-encoding and | |
645 | decoding is applied. | |
646 | ||
647 | pct-encoded = "%" HEXDIG HEXDIG | |
648 | ||
649 | The uppercase hexadecimal digits 'A' through 'F' are equivalent to | |
650 | the lowercase digits 'a' through 'f', respectively. If two URIs | |
651 | differ only in the case of hexadecimal digits used in percent-encoded | |
652 | octets, they are equivalent. For consistency, URI producers and | |
653 | normalizers should use uppercase hexadecimal digits for all percent- | |
654 | encodings. | |
655 | ||
656 | 2.2. Reserved Characters | |
657 | ||
658 | URIs include components and subcomponents that are delimited by | |
659 | characters in the "reserved" set. These characters are called | |
660 | "reserved" because they may (or may not) be defined as delimiters by | |
661 | the generic syntax, by each scheme-specific syntax, or by the | |
662 | implementation-specific syntax of a URI's dereferencing algorithm. | |
663 | If data for a URI component would conflict with a reserved | |
664 | character's purpose as a delimiter, then the conflicting data must be | |
665 | percent-encoded before the URI is formed. | |
666 | ||
667 | ||
668 | ||
669 | ||
670 | ||
671 | ||
672 | ||
673 | ||
674 | Berners-Lee, et al. Standards Track [Page 12] | |
675 | \f | |
676 | RFC 3986 URI Generic Syntax January 2005 | |
677 | ||
678 | ||
679 | reserved = gen-delims / sub-delims | |
680 | ||
681 | gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" | |
682 | ||
683 | sub-delims = "!" / "$" / "&" / "'" / "(" / ")" | |
684 | / "*" / "+" / "," / ";" / "=" | |
685 | ||
686 | The purpose of reserved characters is to provide a set of delimiting | |
687 | characters that are distinguishable from other data within a URI. | |
688 | URIs that differ in the replacement of a reserved character with its | |
689 | corresponding percent-encoded octet are not equivalent. Percent- | |
690 | encoding a reserved character, or decoding a percent-encoded octet | |
691 | that corresponds to a reserved character, will change how the URI is | |
692 | interpreted by most applications. Thus, characters in the reserved | |
693 | set are protected from normalization and are therefore safe to be | |
694 | used by scheme-specific and producer-specific algorithms for | |
695 | delimiting data subcomponents within a URI. | |
696 | ||
697 | A subset of the reserved characters (gen-delims) is used as | |
698 | delimiters of the generic URI components described in Section 3. A | |
699 | component's ABNF syntax rule will not use the reserved or gen-delims | |
700 | rule names directly; instead, each syntax rule lists the characters | |
701 | allowed within that component (i.e., not delimiting it), and any of | |
702 | those characters that are also in the reserved set are "reserved" for | |
703 | use as subcomponent delimiters within the component. Only the most | |
704 | common subcomponents are defined by this specification; other | |
705 | subcomponents may be defined by a URI scheme's specification, or by | |
706 | the implementation-specific syntax of a URI's dereferencing | |
707 | algorithm, provided that such subcomponents are delimited by | |
708 | characters in the reserved set allowed within that component. | |
709 | ||
710 | URI producing applications should percent-encode data octets that | |
711 | correspond to characters in the reserved set unless these characters | |
712 | are specifically allowed by the URI scheme to represent data in that | |
713 | component. If a reserved character is found in a URI component and | |
714 | no delimiting role is known for that character, then it must be | |
715 | interpreted as representing the data octet corresponding to that | |
716 | character's encoding in US-ASCII. | |
717 | ||
718 | 2.3. Unreserved Characters | |
719 | ||
720 | Characters that are allowed in a URI but do not have a reserved | |
721 | purpose are called unreserved. These include uppercase and lowercase | |
722 | letters, decimal digits, hyphen, period, underscore, and tilde. | |
723 | ||
724 | unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" | |
725 | ||
726 | ||
727 | ||
728 | ||
729 | ||
730 | Berners-Lee, et al. Standards Track [Page 13] | |
731 | \f | |
732 | RFC 3986 URI Generic Syntax January 2005 | |
733 | ||
734 | ||
735 | URIs that differ in the replacement of an unreserved character with | |
736 | its corresponding percent-encoded US-ASCII octet are equivalent: they | |
737 | identify the same resource. However, URI comparison implementations | |
738 | do not always perform normalization prior to comparison (see Section | |
739 | 6). For consistency, percent-encoded octets in the ranges of ALPHA | |
740 | (%41-%5A and %61-%7A), DIGIT (%30-%39), hyphen (%2D), period (%2E), | |
741 | underscore (%5F), or tilde (%7E) should not be created by URI | |
742 | producers and, when found in a URI, should be decoded to their | |
743 | corresponding unreserved characters by URI normalizers. | |
744 | ||
745 | 2.4. When to Encode or Decode | |
746 | ||
747 | Under normal circumstances, the only time when octets within a URI | |
748 | are percent-encoded is during the process of producing the URI from | |
749 | its component parts. This is when an implementation determines which | |
750 | of the reserved characters are to be used as subcomponent delimiters | |
751 | and which can be safely used as data. Once produced, a URI is always | |
752 | in its percent-encoded form. | |
753 | ||
754 | When a URI is dereferenced, the components and subcomponents | |
755 | significant to the scheme-specific dereferencing process (if any) | |
756 | must be parsed and separated before the percent-encoded octets within | |
757 | those components can be safely decoded, as otherwise the data may be | |
758 | mistaken for component delimiters. The only exception is for | |
759 | percent-encoded octets corresponding to characters in the unreserved | |
760 | set, which can be decoded at any time. For example, the octet | |
761 | corresponding to the tilde ("~") character is often encoded as "%7E" | |
762 | by older URI processing implementations; the "%7E" can be replaced by | |
763 | "~" without changing its interpretation. | |
764 | ||
765 | Because the percent ("%") character serves as the indicator for | |
766 | percent-encoded octets, it must be percent-encoded as "%25" for that | |
767 | octet to be used as data within a URI. Implementations must not | |
768 | percent-encode or decode the same string more than once, as decoding | |
769 | an already decoded string might lead to misinterpreting a percent | |
770 | data octet as the beginning of a percent-encoding, or vice versa in | |
771 | the case of percent-encoding an already percent-encoded string. | |
772 | ||
773 | 2.5. Identifying Data | |
774 | ||
775 | URI characters provide identifying data for each of the URI | |
776 | components, serving as an external interface for identification | |
777 | between systems. Although the presence and nature of the URI | |
778 | production interface is hidden from clients that use its URIs (and is | |
779 | thus beyond the scope of the interoperability requirements defined by | |
780 | this specification), it is a frequent source of confusion and errors | |
781 | in the interpretation of URI character issues. Implementers have to | |
782 | be aware that there are multiple character encodings involved in the | |
783 | ||
784 | ||
785 | ||
786 | Berners-Lee, et al. Standards Track [Page 14] | |
787 | \f | |
788 | RFC 3986 URI Generic Syntax January 2005 | |
789 | ||
790 | ||
791 | production and transmission of URIs: local name and data encoding, | |
792 | public interface encoding, URI character encoding, data format | |
793 | encoding, and protocol encoding. | |
794 | ||
795 | Local names, such as file system names, are stored with a local | |
796 | character encoding. URI producing applications (e.g., origin | |
797 | servers) will typically use the local encoding as the basis for | |
798 | producing meaningful names. The URI producer will transform the | |
799 | local encoding to one that is suitable for a public interface and | |
800 | then transform the public interface encoding into the restricted set | |
801 | of URI characters (reserved, unreserved, and percent-encodings). | |
802 | Those characters are, in turn, encoded as octets to be used as a | |
803 | reference within a data format (e.g., a document charset), and such | |
804 | data formats are often subsequently encoded for transmission over | |
805 | Internet protocols. | |
806 | ||
807 | For most systems, an unreserved character appearing within a URI | |
808 | component is interpreted as representing the data octet corresponding | |
809 | to that character's encoding in US-ASCII. Consumers of URIs assume | |
810 | that the letter "X" corresponds to the octet "01011000", and even | |
811 | when that assumption is incorrect, there is no harm in making it. A | |
812 | system that internally provides identifiers in the form of a | |
813 | different character encoding, such as EBCDIC, will generally perform | |
814 | character translation of textual identifiers to UTF-8 [STD63] (or | |
815 | some other superset of the US-ASCII character encoding) at an | |
816 | internal interface, thereby providing more meaningful identifiers | |
817 | than those resulting from simply percent-encoding the original | |
818 | octets. | |
819 | ||
820 | For example, consider an information service that provides data, | |
821 | stored locally using an EBCDIC-based file system, to clients on the | |
822 | Internet through an HTTP server. When an author creates a file with | |
823 | the name "Laguna Beach" on that file system, the "http" URI | |
824 | corresponding to that resource is expected to contain the meaningful | |
825 | string "Laguna%20Beach". If, however, that server produces URIs by | |
826 | using an overly simplistic raw octet mapping, then the result would | |
827 | be a URI containing "%D3%81%87%A4%95%81@%C2%85%81%83%88". An | |
828 | internal transcoding interface fixes this problem by transcoding the | |
829 | local name to a superset of US-ASCII prior to producing the URI. | |
830 | Naturally, proper interpretation of an incoming URI on such an | |
831 | interface requires that percent-encoded octets be decoded (e.g., | |
832 | "%20" to SP) before the reverse transcoding is applied to obtain the | |
833 | local name. | |
834 | ||
835 | In some cases, the internal interface between a URI component and the | |
836 | identifying data that it has been crafted to represent is much less | |
837 | direct than a character encoding translation. For example, portions | |
838 | of a URI might reflect a query on non-ASCII data, or numeric | |
839 | ||
840 | ||
841 | ||
842 | Berners-Lee, et al. Standards Track [Page 15] | |
843 | \f | |
844 | RFC 3986 URI Generic Syntax January 2005 | |
845 | ||
846 | ||
847 | coordinates on a map. Likewise, a URI scheme may define components | |
848 | with additional encoding requirements that are applied prior to | |
849 | forming the component and producing the URI. | |
850 | ||
851 | When a new URI scheme defines a component that represents textual | |
852 | data consisting of characters from the Universal Character Set [UCS], | |
853 | the data should first be encoded as octets according to the UTF-8 | |
854 | character encoding [STD63]; then only those octets that do not | |
855 | correspond to characters in the unreserved set should be percent- | |
856 | encoded. For example, the character A would be represented as "A", | |
857 | the character LATIN CAPITAL LETTER A WITH GRAVE would be represented | |
858 | as "%C3%80", and the character KATAKANA LETTER A would be represented | |
859 | as "%E3%82%A2". | |
860 | ||
861 | 3. Syntax Components | |
862 | ||
863 | The generic URI syntax consists of a hierarchical sequence of | |
864 | components referred to as the scheme, authority, path, query, and | |
865 | fragment. | |
866 | ||
867 | URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ] | |
868 | ||
869 | hier-part = "//" authority path-abempty | |
870 | / path-absolute | |
871 | / path-rootless | |
872 | / path-empty | |
873 | ||
874 | The scheme and path components are required, though the path may be | |
875 | empty (no characters). When authority is present, the path must | |
876 | either be empty or begin with a slash ("/") character. When | |
877 | authority is not present, the path cannot begin with two slash | |
878 | characters ("//"). These restrictions result in five different ABNF | |
879 | rules for a path (Section 3.3), only one of which will match any | |
880 | given URI reference. | |
881 | ||
882 | The following are two example URIs and their component parts: | |
883 | ||
884 | foo://example.com:8042/over/there?name=ferret#nose | |
885 | \_/ \______________/\_________/ \_________/ \__/ | |
886 | | | | | | | |
887 | scheme authority path query fragment | |
888 | | _____________________|__ | |
889 | / \ / \ | |
890 | urn:example:animal:ferret:nose | |
891 | ||
892 | ||
893 | ||
894 | ||
895 | ||
896 | ||
897 | ||
898 | Berners-Lee, et al. Standards Track [Page 16] | |
899 | \f | |
900 | RFC 3986 URI Generic Syntax January 2005 | |
901 | ||
902 | ||
903 | 3.1. Scheme | |
904 | ||
905 | Each URI begins with a scheme name that refers to a specification for | |
906 | assigning identifiers within that scheme. As such, the URI syntax is | |
907 | a federated and extensible naming system wherein each scheme's | |
908 | specification may further restrict the syntax and semantics of | |
909 | identifiers using that scheme. | |
910 | ||
911 | Scheme names consist of a sequence of characters beginning with a | |
912 | letter and followed by any combination of letters, digits, plus | |
913 | ("+"), period ("."), or hyphen ("-"). Although schemes are case- | |
914 | insensitive, the canonical form is lowercase and documents that | |
915 | specify schemes must do so with lowercase letters. An implementation | |
916 | should accept uppercase letters as equivalent to lowercase in scheme | |
917 | names (e.g., allow "HTTP" as well as "http") for the sake of | |
918 | robustness but should only produce lowercase scheme names for | |
919 | consistency. | |
920 | ||
921 | scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) | |
922 | ||
923 | Individual schemes are not specified by this document. The process | |
924 | for registration of new URI schemes is defined separately by [BCP35]. | |
925 | The scheme registry maintains the mapping between scheme names and | |
926 | their specifications. Advice for designers of new URI schemes can be | |
927 | found in [RFC2718]. URI scheme specifications must define their own | |
928 | syntax so that all strings matching their scheme-specific syntax will | |
929 | also match the <absolute-URI> grammar, as described in Section 4.3. | |
930 | ||
931 | When presented with a URI that violates one or more scheme-specific | |
932 | restrictions, the scheme-specific resolution process should flag the | |
933 | reference as an error rather than ignore the unused parts; doing so | |
934 | reduces the number of equivalent URIs and helps detect abuses of the | |
935 | generic syntax, which might indicate that the URI has been | |
936 | constructed to mislead the user (Section 7.6). | |
937 | ||
938 | 3.2. Authority | |
939 | ||
940 | Many URI schemes include a hierarchical element for a naming | |
941 | authority so that governance of the name space defined by the | |
942 | remainder of the URI is delegated to that authority (which may, in | |
943 | turn, delegate it further). The generic syntax provides a common | |
944 | means for distinguishing an authority based on a registered name or | |
945 | server address, along with optional port and user information. | |
946 | ||
947 | The authority component is preceded by a double slash ("//") and is | |
948 | terminated by the next slash ("/"), question mark ("?"), or number | |
949 | sign ("#") character, or by the end of the URI. | |
950 | ||
951 | ||
952 | ||
953 | ||
954 | Berners-Lee, et al. Standards Track [Page 17] | |
955 | \f | |
956 | RFC 3986 URI Generic Syntax January 2005 | |
957 | ||
958 | ||
959 | authority = [ userinfo "@" ] host [ ":" port ] | |
960 | ||
961 | URI producers and normalizers should omit the ":" delimiter that | |
962 | separates host from port if the port component is empty. Some | |
963 | schemes do not allow the userinfo and/or port subcomponents. | |
964 | ||
965 | If a URI contains an authority component, then the path component | |
966 | must either be empty or begin with a slash ("/") character. Non- | |
967 | validating parsers (those that merely separate a URI reference into | |
968 | its major components) will often ignore the subcomponent structure of | |
969 | authority, treating it as an opaque string from the double-slash to | |
970 | the first terminating delimiter, until such time as the URI is | |
971 | dereferenced. | |
972 | ||
973 | 3.2.1. User Information | |
974 | ||
975 | The userinfo subcomponent may consist of a user name and, optionally, | |
976 | scheme-specific information about how to gain authorization to access | |
977 | the resource. The user information, if present, is followed by a | |
978 | commercial at-sign ("@") that delimits it from the host. | |
979 | ||
980 | userinfo = *( unreserved / pct-encoded / sub-delims / ":" ) | |
981 | ||
982 | Use of the format "user:password" in the userinfo field is | |
983 | deprecated. Applications should not render as clear text any data | |
984 | after the first colon (":") character found within a userinfo | |
985 | subcomponent unless the data after the colon is the empty string | |
986 | (indicating no password). Applications may choose to ignore or | |
987 | reject such data when it is received as part of a reference and | |
988 | should reject the storage of such data in unencrypted form. The | |
989 | passing of authentication information in clear text has proven to be | |
990 | a security risk in almost every case where it has been used. | |
991 | ||
992 | Applications that render a URI for the sake of user feedback, such as | |
993 | in graphical hypertext browsing, should render userinfo in a way that | |
994 | is distinguished from the rest of a URI, when feasible. Such | |
995 | rendering will assist the user in cases where the userinfo has been | |
996 | misleadingly crafted to look like a trusted domain name | |
997 | (Section 7.6). | |
998 | ||
999 | 3.2.2. Host | |
1000 | ||
1001 | The host subcomponent of authority is identified by an IP literal | |
1002 | encapsulated within square brackets, an IPv4 address in dotted- | |
1003 | decimal form, or a registered name. The host subcomponent is case- | |
1004 | insensitive. The presence of a host subcomponent within a URI does | |
1005 | not imply that the scheme requires access to the given host on the | |
1006 | Internet. In many cases, the host syntax is used only for the sake | |
1007 | ||
1008 | ||
1009 | ||
1010 | Berners-Lee, et al. Standards Track [Page 18] | |
1011 | \f | |
1012 | RFC 3986 URI Generic Syntax January 2005 | |
1013 | ||
1014 | ||
1015 | of reusing the existing registration process created and deployed for | |
1016 | DNS, thus obtaining a globally unique name without the cost of | |
1017 | deploying another registry. However, such use comes with its own | |
1018 | costs: domain name ownership may change over time for reasons not | |
1019 | anticipated by the URI producer. In other cases, the data within the | |
1020 | host component identifies a registered name that has nothing to do | |
1021 | with an Internet host. We use the name "host" for the ABNF rule | |
1022 | because that is its most common purpose, not its only purpose. | |
1023 | ||
1024 | host = IP-literal / IPv4address / reg-name | |
1025 | ||
1026 | The syntax rule for host is ambiguous because it does not completely | |
1027 | distinguish between an IPv4address and a reg-name. In order to | |
1028 | disambiguate the syntax, we apply the "first-match-wins" algorithm: | |
1029 | If host matches the rule for IPv4address, then it should be | |
1030 | considered an IPv4 address literal and not a reg-name. Although host | |
1031 | is case-insensitive, producers and normalizers should use lowercase | |
1032 | for registered names and hexadecimal addresses for the sake of | |
1033 | uniformity, while only using uppercase letters for percent-encodings. | |
1034 | ||
1035 | A host identified by an Internet Protocol literal address, version 6 | |
1036 | [RFC3513] or later, is distinguished by enclosing the IP literal | |
1037 | within square brackets ("[" and "]"). This is the only place where | |
1038 | square bracket characters are allowed in the URI syntax. In | |
1039 | anticipation of future, as-yet-undefined IP literal address formats, | |
1040 | an implementation may use an optional version flag to indicate such a | |
1041 | format explicitly rather than rely on heuristic determination. | |
1042 | ||
1043 | IP-literal = "[" ( IPv6address / IPvFuture ) "]" | |
1044 | ||
1045 | IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" ) | |
1046 | ||
1047 | The version flag does not indicate the IP version; rather, it | |
1048 | indicates future versions of the literal format. As such, | |
1049 | implementations must not provide the version flag for the existing | |
1050 | IPv4 and IPv6 literal address forms described below. If a URI | |
1051 | containing an IP-literal that starts with "v" (case-insensitive), | |
1052 | indicating that the version flag is present, is dereferenced by an | |
1053 | application that does not know the meaning of that version flag, then | |
1054 | the application should return an appropriate error for "address | |
1055 | mechanism not supported". | |
1056 | ||
1057 | A host identified by an IPv6 literal address is represented inside | |
1058 | the square brackets without a preceding version flag. The ABNF | |
1059 | provided here is a translation of the text definition of an IPv6 | |
1060 | literal address provided in [RFC3513]. This syntax does not support | |
1061 | IPv6 scoped addressing zone identifiers. | |
1062 | ||
1063 | ||
1064 | ||
1065 | ||
1066 | Berners-Lee, et al. Standards Track [Page 19] | |
1067 | \f | |
1068 | RFC 3986 URI Generic Syntax January 2005 | |
1069 | ||
1070 | ||
1071 | A 128-bit IPv6 address is divided into eight 16-bit pieces. Each | |
1072 | piece is represented numerically in case-insensitive hexadecimal, | |
1073 | using one to four hexadecimal digits (leading zeroes are permitted). | |
1074 | The eight encoded pieces are given most-significant first, separated | |
1075 | by colon characters. Optionally, the least-significant two pieces | |
1076 | may instead be represented in IPv4 address textual format. A | |
1077 | sequence of one or more consecutive zero-valued 16-bit pieces within | |
1078 | the address may be elided, omitting all their digits and leaving | |
1079 | exactly two consecutive colons in their place to mark the elision. | |
1080 | ||
1081 | IPv6address = 6( h16 ":" ) ls32 | |
1082 | / "::" 5( h16 ":" ) ls32 | |
1083 | / [ h16 ] "::" 4( h16 ":" ) ls32 | |
1084 | / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32 | |
1085 | / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32 | |
1086 | / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32 | |
1087 | / [ *4( h16 ":" ) h16 ] "::" ls32 | |
1088 | / [ *5( h16 ":" ) h16 ] "::" h16 | |
1089 | / [ *6( h16 ":" ) h16 ] "::" | |
1090 | ||
1091 | ls32 = ( h16 ":" h16 ) / IPv4address | |
1092 | ; least-significant 32 bits of address | |
1093 | ||
1094 | h16 = 1*4HEXDIG | |
1095 | ; 16 bits of address represented in hexadecimal | |
1096 | ||
1097 | A host identified by an IPv4 literal address is represented in | |
1098 | dotted-decimal notation (a sequence of four decimal numbers in the | |
1099 | range 0 to 255, separated by "."), as described in [RFC1123] by | |
1100 | reference to [RFC0952]. Note that other forms of dotted notation may | |
1101 | be interpreted on some platforms, as described in Section 7.4, but | |
1102 | only the dotted-decimal form of four octets is allowed by this | |
1103 | grammar. | |
1104 | ||
1105 | IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet | |
1106 | ||
1107 | dec-octet = DIGIT ; 0-9 | |
1108 | / %x31-39 DIGIT ; 10-99 | |
1109 | / "1" 2DIGIT ; 100-199 | |
1110 | / "2" %x30-34 DIGIT ; 200-249 | |
1111 | / "25" %x30-35 ; 250-255 | |
1112 | ||
1113 | A host identified by a registered name is a sequence of characters | |
1114 | usually intended for lookup within a locally defined host or service | |
1115 | name registry, though the URI's scheme-specific semantics may require | |
1116 | that a specific registry (or fixed name table) be used instead. The | |
1117 | most common name registry mechanism is the Domain Name System (DNS). | |
1118 | A registered name intended for lookup in the DNS uses the syntax | |
1119 | ||
1120 | ||
1121 | ||
1122 | Berners-Lee, et al. Standards Track [Page 20] | |
1123 | \f | |
1124 | RFC 3986 URI Generic Syntax January 2005 | |
1125 | ||
1126 | ||
1127 | defined in Section 3.5 of [RFC1034] and Section 2.1 of [RFC1123]. | |
1128 | Such a name consists of a sequence of domain labels separated by ".", | |
1129 | each domain label starting and ending with an alphanumeric character | |
1130 | and possibly also containing "-" characters. The rightmost domain | |
1131 | label of a fully qualified domain name in DNS may be followed by a | |
1132 | single "." and should be if it is necessary to distinguish between | |
1133 | the complete domain name and some local domain. | |
1134 | ||
1135 | reg-name = *( unreserved / pct-encoded / sub-delims ) | |
1136 | ||
1137 | If the URI scheme defines a default for host, then that default | |
1138 | applies when the host subcomponent is undefined or when the | |
1139 | registered name is empty (zero length). For example, the "file" URI | |
1140 | scheme is defined so that no authority, an empty host, and | |
1141 | "localhost" all mean the end-user's machine, whereas the "http" | |
1142 | scheme considers a missing authority or empty host invalid. | |
1143 | ||
1144 | This specification does not mandate a particular registered name | |
1145 | lookup technology and therefore does not restrict the syntax of reg- | |
1146 | name beyond what is necessary for interoperability. Instead, it | |
1147 | delegates the issue of registered name syntax conformance to the | |
1148 | operating system of each application performing URI resolution, and | |
1149 | that operating system decides what it will allow for the purpose of | |
1150 | host identification. A URI resolution implementation might use DNS, | |
1151 | host tables, yellow pages, NetInfo, WINS, or any other system for | |
1152 | lookup of registered names. However, a globally scoped naming | |
1153 | system, such as DNS fully qualified domain names, is necessary for | |
1154 | URIs intended to have global scope. URI producers should use names | |
1155 | that conform to the DNS syntax, even when use of DNS is not | |
1156 | immediately apparent, and should limit these names to no more than | |
1157 | 255 characters in length. | |
1158 | ||
1159 | The reg-name syntax allows percent-encoded octets in order to | |
1160 | represent non-ASCII registered names in a uniform way that is | |
1161 | independent of the underlying name resolution technology. Non-ASCII | |
1162 | characters must first be encoded according to UTF-8 [STD63], and then | |
1163 | each octet of the corresponding UTF-8 sequence must be percent- | |
1164 | encoded to be represented as URI characters. URI producing | |
1165 | applications must not use percent-encoding in host unless it is used | |
1166 | to represent a UTF-8 character sequence. When a non-ASCII registered | |
1167 | name represents an internationalized domain name intended for | |
1168 | resolution via the DNS, the name must be transformed to the IDNA | |
1169 | encoding [RFC3490] prior to name lookup. URI producers should | |
1170 | provide these registered names in the IDNA encoding, rather than a | |
1171 | percent-encoding, if they wish to maximize interoperability with | |
1172 | legacy URI resolvers. | |
1173 | ||
1174 | ||
1175 | ||
1176 | ||
1177 | ||
1178 | Berners-Lee, et al. Standards Track [Page 21] | |
1179 | \f | |
1180 | RFC 3986 URI Generic Syntax January 2005 | |
1181 | ||
1182 | ||
1183 | 3.2.3. Port | |
1184 | ||
1185 | The port subcomponent of authority is designated by an optional port | |
1186 | number in decimal following the host and delimited from it by a | |
1187 | single colon (":") character. | |
1188 | ||
1189 | port = *DIGIT | |
1190 | ||
1191 | A scheme may define a default port. For example, the "http" scheme | |
1192 | defines a default port of "80", corresponding to its reserved TCP | |
1193 | port number. The type of port designated by the port number (e.g., | |
1194 | TCP, UDP, SCTP) is defined by the URI scheme. URI producers and | |
1195 | normalizers should omit the port component and its ":" delimiter if | |
1196 | port is empty or if its value would be the same as that of the | |
1197 | scheme's default. | |
1198 | ||
1199 | 3.3. Path | |
1200 | ||
1201 | The path component contains data, usually organized in hierarchical | |
1202 | form, that, along with data in the non-hierarchical query component | |
1203 | (Section 3.4), serves to identify a resource within the scope of the | |
1204 | URI's scheme and naming authority (if any). The path is terminated | |
1205 | by the first question mark ("?") or number sign ("#") character, or | |
1206 | by the end of the URI. | |
1207 | ||
1208 | If a URI contains an authority component, then the path component | |
1209 | must either be empty or begin with a slash ("/") character. If a URI | |
1210 | does not contain an authority component, then the path cannot begin | |
1211 | with two slash characters ("//"). In addition, a URI reference | |
1212 | (Section 4.1) may be a relative-path reference, in which case the | |
1213 | first path segment cannot contain a colon (":") character. The ABNF | |
1214 | requires five separate rules to disambiguate these cases, only one of | |
1215 | which will match the path substring within a given URI reference. We | |
1216 | use the generic term "path component" to describe the URI substring | |
1217 | matched by the parser to one of these rules. | |
1218 | ||
1219 | path = path-abempty ; begins with "/" or is empty | |
1220 | / path-absolute ; begins with "/" but not "//" | |
1221 | / path-noscheme ; begins with a non-colon segment | |
1222 | / path-rootless ; begins with a segment | |
1223 | / path-empty ; zero characters | |
1224 | ||
1225 | path-abempty = *( "/" segment ) | |
1226 | path-absolute = "/" [ segment-nz *( "/" segment ) ] | |
1227 | path-noscheme = segment-nz-nc *( "/" segment ) | |
1228 | path-rootless = segment-nz *( "/" segment ) | |
1229 | path-empty = 0<pchar> | |
1230 | ||
1231 | ||
1232 | ||
1233 | ||
1234 | Berners-Lee, et al. Standards Track [Page 22] | |
1235 | \f | |
1236 | RFC 3986 URI Generic Syntax January 2005 | |
1237 | ||
1238 | ||
1239 | segment = *pchar | |
1240 | segment-nz = 1*pchar | |
1241 | segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" ) | |
1242 | ; non-zero-length segment without any colon ":" | |
1243 | ||
1244 | pchar = unreserved / pct-encoded / sub-delims / ":" / "@" | |
1245 | ||
1246 | A path consists of a sequence of path segments separated by a slash | |
1247 | ("/") character. A path is always defined for a URI, though the | |
1248 | defined path may be empty (zero length). Use of the slash character | |
1249 | to indicate hierarchy is only required when a URI will be used as the | |
1250 | context for relative references. For example, the URI | |
1251 | <mailto:fred@example.com> has a path of "fred@example.com", whereas | |
1252 | the URI <foo://info.example.com?fred> has an empty path. | |
1253 | ||
1254 | The path segments "." and "..", also known as dot-segments, are | |
1255 | defined for relative reference within the path name hierarchy. They | |
1256 | are intended for use at the beginning of a relative-path reference | |
1257 | (Section 4.2) to indicate relative position within the hierarchical | |
1258 | tree of names. This is similar to their role within some operating | |
1259 | systems' file directory structures to indicate the current directory | |
1260 | and parent directory, respectively. However, unlike in a file | |
1261 | system, these dot-segments are only interpreted within the URI path | |
1262 | hierarchy and are removed as part of the resolution process (Section | |
1263 | 5.2). | |
1264 | ||
1265 | Aside from dot-segments in hierarchical paths, a path segment is | |
1266 | considered opaque by the generic syntax. URI producing applications | |
1267 | often use the reserved characters allowed in a segment to delimit | |
1268 | scheme-specific or dereference-handler-specific subcomponents. For | |
1269 | example, the semicolon (";") and equals ("=") reserved characters are | |
1270 | often used to delimit parameters and parameter values applicable to | |
1271 | that segment. The comma (",") reserved character is often used for | |
1272 | similar purposes. For example, one URI producer might use a segment | |
1273 | such as "name;v=1.1" to indicate a reference to version 1.1 of | |
1274 | "name", whereas another might use a segment such as "name,1.1" to | |
1275 | indicate the same. Parameter types may be defined by scheme-specific | |
1276 | semantics, but in most cases the syntax of a parameter is specific to | |
1277 | the implementation of the URI's dereferencing algorithm. | |
1278 | ||
1279 | 3.4. Query | |
1280 | ||
1281 | The query component contains non-hierarchical data that, along with | |
1282 | data in the path component (Section 3.3), serves to identify a | |
1283 | resource within the scope of the URI's scheme and naming authority | |
1284 | (if any). The query component is indicated by the first question | |
1285 | mark ("?") character and terminated by a number sign ("#") character | |
1286 | or by the end of the URI. | |
1287 | ||
1288 | ||
1289 | ||
1290 | Berners-Lee, et al. Standards Track [Page 23] | |
1291 | \f | |
1292 | RFC 3986 URI Generic Syntax January 2005 | |
1293 | ||
1294 | ||
1295 | query = *( pchar / "/" / "?" ) | |
1296 | ||
1297 | The characters slash ("/") and question mark ("?") may represent data | |
1298 | within the query component. Beware that some older, erroneous | |
1299 | implementations may not handle such data correctly when it is used as | |
1300 | the base URI for relative references (Section 5.1), apparently | |
1301 | because they fail to distinguish query data from path data when | |
1302 | looking for hierarchical separators. However, as query components | |
1303 | are often used to carry identifying information in the form of | |
1304 | "key=value" pairs and one frequently used value is a reference to | |
1305 | another URI, it is sometimes better for usability to avoid percent- | |
1306 | encoding those characters. | |
1307 | ||
1308 | 3.5. Fragment | |
1309 | ||
1310 | The fragment identifier component of a URI allows indirect | |
1311 | identification of a secondary resource by reference to a primary | |
1312 | resource and additional identifying information. The identified | |
1313 | secondary resource may be some portion or subset of the primary | |
1314 | resource, some view on representations of the primary resource, or | |
1315 | some other resource defined or described by those representations. A | |
1316 | fragment identifier component is indicated by the presence of a | |
1317 | number sign ("#") character and terminated by the end of the URI. | |
1318 | ||
1319 | fragment = *( pchar / "/" / "?" ) | |
1320 | ||
1321 | The semantics of a fragment identifier are defined by the set of | |
1322 | representations that might result from a retrieval action on the | |
1323 | primary resource. The fragment's format and resolution is therefore | |
1324 | dependent on the media type [RFC2046] of a potentially retrieved | |
1325 | representation, even though such a retrieval is only performed if the | |
1326 | URI is dereferenced. If no such representation exists, then the | |
1327 | semantics of the fragment are considered unknown and are effectively | |
1328 | unconstrained. Fragment identifier semantics are independent of the | |
1329 | URI scheme and thus cannot be redefined by scheme specifications. | |
1330 | ||
1331 | Individual media types may define their own restrictions on or | |
1332 | structures within the fragment identifier syntax for specifying | |
1333 | different types of subsets, views, or external references that are | |
1334 | identifiable as secondary resources by that media type. If the | |
1335 | primary resource has multiple representations, as is often the case | |
1336 | for resources whose representation is selected based on attributes of | |
1337 | the retrieval request (a.k.a., content negotiation), then whatever is | |
1338 | identified by the fragment should be consistent across all of those | |
1339 | representations. Each representation should either define the | |
1340 | fragment so that it corresponds to the same secondary resource, | |
1341 | regardless of how it is represented, or should leave the fragment | |
1342 | undefined (i.e., not found). | |
1343 | ||
1344 | ||
1345 | ||
1346 | Berners-Lee, et al. Standards Track [Page 24] | |
1347 | \f | |
1348 | RFC 3986 URI Generic Syntax January 2005 | |
1349 | ||
1350 | ||
1351 | As with any URI, use of a fragment identifier component does not | |
1352 | imply that a retrieval action will take place. A URI with a fragment | |
1353 | identifier may be used to refer to the secondary resource without any | |
1354 | implication that the primary resource is accessible or will ever be | |
1355 | accessed. | |
1356 | ||
1357 | Fragment identifiers have a special role in information retrieval | |
1358 | systems as the primary form of client-side indirect referencing, | |
1359 | allowing an author to specifically identify aspects of an existing | |
1360 | resource that are only indirectly provided by the resource owner. As | |
1361 | such, the fragment identifier is not used in the scheme-specific | |
1362 | processing of a URI; instead, the fragment identifier is separated | |
1363 | from the rest of the URI prior to a dereference, and thus the | |
1364 | identifying information within the fragment itself is dereferenced | |
1365 | solely by the user agent, regardless of the URI scheme. Although | |
1366 | this separate handling is often perceived to be a loss of | |
1367 | information, particularly for accurate redirection of references as | |
1368 | resources move over time, it also serves to prevent information | |
1369 | providers from denying reference authors the right to refer to | |
1370 | information within a resource selectively. Indirect referencing also | |
1371 | provides additional flexibility and extensibility to systems that use | |
1372 | URIs, as new media types are easier to define and deploy than new | |
1373 | schemes of identification. | |
1374 | ||
1375 | The characters slash ("/") and question mark ("?") are allowed to | |
1376 | represent data within the fragment identifier. Beware that some | |
1377 | older, erroneous implementations may not handle this data correctly | |
1378 | when it is used as the base URI for relative references (Section | |
1379 | 5.1). | |
1380 | ||
1381 | 4. Usage | |
1382 | ||
1383 | When applications make reference to a URI, they do not always use the | |
1384 | full form of reference defined by the "URI" syntax rule. To save | |
1385 | space and take advantage of hierarchical locality, many Internet | |
1386 | protocol elements and media type formats allow an abbreviation of a | |
1387 | URI, whereas others restrict the syntax to a particular form of URI. | |
1388 | We define the most common forms of reference syntax in this | |
1389 | specification because they impact and depend upon the design of the | |
1390 | generic syntax, requiring a uniform parsing algorithm in order to be | |
1391 | interpreted consistently. | |
1392 | ||
1393 | 4.1. URI Reference | |
1394 | ||
1395 | URI-reference is used to denote the most common usage of a resource | |
1396 | identifier. | |
1397 | ||
1398 | URI-reference = URI / relative-ref | |
1399 | ||
1400 | ||
1401 | ||
1402 | Berners-Lee, et al. Standards Track [Page 25] | |
1403 | \f | |
1404 | RFC 3986 URI Generic Syntax January 2005 | |
1405 | ||
1406 | ||
1407 | A URI-reference is either a URI or a relative reference. If the | |
1408 | URI-reference's prefix does not match the syntax of a scheme followed | |
1409 | by its colon separator, then the URI-reference is a relative | |
1410 | reference. | |
1411 | ||
1412 | A URI-reference is typically parsed first into the five URI | |
1413 | components, in order to determine what components are present and | |
1414 | whether the reference is relative. Then, each component is parsed | |
1415 | for its subparts and their validation. The ABNF of URI-reference, | |
1416 | along with the "first-match-wins" disambiguation rule, is sufficient | |
1417 | to define a validating parser for the generic syntax. Readers | |
1418 | familiar with regular expressions should see Appendix B for an | |
1419 | example of a non-validating URI-reference parser that will take any | |
1420 | given string and extract the URI components. | |
1421 | ||
1422 | 4.2. Relative Reference | |
1423 | ||
1424 | A relative reference takes advantage of the hierarchical syntax | |
1425 | (Section 1.2.3) to express a URI reference relative to the name space | |
1426 | of another hierarchical URI. | |
1427 | ||
1428 | relative-ref = relative-part [ "?" query ] [ "#" fragment ] | |
1429 | ||
1430 | relative-part = "//" authority path-abempty | |
1431 | / path-absolute | |
1432 | / path-noscheme | |
1433 | / path-empty | |
1434 | ||
1435 | The URI referred to by a relative reference, also known as the target | |
1436 | URI, is obtained by applying the reference resolution algorithm of | |
1437 | Section 5. | |
1438 | ||
1439 | A relative reference that begins with two slash characters is termed | |
1440 | a network-path reference; such references are rarely used. A | |
1441 | relative reference that begins with a single slash character is | |
1442 | termed an absolute-path reference. A relative reference that does | |
1443 | not begin with a slash character is termed a relative-path reference. | |
1444 | ||
1445 | A path segment that contains a colon character (e.g., "this:that") | |
1446 | cannot be used as the first segment of a relative-path reference, as | |
1447 | it would be mistaken for a scheme name. Such a segment must be | |
1448 | preceded by a dot-segment (e.g., "./this:that") to make a relative- | |
1449 | path reference. | |
1450 | ||
1451 | ||
1452 | ||
1453 | ||
1454 | ||
1455 | ||
1456 | ||
1457 | ||
1458 | Berners-Lee, et al. Standards Track [Page 26] | |
1459 | \f | |
1460 | RFC 3986 URI Generic Syntax January 2005 | |
1461 | ||
1462 | ||
1463 | 4.3. Absolute URI | |
1464 | ||
1465 | Some protocol elements allow only the absolute form of a URI without | |
1466 | a fragment identifier. For example, defining a base URI for later | |
1467 | use by relative references calls for an absolute-URI syntax rule that | |
1468 | does not allow a fragment. | |
1469 | ||
1470 | absolute-URI = scheme ":" hier-part [ "?" query ] | |
1471 | ||
1472 | URI scheme specifications must define their own syntax so that all | |
1473 | strings matching their scheme-specific syntax will also match the | |
1474 | <absolute-URI> grammar. Scheme specifications will not define | |
1475 | fragment identifier syntax or usage, regardless of its applicability | |
1476 | to resources identifiable via that scheme, as fragment identification | |
1477 | is orthogonal to scheme definition. However, scheme specifications | |
1478 | are encouraged to include a wide range of examples, including | |
1479 | examples that show use of the scheme's URIs with fragment identifiers | |
1480 | when such usage is appropriate. | |
1481 | ||
1482 | 4.4. Same-Document Reference | |
1483 | ||
1484 | When a URI reference refers to a URI that is, aside from its fragment | |
1485 | component (if any), identical to the base URI (Section 5.1), that | |
1486 | reference is called a "same-document" reference. The most frequent | |
1487 | examples of same-document references are relative references that are | |
1488 | empty or include only the number sign ("#") separator followed by a | |
1489 | fragment identifier. | |
1490 | ||
1491 | When a same-document reference is dereferenced for a retrieval | |
1492 | action, the target of that reference is defined to be within the same | |
1493 | entity (representation, document, or message) as the reference; | |
1494 | therefore, a dereference should not result in a new retrieval action. | |
1495 | ||
1496 | Normalization of the base and target URIs prior to their comparison, | |
1497 | as described in Sections 6.2.2 and 6.2.3, is allowed but rarely | |
1498 | performed in practice. Normalization may increase the set of same- | |
1499 | document references, which may be of benefit to some caching | |
1500 | applications. As such, reference authors should not assume that a | |
1501 | slightly different, though equivalent, reference URI will (or will | |
1502 | not) be interpreted as a same-document reference by any given | |
1503 | application. | |
1504 | ||
1505 | 4.5. Suffix Reference | |
1506 | ||
1507 | The URI syntax is designed for unambiguous reference to resources and | |
1508 | extensibility via the URI scheme. However, as URI identification and | |
1509 | usage have become commonplace, traditional media (television, radio, | |
1510 | newspapers, billboards, etc.) have increasingly used a suffix of the | |
1511 | ||
1512 | ||
1513 | ||
1514 | Berners-Lee, et al. Standards Track [Page 27] | |
1515 | \f | |
1516 | RFC 3986 URI Generic Syntax January 2005 | |
1517 | ||
1518 | ||
1519 | URI as a reference, consisting of only the authority and path | |
1520 | portions of the URI, such as | |
1521 | ||
1522 | www.w3.org/Addressing/ | |
1523 | ||
1524 | or simply a DNS registered name on its own. Such references are | |
1525 | primarily intended for human interpretation rather than for machines, | |
1526 | with the assumption that context-based heuristics are sufficient to | |
1527 | complete the URI (e.g., most registered names beginning with "www" | |
1528 | are likely to have a URI prefix of "http://"). Although there is no | |
1529 | standard set of heuristics for disambiguating a URI suffix, many | |
1530 | client implementations allow them to be entered by the user and | |
1531 | heuristically resolved. | |
1532 | ||
1533 | Although this practice of using suffix references is common, it | |
1534 | should be avoided whenever possible and should never be used in | |
1535 | situations where long-term references are expected. The heuristics | |
1536 | noted above will change over time, particularly when a new URI scheme | |
1537 | becomes popular, and are often incorrect when used out of context. | |
1538 | Furthermore, they can lead to security issues along the lines of | |
1539 | those described in [RFC1535]. | |
1540 | ||
1541 | As a URI suffix has the same syntax as a relative-path reference, a | |
1542 | suffix reference cannot be used in contexts where a relative | |
1543 | reference is expected. As a result, suffix references are limited to | |
1544 | places where there is no defined base URI, such as dialog boxes and | |
1545 | off-line advertisements. | |
1546 | ||
1547 | 5. Reference Resolution | |
1548 | ||
1549 | This section defines the process of resolving a URI reference within | |
1550 | a context that allows relative references so that the result is a | |
1551 | string matching the <URI> syntax rule of Section 3. | |
1552 | ||
1553 | 5.1. Establishing a Base URI | |
1554 | ||
1555 | The term "relative" implies that a "base URI" exists against which | |
1556 | the relative reference is applied. Aside from fragment-only | |
1557 | references (Section 4.4), relative references are only usable when a | |
1558 | base URI is known. A base URI must be established by the parser | |
1559 | prior to parsing URI references that might be relative. A base URI | |
1560 | must conform to the <absolute-URI> syntax rule (Section 4.3). If the | |
1561 | base URI is obtained from a URI reference, then that reference must | |
1562 | be converted to absolute form and stripped of any fragment component | |
1563 | prior to its use as a base URI. | |
1564 | ||
1565 | ||
1566 | ||
1567 | ||
1568 | ||
1569 | ||
1570 | Berners-Lee, et al. Standards Track [Page 28] | |
1571 | \f | |
1572 | RFC 3986 URI Generic Syntax January 2005 | |
1573 | ||
1574 | ||
1575 | The base URI of a reference can be established in one of four ways, | |
1576 | discussed below in order of precedence. The order of precedence can | |
1577 | be thought of in terms of layers, where the innermost defined base | |
1578 | URI has the highest precedence. This can be visualized graphically | |
1579 | as follows: | |
1580 | ||
1581 | .----------------------------------------------------------. | |
1582 | | .----------------------------------------------------. | | |
1583 | | | .----------------------------------------------. | | | |
1584 | | | | .----------------------------------------. | | | | |
1585 | | | | | .----------------------------------. | | | | | |
1586 | | | | | | <relative-reference> | | | | | | |
1587 | | | | | `----------------------------------' | | | | | |
1588 | | | | | (5.1.1) Base URI embedded in content | | | | | |
1589 | | | | `----------------------------------------' | | | | |
1590 | | | | (5.1.2) Base URI of the encapsulating entity | | | | |
1591 | | | | (message, representation, or none) | | | | |
1592 | | | `----------------------------------------------' | | | |
1593 | | | (5.1.3) URI used to retrieve the entity | | | |
1594 | | `----------------------------------------------------' | | |
1595 | | (5.1.4) Default Base URI (application-dependent) | | |
1596 | `----------------------------------------------------------' | |
1597 | ||
1598 | 5.1.1. Base URI Embedded in Content | |
1599 | ||
1600 | Within certain media types, a base URI for relative references can be | |
1601 | embedded within the content itself so that it can be readily obtained | |
1602 | by a parser. This can be useful for descriptive documents, such as | |
1603 | tables of contents, which may be transmitted to others through | |
1604 | protocols other than their usual retrieval context (e.g., email or | |
1605 | USENET news). | |
1606 | ||
1607 | It is beyond the scope of this specification to specify how, for each | |
1608 | media type, a base URI can be embedded. The appropriate syntax, when | |
1609 | available, is described by the data format specification associated | |
1610 | with each media type. | |
1611 | ||
1612 | 5.1.2. Base URI from the Encapsulating Entity | |
1613 | ||
1614 | If no base URI is embedded, the base URI is defined by the | |
1615 | representation's retrieval context. For a document that is enclosed | |
1616 | within another entity, such as a message or archive, the retrieval | |
1617 | context is that entity. Thus, the default base URI of a | |
1618 | representation is the base URI of the entity in which the | |
1619 | representation is encapsulated. | |
1620 | ||
1621 | ||
1622 | ||
1623 | ||
1624 | ||
1625 | ||
1626 | Berners-Lee, et al. Standards Track [Page 29] | |
1627 | \f | |
1628 | RFC 3986 URI Generic Syntax January 2005 | |
1629 | ||
1630 | ||
1631 | A mechanism for embedding a base URI within MIME container types | |
1632 | (e.g., the message and multipart types) is defined by MHTML | |
1633 | [RFC2557]. Protocols that do not use the MIME message header syntax, | |
1634 | but that do allow some form of tagged metadata to be included within | |
1635 | messages, may define their own syntax for defining a base URI as part | |
1636 | of a message. | |
1637 | ||
1638 | 5.1.3. Base URI from the Retrieval URI | |
1639 | ||
1640 | If no base URI is embedded and the representation is not encapsulated | |
1641 | within some other entity, then, if a URI was used to retrieve the | |
1642 | representation, that URI shall be considered the base URI. Note that | |
1643 | if the retrieval was the result of a redirected request, the last URI | |
1644 | used (i.e., the URI that resulted in the actual retrieval of the | |
1645 | representation) is the base URI. | |
1646 | ||
1647 | 5.1.4. Default Base URI | |
1648 | ||
1649 | If none of the conditions described above apply, then the base URI is | |
1650 | defined by the context of the application. As this definition is | |
1651 | necessarily application-dependent, failing to define a base URI by | |
1652 | using one of the other methods may result in the same content being | |
1653 | interpreted differently by different types of applications. | |
1654 | ||
1655 | A sender of a representation containing relative references is | |
1656 | responsible for ensuring that a base URI for those references can be | |
1657 | established. Aside from fragment-only references, relative | |
1658 | references can only be used reliably in situations where the base URI | |
1659 | is well defined. | |
1660 | ||
1661 | 5.2. Relative Resolution | |
1662 | ||
1663 | This section describes an algorithm for converting a URI reference | |
1664 | that might be relative to a given base URI into the parsed components | |
1665 | of the reference's target. The components can then be recomposed, as | |
1666 | described in Section 5.3, to form the target URI. This algorithm | |
1667 | provides definitive results that can be used to test the output of | |
1668 | other implementations. Applications may implement relative reference | |
1669 | resolution by using some other algorithm, provided that the results | |
1670 | match what would be given by this one. | |
1671 | ||
1672 | ||
1673 | ||
1674 | ||
1675 | ||
1676 | ||
1677 | ||
1678 | ||
1679 | ||
1680 | ||
1681 | ||
1682 | Berners-Lee, et al. Standards Track [Page 30] | |
1683 | \f | |
1684 | RFC 3986 URI Generic Syntax January 2005 | |
1685 | ||
1686 | ||
1687 | 5.2.1. Pre-parse the Base URI | |
1688 | ||
1689 | The base URI (Base) is established according to the procedure of | |
1690 | Section 5.1 and parsed into the five main components described in | |
1691 | Section 3. Note that only the scheme component is required to be | |
1692 | present in a base URI; the other components may be empty or | |
1693 | undefined. A component is undefined if its associated delimiter does | |
1694 | not appear in the URI reference; the path component is never | |
1695 | undefined, though it may be empty. | |
1696 | ||
1697 | Normalization of the base URI, as described in Sections 6.2.2 and | |
1698 | 6.2.3, is optional. A URI reference must be transformed to its | |
1699 | target URI before it can be normalized. | |
1700 | ||
1701 | 5.2.2. Transform References | |
1702 | ||
1703 | For each URI reference (R), the following pseudocode describes an | |
1704 | algorithm for transforming R into its target URI (T): | |
1705 | ||
1706 | -- The URI reference is parsed into the five URI components | |
1707 | -- | |
1708 | (R.scheme, R.authority, R.path, R.query, R.fragment) = parse(R); | |
1709 | ||
1710 | -- A non-strict parser may ignore a scheme in the reference | |
1711 | -- if it is identical to the base URI's scheme. | |
1712 | -- | |
1713 | if ((not strict) and (R.scheme == Base.scheme)) then | |
1714 | undefine(R.scheme); | |
1715 | endif; | |
1716 | ||
1717 | ||
1718 | ||
1719 | ||
1720 | ||
1721 | ||
1722 | ||
1723 | ||
1724 | ||
1725 | ||
1726 | ||
1727 | ||
1728 | ||
1729 | ||
1730 | ||
1731 | ||
1732 | ||
1733 | ||
1734 | ||
1735 | ||
1736 | ||
1737 | ||
1738 | Berners-Lee, et al. Standards Track [Page 31] | |
1739 | \f | |
1740 | RFC 3986 URI Generic Syntax January 2005 | |
1741 | ||
1742 | ||
1743 | if defined(R.scheme) then | |
1744 | T.scheme = R.scheme; | |
1745 | T.authority = R.authority; | |
1746 | T.path = remove_dot_segments(R.path); | |
1747 | T.query = R.query; | |
1748 | else | |
1749 | if defined(R.authority) then | |
1750 | T.authority = R.authority; | |
1751 | T.path = remove_dot_segments(R.path); | |
1752 | T.query = R.query; | |
1753 | else | |
1754 | if (R.path == "") then | |
1755 | T.path = Base.path; | |
1756 | if defined(R.query) then | |
1757 | T.query = R.query; | |
1758 | else | |
1759 | T.query = Base.query; | |
1760 | endif; | |
1761 | else | |
1762 | if (R.path starts-with "/") then | |
1763 | T.path = remove_dot_segments(R.path); | |
1764 | else | |
1765 | T.path = merge(Base.path, R.path); | |
1766 | T.path = remove_dot_segments(T.path); | |
1767 | endif; | |
1768 | T.query = R.query; | |
1769 | endif; | |
1770 | T.authority = Base.authority; | |
1771 | endif; | |
1772 | T.scheme = Base.scheme; | |
1773 | endif; | |
1774 | ||
1775 | T.fragment = R.fragment; | |
1776 | ||
1777 | 5.2.3. Merge Paths | |
1778 | ||
1779 | The pseudocode above refers to a "merge" routine for merging a | |
1780 | relative-path reference with the path of the base URI. This is | |
1781 | accomplished as follows: | |
1782 | ||
1783 | o If the base URI has a defined authority component and an empty | |
1784 | path, then return a string consisting of "/" concatenated with the | |
1785 | reference's path; otherwise, | |
1786 | ||
1787 | ||
1788 | ||
1789 | ||
1790 | ||
1791 | ||
1792 | ||
1793 | ||
1794 | Berners-Lee, et al. Standards Track [Page 32] | |
1795 | \f | |
1796 | RFC 3986 URI Generic Syntax January 2005 | |
1797 | ||
1798 | ||
1799 | o return a string consisting of the reference's path component | |
1800 | appended to all but the last segment of the base URI's path (i.e., | |
1801 | excluding any characters after the right-most "/" in the base URI | |
1802 | path, or excluding the entire base URI path if it does not contain | |
1803 | any "/" characters). | |
1804 | ||
1805 | 5.2.4. Remove Dot Segments | |
1806 | ||
1807 | The pseudocode also refers to a "remove_dot_segments" routine for | |
1808 | interpreting and removing the special "." and ".." complete path | |
1809 | segments from a referenced path. This is done after the path is | |
1810 | extracted from a reference, whether or not the path was relative, in | |
1811 | order to remove any invalid or extraneous dot-segments prior to | |
1812 | forming the target URI. Although there are many ways to accomplish | |
1813 | this removal process, we describe a simple method using two string | |
1814 | buffers. | |
1815 | ||
1816 | 1. The input buffer is initialized with the now-appended path | |
1817 | components and the output buffer is initialized to the empty | |
1818 | string. | |
1819 | ||
1820 | 2. While the input buffer is not empty, loop as follows: | |
1821 | ||
1822 | A. If the input buffer begins with a prefix of "../" or "./", | |
1823 | then remove that prefix from the input buffer; otherwise, | |
1824 | ||
1825 | B. if the input buffer begins with a prefix of "/./" or "/.", | |
1826 | where "." is a complete path segment, then replace that | |
1827 | prefix with "/" in the input buffer; otherwise, | |
1828 | ||
1829 | C. if the input buffer begins with a prefix of "/../" or "/..", | |
1830 | where ".." is a complete path segment, then replace that | |
1831 | prefix with "/" in the input buffer and remove the last | |
1832 | segment and its preceding "/" (if any) from the output | |
1833 | buffer; otherwise, | |
1834 | ||
1835 | D. if the input buffer consists only of "." or "..", then remove | |
1836 | that from the input buffer; otherwise, | |
1837 | ||
1838 | E. move the first path segment in the input buffer to the end of | |
1839 | the output buffer, including the initial "/" character (if | |
1840 | any) and any subsequent characters up to, but not including, | |
1841 | the next "/" character or the end of the input buffer. | |
1842 | ||
1843 | 3. Finally, the output buffer is returned as the result of | |
1844 | remove_dot_segments. | |
1845 | ||
1846 | ||
1847 | ||
1848 | ||
1849 | ||
1850 | Berners-Lee, et al. Standards Track [Page 33] | |
1851 | \f | |
1852 | RFC 3986 URI Generic Syntax January 2005 | |
1853 | ||
1854 | ||
1855 | Note that dot-segments are intended for use in URI references to | |
1856 | express an identifier relative to the hierarchy of names in the base | |
1857 | URI. The remove_dot_segments algorithm respects that hierarchy by | |
1858 | removing extra dot-segments rather than treat them as an error or | |
1859 | leaving them to be misinterpreted by dereference implementations. | |
1860 | ||
1861 | The following illustrates how the above steps are applied for two | |
1862 | examples of merged paths, showing the state of the two buffers after | |
1863 | each step. | |
1864 | ||
1865 | STEP OUTPUT BUFFER INPUT BUFFER | |
1866 | ||
1867 | 1 : /a/b/c/./../../g | |
1868 | 2E: /a /b/c/./../../g | |
1869 | 2E: /a/b /c/./../../g | |
1870 | 2E: /a/b/c /./../../g | |
1871 | 2B: /a/b/c /../../g | |
1872 | 2C: /a/b /../g | |
1873 | 2C: /a /g | |
1874 | 2E: /a/g | |
1875 | ||
1876 | STEP OUTPUT BUFFER INPUT BUFFER | |
1877 | ||
1878 | 1 : mid/content=5/../6 | |
1879 | 2E: mid /content=5/../6 | |
1880 | 2E: mid/content=5 /../6 | |
1881 | 2C: mid /6 | |
1882 | 2E: mid/6 | |
1883 | ||
1884 | Some applications may find it more efficient to implement the | |
1885 | remove_dot_segments algorithm by using two segment stacks rather than | |
1886 | strings. | |
1887 | ||
1888 | Note: Beware that some older, erroneous implementations will fail | |
1889 | to separate a reference's query component from its path component | |
1890 | prior to merging the base and reference paths, resulting in an | |
1891 | interoperability failure if the query component contains the | |
1892 | strings "/../" or "/./". | |
1893 | ||
1894 | ||
1895 | ||
1896 | ||
1897 | ||
1898 | ||
1899 | ||
1900 | ||
1901 | ||
1902 | ||
1903 | ||
1904 | ||
1905 | ||
1906 | Berners-Lee, et al. Standards Track [Page 34] | |
1907 | \f | |
1908 | RFC 3986 URI Generic Syntax January 2005 | |
1909 | ||
1910 | ||
1911 | 5.3. Component Recomposition | |
1912 | ||
1913 | Parsed URI components can be recomposed to obtain the corresponding | |
1914 | URI reference string. Using pseudocode, this would be: | |
1915 | ||
1916 | result = "" | |
1917 | ||
1918 | if defined(scheme) then | |
1919 | append scheme to result; | |
1920 | append ":" to result; | |
1921 | endif; | |
1922 | ||
1923 | if defined(authority) then | |
1924 | append "//" to result; | |
1925 | append authority to result; | |
1926 | endif; | |
1927 | ||
1928 | append path to result; | |
1929 | ||
1930 | if defined(query) then | |
1931 | append "?" to result; | |
1932 | append query to result; | |
1933 | endif; | |
1934 | ||
1935 | if defined(fragment) then | |
1936 | append "#" to result; | |
1937 | append fragment to result; | |
1938 | endif; | |
1939 | ||
1940 | return result; | |
1941 | ||
1942 | Note that we are careful to preserve the distinction between a | |
1943 | component that is undefined, meaning that its separator was not | |
1944 | present in the reference, and a component that is empty, meaning that | |
1945 | the separator was present and was immediately followed by the next | |
1946 | component separator or the end of the reference. | |
1947 | ||
1948 | 5.4. Reference Resolution Examples | |
1949 | ||
1950 | Within a representation with a well defined base URI of | |
1951 | ||
1952 | http://a/b/c/d;p?q | |
1953 | ||
1954 | a relative reference is transformed to its target URI as follows. | |
1955 | ||
1956 | ||
1957 | ||
1958 | ||
1959 | ||
1960 | ||
1961 | ||
1962 | Berners-Lee, et al. Standards Track [Page 35] | |
1963 | \f | |
1964 | RFC 3986 URI Generic Syntax January 2005 | |
1965 | ||
1966 | ||
1967 | 5.4.1. Normal Examples | |
1968 | ||
1969 | "g:h" = "g:h" | |
1970 | "g" = "http://a/b/c/g" | |
1971 | "./g" = "http://a/b/c/g" | |
1972 | "g/" = "http://a/b/c/g/" | |
1973 | "/g" = "http://a/g" | |
1974 | "//g" = "http://g" | |
1975 | "?y" = "http://a/b/c/d;p?y" | |
1976 | "g?y" = "http://a/b/c/g?y" | |
1977 | "#s" = "http://a/b/c/d;p?q#s" | |
1978 | "g#s" = "http://a/b/c/g#s" | |
1979 | "g?y#s" = "http://a/b/c/g?y#s" | |
1980 | ";x" = "http://a/b/c/;x" | |
1981 | "g;x" = "http://a/b/c/g;x" | |
1982 | "g;x?y#s" = "http://a/b/c/g;x?y#s" | |
1983 | "" = "http://a/b/c/d;p?q" | |
1984 | "." = "http://a/b/c/" | |
1985 | "./" = "http://a/b/c/" | |
1986 | ".." = "http://a/b/" | |
1987 | "../" = "http://a/b/" | |
1988 | "../g" = "http://a/b/g" | |
1989 | "../.." = "http://a/" | |
1990 | "../../" = "http://a/" | |
1991 | "../../g" = "http://a/g" | |
1992 | ||
1993 | 5.4.2. Abnormal Examples | |
1994 | ||
1995 | Although the following abnormal examples are unlikely to occur in | |
1996 | normal practice, all URI parsers should be capable of resolving them | |
1997 | consistently. Each example uses the same base as that above. | |
1998 | ||
1999 | Parsers must be careful in handling cases where there are more ".." | |
2000 | segments in a relative-path reference than there are hierarchical | |
2001 | levels in the base URI's path. Note that the ".." syntax cannot be | |
2002 | used to change the authority component of a URI. | |
2003 | ||
2004 | "../../../g" = "http://a/g" | |
2005 | "../../../../g" = "http://a/g" | |
2006 | ||
2007 | ||
2008 | ||
2009 | ||
2010 | ||
2011 | ||
2012 | ||
2013 | ||
2014 | ||
2015 | ||
2016 | ||
2017 | ||
2018 | Berners-Lee, et al. Standards Track [Page 36] | |
2019 | \f | |
2020 | RFC 3986 URI Generic Syntax January 2005 | |
2021 | ||
2022 | ||
2023 | Similarly, parsers must remove the dot-segments "." and ".." when | |
2024 | they are complete components of a path, but not when they are only | |
2025 | part of a segment. | |
2026 | ||
2027 | "/./g" = "http://a/g" | |
2028 | "/../g" = "http://a/g" | |
2029 | "g." = "http://a/b/c/g." | |
2030 | ".g" = "http://a/b/c/.g" | |
2031 | "g.." = "http://a/b/c/g.." | |
2032 | "..g" = "http://a/b/c/..g" | |
2033 | ||
2034 | Less likely are cases where the relative reference uses unnecessary | |
2035 | or nonsensical forms of the "." and ".." complete path segments. | |
2036 | ||
2037 | "./../g" = "http://a/b/g" | |
2038 | "./g/." = "http://a/b/c/g/" | |
2039 | "g/./h" = "http://a/b/c/g/h" | |
2040 | "g/../h" = "http://a/b/c/h" | |
2041 | "g;x=1/./y" = "http://a/b/c/g;x=1/y" | |
2042 | "g;x=1/../y" = "http://a/b/c/y" | |
2043 | ||
2044 | Some applications fail to separate the reference's query and/or | |
2045 | fragment components from the path component before merging it with | |
2046 | the base path and removing dot-segments. This error is rarely | |
2047 | noticed, as typical usage of a fragment never includes the hierarchy | |
2048 | ("/") character and the query component is not normally used within | |
2049 | relative references. | |
2050 | ||
2051 | "g?y/./x" = "http://a/b/c/g?y/./x" | |
2052 | "g?y/../x" = "http://a/b/c/g?y/../x" | |
2053 | "g#s/./x" = "http://a/b/c/g#s/./x" | |
2054 | "g#s/../x" = "http://a/b/c/g#s/../x" | |
2055 | ||
2056 | Some parsers allow the scheme name to be present in a relative | |
2057 | reference if it is the same as the base URI scheme. This is | |
2058 | considered to be a loophole in prior specifications of partial URI | |
2059 | [RFC1630]. Its use should be avoided but is allowed for backward | |
2060 | compatibility. | |
2061 | ||
2062 | "http:g" = "http:g" ; for strict parsers | |
2063 | / "http://a/b/c/g" ; for backward compatibility | |
2064 | ||
2065 | ||
2066 | ||
2067 | ||
2068 | ||
2069 | ||
2070 | ||
2071 | ||
2072 | ||
2073 | ||
2074 | Berners-Lee, et al. Standards Track [Page 37] | |
2075 | \f | |
2076 | RFC 3986 URI Generic Syntax January 2005 | |
2077 | ||
2078 | ||
2079 | 6. Normalization and Comparison | |
2080 | ||
2081 | One of the most common operations on URIs is simple comparison: | |
2082 | determining whether two URIs are equivalent without using the URIs to | |
2083 | access their respective resource(s). A comparison is performed every | |
2084 | time a response cache is accessed, a browser checks its history to | |
2085 | color a link, or an XML parser processes tags within a namespace. | |
2086 | Extensive normalization prior to comparison of URIs is often used by | |
2087 | spiders and indexing engines to prune a search space or to reduce | |
2088 | duplication of request actions and response storage. | |
2089 | ||
2090 | URI comparison is performed for some particular purpose. Protocols | |
2091 | or implementations that compare URIs for different purposes will | |
2092 | often be subject to differing design trade-offs in regards to how | |
2093 | much effort should be spent in reducing aliased identifiers. This | |
2094 | section describes various methods that may be used to compare URIs, | |
2095 | the trade-offs between them, and the types of applications that might | |
2096 | use them. | |
2097 | ||
2098 | 6.1. Equivalence | |
2099 | ||
2100 | Because URIs exist to identify resources, presumably they should be | |
2101 | considered equivalent when they identify the same resource. However, | |
2102 | this definition of equivalence is not of much practical use, as there | |
2103 | is no way for an implementation to compare two resources unless it | |
2104 | has full knowledge or control of them. For this reason, | |
2105 | determination of equivalence or difference of URIs is based on string | |
2106 | comparison, perhaps augmented by reference to additional rules | |
2107 | provided by URI scheme definitions. We use the terms "different" and | |
2108 | "equivalent" to describe the possible outcomes of such comparisons, | |
2109 | but there are many application-dependent versions of equivalence. | |
2110 | ||
2111 | Even though it is possible to determine that two URIs are equivalent, | |
2112 | URI comparison is not sufficient to determine whether two URIs | |
2113 | identify different resources. For example, an owner of two different | |
2114 | domain names could decide to serve the same resource from both, | |
2115 | resulting in two different URIs. Therefore, comparison methods are | |
2116 | designed to minimize false negatives while strictly avoiding false | |
2117 | positives. | |
2118 | ||
2119 | In testing for equivalence, applications should not directly compare | |
2120 | relative references; the references should be converted to their | |
2121 | respective target URIs before comparison. When URIs are compared to | |
2122 | select (or avoid) a network action, such as retrieval of a | |
2123 | representation, fragment components (if any) should be excluded from | |
2124 | the comparison. | |
2125 | ||
2126 | ||
2127 | ||
2128 | ||
2129 | ||
2130 | Berners-Lee, et al. Standards Track [Page 38] | |
2131 | \f | |
2132 | RFC 3986 URI Generic Syntax January 2005 | |
2133 | ||
2134 | ||
2135 | 6.2. Comparison Ladder | |
2136 | ||
2137 | A variety of methods are used in practice to test URI equivalence. | |
2138 | These methods fall into a range, distinguished by the amount of | |
2139 | processing required and the degree to which the probability of false | |
2140 | negatives is reduced. As noted above, false negatives cannot be | |
2141 | eliminated. In practice, their probability can be reduced, but this | |
2142 | reduction requires more processing and is not cost-effective for all | |
2143 | applications. | |
2144 | ||
2145 | If this range of comparison practices is considered as a ladder, the | |
2146 | following discussion will climb the ladder, starting with practices | |
2147 | that are cheap but have a relatively higher chance of producing false | |
2148 | negatives, and proceeding to those that have higher computational | |
2149 | cost and lower risk of false negatives. | |
2150 | ||
2151 | 6.2.1. Simple String Comparison | |
2152 | ||
2153 | If two URIs, when considered as character strings, are identical, | |
2154 | then it is safe to conclude that they are equivalent. This type of | |
2155 | equivalence test has very low computational cost and is in wide use | |
2156 | in a variety of applications, particularly in the domain of parsing. | |
2157 | ||
2158 | Testing strings for equivalence requires some basic precautions. | |
2159 | This procedure is often referred to as "bit-for-bit" or | |
2160 | "byte-for-byte" comparison, which is potentially misleading. Testing | |
2161 | strings for equality is normally based on pair comparison of the | |
2162 | characters that make up the strings, starting from the first and | |
2163 | proceeding until both strings are exhausted and all characters are | |
2164 | found to be equal, until a pair of characters compares unequal, or | |
2165 | until one of the strings is exhausted before the other. | |
2166 | ||
2167 | This character comparison requires that each pair of characters be | |
2168 | put in comparable form. For example, should one URI be stored in a | |
2169 | byte array in EBCDIC encoding and the second in a Java String object | |
2170 | (UTF-16), bit-for-bit comparisons applied naively will produce | |
2171 | errors. It is better to speak of equality on a character-for- | |
2172 | character basis rather than on a byte-for-byte or bit-for-bit basis. | |
2173 | In practical terms, character-by-character comparisons should be done | |
2174 | codepoint-by-codepoint after conversion to a common character | |
2175 | encoding. | |
2176 | ||
2177 | False negatives are caused by the production and use of URI aliases. | |
2178 | Unnecessary aliases can be reduced, regardless of the comparison | |
2179 | method, by consistently providing URI references in an already- | |
2180 | normalized form (i.e., a form identical to what would be produced | |
2181 | after normalization is applied, as described below). | |
2182 | ||
2183 | ||
2184 | ||
2185 | ||
2186 | Berners-Lee, et al. Standards Track [Page 39] | |
2187 | \f | |
2188 | RFC 3986 URI Generic Syntax January 2005 | |
2189 | ||
2190 | ||
2191 | Protocols and data formats often limit some URI comparisons to simple | |
2192 | string comparison, based on the theory that people and | |
2193 | implementations will, in their own best interest, be consistent in | |
2194 | providing URI references, or at least consistent enough to negate any | |
2195 | efficiency that might be obtained from further normalization. | |
2196 | ||
2197 | 6.2.2. Syntax-Based Normalization | |
2198 | ||
2199 | Implementations may use logic based on the definitions provided by | |
2200 | this specification to reduce the probability of false negatives. | |
2201 | This processing is moderately higher in cost than character-for- | |
2202 | character string comparison. For example, an application using this | |
2203 | approach could reasonably consider the following two URIs equivalent: | |
2204 | ||
2205 | example://a/b/c/%7Bfoo%7D | |
2206 | eXAMPLE://a/./b/../b/%63/%7bfoo%7d | |
2207 | ||
2208 | Web user agents, such as browsers, typically apply this type of URI | |
2209 | normalization when determining whether a cached response is | |
2210 | available. Syntax-based normalization includes such techniques as | |
2211 | case normalization, percent-encoding normalization, and removal of | |
2212 | dot-segments. | |
2213 | ||
2214 | 6.2.2.1. Case Normalization | |
2215 | ||
2216 | For all URIs, the hexadecimal digits within a percent-encoding | |
2217 | triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore | |
2218 | should be normalized to use uppercase letters for the digits A-F. | |
2219 | ||
2220 | When a URI uses components of the generic syntax, the component | |
2221 | syntax equivalence rules always apply; namely, that the scheme and | |
2222 | host are case-insensitive and therefore should be normalized to | |
2223 | lowercase. For example, the URI <HTTP://www.EXAMPLE.com/> is | |
2224 | equivalent to <http://www.example.com/>. The other generic syntax | |
2225 | components are assumed to be case-sensitive unless specifically | |
2226 | defined otherwise by the scheme (see Section 6.2.3). | |
2227 | ||
2228 | 6.2.2.2. Percent-Encoding Normalization | |
2229 | ||
2230 | The percent-encoding mechanism (Section 2.1) is a frequent source of | |
2231 | variance among otherwise identical URIs. In addition to the case | |
2232 | normalization issue noted above, some URI producers percent-encode | |
2233 | octets that do not require percent-encoding, resulting in URIs that | |
2234 | are equivalent to their non-encoded counterparts. These URIs should | |
2235 | be normalized by decoding any percent-encoded octet that corresponds | |
2236 | to an unreserved character, as described in Section 2.3. | |
2237 | ||
2238 | ||
2239 | ||
2240 | ||
2241 | ||
2242 | Berners-Lee, et al. Standards Track [Page 40] | |
2243 | \f | |
2244 | RFC 3986 URI Generic Syntax January 2005 | |
2245 | ||
2246 | ||
2247 | 6.2.2.3. Path Segment Normalization | |
2248 | ||
2249 | The complete path segments "." and ".." are intended only for use | |
2250 | within relative references (Section 4.1) and are removed as part of | |
2251 | the reference resolution process (Section 5.2). However, some | |
2252 | deployed implementations incorrectly assume that reference resolution | |
2253 | is not necessary when the reference is already a URI and thus fail to | |
2254 | remove dot-segments when they occur in non-relative paths. URI | |
2255 | normalizers should remove dot-segments by applying the | |
2256 | remove_dot_segments algorithm to the path, as described in | |
2257 | Section 5.2.4. | |
2258 | ||
2259 | 6.2.3. Scheme-Based Normalization | |
2260 | ||
2261 | The syntax and semantics of URIs vary from scheme to scheme, as | |
2262 | described by the defining specification for each scheme. | |
2263 | Implementations may use scheme-specific rules, at further processing | |
2264 | cost, to reduce the probability of false negatives. For example, | |
2265 | because the "http" scheme makes use of an authority component, has a | |
2266 | default port of "80", and defines an empty path to be equivalent to | |
2267 | "/", the following four URIs are equivalent: | |
2268 | ||
2269 | http://example.com | |
2270 | http://example.com/ | |
2271 | http://example.com:/ | |
2272 | http://example.com:80/ | |
2273 | ||
2274 | In general, a URI that uses the generic syntax for authority with an | |
2275 | empty path should be normalized to a path of "/". Likewise, an | |
2276 | explicit ":port", for which the port is empty or the default for the | |
2277 | scheme, is equivalent to one where the port and its ":" delimiter are | |
2278 | elided and thus should be removed by scheme-based normalization. For | |
2279 | example, the second URI above is the normal form for the "http" | |
2280 | scheme. | |
2281 | ||
2282 | Another case where normalization varies by scheme is in the handling | |
2283 | of an empty authority component or empty host subcomponent. For many | |
2284 | scheme specifications, an empty authority or host is considered an | |
2285 | error; for others, it is considered equivalent to "localhost" or the | |
2286 | end-user's host. When a scheme defines a default for authority and a | |
2287 | URI reference to that default is desired, the reference should be | |
2288 | normalized to an empty authority for the sake of uniformity, brevity, | |
2289 | and internationalization. If, however, either the userinfo or port | |
2290 | subcomponents are non-empty, then the host should be given explicitly | |
2291 | even if it matches the default. | |
2292 | ||
2293 | Normalization should not remove delimiters when their associated | |
2294 | component is empty unless licensed to do so by the scheme | |
2295 | ||
2296 | ||
2297 | ||
2298 | Berners-Lee, et al. Standards Track [Page 41] | |
2299 | \f | |
2300 | RFC 3986 URI Generic Syntax January 2005 | |
2301 | ||
2302 | ||
2303 | specification. For example, the URI "http://example.com/?" cannot be | |
2304 | assumed to be equivalent to any of the examples above. Likewise, the | |
2305 | presence or absence of delimiters within a userinfo subcomponent is | |
2306 | usually significant to its interpretation. The fragment component is | |
2307 | not subject to any scheme-based normalization; thus, two URIs that | |
2308 | differ only by the suffix "#" are considered different regardless of | |
2309 | the scheme. | |
2310 | ||
2311 | Some schemes define additional subcomponents that consist of case- | |
2312 | insensitive data, giving an implicit license to normalizers to | |
2313 | convert this data to a common case (e.g., all lowercase). For | |
2314 | example, URI schemes that define a subcomponent of path to contain an | |
2315 | Internet hostname, such as the "mailto" URI scheme, cause that | |
2316 | subcomponent to be case-insensitive and thus subject to case | |
2317 | normalization (e.g., "mailto:Joe@Example.COM" is equivalent to | |
2318 | "mailto:Joe@example.com", even though the generic syntax considers | |
2319 | the path component to be case-sensitive). | |
2320 | ||
2321 | Other scheme-specific normalizations are possible. | |
2322 | ||
2323 | 6.2.4. Protocol-Based Normalization | |
2324 | ||
2325 | Substantial effort to reduce the incidence of false negatives is | |
2326 | often cost-effective for web spiders. Therefore, they implement even | |
2327 | more aggressive techniques in URI comparison. For example, if they | |
2328 | observe that a URI such as | |
2329 | ||
2330 | http://example.com/data | |
2331 | ||
2332 | redirects to a URI differing only in the trailing slash | |
2333 | ||
2334 | http://example.com/data/ | |
2335 | ||
2336 | they will likely regard the two as equivalent in the future. This | |
2337 | kind of technique is only appropriate when equivalence is clearly | |
2338 | indicated by both the result of accessing the resources and the | |
2339 | common conventions of their scheme's dereference algorithm (in this | |
2340 | case, use of redirection by HTTP origin servers to avoid problems | |
2341 | with relative references). | |
2342 | ||
2343 | ||
2344 | ||
2345 | ||
2346 | ||
2347 | ||
2348 | ||
2349 | ||
2350 | ||
2351 | ||
2352 | ||
2353 | ||
2354 | Berners-Lee, et al. Standards Track [Page 42] | |
2355 | \f | |
2356 | RFC 3986 URI Generic Syntax January 2005 | |
2357 | ||
2358 | ||
2359 | 7. Security Considerations | |
2360 | ||
2361 | A URI does not in itself pose a security threat. However, as URIs | |
2362 | are often used to provide a compact set of instructions for access to | |
2363 | network resources, care must be taken to properly interpret the data | |
2364 | within a URI, to prevent that data from causing unintended access, | |
2365 | and to avoid including data that should not be revealed in plain | |
2366 | text. | |
2367 | ||
2368 | 7.1. Reliability and Consistency | |
2369 | ||
2370 | There is no guarantee that once a URI has been used to retrieve | |
2371 | information, the same information will be retrievable by that URI in | |
2372 | the future. Nor is there any guarantee that the information | |
2373 | retrievable via that URI in the future will be observably similar to | |
2374 | that retrieved in the past. The URI syntax does not constrain how a | |
2375 | given scheme or authority apportions its namespace or maintains it | |
2376 | over time. Such guarantees can only be obtained from the person(s) | |
2377 | controlling that namespace and the resource in question. A specific | |
2378 | URI scheme may define additional semantics, such as name persistence, | |
2379 | if those semantics are required of all naming authorities for that | |
2380 | scheme. | |
2381 | ||
2382 | 7.2. Malicious Construction | |
2383 | ||
2384 | It is sometimes possible to construct a URI so that an attempt to | |
2385 | perform a seemingly harmless, idempotent operation, such as the | |
2386 | retrieval of a representation, will in fact cause a possibly damaging | |
2387 | remote operation. The unsafe URI is typically constructed by | |
2388 | specifying a port number other than that reserved for the network | |
2389 | protocol in question. The client unwittingly contacts a site running | |
2390 | a different protocol service, and data within the URI contains | |
2391 | instructions that, when interpreted according to this other protocol, | |
2392 | cause an unexpected operation. A frequent example of such abuse has | |
2393 | been the use of a protocol-based scheme with a port component of | |
2394 | "25", thereby fooling user agent software into sending an unintended | |
2395 | or impersonating message via an SMTP server. | |
2396 | ||
2397 | Applications should prevent dereference of a URI that specifies a TCP | |
2398 | port number within the "well-known port" range (0 - 1023) unless the | |
2399 | protocol being used to dereference that URI is compatible with the | |
2400 | protocol expected on that well-known port. Although IANA maintains a | |
2401 | registry of well-known ports, applications should make such | |
2402 | restrictions user-configurable to avoid preventing the deployment of | |
2403 | new services. | |
2404 | ||
2405 | ||
2406 | ||
2407 | ||
2408 | ||
2409 | ||
2410 | Berners-Lee, et al. Standards Track [Page 43] | |
2411 | \f | |
2412 | RFC 3986 URI Generic Syntax January 2005 | |
2413 | ||
2414 | ||
2415 | When a URI contains percent-encoded octets that match the delimiters | |
2416 | for a given resolution or dereference protocol (for example, CR and | |
2417 | LF characters for the TELNET protocol), these percent-encodings must | |
2418 | not be decoded before transmission across that protocol. Transfer of | |
2419 | the percent-encoding, which might violate the protocol, is less | |
2420 | harmful than allowing decoded octets to be interpreted as additional | |
2421 | operations or parameters, perhaps triggering an unexpected and | |
2422 | possibly harmful remote operation. | |
2423 | ||
2424 | 7.3. Back-End Transcoding | |
2425 | ||
2426 | When a URI is dereferenced, the data within it is often parsed by | |
2427 | both the user agent and one or more servers. In HTTP, for example, a | |
2428 | typical user agent will parse a URI into its five major components, | |
2429 | access the authority's server, and send it the data within the | |
2430 | authority, path, and query components. A typical server will take | |
2431 | that information, parse the path into segments and the query into | |
2432 | key/value pairs, and then invoke implementation-specific handlers to | |
2433 | respond to the request. As a result, a common security concern for | |
2434 | server implementations that handle a URI, either as a whole or split | |
2435 | into separate components, is proper interpretation of the octet data | |
2436 | represented by the characters and percent-encodings within that URI. | |
2437 | ||
2438 | Percent-encoded octets must be decoded at some point during the | |
2439 | dereference process. Applications must split the URI into its | |
2440 | components and subcomponents prior to decoding the octets, as | |
2441 | otherwise the decoded octets might be mistaken for delimiters. | |
2442 | Security checks of the data within a URI should be applied after | |
2443 | decoding the octets. Note, however, that the "%00" percent-encoding | |
2444 | (NUL) may require special handling and should be rejected if the | |
2445 | application is not expecting to receive raw data within a component. | |
2446 | ||
2447 | Special care should be taken when the URI path interpretation process | |
2448 | involves the use of a back-end file system or related system | |
2449 | functions. File systems typically assign an operational meaning to | |
2450 | special characters, such as the "/", "\", ":", "[", and "]" | |
2451 | characters, and to special device names like ".", "..", "...", "aux", | |
2452 | "lpt", etc. In some cases, merely testing for the existence of such | |
2453 | a name will cause the operating system to pause or invoke unrelated | |
2454 | system calls, leading to significant security concerns regarding | |
2455 | denial of service and unintended data transfer. It would be | |
2456 | impossible for this specification to list all such significant | |
2457 | characters and device names. Implementers should research the | |
2458 | reserved names and characters for the types of storage device that | |
2459 | may be attached to their applications and restrict the use of data | |
2460 | obtained from URI components accordingly. | |
2461 | ||
2462 | ||
2463 | ||
2464 | ||
2465 | ||
2466 | Berners-Lee, et al. Standards Track [Page 44] | |
2467 | \f | |
2468 | RFC 3986 URI Generic Syntax January 2005 | |
2469 | ||
2470 | ||
2471 | 7.4. Rare IP Address Formats | |
2472 | ||
2473 | Although the URI syntax for IPv4address only allows the common | |
2474 | dotted-decimal form of IPv4 address literal, many implementations | |
2475 | that process URIs make use of platform-dependent system routines, | |
2476 | such as gethostbyname() and inet_aton(), to translate the string | |
2477 | literal to an actual IP address. Unfortunately, such system routines | |
2478 | often allow and process a much larger set of formats than those | |
2479 | described in Section 3.2.2. | |
2480 | ||
2481 | For example, many implementations allow dotted forms of three | |
2482 | numbers, wherein the last part is interpreted as a 16-bit quantity | |
2483 | and placed in the right-most two bytes of the network address (e.g., | |
2484 | a Class B network). Likewise, a dotted form of two numbers means | |
2485 | that the last part is interpreted as a 24-bit quantity and placed in | |
2486 | the right-most three bytes of the network address (Class A), and a | |
2487 | single number (without dots) is interpreted as a 32-bit quantity and | |
2488 | stored directly in the network address. Adding further to the | |
2489 | confusion, some implementations allow each dotted part to be | |
2490 | interpreted as decimal, octal, or hexadecimal, as specified in the C | |
2491 | language (i.e., a leading 0x or 0X implies hexadecimal; a leading 0 | |
2492 | implies octal; otherwise, the number is interpreted as decimal). | |
2493 | ||
2494 | These additional IP address formats are not allowed in the URI syntax | |
2495 | due to differences between platform implementations. However, they | |
2496 | can become a security concern if an application attempts to filter | |
2497 | access to resources based on the IP address in string literal format. | |
2498 | If this filtering is performed, literals should be converted to | |
2499 | numeric form and filtered based on the numeric value, and not on a | |
2500 | prefix or suffix of the string form. | |
2501 | ||
2502 | 7.5. Sensitive Information | |
2503 | ||
2504 | URI producers should not provide a URI that contains a username or | |
2505 | password that is intended to be secret. URIs are frequently | |
2506 | displayed by browsers, stored in clear text bookmarks, and logged by | |
2507 | user agent history and intermediary applications (proxies). A | |
2508 | password appearing within the userinfo component is deprecated and | |
2509 | should be considered an error (or simply ignored) except in those | |
2510 | rare cases where the 'password' parameter is intended to be public. | |
2511 | ||
2512 | 7.6. Semantic Attacks | |
2513 | ||
2514 | Because the userinfo subcomponent is rarely used and appears before | |
2515 | the host in the authority component, it can be used to construct a | |
2516 | URI intended to mislead a human user by appearing to identify one | |
2517 | (trusted) naming authority while actually identifying a different | |
2518 | authority hidden behind the noise. For example | |
2519 | ||
2520 | ||
2521 | ||
2522 | Berners-Lee, et al. Standards Track [Page 45] | |
2523 | \f | |
2524 | RFC 3986 URI Generic Syntax January 2005 | |
2525 | ||
2526 | ||
2527 | ftp://cnn.example.com&story=breaking_news@10.0.0.1/top_story.htm | |
2528 | ||
2529 | might lead a human user to assume that the host is 'cnn.example.com', | |
2530 | whereas it is actually '10.0.0.1'. Note that a misleading userinfo | |
2531 | subcomponent could be much longer than the example above. | |
2532 | ||
2533 | A misleading URI, such as that above, is an attack on the user's | |
2534 | preconceived notions about the meaning of a URI rather than an attack | |
2535 | on the software itself. User agents may be able to reduce the impact | |
2536 | of such attacks by distinguishing the various components of the URI | |
2537 | when they are rendered, such as by using a different color or tone to | |
2538 | render userinfo if any is present, though there is no panacea. More | |
2539 | information on URI-based semantic attacks can be found in [Siedzik]. | |
2540 | ||
2541 | 8. IANA Considerations | |
2542 | ||
2543 | URI scheme names, as defined by <scheme> in Section 3.1, form a | |
2544 | registered namespace that is managed by IANA according to the | |
2545 | procedures defined in [BCP35]. No IANA actions are required by this | |
2546 | document. | |
2547 | ||
2548 | 9. Acknowledgements | |
2549 | ||
2550 | This specification is derived from RFC 2396 [RFC2396], RFC 1808 | |
2551 | [RFC1808], and RFC 1738 [RFC1738]; the acknowledgements in those | |
2552 | documents still apply. It also incorporates the update (with | |
2553 | corrections) for IPv6 literals in the host syntax, as defined by | |
2554 | Robert M. Hinden, Brian E. Carpenter, and Larry Masinter in | |
2555 | [RFC2732]. In addition, contributions by Gisle Aas, Reese Anschultz, | |
2556 | Daniel Barclay, Tim Bray, Mike Brown, Rob Cameron, Jeremy Carroll, | |
2557 | Dan Connolly, Adam M. Costello, John Cowan, Jason Diamond, Martin | |
2558 | Duerst, Stefan Eissing, Clive D.W. Feather, Al Gilman, Tony Hammond, | |
2559 | Elliotte Harold, Pat Hayes, Henry Holtzman, Ian B. Jacobs, Michael | |
2560 | Kay, John C. Klensin, Graham Klyne, Dan Kohn, Bruce Lilly, Andrew | |
2561 | Main, Dave McAlpin, Ira McDonald, Michael Mealling, Ray Merkert, | |
2562 | Stephen Pollei, Julian Reschke, Tomas Rokicki, Miles Sabin, Kai | |
2563 | Schaetzl, Mark Thomson, Ronald Tschalaer, Norm Walsh, Marc Warne, | |
2564 | Stuart Williams, and Henry Zongaro are gratefully acknowledged. | |
2565 | ||
2566 | 10. References | |
2567 | ||
2568 | 10.1. Normative References | |
2569 | ||
2570 | [ASCII] American National Standards Institute, "Coded Character | |
2571 | Set -- 7-bit American Standard Code for Information | |
2572 | Interchange", ANSI X3.4, 1986. | |
2573 | ||
2574 | ||
2575 | ||
2576 | ||
2577 | ||
2578 | Berners-Lee, et al. Standards Track [Page 46] | |
2579 | \f | |
2580 | RFC 3986 URI Generic Syntax January 2005 | |
2581 | ||
2582 | ||
2583 | [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax | |
2584 | Specifications: ABNF", RFC 2234, November 1997. | |
2585 | ||
2586 | [STD63] Yergeau, F., "UTF-8, a transformation format of | |
2587 | ISO 10646", STD 63, RFC 3629, November 2003. | |
2588 | ||
2589 | [UCS] International Organization for Standardization, | |
2590 | "Information Technology - Universal Multiple-Octet Coded | |
2591 | Character Set (UCS)", ISO/IEC 10646:2003, December 2003. | |
2592 | ||
2593 | 10.2. Informative References | |
2594 | ||
2595 | [BCP19] Freed, N. and J. Postel, "IANA Charset Registration | |
2596 | Procedures", BCP 19, RFC 2978, October 2000. | |
2597 | ||
2598 | [BCP35] Petke, R. and I. King, "Registration Procedures for URL | |
2599 | Scheme Names", BCP 35, RFC 2717, November 1999. | |
2600 | ||
2601 | [RFC0952] Harrenstien, K., Stahl, M., and E. Feinler, "DoD Internet | |
2602 | host table specification", RFC 952, October 1985. | |
2603 | ||
2604 | [RFC1034] Mockapetris, P., "Domain names - concepts and facilities", | |
2605 | STD 13, RFC 1034, November 1987. | |
2606 | ||
2607 | [RFC1123] Braden, R., "Requirements for Internet Hosts - Application | |
2608 | and Support", STD 3, RFC 1123, October 1989. | |
2609 | ||
2610 | [RFC1535] Gavron, E., "A Security Problem and Proposed Correction | |
2611 | With Widely Deployed DNS Software", RFC 1535, | |
2612 | October 1993. | |
2613 | ||
2614 | [RFC1630] Berners-Lee, T., "Universal Resource Identifiers in WWW: A | |
2615 | Unifying Syntax for the Expression of Names and Addresses | |
2616 | of Objects on the Network as used in the World-Wide Web", | |
2617 | RFC 1630, June 1994. | |
2618 | ||
2619 | [RFC1736] Kunze, J., "Functional Recommendations for Internet | |
2620 | Resource Locators", RFC 1736, February 1995. | |
2621 | ||
2622 | [RFC1737] Sollins, K. and L. Masinter, "Functional Requirements for | |
2623 | Uniform Resource Names", RFC 1737, December 1994. | |
2624 | ||
2625 | [RFC1738] Berners-Lee, T., Masinter, L., and M. McCahill, "Uniform | |
2626 | Resource Locators (URL)", RFC 1738, December 1994. | |
2627 | ||
2628 | [RFC1808] Fielding, R., "Relative Uniform Resource Locators", | |
2629 | RFC 1808, June 1995. | |
2630 | ||
2631 | ||
2632 | ||
2633 | ||
2634 | Berners-Lee, et al. Standards Track [Page 47] | |
2635 | \f | |
2636 | RFC 3986 URI Generic Syntax January 2005 | |
2637 | ||
2638 | ||
2639 | [RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail | |
2640 | Extensions (MIME) Part Two: Media Types", RFC 2046, | |
2641 | November 1996. | |
2642 | ||
2643 | [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. | |
2644 | ||
2645 | [RFC2396] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform | |
2646 | Resource Identifiers (URI): Generic Syntax", RFC 2396, | |
2647 | August 1998. | |
2648 | ||
2649 | [RFC2518] Goland, Y., Whitehead, E., Faizi, A., Carter, S., and D. | |
2650 | Jensen, "HTTP Extensions for Distributed Authoring -- | |
2651 | WEBDAV", RFC 2518, February 1999. | |
2652 | ||
2653 | [RFC2557] Palme, J., Hopmann, A., and N. Shelness, "MIME | |
2654 | Encapsulation of Aggregate Documents, such as HTML | |
2655 | (MHTML)", RFC 2557, March 1999. | |
2656 | ||
2657 | [RFC2718] Masinter, L., Alvestrand, H., Zigmond, D., and R. Petke, | |
2658 | "Guidelines for new URL Schemes", RFC 2718, November 1999. | |
2659 | ||
2660 | [RFC2732] Hinden, R., Carpenter, B., and L. Masinter, "Format for | |
2661 | Literal IPv6 Addresses in URL's", RFC 2732, December 1999. | |
2662 | ||
2663 | [RFC3305] Mealling, M. and R. Denenberg, "Report from the Joint | |
2664 | W3C/IETF URI Planning Interest Group: Uniform Resource | |
2665 | Identifiers (URIs), URLs, and Uniform Resource Names | |
2666 | (URNs): Clarifications and Recommendations", RFC 3305, | |
2667 | August 2002. | |
2668 | ||
2669 | [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, | |
2670 | "Internationalizing Domain Names in Applications (IDNA)", | |
2671 | RFC 3490, March 2003. | |
2672 | ||
2673 | [RFC3513] Hinden, R. and S. Deering, "Internet Protocol Version 6 | |
2674 | (IPv6) Addressing Architecture", RFC 3513, April 2003. | |
2675 | ||
2676 | [Siedzik] Siedzik, R., "Semantic Attacks: What's in a URL?", | |
2677 | April 2001, <http://www.giac.org/practical/gsec/ | |
2678 | Richard_Siedzik_GSEC.pdf>. | |
2679 | ||
2680 | ||
2681 | ||
2682 | ||
2683 | ||
2684 | ||
2685 | ||
2686 | ||
2687 | ||
2688 | ||
2689 | ||
2690 | Berners-Lee, et al. Standards Track [Page 48] | |
2691 | \f | |
2692 | RFC 3986 URI Generic Syntax January 2005 | |
2693 | ||
2694 | ||
2695 | Appendix A. Collected ABNF for URI | |
2696 | ||
2697 | URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ] | |
2698 | ||
2699 | hier-part = "//" authority path-abempty | |
2700 | / path-absolute | |
2701 | / path-rootless | |
2702 | / path-empty | |
2703 | ||
2704 | URI-reference = URI / relative-ref | |
2705 | ||
2706 | absolute-URI = scheme ":" hier-part [ "?" query ] | |
2707 | ||
2708 | relative-ref = relative-part [ "?" query ] [ "#" fragment ] | |
2709 | ||
2710 | relative-part = "//" authority path-abempty | |
2711 | / path-absolute | |
2712 | / path-noscheme | |
2713 | / path-empty | |
2714 | ||
2715 | scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) | |
2716 | ||
2717 | authority = [ userinfo "@" ] host [ ":" port ] | |
2718 | userinfo = *( unreserved / pct-encoded / sub-delims / ":" ) | |
2719 | host = IP-literal / IPv4address / reg-name | |
2720 | port = *DIGIT | |
2721 | ||
2722 | IP-literal = "[" ( IPv6address / IPvFuture ) "]" | |
2723 | ||
2724 | IPvFuture = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" ) | |
2725 | ||
2726 | IPv6address = 6( h16 ":" ) ls32 | |
2727 | / "::" 5( h16 ":" ) ls32 | |
2728 | / [ h16 ] "::" 4( h16 ":" ) ls32 | |
2729 | / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32 | |
2730 | / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32 | |
2731 | / [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32 | |
2732 | / [ *4( h16 ":" ) h16 ] "::" ls32 | |
2733 | / [ *5( h16 ":" ) h16 ] "::" h16 | |
2734 | / [ *6( h16 ":" ) h16 ] "::" | |
2735 | ||
2736 | h16 = 1*4HEXDIG | |
2737 | ls32 = ( h16 ":" h16 ) / IPv4address | |
2738 | IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet | |
2739 | ||
2740 | ||
2741 | ||
2742 | ||
2743 | ||
2744 | ||
2745 | ||
2746 | Berners-Lee, et al. Standards Track [Page 49] | |
2747 | \f | |
2748 | RFC 3986 URI Generic Syntax January 2005 | |
2749 | ||
2750 | ||
2751 | dec-octet = DIGIT ; 0-9 | |
2752 | / %x31-39 DIGIT ; 10-99 | |
2753 | / "1" 2DIGIT ; 100-199 | |
2754 | / "2" %x30-34 DIGIT ; 200-249 | |
2755 | / "25" %x30-35 ; 250-255 | |
2756 | ||
2757 | reg-name = *( unreserved / pct-encoded / sub-delims ) | |
2758 | ||
2759 | path = path-abempty ; begins with "/" or is empty | |
2760 | / path-absolute ; begins with "/" but not "//" | |
2761 | / path-noscheme ; begins with a non-colon segment | |
2762 | / path-rootless ; begins with a segment | |
2763 | / path-empty ; zero characters | |
2764 | ||
2765 | path-abempty = *( "/" segment ) | |
2766 | path-absolute = "/" [ segment-nz *( "/" segment ) ] | |
2767 | path-noscheme = segment-nz-nc *( "/" segment ) | |
2768 | path-rootless = segment-nz *( "/" segment ) | |
2769 | path-empty = 0<pchar> | |
2770 | ||
2771 | segment = *pchar | |
2772 | segment-nz = 1*pchar | |
2773 | segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" ) | |
2774 | ; non-zero-length segment without any colon ":" | |
2775 | ||
2776 | pchar = unreserved / pct-encoded / sub-delims / ":" / "@" | |
2777 | ||
2778 | query = *( pchar / "/" / "?" ) | |
2779 | ||
2780 | fragment = *( pchar / "/" / "?" ) | |
2781 | ||
2782 | pct-encoded = "%" HEXDIG HEXDIG | |
2783 | ||
2784 | unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" | |
2785 | reserved = gen-delims / sub-delims | |
2786 | gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" | |
2787 | sub-delims = "!" / "$" / "&" / "'" / "(" / ")" | |
2788 | / "*" / "+" / "," / ";" / "=" | |
2789 | ||
2790 | Appendix B. Parsing a URI Reference with a Regular Expression | |
2791 | ||
2792 | As the "first-match-wins" algorithm is identical to the "greedy" | |
2793 | disambiguation method used by POSIX regular expressions, it is | |
2794 | natural and commonplace to use a regular expression for parsing the | |
2795 | potential five components of a URI reference. | |
2796 | ||
2797 | The following line is the regular expression for breaking-down a | |
2798 | well-formed URI reference into its components. | |
2799 | ||
2800 | ||
2801 | ||
2802 | Berners-Lee, et al. Standards Track [Page 50] | |
2803 | \f | |
2804 | RFC 3986 URI Generic Syntax January 2005 | |
2805 | ||
2806 | ||
2807 | ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? | |
2808 | 12 3 4 5 6 7 8 9 | |
2809 | ||
2810 | The numbers in the second line above are only to assist readability; | |
2811 | they indicate the reference points for each subexpression (i.e., each | |
2812 | paired parenthesis). We refer to the value matched for subexpression | |
2813 | <n> as $<n>. For example, matching the above expression to | |
2814 | ||
2815 | http://www.ics.uci.edu/pub/ietf/uri/#Related | |
2816 | ||
2817 | results in the following subexpression matches: | |
2818 | ||
2819 | $1 = http: | |
2820 | $2 = http | |
2821 | $3 = //www.ics.uci.edu | |
2822 | $4 = www.ics.uci.edu | |
2823 | $5 = /pub/ietf/uri/ | |
2824 | $6 = <undefined> | |
2825 | $7 = <undefined> | |
2826 | $8 = #Related | |
2827 | $9 = Related | |
2828 | ||
2829 | where <undefined> indicates that the component is not present, as is | |
2830 | the case for the query component in the above example. Therefore, we | |
2831 | can determine the value of the five components as | |
2832 | ||
2833 | scheme = $2 | |
2834 | authority = $4 | |
2835 | path = $5 | |
2836 | query = $7 | |
2837 | fragment = $9 | |
2838 | ||
2839 | Going in the opposite direction, we can recreate a URI reference from | |
2840 | its components by using the algorithm of Section 5.3. | |
2841 | ||
2842 | Appendix C. Delimiting a URI in Context | |
2843 | ||
2844 | URIs are often transmitted through formats that do not provide a | |
2845 | clear context for their interpretation. For example, there are many | |
2846 | occasions when a URI is included in plain text; examples include text | |
2847 | sent in email, USENET news, and on printed paper. In such cases, it | |
2848 | is important to be able to delimit the URI from the rest of the text, | |
2849 | and in particular from punctuation marks that might be mistaken for | |
2850 | part of the URI. | |
2851 | ||
2852 | In practice, URIs are delimited in a variety of ways, but usually | |
2853 | within double-quotes "http://example.com/", angle brackets | |
2854 | <http://example.com/>, or just by using whitespace: | |
2855 | ||
2856 | ||
2857 | ||
2858 | Berners-Lee, et al. Standards Track [Page 51] | |
2859 | \f | |
2860 | RFC 3986 URI Generic Syntax January 2005 | |
2861 | ||
2862 | ||
2863 | http://example.com/ | |
2864 | ||
2865 | These wrappers do not form part of the URI. | |
2866 | ||
2867 | In some cases, extra whitespace (spaces, line-breaks, tabs, etc.) may | |
2868 | have to be added to break a long URI across lines. The whitespace | |
2869 | should be ignored when the URI is extracted. | |
2870 | ||
2871 | No whitespace should be introduced after a hyphen ("-") character. | |
2872 | Because some typesetters and printers may (erroneously) introduce a | |
2873 | hyphen at the end of line when breaking it, the interpreter of a URI | |
2874 | containing a line break immediately after a hyphen should ignore all | |
2875 | whitespace around the line break and should be aware that the hyphen | |
2876 | may or may not actually be part of the URI. | |
2877 | ||
2878 | Using <> angle brackets around each URI is especially recommended as | |
2879 | a delimiting style for a reference that contains embedded whitespace. | |
2880 | ||
2881 | The prefix "URL:" (with or without a trailing space) was formerly | |
2882 | recommended as a way to help distinguish a URI from other bracketed | |
2883 | designators, though it is not commonly used in practice and is no | |
2884 | longer recommended. | |
2885 | ||
2886 | For robustness, software that accepts user-typed URI should attempt | |
2887 | to recognize and strip both delimiters and embedded whitespace. | |
2888 | ||
2889 | For example, the text | |
2890 | ||
2891 | Yes, Jim, I found it under "http://www.w3.org/Addressing/", | |
2892 | but you can probably pick it up from <ftp://foo.example. | |
2893 | com/rfc/>. Note the warning in <http://www.ics.uci.edu/pub/ | |
2894 | ietf/uri/historical.html#WARNING>. | |
2895 | ||
2896 | contains the URI references | |
2897 | ||
2898 | http://www.w3.org/Addressing/ | |
2899 | ftp://foo.example.com/rfc/ | |
2900 | http://www.ics.uci.edu/pub/ietf/uri/historical.html#WARNING | |
2901 | ||
2902 | ||
2903 | ||
2904 | ||
2905 | ||
2906 | ||
2907 | ||
2908 | ||
2909 | ||
2910 | ||
2911 | ||
2912 | ||
2913 | ||
2914 | Berners-Lee, et al. Standards Track [Page 52] | |
2915 | \f | |
2916 | RFC 3986 URI Generic Syntax January 2005 | |
2917 | ||
2918 | ||
2919 | Appendix D. Changes from RFC 2396 | |
2920 | ||
2921 | D.1. Additions | |
2922 | ||
2923 | An ABNF rule for URI has been introduced to correspond to one common | |
2924 | usage of the term: an absolute URI with optional fragment. | |
2925 | ||
2926 | IPv6 (and later) literals have been added to the list of possible | |
2927 | identifiers for the host portion of an authority component, as | |
2928 | described by [RFC2732], with the addition of "[" and "]" to the | |
2929 | reserved set and a version flag to anticipate future versions of IP | |
2930 | literals. Square brackets are now specified as reserved within the | |
2931 | authority component and are not allowed outside their use as | |
2932 | delimiters for an IP literal within host. In order to make this | |
2933 | change without changing the technical definition of the path, query, | |
2934 | and fragment components, those rules were redefined to directly | |
2935 | specify the characters allowed. | |
2936 | ||
2937 | As [RFC2732] defers to [RFC3513] for definition of an IPv6 literal | |
2938 | address, which, unfortunately, lacks an ABNF description of | |
2939 | IPv6address, we created a new ABNF rule for IPv6address that matches | |
2940 | the text representations defined by Section 2.2 of [RFC3513]. | |
2941 | Likewise, the definition of IPv4address has been improved in order to | |
2942 | limit each decimal octet to the range 0-255. | |
2943 | ||
2944 | Section 6, on URI normalization and comparison, has been completely | |
2945 | rewritten and extended by using input from Tim Bray and discussion | |
2946 | within the W3C Technical Architecture Group. | |
2947 | ||
2948 | D.2. Modifications | |
2949 | ||
2950 | The ad-hoc BNF syntax of RFC 2396 has been replaced with the ABNF of | |
2951 | [RFC2234]. This change required all rule names that formerly | |
2952 | included underscore characters to be renamed with a dash instead. In | |
2953 | addition, a number of syntax rules have been eliminated or simplified | |
2954 | to make the overall grammar more comprehensible. Specifications that | |
2955 | refer to the obsolete grammar rules may be understood by replacing | |
2956 | those rules according to the following table: | |
2957 | ||
2958 | ||
2959 | ||
2960 | ||
2961 | ||
2962 | ||
2963 | ||
2964 | ||
2965 | ||
2966 | ||
2967 | ||
2968 | ||
2969 | ||
2970 | Berners-Lee, et al. Standards Track [Page 53] | |
2971 | \f | |
2972 | RFC 3986 URI Generic Syntax January 2005 | |
2973 | ||
2974 | ||
2975 | +----------------+--------------------------------------------------+ | |
2976 | | obsolete rule | translation | | |
2977 | +----------------+--------------------------------------------------+ | |
2978 | | absoluteURI | absolute-URI | | |
2979 | | relativeURI | relative-part [ "?" query ] | | |
2980 | | hier_part | ( "//" authority path-abempty / | | |
2981 | | | path-absolute ) [ "?" query ] | | |
2982 | | | | | |
2983 | | opaque_part | path-rootless [ "?" query ] | | |
2984 | | net_path | "//" authority path-abempty | | |
2985 | | abs_path | path-absolute | | |
2986 | | rel_path | path-rootless | | |
2987 | | rel_segment | segment-nz-nc | | |
2988 | | reg_name | reg-name | | |
2989 | | server | authority | | |
2990 | | hostport | host [ ":" port ] | | |
2991 | | hostname | reg-name | | |
2992 | | path_segments | path-abempty | | |
2993 | | param | *<pchar excluding ";"> | | |
2994 | | | | | |
2995 | | uric | unreserved / pct-encoded / ";" / "?" / ":" | | |
2996 | | | / "@" / "&" / "=" / "+" / "$" / "," / "/" | | |
2997 | | | | | |
2998 | | uric_no_slash | unreserved / pct-encoded / ";" / "?" / ":" | | |
2999 | | | / "@" / "&" / "=" / "+" / "$" / "," | | |
3000 | | | | | |
3001 | | mark | "-" / "_" / "." / "!" / "~" / "*" / "'" | | |
3002 | | | / "(" / ")" | | |
3003 | | | | | |
3004 | | escaped | pct-encoded | | |
3005 | | hex | HEXDIG | | |
3006 | | alphanum | ALPHA / DIGIT | | |
3007 | +----------------+--------------------------------------------------+ | |
3008 | ||
3009 | Use of the above obsolete rules for the definition of scheme-specific | |
3010 | syntax is deprecated. | |
3011 | ||
3012 | Section 2, on characters, has been rewritten to explain what | |
3013 | characters are reserved, when they are reserved, and why they are | |
3014 | reserved, even when they are not used as delimiters by the generic | |
3015 | syntax. The mark characters that are typically unsafe to decode, | |
3016 | including the exclamation mark ("!"), asterisk ("*"), single-quote | |
3017 | ("'"), and open and close parentheses ("(" and ")"), have been moved | |
3018 | to the reserved set in order to clarify the distinction between | |
3019 | reserved and unreserved and, hopefully, to answer the most common | |
3020 | question of scheme designers. Likewise, the section on | |
3021 | percent-encoded characters has been rewritten, and URI normalizers | |
3022 | are now given license to decode any percent-encoded octets | |
3023 | ||
3024 | ||
3025 | ||
3026 | Berners-Lee, et al. Standards Track [Page 54] | |
3027 | \f | |
3028 | RFC 3986 URI Generic Syntax January 2005 | |
3029 | ||
3030 | ||
3031 | corresponding to unreserved characters. In general, the terms | |
3032 | "escaped" and "unescaped" have been replaced with "percent-encoded" | |
3033 | and "decoded", respectively, to reduce confusion with other forms of | |
3034 | escape mechanisms. | |
3035 | ||
3036 | The ABNF for URI and URI-reference has been redesigned to make them | |
3037 | more friendly to LALR parsers and to reduce complexity. As a result, | |
3038 | the layout form of syntax description has been removed, along with | |
3039 | the uric, uric_no_slash, opaque_part, net_path, abs_path, rel_path, | |
3040 | path_segments, rel_segment, and mark rules. All references to | |
3041 | "opaque" URIs have been replaced with a better description of how the | |
3042 | path component may be opaque to hierarchy. The relativeURI rule has | |
3043 | been replaced with relative-ref to avoid unnecessary confusion over | |
3044 | whether they are a subset of URI. The ambiguity regarding the | |
3045 | parsing of URI-reference as a URI or a relative-ref with a colon in | |
3046 | the first segment has been eliminated through the use of five | |
3047 | separate path matching rules. | |
3048 | ||
3049 | The fragment identifier has been moved back into the section on | |
3050 | generic syntax components and within the URI and relative-ref rules, | |
3051 | though it remains excluded from absolute-URI. The number sign ("#") | |
3052 | character has been moved back to the reserved set as a result of | |
3053 | reintegrating the fragment syntax. | |
3054 | ||
3055 | The ABNF has been corrected to allow the path component to be empty. | |
3056 | This also allows an absolute-URI to consist of nothing after the | |
3057 | "scheme:", as is present in practice with the "dav:" namespace | |
3058 | [RFC2518] and with the "about:" scheme used internally by many WWW | |
3059 | browser implementations. The ambiguity regarding the boundary | |
3060 | between authority and path has been eliminated through the use of | |
3061 | five separate path matching rules. | |
3062 | ||
3063 | Registry-based naming authorities that use the generic syntax are now | |
3064 | defined within the host rule. This change allows current | |
3065 | implementations, where whatever name provided is simply fed to the | |
3066 | local name resolution mechanism, to be consistent with the | |
3067 | specification. It also removes the need to re-specify DNS name | |
3068 | formats here. Furthermore, it allows the host component to contain | |
3069 | percent-encoded octets, which is necessary to enable | |
3070 | internationalized domain names to be provided in URIs, processed in | |
3071 | their native character encodings at the application layers above URI | |
3072 | processing, and passed to an IDNA library as a registered name in the | |
3073 | UTF-8 character encoding. The server, hostport, hostname, | |
3074 | domainlabel, toplabel, and alphanum rules have been removed. | |
3075 | ||
3076 | The resolving relative references algorithm of [RFC2396] has been | |
3077 | rewritten with pseudocode for this revision to improve clarity and | |
3078 | fix the following issues: | |
3079 | ||
3080 | ||
3081 | ||
3082 | Berners-Lee, et al. Standards Track [Page 55] | |
3083 | \f | |
3084 | RFC 3986 URI Generic Syntax January 2005 | |
3085 | ||
3086 | ||
3087 | o [RFC2396] section 5.2, step 6a, failed to account for a base URI | |
3088 | with no path. | |
3089 | ||
3090 | o Restored the behavior of [RFC1808] where, if the reference | |
3091 | contains an empty path and a defined query component, the target | |
3092 | URI inherits the base URI's path component. | |
3093 | ||
3094 | o The determination of whether a URI reference is a same-document | |
3095 | reference has been decoupled from the URI parser, simplifying the | |
3096 | URI processing interface within applications in a way consistent | |
3097 | with the internal architecture of deployed URI processing | |
3098 | implementations. The determination is now based on comparison to | |
3099 | the base URI after transforming a reference to absolute form, | |
3100 | rather than on the format of the reference itself. This change | |
3101 | may result in more references being considered "same-document" | |
3102 | under this specification than there would be under the rules given | |
3103 | in RFC 2396, especially when normalization is used to reduce | |
3104 | aliases. However, it does not change the status of existing | |
3105 | same-document references. | |
3106 | ||
3107 | o Separated the path merge routine into two routines: merge, for | |
3108 | describing combination of the base URI path with a relative-path | |
3109 | reference, and remove_dot_segments, for describing how to remove | |
3110 | the special "." and ".." segments from a composed path. The | |
3111 | remove_dot_segments algorithm is now applied to all URI reference | |
3112 | paths in order to match common implementations and to improve the | |
3113 | normalization of URIs in practice. This change only impacts the | |
3114 | parsing of abnormal references and same-scheme references wherein | |
3115 | the base URI has a non-hierarchical path. | |
3116 | ||
3117 | Index | |
3118 | ||
3119 | A | |
3120 | ABNF 11 | |
3121 | absolute 27 | |
3122 | absolute-path 26 | |
3123 | absolute-URI 27 | |
3124 | access 9 | |
3125 | authority 17, 18 | |
3126 | ||
3127 | B | |
3128 | base URI 28 | |
3129 | ||
3130 | C | |
3131 | character encoding 4 | |
3132 | character 4 | |
3133 | characters 8, 11 | |
3134 | coded character set 4 | |
3135 | ||
3136 | ||
3137 | ||
3138 | Berners-Lee, et al. Standards Track [Page 56] | |
3139 | \f | |
3140 | RFC 3986 URI Generic Syntax January 2005 | |
3141 | ||
3142 | ||
3143 | D | |
3144 | dec-octet 20 | |
3145 | dereference 9 | |
3146 | dot-segments 23 | |
3147 | ||
3148 | F | |
3149 | fragment 16, 24 | |
3150 | ||
3151 | G | |
3152 | gen-delims 13 | |
3153 | generic syntax 6 | |
3154 | ||
3155 | H | |
3156 | h16 20 | |
3157 | hier-part 16 | |
3158 | hierarchical 10 | |
3159 | host 18 | |
3160 | ||
3161 | I | |
3162 | identifier 5 | |
3163 | IP-literal 19 | |
3164 | IPv4 20 | |
3165 | IPv4address 19, 20 | |
3166 | IPv6 19 | |
3167 | IPv6address 19, 20 | |
3168 | IPvFuture 19 | |
3169 | ||
3170 | L | |
3171 | locator 7 | |
3172 | ls32 20 | |
3173 | ||
3174 | M | |
3175 | merge 32 | |
3176 | ||
3177 | N | |
3178 | name 7 | |
3179 | network-path 26 | |
3180 | ||
3181 | P | |
3182 | path 16, 22, 26 | |
3183 | path-abempty 22 | |
3184 | path-absolute 22 | |
3185 | path-empty 22 | |
3186 | path-noscheme 22 | |
3187 | path-rootless 22 | |
3188 | path-abempty 16, 22, 26 | |
3189 | path-absolute 16, 22, 26 | |
3190 | path-empty 16, 22, 26 | |
3191 | ||
3192 | ||
3193 | ||
3194 | Berners-Lee, et al. Standards Track [Page 57] | |
3195 | \f | |
3196 | RFC 3986 URI Generic Syntax January 2005 | |
3197 | ||
3198 | ||
3199 | path-rootless 16, 22 | |
3200 | pchar 23 | |
3201 | pct-encoded 12 | |
3202 | percent-encoding 12 | |
3203 | port 22 | |
3204 | ||
3205 | Q | |
3206 | query 16, 23 | |
3207 | ||
3208 | R | |
3209 | reg-name 21 | |
3210 | registered name 20 | |
3211 | relative 10, 28 | |
3212 | relative-path 26 | |
3213 | relative-ref 26 | |
3214 | remove_dot_segments 33 | |
3215 | representation 9 | |
3216 | reserved 12 | |
3217 | resolution 9, 28 | |
3218 | resource 5 | |
3219 | retrieval 9 | |
3220 | ||
3221 | S | |
3222 | same-document 27 | |
3223 | sameness 9 | |
3224 | scheme 16, 17 | |
3225 | segment 22, 23 | |
3226 | segment-nz 23 | |
3227 | segment-nz-nc 23 | |
3228 | sub-delims 13 | |
3229 | suffix 27 | |
3230 | ||
3231 | T | |
3232 | transcription 8 | |
3233 | ||
3234 | U | |
3235 | uniform 4 | |
3236 | unreserved 13 | |
3237 | URI grammar | |
3238 | absolute-URI 27 | |
3239 | ALPHA 11 | |
3240 | authority 18 | |
3241 | CR 11 | |
3242 | dec-octet 20 | |
3243 | DIGIT 11 | |
3244 | DQUOTE 11 | |
3245 | fragment 24 | |
3246 | gen-delims 13 | |
3247 | ||
3248 | ||
3249 | ||
3250 | Berners-Lee, et al. Standards Track [Page 58] | |
3251 | \f | |
3252 | RFC 3986 URI Generic Syntax January 2005 | |
3253 | ||
3254 | ||
3255 | h16 20 | |
3256 | HEXDIG 11 | |
3257 | hier-part 16 | |
3258 | host 19 | |
3259 | IP-literal 19 | |
3260 | IPv4address 20 | |
3261 | IPv6address 20 | |
3262 | IPvFuture 19 | |
3263 | LF 11 | |
3264 | ls32 20 | |
3265 | OCTET 11 | |
3266 | path 22 | |
3267 | path-abempty 22 | |
3268 | path-absolute 22 | |
3269 | path-empty 22 | |
3270 | path-noscheme 22 | |
3271 | path-rootless 22 | |
3272 | pchar 23 | |
3273 | pct-encoded 12 | |
3274 | port 22 | |
3275 | query 24 | |
3276 | reg-name 21 | |
3277 | relative-ref 26 | |
3278 | reserved 13 | |
3279 | scheme 17 | |
3280 | segment 23 | |
3281 | segment-nz 23 | |
3282 | segment-nz-nc 23 | |
3283 | SP 11 | |
3284 | sub-delims 13 | |
3285 | unreserved 13 | |
3286 | URI 16 | |
3287 | URI-reference 25 | |
3288 | userinfo 18 | |
3289 | URI 16 | |
3290 | URI-reference 25 | |
3291 | URL 7 | |
3292 | URN 7 | |
3293 | userinfo 18 | |
3294 | ||
3295 | ||
3296 | ||
3297 | ||
3298 | ||
3299 | ||
3300 | ||
3301 | ||
3302 | ||
3303 | ||
3304 | ||
3305 | ||
3306 | Berners-Lee, et al. Standards Track [Page 59] | |
3307 | \f | |
3308 | RFC 3986 URI Generic Syntax January 2005 | |
3309 | ||
3310 | ||
3311 | Authors' Addresses | |
3312 | ||
3313 | Tim Berners-Lee | |
3314 | World Wide Web Consortium | |
3315 | Massachusetts Institute of Technology | |
3316 | 77 Massachusetts Avenue | |
3317 | Cambridge, MA 02139 | |
3318 | USA | |
3319 | ||
3320 | Phone: +1-617-253-5702 | |
3321 | Fax: +1-617-258-5999 | |
3322 | EMail: timbl@w3.org | |
3323 | URI: http://www.w3.org/People/Berners-Lee/ | |
3324 | ||
3325 | ||
3326 | Roy T. Fielding | |
3327 | Day Software | |
3328 | 5251 California Ave., Suite 110 | |
3329 | Irvine, CA 92617 | |
3330 | USA | |
3331 | ||
3332 | Phone: +1-949-679-2960 | |
3333 | Fax: +1-949-679-2972 | |
3334 | EMail: fielding@gbiv.com | |
3335 | URI: http://roy.gbiv.com/ | |
3336 | ||
3337 | ||
3338 | Larry Masinter | |
3339 | Adobe Systems Incorporated | |
3340 | 345 Park Ave | |
3341 | San Jose, CA 95110 | |
3342 | USA | |
3343 | ||
3344 | Phone: +1-408-536-3024 | |
3345 | EMail: LMM@acm.org | |
3346 | URI: http://larry.masinter.net/ | |
3347 | ||
3348 | ||
3349 | ||
3350 | ||
3351 | ||
3352 | ||
3353 | ||
3354 | ||
3355 | ||
3356 | ||
3357 | ||
3358 | ||
3359 | ||
3360 | ||
3361 | ||
3362 | Berners-Lee, et al. Standards Track [Page 60] | |
3363 | \f | |
3364 | RFC 3986 URI Generic Syntax January 2005 | |
3365 | ||
3366 | ||
3367 | Full Copyright Statement | |
3368 | ||
3369 | Copyright (C) The Internet Society (2005). | |
3370 | ||
3371 | This document is subject to the rights, licenses and restrictions | |
3372 | contained in BCP 78, and except as set forth therein, the authors | |
3373 | retain all their rights. | |
3374 | ||
3375 | This document and the information contained herein are provided on an | |
3376 | "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS | |
3377 | OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET | |
3378 | ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, | |
3379 | INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE | |
3380 | INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED | |
3381 | WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. | |
3382 | ||
3383 | Intellectual Property | |
3384 | ||
3385 | The IETF takes no position regarding the validity or scope of any | |
3386 | Intellectual Property Rights or other rights that might be claimed to | |
3387 | pertain to the implementation or use of the technology described in | |
3388 | this document or the extent to which any license under such rights | |
3389 | might or might not be available; nor does it represent that it has | |
3390 | made any independent effort to identify any such rights. Information | |
3391 | on the IETF's procedures with respect to rights in IETF Documents can | |
3392 | be found in BCP 78 and BCP 79. | |
3393 | ||
3394 | Copies of IPR disclosures made to the IETF Secretariat and any | |
3395 | assurances of licenses to be made available, or the result of an | |
3396 | attempt made to obtain a general license or permission for the use of | |
3397 | such proprietary rights by implementers or users of this | |
3398 | specification can be obtained from the IETF on-line IPR repository at | |
3399 | http://www.ietf.org/ipr. | |
3400 | ||
3401 | The IETF invites any interested party to bring to its attention any | |
3402 | copyrights, patents or patent applications, or other proprietary | |
3403 | rights that may cover technology that may be required to implement | |
3404 | this standard. Please address the information to the IETF at ietf- | |
3405 | ipr@ietf.org. | |
3406 | ||
3407 | ||
3408 | Acknowledgement | |
3409 | ||
3410 | Funding for the RFC Editor function is currently provided by the | |
3411 | Internet Society. | |
3412 | ||
3413 | ||
3414 | ||
3415 | ||
3416 | ||
3417 | ||
3418 | Berners-Lee, et al. Standards Track [Page 61] | |
3419 | \f |