From: Fred Drake Date: Tue, 24 Sep 2002 21:01:07 +0000 (+0000) Subject: Another try at clarifying what goes into and comes out of Unicode objects. X-Git-Tag: v2.2.2b1~119 X-Git-Url: http://git.ipfire.org/gitweb.cgi?a=commitdiff_plain;h=31940037d34bfacebef37837fc467e465ff437d5;p=thirdparty%2FPython%2Fcpython.git Another try at clarifying what goes into and comes out of Unicode objects. --- diff --git a/Doc/lib/libfuncs.tex b/Doc/lib/libfuncs.tex index e47e01e8def8..fc00dd4197d6 100644 --- a/Doc/lib/libfuncs.tex +++ b/Doc/lib/libfuncs.tex @@ -526,10 +526,6 @@ def my_import(name): \begin{funcdesc}{len}{s} Return the length (the number of items) of an object. The argument may be a sequence (string, tuple or list) or a mapping (dictionary). - In the case of Unicode strings, \function{len()} returns the number - of storage units, not abstract characters. In particular, when a - surrogate pair is encountered, each component of the pair is counted - as a separate character. \end{funcdesc} \begin{funcdesc}{list}{\optional{sequence}} diff --git a/Doc/ref/ref2.tex b/Doc/ref/ref2.tex index 00152aeb0f80..6c8267b40d00 100644 --- a/Doc/ref/ref2.tex +++ b/Doc/ref/ref2.tex @@ -376,29 +376,48 @@ those used by Standard C. The recognized escape sequences are: \index{Standard C} \index{C} -\begin{tableii}{l|l}{code}{Escape Sequence}{Meaning} -\lineii{\e\var{newline}} {Ignored} -\lineii{\e\e} {Backslash (\code{\e})} -\lineii{\e'} {Single quote (\code{'})} -\lineii{\e"} {Double quote (\code{"})} -\lineii{\e a} {\ASCII{} Bell (BEL)} -\lineii{\e b} {\ASCII{} Backspace (BS)} -\lineii{\e f} {\ASCII{} Formfeed (FF)} -\lineii{\e n} {\ASCII{} Linefeed (LF)} -\lineii{\e N\{\var{name}\}} - {Character named \var{name} in the Unicode database (Unicode only)} -\lineii{\e r} {\ASCII{} Carriage Return (CR)} -\lineii{\e t} {\ASCII{} Horizontal Tab (TAB)} -\lineii{\e u\var{xxxx}} {Character with 16-bit hex value \var{xxxx} (Unicode only)} -\lineii{\e U\var{xxxxxxxx}}{Character with 32-bit hex value \var{xxxxxxxx} (Unicode only)} -\lineii{\e v} {\ASCII{} Vertical Tab (VT)} -\lineii{\e\var{ooo}} {\ASCII{} character with octal value \var{ooo}} -\lineii{\e x\var{hh}} {\ASCII{} character with hex value \var{hh}} -\end{tableii} +\begin{tableiii}{l|l|c}{code}{Escape Sequence}{Meaning}{Notes} +\lineiii{\e\var{newline}} {Ignored}{} +\lineiii{\e\e} {Backslash (\code{\e})}{} +\lineiii{\e'} {Single quote (\code{'})}{} +\lineiii{\e"} {Double quote (\code{"})}{} +\lineiii{\e a} {\ASCII{} Bell (BEL)}{} +\lineiii{\e b} {\ASCII{} Backspace (BS)}{} +\lineiii{\e f} {\ASCII{} Formfeed (FF)}{} +\lineiii{\e n} {\ASCII{} Linefeed (LF)}{} +\lineiii{\e N\{\var{name}\}} + {Character named \var{name} in the Unicode database (Unicode only)}{} +\lineiii{\e r} {\ASCII{} Carriage Return (CR)}{} +\lineiii{\e t} {\ASCII{} Horizontal Tab (TAB)}{} +\lineiii{\e u\var{xxxx}} + {Character with 16-bit hex value \var{xxxx} (Unicode only)}{(1)} +\lineiii{\e U\var{xxxxxxxx}} + {Character with 32-bit hex value \var{xxxxxxxx} (Unicode only)}{(2)} +\lineiii{\e v} {\ASCII{} Vertical Tab (VT)}{} +\lineiii{\e\var{ooo}} {\ASCII{} character with octal value \var{ooo}}{(3)} +\lineiii{\e x\var{hh}} {\ASCII{} character with hex value \var{hh}}{(4)} +\end{tableiii} \index{ASCII@\ASCII} -As in Standard C, up to three octal digits are accepted. However, -exactly two hex digits are taken in hex escapes. +\noindent +Notes: + +\begin{itemize} +\item[(1)] + Individual code units which form parts of a surrogate pair can be + encoded using this escape sequence. +\item[(2)] + Any Unicode character can be encoded this way, but characters + outside the Basic Multilingual Plane (BMP) will be encoded using a + surrogate pair if Python is compiled to use 16-bit code units (the + default). Individual code units which form parts of a surrogate + pair can be encoded using this escape sequence. +\item[(3)] + As in Standard C, up to three octal digits are accepted. +\item[(4)] + Unlike in Standard C, at most two hex digits are accepted. +\end{itemize} + Unlike Standard \index{unrecognized escape sequence}C, all unrecognized escape sequences are left in the string unchanged, @@ -427,7 +446,7 @@ When an \character{r} or \character{R} prefix is used in conjunction with a \character{u} or \character{U} prefix, then the \code{\e uXXXX} escape sequence is processed while \emph{all other backslashes are left in the string}. For example, the string literal -\code{ur"\e u0062\e n"} consists of three Unicode characters: +\code{ur"\e{}u0062\e n"} consists of three Unicode characters: `LATIN SMALL LETTER B', `REVERSE SOLIDUS', and `LATIN SMALL LETTER N'. Backslashes can be escaped with a preceding backslash; however, both remain in the string. As a result, \code{\e uXXXX} escape sequences diff --git a/Doc/ref/ref3.tex b/Doc/ref/ref3.tex index bc98d41608cd..94db3e929c8e 100644 --- a/Doc/ref/ref3.tex +++ b/Doc/ref/ref3.tex @@ -286,15 +286,19 @@ Or perhaps someone can propose a better rule?) \bifuncindex{ord} \item[Unicode] -The items of a Unicode object are Unicode characters. A Unicode -character is represented by a Unicode object of one item and can hold -a 16-bit value representing a Unicode ordinal. The built-in functions +The items of a Unicode object are Unicode code units. A Unicode code +unit is represented by a Unicode object of one item and can hold +either a 16-bit or 32-bit value representing a Unicode ordinal (the +maximum value for the ordinal is given in \code{sys.maxunicode}, and +depends on how Python is configured at compile time). Surrogate pairs +may be present in the Unicode object, and will be reported as two +separate items. The built-in functions \function{unichr()}\bifuncindex{unichr} and -\function{ord()}\bifuncindex{ord} convert between characters and +\function{ord()}\bifuncindex{ord} convert between code units and nonnegative integers representing the Unicode ordinals as defined in the Unicode Standard 3.0. Conversion from and to other encodings are possible through the Unicode method \method{encode} and the built-in -function \function{unicode()}\bifuncindex{unicode}. +function \function{unicode()}.\bifuncindex{unicode} \obindex{unicode} \index{character} \index{integer}