Another try at clarifying what goes into and comes out of Unicode objects.

author Fred Drake <fdrake@acm.org>

Tue, 24 Sep 2002 21:01:07 +0000 (21:01 +0000)

committer Fred Drake <fdrake@acm.org>

Tue, 24 Sep 2002 21:01:07 +0000 (21:01 +0000)
author Fred Drake <fdrake@acm.org>
Tue, 24 Sep 2002 21:01:07 +0000 (21:01 +0000)
committer Fred Drake <fdrake@acm.org>
Tue, 24 Sep 2002 21:01:07 +0000 (21:01 +0000)
diff --git a/Doc/lib/libfuncs.tex b/Doc/lib/libfuncs.tex

index e47e01e8def8a5e1390c139ba2c1d684d47707ad..fc00dd4197d604092d2980db7684336e96109f9c 100644 (file)
--- a/Doc/lib/libfuncs.tex
+++ b/Doc/lib/libfuncs.tex
@@ -526,10 +526,6 @@ def my_import(name):
  \begin{funcdesc}{len}{s}
    Return the length (the number of items) of an object.  The argument
    may be a sequence (string, tuple or list) or a mapping (dictionary).
-  In the case of Unicode strings, \function{len()} returns the number
-  of storage units, not abstract characters.  In particular, when a
-  surrogate pair is encountered, each component of the pair is counted
-  as a separate character.
  \end{funcdesc}
  
  \begin{funcdesc}{list}{\optional{sequence}}
diff --git a/Doc/ref/ref2.tex b/Doc/ref/ref2.tex

index 00152aeb0f801109bacff311fafc0932c04ff451..6c8267b40d000a1b19bc7ac7c456cdae575582ef 100644 (file)
--- a/Doc/ref/ref2.tex
+++ b/Doc/ref/ref2.tex
@@ -376,29 +376,48 @@ those used by Standard C.  The recognized escape sequences are:
  \index{Standard C}
  \index{C}
  
-\begin{tableii}{l|l}{code}{Escape Sequence}{Meaning}
-\lineii{\e\var{newline}} {Ignored}
-\lineii{\e\e}  {Backslash (\code{\e})}
-\lineii{\e'}   {Single quote (\code{'})}
-\lineii{\e"}   {Double quote (\code{"})}
-\lineii{\e a}  {\ASCII{} Bell (BEL)}
-\lineii{\e b}  {\ASCII{} Backspace (BS)}
-\lineii{\e f}  {\ASCII{} Formfeed (FF)}
-\lineii{\e n}  {\ASCII{} Linefeed (LF)}
-\lineii{\e N\{\var{name}\}}
-       {Character named \var{name} in the Unicode database (Unicode only)}
-\lineii{\e r}  {\ASCII{} Carriage Return (CR)}
-\lineii{\e t}  {\ASCII{} Horizontal Tab (TAB)}
-\lineii{\e u\var{xxxx}}    {Character with 16-bit hex value \var{xxxx} (Unicode only)}
-\lineii{\e U\var{xxxxxxxx}}{Character with 32-bit hex value \var{xxxxxxxx} (Unicode only)}
-\lineii{\e v}  {\ASCII{} Vertical Tab (VT)}
-\lineii{\e\var{ooo}} {\ASCII{} character with octal value \var{ooo}}
-\lineii{\e x\var{hh}} {\ASCII{} character with hex value \var{hh}}
-\end{tableii}
+\begin{tableiii}{l|l|c}{code}{Escape Sequence}{Meaning}{Notes}
+\lineiii{\e\var{newline}} {Ignored}{}
+\lineiii{\e\e} {Backslash (\code{\e})}{}
+\lineiii{\e'}  {Single quote (\code{'})}{}
+\lineiii{\e"}  {Double quote (\code{"})}{}
+\lineiii{\e a} {\ASCII{} Bell (BEL)}{}
+\lineiii{\e b} {\ASCII{} Backspace (BS)}{}
+\lineiii{\e f} {\ASCII{} Formfeed (FF)}{}
+\lineiii{\e n} {\ASCII{} Linefeed (LF)}{}
+\lineiii{\e N\{\var{name}\}}
+        {Character named \var{name} in the Unicode database (Unicode only)}{}
+\lineiii{\e r} {\ASCII{} Carriage Return (CR)}{}
+\lineiii{\e t} {\ASCII{} Horizontal Tab (TAB)}{}
+\lineiii{\e u\var{xxxx}}
+        {Character with 16-bit hex value \var{xxxx} (Unicode only)}{(1)}
+\lineiii{\e U\var{xxxxxxxx}}
+        {Character with 32-bit hex value \var{xxxxxxxx} (Unicode only)}{(2)}
+\lineiii{\e v} {\ASCII{} Vertical Tab (VT)}{}
+\lineiii{\e\var{ooo}} {\ASCII{} character with octal value \var{ooo}}{(3)}
+\lineiii{\e x\var{hh}} {\ASCII{} character with hex value \var{hh}}{(4)}
+\end{tableiii}
  \index{ASCII@\ASCII}
  
-As in Standard C, up to three octal digits are accepted.  However,
-exactly two hex digits are taken in hex escapes.
+\noindent
+Notes:
+
+\begin{itemize}
+\item[(1)]
+  Individual code units which form parts of a surrogate pair can be
+  encoded using this escape sequence.
+\item[(2)]
+  Any Unicode character can be encoded this way, but characters
+  outside the Basic Multilingual Plane (BMP) will be encoded using a
+  surrogate pair if Python is compiled to use 16-bit code units (the
+  default).  Individual code units which form parts of a surrogate
+  pair can be encoded using this escape sequence.
+\item[(3)]
+  As in Standard C, up to three octal digits are accepted.
+\item[(4)]
+  Unlike in Standard C, at most two hex digits are accepted.
+\end{itemize}
+
  
  Unlike Standard \index{unrecognized escape sequence}C,
  all unrecognized escape sequences are left in the string unchanged,
@@ -427,7 +446,7 @@ When an \character{r} or \character{R} prefix is used in conjunction
  with a \character{u} or \character{U} prefix, then the \code{\e uXXXX}
  escape sequence is processed while \emph{all other backslashes are
  left in the string}.  For example, the string literal
-\code{ur"\e u0062\e n"} consists of three Unicode characters:
+\code{ur"\e{}u0062\e n"} consists of three Unicode characters:
  `LATIN SMALL LETTER B', `REVERSE SOLIDUS', and `LATIN SMALL LETTER N'.
  Backslashes can be escaped with a preceding backslash; however, both
  remain in the string.  As a result, \code{\e uXXXX} escape sequences
diff --git a/Doc/ref/ref3.tex b/Doc/ref/ref3.tex

index bc98d41608cd148ac3036cd656b2f596f1d8565d..94db3e929c8e208ad2081a3bd3795f917f318a00 100644 (file)
--- a/Doc/ref/ref3.tex
+++ b/Doc/ref/ref3.tex
@@ -286,15 +286,19 @@ Or perhaps someone can propose a better rule?)
  \bifuncindex{ord}
  
  \item[Unicode]
-The items of a Unicode object are Unicode characters.  A Unicode
-character is represented by a Unicode object of one item and can hold
-a 16-bit value representing a Unicode ordinal.  The built-in functions
+The items of a Unicode object are Unicode code units.  A Unicode code
+unit is represented by a Unicode object of one item and can hold
+either a 16-bit or 32-bit value representing a Unicode ordinal (the
+maximum value for the ordinal is given in \code{sys.maxunicode}, and
+depends on how Python is configured at compile time).  Surrogate pairs
+may be present in the Unicode object, and will be reported as two
+separate items.  The built-in functions
  \function{unichr()}\bifuncindex{unichr} and
-\function{ord()}\bifuncindex{ord} convert between characters and
+\function{ord()}\bifuncindex{ord} convert between code units and
  nonnegative integers representing the Unicode ordinals as defined in
  the Unicode Standard 3.0. Conversion from and to other encodings are
  possible through the Unicode method \method{encode} and the built-in
-function \function{unicode()}\bifuncindex{unicode}.
+function \function{unicode()}.\bifuncindex{unicode}
  \obindex{unicode}
  \index{character}
  \index{integer}
author	Fred Drake <fdrake@acm.org>
	Tue, 24 Sep 2002 21:01:07 +0000 (21:01 +0000)
committer	Fred Drake <fdrake@acm.org>
	Tue, 24 Sep 2002 21:01:07 +0000 (21:01 +0000)
Doc/lib/libfuncs.tex		patch \| blob \| blame \| history
Doc/ref/ref2.tex		patch \| blob \| blame \| history
Doc/ref/ref3.tex		patch \| blob \| blame \| history