Revise the Unicode section after getting comments from MAL, GvR, and others.

author Andrew M. Kuchling <amk@amk.ca>

Thu, 19 Jul 2001 14:59:53 +0000 (14:59 +0000)

committer Andrew M. Kuchling <amk@amk.ca>

Thu, 19 Jul 2001 14:59:53 +0000 (14:59 +0000)
author Andrew M. Kuchling <amk@amk.ca>
Thu, 19 Jul 2001 14:59:53 +0000 (14:59 +0000)
committer Andrew M. Kuchling <amk@amk.ca>
Thu, 19 Jul 2001 14:59:53 +0000 (14:59 +0000)
diff --git a/Doc/whatsnew/whatsnew22.tex b/Doc/whatsnew/whatsnew22.tex

index 96b0972ae1304d9e65de8dc2ae6370ba024ff7df..431e269c4fb4d44b75f0332ca7b7267d8ddd721f 100644 (file)
--- a/Doc/whatsnew/whatsnew22.tex
+++ b/Doc/whatsnew/whatsnew22.tex
@@ -3,7 +3,7 @@
  % $Id$
  
  \title{What's New in Python 2.2}
-\release{0.03}
+\release{0.04}
  \author{A.M. Kuchling}
  \authoraddress{\email{akuchlin@mems-exchange.org}}
  \begin{document}
@@ -339,32 +339,46 @@ and Tim Peters, with other fixes from the Python Labs crew.}
  \section{Unicode Changes}
  
  Python's Unicode support has been enhanced a bit in 2.2.  Unicode
-strings are usually stored as UCS-2, as 16-bit unsigned integers.
+strings are usually stored as UTF-16, as 16-bit unsigned integers.
  Python 2.2 can also be compiled to use UCS-4, 32-bit unsigned
  integers, as its internal encoding by supplying
  \longprogramopt{enable-unicode=ucs4} to the configure script.  When
-built to use UCS-4, in theory Python could handle Unicode characters
-from U-00000000 to U-7FFFFFFF.  Being able to use UCS-4 internally is
-a necessary step to do that, but it's not the only step, and in Python
-2.2alpha1 the work isn't complete yet.  For example, the
-\function{unichr()} function still only accepts values from 0 to
-65535, and there's no \code{\e U} notation for embedding characters
-greater than 65535 in a Unicode string literal.  All this is the
-province of the still-unimplemented PEP 261, ``Support for `wide'
-Unicode characters''; consult it for further details, and please offer
-comments and suggestions on the proposal it describes.
-
-Another change is much simpler to explain.
-Since their introduction, Unicode strings have supported an
-\method{encode()} method to convert the string to a selected encoding
-such as UTF-8 or Latin-1.  A symmetric
-\method{decode(\optional{\var{encoding}})} method has been added to
-both 8-bit and Unicode strings in 2.2, which assumes that the string
-is in the specified encoding and decodes it. This means that
-\method{encode()} and \method{decode()} can be called on both types of
-strings, and can be used for tasks not directly related to Unicode.
-For example, codecs have been added for UUencoding, MIME's base-64
-encoding, and compression with the \module{zlib} module.
+built to use UCS-4 (a ``wide Python''), the interpreter can natively
+handle Unicode characters from U+000000 to U+110000.  The range of
+legal values for the \function{unichr()} function has been expanded;
+it used to only accept values up to 65535, but in 2.2 will accept
+values from 0 to 0x110000.  Using a ``narrow Python'', an interpreter
+compiled to use UTF-16, values greater than 65535 will result in
+\function{unichr()} returning a string of length 2:
+
+\begin{verbatim}
+>>> s = unichr(65536)
+>>> s
+u'\ud800\udc00'
+>>> len(s)
+2
+\end{verbatim}
+
+This possibly-confusing behaviour, breaking the intuitive invariant
+that \function{chr()} and\function{unichr()} always return strings of
+length 1, may be changed later in 2.2 depending on public reaction.
+
+All this is the province of the still-unimplemented PEP 261, ``Support
+for `wide' Unicode characters''; consult it for further details, and
+please offer comments and suggestions on the proposal it describes.
+
+Another change is much simpler to explain. Since their introduction,
+Unicode strings have supported an \method{encode()} method to convert
+the string to a selected encoding such as UTF-8 or Latin-1.  A
+symmetric \method{decode(\optional{\var{encoding}})} method has been
+added to 8-bit strings (though not to Unicode strings) in 2.2.
+\method{decode()} assumes that the string is in the specified encoding
+and decodes it, returning whatever is returned by the codec. 
+
+Using this new feature, codecs have been added for tasks not directly
+related to Unicode.  For example, codecs have been added for
+uu-encoding, MIME's base64 encoding, and compression with the
+\module{zlib} module:
  
  \begin{verbatim}
  >>> s = """Here is a lengthy piece of redundant, overly verbose,
@@ -610,6 +624,15 @@ changes are:
    been changed to use the new C-level interface.  (Contributed by Fred
    L. Drake, Jr.)
  
+  \item Another low-level API, primarily of interest to implementors
+  of Python debuggers and development tools, was added.
+  \cfunction{PyInterpreterState_Head()} and
+  \cfunction{PyInterpreterState_Next()} let a caller walk through all
+  the existing interpreter objects;
+  \cfunction{PyInterpreterState_ThreadHead()} and
+  \cfunction{PyThreadState_Next()} allow looping over all the thread
+  states for a given interpreter.  (Contributed by David Beazley.)
+
    % XXX is this explanation correct?  
    \item When presented with a Unicode filename on Windows, Python will
    now correctly convert it to a string using the MBCS encoding.
@@ -668,6 +691,7 @@ changes are:
  
  The author would like to thank the following people for offering
  suggestions and corrections to various drafts of this article: Fred
-Bremmer, Fred L. Drake, Jr., Tim Peters, Neil Schemenauer.  
+Bremmer, Fred L. Drake, Jr., Marc-Andr\'e Lemburg,
+Tim Peters, Neil Schemenauer, Guido van Rossum.  
  
  \end{document}
author	Andrew M. Kuchling <amk@amk.ca>
	Thu, 19 Jul 2001 14:59:53 +0000 (14:59 +0000)
committer	Andrew M. Kuchling <amk@amk.ca>
	Thu, 19 Jul 2001 14:59:53 +0000 (14:59 +0000)