Implement a string conversion interface to archive_entry and archive_mstring
for efficient string conversion. Some platform have to do a string conversion
through wide characters. And then Windows platform cannot make locale UTF-8,
so it means use of wide characters is only way to make a internationalization
program.
Issue 106.
Check the result of FIEMAP because there is a possibility the result
has adjacent extents though the file is not sparse. So we should
clear the sparse data if it indicates the whole file.
If the character-set of filenames in archives is UTF-8, we should automatically
normalize it to avoid the scene that two filenames in one directory are, of course,
different byte sequence but they have the same sight because of NFD and NFC.
Second reason is that iconv cannot correctly convert NFD characters to other
character-set so we have to convert NFD to NFC before iconv handle it unless
iconv supports UTF-8-MAC. Third reason is for matching filenames, if filenames
in archives are NFD and the platform is not MAC OS, the uses cannot specify
the filename the uses want to extract although the users can see the filename
by listing. Recently NFD can be displayed on some platforms but creating NFD
requires character-set conversion utility, in particular that input of NFD
string is hard on Windows platform.
Tim Kientzle [Tue, 26 Apr 2011 06:02:43 +0000 (02:02 -0400)]
Refactor the read_open() routines into a collection of
single setters to set the callbacks, and a simple
archive_read_open1(a) that uses the callbacks already
registered.
In particular, this will make it easier to extend the
API in the future with new callbacks.
Get rid of a hdrcharset option support from xar reader. It is almost useless for xar
since the options would be used only when the users needed to convert filenames
from UTF-8-MAC to UTF-8, but, unfortunately, UTF-8-MAC is not usually supported
by iconv/libiconv(including packages system such as FreeBSD ports or NetBSD pkgsrc)
except MAC OS, while the patch for libiconv has been available for many years.
That means users will not use UTF-8-MAC unless the users could build libiconv with
custom libiconv for UTF-8-MAC.
Use en_US.UTF-8 instead of de_DE.UTF-8 at test_entry and test_pax_filename_encoding
because it seems en_US.UTF-8 is widely supported by default more than de_DE.UTF-8.
For example, Ubuntu has installed "C" and en_*.UTF-8 locales whatever locale would
be selected by default but de_DE.UTF-8 is not.
Improve Unicode handling.
1. The conversion will fail when following conditions since those are not legal Unicode.
- The code point larger than 0x10FFFF.
- The code point consist of overlong sequence.
- There is a surrogate pair in UTF-8 strings. we currently use iconv
if available, and it does not allow a surrogate pair in UTF-8, so the behavior
of string_append_from_utf8_to_utf16be() function should match it.
- There is a incomplete surrogate pair in UTF-16 strings.
2. Use a table for getting bytes of UTF-8 sequence from first byte.
It is easy to know what code is wrong and how many bytes are following.
Introudce "tar:utf8type=libarchive2x" option for the incorrect UTF-8 string
which libarchive 2.x makes in wrong assumption about wchar_t. The option works
only for pax format.
Currently libarchive 3 correctly translates UTF-8 string from/to current locale
string, but we cannot accordingly handle the incorrect UTF-8 on the some
platforms wchar_t of which is not Unicode and users are not using UTF-8 locale.
So we should support the UTF-8 string to be properly translated to current
locale string.
Add ENABLE_ICONV option to CMake build. At this time the ENABLE_ICONV on Windows platform is OFF.
Tweak initial value for CMake option to correctly work on CMake GUI.
Improve test_read_format_zip_filename.
Add the check that zip reader does not translate the filename stored in UTF-8
charset and its UTF-8 Name flag(general purpose flag bit 11) is set
whenever a hdrcharset option is specified.
Use locale_charset() instead of nl_langinfo(CODESET) for GNU libiconv.
The charset name which nl_langinfo(CODESET) returns is dependent on
the platform and so GNU libiconv will not recognize the charset name
on some platform. It is the same as you pass an empty name "" to iconv,
but that is GNU libiconv specific function although FreeBSD iconv allow
the empty name. I think locale_charset is better than use of "" because
It is easy to know what charset is current when debugging.
Fix a iconv detection of cmake.
- Properly find optional directories such as /usr/local/{include,lib}.
- Use FIND_PATH(ICONV_INCLUDE_DIR iconv.h) instead of
LA_CHECK_INCLUDE_FILE("iconv.h" HAVE_ICONV_H) because detecting libxml2
headers uses ICONV_INCLUDE_DIR.
Add a hdrcharset option test for pax format.
Whenever hdrcharset option is specified, we will correctly read the filename
stored in UTF-8 charset by default.
For pax format, users can specify charset to BINARY filenames only.
On Windows, this command line systemf("echo f | bsdcpio -pd copy >copy.out 2>copy.err"),
bsdcpio will get a wrong filename "f ". this "echo f| bsdcpio ...." can correctly pass
a intended filename to bsdcpio.
r3216 was insufficient. It needs father changes.
Windows "\\.\" prefix path mechanism does not allow "../" or "./" components.
So if GetLastError() returns ERROR_INVALID_NAME we retry the operation with the corrected name
renamed through __la_win_permissive_name_w(), which returns a canonical path.
Consider the large i-node number for the tests using cpio newc format.
Some cpio tests on Cygwin 1.7.x always failed because of the large i-node number.
Allow to apply the charset specified by the hdrcharset option to PAX reader only when
the the charset described in PAX attribute is BINARY because BINARY means
the the character-set of a bunch of metadata(filname/uname/gname) is unknown.
It might be useful users can specify charset for 'BINARY' metadata, the users
may know a proper charset or try to find what kind of charset correctly convert.
Improve archive_write_disk_windows.c.
- Use WCS to pathname. This has made following changes.
- FindFirstFileW and GetFileInformationByHandle instead of stat/lstat.
- Move __la_chmod and __la_ftruncate used only in the file from archive_windows.c
and change it to wide char version.
- Remove __la_mkdir and directly use CreateDirectoryW in the file.
- Remove __la_rmdir and use _wrmdir in the file.
- Remove __la_unlink and use _wunlink in the file.
- Remove __la_link and Move la_CreateHardLinkW into the file to use it directly.
- Use _wopen instead of __la_open.
Unfortunately, at this time we cannot use full-pathname through __la_win_permissive_name_w()
completely at the file because __la_win_permissive_name_w() trim "../". For example, the path
"abc/a/../b../c", which is multi dirs in one entry, will be converted to "<parent-dir>/abc/c",
so we could not make both abc/a and abc/b directories if we applied __la_win_permissive_name_w()
to the path at _archive_write_disk_header().
Make two version of archive_write_disk.c by copying and renaming it;
one is for POSIX platform, other one is for Windows platform.
The Windows version will be cleaned up and some part of it will be
reimplemented by Win32 API for reducing the overhead simulates POSIX API.