Add wrapping modes to facilitate wrapping non-English texts.

author Ben Schmidt <none@none>

Tue, 28 Feb 2012 13:46:35 +0000 (00:46 +1100)

committer Ben Schmidt <none@none>

Tue, 28 Feb 2012 13:46:35 +0000 (00:46 +1100)
author Ben Schmidt <none@none>
Tue, 28 Feb 2012 13:46:35 +0000 (00:46 +1100)
committer Ben Schmidt <none@none>
Tue, 28 Feb 2012 13:46:35 +0000 (00:46 +1100)
diff --git a/ChangeLog b/ChangeLog

index 55e7434bc5caabe76d0f18e7cf7c6eaacc1ccad5..5984229f2cc68811a1cafc59cd9a48c670e72838 100644 (file)
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,4 @@
+ o Add different wrapping modes to facilitate wrapping many languages
   o Fix backslash escaping mechanism so double backslash can't effectively
     recurse and form part of another escape sequence, other non-unicode escapes
     aren't ignored, and first lines of included files don't 'escape' escaping.
@@ -9,10 +10,9 @@
   o Make mlmmj-sub and +subscribe[-digest|-nomail] switch existing
     subscriptions.
   o Add a switch to bypass notifying the owner on subscribe/unsubscribe.
- o Introduce \<space> to indicate line-break positions to enable sensible
-   wrapping of Chinese and similar text.
- o Allow lines to be longer than the wrapping width if there are no spaces,
-   as generated email addresses (e.g. for moderation) won't work if split.
+ o Introduce \<space> to indicate non-breakable space, \= to mark other
+   locations where breaks should not occur, and \/ to mark locations where
+   breaks can occur
   o Add rejection of posts and obstruction of subscriptions.
   o Avoid bogus error messages when logging that the list address has been
     found in To: or CC: headers.
diff --git a/README.listtexts b/README.listtexts

index 84f872bf86e5b7d809b9a8de2013e4173eac9d91..f1e7b2e50aa712433ea65385e3f0756f2e811c66 100644 (file)
--- a/README.listtexts
+++ b/README.listtexts
@@ -15,8 +15,11 @@ This file documents the following aspects of list texts:
  - Supported list texts
  - Format
  - Conditionals
-- Formatting and formatted substitutions
+- Wrapping
+- Formatting and comments
+- Formatted substitutions
  - Unformatted substitutions
+- Escapes
  
  Naming scheme
  -------------
@@ -240,21 +243,75 @@ need to worry about it.
  Note that when multiple parameters can be given for the directives, these have
  'or' behaviour; to get 'and' behaviour, nest conditionals.
  
-Formatting and formatted substitutions
---------------------------------------
+Wrapping
+--------
  
-These formatting-related directives work with multiple lines, so are generally
-not appropriate for use in headers. They are:
+There are various directives available to assist with wrapping and formatting.
+Wrapping needs to be enabled for each paragraph with:
  
  - %wrap%
  - %wrap W%
-  lines until the next blank line are concatenated and are then rewrapped to a
-  width of W (or 76 if W is omitted); lines have whitespace trimmed before
-  being joined with a single space; lines are broken at spaces or at points
-  marked for breaking with \<space>; the width is reckoned including any text
+  concatenate and rewrap lines until the next blank line to a width of W (or 76
+  if W is omitted); second and later lines are preceded with as many spaces as
+  the width preceding the directive; the width is reckoned including any text
    preceding the directive and any indentation preserved from a file which
-  included the current one, so it is an absolute maximum width; it is measured
-  in bytes
+  included the current one, so it is an absolute maximum width
+
+To cater for various languages, there are a number of different wrapping modes
+that can be set. These can be set either before or after wrapping is specified,
+and can even be changed part way through a paragraph if desired. The following
+directives control them:
+
+- %wordwrap%
+- %ww%
+  use word-wrapping (this is the default; good for English, French, Greek and
+  other languages that use an alphabet and spaces between words); lines have
+  whitespace trimmed from both ends and are joined with a single space; lines
+  are broken at spaces or at points marked for breaking with \/, but not at
+  spaces escaped with a backslash
+
+- %charwrap%
+- %cw%
+  use character-wrapping (good for Chinese, Japanese and Korean which use
+  characters without spaces between words); lines have only leading whitespace
+  trimmed and are joined without inserting anything at the joint; lines are
+  broken at space or any non-ASCII character except where disallowed with \=
+
+- %userwrap%
+- %uw%
+  use user-wrapping (for more complex languages or wherever complete manual
+  control is desired); lines have only leading whitespace trimmed and are
+  joined without inserting anything at the joint; lines are broken only where
+  marked for breaking with \/
+
+If a line with any of the directives in this section, after processing,
+contains only whitespace, the line does not appear at all in the output (the
+newline and any whitespace is omitted).
+
+Formatting and comments
+-----------------------
+
+The following directives are available to assist with formatting and
+readability:
+
+- %^%
+  start the line here; anything preceding this directive is ignored (useful for
+  using indentation for readability without ruining the formatting of the text
+  when it is processed)
+
+- %comment%
+- %$%
+  end the line here; anything following this directive is ignored
+
+If a line with any of these directives, after processing, contains only
+whitespace, the line does not appear at all in the output (the newline and any
+whitespace is omitted).
+
+Formatted substitutions
+-----------------------
+
+These formatted substitutions work with multiple lines, so are generally not
+appropriate for use in headers. They are:
  
  - %text T%
    text from the file named T in the listdir/text directory; the name may only
@@ -303,27 +360,12 @@ not appropriate for use in headers. They are:
    the list of indexes of messages which may not have been received as they
    bounced
  
-- %^%
-  start the line here; anything preceding this directive is ignored (useful for
-  using indentation for readability without ruining the formatting of the text
-  when it is processed)
-
-- %comment%
-- %$%
-  end the line here; anything following this directive is ignored
-
-- %%
-  a single %
-
  Directives which include a list of items have the behaviour that each item is
  preceded and followed by the same text as preceded and followed the directive
-on its line. Only one such directive is supported per line.
-
-The %wrap% and %wrap W% directives, as well as those which include a block of
-text, have the behaviour that second and later lines are preceded with as many
-spaces as there were characters preceding the directive. Apart from the
-%wrap% and %wrap W% directives, any text following the directive on the same
-line is omitted.
+on its line; only one such directive is supported per line. Those which include
+a block of text have the behaviour that second and later lines are preceded
+with as many spaces as there were bytes preceding the directive; any text
+following such directives on the same line is omitted.
  
  If a line with any of these directives, after processing, contains only
  whitespace, the line does not appear at all in the output (the newline and any
@@ -332,6 +374,8 @@ whitespace is omitted).
  Unformatted substitutions
  -------------------------
  
+Unformatted substitutions that are available are:
+
  - $bouncenumbers$
    (available only in probe)
    the formatted list of indexes of messages which may not have been received as
@@ -494,18 +538,35 @@ Unformatted substitutions
    newline stripped; the name may only include letters, digits, underscore, dot
    and hyphen; note that there is a formatted version of this directive
  
+Escapes
+-------
+
+These allow you to avoid special meanings of characters used for other purposes
+in list texts, as well as control the construction of the texts at a fairly low
+level.
+
  - $$
    a single $
  
+- %%
+  a single %
+
+- \\
+  a single \
+
  - \uNNNN
-  (NNNN are hex digits)
+  (NNNN represents four hex digits)
    a Unicode character
    (this is not really appropriate for use in a header, except perhaps the
    Subject: header as Mlmmj does automatic quoting for that header as described
    above)
  
  - \<space>
+  a space, but don't allow the line to be broken here when wrapping
+
+- \/
    nothing, but allow the line to be broken here when wrapping
  
-- \\
-  a single \
+- \=
+  nothing, but don't allow the line to be broken here when wrapping
+
diff --git a/src/prepstdreply.c b/src/prepstdreply.c

index 9eeed32649552f709a5750bf42dd9af9c46879cd..53da1cf53311366acb466547e469695b5457d66b 100644 (file)
--- a/src/prepstdreply.c
+++ b/src/prepstdreply.c
@@ -98,6 +98,13 @@ enum conditional_target {
  };
  
  
+enum wrap_mode {
+       WRAP_WORD,
+       WRAP_CHAR,
+       WRAP_USER
+};
+
+
  struct text {
         char *action;
         char *reason;
@@ -108,6 +115,7 @@ struct text {
         formatted *fmts;
         int wrapindent;
         int wrapwidth;
+       enum wrap_mode wrapmode;
         conditional *cond;
         conditional *skip;
  };
@@ -458,6 +466,7 @@ text *open_text_file(const char *listdir, const char *filename)
         txt->fmts = NULL;
         txt->wrapindent = 0;
         txt->wrapwidth = 0;
+       txt->wrapmode = WRAP_WORD;
         txt->cond = NULL;
         txt->skip = NULL;
  
@@ -916,6 +925,20 @@ static int handle_directive(text *txt, char **line_p, char **pos_p,
                         *line_p = line;
                         return 0;
                 }
+       } else if(strcmp(token, "ww") == 0 ||
+                       strcmp(token, "wordwrap") == 0 ||
+                       strcmp(token, "cw") == 0 ||
+                       strcmp(token, "charwrap") == 0 ||
+                       strcmp(token, "uw") == 0 ||
+                       strcmp(token, "userwrap") == 0) {
+               if (*token == 'w') txt->wrapmode = WRAP_WORD;
+               if (*token == 'c') txt->wrapmode = WRAP_CHAR;
+               if (*token == 'u') txt->wrapmode = WRAP_USER;
+               line = concatstr(2, line, endpos + 1);
+               *pos_p = line + (*pos_p - *line_p);
+               myfree(*line_p);
+               *line_p = line;
+               return 0;
         } else if(strncmp(token, "control ", 8) == 0) {
                 token = filename_token(token + 8);
                 if (token != NULL) {
@@ -990,8 +1013,8 @@ char *get_processed_text_line(text *txt, int headers,
         char *tmp;
         char *prev = NULL;
         int len, i;
-       int directive;
         int incision, spc;
+       int directive, inhibitbreak;
         int peeking = 0; /* for a failed conditional without an else */
         int skipwhite; /* skip whitespace after a conditional directive */
         int swallow;
@@ -1047,8 +1070,11 @@ char *get_processed_text_line(text *txt, int headers,
                         /* Wrapping */
                         len = strlen(prev);
                         pos = prev + len - 1;
-                       while (pos > prev && (*pos == ' ' || *pos == '\t'))
-                                       pos--;
+                       if (txt->wrapmode == WRAP_WORD) {
+                               while (pos > prev &&
+                                               (*pos == ' ' || *pos == '\t'))
+                                               pos--;
+                       }
                         pos++;
                         *pos = '\0';
                         len = pos - prev;
@@ -1071,8 +1097,12 @@ char *get_processed_text_line(text *txt, int headers,
                                 if (*prev == '\0') {
                                         tmp = mystrdup(pos);
                                 } else {
-                                       tmp = concatstr(3, prev, " ", pos);
-                                       len++;
+                                       if (txt->wrapmode == WRAP_WORD) {
+                                           tmp = concatstr(3, prev, " ", pos);
+                                           len++;
+                                       } else {
+                                           tmp = concatstr(2, prev, pos);
+                                       }
                                 }
                                 myfree(line);
                                 line = tmp;
@@ -1096,9 +1126,13 @@ char *get_processed_text_line(text *txt, int headers,
                         incision = -1;
                 }
                 directive = 0;
+               inhibitbreak = 0;
                 while (*pos != '\0') {
                         if (txt->wrapwidth != 0 && len >= txt->wrapwidth &&
                                         !peeking && spc != -1) break;
+                       if ((unsigned char)*pos > 0xbf && txt->skip == NULL &&
+                                       txt->wrapmode == WRAP_CHAR &&
+                                       !inhibitbreak) spc = len - 1;
                         if (*pos == '\r') {
                                 *pos = '\0';
                                 pos++;
@@ -1113,23 +1147,35 @@ char *get_processed_text_line(text *txt, int headers,
                                 txt->src->upcoming = mystrdup(pos);
                                 break;
                         } else if (*pos == ' ') {
-                               if (txt->skip == NULL) {
-                                       spc = pos - line;
-                               }
+                               if (txt->skip == NULL &&
+                                               txt->wrapmode != WRAP_USER &&
+                                               !inhibitbreak) spc = len;
+                               inhibitbreak = 0;
                         } else if (*pos == '\t') {
                                 /* Avoid breaking due to peeking */
+                               inhibitbreak = 0;
                         } else if (txt->src->transparent) {
                                 /* Do nothing if the file is to be included
                                  * transparently */
                                 if (peeking && txt->skip == NULL) break;
+                               inhibitbreak = 0;
                         } else if (*pos == '\\' && txt->skip == NULL) {
                                 if (peeking) break;
-                               if (*(pos + 1) == ' ') {
+                               if (*(pos + 1) == '/') {
                                         spc = len - 1;
                                         tmp = pos + 2;
+                                       inhibitbreak = 0;
+                               } else if (*(pos + 1) == '=') {
+                                       tmp = pos + 2;
+                                       /* Ensure we don't wrap the next
+                                        * character */
+                                       inhibitbreak = 1;
                                 } else {
-                                       /* Includes backslash */
+                                       /* Includes space and backslash */
                                         tmp = pos + 1;
+                                       /* Ensure we don't wrap a space */
+                                       if (*(pos+1) == ' ') inhibitbreak = 1;
+                                       else inhibitbreak = 0;
                                 }
                                 *pos = '\0';
                                 tmp = concatstr(2, line, tmp);
@@ -1143,6 +1189,10 @@ char *get_processed_text_line(text *txt, int headers,
                                 substitute_one(&line, &pos, listaddr,
                                                 listdelim, listdir, txt);
                                 if (len != pos - line) {
+                                       /* Cancel any break inhibition if the
+                                        * length changed (which will be
+                                        * because of $$) */
+                                       inhibitbreak = 0;
                                         len = pos - line;
                                 }
                                 skipwhite = 0;
@@ -1175,6 +1225,11 @@ char *get_processed_text_line(text *txt, int headers,
                                         }
                                 }
                                 if (len != pos - line) {
+                                       /* Cancel any break inhibition if the
+                                        * length changed (which will be
+                                        * because of %% or %^% or an empty
+                                        * list) */
+                                       inhibitbreak = 0;
                                         len = pos - line;
                                 }
                                 /* handle_directive() sets up for the next
@@ -1217,7 +1272,8 @@ char *get_processed_text_line(text *txt, int headers,
                                 continue;
                         }
                         if (spc != -1) {
-                               if (line[spc] == ' ') line[spc] = '\0';
+                               if (txt->wrapmode == WRAP_WORD &&
+                                       line[spc] == ' ') line[spc] = '\0';
                                 spc++;
                                 if (line[spc] == '\0') spc = -1;
                         }
author	Ben Schmidt <none@none>
	Tue, 28 Feb 2012 13:46:35 +0000 (00:46 +1100)
committer	Ben Schmidt <none@none>
	Tue, 28 Feb 2012 13:46:35 +0000 (00:46 +1100)
ChangeLog		patch \| blob \| blame \| history
README.listtexts		patch \| blob \| blame \| history
src/prepstdreply.c		patch \| blob \| blame \| history