From: Ben Schmidt <none@none>
Date: Tue, 28 Feb 2012 13:46:35 +0000 (+1100)
Subject: Add wrapping modes to facilitate wrapping non-English texts.
X-Git-Tag: RELEASE_1_2_18a1~13
X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=dbd4145109593b4628800a8a227c7daa7714792d;p=thirdparty%2Fmlmmj.git

Add wrapping modes to facilitate wrapping non-English texts.

- Add %wordwrap%, %charwrap% and %userwrap% line-breaking modes.
- \<space> now means a non-breakable space, not a break opportunity.
- Introduce \/ to mark a break opportunity.
- Introduce \= to inhibit a break.
---

diff --git a/ChangeLog b/ChangeLog
index 55e7434b..5984229f 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,4 @@
+ o Add different wrapping modes to facilitate wrapping many languages
  o Fix backslash escaping mechanism so double backslash can't effectively
    recurse and form part of another escape sequence, other non-unicode escapes
    aren't ignored, and first lines of included files don't 'escape' escaping.
@@ -9,10 +10,9 @@
  o Make mlmmj-sub and +subscribe[-digest|-nomail] switch existing
    subscriptions.
  o Add a switch to bypass notifying the owner on subscribe/unsubscribe.
- o Introduce \<space> to indicate line-break positions to enable sensible
-   wrapping of Chinese and similar text.
- o Allow lines to be longer than the wrapping width if there are no spaces,
-   as generated email addresses (e.g. for moderation) won't work if split.
+ o Introduce \<space> to indicate non-breakable space, \= to mark other
+   locations where breaks should not occur, and \/ to mark locations where
+   breaks can occur
  o Add rejection of posts and obstruction of subscriptions.
  o Avoid bogus error messages when logging that the list address has been
    found in To: or CC: headers.
diff --git a/README.listtexts b/README.listtexts
index 84f872bf..f1e7b2e5 100644
--- a/README.listtexts
+++ b/README.listtexts
@@ -15,8 +15,11 @@ This file documents the following aspects of list texts:
 - Supported list texts
 - Format
 - Conditionals
-- Formatting and formatted substitutions
+- Wrapping
+- Formatting and comments
+- Formatted substitutions
 - Unformatted substitutions
+- Escapes
 
 Naming scheme
 -------------
@@ -240,21 +243,75 @@ need to worry about it.
 Note that when multiple parameters can be given for the directives, these have
 'or' behaviour; to get 'and' behaviour, nest conditionals.
 
-Formatting and formatted substitutions
---------------------------------------
+Wrapping
+--------
 
-These formatting-related directives work with multiple lines, so are generally
-not appropriate for use in headers. They are:
+There are various directives available to assist with wrapping and formatting.
+Wrapping needs to be enabled for each paragraph with:
 
 - %wrap%
 - %wrap W%
-  lines until the next blank line are concatenated and are then rewrapped to a
-  width of W (or 76 if W is omitted); lines have whitespace trimmed before
-  being joined with a single space; lines are broken at spaces or at points
-  marked for breaking with \<space>; the width is reckoned including any text
+  concatenate and rewrap lines until the next blank line to a width of W (or 76
+  if W is omitted); second and later lines are preceded with as many spaces as
+  the width preceding the directive; the width is reckoned including any text
   preceding the directive and any indentation preserved from a file which
-  included the current one, so it is an absolute maximum width; it is measured
-  in bytes
+  included the current one, so it is an absolute maximum width
+
+To cater for various languages, there are a number of different wrapping modes
+that can be set. These can be set either before or after wrapping is specified,
+and can even be changed part way through a paragraph if desired. The following
+directives control them:
+
+- %wordwrap%
+- %ww%
+  use word-wrapping (this is the default; good for English, French, Greek and
+  other languages that use an alphabet and spaces between words); lines have
+  whitespace trimmed from both ends and are joined with a single space; lines
+  are broken at spaces or at points marked for breaking with \/, but not at
+  spaces escaped with a backslash
+
+- %charwrap%
+- %cw%
+  use character-wrapping (good for Chinese, Japanese and Korean which use
+  characters without spaces between words); lines have only leading whitespace
+  trimmed and are joined without inserting anything at the joint; lines are
+  broken at space or any non-ASCII character except where disallowed with \=
+
+- %userwrap%
+- %uw%
+  use user-wrapping (for more complex languages or wherever complete manual
+  control is desired); lines have only leading whitespace trimmed and are
+  joined without inserting anything at the joint; lines are broken only where
+  marked for breaking with \/
+
+If a line with any of the directives in this section, after processing,
+contains only whitespace, the line does not appear at all in the output (the
+newline and any whitespace is omitted).
+
+Formatting and comments
+-----------------------
+
+The following directives are available to assist with formatting and
+readability:
+
+- %^%
+  start the line here; anything preceding this directive is ignored (useful for
+  using indentation for readability without ruining the formatting of the text
+  when it is processed)
+
+- %comment%
+- %$%
+  end the line here; anything following this directive is ignored
+
+If a line with any of these directives, after processing, contains only
+whitespace, the line does not appear at all in the output (the newline and any
+whitespace is omitted).
+
+Formatted substitutions
+-----------------------
+
+These formatted substitutions work with multiple lines, so are generally not
+appropriate for use in headers. They are:
 
 - %text T%
   text from the file named T in the listdir/text directory; the name may only
@@ -303,27 +360,12 @@ not appropriate for use in headers. They are:
   the list of indexes of messages which may not have been received as they
   bounced
 
-- %^%
-  start the line here; anything preceding this directive is ignored (useful for
-  using indentation for readability without ruining the formatting of the text
-  when it is processed)
-
-- %comment%
-- %$%
-  end the line here; anything following this directive is ignored
-
-- %%
-  a single %
-
 Directives which include a list of items have the behaviour that each item is
 preceded and followed by the same text as preceded and followed the directive
-on its line. Only one such directive is supported per line.
-
-The %wrap% and %wrap W% directives, as well as those which include a block of
-text, have the behaviour that second and later lines are preceded with as many
-spaces as there were characters preceding the directive. Apart from the
-%wrap% and %wrap W% directives, any text following the directive on the same
-line is omitted.
+on its line; only one such directive is supported per line. Those which include
+a block of text have the behaviour that second and later lines are preceded
+with as many spaces as there were bytes preceding the directive; any text
+following such directives on the same line is omitted.
 
 If a line with any of these directives, after processing, contains only
 whitespace, the line does not appear at all in the output (the newline and any
@@ -332,6 +374,8 @@ whitespace is omitted).
 Unformatted substitutions
 -------------------------
 
+Unformatted substitutions that are available are:
+
 - $bouncenumbers$
   (available only in probe)
   the formatted list of indexes of messages which may not have been received as
@@ -494,18 +538,35 @@ Unformatted substitutions
   newline stripped; the name may only include letters, digits, underscore, dot
   and hyphen; note that there is a formatted version of this directive
 
+Escapes
+-------
+
+These allow you to avoid special meanings of characters used for other purposes
+in list texts, as well as control the construction of the texts at a fairly low
+level.
+
 - $$
   a single $
 
+- %%
+  a single %
+
+- \\
+  a single \
+
 - \uNNNN
-  (NNNN are hex digits)
+  (NNNN represents four hex digits)
   a Unicode character
   (this is not really appropriate for use in a header, except perhaps the
   Subject: header as Mlmmj does automatic quoting for that header as described
   above)
 
 - \<space>
+  a space, but don't allow the line to be broken here when wrapping
+
+- \/
   nothing, but allow the line to be broken here when wrapping
 
-- \\
-  a single \
+- \=
+  nothing, but don't allow the line to be broken here when wrapping
+
diff --git a/src/prepstdreply.c b/src/prepstdreply.c
index 9eeed326..53da1cf5 100644
--- a/src/prepstdreply.c
+++ b/src/prepstdreply.c
@@ -98,6 +98,13 @@ enum conditional_target {
 };
 
 
+enum wrap_mode {
+	WRAP_WORD,
+	WRAP_CHAR,
+	WRAP_USER
+};
+
+
 struct text {
 	char *action;
 	char *reason;
@@ -108,6 +115,7 @@ struct text {
 	formatted *fmts;
 	int wrapindent;
 	int wrapwidth;
+	enum wrap_mode wrapmode;
 	conditional *cond;
 	conditional *skip;
 };
@@ -458,6 +466,7 @@ text *open_text_file(const char *listdir, const char *filename)
 	txt->fmts = NULL;
 	txt->wrapindent = 0;
 	txt->wrapwidth = 0;
+	txt->wrapmode = WRAP_WORD;
 	txt->cond = NULL;
 	txt->skip = NULL;
 
@@ -916,6 +925,20 @@ static int handle_directive(text *txt, char **line_p, char **pos_p,
 			*line_p = line;
 			return 0;
 		}
+	} else if(strcmp(token, "ww") == 0 ||
+			strcmp(token, "wordwrap") == 0 ||
+			strcmp(token, "cw") == 0 ||
+			strcmp(token, "charwrap") == 0 ||
+			strcmp(token, "uw") == 0 ||
+			strcmp(token, "userwrap") == 0) {
+		if (*token == 'w') txt->wrapmode = WRAP_WORD;
+		if (*token == 'c') txt->wrapmode = WRAP_CHAR;
+		if (*token == 'u') txt->wrapmode = WRAP_USER;
+		line = concatstr(2, line, endpos + 1);
+		*pos_p = line + (*pos_p - *line_p);
+		myfree(*line_p);
+		*line_p = line;
+		return 0;
 	} else if(strncmp(token, "control ", 8) == 0) {
 		token = filename_token(token + 8);
 		if (token != NULL) {
@@ -990,8 +1013,8 @@ char *get_processed_text_line(text *txt, int headers,
 	char *tmp;
 	char *prev = NULL;
 	int len, i;
-	int directive;
 	int incision, spc;
+	int directive, inhibitbreak;
 	int peeking = 0; /* for a failed conditional without an else */
 	int skipwhite; /* skip whitespace after a conditional directive */
 	int swallow;
@@ -1047,8 +1070,11 @@ char *get_processed_text_line(text *txt, int headers,
 			/* Wrapping */
 			len = strlen(prev);
 			pos = prev + len - 1;
-			while (pos > prev && (*pos == ' ' || *pos == '\t'))
-					pos--;
+			if (txt->wrapmode == WRAP_WORD) {
+				while (pos > prev &&
+						(*pos == ' ' || *pos == '\t'))
+						pos--;
+			}
 			pos++;
 			*pos = '\0';
 			len = pos - prev;
@@ -1071,8 +1097,12 @@ char *get_processed_text_line(text *txt, int headers,
 				if (*prev == '\0') {
 					tmp = mystrdup(pos);
 				} else {
-					tmp = concatstr(3, prev, " ", pos);
-					len++;
+					if (txt->wrapmode == WRAP_WORD) {
+					    tmp = concatstr(3, prev, " ", pos);
+					    len++;
+					} else {
+					    tmp = concatstr(2, prev, pos);
+					}
 				}
 				myfree(line);
 				line = tmp;
@@ -1096,9 +1126,13 @@ char *get_processed_text_line(text *txt, int headers,
 			incision = -1;
 		}
 		directive = 0;
+		inhibitbreak = 0;
 		while (*pos != '\0') {
 			if (txt->wrapwidth != 0 && len >= txt->wrapwidth &&
 					!peeking && spc != -1) break;
+			if ((unsigned char)*pos > 0xbf && txt->skip == NULL &&
+					txt->wrapmode == WRAP_CHAR &&
+					!inhibitbreak) spc = len - 1;
 			if (*pos == '\r') {
 				*pos = '\0';
 				pos++;
@@ -1113,23 +1147,35 @@ char *get_processed_text_line(text *txt, int headers,
 				txt->src->upcoming = mystrdup(pos);
 				break;
 			} else if (*pos == ' ') {
-				if (txt->skip == NULL) {
-					spc = pos - line;
-				}
+				if (txt->skip == NULL &&
+						txt->wrapmode != WRAP_USER &&
+						!inhibitbreak) spc = len;
+				inhibitbreak = 0;
 			} else if (*pos == '\t') {
 				/* Avoid breaking due to peeking */
+				inhibitbreak = 0;
 			} else if (txt->src->transparent) {
 				/* Do nothing if the file is to be included
 			 	 * transparently */
 				if (peeking && txt->skip == NULL) break;
+				inhibitbreak = 0;
 			} else if (*pos == '\\' && txt->skip == NULL) {
 				if (peeking) break;
-				if (*(pos + 1) == ' ') {
+				if (*(pos + 1) == '/') {
 					spc = len - 1;
 					tmp = pos + 2;
+					inhibitbreak = 0;
+				} else if (*(pos + 1) == '=') {
+					tmp = pos + 2;
+					/* Ensure we don't wrap the next
+					 * character */
+					inhibitbreak = 1;
 				} else {
-					/* Includes backslash */
+					/* Includes space and backslash */
 					tmp = pos + 1;
+					/* Ensure we don't wrap a space */
+					if (*(pos+1) == ' ') inhibitbreak = 1;
+					else inhibitbreak = 0;
 				}
 				*pos = '\0';
 				tmp = concatstr(2, line, tmp);
@@ -1143,6 +1189,10 @@ char *get_processed_text_line(text *txt, int headers,
 				substitute_one(&line, &pos, listaddr,
 						listdelim, listdir, txt);
 				if (len != pos - line) {
+					/* Cancel any break inhibition if the
+					 * length changed (which will be
+					 * because of $$) */
+					inhibitbreak = 0;
 					len = pos - line;
 				}
 				skipwhite = 0;
@@ -1175,6 +1225,11 @@ char *get_processed_text_line(text *txt, int headers,
 					}
 				}
 				if (len != pos - line) {
+					/* Cancel any break inhibition if the
+					 * length changed (which will be
+					 * because of %% or %^% or an empty
+					 * list) */
+					inhibitbreak = 0;
 					len = pos - line;
 				}
 				/* handle_directive() sets up for the next
@@ -1217,7 +1272,8 @@ char *get_processed_text_line(text *txt, int headers,
 				continue;
 			}
 			if (spc != -1) {
-				if (line[spc] == ' ') line[spc] = '\0';
+				if (txt->wrapmode == WRAP_WORD &&
+					line[spc] == ' ') line[spc] = '\0';
 				spc++;
 				if (line[spc] == '\0') spc = -1;
 			}