Improve how xgettext handles Perl syntax ambiguities.

author Guido Flohr <guido@imperia.net>

Thu, 1 Apr 2010 10:25:06 +0000 (12:25 +0200)

committer Bruno Haible <bruno@clisp.org>

Thu, 1 Apr 2010 10:25:28 +0000 (12:25 +0200)
author Guido Flohr <guido@imperia.net>
Thu, 1 Apr 2010 10:25:06 +0000 (12:25 +0200)
committer Bruno Haible <bruno@clisp.org>
Thu, 1 Apr 2010 10:25:28 +0000 (12:25 +0200)
diff --git a/gettext-tools/doc/ChangeLog b/gettext-tools/doc/ChangeLog

index 6105b937d18776614ea316d89fe1dd31799c2099..4afbf4f77324a47b4b8f71f5f6c9155231eac9bc 100644 (file)
--- a/gettext-tools/doc/ChangeLog
+++ b/gettext-tools/doc/ChangeLog
@@ -1,3 +1,9 @@
+2010-03-31  Guido Flohr  <guido@imperia.net>
+
+       More explanations about how xgettext handles Perl syntax ambiguities.
+       * gettext.texi (General Problems): Explain how xgettext disambiguates
+       conditional operator vs. regular expression.
+
  2010-03-13  Bruno Haible  <bruno@clisp.org>
  
         New options --color, --style for many programs.
diff --git a/gettext-tools/doc/gettext.texi b/gettext-tools/doc/gettext.texi

index 04b9ee3228bdacbbc9fe831df7263df46bf3219a..d342601f321c6b8356ebe7e8406cb282e2ba21c2 100644 (file)
--- a/gettext-tools/doc/gettext.texi
+++ b/gettext-tools/doc/gettext.texi
@@ -10997,6 +10997,48 @@ the semantic context.  If a slash is really a division sign but
  mis-interpreted as a pattern match, the rest of the input file is most
  probably parsed incorrectly.
  
+There are certain cases, where the ambiguity cannot be resolved at all:
+
+@example
+$x = wantarray ? 1 : 0;
+@end example
+
+The Perl built-in function @code{wantarray} does not accept any arguments.
+The Perl parser therefore knows that the question mark does not start
+a regular expression but is the ternary conditional operator.
+
+@example 
+sub wantarrays @{@}
+$x = wantarrays ? 1 : 0;
+@end example
+
+Now the situation is different.  The function @code{wantarrays} takes
+a variable number of arguments (like any non-prototyped Perl function).
+The question mark is now the delimiter of a pattern match, and hence
+the piece of code does not compile.
+
+@example
+sub wantarrays() @{@}
+$x = wantarrays ? 1 : 0;
+@end example
+
+Now the function is prototyped, Perl knows that it does not accept any
+arguments, and the question mark is therefore interpreted as the
+ternaray operator again.  But that unfortunately outsmarts @code{xgettext}.
+
+The Perl parser in @code{xgettext} cannot know whether a function has
+a prototype and what that prototype would look like.  It therefore makes
+an educated guess.  If a function is known to be a Perl built-in and
+this function does not accept any arguments, a following question mark
+or slash is treated as an operator, otherwise as the delimiter of a
+following regular expression.  The Perl built-ins that do not accept
+arguments are @code{wantarray}, @code{fork}, @code{time}, @code{times},
+@code{getlogin}, @code{getppid}, @code{getpwent}, @code{getgrent},
+@code{gethostent}, @code{getnetent}, @code{getprotoent}, @code{getservent},
+@code{setpwent}, @code{setgrent}, @code{endpwent}, @code{endgrent},
+@code{endhostent}, @code{endnetent}, @code{endprotoent}, and
+@code{endservent}.
+
  If you find that @code{xgettext} fails to extract strings from
  portions of your sources, you should therefore look out for slashes
  and/or question marks preceding these sections.  You may have come
@@ -11004,6 +11046,17 @@ across a bug in @code{xgettext}'s Perl parser (and of course you
  should report that bug).  In the meantime you should consider to
  reformulate your code in a manner less challenging to @code{xgettext}.
  
+In particular, if the parser is too dumb to see that a function
+does not accept arguments, use parentheses:
+
+@example
+$x = somefunc() ? 1 : 0;
+$y = (somefunc) ? 1 : 0;
+@end example
+
+In fact the Perl parser itself has similar problems and warns you
+about such constructs.
+
  @node Default Keywords, Special Keywords, General Problems, Perl
  @subsubsection Which keywords will xgettext look for?
  @cindex Perl default keywords
diff --git a/gettext-tools/src/ChangeLog b/gettext-tools/src/ChangeLog

index 47183dd53c997fbf76bbaa1a671d935b77428153..cb00dc6ba5da6fdc5e5243b8ff99c9032a5bdeee 100644 (file)
--- a/gettext-tools/src/ChangeLog
+++ b/gettext-tools/src/ChangeLog
@@ -1,5 +1,28 @@
  2010-03-31  Guido Flohr  <guido@imperia.net>
  
+       Improve how xgettext handles Perl syntax ambiguities.
+       * x-perl.c(enum token_type_ty): New enumeration items
+       token_type_number, token_type_object.
+       (struct token_ty): New field 'last_type'.
+       (token2string): Handle token_type_number, token_type_object.
+       (free_token): Likewise.
+       (prefer_division_over_regexp): Remove variable.
+       (extract_variable): Recognize token of type token_type_object.
+       (prefer_regexp_over_division): New function.
+       (last_token_type): Renamed from last_token.
+       (x_perl_prelex): Assign the token's last_type. Recognize token of type
+       token_type_number. Don't special-case "grep" and "split". Invoke
+       prefer_regexp_over_division for disambiguation.
+       (token_stack_dump): Handle token_type_number, token_type_object.
+       (x_perl_lex): Assign the token's last_type. Update last_token_type
+       intelligently.
+       (collect_message): Invoke prefer_regexp_over_division for
+       disambiguation.
+       (extract_balanced): Don't set last_token_type here. Handle
+       token_type_number, token_type_object.
+       (extract_perl): Initialize last_token_type here.
+       Reported by Guillem Jover <guillem@debian.org> via Santiago Vila.
+
         * x-perl.c (x_perl_prelex): Clarify interpolate_keywords arguments.
  
         * x-perl.c (eaten_here): Renamed from here_eaten.
diff --git a/gettext-tools/src/x-perl.c b/gettext-tools/src/x-perl.c

index c2e9fee96cf01285f2aa2ed7b6abb3e04f33bf8f..c7843c40539c0e542fe07bc38f3f2297b4e07100 100644 (file)
--- a/gettext-tools/src/x-perl.c
+++ b/gettext-tools/src/x-perl.c
@@ -1,7 +1,7 @@
  /* xgettext Perl backend.
     Copyright (C) 2002-2010 Free Software Foundation, Inc.
  
-   This file was written by Guido Flohr <guido@imperia.net>, 2002-2003.
+   This file was written by Guido Flohr <guido@imperia.net>, 2002-2010.
  
     This program is free software: you can redistribute it and/or modify
     it under the terms of the GNU General Public License as published by
@@ -234,7 +234,7 @@ phase1_getc ()
        ++line_number;
  
        /* Undosify.  This is important for catching the end of <<EOF and
-         <<'EOF'.  We could rely on stdio doing this for us but you
+         <<'EOF'.  We could rely on stdio doing this for us but
           it is not uncommon to to come across Perl scripts with CRLF
           newline conventions on systems that do not follow this
           convention.  */
@@ -506,15 +506,18 @@ enum token_type_ty
    token_type_rparen,            /* ) */
    token_type_comma,             /* , */
    token_type_fat_comma,         /* => */
-  token_type_dereference,       /* , */
+  token_type_dereference,       /* -> */
    token_type_semicolon,         /* ; */
    token_type_lbrace,            /* { */
    token_type_rbrace,            /* } */
    token_type_lbracket,          /* [ */
    token_type_rbracket,          /* ] */
    token_type_string,            /* quote-like */
+  token_type_number,            /* starting with a digit o dot */
    token_type_named_op,          /* if, unless, while, ... */
    token_type_variable,          /* $... */
+  token_type_object,            /* A dereferenced variable, maybe a blessed
+                                   object.  */
    token_type_symbol,            /* symbol, number */
    token_type_regex_op,          /* s, tr, y, m.  */
    token_type_dot,               /* . */
@@ -548,12 +551,14 @@ typedef struct token_ty token_ty;
  struct token_ty
  {
    token_type_ty type;
+  token_type_ty last_type;
    int sub_type;                 /* for token_type_string, token_type_symbol */
    char *string;                 /* for:                 in encoding:
                                     token_type_named_op  ASCII
                                     token_type_string    UTF-8
                                     token_type_symbol    ASCII
                                     token_type_variable  global_source_encoding
+                                   token_type_object    global_source_encoding
                                   */
    refcounted_string_list_ty *comment; /* for token_type_string */
    int line_number;
@@ -589,10 +594,14 @@ token2string (const token_ty *token)
        return "token_type_rbracket";
      case token_type_string:
        return "token_type_string";
+    case token_type_number:
+      return "token type number";
      case token_type_named_op:
        return "token_type_named_op";
      case token_type_variable:
        return "token_type_variable";
+    case token_type_object:
+      return "token_type_object";
      case token_type_symbol:
        return "token_type_symbol";
      case token_type_regex_op:
@@ -617,6 +626,7 @@ free_token (token_ty *tp)
      case token_type_string:
      case token_type_symbol:
      case token_type_variable:
+    case token_type_object:
        free (tp->string);
        break;
      default:
@@ -748,14 +758,6 @@ extract_quotelike_pass1_utf8 (int delim)
  /* ========= Reading of tokens and commands.  Extracting strings.  ========= */
  
  
-/* There is an ambiguity about '/': It can start a division operator ('/' or
-   '/=') or it can start a regular expression.  The distinction is important
-   because inside regular expressions, '#' loses its special meaning.
-   The distinction is possible depending on the parsing state: After a
-   variable or simple expression, it's a division operator; at the beginning
-   of an expression, it's a regexp.  */
-static bool prefer_division_over_regexp;
-
  /* Context lookup table.  */
  static flag_context_list_table_ty *flag_context_list_table;
  
@@ -1436,8 +1438,6 @@ extract_variable (message_list_ty *mlp, token_ty *tp, int first)
             real_file_name, line_number, tp->string);
  #endif
  
-  prefer_division_over_regexp = true;
-
    /*
     * 3) If the following looks strange to you, this is valid Perl syntax:
     *
@@ -1483,6 +1483,7 @@ extract_variable (message_list_ty *mlp, token_ty *tp, int first)
  
        if (maybe_hash_value && is_dereference)
          {
+          tp->type = token_type_object;
  #if DEBUG_PERL
            fprintf (stderr, "%s:%d: first keys preceded by \"->\"\n",
                     real_file_name, line_number);
@@ -2002,9 +2003,94 @@ interpolate_keywords (message_list_ty *mlp, const char *string, int lineno)
      }
  }
  
-/* The last token seen in the token stream.  This is important for the
-   interpretation of '?' and '/'.  */
-static token_type_ty last_token;
+/* There is an ambiguity about '/' and '?': They can start an operator
+   (division operator '/' or '/=' or the conditional operator '?'), or they can
+   start a regular expression.  The distinction is important because inside
+   regular expressions, '#' loses its special meaning.  This function helps
+   making the decision (a heuristic).  See the documentation for details.  */
+static bool
+prefer_regexp_over_division (token_type_ty type)
+{
+  bool retval = true;
+
+  switch (type)
+    {
+      case token_type_eof:
+        retval = true;
+        break;
+      case token_type_lparen:
+        retval = true;
+        break;
+      case token_type_rparen:
+        retval = false;
+        break;
+      case token_type_comma:
+        retval = true;
+        break;
+      case token_type_fat_comma:
+        retval = true;
+        break;
+      case token_type_dereference:
+        retval = true;
+        break;
+      case token_type_semicolon:
+        retval = true;
+        break;
+      case token_type_lbrace:
+        retval = true;
+        break;
+      case token_type_rbrace:
+        retval = false;
+        break;
+      case token_type_lbracket:
+        retval = true;
+        break;
+      case token_type_rbracket:
+        retval = false;
+        break;
+      case token_type_string:
+        retval = false;
+        break;
+      case token_type_number:
+        retval = false;
+        break;
+      case token_type_named_op:
+        retval = true;
+        break;
+      case token_type_variable:
+        retval = false;
+        break;
+      case token_type_object:
+        retval = false;
+        break;
+      case token_type_symbol:
+      case token_type_keyword_symbol:
+        retval = true;
+        break;
+      case token_type_regex_op:
+        retval = false;
+        break;
+      case token_type_dot:
+        retval = true;
+        break;
+      case token_type_other:
+        retval = true;
+        break;
+  }
+
+#if DEBUG_PERL
+  token_ty ty;
+  ty.type = type;
+  fprintf (stderr, "Prefer regexp over division after %s: %s\n",
+           token2string (&ty), retval ? "true" : "false");
+#endif
+
+  return retval;
+}
+
+/* Last token type seen in the stream.  Important for the interpretation
+   of slash and question mark.  */
+static token_type_ty last_token_type;
  
  /* Combine characters into tokens.  Discard whitespace.  */
  
@@ -2020,6 +2106,7 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
      {
        c = phase2_getc ();
        tp->line_number = line_number;
+      tp->last_type = last_token_type;
  
        switch (c)
          {
@@ -2043,7 +2130,6 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
            if (!extract_all)
              {
                extract_variable (mlp, tp, c);
-              prefer_division_over_regexp = true;
                return;
              }
            break;
@@ -2060,17 +2146,11 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
              if (c2 == '.')
                {
                  tp->type = token_type_other;
-                prefer_division_over_regexp = false;
                  return;
                }
-            else if (c2 >= '0' && c2 <= '9')
-              {
-                prefer_division_over_regexp = false;
-              }
-            else
+            else if (!(c2 >= '0' && c2 <= '9'))
                {
                  tp->type = token_type_dot;
-                prefer_division_over_regexp = true;
                  return;
                }
            }
@@ -2089,7 +2169,6 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
          case '0': case '1': case '2': case '3': case '4':
          case '5': case '6': case '7': case '8': case '9':
            /* Symbol, or part of a number.  */
-          prefer_division_over_regexp = true;
            bufpos = 0;
            for (;;)
              {
@@ -2154,7 +2233,6 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
              {
                tp->type = token_type_named_op;
                tp->string = xstrdup (buffer);
-              prefer_division_over_regexp = false;
                return;
              }
            else if (strcmp (buffer, "s") == 0
@@ -2180,7 +2258,6 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
                    tp->type = token_type_symbol;
                    tp->sub_type = symbol_type_none;
                    tp->string = xstrdup (buffer);
-                  prefer_division_over_regexp = true;
                    return;
                  }
                extract_triple_quotelike (mlp, tp, delim,
@@ -2214,7 +2291,6 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
                    tp->type = token_type_symbol;
                    tp->sub_type = symbol_type_none;
                    tp->string = xstrdup (buffer);
-                  prefer_division_over_regexp = true;
                    return;
                  }
                extract_quotelike (tp, delim);
@@ -2223,7 +2299,6 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
                free (tp->string);
                drop_reference (tp->comment);
                tp->type = token_type_regex_op;
-              prefer_division_over_regexp = true;
  
                /* Eat the following modifiers.  */
                do
@@ -2253,7 +2328,6 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
                    tp->type = token_type_eof;
                    return;
                  }
-              prefer_division_over_regexp = true;
  
                if ((delim >= '0' && delim <= '9')
                    || (delim >= 'A' && delim <= 'Z')
@@ -2264,7 +2338,6 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
                    tp->type = token_type_symbol;
                    tp->sub_type = symbol_type_none;
                    tp->string = xstrdup (buffer);
-                  prefer_division_over_regexp = true;
                    return;
                  }
  
@@ -2296,10 +2369,10 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
                  }
                return;
              }
-          else if (strcmp (buffer, "grep") == 0
-                   || strcmp (buffer, "split") == 0)
+          else if ((buffer[0] >= '0' && buffer[0] <= '9') || buffer[0] == '.')
              {
-              prefer_division_over_regexp = false;
+              tp->type = token_type_number;
+              return;
              }
            tp->type = token_type_symbol;
            tp->sub_type = (strcmp (buffer, "sub") == 0
@@ -2309,21 +2382,18 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
            return;
  
          case '"':
-          prefer_division_over_regexp = true;
            extract_quotelike (tp, c);
            tp->sub_type = string_type_qq;
            interpolate_keywords (mlp, tp->string, line_number);
            return;
  
          case '`':
-          prefer_division_over_regexp = true;
            extract_quotelike (tp, c);
            tp->sub_type = string_type_qq;
            interpolate_keywords (mlp, tp->string, line_number);
            return;
  
          case '\'':
-          prefer_division_over_regexp = true;
            extract_quotelike (tp, c);
            tp->sub_type = string_type_q;
            return;
@@ -2336,42 +2406,34 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
            else
              phase2_ungetc (c);
            tp->type = token_type_lparen;
-          prefer_division_over_regexp = false;
            return;
  
          case ')':
            tp->type = token_type_rparen;
-          prefer_division_over_regexp = true;
            return;
  
          case '{':
            tp->type = token_type_lbrace;
-          prefer_division_over_regexp = false;
            return;
  
          case '}':
            tp->type = token_type_rbrace;
-          prefer_division_over_regexp = false;
            return;
  
          case '[':
            tp->type = token_type_lbracket;
-          prefer_division_over_regexp = false;
            return;
  
          case ']':
            tp->type = token_type_rbracket;
-          prefer_division_over_regexp = false;
            return;
  
          case ';':
            tp->type = token_type_semicolon;
-          prefer_division_over_regexp = false;
            return;
  
          case ',':
            tp->type = token_type_comma;
-          prefer_division_over_regexp = false;
            return;
  
          case '=':
@@ -2383,8 +2445,8 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
                return;
              }
            else if (linepos == 2
-                   && (last_token == token_type_semicolon
-                       || last_token == token_type_rbrace)
+                   && (last_token_type == token_type_semicolon
+                       || last_token_type == token_type_rbrace)
                     && ((c >= 'A' && c <='Z')
                         || (c >= 'a' && c <= 'z')))
              {
@@ -2401,12 +2463,10 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
              }
            phase1_ungetc (c);
            tp->type = token_type_other;
-          prefer_division_over_regexp = false;
            return;
  
          case '<':
            /* Check for <<EOF and friends.  */
-          prefer_division_over_regexp = false;
            c = phase1_getc ();
            if (c == '<')
              {
@@ -2509,19 +2569,17 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
              }
            phase1_ungetc (c);
            tp->type = token_type_other;
-          prefer_division_over_regexp = false;
            return;
  
          case '/':
          case '?':
-          if (!prefer_division_over_regexp)
+          if (prefer_regexp_over_division (tp->last_type))
              {
                extract_quotelike (tp, c);
                interpolate_keywords (mlp, tp->string, line_number);
                free (tp->string);
                drop_reference (tp->comment);
-              tp->type = token_type_other;
-              prefer_division_over_regexp = true;
+              tp->type = token_type_regex_op;
                /* Eat the following modifiers.  */
                do
                  c = phase1_getc ();
@@ -2544,7 +2602,6 @@ x_perl_prelex (message_list_ty *mlp, token_ty *tp)
               as we only need to recognize gettext invocations.  Don't
               bother.  */
            tp->type = token_type_other;
-          prefer_division_over_regexp = false;
            return;
          }
      }
@@ -2583,6 +2640,10 @@ token_stack_dump (token_stack_ty *stack)
          case token_type_variable:
            fprintf (stderr, "    string: %s\n", token->string);
            break;
+        case token_type_object:
+          fprintf (stderr, "    string: %s->\n", token->string);
+        default:
+          break;
          }
      }
    fprintf (stderr, "END STACK DUMP\n");
@@ -2651,10 +2712,69 @@ x_perl_lex (message_list_ty *mlp)
      {
        tp = XMALLOC (token_ty);
        x_perl_prelex (mlp, tp);
+      tp->last_type = last_token_type;
+      last_token_type = tp->type;
+
  #if DEBUG_PERL
        fprintf (stderr, "%s:%d: x_perl_prelex returned %s\n",
                 real_file_name, line_number, token2string (tp));
  #endif
+
+      /* The interpretation of a slash or question mark after a function call
+         depends on the prototype of that function.  If the function expects
+         at least one argument, a regular expression is preferred, otherwise
+         an operator.  With our limited means, we can only guess here.  If
+         the function is a builtin that takes no arguments, we prefer an
+         operator by silently turning the last symbol into a variable instead
+         of a symbol.
+
+         Method calls without parentheses are not ambiguous.  After them, an
+         operator must follow.  Due to some ideosyncrasies in this parser
+         they are treated in two different manners.  If the call is
+         chained ($foo->bar->baz) the token left of the symbol is a
+         dereference operator.  If it is not chained ($foo->bar) the
+         dereference operator is consumed with the extracted variable.  The
+         latter case is handled below.  */
+      if (tp->type == token_type_symbol)
+        {
+          if (tp->last_type == token_type_dereference)
+            {
+              /* Class method call or chained method call (with at least
+                 two arrow operators).  */
+              last_token_type = token_type_variable;
+            }
+          else if (tp->last_type == token_type_object)
+            {
+              /* Instance method, not chained.  */
+              last_token_type = token_type_variable;
+            }
+          else if (strcmp (tp->string, "wantarray") == 0
+                   || strcmp (tp->string, "fork") == 0
+                   || strcmp (tp->string, "getlogin") == 0
+                   || strcmp (tp->string, "getppid") == 0
+                   || strcmp (tp->string, "getpwent") == 0
+                   || strcmp (tp->string, "getgrent") == 0
+                   || strcmp (tp->string, "gethostent") == 0
+                   || strcmp (tp->string, "getnetent") == 0
+                   || strcmp (tp->string, "getprotoent") == 0
+                   || strcmp (tp->string, "getservent") == 0
+                   || strcmp (tp->string, "setpwent") == 0
+                   || strcmp (tp->string, "setgrent") == 0
+                   || strcmp (tp->string, "endpwent") == 0
+                   || strcmp (tp->string, "endgrent") == 0
+                   || strcmp (tp->string, "endhostent") == 0
+                   || strcmp (tp->string, "endnetent") == 0
+                   || strcmp (tp->string, "endprotoent") == 0
+                   || strcmp (tp->string, "endservent") == 0
+                   || strcmp (tp->string, "time") == 0
+                   || strcmp (tp->string, "times") == 0
+                   || strcmp (tp->string, "wait") == 0
+                   || strcmp (tp->string, "wantarray") == 0)
+            {
+              /* A Perl built-in function that does not accept arguments.  */
+              last_token_type = token_type_variable;
+            }
+        }
      }
  #if DEBUG_PERL
    else
@@ -2789,7 +2909,8 @@ collect_message (message_list_ty *mlp, token_ty *tp, int error_level)
        phase2_ungetc (c);
  
        if (c == '"' || c == '\'' || c == '`'
-          || (!prefer_division_over_regexp && (c == '/' || c == '?'))
+          || ((c == '/' || c == '?')
+              && prefer_regexp_over_division (tp->last_type))
            || c == 'q')
          {
            token_ty *qstring = x_perl_lex (mlp);
@@ -2916,9 +3037,6 @@ extract_balanced (message_list_ty *mlp,
    ++nesting_level;
  #endif
  
-  last_token = token_type_semicolon;  /* Safe assumption.  */
-  prefer_division_over_regexp = false;
-
    for (;;)
      {
        /* The current token.  */
@@ -2926,8 +3044,6 @@ extract_balanced (message_list_ty *mlp,
  
        tp = x_perl_lex (mlp);
  
-      last_token = tp->type;
-
        if (delim == tp->type)
          {
            xgettext_current_source_encoding = po_charset_utf8;
@@ -3016,6 +3132,7 @@ extract_balanced (message_list_ty *mlp,
        switch (tp->type)
          {
          case token_type_symbol:
+        case token_type_keyword_symbol:
  #if DEBUG_PERL
            fprintf (stderr, "%s:%d: type symbol (%d) \"%s\"\n",
                     logical_file_name, tp->line_number, nesting_level,
@@ -3031,7 +3148,6 @@ extract_balanced (message_list_ty *mlp,
                  const struct callshapes *shapes =
                    (const struct callshapes *) keyword_value;
  
-                last_token = token_type_keyword_symbol;
                  next_shapes = shapes;
                  next_argparser = arglist_parser_alloc (mlp, shapes);
                }
@@ -3052,9 +3168,22 @@ extract_balanced (message_list_ty *mlp,
          case token_type_variable:
  #if DEBUG_PERL
            fprintf (stderr, "%s:%d: type variable (%d) \"%s\"\n",
-                   logical_file_name, tp->line_number, nesting_level, tp->string);
+                   logical_file_name, tp->line_number, nesting_level,
+                   tp->string);
+#endif
+          next_is_argument = false;
+          if (next_argparser != NULL)
+            free (next_argparser);
+          next_argparser = NULL;
+          next_context_iter = null_context_list_iterator;
+          break;
+
+        case token_type_object:
+#if DEBUG_PERL
+          fprintf (stderr, "%s:%d: type object (%d) \"%s->\"\n",
+                   logical_file_name, tp->line_number, nesting_level,
+                   tp->string);
  #endif
-          prefer_division_over_regexp = true;
            next_is_argument = false;
            if (next_argparser != NULL)
              free (next_argparser);
@@ -3217,6 +3346,18 @@ extract_balanced (message_list_ty *mlp,
            next_context_iter = null_context_list_iterator;
            break;
  
+        case token_type_number:
+#if DEBUG_PERL
+          fprintf (stderr, "%s:%d: type number (%d)\n",
+                   logical_file_name, tp->line_number, nesting_level);
+#endif
+          next_is_argument = false;
+          if (next_argparser != NULL)
+            free (next_argparser);
+          next_argparser = NULL;
+          next_context_iter = null_context_list_iterator;
+          break;
+
          case token_type_eof:
  #if DEBUG_PERL
            fprintf (stderr, "%s:%d: type EOF (%d)\n",
@@ -3429,6 +3570,9 @@ extract_perl (FILE *f, const char *real_filename, const char *logical_filename,
    eaten_here = 0;
    end_of_file = false;
  
+  /* Safe assumption.  */
+  last_token_type = token_type_semicolon;
+
    /* Eat tokens until eof is seen.  When extract_balanced returns
       due to an unbalanced closing brace, just restart it.  */
    while (!extract_balanced (mlp, token_type_rbrace, true, false,
@@ -3441,7 +3585,7 @@ extract_perl (FILE *f, const char *real_filename, const char *logical_filename,
    free (logical_file_name);
    logical_file_name = NULL;
    line_number = 0;
-  last_token = token_type_semicolon;
+  last_token_type = token_type_semicolon;
    token_stack_free (&token_stack);
    eaten_here = 0;
    end_of_file = true;
diff --git a/gettext-tools/tests/ChangeLog b/gettext-tools/tests/ChangeLog

index 4ca6a394fb08e5d3d7c77dc0ac1199e6265568bd..deb0123f466c729c7e8dd959cb2360c70bf615b2 100644 (file)
--- a/gettext-tools/tests/ChangeLog
+++ b/gettext-tools/tests/ChangeLog
@@ -1,3 +1,10 @@
+2010-03-31  Guido Flohr  <guido@imperia.net>
+
+       Improve how xgettext handles Perl syntax ambiguities.
+       * xgettext-perl-8: New file.
+       * Makefile.am (TESTS): Add it.
+       Reported by Guillem Jover <guillem@debian.org> via Santiago Vila.
+
  2009-12-26  Bruno Haible  <bruno@clisp.org>
  
         Enable the gettext-6 and gettext-7 tests also on MacOS X.
diff --git a/gettext-tools/tests/Makefile.am b/gettext-tools/tests/Makefile.am

index 74dc13f5849beb317b6ca50a6162cc31b0448839..5a89a6e021c2da76bdc767c8311eb9bc27bfe29f 100644 (file)
--- a/gettext-tools/tests/Makefile.am
+++ b/gettext-tools/tests/Makefile.am
@@ -1,5 +1,5 @@
  ## Makefile for the gettext-tools/tests subdirectory of GNU gettext
-## Copyright (C) 1995-1997, 2001-2009 Free Software Foundation, Inc.
+## Copyright (C) 1995-1997, 2001-2010 Free Software Foundation, Inc.
  ##
  ## This program is free software: you can redistribute it and/or modify
  ## it under the terms of the GNU General Public License as published by
@@ -86,7 +86,7 @@ TESTS = gettext-1 gettext-2 gettext-3 gettext-4 gettext-5 gettext-6 gettext-7 \
         xgettext-lisp-1 xgettext-lisp-2 \
         xgettext-objc-1 xgettext-objc-2 \
         xgettext-perl-1 xgettext-perl-2 xgettext-perl-3 xgettext-perl-4 \
-       xgettext-perl-5 xgettext-perl-6 xgettext-perl-7 \
+       xgettext-perl-5 xgettext-perl-6 xgettext-perl-7 xgettext-perl-8 \
         xgettext-php-1 xgettext-php-2 xgettext-php-3 xgettext-php-4 \
         xgettext-po-1 \
         xgettext-properties-1 \
diff --git a/gettext-tools/tests/xgettext-perl-8 b/gettext-tools/tests/xgettext-perl-8

new file mode 100755 (executable)

index 0000000..74247cd
--- /dev/null
+++ b/gettext-tools/tests/xgettext-perl-8
@@ -0,0 +1,133 @@
+#! /bin/sh
+
+# The slash (/) and the question mark (?) serve a double-purpose in Perl.
+# Depending on the context they can either be an operator (division
+# or ternary respectively) or they are regex delimiters for pattern
+# matches.  This test case checks the proper recognition.
+
+tmpfiles=""
+trap 'rm -fr $tmpfiles' 1 2 3 15
+
+tmpfiles="$tmpfiles xg-pl-8.pl"
+cat <<\EOF > xg-pl-8.pl
+info(__("using %s."), ($a->b() eq "auto" ? "" : ""));
+
+print __"Question mark after string is an operator!\n";
+# ?; Re-sync.
+
+@times = sort {$a - $b} split /,\s*/, $options
+    if (defined $options && $options);
+
+print __"First slash in a an argument to a function starts a pattern match.";
+# /; Re-sync.
+
+$0 =~ /xyz/ ? 'foo' : 'bar';
+
+print __"Question mark after a regular pattern match is an operator!";
+# ?; Re-sync.
+
+$0 =~ m{xyz} ? 'foo' : 'bar';
+
+print __"Question mark after a nesting pattern match is an operator!";
+# ?; Re-sync.
+
+$0 =~ m|xyz| ? 'foo' : 'bar';
+
+print __"Question mark after a non-nesting pattern match is an operator!";
+# ?; Re-sync.
+
+print __(<<EOS);
+Line number for here documents is not correct.
+EOS
+
+$foo = wantarray ? 1 : 0;
+
+print __"The function wantarray does not take arguments!";
+# ?; Re-sync.
+
+$foo = Something->method ? 1 : 0;
+
+print __"Class method calls without parentheses do not accept arguments!";
+# ?; Re-sync.
+
+$foo = $Something->method ? 1 : 0;
+
+print __"Instance method calls without parentheses do not accept arguments!";
+# ?; Re-sync.
+
+$foo = $Some->thing->method ? 1 : 0;
+
+print __"Chained method calls without parentheses do not accept arguments!";
+# ?; Re-sync.
+
+print __"Synching works.";
+EOF
+
+tmpfiles="$tmpfiles xg-pl-8.tmp.po xg-pl-8.po"
+: ${XGETTEXT=xgettext}
+${XGETTEXT} --omit-header -n \
+  -k__ \
+  -d xg-pl-8.tmp xg-pl-8.pl
+test $? = 0 || { rm -fr $tmpfiles; exit 1; }
+LC_ALL=C tr -d '\r' < xg-pl-8.tmp.po > xg-pl-8.po
+test $? = 0 || { rm -fr $tmpfiles; exit 1; }
+
+tmpfiles="$tmpfiles xg-pl-8.ok"
+cat <<\EOF > xg-pl-8.ok
+#: xg-pl-8.pl:1
+#, perl-format
+msgid "using %s."
+msgstr ""
+
+#: xg-pl-8.pl:3
+msgid "Question mark after string is an operator!\n"
+msgstr ""
+
+#: xg-pl-8.pl:9
+msgid "First slash in a an argument to a function starts a pattern match."
+msgstr ""
+
+#: xg-pl-8.pl:14
+msgid "Question mark after a regular pattern match is an operator!"
+msgstr ""
+
+#: xg-pl-8.pl:19
+msgid "Question mark after a nesting pattern match is an operator!"
+msgstr ""
+
+#: xg-pl-8.pl:24
+msgid "Question mark after a non-nesting pattern match is an operator!"
+msgstr ""
+
+#: xg-pl-8.pl:28
+msgid "Line number for here documents is not correct.\n"
+msgstr ""
+
+#: xg-pl-8.pl:33
+msgid "The function wantarray does not take arguments!"
+msgstr ""
+
+#: xg-pl-8.pl:38
+msgid "Class method calls without parentheses do not accept arguments!"
+msgstr ""
+
+#: xg-pl-8.pl:43
+msgid "Instance method calls without parentheses do not accept arguments!"
+msgstr ""
+
+#: xg-pl-8.pl:48
+msgid "Chained method calls without parentheses do not accept arguments!"
+msgstr ""
+
+#: xg-pl-8.pl:51
+msgid "Synching works."
+msgstr ""
+EOF
+
+: ${DIFF=diff}
+${DIFF} xg-pl-8.ok xg-pl-8.po
+result=$?
+
+rm -fr $tmpfiles
+
+exit $result
author	Guido Flohr <guido@imperia.net>
	Thu, 1 Apr 2010 10:25:06 +0000 (12:25 +0200)
committer	Bruno Haible <bruno@clisp.org>
	Thu, 1 Apr 2010 10:25:28 +0000 (12:25 +0200)
gettext-tools/doc/ChangeLog		patch \| blob \| blame \| history
gettext-tools/doc/gettext.texi		patch \| blob \| blame \| history
gettext-tools/src/ChangeLog		patch \| blob \| blame \| history
gettext-tools/src/x-perl.c		patch \| blob \| blame \| history
gettext-tools/tests/ChangeLog		patch \| blob \| blame \| history
gettext-tools/tests/Makefile.am		patch \| blob \| blame \| history
gettext-tools/tests/xgettext-perl-8	[new file with mode: 0755]	patch \| blob