Rationale

The Open Group Base Specifications Issue 6
IEEE Std 1003.1, 2004 Edition
Copyright © 2001-2004 The IEEE and The Open Group

A.9 Regular Expressions

Rather than repeating the description of REs for each utility supporting REs, the standard developers preferred a common, comprehensive description of regular expressions in one place. The most common behavior is described here, and exceptions or extensions to this are documented for the respective utilities, as appropriate.

The BRE corresponds to the ed or historical grep type, and the ERE corresponds to the historical egrep type (now grep -E).

The text is based on the ed description and substantially modified, primarily to aid developers and others in the understanding of the capabilities and limitations of REs. Much of this was influenced by internationalization requirements.

It should be noted that the definitions in this section do not cover the tr utility; the tr syntax does not employ REs.

The specification of REs is particularly important to internationalization because pattern matching operations are very basic operations in business and other operations. The syntax and rules of REs are intended to be as intuitive as possible to make them easy to understand and use. The historical rules and behavior do not provide that capability to non-English language users, and do not provide the necessary support for commonly used characters and language constructs. It was necessary to provide extensions to the historical RE syntax and rules to accommodate other languages.

As they are limited to bracket expressions, the rationale for these modifications is in the Base Definitions volume of IEEE Std 1003.1-2001, Section 9.3.5, RE Bracket Expression.

A.9.1 Regular Expression Definitions

It is possible to determine what strings correspond to subexpressions by recursively applying the leftmost longest rule to each subexpression, but only with the proviso that the overall match is leftmost longest. For example, matching "$ac*$c*d[ac]*\1" against acdacaaa matches acdacaaa (with \1=a); simply matching the longest match for "$ac*$" would yield \1=ac, but the overall match would be smaller (acdac). Conceptually, the implementation must examine every possible match and among those that yield the leftmost longest total matches, pick the one that does the longest match for the leftmost subexpression, and so on. Note that this means that matching by subexpressions is context-dependent: a subexpression within a larger RE may match a different string from the one it would match as an independent RE, and two instances of the same subexpression within the same larger RE may match different lengths even in similar sequences of characters. For example, in the ERE "(a.*b)(a.*b)", the two identical subexpressions would match four and six characters, respectively, of accbaccccb.

The definition of single character has been expanded to include also collating elements consisting of two or more characters; this expansion is applicable only when a bracket expression is included in the BRE or ERE. An example of such a collating element may be the Dutch ij, which collates as a 'y'. In some encodings, a ligature "i with j" exists as a character and would represent a single-character collating element. In another encoding, no such ligature exists, and the two-character sequence ij is defined as a multi-character collating element. Outside brackets, the ij is treated as a two-character RE and matches the same characters in a string. Historically, a bracket expression only matched a single character. The ISO POSIX-2:1993 standard required bracket expressions like "[^[:lower:]]" to match multi-character collating elements such as "ij". However, this requirement led to behavior that many users did not expect and that could not feasibly be mimicked in user code, and it was rarely if ever implemented correctly. The current standard leaves it unspecified whether a bracket expression matches a multi-character collating element, allowing both historical and ISO POSIX-2:1993 standard implementations to conform.

Also, in the current standard, it is unspecified whether character class expressions like "[:lower:]" can include multi-character collating elements like "ij" ; hence "[[:lower:]]" can match "ij", and "[^[:lower:]]" can fail to match "ij". Common practice is for a character class expression to match a collating element if it matches the collating element's first character.

A.9.2 Regular Expression General Requirements

The definition of which sequence is matched when several are possible is based on the leftmost-longest rule historically used by deterministic recognizers. This rule is easier to define and describe, and arguably more useful, than the first-match rule historically used by non-deterministic recognizers. It is thought that dependencies on the choice of rule are rare; carefully contrived examples are needed to demonstrate the difference.

A formal expression of the leftmost-longest rule is:

The search is performed as if all possible suffixes of the string were tested for a prefix matching the pattern; the longest suffix containing a matching prefix is chosen, and the longest possible matching prefix of the chosen suffix is identified as the matching sequence.

Historically, most RE implementations only match lines, not strings. However, that is more an effect of the usage than of an inherent feature of REs themselves. Consequently, IEEE Std 1003.1-2001 does not regard <newline>s as special; they are ordinary characters, and both a period and a non-matching list can match them. Those utilities (like grep) that do not allow <newline>s to match are responsible for eliminating any <newline> from strings before matching against the RE. The regcomp() function, however, can provide support for such processing without violating the rules of this section.

Some implementations of egrep have had very limited flexibility in handling complex EREs. IEEE Std 1003.1-2001 does not attempt to define the complexity of a BRE or ERE, but does place a lower limit on it-any RE must be handled, as long as it can be expressed in 256 bytes or less. (Of course, this does not place an upper limit on the implementation.) There are historical programs using a non-deterministic-recognizer implementation that should have no difficulty with this limit. It is possible that a good approach would be to attempt to use the faster, but more limited, deterministic recognizer for simple expressions and to fall back on the non-deterministic recognizer for those expressions requiring it. Non-deterministic implementations must be careful to observe the rules on which match is chosen; the longest match, not the first match, starting at a given character is used.

The term "invalid" highlights a difference between this section and some others: IEEE Std 1003.1-2001 frequently avoids mandating of errors for syntax violations because they can be used by implementors to trigger extensions. However, the authors of the internationalization features of REs wanted to mandate errors for certain conditions to identify usage problems or non-portable constructs. These are identified within this rationale as appropriate. The remaining syntax violations have been left implicitly or explicitly undefined. For example, the BRE construct "\{1,2,3\}" does not comply with the grammar. A conforming application cannot rely on it producing an error nor matching the literal characters "\{1,2,3\}".

The term "undefined" was used in favor of "unspecified" because many of the situations are considered errors on some implementations, and the standard developers considered that consistency throughout the section was preferable to mixing undefined and unspecified.

A.9.3 Basic Regular Expressions

There is no additional rationale provided for this section.

BREs Matching a Single Character or Collating Element

There is no additional rationale provided for this section.

BRE Ordinary Characters

There is no additional rationale provided for this section.

BRE Special Characters

There is no additional rationale provided for this section.

Periods in BREs

There is no additional rationale provided for this section.

RE Bracket Expression

Range expressions are, historically, an integral part of REs. However, the requirements of "natural language behavior" and portability do conflict. In the POSIX locale, ranges must be treated according to the collating sequence and include such characters that fall within the range based on that collating sequence, regardless of character values. In other locales, ranges have unspecified behavior.

Some historical implementations allow range expressions where the ending range point of one range is also the starting point of the next (for instance, "[a-m-o]" ). This behavior should not be permitted, but to avoid breaking historical implementations, it is now undefined whether it is a valid expression and how it should be interpreted.

Current practice in awk and lex is to accept escape sequences in bracket expressions as per the Base Definitions volume of IEEE Std 1003.1-2001, Table 5-1, Escape Sequences and Associated Actions, while the normal ERE behavior is to regard such a sequence as consisting of two characters. Allowing the awk/ lex behavior in EREs would change the normal behavior in an unacceptable way; it is expected that awk and lex will decode escape sequences in EREs before passing them to regcomp() or comparable routines. Each utility describes the escape sequences it accepts as an exception to the rules in this section; the list is not the same, for historical reasons.

As noted previously, the new syntax and rules have been added to accommodate other languages than English. The remainder of this section describes the rationale for these modifications.

In the POSIX locale, a regular expression that starts with a range expression matches a set of strings that are contiguously sorted, but this is not necessarily true in other locales. For example, a French locale might have the following behavior:

$ ls
alpha   Alpha   estimé   ESTIMé   été   eurêka
$ ls [a-e]*
alpha   Alpha   estimé  eurêka

Such disagreements between matching and contiguous sorting are unavoidable because POSIX sorting cannot be implemented in terms of a deterministic finite-state automaton (DFA), but range expressions by design are implementable in terms of DFAs.

Historical implementations used native character order to interpret range expressions. The ISO POSIX-2:1993 standard instead required collating element order (CEO): the order that collating elements were specified between the order_start and order_end keywords in the LC_COLLATE category of the current locale. CEO had some advantages in portability over the native character order, but it also had some disadvantages:

CEO could not feasibly be mimicked in user code, leading to inconsistencies between POSIX matchers and matchers in popular user programs like Emacs, ksh, and Perl.
CEO caused range expressions to match accented and capitalized letters contrary to many users' expectations. For example, "[a-e]" typically matched both 'E' and 'á' but neither 'A' nor 'é'.
CEO was not consistent across implementations. In practice, CEO was often less portable than native character order. For example, it was common for the CEOs of two implementation-supplied locales to disagree, even if both locales were named "da_DK".

Because of these problems, some implementations of regular expressions continued to use native character order. Others used the collation sequence, which is more consistent with sorting than either CEO or native order, but which departs further from the traditional POSIX semantics because it generally requires "[a-e]" to match either 'A' or 'E' but not both. As a result of this kind of implementation variation, programmers who wanted to write portable regular expressions could not rely on the ISO POSIX-2:1993 standard guarantees in practice.

While revising the standard, lengthy consideration was given to proposals to attack this problem by adding an API for querying the CEO to allow user-mode matchers, but none of these proposals had implementation experience and none achieved consensus. Leaving the standard alone was also considered, but rejected due to the problems described above.

The current standard leaves unspecified the behavior of a range expression outside the POSIX locale. This makes it clearer that conforming applications should avoid range expressions outside the POSIX locale, and it allows implementations and compatible user-mode matchers to interpret range expressions using native order, CEO, collation sequence, or other, more advanced techniques. The concerns which led to this change were raised in IEEE PASC interpretation 1003.2 #43 and others, and related to ambiguities in the specification of how multi-character collating elements should be handled in range expressions. These ambiguities had led to multiple interpretations of the specification, in conflicting ways, which led to varying implementations. As noted above, efforts were made to resolve the differences, but no solution has been found that would be specific enough to allow for portable software while not invalidating existing implementations.

The standard developers recognize that collating elements are important, such elements being common in several European languages; for example, 'ch' or 'll' in traditional Spanish; 'aa' in several Scandinavian languages. Existing internationalized implementations have processed, and continue to process, these elements in range expressions. Efforts are expected to continue in the future to find a way to define the behavior of these elements precisely and portably.

The ISO POSIX-2:1993 standard required "[b-a]" to be an invalid expression in the POSIX locale, but this requirement has been relaxed in this version of the standard so that "[b-a]" can instead be treated as a valid expression that does not match any string.

BREs Matching Multiple Characters

The limit of nine back-references to subexpressions in the RE is based on the use of a single-digit identifier; increasing this to multiple digits would break historical applications. This does not imply that only nine subexpressions are allowed in REs. The following is a valid BRE with ten subexpressions:

\(\(\(ab\)*c\)*d\)\(ef\)*\(gh\)\{2\}\(ij\)*\(kl\)*\(mn\)*\(op\)*\(qr\)*

The standard developers regarded the common historical behavior, which supported "\n*", but not "\n\{min,max\}", "$...$*", or "$...$\{min,max\}", as a non-intentional result of a specific implementation, and they supported both duplication and interval expressions following subexpressions and back-references.

The changes to the processing of the back-reference expression remove an unspecified or ambiguous behavior in the Shell and Utilities volume of IEEE Std 1003.1-2001, aligning it with the requirements specified for the regcomp() expression, and is the result of PASC Interpretation 1003.2-92 #43 submitted for the ISO POSIX-2:1993 standard.

BRE Precedence

There is no additional rationale provided for this section.

BRE Expression Anchoring

Often, the dollar sign is viewed as matching the ending <newline> in text files. This is not strictly true; the <newline> is typically eliminated from the strings to be matched, and the dollar sign matches the terminating null character.

The ability of '^', '$', and '*' to be non-special in certain circumstances may be confusing to some programmers, but this situation was changed only in a minor way from historical practice to avoid breaking many historical scripts. Some consideration was given to making the use of the anchoring characters undefined if not escaped and not at the beginning or end of strings. This would cause a number of historical BREs, such as "2^10", "$HOME", and "$1.35", that relied on the characters being treated literally, to become invalid.

However, one relatively uncommon case was changed to allow an extension used on some implementations. Historically, the BREs "^foo" and "$^foo$" did not match the same string, despite the general rule that subexpressions and entire BREs match the same strings. To increase consensus, IEEE Std 1003.1-2001 has allowed an extension on some implementations to treat these two cases in the same way by declaring that anchoring may occur at the beginning or end of a subexpression. Therefore, portable BREs that require a literal circumflex at the beginning or a dollar sign at the end of a subexpression must escape them. Note that a BRE such as "a$^bc$" will either match "a^bc" or nothing on different systems under the rules.

ERE anchoring has been different from BRE anchoring in all historical systems. An unescaped anchor character has never matched its literal counterpart outside a bracket expression. Some implementations treated "foo$bar" as a valid expression that never matched anything; others treated it as invalid. IEEE Std 1003.1-2001 mandates the former, valid unmatched behavior.

Some implementations have extended the BRE syntax to add alternation. For example, the subexpression "$foo$\|bar$" would match either "foo" at the end of the string or "bar" anywhere. The extension is triggered by the use of the undefined "\|" sequence. Because the BRE is undefined for portable scripts, the extending system is free to make other assumptions, such that the '$' represents the end-of-line anchor in the middle of a subexpression. If it were not for the extension, the '$' would match a literal dollar sign under the rules.

A.9.4 Extended Regular Expressions

As with BREs, the standard developers decided to make the interpretation of escaped ordinary characters undefined.

The right parenthesis is not listed as an ERE special character because it is only special in the context of a preceding left parenthesis. If found without a preceding left parenthesis, the right parenthesis has no special meaning.

The interval expression, "{m,n}", has been added to EREs. Historically, the interval expression has only been supported in some ERE implementations. The standard developers estimated that the addition of interval expressions to EREs would not decrease consensus and would also make BREs more of a subset of EREs than in many historical implementations.

It was suggested that, in addition to interval expressions, back-references ( '\n' ) should also be added to EREs. This was rejected by the standard developers as likely to decrease consensus.

In historical implementations, multiple duplication symbols are usually interpreted from left to right and treated as additive. As an example, "a+*b" matches zero or more instances of 'a' followed by a 'b'. In IEEE Std 1003.1-2001, multiple duplication symbols are undefined; that is, they cannot be relied upon for conforming applications. One reason for this is to provide some scope for future enhancements.

The precedence of operations differs between EREs and those in lex; in lex, for historical reasons, interval expressions have a lower precedence than concatenation.

EREs Matching a Single Character or Collating Element

There is no additional rationale provided for this section.

ERE Ordinary Characters

There is no additional rationale provided for this section.

ERE Special Characters

There is no additional rationale provided for this section.

Periods in EREs

There is no additional rationale provided for this section.

ERE Bracket Expression

There is no additional rationale provided for this section.

EREs Matching Multiple Characters

There is no additional rationale provided for this section.

ERE Alternation

There is no additional rationale provided for this section.

ERE Precedence

There is no additional rationale provided for this section.

ERE Expression Anchoring

There is no additional rationale provided for this section.

A.9.5 Regular Expression Grammar

The grammars are intended to represent the range of acceptable syntaxes available to conforming applications. There are instances in the text where undefined constructs are described; as explained previously, these allow implementation extensions. There is no intended requirement that an implementation extension must somehow fit into the grammars shown here.

The BRE grammar does not permit L_ANCHOR or R_ANCHOR inside "$" and "$" (which implies that '^' and '$' are ordinary characters). This reflects the semantic limits on the application, as noted in the Base Definitions volume of IEEE Std 1003.1-2001, Section 9.3.8, BRE Expression Anchoring. Implementations are permitted to extend the language to interpret '^' and '$' as anchors in these locations, and as such, conforming applications cannot use unescaped '^' and '$' in positions inside "$" and "$" that might be interpreted as anchors.

The ERE grammar does not permit several constructs that the Base Definitions volume of IEEE Std 1003.1-2001, Section 9.4.2, ERE Ordinary Characters and the Base Definitions volume of IEEE Std 1003.1-2001, Section 9.4.3, ERE Special Characters specify as having undefined results:

ORD_CHAR preceded by '\'
ERE_dupl_symbol(s) appearing first in an ERE, or immediately following '|', '^', or '('
'{' not part of a valid ERE_dupl_symbol
'|' appearing first or last in an ERE, or immediately following '|' or '(', or immediately preceding ')'

Implementations are permitted to extend the language to allow these. Conforming applications cannot use such constructs.

BRE/ERE Grammar Lexical Conventions

There is no additional rationale provided for this section.

RE and Bracket Expression Grammar

The removal of the Back_open_paren Back_close_paren option from the nondupl_RE specification is the result of PASC Interpretation 1003.2-92 #43 submitted for the ISO POSIX-2:1993 standard. Although the grammar required support for null subexpressions, this section does not describe the meaning of, and historical practice did not support, this construct.

ERE Grammar

There is no additional rationale provided for this section.

UNIX ® is a registered Trademark of The Open Group.
POSIX ® is a registered Trademark of The IEEE.
[ Main Index | XBD | XCU | XSH | XRAT ]