Augmented Backus-Naur Form (ABNF) is a popular formal syntax for describing Internet specifications. It is officially described in the Internet standards publications of the IETF, RFC 5234 and RFC 7405. A detailed explanation of ABNF is given here. However, if there any discrepancies between what is given here and the RFCs, the RFCs naturally prevail.
Additionally, in the development APG(an ABNF Parser Generator), its parsers and pattern-matching engines, it has been found convenient to enhance ABNF's features. Since the enhancement features add to ABNF without altering any of its existing features, it is referred to here as a Superset of ABNF or SABNF.
ABNF is a syntax to describe phrases, a phrase being any string of positive integers. Because the integers so often represent the ASCII character codes there are special ABNF features to accommodate an easy description of ASCII strings. Consequently, phrase integers are often referred to here as character codes. However, the meaning and range of the ABNF phrase integers are entirely up to the user. The complete ABNF syntax description of a phrase is called a grammar and the terms "grammar" and "ABNF syntax" will be used synonymously here.
Phrases are described with named rules. A rule name is alphanumeric with hyphens allowed after the first character. Rule names are case insensitive. A rule definition has the form:
where the equal sign, =
, separates the name from the phrase definition. Elements are made up of terminals, operators and other rule names, as described below. Each rule must end with a carriage return, line feed combination, CRLF. Each line must begin in the first column (see restrictions below). A rule definition may be continued with continuation lines, each of which begins with a space or tab.
Rules resolve into a string of terminal phrase integers. ABNF provides several means of representing terminal integers and strings of integers or character codes explicitly.
single characters
strings of characters
range of characters
literal strings of characters
Tab characters, 0x09
, are not allowed in literal strings.
prose values
When all else fails, ABNF provides a means for the grammar's author to simply provide a prose explanation of the phrase in the form of a spoken, as opposed to formal, language. The notation is informative and the parser generator will recognize it as valid ABNF. However, since there are no formal specifics, the generator will halt without generating a parser.
Tab characters, 0x09
, are not allowed in prose values.
concatenation
One or more spaces and/or tabs between elements in a rule definition represents a concatenation of the two elements. For example, consider the two rules,
The space between the two elements "a"
and "b"
acts as a concatenation operator. The effect in this case is that rule AB1
defines the same phrase as rule AB2
.
alternatives
The forward slash, /
, is the alternative operator. The rule
would match either the phrase a
or the phrase b
.
incremental alternatives
While not a new operation, incremental alternatives are a sometimes-convenient means of adding alternatives to a rule.
Rules alt1
, alt2
and alt3
have identical definitions. The incremental alternative, =/
, allows for adding additional alternatives to a rule at a later date. As seen in alt2
, the same affect can be achieved with line continuations. However, in some cases, it may be convenient or even essential to add additional alternatives later in the grammar. For example, if the grammar is broken into two or more files. In such a case, line continuations would not be possible and the incremental alternative becomes an essential syntactic addition.
Note that the rule name of an incremental alternative statement must have been previously defined in a rule definition or an error will occur.
repetitions
An element modifier of the general form n*m (0 <= n <= m)
can be used to indicate a repetition of the element a minimum of n
times and a maximum of m
times. For example, the grammar
would define a phrase that could be any number with 2 or 3 ASCII digits. There are a number of shorthand variations of the repetition operator.
Elements may be grouped with enclosing parentheses. Grouped elements are then treated as a single element within the full context of the defining rule. Consider,
phrase1
matches elem foo blat
or elem bar blat
, whereas phrase2
matches elem foo
or bar blat
. A word of caution here. Concatenation has presidence over (tighter binding than) alternation so that phrase2
is the same as phrase3
and not phrase1
. It can be confusing. Use parentheses liberally to keep the grammar meaning clear.
Another useful way to think of groups is as anonymous rules. That is, given
phrase1 and phrase2 are identical. Only phrase2 utilizes the explicit rule anon
for the parenthesized grouping. In phrase1, the parenthesized grouping anonymously defines the same rule as anon
.
Elements grouped with square brackets, []
, are optional groups. Consider,
Both phrases are identical and will match either elem foo bar blat
or bar blat
.
Comments begin with a semicolon, ;
, and continue to the end of the current line. For example, in the following rule definition, everything from the semicolon to CRLF is considered white space.
In this implementation empty lines and comment-only lines are accepted as white space, but any line beginning with one or more space/tab characters and having text not beginning with a semicolon will be rejected as an ABNF syntax error. Consider the lines,
Lines 1:
through 4:
are valid blank lines. Line 5:
would be regarded as a syntax error.
Here is an example of a complete ABNF grammar representing the general definition of a floating point number.
This APG implementation imposes a several restrictions and changes to the strict ABNF described above. These are minor changes except for the disambiguation rules.
Indentations
RFC 5234 specifies that a rule may begin in any column, so long as all rules begin in the same column. This implementation restricts the rules to the first column.
Line Endings
RFC 5234 specifies that a line ending must be the carriage return/line feed pair, CRLF. This implementation relaxes that and accepts CRLF, LF or CR as a valid line ending. However, the last line must have a line ending or a fatal error is generated. (Forgetting a line ending on the last line is a common and annoying error, but keeping the line ending requirement has been a conscious design decision.)
Case-Sensitive Strings
This implementation allows case-sensitive strings to be defined with single quotes.
All three of the above phrases defined the identical, case-sensitive string abc
. The single-quote notation for this was introduced in 2011 prior to publication of RFC 7405. The SABNF single-quote notation is kept for backward compatibility.
Empty Strings
Some rules may accept empty strings. That is, they match a string with 0 characters. To represent an empty string explicitly, two possibilities exist.
In this implementation only the literal string is allowed. Zero repetitions will halt the parser generator with a grammar error.
Disambiguation
The ALT operation allows the parser to follow multiple pathways through the parse tree. It can be and often is the case that more than one of these pathways will lead to a successful phrase match. The question of what to do with multiple matches was answered early in the development of APG with the simple rule of always trying the alternatives left to right as they appear in the grammar and then simply accepting the first to succeed, ignoring any remaining alternatives, if any. This "first success" disambiguation rule may break the context-free aspect of ABNF, but it not only solves the problem of what to do with multiple matches, at least on a personally subjective level, it actually makes the grammars easier to write. That is, easier to arrange the alternatives to achieve the desired phrase definitions.
Related to disambiguation is the question of how many repetitions to accept. Consider the grammar
A strictly context-free parser should accept any string an, n>0. But in general this requires some trial and error with back tracking. Instead, repetitions in APG always accept the longest match possible. That would mean that APG would fail to match the example above. However, a quick look shows that a simple rewrite would fix the problem.
Longest-match repetitions rarely lead to a serious problem. Again, knowing in advance exactly how the parser will handle repetitions allows for easy writing of a correct grammar.
In addition to the seven original node operations defined by ABNF, APG recognizes an addition eight operations. Since these do not alter the original seven operations in any way, taken together they constitute a super set of the original set. Hence the designation SABNF(Superset Augmented Backus-Naur Form).
The user-defined terminals and look ahead operations have been carried over from previous versions of APG. Look behind, anchors and back references have been developed to replicate the phrase-matching power of various flavors of regex
. However, the recursive mode of back referencing is, to my knowledge, a new APG development with no previous counterpart in other parsers or phrase-matching engines.
In addition to the ABNF terminals above, APG allows for User-Defined Terminals (UDT). These allow the user to write any phrase he or she chooses as a code snippet. The syntax is,
UDTs begin with u_
or e_
. The underscore is not used in the ABNF syntax, so the syntax parser can easily distinguish between UDT names and rule names. The difference between the two forms is that a UDT beginning with u_
may not return an empty phrase. If it does the parser will raise an exception. Only if the UDT name begins with e_
is an empty phrase return accepted. The difference has to do with fatal rule attributes such as left recursion and will not be discussed here further.
Note that even though UDTs are terminal phrases, they are also named phrases and share many named-phrase qualities with rules.
The look ahead operators are modifiers like repetitions. They are left of and adjacent to the phrase that they modify.
phrase1
uses the positive look ahead operator. If number
begins with a "+"
then &"+"
returns the empty phrase and parsing continues. Otherwise, &"+"
returns failure and phrase1
fails to find a match. That is, phrase1
accepts only numbers that begin with +
, e.g.+123
.
phrase2
uses the negative look ahead operator. It works just as described above except that it succeeds if "+"
is not found and fails if it is. That is, phrase2
accepts only numbers that begin with no sign or with a negative sign. e.g. -123
or 123
A good discussion of the origin of these operators can be found in this Wikipedia article.
The look behind operators are modifiers very similar to the look ahead operators, the difference, as the name implies, is that they operate on phrases behind the current string index instead of ahead of it.
phrase1
will succeed only if text
is preceded by a line-end
. phrase2
will succeed only if text
is not preceded by a line-end
.
Look behind was introduced specifically for the APG phrase-matching engine. It may have limited use outside of this application.
Back references are terminal strings similar to terminal literal and binary strings. The difference being that terminal literal and strings are predefined in the grammar syntax and back reference strings are defined with a previous rule name or UDT match.
The back reference, \A
will attempt a case-insensitive match to whatever phrase was matched by A. (The notation works equally for rule names and UDT names.) Therefore, phrase1
would match abcabc
or abcABC
, etc., but not abcxyz
. The i
and s
notation is used to indicate case-insensitive and case-sensitive matches, just as specified in RFC 7405 for literal strings. Therefore, phrase3
would match xYzxYz
but not xYzxyz
.
These back reference operations were introduced specifically to match the parsing power of various flavors of regex
pattern-matching engines. However, it was soon recognized that another mode of back referencing was possible. The particular problem to solve was, how to use back referencing to match tag names in the nested opening and closing tags of HTML and XML. This led to the development of a new type of back referencing, which to my knowledge, is unique to APG.
I'll refer to the original definition of back referencing above as "universal mode". The name "universal" being chosen to indicate that the back reference \%uA
matches the last previously occurring phrase A
universally. That is, regardless of where in the input source string or parse tree it occurs.
I'll refer to the new type of back referencing as "recursive mode". The name "recursive" being chosen to indicate that \%rA
matches the last occurrence of A
on a recursive sub-tree of the parse tree with the same recursive parent node. A more detailed explanation with diagrams is given elsewhere.
Case insensitive and universal mode are the defaults unless otherwise specified. The complete set of back references with modifiers is:
Again, to replicate the pattern matching power of regex
, SABNF includes two specific anchors, the beginning and ending of a string.
Anchors match a location, not a phrase. %^
returns an empty string match if the input string character index is zero and fails otherwise. Likewise, %$
returns an empty string match if the input string character index equals the string length and fails otherwise. The leading %
is taken from the RFC 7405 syntax for modifying literal strings, and the ^
and $
characters have been chosen to be similar to their familiar regex
counterparts.
In the examples above, phrase1
will match text
only if it starts at the beginning of the string. phrase2
will match text
only if it ends at the end of a string. phrase3
will match abc
only if it is the entire string. This may seem self evident in this context, but APG parsers allow parsing of sub-strings of the full input string. Therefore, when parsing sub-strings it may not always be known programmatically whether a phrase is at the beginning or end of a string.
operator | notation | form | description |
---|---|---|---|
TLS | "string" | ABNF | terminal literal string |
TBS | d65.66.67 | ABNF | terminal binary string |
TRG | d48-57 | ABNF | terminal range |
UDT | u_name or e_name | SABNF | User-Defined Terminal |
BKR | \name or \u_name | SABNF | back reference |
ABG | %$ | SABNF | begin of string anchor |
AEN | %^ | SABNF | end of string anchor |
operator | notation | form | description |
---|---|---|---|
ALT | / | ABNF | alternation |
CAT | white space | ABNF | concatenation |
REP | n*m | ABNF | repetition |
RNM | name | ABNF | rule name |
AND | & | SABNF | positive look ahead |
NOT | ! | SABNF | negative look ahead |
BKA | && | SABNF | positive look behind |
BKN | !! | SABNF | negative look behind |
RFC 5234 defines the ABNF syntax for the ABNF syntax. While this may seem paradoxical, it makes sense when you realize that a parser generator is a parser whose semantic phase generates a parser. In the case of APG, both the parser of the generator and the parsers it generates are defined with an SABNF syntax. Here is the ABNF syntax (no superset features required) used by APG to generate SABNF parsers.