The National Archives NDAD
Print page Close window
 

Help

Regular expressions

 
Help Glossary Frequently asked questions Contact us Site map  

Regular expressions are powerful tools which allow you to filter data more effectively than most of the other comparators you can use in queries. Their power does have some drawbacks - they can be difficult to understand and difficult to construct. However, these problems can be overcome by building up complex queries from simple examples. The other drawback of regular expressions is that they are potentially expensive in computing time to evaluate. It's quite possible to construct queries using regular expressions which may take many minutes or even hours to complete. The notes below give some guidance as to how to avoid these extremes. In general, however, you should only use regular expressions when you need them. If it's possible to achieve what you want using a simple comparison operator, it's nearly always better to do so.

Regular expressions are often referred to as REs in the literature which defines them, and this abbreviation will be used in the documentation which follows. They grew out of a branch of computing science which deals with the discovery and characterisation of patterns. An RE defines a pattern of interest to you; the query then shows you which records contain fields which match that pattern. Patterns are built up from a number of very basic types. These can then be combined to form more complex patterns, to an arbitrary degree of complexity.

If you are already familiar with the concept of regular expressions, you'll probably find it most useful to jump straight to the  quick reference section which gives specific details on NDAD's implementation. Otherwise, the examples following are the best place to start.

Examples

The simplest pattern is simply a string of ordinary characters, which stands for itself. Thus, the query

County MATCHES devon

has exactly the same effect as

County = devon

We can combine patterns using | which means that at least one of the component patterns must match. Hence

County MATCHES "(devon)|(cornwall)"
is the same as 
County = devon OR County = cornwall

Note that we had to put the RE in quotes. Almost all REs contain punctuation characters which mean that the quotes are necessary to tell the system where the RE begins and ends. For safety's sake, always put any RE in quote marks like this when using advanced queries. The quotes should not be used, though, in simple queries: the system automatically supplies quotes around all values when you use simple queries.

Parentheses are used to group parts of a pattern together, making one compound pattern. A simple example is:

Network MATCHES "(inter|intra)net"

which matches either of the words internet or intranet. The parentheses simply separate the word net from the preceding part of the pattern. This example also shows that you can combine patterns simply by putting them next to each other. The result is a pattern in which the first part matches the beginning of the field, the next part matches the next part of the field, and so on.

Wildcards and repeats

Certain characters have a special meaning in patterns. It is these that give patterns their real power. The simplest is the dot, . . This stands for any single character. Thus,

Region MATCHES "N."

finds any record where the field Region contains exactly two characters, the first of which is the letter n. (It doesn't matter if N is upper or lowercase - they are treated as equivalent. In the rest of these examples, we'll assume this case-insensitivity is taken for granted.)

Another useful character is the asterisk, *. This tells the system to match an arbitrary number of occurrences of the pattern which precedes it. If it follows a single character, it matches any number of occurrences of that single character. "Any number" includes 0. The following two examples may make this clearer:

Region MATCHES "N*" 
Region MATCHES "N.*"

The first example matches any field which is composed entirely of the letterN or which is empty (since this is 0 occurrences of the letter N.) It won't match any field which contains any characters other than n or N. The second example finds any field whose first letter is N or n. The first N stands for itself, and the .* stands for any number of occurrences of any character - hence, the whole pattern means "N followed by anything, including N followed by nothing."

Two slight variants of this use plus + and the question mark?.  + following any pattern matches 1 or more occurrences of that pattern; ? matches 0 or 1 occurrence. These examples demonstrate all of these in combination:

Region MATCHES "N(E|W)?" 
Region MATCHES "N+" 
Region MATCHES "N*E+"

The first pattern matches N, NE, or NW. That's because the N stands for itself; it's followed by a pattern in parentheses which matches either E or W. Because this pattern is followed by a ?, the pattern is only allowed to occur 0 or 1 times. The second pattern matches anything which consists only of the letter N. Unlike our earlier example with N*, there must be at least one N in the field - an empty field won't match. The final example matches any field which ends in a sequence of at least one E which is possibly preceded by a number of ns. Thus it matches "EEEE" and "Nnee" but not "NeNe" nor "NNN".

Special characters

These tools already allow you to construct quite complex patterns, but there's more available. Before we consider these more powerful tools, we need to take a brief digression to look at special characters. Certain characters, such as the * which we have already seen, have a special meaning. If you need to match a string which contains these characters, you need to quote them using a backslash. For example,

Region MATCHES "N\*"

matches a field which consists of two characters, an N followed by a *. The backslash character preceding the asterisk removes its special meaning. The full set of characters which require this special treatment are : ^ . [ $ ( ) | * + ? { \ 

If you cannot remember which are which, it's always safe to quote any character by preceding it with backslash in order to ensure that that character stands for itself rather than having a special meaning. Thus \} can be used to represent a closing brace even though the closing brace isn't special and doesn't actually need quoting.

The other character which requires special treatment is the double quote character ". To use this within a regular expression, it must be doubled on every occurrence and the entire expression must itself be enclosed in quotes. For example,

Play MATCHES "(Shakespeare's|Bacon's) ""Hamlet"""

looks for the field values Shakespeare's "Hamlet" or Bacon's "Hamlet".

Bracket expressions

That's all we need to know about special characters at present. The next useful concept we'll deal with are ranges, also known as bracket expressions. These are a shorthand for a set of related characters. They're denoted by sets of characters inside square brackets, as follows:

[aeiou] A simple list of characters such as this matches any one of the characters in the list. So, this expression matches any one vowel.
[a-z] Two characters separated by - match any character in that range. This example matches any alphabetic character.
[^aeiou] A list preceded by a caret ^ matches anything except what the list would match. This example matches any single character except a vowel.
[^a-z] You can combine negation and ranges. This example matches any single non-alphabetic character.

We can now construct some examples mixing ranges and the other concepts we've used so far:

Code MATCHES "([a-z][0-9]+)|([0-9][a-z]+)"

matches field values which consist either of one letter followed by a string of digits, or one digit followed by a string of letters.

In fact, there are some slightly more mnemonic ways to specify that we are interested in matching letters, digits, or other collections of similar characters. Ranges also allow us to specify things called character classes, which are simply mnemonic names for characters which share a common property, such as letters or digits. The character class names and meanings are:

Character Class Name Character Class Definition
alpha All alphabetic characters
alnum All alphanumeric characters (i.e. letters and numbers)
blank A space or tab character
cntrl Any non-printing character
digit A decimal digit (that is, any of 0-9)
graph A character which can be printed, excluding space
print A character which can be printed, including space.
punct Any punctuation character; more formally, anything that can be printed but is not alphabetic, numeric or space.
space Any spacing character - space, tab, linefeed, formfeed, vertical tab
xdigit Any hexadecimal digit (the digits 0-9 plus the letters a-f)

To use a character class within a bracket expression, write [:classname:]where any particular character would appear inside the bracket expression. That means that, to use a character class on its own, the brackets appear twice. So, the bracket expression [[:alnum:]] matches any single alphanumeric character. The bracket expression [@[:digit:]] matches an @ sign or a digit. Using character classes we could write our previous example as:

Code MATCHES "([:alpha:][:digit:]+)|([:digit:][:alpha:]+)"

Whether you find this easier than the previous form is a matter of taste. However, the following example is definitely easier to type and to read using character classes:

Name MATCHES "[^[:punct:]]+"

This matches all instances of the field Name which do not contain any punctuation but are not blank. We can improve our filtering of names. If we have a field called Surname and we want to find all cases were the surname is well-formed, we can use patterns to do it given some assumptions about surnames. Let's assume that surnames are basically alphabetic, but may contain spaces or hyphens, to allow for some forms of double-barrelled name and some ways of writing "Mac". The space or hyphen must occur in the middle of the name (i.e. must be preceded and followed by letters) and to simplify matters we will allow only one of them. We also want to allow for apostrophes in the second position for names such as O'Leary and D'eath. This pattern will match such names:

Surname MATCHES "[[:alpha:]]'?[[:alpha:]]+[- ]?[[:alpha:]]+"

Let's decompose that piece-by-piece to reassure ourselves that it does what we want. The initial [[:alpha:]] says that the first character must be alphabetic. It's followed in the pattern by '? which matches 0 or 1 occurrences of the apostrophe character, '. That's followed by [[:alpha:]]+ which matches one or more alphabetic characters. The next bracket expression contains two characters, hyphen and space, and matches one of them. (It also illustrates an important point about special characters in bracket expressions, explained further below.) Because it's followed by a question mark, the effect is to match 0 or 1 occurrences:[- ]? matches one space or one hyphen or nothing at all. Finally, another occurence of [[:alpha:]]+ ensures we have another string of at least one letter following the possible hypen or space in the name. Altogether, then, we have one letter, a possible apostrophe, some more letters, a possible hyphen or space, and some more letters.

The rules for special characters in bracket expressions are, unfortunately, different from the usual ones. Most special characters lose their significance. In particular, the backslash no longer has the effect of quoting the character it precedes. Within a bracket expression, ], ^ and - all need to be treated specially. To include a literal ] in a bracket expression, it must be the first character in the expression (although it can be preceded by ^.) Thus, the bracket expression [])}] matches a closing bracket, parenthesis or brace. To include a literal ^, ensure that it does not appear first in the bracket expression. To include a literal -, ensure that it appears first or last in the expression.

Bounds

The final tool in our kit for building patterns is the bound. Bounds allow us to specify that a particular component of a pattern appears a specified number of times. Bounds can appear wherever *, + or ? can appear to specify a given number of repeats of the immediatly preceding pattern.

Bounds take three forms, all of which have a similar syntax. In the table below, low and high represent numbers between 0 and 255.

Syntax Meaning
{low} Matches exactly low occurrences of the preceding pattern
{low,} Matches at least low occurrences of the preceding pattern
{low,high} Matches at least low but no more than high occurrences of the preceding pattern

So, N{5,9} matches 5,6,7,8 or 9 letter Ns. [[:alpha:]]{7,} matches any string of alphabetic characters containing at least 7 letters.((yes)|(no)){1,4} matches Yes, No, YesYesYes, noyesno or any other sequence of from 1 to four words each of which is yes or no.

Bounds can be useful but are also potentially slow to execute, especially if they are nested. A pattern such as (a{3,5}c{2,6}){3,7} (which looks for between 3 and 7 repeats of a sequence of 3 to 5 as followed by 2 to 6 cs) will take noticeably longer than other patterns of similar length that don't use nested bounds. Very complex patterns using deeply-nested bounds will be refused by the system or will fail to execute at all.

Other notes

There are two special bracket expressions which can help when you want to match things in terms of words. [[:<:]] and [[:>:]] match the beginning and ending of a word respectively. A word, in this context, is a sequence of alphanumeric characters or underscores. Thus, the pattern ".*[[:<:]]green[[:>:]].*" matches any string which contains the word green but not other words of which green forms a part, such as evergreen or wintergreen.

Some patterns may appears to be ambiguous in that their components may be matched against a string in more than one way. In general, each component is matched against the largest substring possible subject to the constraint that the whole pattern takes precedence over its component parts, each of which takes precedence over its parts, and so on. Thus, the pattern"[[:alpha:]]*able" matches any word ending in able even though the first part of the pattern - which specifies any sequence of letters of arbitrary length - will also match the whole word itself, leaving nothing for the trailing part of the pattern to match.

Quick reference to RE components

NDAD's RE software is based on Henry Spencer's regex package, which is © 1992, 1993, 1994 Henry Spencer. This library implements the regular expression syntax defined by POSIX 1003.2 (section 2.8) and described by it as 'extended regular expressions'. All pattern matches are performed with case-insensitivity enabled (i.e. it's never possible to do a match based only on letter case.) All patterns that you enter are matched against the entire field contents; the effect is as if the pattern was preceded by a $ anchor and terminated with ^, and also enclosed in parentheses. To create a pattern which matches an arbitrary subset of a field, precede and follow the pattern with .*

Element or device Syntax and notes
Grouping (RE) stands for the RE within the parentheses
Alternatives RE1|RE2 matches either RE1 or RE2
Bracket expressions [abc] matches any of a,b,c
[a-d] matches any character between a and d
[^abc] matches anything except a,b or c
The above may be combined: [^abc0-9] matches anything except a,b,c or a digit
Within bracket expressions, [:class:] stands for any character in its class. Classes are alnum, alpha, blank, cntrl, digit, graph, lower, print, punct, space, upper, xdigit. Lower and upper have no meaning in NDAD queries due to case equivalence.
Within bracket expressions, [.xy.] stands for the collating sequence represented by its contents. No collating sequences other than single-character sequences are defined in the locale used by NDAD This is this primarily of use to quote characters that are otherwise special in bracket expressions, such as[.-.]
Within bracket expressions, [=x=] stands for the set of characters in the equivalence class of x.
Most special characters lose meaning in bracket expressions, including \. Literal ] must appear first (after a possible ^) or use a collating sequence. Literal - must appear first, last, as the final endpoint of a range, or as a collating sequence. Literal ^ must not appear first.
Repeats * after any pattern matches 0 or more instances of the pattern
+ after any pattern matches 1 or more instances of the pattern
? after any pattern matches 0 or 1 instances of the pattern
Bounds Where low and high are numbers between 0 and 255 inclusive:
{low} after any pattern matches 'low' instances of the pattern
{low, } after any pattern matches 'low' or more instances of the pattern
{low,high} after any pattern matches 'low' to 'high' (inclusive) instances of the pattern
Special characters . stands for any character.
Any other character stands for itself except for the special characters, $[.^()|*+?{\ when not in a bracket expression. To make a special character stand for itself, precede it with \
Special bracket expressions [[:<:]] and [[:>:]] match the null string at the beginning and end of a word respectively.

Formal syntactic definition

Pattern element Meaning or components
Pattern One or more non-empty branches separated by |. The pattern matches anything which is matched by at least one of the branches.
Branch One or more pieces, concatenated. A branch matches if each of its pieces successively match successive portions of the field.
Piece A piece is an atom, possibly followed by either *, + or ? or a bound
* When immediately following an atom, causes the piece to match 0 or more matches of the atom. 
+ When immediately following an atom, causes the piece to match 1 or more matches of the atom.
? When immediately following an atom, causes the piece to match 0 or 1 matches of the atom.
 
 

NDAD v3.0

 
 
Go to top of page Print page Close window