Regular Expression Syntax

A regular expression is a very powerful (and sometimes complicated) tool, and a complete understanding of their capabilities and applications goes well beyond this primer. Instead, some standard expressions that are typically used and the rules below must be understood in order to read a regular expression, including:

  • Characters
  • Repetition modifiers
  • Groups and ranges
  • Escaping special characters
  • Anchors

Characters

The following parts of a regular expression allow you to indicate both matches for characters and character classes.

Matches

To specify that a certain character or character set can be matched, use the following options.

  • Literals: Any character matches itself.

Example: 'a' matches the character 'a'.

  • Dot / Period (.): Matches one single character (any character).

Example: a..d

Matches: 'abcd', 'a12d', and 'aaad'

Does Not Match: 'abbb', 'abcde', or 'ba1d'

  • Character Sets: Uses brackets ( [ ] ) to match any one of a set of characters.

Example: [ae]

Matches: 'a' and 'e'

Does Not Match: 'b', 'c', or 'd'

Example: [12345]

Matches: '1', '2', '3', '4', and '5'

Does Not Match: '0', '6', '7', '8', or '9'

  • Ranges: Use a dash ( - ) to indicate ranges within a Character Set.

Example: [a-e]

Matches: 'a', 'b', 'c', 'd', and 'e'

Does Not Match: '1' or 'f'

Example: [1-5]

Matches: '1', '2', '3', '4', and '5'

Does Not Match: '0', '6', '7', '8', or '9'

  • Combinations

Example: [a-z][a-z0-9][a-z0-9]

Matches any lowercase letter followed by two other lowercase letter or numbers

Matches: 'a11', 'bt9', and 'xyz'

Does Not Match: '1ab' or 'abc1'

Classes

To specify that the text must be of a certain class, use the following options.

  • Digit: Any number character 0-9.

\d indicates a digit character is required.

\D indicates a digit character must not be present.

Example: '\d' matches the number '5' in the value '5 = V'.

Example: '\D' matches the equal sign '=' in the value '5 = V'.

  • Word: Any word character: a-z, A-Z, or 0-9.

\w indicates a word character is required.

\W indicates a word character must not be present.

Example: '\w' matches the number '5' in the value '5 = V'.

Example: '\W' matches the equal sign '=' in the value '5 = V'.

  • White space: Any non-whitespace character.

\s indicates space character is required.

\S indicates space character must not be present.

Example: '\s' matches the space between the two words in the value 'this is'.

Example: '\S' matches the character 't' in the value ' this is'.

Repetition Modifiers

To specify that a certain character or character set can be matched more than once, you can add repetition modifiers after the character or character set. Some common repetition modifiers are described below.

  • +: One (1) or more; means 'there must be at least one of these'

Example: [a-z]+[0-9]

Matches anything starting with one or more lowercase letters, followed by a single number

Matches: 'abc1', 'a1', and 'abcxyz9'

Does Not Match: '9a', 'a22', or 'a-4'

  • ?: Zero (0) or one (1); means 'this is optional'

Example: [a-z]+[0-9]?

Matches anything starting with one or more lower-case letters, optionally followed by a single number

Matches: 'abc', 'abc2', 'x', 'y3', and 'abcdefgh8'

Does Not Match: '9', 'a99', or 'a-4'

  • *: Zero (0) or more; means 'have as many as you want of these, or none at all'

Example: .*

Any number (0 or more, due to the '*') of any character (due to the '.')

Will match any text

Matches: 'a1b2c3', 'aaaa', and '111'

  • {min, max}: Between the minimum and maximum occurrences; means 'have between X and Y of the indicated character / set / range'; if max is omitted, this means exactly min number of times

Example: [0-9]{5}-[0-9]{4}

Simple zip code + 4 matcher

Matches: '12345-0123'

Does Not Match: '12345' or '6789012'

Groups and Ranges

Grouping is not related to matching, but is used to store certain matched sets of characters for use later, typically in a substitution.

  • Grouping: Done using parenthesis ( ) to isolate groups for storage

Example: ([0-9]{5})-([0-9]{4})

Stores the first five digits in group 1

Stores the last four digits in group 2

The dash ( - ) character is not stored as it is not inside parenthesis

  • Referencing Groups: Done using a dollar sign ( $ ) to indicate a group

Example: $1

Use the content from group 1

With the expression '([0-9]{5})-([0-9]{4})' and value '12345-0123' the results are $1 = '12345' and $2 = '0123'

  • |: Or condition; means either must be true

Example: [a-z]|[0-9]

Matches anything with one lowercase letter OR one numeric digit

Matches: 'c' and '8'

Does Not Match: 'X' or '22'

Escaping Special Characters

There are several characters that have special meanings, such as ?, . (dot), *, +, (, ), {, }, and more. This prompts the question: how can you match a literal example of one of these characters? The key is to escape the special character by putting a backslash ( \ ) in front of it.

  • Escaping Done using backslash ( \ ) to indicate retention of the literal; if you need to match a backslash, escape it in the same way: \\

Example: \.

Retain the dot ( . )

Example: .*\..*

Match anything that has a number of characters, followed by a period, followed by any number of characters

Used to match file names with extensions (e.g., abc123.pdf)

Can also be used to match any text (e.g., 'Hello there this is matched')

Anchors

The following characters indicate the beginning or end of an expression.

  • ^: Start of a string, or start of a line in a multi-line pattern
  • $: End of a string, or end of a line in a multi-line pattern

Example: ^[1-5][0-9]$|^[0-9]$

Two strings are being considered for matching. The first string allows a single digit from 1 - 5, and a second digit from 0 - 9. The pipe (|) symbol indicates an 'or' condition, meaning either the first or second string can be matched. The second string allows for a single digit from 0 - 9.