Regular Expressions
Regular expressions (commonly abbreviated to regex) are widely used throughout computing. A regex pattern describes a set of strings that match the regex syntax.
You may be familiar with using a wildcard * to mean "anything can be here". A regex is a much more powerful extension of that - rather than "anything", we can specify what possible values that can take and where in the string the unknown is found.
In part one of the rewrite tutorial, we used an article script as an example. Assume we have configured our script to output URLs in the format /articles/24-article-title-here.html but we need that rewritten to /articles.php?id=24. Before we can do anything, we need a regex pattern that matches this string:
That solves that problem1, but only when we view article number 24. It is a regex pattern - most characters are treated as literals, or in other words they only match themselves. An "a" in a regex will match an "a", a "b" in a regex will match a "b", and so on. However, that alone would be fairly useless and this is where our special characters come in. These are known as metacharacters and have a special meaning, as described below.
Syntax
Values:
- . the period can match any single character. "h." will match "ha", "hb", etc. This matches all characters, not only alphanumeric.
- [] square brackets match against a range of characters - "[a-z]" matches all lowercase letters, "[0-9]" matches all numbers and "[hi2u]" matches "h", "i", "2" or "u". You can also use any combination, such as "[a-g0-9p]" that matches all numbers, the lowercase letters a through g and p.
- [^] this matches against a character not contained within the brackets. This is similar to the previous syntax, except now "[^hi2u]" matches every character that is not an "h", an "i", a "2" or an "i".
Anchors:
- ^ denotes the string must start at this point.
- $ denotes the string must end at this point.
"pie" alone will match "I like pie", "pie is good" or "Chicken pie no thanks".
"^pie" will match "pie is good" and any other string starting with "pie".
"pie$" will match "I like pie" and any other string ending with "pie".
"^pie$" will match only the string "pie".
Quantifiers:
Up to now, we have only matched one character at any time.
- * means zero or more of the previous expression. "go*gle" matches "ggle","gogle","google","gooogle", and so on.
- + means one or more of the previous expression.
- ? means zero or one of the previous expression. "pin?e" will match both "pine" and "pie".
The rest:
The pipe symbol | has meaning "OR". "gray|grey" matches "gray" or "grey".
Parenthesis, or brackets, can be used to group expressions. The above example could be simplified to "gr(e|a)y".
Escape Special Meaning
Since the characters .+*^$[]() all have a special meaning in a regex pattern, if you wish to match literally a period for example, you must escape it. ".html" matches any character followed by "html". "\.html" matches only ".html".
Example
Going back to our example, we can now create a much more useful pattern that matches any possible ID number and any possible title in the URL.
That may look a little harder than our original static pattern but if we go through it bit by bit, it makes perfect sense.
First, we make sure the request started with articles/. Then we match the number, by using a range 0-9 and repeating one or more times which will allow us to match any possible numeric value there. Note this is in parenthesis so that we can use it as a backreference (explained in the rewrite tutorial but basically we want to use the value matched there). We then separate with a dash, match any set of characters in the title and end with a .html extension.
Easy?