Regular Expression Tutorial

January 26, 2006
This is a short tutorial on understanding regular expressions.

What is a Regular Expression ?


A regular expression is a set of characters that describe a pattern in any text. The set consists of a combination of characters and symbols which when used in a group, convey a special meaning. The term "Regular Expression" is usually abbreviated to regex or regexp.

Regular Expression Syntax


There are 3 important parts to a regular expression. They are Anchors, Character sets, and Modifiers.

Anchors


Anchors are used to specify the position of the pattern in relation to a line of text. Regular expressions have 2 anchors. They are ^ and $.

^ - A caret matches start of string.

$ - A dollar matches end of string.

Special Characters


Apart from the two anchors listed above, the following characters have special meaning in regular expressions.

? - A question mark indicates the preceding character is repeated at most one time. That is 0 or 1 times.

* - An asterisk indicates the preceding character is repeated any number of times (0 or more times).

+ - A plus repeats the previous item 1 or more times.

. - A dot matches any single character in this position except line break characters '\r' and '\n'.

\ - Use a backslash if you want to represent special characters literally. ie it suppresses the special meaning of the character.

| - A pipe (alternation) is used to match either - the part on the LEFT side OR the part on the RIGHT.

{n} - Matches the preceding character, or character range, n times exactly.

{n,m} - Matches the preceding character at least n times but not more than m times.

You can combine and use the above mentioned special characters to create regular expressions.


Character Set


[ ] - A character set (character class) is denoted by opening [ and closing ] brackets. It can contain alphabets and numerals.

A ^ inside a character class indicates negation.


The following are a few examples of character class usage.

[0] - Matches 0

[0-9] - Matches any number between 0 and 9.

[^0-9] - Matches any character which is not a number.

[a-z] - Matches any alphabet.

A hyphen - inside a character class indicates range.


POSIX Character Sets


POSIX has added newer and more portable ways to search for character sets. They are -

[:alnum:] - Alphanumeric. Same as writing [a-zA-Z0-9]

[:cntrl:] - Control character.

[:lower:] - Lower case character.

[:space:] - Whitespace.

[:alpha:] - Alphabetic. Same as writing [a-zA-Z]

[:digit:] - Digits. Same as [0-9]

[:print:] - Printable character

[:upper:] - Upper case character

[:blank:] - Whitespace

[:graph:] - Printable and visible characters

[:punct:] - Punctuation

[:xdigit:] - Extended digit.

Regular Expressions Tester


These are specialized programs that allow you to test your regular expressions against some text. In Linux, there are 2 of them - Kodos and Kiki.

Kodos


Kodos is a Python GUI utility for creating, testing and debugging regular expressions for the Python programming language. Kodos should aid any developer to efficiently and effortlessly develop regular expressions.

Kodos Python GUI
Kodos regular expressions tester

Homepage - kodos.sourceforge.net

Kiki


Kiki is a GUI tool used for regular expression testing. It allows you to write regexes and test them against your sample text, providing extensive output about the results.

Kiki regular expressions tester
Kiki regular expressions tester

Both Kodos and Kiki are available in the repositories of Ubuntu and Fedora and most likely, in other Linux distributions as well.

0 comments: