regular expression (how-to) by mcd

Before we begin let's start with a relatively simple set of values. Generally with a set of values this short, you would just go through and do the calculations yourself or with a calculator. Of course, that changes when your dealing with thousands of rows of data from a database or god-forbid, a spreadsheet. But I have traveled both roads. I have had to sample test the 10,000 records until I was sure I had every variation. So while this may look like a simple list, it is deceptively so. Look at the regex it takes to accomplish the task. Before we get too serious, how about a nice little web-comic.

Stand back! I know regular expressions!

OK, first let's start with a string value of a variation of numbers, one per line. Nothing fancy, right?

Since we are only dealing with numbers (a task uncommon with real data) we get a break. Let's use some pattern recognition to figure out the types of numbers we are up against. I was able to devise three types of numbers to develop patterns for. 1. We have our whole numbers. 2. We have our decimal numbers. 3. We have the most varying numbers of the bunch, the fractions.

Pattern 1

The Whole Numbers

The code for the whole numbers is simple and remember to use parenthesis to capture our values to save them for our calculations later.

(\d+) //match all numbers of (plus sign means) one or greater

Pattern 2

The Fractional Numbers

Probably the most daunting looking pattern to match at first. You have a combination of any number of digits, the forward slash which has to be escaped, but in reality it's not too bad. Remember that you may be getting simple fractions like 2/3 or the combination of whole and fractional numbers like 3 2/3 so your pattern must account for both. Again, parenthesis to capture our values.

(\d* *\d+\/\d+) /*match all numbers of (asterisk means) 0 or one, a space or not, numbers of one or more, 
a forward-slash (which is a character of special meaning in regex, if you want to treat it as a regular character, 
we escape with a backslash. Backslash is used all over the place to escape characters so you better mind-meld with 
it or something), and then all numbers of one or more.*/

Pattern 3

The Decimal Numbers

The period is a little tricky because it literally means "anything" in a regex pattern, so it has to be escaped if you are actually trying to match a period or decimal. Things get a little more complicated with these because we can have numbers on both sides of the decimal point or we can have no numbers on the left side if our recorder refuses to use a zero place-holder. Also, the numbers can be as big or as small as possible.

(\d{0,}\.\d+) // match all numbers of zero or more, then decimal, then digits of one or more

Pretty straight forward as long as you remember that "." captures "anything" and depending on if you are using POSIX or PCRE or whatever that can mean unicode and that is something way to big to get into in this writeup.

Putting our RegEx Together

Of course you know as a regex guru that the character | (not I) means or. So that's how we're going to link our patterns together. But I introduced the patterns I did in a certain order for a reason. If the whole number value was first it would match everything, every single little digit. So we go with our complicated patterns first and work our way down to the simple ones. Thus, our pattern becomes:

(\d{0,}\.\d+)|(\d* *\d+\/\d+)|(\d+)

But that's missing the all important /regex/ to really make a Regular Expression. More on that in a second.

We need a function to do something with these values we captured, this function just happens to be in PHP, don't worry the hate will die down as we get into other languages.

function matchValues($value, $pattern){
	preg_match_all("/$pattern/", $value, $matches, PREG_PATTERN_ORDER);
	return $matches;
}

$pattern = "(\d{0,}\.\d+)|(\d* *\d+\/\d+)|(\d+)";
$m = matchValues($v,$pattern);
print_r($m);

Bear in mind that this isn't a real world example just a code to show you how to grab all the various values, and I came up with as many different kinds as I could think of. I also chose the FLAG PREG_PATTERN_ORDER, because it assembles all our captures in the first array. You may want to read up on preg_match_all. And I just added print_r at the end in case you want to run it and see how it grabs the values. Once you have the values you can do what you want with them. Convert them to similar types, add, multiply, take your pick. I just wanted to show how even the simplest of values can have tricky regular expressions.

ADVANCED: proceed with caution

In a language like Perl or Javascript you don't treat a regular expression like a string that just happens to have forward-slashes at the beginning and the end. Forward slashes in Unix, Perl, Javascript and a number of languages use forward-slashes to define a regular expression. Javascript has a whole class for putting together a RegEx, which seems like overkill to me, but constructors and javascript are so easy they almost create themselves.

When searching for data you are almost always going to run into grep. Me, I prefer a perl file called ack which is way faster and more convenient, but that is not for this writeup. Often with grep, you just want to type a command, search all php file for include (for example), and you would type

grep include *.php

But let's say you have a file of arbitrary phone-number type data. There was no restriction on how it was entered so it's up to you to find all the phone numbers and format them into real numbers. I'll show you how to find the numbers. First, a short list of phone numbers in a file, phonenumber.

559-456-4214
526-699-9993
1-1234567893
(559)-456-4563

So then we use regex to find all our values. Of course our real data set is going to be much bigger, this is just for ease of example.

grep '[1-]*\(*[0-9]\)*\{3\}-*[0-9]\{3\}-*[0-9]\{4\}' phonenumber

"Ohmigod!" You say, "What the hell is that?" Yes, regex can seem like cryptic voodoo, but it is incredibly powerful and I have built entire clean databases off of nothing but many, many regular expressions.

But let's get into Perl and do a little search and replace which is incredibly easy to do in Perl even if it looks a little cryptic at first.

s/foo/bar/gi;

Me, I have always thought of that s as a "search" abbreviation. What it will do is find all "foo" and replace it with "bar". Also, I want to introduce you to flags. What are those letters at the end of the sequence of symbols? What is the g and the i? Well, g stands for global, meaning that it will replace all instances it finds instead of just the first which is default behavior. i means, make our search case-insensitive. Very handy, that. In fact, we're going to use it to switch our first foo to BAR and then switch only the bar to foo. Since we have marked our special BAR with uppercase, we just turn it lowercase again. There are a ton of ways to do that, but let's stick with regex for now. Here is the whole perl file (almost forgot to mention the use of =~ to perform pattern matches m// and replaces s//. Once you get used to Perl's crazy syntax it becomes so much easier to type in s/search/replace then preg_match_all(pattern, subject, matches, FLAG).

$foo = "Bigfoot is the coolest monster and if you ever meet him at a bar buy him a foot-tall pint, and he is sure to thank you with a teeth-baring smile";
$foo =~ s/foo/BAR/gi;
$foo =~ s/bar/foo/g;
$foo =~ s/BAR/bar/g;
print $foo;

Now let's see what that print-out is on the last line.

Bigbart is the coolest monster and if you ever meet him at a foo buy him a bart-tall pint, and he is sure to thank you with a teeth-fooing smile

That's all for today folks! I would read this after reading all the other writeups at the top and you're on your way to a good understanding and use of regular expressions.

regex	No rexen for the wildcard	World's most narrowly useful programming language	Mastering Regular Expressions
10 steps to becoming a Perl Ninja	animal book	Leaning Toothpick Syndrome	Perl
Kleene star	my first perl program	regular language	SED
*n?x	O'Reilly	regexp	Unicode Technical Report
grep	vi	s///	Comparing UNIX to DOS
The Jakarta Project	steps to UNIX familiarity	the key commands all emacs users should know	E2 node autolinker in perl