Regexes in Perl

In the last tutorial, we went over the three regex operators.  We also learned what regexes were and why we’d use these in our every scripts.  This tutorial will be going deeper in the many uses to manipulate our data in any way we wish.

Regular Expressions are a huge topic all on it’s own, there is no possible way to go over all of them here so you may wish to purchase a book on the topic (yes, complete books are dedicated to these) for future reading and understanding.

We’ll use the same originally example on a basic regex before we continue.

while (<STDIN>)
{
if(m/exit/)
{
exit;
}
}


Special Characters

Before we begin, there are many special characters we need to be aware of.  Please note the backslash \ preceding the character(s) is not optional and everything is case-sensitive.

\077 : octal character
\a      : internal alarm
\c[    : control character
\D    :  match a non-digit character
\d     :  match a digit character
\E     :  end case modifier
\e     :  escape key
\f      :  form feed character
\L     : lowercase all characters until the end case modifier \E is found
\l      :  lowercases the preceding character
\n     :  new line.  acts as a <br> tag does in HTML
\r     :  return
\S    :  matches a non-white space character
\s     :  matches a white space character
\t      :  inserts a tab
\U    :  uppercase until the end case modifier \E is found
\u    :  uppercase the next character
\W  :  matches a non-word character
\w    :  matches a word character


Match (nearly) any characters

One of the most useful, in my opinion of course, special characters in regexes is the . (the dot).  This matches any and all characters other than the new line \n.  This means, all letters a-z, numbers 0-9, dashes – and any weird @$()! character will be matched except the new line feed.

my $sentence = “this is a sentence, woohoo!”;
$sentence =~ s/./T/;
print $sentence;
This is a sentence, woohoo!

The . (dot) matches any non-new line character and since we’re doing a substitution for “T”, the first character it comes across will be replaced with this letter.  The “t” was replaced by “T”.

If . is a metacharacter,what if we really wanted to match a period in our regex?  This comes up very frequently, especially with the period, and lucky for us there is a very quick solution.  If you place a backslash in front of the metachacter, it acts as a normal character instead of being special.

my $sentence = “We have ourselves a .”;
$sentence = s/\./period/;
print $sentence;
We have ourselves a period

The only change from the first example and this one to match the period is adding \. to the regex.  So instead of matching any character whatsoever, we’re literally matching a period and nothing more.

It is easy for us to match all characters at one time as well.  We already know the dot matches nearly all characters, so we use that along with the global modifer /g to substitute any and all matches.

my $sentence = “This is a sentence, woohoo!”;
$sentence =~ s/./-/g;
print $sentence;

—————————

Character Classes

Instead of matching entire words or phrases, we can also match characters.  You can set a character class within square brackets [] such as [abc123] and you can set up a range of characters such as [a-zA-Z].  The latter checks to see if any case of any letter appears in our string.

The range operator can work on parts of the alphabet such as c-f or numerics such as 0-9 or 2-5.  Remember everything is case sensitive, that’s why we used [a-zA-Z] to match all cases of the letters.

my $string = “We’re off to see the wizard, the most wonderful wizard of all!”;
if ($string =~ m/!/)
{
print “hey hey now, there’s no reason to shout!”;
}

In our above example, we are testing to see if we can match our explanation point which we can see at the end of the line, it indeed matches. Now lets do a range test that fails.  We’ll rewrite $string so it’s all lowercase and see if it contains any uppercase characters.

my $string = “we’re off to see the wizard, the most wonderful wizard of all!”;
if ($string =~ m/[A-Z]/)
{
print “I wonder if this will print..”;
}

The above example will not print because nothing in our character range exists.  Remember, you can check for any letter or number in a range or you can check for any character inside square brackets[].


Multiple chances with matching

You can do multiple tests to see if a variable or string contains this, that or another thing.  If you wanted to see if your string had the word “blue” or the color “red”, you don’t have to write to separate regexes to see if it matches.  We use the alternative match pattern instead.

Alternative matching matches one or the other, or in some cases another other 🙂 It is not used to test to see if both cases are found, it will stop at the first match it finds and end, whether this be the first possible match or the sixteenth.

while(<STDIN>)
{
if(m/red|blue/)
{
print “these are my fav colors”;
exit;
}
}

This will loop endlessly until it finds something that matches either “red” or “blue” literally.  “Red” and “bLue” will not match, neither will any of the alternative ways to write these.

You separate each possible match with a pipe |, and you can use a single word, a single character, a sentence, a number or anything else you want to stick inside.  This idea is pretty straight forward so we’ll assume the one example will suffice, remember to not just follow the examples written in these tutorials but to make your own and TEST, TEST, TEST!


Quantifiers

Along with simple matching to see if one thing exists or not, we can check to see how many times it exists. Or rather, make sure it matches exactly the number of times we want.

We do this using the quantifiers from the list below:

*        :  matches zero or more times
+        :  matches one or more times
?        :  matches either one or zero times
{#}    :  matches a percise number of times
{#,}   :  matches atleast a certain number of times
{#,#} :  matches between first number and second number times

In our first example, we will be using the + quantifier to match one or more instances of our test.  If it does, we’ll do a simple substitution.

my $quantifier = “The frog goes rrribbit, rrribbit”;
$quantifier =~ s/r+/r/g;

print $quantifier;

The frog goes ribbit, ribbit

Since the + quantifier matches only characters or words that exist atleast one, maybe more, times, we replaced all occurances of more than one “r” consecutively.  This would work if you had one “r” or 100,000 of them.  As long as it’s more than one, Perl is happy.

In this next example, we are checking to see if the user typed in between 10 and 30 characters.

while(<STDIN>)
{
if(m/.{10,30}/)
{
print “good for you!\n”;
}
}

Quanitifiers are greedy by nature.  This means they’ll slurp up the biggest match as they can that follows the match you’re telling it to.  So instead of taking the first applicable match, it will take the biggest (in terms of character size) as possible thus occassionally providing unwanted results.

my $quote = “to be or not to be, that is the question”;
$quote =~ s/.*be/To/;
print $quote;
To, that is the question

What we tried to do was change the first word from “to” to “To” with a capital “T” by substituting any and all characters before “be”.  Remember, .* means to match all characters and as we placed the characters “be” after it, we tried to match the first few characters of our sentence.

What really happened was it found a bigger match– instead of substituting the first match of “to”, it matched everything and substituted everything uptil the word before the comma.


Regex Anchors (assertsions)

Often times we need total control over what to match and where. Rather than matching the first match or taking your chance using a quantifier, we can tell our regex to match specific conditions.  These conditions include matching at the beginning or end of a string, match word or non-word boundaries, etc.

Here is a chart of the majority of the anchors we can use that will give us the power we need for accurate results and matching.

^   :  matches the beginning of the line
$   :  matches the end of the line
\A :  matches the beginning of the string
\B :  matches a non-word boundary
\b  :  matches a word boundary
\Z  :  matches the end of the line
\z   :  matches the end of the line

The \A and \Z are pretty much the same as ^ and $, the main difference is ^ and $ can match once and once only– at the beginning or the end of a string.  The \A and \Z can match multiple times for internal boundaries.

Let’s use an example we came across above.  We’re trying to see if the user typed exit.  This time, we’re making sure exit is the only thing the user typed.  This will not match if the user typed “I need an exit” or “exit is that way”.

while (<STDIN>)
{
if(m/^exit$/)
{
exit;
}
}

We just used two of the anchors we learned from the list we read earlier. ^exit is explicitly telling it to match exit at the beginning of the string. exit$ is telling it to match at the end of the string.  In short, by using both the ^ and $ anchors, you are asking it to match if it’s the only word or set of characters (or even a single character) in the string.

If we just wanted to match the beginning of the string, we would have just used the ^ carrot.  Likewise with the $ if all we wanted to do was match the end of the string.

Word boundaries are anything separated by a whitespace, just like our every day text.  To match a word boundary means to match anything between one whitespace and the next.

Author: Syperder Co
I waltzed into the Web Design community as a professional when I was just 15 years of age founding SpyderWebDesigns. Through the years my interests shifted from web development to backbone and user interaction.

In 2000 Sulfericacid.com was born. The world’s largest free and 100% ad-free web site where you could use and download 24 Perl and CGI script along with tutorials without limits or restrictions. January 2005 the site was renamed to SpyderScripts.com as a subsidy of SpyderCo.

In 2001 I also founded an SEO company SpyderSubmission.com. We’ve helped nearly 2300 web sites achieve higher rankings than they ever could have imagined since our launch four years ago.

On a more personal note, I’ve attained 28 certifications from BrainBench.com and about 40 certifications in total from all resources. One of these is a near Masters in Perl which ranks second highest test score in the state and 17th throughout the country.

I have a Perl Abbot status on PerlMonks.org working on getting my Perl Saint status this fall.