Regular Expressions

Introduction

Many LinkScan features and options support powerful Perl Regular Expressions. These enable users to match patterns within text strings and, in the context of LinkScan, that often means URL's.

Regular Expressions may appear somewhat intimidating if you have not encountered them before. However, they represent the best practices and standard that computer scientists have been able to develop over 20 plus years.

We present a short and simplified tutorial below. Many other references are available on-line and in print. For example:

http://perldoc.perl.org/perlre.html

We also recommend the book Mastering Regular Expressions (a.k.a. the Owl Book) by Jeffrey E.F. Friedl, and published by O'Reilly [ISBN: 1-56592-257-3].

Our First Example

One of the most common uses for Regular Expressions within LinkScan is to establish an exclusion rule. For example, if one were to scan the website http://www.example.com/, one might wish to exclude from the scan some old archival content residing under:

http://www.example.com/papers/archived/

The following LinkScan rule will accomplish that:

Exclude papers/archived/

In fact, this simple example introduces a couple of important LinkScan conventions:

Our expression papers/archived/ is aligned with the portion of the URL immediately following the root URL (http://www.example.com/).
Our expression includes an implied wildcard at the end of the expression. In other words, this rule will exclude everything under papers/archived/.

Some Simple Variations

Now let us suppose that we wish to exclude multiple archival areas such as:

http://www.example.com/papers/archived/
http://www.example.com/technotes/archived/
http://www.example.com/newsletters/archived/

We can add a simple wildcard to the rule and specify:

Exclude .*/archived/

Note that the wildcard is dot-asterisk and not a singleton asterisk. In fact, the dot means any character and the asterisk means between zero and N of them.

Next we'll assume that we wish to exclude all files with the .old extension from our scan. We might guess that the rule would be:

Exclude .*.old

This might achieve the desired effect, in part. But it is incorrect, in two important respects. First, we have already introduced the notion that the dot has a special meaning (any character). And second, the implied wildcard at the end of the expression. The correct way to exclude all of the .old files and eliminate the possibility of false matches is:

Exclude .*\.old$

We have used the backslash to escape our dot character. So now we will match a literal dot and not any character. We have also appended the dollar sign which forces a match at the end of the URL.

Case Sensitivity

Regular Expressions are case sensitive, and hence our earlier examples would fail to exclude directories called Archived/ or files such as Foo.Old. We can easily address that as follows:

Exclude (?i).*/archived/
Exclude (?i).*\.old$

Pre pending the (?i) is a bit verbose but there's simply not enough characters on the keyboard...

Special Characters

Certain characters such as the dot have special meanings. Hence we must precede those with a backslash when we wish to refer to the literal character. The most commonly occurring examples, in URL's, are:

. * + ?

And here is a more complete list:

. * + ? \ ( ) [ ] { } | $ ^

Hence if you need to refer to a literal backslash character, simply write:

\\

Finally, we need to address the literal space character. Of course, spaces should never appear in a URL except as %20. However, LinkScan generally requires that no space characters appear in an expression and we may need to match on these from time to time. You may write, for example:

Not\sFound

But it's often more useful to use:

Not\s+Found

In this case, the \s+ will match one or more space and/or tab characters.

Summary

.*             Wildcard
$              Force the match at the end of the URL/string
\s*            Zero or more space characters
\s+            One or more space characters
\d+            One or more numeric characters
[0-9]+         One or more numeric characters
[^0-9]+        One or more NON-numeric characters
<[^>]+>        HTML tag. [^>]+ == One more characters, not a >
(abc|def)      abc OR def
(?i)           Make this expression case insensitive

Characters that require an "escape":

. * + ? \ ( ) [ ] { } | $ ^

Help Reference HowTo Card