LinkScan

LinkScan for Unix. Reference Manual

 

LinkScan for Unix. Reference Manual. Table of Contents

LinkScan Reference Manual. Table of Contents

    Part I. LinkScan Core Capabilities

  1. Introduction to LinkScan
  2. Essential LinkScan Concepts
  3. New LinkScan Installations
  4. Upgrading Existing LinkScan Installations
  5. Basic Scanning
  6. Examining the Results
  7. LinkScan Status and Error Codes
  8. Scheduling LinkScan
  9. File System Scanning and Orphaned Files
  10. Import Scanning
  11. Advanced and Custom Scanning
  12. Advanced, Custom and Command Line Results
  13. LinkScan Enterprise/Unlimited Extensions
  14. LinkScan Support
  15. Known Problems and Limitations

    Part II. Companion Programs

  16. LinkScan Dispatch
  17. LinkScan Excel
  18. LinkScan Profiler
  19. LinkScan QuickCheck
  20. LinkScan Recorder
  21. LinkScan TapMap
  22. LinkScan WebServer
  23. LinkScan Pinger
  24. Weblint Man Page

    Part III. Appendixes

  25. Glossary of Terms
  26. LinkScan Quick Reference Card
  27. LinkScan and Various Web Servers
  28. LinkScan File Formats
  29. LinkScan Application Notes
  30. LinkScan Revision History
  31. LinkScan License Agreement

Other Documents

Search

You may use this form to perform keyword searches over the LinkScan for Unix documentation.

Enter search term(s):


Note: This Reference Manual is divided into multiple documents for ease and speed of navigation. However, the contents are also available as a single document suitable for searching and/or printing as the Single Document LinkScan Reference Manual.

LinkScan for Unix. Reference Manual. Section 1

Introduction to LinkScan

LinkScan™ is an industrial-strength link checking and website management tool. It saves time and money by automating the quality assurance testing of virtually any website or web-based application.

LinkScan is built around applicable open systems standards. Hence it integrates easily with many other content development, management and testing applications as well as general purpose computer tools. It operates on all Microsoft Windows and Unix/Linux platforms and is professionally supported.

LinkScan users include Fortune 1000 companies such as Hewlett Packard, government agencies like NASA, as well as many smaller businesses.

New users will find that LinkScan is extremely simple to install, configure and use. And the more experienced user will appreciate the vast array of customization features built into the system. Together, these attributes make LinkScan ideal for:

Five LinkScan Editions

LinkScan is available in five different editions all based upon the same core technology:

The above descriptions are not complete nor comprehensive. You must read the LinkScan License Agreement for a complete definition of the products and your other rights and obligations.

Using LinkScan

The steps involved in using LinkScan include:

  1. Installing and Configuring LinkScan for your environment
  2. Planning the specific test scenario(s) that you wish to execute
  3. Scanning the website to create a LinkScan Database
  4. Examining the results from the LinkScan Database

Each of these steps is described in this Reference Manual. However, we recommend that new users get a fast start by jumping to one of the following pages:

LinkScan for Unix. Reference Manual. Section 2

Essential LinkScan Concepts

This section introduces some important concepts and terms that are used throughout the remainder of this Reference Manual. These are:

  1. LinkScan Projects
  2. LinkScan Owners
  3. LinkScan Usernames
  4. Scanning Methods
  5. Documents and Links
  6. LinkScan Directory and File Structure
  7. LinkScan Configuration Files
  8. Perl Regular Expressions
  9. relative-path and relative-path-expression

2.1 LinkScan Projects

LinkScan is able to scan multiple websites. You may also scan the same website multiple times with different configuration options. In each case, LinkScan creates a unique and corresponding LinkScan Database containing the results of the analysis. Together, the configuration files and database constitute a LinkScan Project.

Users/administrators are required to select a Project when scanning, if multiple projects are defined. And, users must select a Project when viewing the results.

Each LinkScan Project is stored within a subdirectory of the main LinkScan installation directory.

For addition information concerning Projects, how to create them and how to scan them, see Basic Scanning.

2.2 LinkScan Owners

Within each Project, you may also configure multiple LinkScan Owners. Collections of HTML documents and other files are assigned between Owners in a variety of ways:

The LinkScan Owner concept enables individual content developers or workgroups to view results that pertain to their documents or areas of responsibility. LinkScan Owners are defined via the LinkScan Configuration Files, discussed below. By default, LinkScan will create and assign Owners as follows:

This enables users to browse the results selectively so that the reports are smaller and more relevant to their needs. They're also produced more rapidly.

2.3 LinkScan Usernames

LinkScan incorporates access controls that may be used to limit user access to LinkScan databases and results. These controls are not enabled by default.

When activated, users may be required to login to the LinkScan system used a pre-defined LinkScan Username and associated password. The Username will define the Projects and Owners that an individual user is permitted to access.

Those wishing to enable these access control features should see LinkScan Access Controls.

2.4 Scanning Methods

LinkScan supports three different scanning methods:

Network HTTP scanning is generally the best mode to use for sites with a large amount of dynamic content: .jsp, .asp files, etc. The File System Scanning method mode enables tracking of "orphaned" files, files which aren't linked to currently, and is more appropriate for sites with limited dynamic content.

2.5 Documents and Links

The LinkScan software, and this document, both maintain a strong distinction between Documents and Links.

Hence an HTML file is a Document containing Links. Dynamically generated web pages, PDF and Flash Files as well as Import Files may also be considered Documents since LinkScan can examine those files for the presence of Links. Images (such as .gif and .jpg files) are not considered documents.

References to sites other than the one being scanned (External Links) are not documents either, since LinkScan does not examine the content of those files for the presence of Links.

2.6 LinkScan Directory and File Structure

The LinkScan system is made up of a number of different file types:

In a basic LinkScan installation these files are organized within the following directory structure:

2.7 LinkScan Configuration Files

LinkScan's operation is controlled by a number of different configuration files. When running LinkScan via the Windows Graphical User Interface, these files are somewhat invisible. However, they still control the execution of the program and you may find it useful to view the raw configuration files from time to time. On Unix systems, these files represent the primary method of configuring LinkScan. All of the files are formatted in plain ASCII text and may be viewed and modified using the editor of your choice (e.g. Windows Notepad, Unix vi, emacs, pico, nedit, et al).

The most important configuration files are:

This approach provides tremendous flexibility. It means you can establish Global Settings in the Global Configuration File that apply to all Projects. And you may override (single-valued) settings or supplement (multi-valued) settings with additional commands in the Project Configuration File(s); these being Project-specific.

Some additional configuration/control files are discussed elsewhere in this manual. They are used by LinkScan (i.e. do not delete them!) but it is rarely necessary for users to examine or modify them.

All of the configuration files include extensive comments. Comments are signified by the pound sign like this:


# This line contains only a comment

Realcommand = 1   # This comment could describe Realcommand

2.8 Perl Regular Expressions

LinkScan incorporates a vast array of customization features many of which exploit the power of Perl Regular Expressions. For a description of Perl Regular Expressions on Unix systems, see man perlre. HTML versions are available at many locations including:

http://www.perl.com/doc/manual/html/pod/perlre.html

We also recommend the book Mastering Regular Expressions (a.k.a. the Owl Book) by Jeffrey E.F. Friedl, and published by O'Reilly [ISBN: 1-56592-257-3].

2.9 relative-path and relative-path-expression

We make extensive reference to these terms in the customization sections of this manual and they are introduced here for your convenience.

Let us assume that we are scanning the website:

http://www.example.com/

An individual document within that website might be:

http://www.example.com/products/widget.html

LinkScan will refer to that page using its relative-path, which in this case, is:

products/widget.html

A relative-path-expression is a Perl Regular Regular Expression that matches relative-path. For example, all of the following will match our widget page:


products/widget.html      # Also matches products/widgetXhtml
products/widget\.html$    # Does not match anything else
(|.*/)widget\.html$       # Matches widget.html in any directory

LinkScan for Unix. Reference Manual. Section 3

New LinkScan Installations

This section describes the pre-requisites for LinkScan and leads into step-by-step instructions for performing a new installation.

  1. Hardware Requirements
  2. Prerequisites
  3. Installation Step-by-Step

3.1 Hardware Requirements

LinkScan is supported on a wide variety of platforms including:

We do not recommend Windows 95/98/ME for scanning large websites of more than 5000 documents. Although LinkScan has been tested on websites of significantly greater size, performance and stability will be much improved when running under operating systems with a true multi-processing implementation such as Windows NT/2000/XP/Vista or Linux/Unix.

Disk and memory requirement depend almost exclusively on the size and nature of the website(s) to be analyzed. However, the following guidelines are intended to assist users with their capacity planning needs:

3.2 Prerequisites

To successfully install and configure LinkScan on your computer you must have:

  1. An appropriate version of Perl Version 5 installed on your computer. You may download a version suitable for your system via:

    http://www.elsop.com/perl/

  2. A copy of the LinkScan software and a LinkScan License Key. Both are available from:

    http://www.elsop.com/linkscan/dleval.cgi

3.3 Installation Step-by-Step

We recommended that new users get a fast start by jumping to one of the following pages:

LinkScan for Unix. Reference Manual. Section 4

Upgrading Existing LinkScan Installations

This section describes how to upgrade an existing LinkScan installation to LinkScan Version 12.0.

LinkScan for Unix. Reference Manual. Section 5

Basic Scanning with the Command Line Interface

This section describes how to create, configure and scan a LinkScan Project using the command line interface.

Before executing the LinkScan programs you must set the current working directory:

web:/> cd /usr/www/htdocs/linkscan/
web:/usr/www/htdocs/linkscan>

Creating a New Project

To create a new Project, simply execute the main LinkScan program (linkscan.pl) with the -newproject command line option:

web:/usr/www/htdocs/linkscan> perl linkscan.pl -newproject newproj

[...]

This Will Create the New LinkScan Project: newproj

The answers to the following questions are accepted verbatim without
validation. Please type carefully. <Control-C> to abort and start again.


Enter Homedir: 
Enter Home URL: http://www.example.com/index.html
Enter Organization: My Department
Enter Project Description: My First Test
** Status: Project newproj Created Successfully
web:/usr/www/htdocs/linkscan>

Configuring a Project

To configure a Project, simply edit the appropriate Project configuration file using your editor of choice:

web:/usr/www/htdocs/linkscan> vi ./newproj/linkscan.cfg

Note that lines starting with a pound sign (#) are comments.

In the simple case of scanning a website using the normal Network (HTTP) Scanning Method, you would only need to configure Homeurl with the URL to the root of the website, and Homefile with the filename (relative to server root) of the starting page. Be sure to leave Homedir blank since this will force LinkScan to use Network (HTTP) Scanning.

[...]
Homedir = 
Homeurl = http://www.example.com/
Mirrorurl = 
Homefile = index.html
Projectdesc = My First Test
Organization = My Department
[...]

This will scan the entire site www.example.com from it's starting page, index.html. The Homeurl parameter should always be the "root" URL of the site being scanned. To specify scans for sub-level areas, add information the Homefile parameter. For example, using the same Homeurl as above, and setting:


Homefile = recommendations/external/index.html

would start the scan at:

http://www.example.com/recommendations/external/index.html

Scanning a Project

To scan a Project, simple execute the main LinkScan program. You may specify the Project on the command line as shown below. Otherwise LinkScan will prompt you to select from the available list of valid Projects.

web:/usr/www/htdocs/linkscan> perl linkscan.pl -project newproj

LinkScan Enterprise Version 12.0 Unix.

[...]

** Status: LinkScan is Starting Processes...
** Status: Started 3 Processes...
** Status: LinkScan is Scanning Internal Links...
Processing  URL: 
Processing  URL: about.html
Processing  URL: linkscan/
Processing  URL: linkscan/dleval.cgi
Processing  URL: linkscan/order.cgi
Processing  URL: linkscan/support.html
[...]

You have now completed a scan of the website and LinkScan has created a Database for that Project. Next you will want to examine the findings by following the steps described in Viewing the Results.

Command Line Options

Run the main LinkScan program with the -help option to see a short listing of the available command line switches:

web:/usr/www/htdocs/linkscan> perl linkscan.pl -help
LinkScan Version 12.0 Unix
Copyright 1997-2008 Electronic Software Publishing Corporation

USAGE: linkscan.pl  {-help} {-alllinks} {-fast} {-home pathname} {-http}
       {-newproject name} {-noexternal} {-noorphans} {-project name}
       {-quiet} {-remote URL} {-retest}

-help            Displays this message
-alllinks        Check all external links [Override: Maxgoodhours etc]
-fast            Use larger number of processes to speed testing
-home pathname   Specify starting page [Override: Homefile in linkscan.cfg]
-http            Use HTTP navigation [Equiv: Execute .* and -noorphans]
-newproject name Create a new LinkScan Project
-noexternal      Test internal links only [Default: Internal and External]
-noorphans       Disable checking for orphaned files
-project name    Select a LinkScan Project
-quiet           Reduce verbosity of progress/status messages
-remote URL      Specify Remote Site [Equiv: -http; Override: Homeurl/Homefile]
-retest          Repeat last test, rechecking only those links that failed
Detailed Help [Y/N]:n

LinkScan for Unix. Reference Manual. Section 6

Examining the Results

Once a Project has been scanned and a database created, a wide range of different reports are available.

This document describes those reports and how to view them interactively using a simple web browser-based interface. Note that a batch command-line interface is also available. See Section 12 of this manual.

To view the reports interactively:

Users will need to point a web browser at the LinkScan Main Menu which typically resides at:

http://your.server.name/linkscan/linkscan.cgi
or
http://your.server.name/cgi-bin/linkscan.cgi

The first time you access the results, you will be presented with the LinkScan Login and Preferences Menu. Simply click Login Now. No username is required unless you later decide to enable various LinkScan security features.

Once you have logged in, you will be presented with the LinkScan Main Menu.

Report Selection

You must select one of the individual Reports and submit the form by pressing Select Report.

A help page is available for each type of LinkScan Report. You may view the appropriate help page at any time by using the Help option on the context-sensitive LinkScan Toolbar. You may also use the [?] links on the LinkScan Main Menu, or the links provided in the summary table below.

The most frequently used reports have been organized in the left hand column; we suggest new users start there. Also, many of the reports incorporate hyperlinks to other reports. This means you can use a drill-down paradigm to view more detail associated with a specific problem or document. For example, some users may never explicitly select a LinkScan/QuickCheck Report. But they will likely view reports of that type by following the [Src] links from other reports.

Summary of Available Reports

Project Summary Report
Summary statistics for the current project
Summary of All Projects Report
Summary statistics for all configured projects
Problem Documents Report
List documents containing potential problems
Selected Status Codes Report
List errors of specific types
Document Detail Report
List all/selected documents
All Pages Linking To ... Report
Find pages that link to...
Critical Errors Report
List most critical errors
Orphaned Files Report
List orphaned files
Detailed Errors Report
List all/selected errors
External History Report
View history of an external link
Changed Documents Report
Compare two scans of the current project
Redirections Report
List a summary of redirections
Search Documents Report
Ad hoc searching: document-centric
System Configuration Report
Display current LinkScan configuration settings
Search Links Report
Ad hoc searching: link-centric
LinkScan/QuickCheck
View source code and detailed analysis of a document
SiteMap Report
Display LinkScan SiteMap
LinkScan/TapMap
Display LinkScan TapMap

Owner Selection

The LinkScan Main Menu may include an Owner Selection Box. If enabled, this option will allow you to select a sub-set of the website to which subsequent reports will apply.

In a default configuration, the Owner Selection Box will include entries for each top-level directory scanned, in addition to the special entry "All". This will be the default selection and subsequent reports will apply to the entire website scanned.

Note however, that the LinkScan Administrator may configure and customize the manner in which Owners are created. Hence your installation may appear and behave somewhat differently from that described herein.

SubMenu Selection

In many cases, when you submit the form by pressing Select Report you will be presented with a second menu of options. Initially, we suggest you accept the default options which have been carefully designed to produce excellent results in the vast majority of situations. However, to learn more, you may use the context-sensitive Help button on the LinkScan Toolbar at any time.

LinkScan Toolbar

Each of the LinkScan Menus and Reports includes a common LinkScan Toolbar. It contains a number of links:

 Main Menu   Preferences   Advanced   Help   Reference   HowTo   Card 

The Main Menu link will always return you to the LinkScan Main Menu.

The Preferences link will always take you to the LinkScan Login and Preferences Menu.

The Advanced link appears when appropriate and it will cause the current menu to be redrawn with additional options.

The Help link will display an appropriate section of the LinkScan Documentation depending upon the current context.

The Reference link will display the table of contents for the LinkScan Reference Manual.

The HowTo link will display a brief How To Guide with instructions for completing certain Common Tasks.

The Card link will display the LinkScan Quick Reference Card.

LinkScan for Unix. Reference Manual. Section 7

LinkScan Status and Error Codes

The following section describes each of the LinkScan Error and Status Codes. Each Status Code is assigned to one of six Severities:

Symbol Code Severity Explanation
* 0 Unknown: LinkScan has not tested or was unable to test this link
* 1 Error: LinkScan found a hard error on this link
* 2 Possible Error: There may be a problem with this link. It should be retested at a later time
* 3 Warning: LinkScan found something unusual about this link. Manual inspection highly recommended
* 4 Advisory: This link is probably ok, but manual inspection recommended
* 5 No Error: This is a good link

The Severity associated with any specific Error or Status Code may be customized by the LinkScan Administrator through the use of the Statuscode option.

Status codes in the range 0-99 are generated exclusively by LinkScan and generally refer to the status of local links (HTML files, Non-HTML files, etc.).

Status codes in the range 100-699 are defined exclusively by the HyperText Transfer Protocol.

Status codes in the range 800-3099 are generated exclusively by LinkScan and generally refer to Networking Problems (Failed DNS lookups, failure to connect to a remote server or timeouts) as well as some other LinkScan detected warning or advisory messages.

* No Status (0)

* HTML File (1)

* Error: Bad HTML File (2)

* Non-HTML File (3)

* Error: Bad non-HTML File (4)

* Anchor (5)

* Error: Bad Anchor (6)

* Warning: Orphaned HTML File (7)

* Warning: Orphaned non-HTML File (8)

* Imagemap File (9)

* Error: Bad Imagemap File (10)

* Valid Mailto Link (11)

* Possible Error: Invalid Mailto Link (12)

* Warning: Missing / (13)

* Warning: Unprocessed SSI (14)

* PDF File (15)

* Error: Bad PDF File (16)

* Warning: No Closing /a (17)

* Error: Invalid Scheme (18)

* Advisory: No Alt/Height/Width (20)

* Flash File (21)

* Error: Bad Flash File (22)

* Text File (23)

* Error: Bad Text File (24)

* Javascript File (25)

* Error: Bad Javascript File (26)

* XML File (27)

* Error: Bad XML File (28)

* Error: HTML Syntax (99)

* Continue (100)

* Switching Protocols (101)

* Good URL (200, 201, 202, 203, 205, 206)

* Error: No Content (204)

* Error: Multiple Choices (300)

* Error: Moved Permanently (301)

* Advisory: Moved Temporarily (302)

* Error: Network/Server Error (303, 304)

* Error: Use Proxy (305)

* Error: Unused (306)

* Warning: Temporary Redirect (307)

* Error: Network/Server Error (400)

* Warning: Unauthorized (401)

* Warning: Payment Required (402)

* Error: Forbidden (403)

* Error: Not Found (404)

* Error: Method Not Allowed (405)

* Error: Not Acceptable (406)

* Error: Proxy Authentication Required (407)

* Possible Error: Request Timed Out (408)

* Error: Conflict (409)

* Error: Gone (410)

* Error: Length Required (411)

* Error: Precondition Failed (412)

* Error: Request Entity Too Large (413)

* Error: Request URI Too Large (414)

* Error: Unsupported Media Type (415)

* Possible Error: Server Error (500)

* Possible Error: Not Implemented (501)

* Possible Error: Bad Gateway (502)

* Possible Error: Service Unavailable (503)

* Possible Error: Gateway Timed Out (504)

* Possible Error: HTTP Version Not Supported (505)

* Possible Error: Network/Server Error (600, 601, 602, 603)

* Advisory: Skipped - Recently Test (800)

* Possible Error: Skipped - Bad Server (801)

* Advisory: Skipped - FTP Limit (802)

* Advisory: Skipped - CGI Limit (803)

* Possible Error: No DNS Entry (900)

* Possible Error: DNS Timeout (901)

* Possible Error: Connect Error (902)

* Possible Error: Connect Timeout (903)

* Warning: Missing / (904)

* Warning: Probably OK (905)

* Warning: Contains an IP Address (906)

* Error: Multiple Redirections (907)

* Warning: Missing / (908)

* Error: Disconnected (909)

* Warning: Location Not Absolute (910)

* Error: Unsafe Character (911)

* Advisory: SSL Server Path Not Checked (912)

* Advisory: Simulated Redirect (913)

* Warning: Meta Redirect (914)

* Warning: Meta Loc not Absolute (915)

* Advisory: LDAP Server Query Not Checked (916)

* Error: No Headers Seen (917)

* Possible Error: Timeout Header (930)

* Possible Error: Timeout Body (931)

* Possible Error: Timeout Unknown (932)

* Warning: Body Truncated (933)

* Error: Error Creating Socket (990)

* Error: SSL Error (991)

* Error: Post Data Not Found (992)

* Error: Unknown (999)

* Error: FTP Error (1000)

* Error: Bad Syntax (2000)

* Error: SMTP No Such User (2001)

* Warning: SMTP Mailbox Full (2002)

* Possible Error: SMTP Failure (2003)

* Error: Errordoc Match (3000)

* Error: Errorbody Match (3001)

* Error: Profiler Match (3002)

LinkScan for Unix. Reference Manual. Section 8

Scheduling LinkScan on Unix Systems

The following example is provided to assist those users who wish to run LinkScan as a cron job. The crontab system is a standard Unix utility that enables jobs to be executed automatically according to some regular schedule. On most Unix systems, see man crontab or man 5 crontab for help.

  1. Save any existing configured cron jobs to a file (for example, cron.job) using the following shell command:

    crontab -l > cron.job
    
  2. Edit the file cron.job and append an additional entry for LinkScan containing something like:

    40 8 * * 0,1,2,3,4,5,6 /usr/linkscan/linkscan.cron
    

    This will execute /usr/linkscan/linkscan.cron at 08:40am each day. Adjust the pathname to linkscan.cron accordingly.

  3. Submit this to the crontab system with the following shell command:

    crontab cron.job
    

    You can check that it's been scheduled with:

    crontab -l
    
  4. Edit the linkscan.cron file -- the following example file is automatically installed in the LinkScan directory:

    #!/bin/sh
    
    # Set current working directory
    cd /usr/linkscan/
    
    # Execute LinkScan
    /usr/local/bin/perl linkscan.pl -project proja
    /usr/local/bin/perl linkscan.pl -project projb
    
    # Execute LinkScan/Dispatch (if required)
    /usr/local/bin/perl dispatch.pl -project proja -options
    
    # Execute command line reports (if required)
    # Must set environnment variable for these
    # setenv linkscan linkscan
    export linkscan=linkscan
    /usr/local/bin/perl linkscan.cgi -project proja -options
    

    See the following for a summary of the available command line switches/options:

    Please note the following points:

LinkScan for Unix. Reference Manual. Section 9

File System Scanning and Orphaned Files

LinkScan incorporates the ability to examine the files on your local hard drive and interpret them in a manner very similar to a web server. This capability has two major applications:

Configuration is inherently significantly more complex when compared to normal HTTP Scanning. In particular, you must configure the following items:

If you do not configure the File System Pathnames, LinkScan will automatically use HTTP Scanning. It will also disable the Orphaned File checking.

If you wish to enable Orphaned File checking and use HTTP Scanning, you must configure the File System Pathnames to enable orphan checking. Then, simply set Http = 1.

This is best illustrated by example:

# Map the server root
# http://www.example.com/index.html  <==> /usr/www/htdocs/index.html

Homeurl = http://www.example.com/
Homedir = /usr/www/htdocs/
Homefile = index.html

# http://www.example.com/cgi-bin/    <==> /usr/www/cgi-bin/
# http://www.example.com/~username/  <==> /home/username/public_html/

Alias cgi-bin/ /usr/www/cgi-bin/
Alias ~([^/]+)/ /home/$1/public_html/

# Hide hidden files and directories from the Orphans Report

Noorphans (\.|.*/\.)

# The following are significant (but default) settings

Execute cgi-bin/             # Test cgi-bin/ via HTTP
Execute (?i).*\.(cgi|asp)$   # Test .cgi and .asp files via HTTP

Htmlfiles = html, shtml, htm
Mapfiles = map
Pdffiles = 
Flashfiles = swf
Defaultpages = index.html, index.shtml, index.htm, home.html, home.shtml, home.htm

Indexoptions = 0             # Disallow directory listings
Expandssi = 1                # Expand Server Side Includes
Autohttp = 0                 # Disable automatic HTTP retry
Maxdirlevels = 10            # Don't explore file system beyond 10 levels

On Unix systems only, the Alias directive supports the special !HOME expression:

Alias ~([^/]+)(/|$) !HOME/public_html/

A reference to ~someuser/ will be Aliased to !HOME/public_html/. Then, !HOME will be replaced by the someuser's Home Directory which is determined via a lookup of /etc/passwd.

Remote File Systems

In some cases, the file system directories containing the web site may reside on a physically different computer from LinkScan. In these cases, LinkScan will support Network File System pathnames (subject to any locally imposed security controls).

In other cases, the file system of the remote system may not be visible via the network, quite possibly for security reasons. LinkScan will be unable to scan the remote computer using the File System Scanning Method. You must use HTTP Scanning.

However, it is still possible to enable Orphaned File checking. In summary, you will need to execute a small, self-contained Perl program on the remote computer. It will assemble a "picture" of the file system and save it as a simple ASCII file. That file may be transferred to the LinkScan computer using FTP (or any other more secure technique) and used to perform the orphan analysis in lieu of direct access to the remote server.

  1. Fully configure the selected Project as if your were using File System Scanning on your local machine. However, when setting the pathname to the root of the target webserver, (and any associated Aliases) use the pathname conventions applicable to the remote server.

  2. In the Project configuration file, force LinkScan to use normal HTTP Scanning by setting:

    
    Http = 1
    
  3. Set the Orphanfile setting in the Project configuration file to the full pathname of a file on your local computer. For example:

    
    Orphanfile = /usr/linkscan/someproject/orphans.list
    
  4. Transfer the following files to the remote server:

    
    /usr/linkscan/lsfind.pl
    /usr/linkscan/someproject/linkscan.cfg
    
  5. On the remote server, execute the lsfind.pl program:

    
    perl lsfind.pl orphans.list
    
  6. Transfer the orphans.list file back to the LinkScan machine.

  7. Initiate a scan of the target website in the normal manner. LinkScan will use the orphans.list file from the remote server in lieu of scanning the file system on the local server.

LinkScan for Unix. Reference Manual. Section 10

Import Scanning

The LinkScan Import function may be used to:

When processing a list of Links each URL is checked in turn and its status stored in the LinkScan database. When processing a list of Documents, each document and every link within that document is checked and its status stored.

The import function offers enormous flexibility. To use this feature, carry out the following steps:

  1. Prepare the Import File

    LinkScan will import a simple ASCII file of the following format:

    URL ... one or more tab characters ... URL-Description

    URL's may be absolute, or relative to the Home URL for the current server. The URL-Description is imported and carried through to the LinkScan Reports for identification purposes. You may use any ASCII string, for example a database record number.

    Import files may also include URL's using the extended LinkScan conventions for form submissions (GET, POST and Multi-Part POST). See How to Submit Forms.

    An alternative field separator may be specified by including a special command as the first line of the file:

    ## \s+

    The command starts with '##' in column one followed by a Perl expression that specifies the field delimiter. In the example above, '\s+' means one or more whitespace characters (tab or space).

    Lines with a '#' in column one, and blank lines, are ignored as comments.

    To use the Import Function, open the linkscan.cfg file for the appropriate Project, and edit the Importfile setting. Supply the full pathname to the prepared ASCII import file. For example:

    
    Importfile = /usr/home/linkscan/importfiles/test.txt
    

    Then select the import mode by changing the Import setting. Valid values are:

    Import = 0 Import mode disabled
    Import = 1 Import a list of links
    Import = 2 Import a list of documents
    Import = 3 Import a list of documents with caching disabled

    When using Import Documents LinkScan will by default check each document listed in the Import file but it will not follow those links and scan the entire site. Optionally, you may set Maxclicks and force LinkScan to execute a deeper scan. e.g. with Maxclicks = 3, LinkScan will check the Import File, the documents listed in the Import File, and the children (but not the grandchildren) of those documents.

  2. Special Considerations

    LinkScan de-duplicates the list of links within an Import Document list. This means that LinkScan will validate each unique URL within the list only one time.

    However, you may force LinkScan to process an Import Sequence so that the same URL or document is checked more than once. This may be achieved by adjusting the URL's to make them appear unique. Note that this also provides a means by which to differentiate the test results for each step. Simply edit the URL's to make them unique by adding dummy name-value pairs to the query string of the URL's:

    http://www.example.com/cookie_sensitive?dummyseq=1
    [...]
    http://www.example.com/set_cookie
    [...]
    http://www.example.com/cookie_sensitive?dummyseq=2

    If the URL's already include a query string, simply append the additional parameter to the existing query and change:

    http://www.example.com/foo?name=value

    to:

    http://www.example.com/foo?name=value&dummyseq=1

    Normally, LinkScan maintains the status of each link in a cache while it scans a site. This dramatically improves performance since LinkScan does not need to re-check commonly used images and other components over and over. However, it may also be undesirable with some stateful sequences. For example, if the same URL produces a completely different result before and after a cookie is set.

    In those situations, you may use a special option (Import = 3) which will force LinkScan to flush its cache after each imported document has been validated.

LinkScan for Unix. Reference Manual. Section 11

Advanced and Custom Scanning

LinkScan incorporates many powerful customization features described below.

  1. How to control the scope of a scan
  2. How to handle authentication schemes
  3. How to scan additional pages and submit forms
  4. How to validate JavaScript and drop-down lists
  5. How to handle special Error documents
  6. How to manipulate URLs on-the-fly
  7. How to emulate different browser types
  8. How to remap different hosts
  9. How to assign documents to Owners
  10. How to process additional per-document data
  11. How to control the testing of external links
  12. Other miscellaneous customizations

Hint: We strongly recommend that you read Essential LinkScan Concepts before studying this section of the Reference Manual.

11.1 How to control the scope of a scan

You may use any combination of the following commands to include or exclude specific areas of the target website.


Exclude relative-path-expression
Exclude absolute-url-expression
Nofollow relative-path-expression
Onlyfollow relative-path-expression
Onlyinclude relative-path-expression
Maxlevels depth
Maxclicks depth

Exclude: The Exclude command may be used to completely ignore specific links. You may supply a relative-path-expression to exclude Internal Links, or an absolute-url-expression to exclude External Links.

Nofollow: The Nofollow command may be used to provide even finer control over LinkScan's behavior. When LinkScan encounters a link matching a Nofollow command, it will validate the link (and check for any <a name = ... > tags if appropriate). However, it will not test any links that lead from the target document.

For greater flexibility and completeness, the Onlyinclude and Onlyfollow commands are also supported.

Onlyinclude: is logically equivalent to "Exclude everything except".

Onlyfollow: is logically equivalent to "Nofollow everything except".

Maxlevels: A command such as Maxlevels = 3 will limit the depth of the scan to three directory levels under server root.

Maxclicks: A command such as Maxclicks = 3 will limit the depth of the scan based on the number of clicks from the start of the scan. In order to more closely model the real user experience, LinkScan does not include clicks that result from following framesets or redirections.

The following rules of precedence apply when using multiple commands in combination:


Example 1:

Exclude http://www.domain.com/
Exclude test/

All links to "http://www.domain.com/" and all files in the local "test/" subdirectory will be ignored by LinkScan.


Example 2:

Nofollow user2/

LinkScan will check the links to files in the "user2/" directory, but it will not examine the content of any documents within the "user2/" directory or test any of the links contained within them.


Example 3:

Onlyfollow user1/

LinkScan will check the documents in the local "user1/" subdirectory and test the links to files in other local directories. However, LinkScan will not examine the content of any documents that lie outside of the local "user1/" directory or test any of the links contained within them.

Dynamic content

On websites that incorporate a high proportion of dynamic content it may not be productive to test any or all scripts with large number of query parameters or other variations. Controls are provided.

Maxcgi: The maximum number of times any single URL should be probed with different query parameters. This prevents LinkScan from trying to validate a CGI script or dynamic page with a potentially infinite number of query parameters.
[Default: Maxcgi = 100 ]

Taglimit: The Taglimit command may be used to provide even finer control over the number of times clusters of URL's are probed. Syntax and example:


Syntax:

Taglimit relative-path-expression maxnumber

Example:

Taglimit scripts/DatabaseLookup.asp 20

LinkScan will only attempt to parse 20 documents matching the pattern "scripts/DatabaseLookup.asp". Any further links matching the specified pattern will be completely ignored.

11.2 How to handle authentication schemes

Many websites include some form of access control or user authentication features. These are:

In the case of HTTP or NTLM Authentication, when a user attempts to access a protected area, their browser will present a challenge in the form of a pop-up dialog box that requires a username and password to be entered. In the case of cookie-based arrangements, the user is normally required to login by filling out an HTML form and submitting it.

HTTP Authentication

For sites that require HTTP Authentication, you must configure LinkScan with an appropriate Auth command:


Syntax:

Auth server-name "realm-name" username password

Examples:

Auth www.example.com "" guestuser xxxxxx
Auth app.example.com "Controlled Access" guestuser xxxxxx

You must include a realm-name (enclosed in double-quotes) but it may be empty. In that case, LinkScan will use the configured username and password for any realm on the target server. This is the recommended approach unless your server uses multiple realms with different access control rules for different portions of the website.

NTLM Authentication

Some Intranet websites utilize the proprietary and undocumented Microsoft NTLM protocol to authenticate users. LinkScan (on Windows systems only) may be configured to scan such sites.

Note: This may result in other minor artifacts in the results of the scan since LinkScan will use the Microsoft Windows implementation of the HTTP protocol versus the (stricter) native LinkScan implementation.

Cookie-based Authentication

HTTP access to some sites is controlled via authentication schemes requiring Cookies. For more information regarding Cookies see the Netscape Cookie Specification at http://wp.netscape.com/newsref/std/cookie_spec.html.

LinkScan will automatically accept and return all valid cookies received during the course of a scan. However, to gain access to the site, you may need to configure LinkScan to ensure that the appropriate cookies are set. This may be achieved by one of two techniques:

The submissions of a login form may be configured using the Extrahome command (described in the next section). However, you may optionally initialize LinkScan's collection of stored cookies (aka Cookie Jar) with one or more permanent Cookies by using the Cookie command:


Syntax:

Cookie server-name cookiename=cookievalue

Example:

Cookie www.elsop.com LinkScan=cookie_value;

Note: Do not enter space characters around the '=' character

The server-name is the name of the server to be tested. For security reasons and in compliance with the applicable standards, LinkScan will only send the cookie when the specified server-name exactly matches the hostname portion of the requested URL. In this context, server names and their corresponding IP addresses are considered to be different (consistent with all major browsers). The cookie names and values must be reverse engineered from your server code or "discovered" via your browser by enabling the "Prompt before accepting cookies" or examination of stored cookies on disk.

Hint 1: Sites with especially complex schemes (multiple levels of access control, subscription expirations etc.) might consider configuring their server and/or scripts to recognize a "super-user-cookie" specifically for testing purposes. This approach may also be used to trigger test points within server-based scripts and greatly improve the meaningful testability of complex dynamic content.

Hint 2: HTTP Authentication and Cookie related transactions are logged by LinkScan during the course of the scan. You may examine the following file to view the log: .../LinkScan/Projectname/data/linkscan.red

11.3 How to scan additional pages and submit forms

You may configure LinkScan to examine additional documents that would not normally be found during the scan and might otherwise be reported as orphaned files. The same technique may be used to submit forms on your website with specific data values for testing purposes. This is achieved with the Extrahome command:


Syntax:

Extrahome relative-path-expression

Examples

Extrahome somedir/staticdoc.html
Extrahome cgi-bin/getscript.cgi?Var1=aaa&Var2=bbb

The second example above includes a query string and is therefore equivalent to a FORM submission using the GET method. In addition, LinkScan includes support for special conventions that allow users to specify FORM submission operations using the POST method, including the Multi-Part POST, frequently used to upload files from a client to the server.


Examples:

Extrahome cgi-bin/postscript.cgi??Name=Malcolm%20Hoar&Password=secret

Extrahome upload.cgi???(postedfile;C:\LinkScan10\post\test.jpg;image/jpeg)

Extrahome upload.cgi???Name1=Val1&(postedfile;/usr/home/test/test.jpg;image/jpeg)&Name2=Val2

Hint: Use the LinkScan Recorder to automatically capture the correctly constructed URL's.

Hint 2: When using the Extrahome command to submit a login form to provide access to a site, you may also need to configure LinkScan so that it doesn't immediately "click" any LOGOUT button which would invalidate the newly created session.

11.4 How to validate JavaScript and drop-down lists

LinkScan may be configured to interpret the contents of drop-down lists as links to other pages. The HTML specification does not define a standard method for indicating that a drop-down list contains hyperlinks (as opposed to regular data). Hence LinkScan needs some other "cue" and may be triggered by pattern matching of attributes within the SELECT tag. Consider, for example, the following:


<select name="URLLIST">
<option value="/products/" Selected> Relative URL to Products
<option value="http://www.mydomain.com/services/"> Absolute URL to Services
</select>

To instruct LinkScan to treat the contents of the drop-down list as URL's, use the following command:


Selecturl URLLIST

LinkScan will examine all SELECT tags and look for a Regular Expression match on the NAME attribute. If the match is successful (URLLIST in this example) LinkScan will treat each OPTION tag within the list as a hyperlink and validate it accordingly.

LinkScan includes the ability to validate links contained within JavaScript code. A relatively simple pattern matching technique is used -- LinkScan does not contain a full JavaScript interpreter. This means that LinkScan may "miss" some links or find "false positive errors" especially if the code creates the hyperlink references dynamically at run-time. The following Scriptmatch and Scriptnomatch commands give excellent results in most cases. However, you can customize the matching rules by changing these expressions and/or adding new ones.


Scriptmatch = (\w+://\S+|\S+/$|\S+\?\S+|\S+\.([a-z]{2,3}|[js]?html?|Z)$)
Scriptnomatch = .*([\(\)\[\]\{\}\']|document\.\S+|\.(src|com)$)

Some JavaScript constructs may still produce false errors. You may force LinkScan to ignore complete script blocks that match a specified pattern. For example:


Scriptexclude function\s+ZoomWindow

The above command will force LinkScan to ignore script blocks that contain a definition for the ZoomWindow function.

11.5 How to handle special Error documents

Many websites are constructed with special user-friendly error pages, sometimes known as "custom-404 documents". Some servers will deliver the error document directly whereas others may force a redirection to a specific error document. In either case, an issue arises if your server delivers the error document with a 200 OK response code. LinkScan (or any other link checker) would not be able to detect the error condition.

A similar issue arises with some dynamically generated documents. For example, a Java applet may encounter a run-time error condition after it has already sent a 200 OK response code to the client.

Hence LinkScan supports two special commands that may be used to detect such conditions and force a 404 Not Found error, regardless of the HTTP response code produced by the server/application. The first is used with servers that force a redirection by pattern matching on the HTTP Location: header. The second operates by pattern matches on the document bodies.


Syntax:

Errordoc pattern
Errorbody pattern

Examples:

Errordoc special/notfound\.html
Errorbody (?i).*runtime\serror

In the Errordoc example, LinkScan will report as 404 Not Found any URL that is redirected to http://your.server/special/notfound.html. In the Errorbody example, LinkScan will report as 404 any document that contains the string runtime error in the document body. Note the (?i) makes the pattern match case-insensitive.

Hint: The Errorbody pattern match is carried out on the entire document, including comments. Developers might consider including a standard error string within comment tags that may be used to trigger the Errorbody match.

11.6 How to manipulate URLs on-the-fly

One of the most powerful (and complex) customization features of LinkScan concerns the real-time manipulation of links during the course of the scan. This is typically used to control the testing of sites with complex dynamic content. The basic commands available are:


Sessionmatch expression
Substitute relative-path-expression expression
Substituteraw relative-path-expression expression
Substitutescript relative-path-expression expression

The Sessionmatch command is used to manipulate Session numbers. The Substitute command is used to perform transformations on resolved links. The Substituteraw is used to perform transformations on unresolved links (i.e. the raw contents of a tag or tag attribute). The Substitutescript is used to perform transformations of blocks of JavaScript code.

We shall consider a number of examples which may be adapted according to your specific needs.

Example 1

Consider a site that produces links such as:


http://www.example.com/page1.asp
http://www.example.com/page1.asp?Print

It is entirely possible that page1.asp has been designed in such a manner that it delivers the same basic content with minor variations in formatting depending upon the presence or absence of the Print query string. One might configure LinkScan with:


Substitute (.*\.asp)\?Print $1

Whenever LinkScan encounters a link matching the specified pattern it will make the substitution indicated before it tries to validate or follow that link. In this example, a link to:

http://www.example.com/page1.asp?Print

will immediately be transformed to:

http://www.example.com/page1.asp

Note, however, this is not the same as Excluding links which contain the Print query string; that would cause LinkScan to simply ignore the link. In this case, LinkScan will process the link but transform it on-the-fly during the scan.

Example 2

Next we will consider a significantly more complex scenario.


Sessionmatch .*&token=([^&]+)
Substitute (.*&token=)[^&]*(.*)$ $1!S$2

In this case, we use the special Sessionmatch command to capture and save the first value of the query parameter token that LinkScan sees. This is most likely some kind of session number assigned by the target server immediately following the submission of a login form. The Substitute command then instructs LinkScan to replace all subsequent values of token with the saved value (represented by the special parameter !S).

In this scenario, LinkScan ensures that the value of token can never change during the course of the scan from the originally assigned value.

Example 3

Next we'll consider a JSP site that produces URL's with the following structure:


http://www.example.com/content?A=123&B=456&C=789&D=XYZ

It may not be productive or efficient for LinkScan to scan all of the pages using every combination and permutation of values for the parameters A, B, C, D... etc.. We can control that by manipulating the individual name-value pairs during the scan. For example:


Substitute (content\.jsp\?.*)&B=[^&](.*) $1&B=456$2
Substitute (content\.jsp\?.*)&C=[^&](.*) $1$2
Taglimit content\.jsp\?.*&D= 20

The first command fixes the value of B=456. Whatever value the parameter B takes on during the scan, LinkScan will force the value back to 456. The second command deletes any references to the C parameter from every link that it finds. We have also included the third Taglimit command; this will cause LinkScan to completely ignore the twenty-first and subsequent links that include a D parameter. In other words, in this case, we only want to test a representative sample (20) of links that include a D parameter.

Example 4

For our next example, we shall consider a site that generates pages containing some links with the following structure:


http://www.example.com/cgi-bin/GenerateFrame?Referer=abc&Link=http%3A%2F%2Fwww.yahoo.com%2F

Rather than linking directly to Yahoo!, this page links to a script that generates a frameset that includes the referenced page. In a default configuration, LinkScan will happily follow the link, validating the frameset and the ultimate link to Yahoo!. However, it may not be productive to do that for potentially thousands of links. Furthermore, in the (extremely unlikely) event that the link to http://www.yahoo.com/ was broken, the error would appear in one of the GenerateFrame documents and not the original referring document. In order to repair that link, one would have to backtrack through the frameset to locate the original source of the trouble.

Hence we can apply more Substitute magic:


Substitute cgi-bin/GenerateFrame.*&Link=([^&]+).* !U$1

This command will extract the value of the Link= parameter, and the special !U token instructs LinkScan that the string needs to be un-encoded. So the original link:

http://www.example.com/cgi-bin/GenerateFrame?Referer=abc&Link=http%3A%2F%2Fwww.yahoo.com%2F

is transformed on-the-fly to:

http%3A%2F%2Fwww.yahoo.com%2F

and then decoded to:

http://www.yahoo.com/

And this means LinkScan can validate the link to Yahoo! directly without checking the GenerateFrame script many, many times. Furthermore, any errors will be flagged against the original document (and not one or more steps removed).

Example 5

For our final example, we include for illustration the complete configuration for a real-world large and very complex dynamic site:


# Set the CGI limit to be very large
# Include all file types on the Map

Maxcgi = 10000
Mapinclude .*

# Force &A=B and insert it immediately after the '?'

Substitute (cgi-bin.*[&\?])A=[^&=]*&*(.*) $1$2
Substitute (cgi-bin.*\?)(.*) $1A=B&$2

# Discard null and undefined values

Substitute (cgi-bin.*)&B=(null|undefined)(.*) $1$3
Substitute (cgi-bin.*)&C=(null|undefined)(.*) $1$3
Substitute (cgi-bin.*)&D=(null|undefined)(.*) $1$3
Substitute (cgi-bin.*)&R=(null|undefined)(.*) $1$3

# For 'category', take the &C= if present, otherwise the &B=

Substitute (cgi-bin/bv/scripts/category.*\?A=B).*?(&C=[^&=]*).* $1$2
Substitute (cgi-bin/bv/scripts/category.*\?A=B).*?(&B=[^&=]*).* $1$2

# For 'content', take the &D= or &R= if present (call it &D=). Otherwise take the &B=

Substitute (cgi-bin/bv/scripts/content.*\?A=B).*?&[DR]=([^&=]*).* $1&D=$2
Substitute (cgi-bin/bv/scripts/content.*\?A=B).*?(&B=[^&=]*).* $1$2

# For 'frame', take the &D= or &R= if present (call it &D=). Otherwise take the &B=

Substitute (cgi-bin/bv/scripts/frame.*\?A=B).*?&[DR]=([^&=]*).* $1&D=$2
Substitute (cgi-bin/bv/scripts/frame.*\?A=B).*?(&B=[^&=]*).* $1$2

# For 'mailing...', take the &R=

Substitute (cgi-bin/bv/scripts/mailing.*\?A=B).*?(&R=[^&=]*).* $1$2

# For 'contact', take the &B=, &C= and &Comments

Substitute (cgi-bin/bv/scripts/contact.*\?A=B).*?(&B=[^&=]*).*?(&C=[^&=]*).*?(&Comments=[^&=]*).* $1$2$3$4

# Mark redirects to Error page as 404
# Mark documents containing 'Error Code:' as 404

Errordoc cgi-bin/bv/scripts/error.jsp
Errorbody Error\s+Code:[^\n<]*

# Hide some frequent arising errors

Noforms = 1
Exclude images/arrow.gif

Example 6

Next we will consider a reference to a JavaScript function:


<a href="javascript:MyFunction(4,5,6);">

The following Substitutescript command:


Substitutescript .*:MyFunction\((\d+),(\d+),(\d+)\) '/somepage.jsp?Par1=$1&Par2=$2&Par3=$3'

will transform the function call into the following link which will then be validated/processed by LinkScan.


/somepage.jsp?Par1=4&Par2=5&Par3=6

Synthesizing Additional Links

The Substitute commands may be used to modify existing links on-the-fly. However, a variation of this, the Insertlink command, may be used to insert additional links into specified documents in order to achieve a specific test coverage. Again, it is best illustrated by example:


Insertlink .*complex\.jsp\?.*SPVAR= -
Insertlink (.*complex\.jsp\?.*) /$1&ALTMODE=1 +

As each document is scanned, LinkScan will process all Insertlink commands (in the order specified). The URL of the scanned document is matched against the first parameter of each Insertlink command. In the case of the first example above, a link to:

complex.jsp?VAR=1&SPVAR=2

will match the expression and LinkScan will abort all Insertlink processing for this document (signified by the minus character).

However, a link to:

complex.jsp?VAR=1

does not match the expression. Processing will continue to the second command. This does match the expression and LinkScan will insert a link into this document (signified by the plus character). Hence, when LinkScan processes:

complex.jsp?VAR=1

It will insert into that document, the following link:

complex.jsp?VAR=1&ALTMODE=1

Hint: Clearly, the Substitute command requires a good working knowledge of Perl Regular Expressions. If you need assistance, the LinkScan engineers will be happy to help. Please write to mailto:linkscan@elsop.com describing in as much detail as possible, the transformations you are seeking to achieve.

11.7 How to emulate different browser types

Most web browsers advertise their identity by including a User-Agent header with every request that they make. LinkScan also sends a User-Agent header. For example, the versions of Netscape Navigator, Microsoft Internet Explorer and LinkScan installed on the writers computer send, respectively:


User-Agent: Mozilla/4.08 [en] (WinNT; I ;Nav)
User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)
User-Agent: LinkScan Enterprise/12.0 Windows

Some websites are constructed in a manner that is browser sensitive. They may, for example, deliver customized pages depending on the users browser type. Hence LinkScan may be customized to emulate different browser types using the Extraheader command:


Syntax:

Extraheader literal-header-string

Example:

Extraheader User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)

In this example, LinkScan will advertise itself as Microsoft Internet Explorer version 5.5 running under Windows 2000.

In fact, the Extraheader command may be used to add any arbitrary HTTP headers to every request that LinkScan sends. A common application involves those servers which look for a language preference in the HTTP headers in order to deliver pages in the appropriate language. For example, the following command instructs LinkScan to include an English Language preference header with each request:


Extraheader Accept-Language: en

11.8 How to remap different hosts

Sometimes a single website may contain links such as:


http://www.example.com/
http://www2.example.com/

Where www.example.com and www2.example.com resolve to the same host IP address. However, LinkScan would consider www2.example.com to be an External Link and not part of the www.example.com Project. Hence the Hostalias command may be used to assign more than one name to the current server. Syntax and example:


Syntax:

Hostalias from-server-url to-server-url

Example:

Hostalias http://www2.example.com/  http://www.example.com/

A similar issue arises when scanning development or staging servers. For example, you may wish to scan the site:


http://staging.example.com/

but the site may contain one or more absolute links to http://www.example.com/. In this case, you can use the Mirrorurl command.


Syntax:

Mirrorurl absolute-url

Example:

Homeurl = http://www.example.com/
Mirrorurl = http://staging.example.com/

In this case, LinkScan will resolve all links as if it were scanning http://www.example.com/. However, all actual HTTP requests will be directed to http://staging.example.com/. This provides a convenient mechanism for scanning development and staging copies of a production website.

11.9 How to assign documents to Owners

You may define the ownership of any given document or file in one of several ways. Ownership directives are evaluated in the order specified with the last match taking precedence. Note that the file ownership attribute is case sensitive.

  1. By the Unix File System ownership attribute. Note: this is not supported on Windows systems

  2. By the Defaultowner command. The syntax for the Defaultowner command is:

    Defaultowner owner-name

  3. By pattern matching with one or more Owner commands. The syntax for the Owner command is:

    Owner relative-path-expression owner-name
    OR
    Ownerq relative-path-expression owner-name

    The Owner command operates on the pathname portion of the URL and does not process any query string (following a "?" character). The Ownerq command operates on the entire URL including any query string.

    LinkScan also supports a special variation of the Owner command. This will automatically assign every file an owner-name based on the name of the directory in which it resides. The syntax is:

    Owner *integer

    The default setting (Owner *1) will assign each document to an Owner based on the top-level directory name (i.e. under "www root"). A setting of Owner *2 will cause LinkScan to assign Ownership based on the first two directory names. For example:

    http://www.example.com/first/second/third/index.html

    Will be assigned to the Owner first_second.

  4. By using preexisting META tags in your HTML documents. For example, if your existing documents already contain tags of the form:

    <METa name="S11CONTENT_OWNER" CONTENT="Malcolm Hoar">

    You may set the Owner to 'Malcolm Hoar' by configuring a suitable pattern. e.g.:

    Ownertags = ^meta\s+name\s*=\s*"content_owner"\s+content\s*=\s*"([^"]+)

  5. Finally, once an Owner has been assigned to the file or document, you may manipulate the Owner string with a simple pattern substitution:

    Owneralias .*?([a-zA-Z0-9]+)[\s\.\)]*$ \L$1

    This example would take the string 'Malcolm Hoar' and convert the ownership to 'hoar'. This technique may be used to deal with synonyms such as 'M. Hoar.', 'Malcolm C Hoar '.


Example:

Defaultowner elsop         # Set default
Owner *1                   # Assign Owner based on top level dir ...
Owner wrc/humor/ humor     # But, make this subdir look like top-level
Owner .*\.cgi$ webmaster   # And give all *.cgi files to webmaster

When using LinkScan Dispatch to create reports for delivery by Electronic mail, you may define associations between Owners and Addresses with the Mailalias command. The syntax is:

Mailalias expression list-of-addresses

list-of-addresses may be a comma separated list of addressees if you wish to distribute the report to multiple recipients. Use Mailalias owner-name null to skip a specific Owner.


Example:

Defaultowner elsop         # Set default
Owner *1                   # Assign Owner based on top level dir ...
Owner wrc/humor/ humor     # But, make this subdir look like top-level
Owner .*\.cgi$ webmaster   # And give all *.cgi files to webmaster

Mailalias elsop            malch@elsop.com, ken@elsop.com
Mailalias links            ken@elsop.com
Mailalias linkscan         malch@elsop.com
Mailalias wrc              ken@elsop.com
Mailalias humor            ken@elsop.com
Mailalias test             null

If no Mailaliases are defined, Dispatch will address the reports to Ownername @ Mailhost

11.10 How to process additional per-document data

Facilities are provided to extract additional data from each document scanned, store those data in the LinkScan database and create various reports. The additional data collected are typically collected from the META tags in each HTML document.

Supported commands are provided for data extraction, substitution/manipulation and formatting:


# Userdata [123] match-expression expression
# Userdatafmt [123] [DHLTX] integer[LRC] caption
# D=date; H=hot links; L=link; T=truncate to format; X=normal
# Userdatasub [123] expression expression

The following example illustrates the use of these commands to extract and process an employee badge number from document META tags:


Userdata 1 (?i)<meta\s[^>]*employee\s*=\s*"\s*(#?\d+)\s*" $1
Userdatasub 1 #?(\d+) $1
Userdatafmt 1 X 6R Badge-Number

In the above example, we use the fir