Planning a LinkScan Project

Introduction

All basic LinkScan operations are carried out on Projects. The essential steps are:

Create and Plan a Project: This step provides LinkScan with the basic specification and definition of the test you wish to perform. You will always need to supply the URL of the website you are seeking to test. Often you will want to define the test conditions more precisely by selecting from the large number of available options and customizations.
Scan a Project: During this step, LinkScan actually executes the test scenario defined by the Project Plan.
Examine the Results of a Project: Finally you will wish to examine and analyze the results of the test.

These basic steps correspond to the Plan, Scan, and Exam buttons on the main LinkScan window. [Screenshot]

In addition, New and Remove buttons are provided to create a new Project or permanently delete an old project that is no longer needed.

When creating a new Project, you may either create a brand new (empty) Project, or create a new Project based upon (cloned from) an existing Project. The latter technique provides a simple method to define multiple test scenarios with minor variations between them.

The remainder of this document describes the options available on the Project Planning property sheet dialog. You may press the property sheet Help button at any time to display the applicable section of this document.

Basic Settings Tab

Project Description/Organization: You may enter simple text descriptions for this project and they will be displayed on the associated reports.
URL of website to scan: Enter the URL of the website you wish to scan.
When you enter a URL such as:
```
http://www.example.com/Products/index.html
```
LinkScan will automatically enter Products/ in the text box below. You may use the associated Radio Buttons to control the scope of the scan.
- Select Full Site (default) to scan the entire site.
- Select Onlyfollow if you wish to completely scan the Products/ directory. LinkScan will validate all of the links leading to other directories within the site. However, it will not follow them and scan those other areas of the website.
- Select Onlyinclude if you wish to scan the Products/ directory without following, or even checking, those links that lead to other areas of the website.
Scanning Method: We recommend that you use the Normal HTTP Scanning method, at least initially. This method is frequently the most appropriate and is also the simplest to configure.
Server Uses Case Sensitive Pathnames: This tells LinkScan whether to treat index.html and INDEX.HTML, for example, as a single file or two different files. In general, this box should be checked when scanning websites hosted on Unix servers and unchecked when scanning websites hosted on Windows servers.
Processes: You may configure the number of processes that LinkScan will use for scanning. These options control the maximum number of simultaneous network requests LinkScan will initiate. Using more processes will cause the scan to complete more rapidly but also uses more CPU, memory and network resources. In general, we recommend you use the default values. Users with large numbers of external links and a fast network connection (i.e. not dial-up) may increase the number of External Processes to accelerate the completion of the scan.

Scope Tab

The settings on this tab are used to control the scope of the scan -- these rules are applied in addition to any Onlyinclude/Onlyfollow rule you may define on the Basic Settings Tab.

Dir Levels: Limit the depth of the scan based on the number of directory levels (tracks the number of '/' characters in the path portion of the URL).
Clicks: Limit the depth of the scan based on the number of user clicks starting from the home page.
Max Queries per Script: The maximum number of times a single page will be tested with different combinations of query strings (portion of URL following the '?' character). For example:

index.jsp?n=1
index.jsp?n=2
index.jsp?n=3
Check Form Actions: Validate links in <FORM ACTION=...> tags. Since LinkScan does not automatically know what user data to submit with the form, it tests the NULL case. Disable when scanning servers with poor data validation that produce 500 or similar errors.

In addition, you may specify an unlimited number of inclusion/exclusion rules based on pattern matching within the URL of the links.
Exclude: Completely ignore links that match the specified pattern.
Nofollow: Check links that match the specified pattern but do not follow those links.
Onlyinclude: Equivalent to Exclude everything that does not match the specified pattern.
Onlyfollow: Equivalent to Nofollow everything that does not match the specified pattern.

See Regular Expressions for a discussion of the pattern matching rules and their syntax together with some common examples.

Root Tab

The Root Tab is enabled when using File System Scanning.

Select the folder on your hard drive that corresponds to the root of the website you wish to scan (i.e. containing the website documents). This may be a folder on a Network Drive.

In some cases, the file system directories containing the web site may reside on a physically different computer from LinkScan. In these cases, LinkScan will support Network File System pathnames (subject to any locally imposed security controls).
In other cases, the file system of the remote system may not be visible via the network, quite possibly for security reasons. LinkScan will be unable to scan the remote computer using the File System Scanning Method. You must use Network (HTTP) Scanning.

However, it is still possible to enable Orphaned File checking. In summary, you will need to execute a small, self-contained Perl program on the remote computer. It will assemble a "picture" of the file system and save it as a simple ASCII file. That file may be transferred to the LinkScan computer using FTP (or any other more secure technique) and used to perform the orphan analysis in lieu of direct access to the remote server.
1. Fully configure the selected Project as if your were using File System Scanning on your local machine. However, when setting the pathname to the root of the target webserver, (and any associated Aliases) use the pathname conventions applicable to the remote server.
2. Transfer the following files to the remote server:
```
C:/LinkScan/lsfind.pl
C:/LinkScan/someproject/linkscan.cfg
```
3. On the remote server, execute the lsfind.pl program:
```
perl lsfind.pl orphans.list
```
4. Transfer the orphans.list file back to the LinkScan machine.
5. Initiate a scan of the target website in the normal manner. LinkScan will use the orphans.list file from the remote server in lieu of scanning the file system on the local server.

Alias Tab

The Alias Tab is enabled when using File System Scanning.

You must configure the mapping between the Root URL of the website and the File System on the Root Tab. You may optionally configure additional mappings via the Alias Tab.

Files Tab

The Files Tab is only enabled when using File System Scanning. It routes different file types (as defined by their file extensions) to the appropriate parser/processor. Hence files with a .htm or .html extension are routed to the HTML parser.

Note: when using HTTP Scanning, the Internet Standards dictate that files are routed according to their MIME or Content-Type and not based on their file extension.

The following mappings are established by default:

.htm, .html and .shtml files are processed as HTML documents
.swf files are processed as Shockwave/Flash files
.doc and .txt files are processed as Text files

You may wish to establish additional mappings; the following are commonly used:

.pdf files processed as PDF documents
.xls and .ppt files processed as Text documents

Note that the LinkScan Text parser is an extremely generic implementation and it attempts to extract hyperlinks from any file type that is routed to it. In particular, it may be used to extract links from various Microsoft Office file types (e.g. .doc, .ppt, .xls etc.) as well as .url files as used within the Internet Explorer Favorites folder.

The lower half of the Files Tab is used to define what files to look for when a reference to a directory (without any explicit filename) is found. Typically index.html.

A checkbox controls whether or not to permit a directory listing to be created on-the-fly when a link to a directory is found but no index.html (or similar) file is present.

Mimes Tab

The Mimes Tab routes different file types (as defined by their MIME or Content-Type header) to the appropriate parser/processor.

The following mappings are established by default:

text/html is processed as an HTML document
text/plain is processed as a Text file
application/msword is processed as a Text file
application/x-shockwave-flash is processed as a Shockwave/Flash file

You may wish to establish additional mappings; the following are commonly used:

application/pdf processed as a PDF document

Enabling the PDF option may incur significant performance overheads in view of the large size of many PDF documents and the time required to download them.

Import Tab

The LinkScan Import function may be used to:

Validate a list of Links exported from some arbitrary data source (e.g. a database management system).
Validate a list of Documents (e.g. an arbitrary sub-set of pages from a web site) and all the links contained within them. This might include the most critical/popular pages perhaps extracted from an HTTP logfile analysis program. This could also represent an arbitrary user session including a sequence of form submissions with specific data values. Such sequences may be easily captured with the LinkScan Recorder.

When processing a list of Links each URL is checked in turn and its status stored in the LinkScan database. When processing a list of Documents, each document and every link within that document is checked and its status stored.

The import function offers enormous flexibility. To use this feature, carry out the following steps:

Prepare the Import File

LinkScan will import a simple ASCII file of the following format:

URL ... one or more tab characters ... URL-Description

URL's may be absolute, or relative to the Home URL for the current server. The URL-Description is imported and carried through to the LinkScan Reports for identification purposes. You may use any ASCII string, for example a database record number.

An alternative field separator may be specified by including a special command as the first line of the file:

## \s+

The command starts with '##' in column one followed by a Perl expression that specifies the field delimiter. In the example above, '\s+' means one or more whitespace characters (tab or space).

Lines with a '#' in column one, and blank lines, are ignored as comments.
Then select the import mode by changing the Import setting. Valid selections are:

Import links
Import documents
Import documents with caching disabled
Supply the pathname to the ASCII Import File.

Special Considerations

LinkScan de-duplicates the list of links within an Import Document list. This means that LinkScan will validate each unique URL within the list only one time.

However, you may force LinkScan to process an Import Sequence so that the same URL or document is checked more than once. This may be achieved by adjusting the URL's to make them appear unique. Note that this also provides a means by which to differentiate the test results for each step. Simply edit the URL's to make them unique by adding dummy name-value pairs to the query string of the URL's:

http://www.example.com/cookie_sensitive?dummyseq=1
[...]
http://www.example.com/set_cookie
[...]
http://www.example.com/cookie_sensitive?dummyseq=2

If the URL's already include a query string, simply append the additional parameter to the existing query and change:

http://www.example.com/foo?name=value

to:

http://www.example.com/foo?name=value&dummyseq=1

Normally, LinkScan maintains the status of each link in a cache while it scans a site. This dramatically improves performance since LinkScan does not need to re-check commonly used images and other components over and over. However, it may also be undesirable with some stateful sequences. For example, if the same URL produces a completely different result before and after a cookie is set.

In those situations, you may use a special option (Import Nocache) which will force LinkScan to flush its cache after each imported document has been validated.

Login Tab

HTTP access to some sites is controlled via authentication schemes requiring Cookies.

LinkScan will automatically accept and return all valid cookies received during the course of a scan. However, to gain access to the site, you may need to configure LinkScan to ensure that the appropriate cookies are set. This may be achieved by one of two techniques:

Instructing LinkScan to submit a login form with a valid username and password, or...
Pre-loading LinkScan with the necessary cookies prior to initiating the scan

Login URL: Specify a URL to submit at the start of a scan. Typically, this will involve the submission of a login form such as:
login.jsp??Username=John&Password=secret
Avoid Logout URL: Typically, we must tell LinkScan to avoid pressing any logout button to that a successful login is not immediately invalidated.
Use Recorded Links: Instructs LinkScan to replay any links captured with the LinkScan Recorder and saved with the current project.
Use Recorded Cookies: Instructs LinkScan to use any cookies captured with the LinkScan Recorder and saved with the current project.
Set Persistent Cookies: Additional cookies may be loaded into the internal LinkScan cookie jar prior to the start of a scan.

Auth Tab

The Auth Tab may be used to specify HTTP authentication credentials. Servers that require HTTP Authentication cause web browsers to challenge the user with a popup dialog demanding a username and password.

Note that this is a completely separate mechanism from cookie based schemes that require users to enter their credentials on an HTML based form (see the Login Tab for details).

Host: The name of the host to which the credentials apply. Enter, for example, www.example.com.
Realm: HTTP Authentication allows webmasters to configure different authentication rules for different sub-sets of the site, formally known as Realms. You may specify the name of the Realm to configure Realm-specific credentials. However, in most cases, you may leave the Realm blank and LinkScan will use the supplied username/password for any and all Realms on that host.
Username: Enter the username.
Password: Enter the password.

Owners Tab

By default, LinkScan assigns each document scanned to an Owner based on the top-level directory name. You may use the Spin Button to create Owner names based on the topmost 0-5 directory levels.

Use the Owners tab to specify additional ownership rules. For example:

# The default (automatic) rules would assign all documents under # products/ and services/ directory to their respective owners. Owner products/ products Owner services/ services # This can be enhanced by adding additional rules such as: Owner products/consumer/ btoc Owner products/business/ btob

Advanced Tab

This tab may be used to enter or modify some rarely used commands not otherwise available via the graphical interface. See the LinkScan Quick Reference Card for a complete list.

Notes Tab

Unlike old fashioned configuration files that normally accept comment lines, graphical user interfaces do not generally allow you to make notes (for example, the reasons for a change to the configuration, when it was made and by whom). We miss that nice feature!

Hence you may use the Notes tab to annotate a Project with your own comments. These are saved as part of the Project Plan.

Other Options Tab

User Agent: Select the User-Agent header that LinkScan will send with each and every HTTP request.
Disable JavaScript: Disable recognition of links embedded within JavaScript. Normally used to help identify areas of a site that may not be reachable with browsers that do not support JavaScript, or have it disabled.
Retry External: When selected, LinkScan will track all External links that appear to fail due to network related errors (e.g. DNS, connect and timeout errors). These links will be retested at the end of the scan. This tends to reduce the number of transient errors reported but the scan may require a little more time to complete.
Report External Redir: Select this option when you want LinkScan to warn/report on redirections and store the status of the final (redirected) link.
Require </A>: Report errors if <A HREF=...> tags are not closed with </A> tags.
Check ALT, WIDTH and HEIGHT: Report errors if <IMG SRC=...> tags do not have ALT, WIDTH and/or HEIGHT attributes.
Collect META tags: Save all document <META> tags. These data are stored in, for example,
```
C:\LinkScan10\projectname\data\linkscan.met
```
Diagnostic Trace: Log all HTTP request and response headers to disk (typically used to examine and debug complex authentication issues). The data are stored in, for example:
C:\LinkScan10\projectname\data\linkscan.red
Report Error if Redirect Matches Expression: Forces LinkScan to report an error if a page is redirected to URL matching the specified expression (typically custom pages that display user friendly error messages which would otherwise appear as 200 OK):
.*error\.html
Report Error if Internal Document Body Matches Expression: Forces LinkScan to report an error if a page contains the specified expression, even if the page has a 200 OK status code. For example:
(?i)(404\s+not\s+found|500\s+server\s+error)
Report Error if External Document Body Matches Expression: Forces LinkScan to report an error if an external page contains the specified expression, even if the page has a 200 OK status code. Note: since LinkScan must fetch and process the document body when validating external links, this may have a significant performance impact. Example to trigger on External links to documents containing META REFRESH tags:
(?i)<meta\s+http[^>]+refresh[^>]*>

Planning a LinkScan Project
LinkScan Version 12.3
© Copyright 1997-2012 Electronic Software Publishing Corporation (Elsop)
LinkScan™ and Elsop™ are Trademarks of Electronic Software Publishing Corporation

Help Reference HowTo Card