LinkScan

LinkScan for Unix. Reference Manual.

Section 11

  Previous   Contents   Next   Help   Reference   HowTo   Card 

Advanced and Custom Scanning

LinkScan incorporates many powerful customization features described below.

  1. How to control the scope of a scan
  2. How to handle authentication schemes
  3. How to scan additional pages and submit forms
  4. How to validate JavaScript and drop-down lists
  5. How to handle special Error documents
  6. How to manipulate URLs on-the-fly
  7. How to emulate different browser types
  8. How to remap different hosts
  9. How to assign documents to Owners
  10. How to process additional per-document data
  11. How to control the testing of external links
  12. Other miscellaneous customizations

Hint: We strongly recommend that you read Essential LinkScan Concepts before studying this section of the Reference Manual.

11.1 How to control the scope of a scan

You may use any combination of the following commands to include or exclude specific areas of the target website.


Exclude relative-path-expression
Exclude absolute-url-expression
Nofollow relative-path-expression
Onlyfollow relative-path-expression
Onlyinclude relative-path-expression
Maxlevels depth
Maxclicks depth

Exclude: The Exclude command may be used to completely ignore specific links. You may supply a relative-path-expression to exclude Internal Links, or an absolute-url-expression to exclude External Links.

Nofollow: The Nofollow command may be used to provide even finer control over LinkScan's behavior. When LinkScan encounters a link matching a Nofollow command, it will validate the link (and check for any <a name = ... > tags if appropriate). However, it will not test any links that lead from the target document.

For greater flexibility and completeness, the Onlyinclude and Onlyfollow commands are also supported.

Onlyinclude: is logically equivalent to "Exclude everything except".

Onlyfollow: is logically equivalent to "Nofollow everything except".

Maxlevels: A command such as Maxlevels = 3 will limit the depth of the scan to three directory levels under server root.

Maxclicks: A command such as Maxclicks = 3 will limit the depth of the scan based on the number of clicks from the start of the scan. In order to more closely model the real user experience, LinkScan does not include clicks that result from following framesets or redirections.

The following rules of precedence apply when using multiple commands in combination:


Example 1:

Exclude http://www.domain.com/
Exclude test/

All links to "http://www.domain.com/" and all files in the local "test/" subdirectory will be ignored by LinkScan.


Example 2:

Nofollow user2/

LinkScan will check the links to files in the "user2/" directory, but it will not examine the content of any documents within the "user2/" directory or test any of the links contained within them.


Example 3:

Onlyfollow user1/

LinkScan will check the documents in the local "user1/" subdirectory and test the links to files in other local directories. However, LinkScan will not examine the content of any documents that lie outside of the local "user1/" directory or test any of the links contained within them.

Dynamic content

On websites that incorporate a high proportion of dynamic content it may not be productive to test any or all scripts with large number of query parameters or other variations. Controls are provided.

Maxcgi: The maximum number of times any single URL should be probed with different query parameters. This prevents LinkScan from trying to validate a CGI script or dynamic page with a potentially infinite number of query parameters.
[Default: Maxcgi = 100 ]

Taglimit: The Taglimit command may be used to provide even finer control over the number of times clusters of URL's are probed. Syntax and example:


Syntax:

Taglimit relative-path-expression maxnumber

Example:

Taglimit scripts/DatabaseLookup.asp 20

LinkScan will only attempt to parse 20 documents matching the pattern "scripts/DatabaseLookup.asp". Any further links matching the specified pattern will be completely ignored.

11.2 How to handle authentication schemes

Many websites include some form of access control or user authentication features. These are:

In the case of HTTP or NTLM Authentication, when a user attempts to access a protected area, their browser will present a challenge in the form of a pop-up dialog box that requires a username and password to be entered. In the case of cookie-based arrangements, the user is normally required to login by filling out an HTML form and submitting it.

HTTP Authentication

For sites that require HTTP Authentication, you must configure LinkScan with an appropriate Auth command:


Syntax:

Auth server-name "realm-name" username password

Examples:

Auth www.example.com "" guestuser xxxxxx
Auth app.example.com "Controlled Access" guestuser xxxxxx

You must include a realm-name (enclosed in double-quotes) but it may be empty. In that case, LinkScan will use the configured username and password for any realm on the target server. This is the recommended approach unless your server uses multiple realms with different access control rules for different portions of the website.

NTLM Authentication

Some Intranet websites utilize the proprietary and undocumented Microsoft NTLM protocol to authenticate users. LinkScan (on Windows systems only) may be configured to scan such sites.

Note: This may result in other minor artifacts in the results of the scan since LinkScan will use the Microsoft Windows implementation of the HTTP protocol versus the (stricter) native LinkScan implementation.

Cookie-based Authentication

HTTP access to some sites is controlled via authentication schemes requiring Cookies.

LinkScan will automatically accept and return all valid cookies received during the course of a scan. However, to gain access to the site, you may need to configure LinkScan to ensure that the appropriate cookies are set. This may be achieved by one of two techniques:

The submissions of a login form may be configured using the Extrahome command (described in the next section). However, you may optionally initialize LinkScan's collection of stored cookies (aka Cookie Jar) with one or more permanent Cookies by using the Cookie command:


Syntax:

Cookie server-name cookiename=cookievalue

Example:

Cookie www.elsop.com LinkScan=cookie_value;

Note: Do not enter space characters around the '=' character

The server-name is the name of the server to be tested. For security reasons and in compliance with the applicable standards, LinkScan will only send the cookie when the specified server-name exactly matches the hostname portion of the requested URL. In this context, server names and their corresponding IP addresses are considered to be different (consistent with all major browsers). The cookie names and values must be reverse engineered from your server code or "discovered" via your browser by enabling the "Prompt before accepting cookies" or examination of stored cookies on disk.

Hint 1: Sites with especially complex schemes (multiple levels of access control, subscription expirations etc.) might consider configuring their server and/or scripts to recognize a "super-user-cookie" specifically for testing purposes. This approach may also be used to trigger test points within server-based scripts and greatly improve the meaningful testability of complex dynamic content.

Hint 2: HTTP Authentication and Cookie related transactions are logged by LinkScan during the course of the scan. You may examine the following file to view the log: .../LinkScan/Projectname/data/linkscan.red

11.3 How to scan additional pages and submit forms

You may configure LinkScan to examine additional documents that would not normally be found during the scan and might otherwise be reported as orphaned files. The same technique may be used to submit forms on your website with specific data values for testing purposes. This is achieved with the Extrahome command:


Syntax:

Extrahome relative-path-expression

Examples

Extrahome somedir/staticdoc.html
Extrahome cgi-bin/getscript.cgi?Var1=aaa&Var2=bbb

The second example above includes a query string and is therefore equivalent to a FORM submission using the GET method. In addition, LinkScan includes support for special conventions that allow users to specify FORM submission operations using the POST method, including the Multi-Part POST, frequently used to upload files from a client to the server.


Examples:

Extrahome cgi-bin/postscript.cgi??Name=Malcolm%20Hoar&Password=secret

Extrahome upload.cgi???(postedfile;C:\LinkScan10\post\test.jpg;image/jpeg)

Extrahome upload.cgi???Name1=Val1&(postedfile;/usr/home/test/test.jpg;image/jpeg)&Name2=Val2

Hint: Use the LinkScan Recorder to automatically capture the correctly constructed URL's.

Hint 2: When using the Extrahome command to submit a login form to provide access to a site, you may also need to configure LinkScan so that it doesn't immediately "click" any LOGOUT button which would invalidate the newly created session.

11.4 How to validate JavaScript and drop-down lists

LinkScan may be configured to interpret the contents of drop-down lists as links to other pages. The HTML specification does not define a standard method for indicating that a drop-down list contains hyperlinks (as opposed to regular data). Hence LinkScan needs some other "cue" and may be triggered by pattern matching of attributes within the SELECT tag. Consider, for example, the following:


<select name="URLLIST">
<option value="/products/" Selected> Relative URL to Products
<option value="http://www.mydomain.com/services/"> Absolute URL to Services
</select>

To instruct LinkScan to treat the contents of the drop-down list as URL's, use the following command:


Selecturl URLLIST

LinkScan will examine all SELECT tags and look for a Regular Expression match on the NAME attribute. If the match is successful (URLLIST in this example) LinkScan will treat each OPTION tag within the list as a hyperlink and validate it accordingly.

LinkScan includes the ability to validate links contained within JavaScript code. A relatively simple pattern matching technique is used -- LinkScan does not contain a full JavaScript interpreter. This means that LinkScan may "miss" some links or find "false positive errors" especially if the code creates the hyperlink references dynamically at run-time. The following Scriptmatch and Scriptnomatch commands give excellent results in most cases. However, you can customize the matching rules by changing these expressions and/or adding new ones.


Scriptmatch = (\w+://\S+|\S+/$|\S+\?\S+|\S+\.([a-z]{2,3}|[js]?html?|Z)$)
Scriptnomatch = .*([\(\)\[\]\{\}\']|document\.\S+|\.(src|com)$)

Some JavaScript constructs may still produce false errors. You may force LinkScan to ignore complete script blocks that match a specified pattern. For example:


Scriptexclude function\s+ZoomWindow

The above command will force LinkScan to ignore script blocks that contain a definition for the ZoomWindow function.

11.5 How to handle special Error documents

Many websites are constructed with special user-friendly error pages, sometimes known as "custom-404 documents". Some servers will deliver the error document directly whereas others may force a redirection to a specific error document. In either case, an issue arises if your server delivers the error document with a 200 OK response code. LinkScan (or any other link checker) would not be able to detect the error condition.

A similar issue arises with some dynamically generated documents. For example, a Java applet may encounter a run-time error condition after it has already sent a 200 OK response code to the client.

Hence LinkScan supports two special commands that may be used to detect such conditions and force a 404 Not Found error, regardless of the HTTP response code produced by the server/application. The first is used with servers that force a redirection by pattern matching on the HTTP Location: header. The second operates by pattern matches on the document bodies.


Syntax:

Errordoc pattern
Errorbody pattern

Examples:

Errordoc special/notfound\.html
Errorbody (?i).*runtime\serror

In the Errordoc example, LinkScan will report as 404 Not Found any URL that is redirected to http://your.server/special/notfound.html. In the Errorbody example, LinkScan will report as 404 any document that contains the string runtime error in the document body. Note the (?i) makes the pattern match case-insensitive.

Hint: The Errorbody pattern match is carried out on the entire document, including comments. Developers might consider including a standard error string within comment tags that may be used to trigger the Errorbody match.

11.6 How to manipulate URLs on-the-fly

One of the most powerful (and complex) customization features of LinkScan concerns the real-time manipulation of links during the course of the scan. This is typically used to control the testing of sites with complex dynamic content. The basic commands available are:


Sessionmatch expression
Substitute relative-path-expression expression
Substituteraw relative-path-expression expression
Substitutescript relative-path-expression expression

The Sessionmatch command is used to manipulate Session numbers. The Substitute command is used to perform transformations on resolved links. The Substituteraw is used to perform transformations on unresolved links (i.e. the raw contents of a tag or tag attribute). The Substitutescript is used to perform transformations of blocks of JavaScript code.

We shall consider a number of examples which may be adapted according to your specific needs.

Example 1

Consider a site that produces links such as:


http://www.example.com/page1.asp
http://www.example.com/page1.asp?Print

It is entirely possible that page1.asp has been designed in such a manner that it delivers the same basic content with minor variations in formatting depending upon the presence or absence of the Print query string. One might configure LinkScan with:


Substitute (.*\.asp)\?Print $1

Whenever LinkScan encounters a link matching the specified pattern it will make the substitution indicated before it tries to validate or follow that link. In this example, a link to:

http://www.example.com/page1.asp?Print

will immediately be transformed to:

http://www.example.com/page1.asp

Note, however, this is not the same as Excluding links which contain the Print query string; that would cause LinkScan to simply ignore the link. In this case, LinkScan will process the link but transform it on-the-fly during the scan.

Example 2

Next we will consider a significantly more complex scenario.


Sessionmatch .*&token=([^&]+)
Substitute (.*&token=)[^&]*(.*)$ $1!S$2

In this case, we use the special Sessionmatch command to capture and save the first value of the query parameter token that LinkScan sees. This is most likely some kind of session number assigned by the target server immediately following the submission of a login form. The Substitute command then instructs LinkScan to replace all subsequent values of token with the saved value (represented by the special parameter !S).

In this scenario, LinkScan ensures that the value of token can never change during the course of the scan from the originally assigned value.

Example 3

Next we'll consider a JSP site that produces URL's with the following structure:


http://www.example.com/content?A=123&B=456&C=789&D=XYZ

It may not be productive or efficient for LinkScan to scan all of the pages using every combination and permutation of values for the parameters A, B, C, D... etc.. We can control that by manipulating the individual name-value pairs during the scan. For example:


Substitute (content\.jsp\?.*)&B=[^&](.*) $1&B=456$2
Substitute (content\.jsp\?.*)&C=[^&](.*) $1$2
Taglimit content\.jsp\?.*&D= 20

The first command fixes the value of B=456. Whatever value the parameter B takes on during the scan, LinkScan will force the value back to 456. The second command deletes any references to the C parameter from every link that it finds. We have also included the third Taglimit command; this will cause LinkScan to completely ignore the twenty-first and subsequent links that include a D parameter. In other words, in this case, we only want to test a representative sample (20) of links that include a D parameter.

Example 4

For our next example, we shall consider a site that generates pages containing some links with the following structure:


http://www.example.com/cgi-bin/GenerateFrame?Referer=abc&Link=http%3A%2F%2Fwww.yahoo.com%2F

Rather than linking directly to Yahoo!, this page links to a script that generates a frameset that includes the referenced page. In a default configuration, LinkScan will happily follow the link, validating the frameset and the ultimate link to Yahoo!. However, it may not be productive to do that for potentially thousands of links. Furthermore, in the (extremely unlikely) event that the link to http://www.yahoo.com/ was broken, the error would appear in one of the GenerateFrame documents and not the original referring document. In order to repair that link, one would have to backtrack through the frameset to locate the original source of the trouble.

Hence we can apply more Substitute magic:


Substitute cgi-bin/GenerateFrame.*&Link=([^&]+).* !U$1

This command will extract the value of the Link= parameter, and the special !U token instructs LinkScan that the string needs to be un-encoded. So the original link:

http://www.example.com/cgi-bin/GenerateFrame?Referer=abc&Link=http%3A%2F%2Fwww.yahoo.com%2F

is transformed on-the-fly to:

http%3A%2F%2Fwww.yahoo.com%2F

and then decoded to:

http://www.yahoo.com/

And this means LinkScan can validate the link to Yahoo! directly without checking the GenerateFrame script many, many times. Furthermore, any errors will be flagged against the original document (and not one or more steps removed).

Example 5

For our final example, we include for illustration the complete configuration for a real-world large and very complex dynamic site:


# Set the CGI limit to be very large
# Include all file types on the Map

Maxcgi = 10000
Mapinclude .*

# Force &A=B and insert it immediately after the '?'

Substitute (cgi-bin.*[&\?])A=[^&=]*&*(.*) $1$2
Substitute (cgi-bin.*\?)(.*) $1A=B&$2

# Discard null and undefined values

Substitute (cgi-bin.*)&B=(null|undefined)(.*) $1$3
Substitute (cgi-bin.*)&C=(null|undefined)(.*) $1$3
Substitute (cgi-bin.*)&D=(null|undefined)(.*) $1$3
Substitute (cgi-bin.*)&R=(null|undefined)(.*) $1$3

# For 'category', take the &C= if present, otherwise the &B=

Substitute (cgi-bin/bv/scripts/category.*\?A=B).*?(&C=[^&=]*).* $1$2
Substitute (cgi-bin/bv/scripts/category.*\?A=B).*?(&B=[^&=]*).* $1$2

# For 'content', take the &D= or &R= if present (call it &D=). Otherwise take the &B=

Substitute (cgi-bin/bv/scripts/content.*\?A=B).*?&[DR]=([^&=]*).* $1&D=$2
Substitute (cgi-bin/bv/scripts/content.*\?A=B).*?(&B=[^&=]*).* $1$2

# For 'frame', take the &D= or &R= if present (call it &D=). Otherwise take the &B=

Substitute (cgi-bin/bv/scripts/frame.*\?A=B).*?&[DR]=([^&=]*).* $1&D=$2
Substitute (cgi-bin/bv/scripts/frame.*\?A=B).*?(&B=[^&=]*).* $1$2

# For 'mailing...', take the &R=

Substitute (cgi-bin/bv/scripts/mailing.*\?A=B).*?(&R=[^&=]*).* $1$2

# For 'contact', take the &B=, &C= and &Comments

Substitute (cgi-bin/bv/scripts/contact.*\?A=B).*?(&B=[^&=]*).*?(&C=[^&=]*).*?(&Comments=[^&=]*).* $1$2$3$4

# Mark redirects to Error page as 404
# Mark documents containing 'Error Code:' as 404

Errordoc cgi-bin/bv/scripts/error.jsp
Errorbody Error\s+Code:[^\n<]*

# Hide some frequent arising errors

Noforms = 1
Exclude images/arrow.gif

Example 6

Next we will consider a reference to a JavaScript function:


<a href="javascript:MyFunction(4,5,6);">

The following Substitutescript command:


Substitutescript .*:MyFunction\((\d+),(\d+),(\d+)\) '/somepage.jsp?Par1=$1&Par2=$2&Par3=$3'

will transform the function call into the following link which will then be validated/processed by LinkScan.


/somepage.jsp?Par1=4&Par2=5&Par3=6

Synthesizing Additional Links

The Substitute commands may be used to modify existing links on-the-fly. However, a variation of this, the Insertlink command, may be used to insert additional links into specified documents in order to achieve a specific test coverage. Again, it is best illustrated by example:


Insertlink .*complex\.jsp\?.*SPVAR= -
Insertlink (.*complex\.jsp\?.*) /$1&ALTMODE=1 +

As each document is scanned, LinkScan will process all Insertlink commands (in the order specified). The URL of the scanned document is matched against the first parameter of each Insertlink command. In the case of the first example above, a link to:

complex.jsp?VAR=1&SPVAR=2

will match the expression and LinkScan will abort all Insertlink processing for this document (signified by the minus character).

However, a link to:

complex.jsp?VAR=1

does not match the expression. Processing will continue to the second command. This does match the expression and LinkScan will insert a link into this document (signified by the plus character). Hence, when LinkScan processes:

complex.jsp?VAR=1

It will insert into that document, the following link:

complex.jsp?VAR=1&ALTMODE=1

Hint: Clearly, the Substitute command requires a good working knowledge of Perl Regular Expressions. If you need assistance, the LinkScan engineers will be happy to help. Please write to mailto:[email protected] describing in as much detail as possible, the transformations you are seeking to achieve.

11.7 How to emulate different browser types

Most web browsers advertise their identity by including a User-Agent header with every request that they make. LinkScan also sends a User-Agent header. For example, the versions of Netscape Navigator, Microsoft Internet Explorer and LinkScan installed on the writers computer send, respectively:


User-Agent: Mozilla/4.08 [en] (WinNT; I ;Nav)
User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)
User-Agent: LinkScan Enterprise/12.3 Windows

Some websites are constructed in a manner that is browser sensitive. They may, for example, deliver customized pages depending on the users browser type. Hence LinkScan may be customized to emulate different browser types using the Extraheader command:


Syntax:

Extraheader literal-header-string

Example:

Extraheader User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)

In this example, LinkScan will advertise itself as Microsoft Internet Explorer version 5.5 running under Windows 2000.

In fact, the Extraheader command may be used to add any arbitrary HTTP headers to every request that LinkScan sends. A common application involves those servers which look for a language preference in the HTTP headers in order to deliver pages in the appropriate language. For example, the following command instructs LinkScan to include an English Language preference header with each request:


Extraheader Accept-Language: en

11.8 How to remap different hosts

Sometimes a single website may contain links such as:


http://www.example.com/
http://www2.example.com/

Where www.example.com and www2.example.com resolve to the same host IP address. However, LinkScan would consider www2.example.com to be an External Link and not part of the www.example.com Project. Hence the Hostalias command may be used to assign more than one name to the current server. Syntax and example:


Syntax:

Hostalias from-server-url to-server-url

Example:

Hostalias http://www2.example.com/  http://www.example.com/

A similar issue arises when scanning development or staging servers. For example, you may wish to scan the site:


http://staging.example.com/

but the site may contain one or more absolute links to http://www.example.com/. In this case, you can use the Mirrorurl command.


Syntax:

Mirrorurl absolute-url

Example:

Homeurl = http://www.example.com/
Mirrorurl = http://staging.example.com/

In this case, LinkScan will resolve all links as if it were scanning http://www.example.com/. However, all actual HTTP requests will be directed to http://staging.example.com/. This provides a convenient mechanism for scanning development and staging copies of a production website.

11.9 How to assign documents to Owners

You may define the ownership of any given document or file in one of several ways. Ownership directives are evaluated in the order specified with the last match taking precedence. Note that the file ownership attribute is case sensitive.

  1. By the Unix File System ownership attribute. Note: this is not supported on Windows systems

  2. By the Defaultowner command. The syntax for the Defaultowner command is:

    Defaultowner owner-name

  3. By pattern matching with one or more Owner commands. The syntax for the Owner command is:

    Owner relative-path-expression owner-name
    OR
    Ownerq relative-path-expression owner-name

    The Owner command operates on the pathname portion of the URL and does not process any query string (following a "?" character). The Ownerq command operates on the entire URL including any query string.

    LinkScan also supports a special variation of the Owner command. This will automatically assign every file an owner-name based on the name of the directory in which it resides. The syntax is:

    Owner *integer

    The default setting (Owner *1) will assign each document to an Owner based on the top-level directory name (i.e. under "www root"). A setting of Owner *2 will cause LinkScan to assign Ownership based on the first two directory names. For example:

    http://www.example.com/first/second/third/index.html

    Will be assigned to the Owner first_second.

  4. By using preexisting META tags in your HTML documents. For example, if your existing documents already contain tags of the form:

    <META NAME="CONTENT_OWNER" CONTENT="Malcolm Hoar">

    You may set the Owner to 'Malcolm Hoar' by configuring a suitable pattern. e.g.:

    Ownertags = ^meta\s+name\s*=\s*"content_owner"\s+content\s*=\s*"([^"]+)

  5. Finally, once an Owner has been assigned to the file or document, you may manipulate the Owner string with a simple pattern substitution:

    Owneralias .*?([a-zA-Z0-9]+)[\s\.\)]*$ \L$1

    This example would take the string 'Malcolm Hoar' and convert the ownership to 'hoar'. This technique may be used to deal with synonyms such as 'M. Hoar.', 'Malcolm C Hoar '.


Example:

Defaultowner elsop         # Set default
Owner *1                   # Assign Owner based on top level dir ...
Owner wrc/humor/ humor     # But, make this subdir look like top-level
Owner .*\.cgi$ webmaster   # And give all *.cgi files to webmaster

When using LinkScan Dispatch to create reports for delivery by Electronic mail, you may define associations between Owners and Addresses with the Mailalias command. The syntax is:

Mailalias expression list-of-addresses

list-of-addresses may be a comma separated list of addressees if you wish to distribute the report to multiple recipients. Use Mailalias owner-name null to skip a specific Owner.


Example:

Defaultowner elsop         # Set default
Owner *1                   # Assign Owner based on top level dir ...
Owner wrc/humor/ humor     # But, make this subdir look like top-level
Owner .*\.cgi$ webmaster   # And give all *.cgi files to webmaster

Mailalias elsop            [email protected], [email protected]
Mailalias links            [email protected]
Mailalias linkscan         [email protected]
Mailalias wrc              [email protected]
Mailalias humor            [email protected]
Mailalias test             null

If no Mailaliases are defined, Dispatch will address the reports to Ownername @ Mailhost

11.10 How to process additional per-document data

Facilities are provided to extract additional data from each document scanned, store those data in the LinkScan database and create various reports. The additional data collected are typically collected from the META tags in each HTML document.

Supported commands are provided for data extraction, substitution/manipulation and formatting:


# Userdata [123] match-expression expression
# Userdatafmt [123] [DHLTX] integer[LRC] caption
# D=date; H=hot links; L=link; T=truncate to format; X=normal
# Userdatasub [123] expression expression

The following example illustrates the use of these commands to extract and process an employee badge number from document META tags:


Userdata 1 (?i)<meta\s[^>]*employee\s*=\s*"\s*(#?\d+)\s*" $1
Userdatasub 1 #?(\d+) $1
Userdatafmt 1 X 6R Badge-Number

In the above example, we use the first of the three available userdata fields. The first command extracts the badge number from the document META tag. The second command performs a substitution on the matched data to remove an optional pound symbol from the badge number. The third command defines the formatting attributes; X defines a simple text field; 6R specifies a six-character, right-adjusted layout and Badge-Number defines a simple caption.

During the course of the scan, the employee badge numbers are extracted from each document and stored in the LinkScan database. In fact, the userdata fields are stored in a separate file:


PATH-TO-LINKSCAN/Project-name/data/linkscan.usr

This means that it is relatively simple to post-process the data before creating reports. For example, in this case, one might translate the badge numbers to employee names via a lookup on an employee database. The linkscan.usr file is a simple ASCII file with <Control-G> field delimiters.

The final data may be searched/viewed using the Search Documents Report and/or Changed Document Report.

11.11 How to control the testing of external links

LinkScan includes the capability to maintain a History File containing the date/time tested and status of all external links. This feature may be enabled and controlled via various settings in linkscan.sys.

A Site History Report, available from the main LinkScan Reports Menu, may be used to examine the historic behavior of doubtful links.

Once enabled, the LinkScan History file may be used to avoid testing links to remote servers with an excessive frequency. Appropriate use of the following controls will help ensure that you do not impose unnecessary loads on the network or the remote servers your links access. This feature enables you to be a responsible user of the network. But equally important, it can significantly speed up the testing of large projects. Note: The Site History Feature must be enabled (Maxhist > 0) for these settings to be effective:

Masterhist: Normally, LinkScan will maintain a History file on a per-Project basis. Enabling this feature will force LinkScan to maintain a single History file (in the LinkScan directory) for all Projects. Concurrency control is provided to ensure that the file is not damaged when scanning two or more Projects simultaneously.
[Default: Masterhist = 0 (Disabled) ]

Maxhist: The maximum number of entries maintained in the History File for each external link.
[Default: Maxhist = 0 (Disabled) ]

Maxgoodhours: The maximum number of hours between attempts to retest good external links. The scanning of URL's that have been checked within the specified period is skipped and the LinkScan Reports display the Status Code from the prior test.
[Default: Maxgoodhours = 0 (Disabled) ]

Maxbadhours: The maximum number of hours between attempts to retest bad external links. The scanning of URL's that have been checked within the specified period is skipped and the LinkScan Reports display the Status Code from the prior test.
[Default: Maxbadhours = (Disabled) ]

In addition, the following options are available via linkscan.cfg

Noexternal: Disable the checking of all External links.
[Default: Noexternal = 0 (Disabled) ]

Fetchext: Fetch the document bodies when checking External links. Enabling this option incurs a significant performance and bandwidth overhead. Typically, it is only used in conjunction with the LinkScan Profiler which will enable Fetchext automatically when required.
[Default: Fetchext = 0 (Disabled) ]

Followext: Follow all HTTP redirections.
[Default: Followext = 1 (Enabled) ]

Maxdns: Limit the total number of failed DNS lookups performed on a given hostname. After more than Maxdns failed lookups on the same host, all subsequent links to that host are assumed to be bad. This avoids excessive numbers of timeout trying to resolve the same hostname.
[Default: Maxdns = 3 ]

Retryext: When enabled, LinkScan will track all External links that appear to fail due to network related errors (e.g. DNS, connect and timeout errors). These links will be retested at the end of the scan. This tends to reduce the number of transient errors reported but the scan may require a little more time to complete.
[Default: Retryext = 0 (Disabled) ]

Showredirext: Enable this option when you want LinkScan to warn/report on redirections and store the status of the final (redirected) link.
[Default: Showredirext = 0 (Disabled) ]

How to control the hits on any one server

You may also control the number of hits per server with the following commands in linkscan.sys.

Maxservertries: The maximum number of links that should be tested on any given server when that server is apparently "dead". Once this limit is exceeded, all other links to that server are skipped and assigned an URL Skipped - Bad Server (801) Status Code.
[Default: Maxservertries = 25 ]

Maxftp: The maximum number of links to any single FTP server that should be validated. Once this limit is exceeded, all other FTP links to that server are skipped and assigned a URL Skipped - FTP Limit (802) Status Code.
[Default: Maxftp = 25 ]

FTPUser and FTPPass: Define the username and password that LinkScan will use when validating links to FTP sites.
[Default: FTPUser = anonymous; FTPPass = [email protected] ]

Active Validation of mailto: Links

In a default configuration, LinkScan performs a simple syntax check on mailto: links. Active checking of mailto: links may be configured -- LinkScan uses our Mailvet™ technology to contact the mail servers associated with the specified address and attempts to establish the validity of the address without actually sending a message. To enable this feature:

  1. Ensure the Perl Module Net::DNS is installed on your computer. The Net::DNS Module is available from http://www.net-dns.org/
  2. Configure the Hostname setting in linkscan.sys. This value is used for the SMTP HELO message and, for maximum accuracy, should match the Reverse DNS hostname of your computer. If your computer does not have a Reverse DNS entry, some mail servers configured with anti-SPAM measures may produce false errors.
  3. Configure the Mailfrom setting in linkscan.sys. This value is used for the SMTP MAIL FROM message and, for maximum accuracy, should be a valid (deliverable) return address.
  4. Set Checkmailto = 1 in linkscan.cfg.

On some systems, Net::DNS may not correctly identify the default name servers from your operating system configuration. If you encounter difficulties, please run the following test script:

perl ./utils/dns.pl

You may also configure DNS name server addresses in linkscan.sys by adding an entry such as:


Nameservers = 10.10.10.10, 10.10.10.20

11.12 Other miscellaneous customizations

This section deals with a few other miscellaneous commands:

LinkScan for Unix. Reference Manual. Section 11. Advanced and Custom Scanning
LinkScan Version 12.3
© Copyright 1997-2012 Electronic Software Publishing Corporation (Elsop)
LinkScan™ and Elsop™ are Trademarks of Electronic Software Publishing Corporation

  Previous   Contents   Next   Help   Reference   HowTo   Card