LinkScan for Unix. Reference Manual. | Section 25 |
Previous Contents Next | Help Reference HowTo Card |
This section define some LinkScan constructs and related terminology with reference to various standards, where appropriate:
LinkScan is able to scan multiple websites. It can also scan the same website multiple times with different configuration options. In each case, LinkScan creates a unique and corresponding LinkScan Database containing the results of the analysis. Together, the configuration files and database constitute a LinkScan Project.
Each LinkScan Project is stored within a subdirectory of the main LinkScan installation directory.
Hence users must always select a Project when scanning a website. Any they must select a Project when viewing the results.
Within each Project, you may also configure multiple LinkScan Owners. Collections of HTML documents and other files are assigned between Owners in a variety of ways:
The LinkScan Owner concept enables individual content developers or workgroups to view results that pertain to their documents or areas of responsibility.
LinkScan incorporates access controls that may be used to limit user access to LinkScan databases and results. These controls are not enabled by default.
When activated, users may be required to login to the LinkScan system used a pre-defined LinkScan Username and associated password. The Username will define the Projects and Owners that an individual user is permitted to access.
A Virtual Host is the Fully Qualified Domain Name (or IP address) of a network host configured on your server. Many servers are configured for a single Virtual Host but others are configured to support multiple Virtual Hosts. You must define at least one LinkScan Project for each Virtual Host that you wish to test.
Pathnames are used to refer to directory structures. They may be Relative or Absolute. Note also that Pathnames are used in the URL context and the File System context. For example:
/usr/www/htdocs/products/widget.html # Absolute pathname, file system context C:/www/products/widget.html # Absolute pathname, file system context http://www.example.com/products/widget.html # Absolute URL ../products/widget.html # Relative link, URL or file system context
LinkScan makes extensive use of a normalized representation such that the documents referred to above would be referenced as:
products/widget.html
This offers the advantages of brevity and consistency, since products/widget.html may typically be used to refer to both:
C:/www/products/widget.html and
http://www.example.com/products/widget.html
The normalized format is referred to in this document as relative-path.
Many LinkScan customization features refer to relative-path-expression. That is a Perl Regular Expression matching a relative-path.
The directory on your server that is considered to be the root directory of your HTTP server. Sometimes known as www root.
The directory on your computer where LinkScan is installed.
A subdirectory of the LinkScan Directory containing the configuration and data files associated with a specific Project.
The various Uniform Resource Locator formats are defined in RFC 2396.
Internal Links are defined as links to the current Project.
Examples: <a href="filename.html">This is an Internal Link</a> <a href="http://www.elsop.com/index.html">This is an Internal Link if the current Project is http://www.elsop.com/</a>
External Links are defined as links specified using an Absolute URL to any Project other than the current Project.
Example: <a href="http://www.otherdomain.com/">This is an External Link</a>
Orphaned Files are defined files present in the Home Directory (or any subdirectory thereof) which cannot be reached via one or more internal links from the Home Page.
The HyperText Markup Language (HTML 3.2) lies at the heart of the World Wide Web.
LinkScan attempts to parse the HTML source code according to the published standards. However, as with all web browsers, the results can be unpredictable when the HTML source code deviates from the specifications. Experience with LinkScan indicates that the following points are worthy of note.
The HyperText Transfer Protocol (HTTP 1.0) has been used for World Wide Web communications since 1990. In January 1997, the first specifications for HTTP 1.1 were published. LinkScan exploits many HTTP features to establish the status of the external links.
In most cases LinkScan is able to definitively establish the status of any given link. However, at any moment in time a small proportion of links (typically around 5%) are temporarily unavailable. In such cases, LinkScan will make two attempts to reach the site before flagging those URL's as "Possible Errors" to be retested at a later time (automatically or manually).
An even smaller percentage of sites are accessible via a web browser but fail to return message headers in accordance with the HTTP specifications. In many cases, LinkScan is still able to establish the status, but a few sites are so grossly non-compliant that LinkScan will return an "Unknown Error" to flag them for manual testing. In tests, only one or two sites per thousand fell into this category.
The File Transfer Protocol (FTP) is a relatively old standard, compared to HTTP. See RFC 640.
Typically, LinkScan accesses the scanned website via the Network and HTTP. This is an appropriate method in most cases.
Optionally, LinkScan may be configured to access part of all of the scanned website by direct access to all of the website files on your computers file system. This offers several advantages and disadvantages:
File System Scanning is extremely fast when you need to scan very large numbers of static HTML documents.
File System Scanning enables the identification of Orphaned Files.
File System Scanning is generally inappropriate for dynamically generated pages.
File System Scanning involves a more complex configuration than HTTP Scanning.
Note that LinkScan may also be configured to scan a site using a combination of both the HTTP and File System Methods. This powerful capability my be used, for example, to enable HTTP Scanning of website content and the comparison of the results with those from File Systems Scanning to reconcile the Orphaned Files.
In addition to HTTP Scanning and File System Scanning, LinkScan supports a third mode of operation; Import Scanning. This is used to validate lists of Documents or Links that are imported from simple text files. The Import Lists may be prepared manually but it is more common for them to be exported from a database management system or other application.
LinkScan incorporates a vast array of customization features many of which exploit the power of Perl Regular Expressions. For a description of Perl Regular Expressions on Unix systems, see man perlre. HTML versions are available at many locations including:
http://perldoc.perl.org/perlre.html
We also recommend the book Mastering Regular Expressions (a.k.a. the Owl Book) by Jeffrey E.F. Friedl, and published by O'Reilly [ISBN: 1-56592-257-3].
When files are served via the Hypertext Transfer Protocol (HTTP) the normal conventions with respect to file extensions do not apply. The content of the file is defined by a HTTP Content-Type header (a.k.a. MIME type). Common examples include:
Content-Type: text/html Content-Type: image/gif
LinkScan always attempted to store a date/time stamp with each document to indicate when the file was last modified. When scanning via the File System, LinkScan is able to capture this data directly from the operating system. However, when LinkScan does not have direct access to the server File System, it looks for a HTTP Last-Modified header. Most web server supply this when serving static HTML documents (without Server Side Includes). However, it is typically not supplied when serving dynamic pages and the data may not be available. Note however, that LinkScan does have the ability to extract information of this type from META tags when available -- see How to process additional per-document data.
LinkScan calculates the total weight of each document. This calculation is based on the total in-line byte count and takes account of:
LinkScan tracks and stores the depth of each document during the course of the scan. The depth reflects the number of hyperlinks the use must click to reach the target starting from the initial URL. Note that LinkScan uses a deepest-first algorithm to scan a site. In general, the click-count is not incremented when following:
LinkScan for Unix. Reference Manual. Section 25. Glossary of Terms
LinkScan Version 12.3
© Copyright 1997-2012
Electronic Software Publishing Corporation (Elsop)
LinkScan and Elsop are Trademarks of Electronic Software Publishing Corporation
Previous Contents Next | Help Reference HowTo Card |