LinkScan for Unix. Reference Manual. Section 25. Glossary of Terms

LinkScan for Unix. Reference Manual.

Section 25

Glossary of Terms

This section define some LinkScan constructs and related terminology with reference to various standards, where appropriate:

1. Projects	2. Owners	3. Usernames
4. Virtual Hosts	5. Pathnames	6. Pathname Expressions
7. Home Directory	8. LinkScan Directory	9. Project Directory
10. Uniform Resource Locators (URL's)	11. Internal Links	12. External Links
13. Orphaned Files	14. HyperText Markup Language (HTML)	15. HyperText Transfer Protocol (HTTP)
16. File Transfer Protocol (FTP)	16. HTTP Scanning	18. File System Scanning
19. Import Scanning	20. Perl Regular Expressions	21. Content-Type/MIME
22. Date and Time Last-Modified	23. Document Weight	24. Click Depth

25.1 Projects

LinkScan is able to scan multiple websites. It can also scan the same website multiple times with different configuration options. In each case, LinkScan creates a unique and corresponding LinkScan Database containing the results of the analysis. Together, the configuration files and database constitute a LinkScan Project.

Each LinkScan Project is stored within a subdirectory of the main LinkScan installation directory.

Hence users must always select a Project when scanning a website. Any they must select a Project when viewing the results.

25.2 Owners

Within each Project, you may also configure multiple LinkScan Owners. Collections of HTML documents and other files are assigned between Owners in a variety of ways:

By the Unix File System ownership attribute
By subdirectories within the website
By pattern matching on directory and file names
By Meta Tags inserted in individual documents

The LinkScan Owner concept enables individual content developers or workgroups to view results that pertain to their documents or areas of responsibility.

25.3 Usernames

LinkScan incorporates access controls that may be used to limit user access to LinkScan databases and results. These controls are not enabled by default.

When activated, users may be required to login to the LinkScan system used a pre-defined LinkScan Username and associated password. The Username will define the Projects and Owners that an individual user is permitted to access.

25.4 Virtual Hosts

A Virtual Host is the Fully Qualified Domain Name (or IP address) of a network host configured on your server. Many servers are configured for a single Virtual Host but others are configured to support multiple Virtual Hosts. You must define at least one LinkScan Project for each Virtual Host that you wish to test.

25.5 Pathnames

Pathnames are used to refer to directory structures. They may be Relative or Absolute. Note also that Pathnames are used in the URL context and the File System context. For example:

/usr/www/htdocs/products/widget.html          # Absolute pathname, file system context
C:/www/products/widget.html                   # Absolute pathname, file system context
http://www.example.com/products/widget.html   # Absolute URL
../products/widget.html                       # Relative link, URL or file system context

LinkScan makes extensive use of a normalized representation such that the documents referred to above would be referenced as:

products/widget.html

This offers the advantages of brevity and consistency, since products/widget.html may typically be used to refer to both:

C:/www/products/widget.html and
http://www.example.com/products/widget.html

The normalized format is referred to in this document as relative-path.

25.6 Pathname Expressions

Many LinkScan customization features refer to relative-path-expression. That is a Perl Regular Expression matching a relative-path.

25.7 Home Directory

The directory on your server that is considered to be the root directory of your HTTP server. Sometimes known as www root.

25.8 LinkScan Directory

The directory on your computer where LinkScan is installed.

25.9 Project Directory

A subdirectory of the LinkScan Directory containing the configuration and data files associated with a specific Project.

25.10 Uniform Resource Locators (URL's)

The various Uniform Resource Locator formats are defined in RFC 2396.

25.11 Internal Links

Internal Links are defined as links to the current Project.


Examples:

<a href="filename.html">This is an Internal Link</a>

<a href="http://www.elsop.com/index.html">This is an Internal
Link if the current Project is http://www.elsop.com/</a>

25.12 External Links

External Links are defined as links specified using an Absolute URL to any Project other than the current Project.


Example:

<a href="http://www.otherdomain.com/">This is an External Link</a>

25.13 Orphaned File

Orphaned Files are defined files present in the Home Directory (or any subdirectory thereof) which cannot be reached via one or more internal links from the Home Page.

25.14 HyperText Markup Language (HTML)

The HyperText Markup Language (HTML 3.2) lies at the heart of the World Wide Web.

LinkScan attempts to parse the HTML source code according to the published standards. However, as with all web browsers, the results can be unpredictable when the HTML source code deviates from the specifications. Experience with LinkScan indicates that the following points are worthy of note.

Non-conformant HTML code almost always causes variable and unpredictable results with web browsers and other software
Unmatched <>'s invariably cause problems
Unmatched quotes invariably cause problems
Unquoted strings are often a cause of problems. Strictly speaking, many HTML parameters must be quoted if they contain characters other than letters, numbers, periods or hyphens
It is especially important to quote strings containing embedded spaces.

25.15 HyperText Transfer Protocol (HTTP)

The HyperText Transfer Protocol (HTTP 1.0) has been used for World Wide Web communications since 1990. In January 1997, the first specifications for HTTP 1.1 were published. LinkScan exploits many HTTP features to establish the status of the external links.

In most cases LinkScan is able to definitively establish the status of any given link. However, at any moment in time a small proportion of links (typically around 5%) are temporarily unavailable. In such cases, LinkScan will make two attempts to reach the site before flagging those URL's as "Possible Errors" to be retested at a later time (automatically or manually).

An even smaller percentage of sites are accessible via a web browser but fail to return message headers in accordance with the HTTP specifications. In many cases, LinkScan is still able to establish the status, but a few sites are so grossly non-compliant that LinkScan will return an "Unknown Error" to flag them for manual testing. In tests, only one or two sites per thousand fell into this category.

25.16 File Transfer Protocol (FTP)

The File Transfer Protocol (FTP) is a relatively old standard, compared to HTTP. See RFC 640.

25.17 HTTP Scanning

Typically, LinkScan accesses the scanned website via the Network and HTTP. This is an appropriate method in most cases.

25.18 File System Scanning

Optionally, LinkScan may be configured to access part of all of the scanned website by direct access to all of the website files on your computers file system. This offers several advantages and disadvantages:

File System Scanning is extremely fast when you need to scan very large numbers of static HTML documents.
File System Scanning enables the identification of Orphaned Files.
File System Scanning is generally inappropriate for dynamically generated pages.
File System Scanning involves a more complex configuration than HTTP Scanning.

Note that LinkScan may also be configured to scan a site using a combination of both the HTTP and File System Methods. This powerful capability my be used, for example, to enable HTTP Scanning of website content and the comparison of the results with those from File Systems Scanning to reconcile the Orphaned Files.

25.19 Import Scanning

In addition to HTTP Scanning and File System Scanning, LinkScan supports a third mode of operation; Import Scanning. This is used to validate lists of Documents or Links that are imported from simple text files. The Import Lists may be prepared manually but it is more common for them to be exported from a database management system or other application.

25.20 Perl Regular Expressions

LinkScan incorporates a vast array of customization features many of which exploit the power of Perl Regular Expressions. For a description of Perl Regular Expressions on Unix systems, see man perlre. HTML versions are available at many locations including:

http://perldoc.perl.org/perlre.html

We also recommend the book Mastering Regular Expressions (a.k.a. the Owl Book) by Jeffrey E.F. Friedl, and published by O'Reilly [ISBN: 1-56592-257-3].

25.21 Content-Type/MIME

When files are served via the Hypertext Transfer Protocol (HTTP) the normal conventions with respect to file extensions do not apply. The content of the file is defined by a HTTP Content-Type header (a.k.a. MIME type). Common examples include:

Content-Type: text/html Content-Type: image/gif

25.22 Date and Time Last-Modified

LinkScan always attempted to store a date/time stamp with each document to indicate when the file was last modified. When scanning via the File System, LinkScan is able to capture this data directly from the operating system. However, when LinkScan does not have direct access to the server File System, it looks for a HTTP Last-Modified header. Most web server supply this when serving static HTML documents (without Server Side Includes). However, it is typically not supplied when serving dynamic pages and the data may not be available. Note however, that LinkScan does have the ability to extract information of this type from META tags when available -- see How to process additional per-document data.

25.23 Document Weight

LinkScan calculates the total weight of each document. This calculation is based on the total in-line byte count and takes account of:

The size of the HTML document
The size of each in-line image. Only the first occurrence of any one image is considered, to simulate browser caching
Factors for HTTP headers and network latency
Compression factors for HTML document bodies (not images) at dial-up modem speeds only

25.24 Click Depth

LinkScan tracks and stores the depth of each document during the course of the scan. The depth reflects the number of hyperlinks the use must click to reach the target starting from the initial URL. Note that LinkScan uses a deepest-first algorithm to scan a site. In general, the click-count is not incremented when following:

HTTP 301/302 redirects
META Refresh redirects
FRAME SRC links

LinkScan for Unix. Reference Manual. Section 25. Glossary of Terms
LinkScan Version 12.3
© Copyright 1997-2012 Electronic Software Publishing Corporation (Elsop)
LinkScan™ and Elsop™ are Trademarks of Electronic Software Publishing Corporation

Previous Contents Next

Help Reference HowTo Card