| LinkScan for Unix. Reference Manual.
| Section 28 |
LinkScan File Formats
The following notes describe the format of many of
the LinkScan database files stored in:
...LinkScan/ProjectName/data/
...LinkScan/ProjectName/hist/
Each file is created in (mainly) ASCII format,
with one Record per Line. Each Record contains
a number of Fields, delimited with <Control-G>
characters (Octal: 007). The Fields associated
with each Record type are outlined below.
idx.dat
=======
Establishes the mapping between an "idx" number and each
unique Document/Link/URL examined by LinkScan.
0 = idx
1 = URL
2 = Document Title
doc.dat
=======
Contains the attributes and characteristics for each unique
Document/Link/URL examined by LinkScan.
0 = idx (see idx.dat)
1 = URL
2 = Owner Code (see linkscan.own)
3 = Clicks
4 = Link Type (see below)
5 = Content-Type (MIME)
6 = Link Status Code (see codes.txt)
7 = Extended Status (normally blank)
8 = Location for Redirect (see idx.dat)
9 = Original Status Code (pre-redirect)
10 = Content-Length (size in bytes)
11 = Last-Modified (date/time)
12 = Reserved
13 = File System Pathname
14 = Document Title
15 = In-line bytes (page weight)
16 = Number of Errors in this document
17 = Number of Warnings in this document
orp.dat
=======
Contains information concerning all Orphaned Files.
0 = URL
1 = File System Pathname
2 = Symlink (0=No; 1=Followed symlink; 2=Is symlink)
3 = File Size
4 = Date/Time last modified
5 = Owner Code (see linkscan.own)
6 = Link Type (see below)
7 = Link Status Code (see codes.txt)
mad.dat and map.dat
===================
Contain the LinkScan SiteMap Data
mad.dat -- directory order
map.dat -- link order
0 = Level in Map
1 = Dot-Decimal Notation
2 = Document URL
3 = Document Title
4 = Owner Code (see linkscan.own)
5 = Content-Length (size in bytes)
6 = Last-Modified (date/time)
7 = Total # of child documents for this node
lnk.dat
=======
Contains the attributes of every link considered by LinkScan.
0 = Owner Code (see linkscan.own)
1 = From URL (see idx.dat)
2 = Line Number (times 10)
3 = To URL (see idx.dat)
4 = Link Type Code (see below)
5 = Link Status Code (see codes.txt)
6 = Extended Status (normally blank)
7 = cnt
8 = Link Caption/Description
9 = File Size (in-line images only)
10 = Redirect location (see idx.dat)
err.dat
=======
Subset of lnk.dat file, excluding records relating to all
good links.
linkscan.own
============
Establishes the mapping between the Owner Code and Owner Name.
0 = Owner Name
1 = Owner Code
linkscan.sum
============
Summary Statistics Data (Note this file is TAB delimited)
0 = Version
1 = Date and time of scan
2 = Total Documents
3 = Missing Documents
4 = Documents Containing Errors
5 = Total Other Files
6 = Missing Other Files
7 = Total Anchors
8 = Missing Anchors
9 = Total External Links
10 = External Links Tested This Scan
11 = External Links with Errors
12 = External Links with Possible Errors
13 = External Links with Warnings
14 = Total Orphans
linkscan.tim
============
HTTP Transaction Times (Note this file is TAB delimited)
0 URL fetched
1 HTTP status code (200, 404 etc)
2 Document size (bytes)
3 Document Body flag (0=not available; 1=available but not fetched;
2=available and fetched)
4 Transaction time (milliseconds)
5 Redirect location
Notes:
* Transaction Time includes time to follow any redirects.
* Time includes time to fetch document body on HTML
and similar MIME types only.
* On other file types (images for example) the transaction
time does NOT include the body download. But it does
measure the time and network/server latency for the
exchange of full request and response headers. The
additional time could be computed from the file size
and a knowledge of the available connection bandwidth.
It's likely to be quite accurate given that the HTTP
server has only to push the data from an already found
file down an already open socket, to the client. Since
most image file formats incorporate compression, you're
unlikely to see any further savings even if the
connection type supported such a scheme.
* Timing will be impacted by # of processes used for
the scan and also, to some extent, the relative
performance of the target server and the LinkScan
machine.
hist/xxxxxx/dat
===============
History Data -- New File Created for Each Scan
0 = Document URL
1 = Owner Name
2 = Document Type Code (see below)
3 = Clicks
4 = Content-Type (MIME)
5 = Document Status Code (see codes.txt)
6 = Content-Length (size in bytes)
7 = Last-Modified (date/time)
8 = Document Title
Document Type Codes
===================
H = HTML Document
D = PDF Document
J = JavaScript Document
M = Image Map
S = Flash Document
T = Text Document
Y = Reserved
Z = Import Document
F = Other File Type
I = In-line image
N = Document with Nofollow rule
O = Orphaned Document
P = Orphaned File
A = Anchor
R = Redirection (internal)
U = External link
V = Redirection (external)
X = Reserved (typically mailto: or invalid characters)
LinkScan for Unix. Reference Manual. Section 28. LinkScan File Formats
LinkScan Version 12.3
© Copyright 1997-2012
Electronic Software Publishing Corporation (Elsop)
LinkScan and Elsop are Trademarks of Electronic Software Publishing Corporation