Datastead Software
THttpScan
Version 4.5 - April 21, 2005
USA:
http://www.datastead.com
http://www.delphicity.com
Europe:
http://www.datastead.net
Contact:
contact@datastead.com
Support:
support@datastead.com
Note to C++Builder users:
Add a #pragma link "inet.lib" statement at the top of your main cpp file.
1. If a previous THttpScan package is already installed, remove it first:
- Components | Install Packages,
- click on "Datastead THttpScan",
- click "Remove",
- click "Yes",
- click "Ok",
- search for "HttpScan.*" and "THtScan.*" files in your Borland directories and delete them, to be certain that old units will
not remain in the search paths (causing later raw errors).
2. Install the current package:
- unzip the archive in a folder of your choice,
- according to your Delphi or C++Builder version, copy all the Delphi\*.* or CBuilder\*.* archive files to your Borland\Delphi\Imports or \Borland\CBuilder\Imports directory,
- run Delphi or C++Builder,
- select Component | Install packages,
- press the "Add" button,
- locate the THtScan.bpl file in the Imports directory and select it,
- select Open,
- select Ok,
- check the Datastead tab in the right of the component palette. The THttpScan object should have been added.
Note:
If you get a "THtscanc.??? file not found" error when compiling or linking your project:
- go to Tools | Environment Options | Library, and check that you have ;$(DELPHI)\Imports (Delphi) or ;$(BCB)\Imports (C++Builder) in the Library path, otherwise add it at the end of the edit field.- go to Project | Options | Packages, check "Build with runtime package", go to the end of the packages edit field, remove ";THtScan", and then uncheck "Build with runtime package".
function Start:
Boolean (1st syntax)
starts downloading and processing the URL set in the StartingUrl
property, which must have been set beforehand.
function Start (StartingUrl_: string): boolean (2nd syntax)
starts downloading and processing the URL set in the StartingUrl_ parameter
passed to the function.
procedure Stop
Stops the HttpScan process currently running.
Agent:
string = ' '
OBSOLETE.
AllowRedirect:
boolean = true
If enabled, THttpScan follows redirected URL. If disabled,
redirected URL are ignored.
ConcurrentDownloads: integer = 6
number of html pages downloads running simultaneously (between 4 and 20,
according to your ISP speed and your processor is a good range).
DepthSearchLevel:
integer = 3
represents the deep of the followed pages tree starting from the first Url. Or "each time I find a link, I click on this link, n times". If
kept on the host of the starting Url with StayOnSite set to true, a high value
allow to grab an entire web site.
The most important parameter with StayOnSite.
HttpPort:
integer = 80
http port of the starting Url
FileOfResults:
string = ' '
complete path of the file in which to store the results of the processing. Left
this property blank if you do not want that THttpScan saves the results to a
file.
KeywordsFilter:
string of keywords separated by char(13)+char(10), not visible on the
object properties.
Set of keywords to filter URLs. One keyword per line. Very short keywords will
eliminate a lot of Url (e.g. a keyword like "th" eliminates all the
Url containing "th"). Activated by KeywordsFilterEnabled = true.
KeywordsFilterEnabled: boolean = false
if set to true, the KeywordsFilter stringlist is used to determine if the URL
contains one of the keywords and must be ignored.
KeywordsLimiter:
string of keywords separated by char(13)+char(10), not visible with the
object inspector.
Set of keywords URLs MUST CONTAIN. One keyword per line. Very short keywords
will report a lot of Urls (e.g. a keyword like "th" allows all the Url
containing "th"). Activated by KeywordsLimiterEnabled = true.
KeywordsLimiterEnabled: boolean = false
if set to true, the KeywordsLimiter stringlist is used to determine if the URL
contains one of the keywords and must be reported.
LeavesFirst: boolean = false
if we think to the pages scanned (starting from the initial URL) as a tree
with its branches and leaves, THttpScan scans through the leaves before the
branches.
LinkScan: TLinkScan = (scanAllLinks,
scanInitialSite, scanInitialPath)
Sets the global way to surf through links.
scanAllLinks: for each html page found, all the links are downloaded and
scanned, and so on...
scanInitialSite: scans only links owned by the site of the starting url.
scanInitialPath: scans only links with the same sub path than the starting url
(links of the same tree level and below).
LinkReport: TLinkReport =
(reportAllLinks, reportCurrentSiteLinks, reportCurrentPathLinks)
Sets the global way links are reported.
reportAllLinks: reports all links found in the current html page,
reportCurrentSiteLinks: reports links owned by the same site than the current
html page,
reportCurrentPathLinks: reports only links with the same sub path than the
current html page (links of the same tree level and below).
To explain by an example, let's say we have the following page:
http://www.oursite.com/info/mainpage.htm
In this page we have the following links:
1. http://www.anothersite.com/externalimage.gif
2. http://www.oursite.com/siteimage.gif
3. http://www.oursite.com/info/siteimage2.gif
- if you select reportAlllinks, the links #1, #2 and #3 will be returned
- if you select ReportCurrentSiteLinks, only the #2 and #3 links will be reported, because they are owned by
www.oursite.com which is the site of the mainpage.htm
- if you select ReportCurrentPathLinks, only the #3 link will be reported, because it is
the only link under the path of mainpage.htm (under http://www.site.com/info/
).
MaxQueueSize: integer
= 5000
maximum size of the html pages queue. The html pages queue grows faster than the
analyzed pages. After a few minutes, we can have 50 pages analyzed and 10000
pages in queue. This queue size limitation helps to avoid memory problems with
huge queues. New links founds are ignored if adding them implies a queue size
greater than MaxQueueSize.
Password:
string = ' '
needed if the starting Url is username/password protected.
ProxyAddress:
string = ''
Ip address of the proxy server
ProxyPassword:
string = ''
password to authenticate to the proxy server
ProxyPort:
integer
Port of the proxy server
ProxyType:
tProxyType = (PROXY_DIRECT, PROXY_USE_PROXY, PROXY_DEFAULT)
PROXY_DIRECT: direct connection to Internet, all the Proxy.. parameters are
ignored
PROXY_USE_PROXY: the Proxy... parameters are used to authenticate to the proxy
server
PROXY_DEFAULT: the control panel parameters are used
ProxyUser:
string
username to authenticate to the proxy server
Referrer:
string = ' '
OBSOLETE.
Retries:
integer = 3
number of download retries when a connect or GET error occurs.
SeekRobotsTxt:
boolean = false
if set to true, THttpScan searches for robots.txt files at the root of the sites
(http://www.hostname.foo/robots.txt). If the file is found, the body content is
returned by the OnPageReceived event
StartingUrl:
string = ' '
the Url from which the scanning will be performed.
Must be set before calling the Start
function if it is called without Url parameter.
TimeOut: integer = 300
time left to the http thread to connect to an URL (in seconds) before aborting
process. The thread tries to connect Retries times before
the OnError event occurs.
TypeFilter:
string of file types separated by char(13)+char(10), not visible on the
object properties.
Set of file types to report only corresponding URLs. One file type per line
(e.g. : jpg gif mp3). Lowercase only. For jpeg use "jpg" and for
mpeg use "mpg" (THttpScan converts jpeg in jpg and mpeg in mpg).
Activated by TypeFilterEnabled = true.
TypeFilterEnabled: boolean
= false
if set to true, the TypeFilter stringlist is used to report on URL whose file
type is found in the TypeFilter list.
UserName: string = ' '
needed if the starting Url is username/password protected.
Working:
boolean = false. Read only, non visible in the object properties.
Indicates the state of HttpScan: "waiting" or "working". Can
be tested before closing the Form to know if downloads are currently running.
See also the OnWorking event.
OnError (Sender:
TObject; Url: String; ErrorCode: Cardinal;
ErrorMsg: String);
occurs when a "GET" request fails. Returns the Url which failed, with
the error code and the error message if available.
OnHttpAuthenticate
(Sender: TObject; HostName, Url: string; var UserName, Password: String;
RetryCount: Integer; var Cancel: Boolean);
occurs when an http page requires an authentication. You can set the related
UserName as Password here, or assign False to Cancel to ignore the http
authentication.
If a wrong UserName or Password is set, the event occurs again and RetryCount
increases.
OnLinkFound (Sender:
TObject; UrlFound, TypeLink, FromUrl, HostName,
UrlPath, UrlPathWithFile, FileName, ExtraInfos: String; Port: Integer; var WriteToFile: String; HrefOrSrc: Char; CountArea: Integer; var FollowIfHtmlLink: Boolean);
This event occurs each time a link is found and returns the following
parameters:
UrlFound: the full address on the link found
TypeLink: type of link (htm, jpg, mpg, cgi, php, etc...)
FromUrl: the referring url (.htm) from which the link come from
Hostname: the host name of the UrlFound address
UrlPath: the Url path (without host name & without filename)
UrlPathWithFile: the Url path (without host name but with filename)
FileName: the file name extracted from the Url path
ExtraInfos: the extra info passed to the URL (e.g. ?param1=v)
Port: Integer: the port number used in the http request (usually 80)
WriteToFile: the line to be written to FileOfResult.
See comments here.
HrefOrSrc: returns 'S' if the link is an object loaded on the page (a
thumb for example) and 'H' if the link is the destination URL.
CountArea: all the area found receive a sequential number. When a Href or
Src link is found, it receives the number corresponding to his area. So, the
couples Href / Src link can be associated.
FollowIfHtmlLink: if you set FollowIfHtmlLink to false in this event, THttpScan
stops searching in the
direction of the current link.
Onlog (Sender:
TObject; LogMessage: string);
returns a string which explains the internal process (for debugging purposes)
OnMetaTag (Sender:
TObject; Url, ReferringUrl,
TagType, Tag1stAttrib, Tag1stValue, Tag2ndAttrib,
Tag2ndValue, Tag3rdAttrib, Tag3rdValue: String);
Returns the tag type and attributes of the current html page. If there is 5 tags
on a page the event occurs 5 times for this page. The number of attributes is
different according to the tag types, so the attribute parameters are called
"1st", "2nd" and "3rd".
Url: the URL from which the meta tag is returned
ReferringUrl: the parent URL
TagType: TITLE, META, LINK, BASE, etc...
Tag1stAttrib: tag attribute, according to the TagType. E.g. if TagType =
META, returns "NAME", "HTTP-EQUIV", etc...
Tag1stValue: value of the Tag1stAttrib, e.g. if Tag1stAttrib =
"NAME", returns "keywords", "description", etc...
Tag2ndAttrib: e.g. if Tag1stAttrib = "NAME" and Tag1stValue =
"keywords", returns "CONTENT" ;
Tag2ndValue: e.g. if Tag1stAttrib = "NAME", Tag1stValue =
"keywords" and Tag2ndAttrib = "CONTENT", returns the
content string.
Tag3rdAttrib: e.g. if TagType = "LINK", Tag1stAttrib = "REL",
Tag1stValue = "STYLESHEET", Tag2ndAttrib = "HREF",
Tag2ndValue = "/style/??.css", returns "TYPE".
Tag3rdValue: e.g. "/text/css" for the sample above.
If you find this is complicated, take a look at the demo, and you'll think it
is finally very simple!
OnPageReceived (Sender:
TObject; Hostname, Url, Head, Body: string);
this event occurs each time an html page is downloaded and returns the
following parameters:
Url: Url of the text page received
Hostname: hostname of the page received
Head: head of the http query request for the page received
Body: body of the text of the page received.
OnProxyAuthenticate
(Sender: TObject; var UserName, Password: String;
Integer;
var Cancel: Boolean);
occurs when a proxy authentication is required. You can set the related UserName
as Password here, or assign False to Cancel to ignore the proxy authentication.
If a wrong UserName or Password is set, the event occurs again and RetryCount
increases.
OnUpdatedStats (Sender:
TObject; InQueue, Downloading,
ToAnalyze, Done, Retries, Errors: Integer);
occurs each time something changes in the HttpScan state. Returns the number of
pages in queue (waiting for download), the number of pages currently
downloading, the number of pages waiting to be analyzed, the number of pages
analyzed (done), and the number of page downloads in error.
OnWorking (Sender:
TObject; working_: boolean);
occurs when HttpScan pass from the state "waiting" to the state
"working" and opposite. Can be used to detected when HttpScan has
terminated his job. You can use also the Working
property.
ABOUT WRITEFILE
Comments about the WriteToFile parameter used in the OnLinkFound event:
WriteToFile contains the string (the last link found) that will be written to the FileOfResults. If you leave it untouched, for each link found a line is written to the file like this : "TypeLink";"NewUrl";"HostName".
WriteToFile is useful to write links to the FileOfResult file only for some kind of links (e.g. "jpg"), or to choose the information written to the file. For examples:
If you want to write your own data to the file,
e.g. Typelink, NewUrl and FromUrl then add the following line in the event:
WriteToFile:= '"' + TypeLink + '";"' + NewUrl + '";"' +
HostName + '"';
If you want to skip the event's link and not to
write anything into the file for the current link found, simply add the
following line in the event:
if ...=... then begin
WriteToFile:= '';
end;
Copyright All Datastead Software
components and applications are copyrighted by Michel FORNENGO (hereafter
"author"), and shall remain the exclusive property of the author. General agreement
By installing this software you agree with: Licensed version
This software and any accompanying documentation
are protected by International Copyright laws and Treaty provisions.
Distribution Rights
You are granted a non-exclusive, royalty-free
right to produce and distribute executable binary files (executables, DLLs,
etc.) that are built with the licensed version of the software unless specifically stated otherwise.
The origin of this software must not be misrepresented, you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required.
Without the express prior written consent of the
author, you may not distribute any of the author's commercial source code,
compiled units or documentation by any means whatsoever. You may not transfer,
lease, lend, copy, modify, translate, sublicense, time-share, or
electronically transmit or receive the software or documentation.
Upgrade
The upgrade version of the software constitutes a
single product of the author's software that you upgraded. For example, the
upgrade and the software that you upgraded cannot both be available for use by
two different people at the same time, without written permission from the
author.
Limited warranty Datastead warrants that for a period of ninety (90) days from the date of
shipment from Datastead: (i) the media on which the Software is furnished will
be free of defects in materials and workmanship under normal use; and (ii) the
Software substantially conforms to its published specifications. Except for the
foregoing, the Software is provided AS IS. This limited warranty extends only to
Customer as the original licensee. Customer's exclusive remedy and the entire
liability of Datastead and its suppliers under this limited warranty will be the
refund of the Software. In no event does Datastead warrant that the Software is
error free or that Customer will be able to operate the Software without
problems or interruptions.
This warranty does not apply if the software (a) has been altered, except by
Datastead, (b) has not been installed, operated, repaired, or maintained in
accordance with instructions supplied by Datastead, (c) has been subjected to
abnormal physical or electrical stress, misuse, negligence, or accident, or (d)
is used in ultra hazardous activities.
Disclaimer
The Author cannot and does not warrant that any
functions contained in the Software will meet your requirements, or that its
operations will be error free. The entire risk as to the Software performance or
quality, or both, is solely with the user and not the Author. You assume
responsibility for the selection of the component to achieve your intended
results, and for the installation, use, and results obtained from the Software.
ALL DATASTEAD SOFTWARE IS NOT DESIGNED, MANUFACTURED, OR INTENDED FOR USE OR
RESALE AS ON-LINE CONTROL EQUIPMENT IN HAZARDOUS ENVIRONMENTS REQUIRING
FAIL-SAFE PERFORMANCE SUCH AS IN THE OPERATION OF NUCLEAR FACILITIES, AIRCRAFT
NAVIGATION OR COMMUNICATION SYSTEMS, AIR TRAFFIC CONTROL, DIRECT LIFE SUPPORT
MACHINES, OR WEAPONS SYSTEMS, IN WHICH THE FAILURE OF THE SOFTWARE COULD LEAD
DIRECTLY OR INDIRECTLY TO DEATH, PERSONAL INJURY, OR SEVERE PHYSICAL OR ENVIRONMENTAL DAMAGE.
General
PLEASE READ THIS LICENSE AGREEMENT ("AGREEMENT") CAREFULLY:
BY INSTALLING, COPYING OR OTHERWISE USING THIS SOFTWARE AND ANY RELATED PRINTED MATERIALS ("SOFTWARE"), YOU ARE ACCEPTING AND AGREEING TO THE TERMS OF THIS AGREEMENT.
IF YOU DO NOT AGREE WITH THE TERMS OF THIS AGREEMENT, DO NOT USE THE SOFTWARE.
- You may not use the source code or binaries in this package to create competitive software
product,
- You may not manipulate any binary files included in this package or generated by
Delphi or C++Builder from the package,
- You may not distribute any file included in this package (source code or
binaries) to non licensed people.
Any use of this software in violation of copyright law or the terms of this
agreement will be prosecuted to the best of the author's ability.
You are hereby authorized to make archival copies of this software for the sole
purpose of back-up and protecting your investment from loss.
Under no circumstances may you copy this software or documentation for the
purposes of distribution to others. Under no conditions may you remove the
copyright notices made part of the software or documentation.
Restrictions
The Author makes no warranty, either implied or expressed, including without
limitation any warranty with respect to this Software documented here, its
quality, performance, or fitness for a particular purpose. In no event shall the
Author be liable to you for damages, whether direct or indirect, incidental,
special, or consequential arising out the use of or any defect in the Software,
even if the Author has been advised of the possibility of such damages, or for
any claim by any other party.
All other warranties of any kind, either express or implied, including but not
limited to the implied warranties of merchantability and fitness for a
particular purpose, are expressly excluded.
THTTPSCAN analyzes recursively and quickly HTML pages and reports all the links it finds to a text file: html, mail, jpg, mpeg, mp3, etc...
THttpScan extracts links through HTML pages in
the neighborhood of the initial URL. The html links found are added in a
download queue. Then THttpScan downloads each related page, extracts the
links found, and so on...
- by using the LinkScan property you can limit the scanning to the initial site
or the initial URL path,
- by using the LinkReport property you can report only links owned by the current site, or
the links under the subdirectories of the initial link.
The DepthSearchLevel property allows you to limit the level of pages scanned, starting from the initial page, especially when not limiting the scanning to a site.
By using the LinkScan and LinkReport
properties combined with an high DephSearchLevel value, you can easily scan a whole
site or only a subdirectory from a web site.
Events are generated for each link found and each page read, returning
URL, meta tags, document type, referrer, host name...
According to your line speed, you can extract thousands of links from a starting URL in a few minutes.
THTTPSCAN saves you having to tangle with
the HTML parsing.
Most common parameters can be simply set from the Object Inspector. It can
be placed on any window, it is only visible at design time.
System requirements
Click here to order.
Q: when I paste a long URL in the edit field, no link is reported.
If you paste the URL in an TEdit field, the TEdit field truncates the URL to 255 characters.
The workaround consists to use a TMemo field instead (be sure to disable the WordWrap property, otherwise the URL could be truncated with line feeds).
Q: when I build my project using C++Builder, I get the following
errors:
Unresolved external 'InternetCloseHandle' referenced
from BORLAND\CBUILDER\IMPORTS\THTSCAN.LIB|HttpScan
Unresolved external 'InternetCrackUrlA' referenced from
BORLAND\CBUILDER\IMPORTS\THTSCAN.LIB|HttpScan
Unresolved external 'InternetCombineUrlA' referenced from
BORLAND\CBUILDER\IMPORTS\THTSCAN.LIB|HttpScan
- Go to Project | Add to Project
- in the "file type" listbox at the bottom of the tab, choose
".lib" files,
- navigate to your CBuilder\Lib directory, and choose either inet.lib
(CBuilder4) or wininet.lib (CBuilder5 and CBuilder6).
- Build and save your project.
Q: when I build my project using C++Builder, I get the following
error:
Unable to find package import: THtScan.bpi
Go to Project | Options | Packages. Go to the "Runtime packages" groupbox, at the bottom of the tab. Go to the end of the packages list edit box, and remove THtScan.bpi.