![]() |
Description - Features |
![]() |
Registration |
![]() |
Disclaimer |
![]() |
Package Install | ||||
![]() |
Methods and Functions
|
1. If and old THttpScan package is already installed, FIRST REMOVE IT:
- Components | Install Packages
- Click on "Michel Fornengo's Components"
- click "Remove"
- click "Yes"
- click "Ok".
2. Install the current package :
- unzip the archive in a folder of your choice (e.g. c:\httpscan)
- copy the following files to delphi5\Imports (or delphi4\imports) :
httpscan.bpl
httpscan.dcp
httpscan.dcu
MF*.dcu (all the .dcu files beginning with MF)
- run Delphi
- select Component | Install packages
- press the "Add" button
- locate the httpscan.bpl file in the Imports directory and select it
- select Open
- select Ok
- check the mfornengo tab in the right of the component palette. The httpscan
object should have been added.
3. If you have created a project with a previous release of THttpScan :
To make Delphi recognize the new parameters in events
, proceed as follows :- cut and save the code of the THttpScan events
with new parameters (OnLinkFound and OnPageReceived)- put a new THttpScan component on your project
-
go to the THttpScan's object properties inspector- double click on the events with same parameters. They will find their existing code.
- double click on the events with new parameters to create the emty procedures, then paste the saved code.
function Start
: boolean (1st syntax)function Start
(StartingUrl_ : string) : boolean (2nd syntax)
procedure Stop : kills all HttpScan processes currently running. Must be called before closing the Form. The Form can be closed after the OnWorking event occurs (false) or the Working property returns false.
Referer : string =
' '
the address (URL) of the document from which the URL in the request was
obtained. If this parameter is left blank, no "referrer" is sent.
Retries : integer = 3
number of download retries when an http error occurs.
ReUseCache : boolean =
false
if set to true, the local cache file is read before downloading pages.
SeekRobotsTxt : boolean =
false
if set to true, THttpScan searches for robots.txt files at the root of the sites
(http://www.hostname.foo/robots.txt). If the file is found, the body content is
returned by the OnPageReceived event
StartingUrl : string = ' '
the url from wich the scanning will be performed.
Must be set before calling the Start
function if she is called without Url parameter.
TimeOut : integer = 300
time left to the http thread to connect to an URL (in seconds) before aborting
process. The thread tries to connect Retries times before
the OnError event occurs.
UserName : string = ' '
needed if the starting url is username/password protected.
Working : boolean
= false (read only)
indicates the state of httpscan : "waiting" or "working".
Can be tested before closing the Form to prevent error messages if downloads are
currenly running. To use with the "stop" method. You can use also the OnWorking event.
OnError (Url: String;
ErrorCode: Cardinal; ErrorMsg: String);
occurs when a "GET"
request fails. Returns the url wich failed, with the error code and the error
message if availables.
OnLinkFound (UrlFound,
TypeLink, FromUrl, HostName, UrlPath, UrlPathWithFile, ExtraInfos: String; var WriteToFile: String);
This event occurs each time a
link is found :
UrlFound : the full address on the link found
TypeLink : type of link (htm, jpg, mpg, cgi, php, etc...)
FromUrl : the referring url (.htm) from wich the link come from
Hostname : the host name of the UrlFound address
UrlPath : the url path (without host name & without filename)
UrlPathWithFile : the url path (without host name but with
filename)
ExtraInfos : the extra infos passed to the URL (e.g. ?param1=v)
WriteToFile : the line to be written to FileOfResult. See comments here.
HrefOrSrc : returns 'S' if the link is an object loaded on the page (a thumb
for example) and 'H' if the link is the destination URL.
CountArea : all the area found receive a sequential number. When a Href or
Src link is found, it receives the number corresponding to his area. So, the
couples Href / Src link can be associated.
FollowIfHtmlLink : if false, THttpScan doesn't continue searching in the
direction of the current link.
Onlog (LogMessage :
string);
returns a string wich explains the internal analyze (for debugging purposes)
OnPageReceived (Hostname,
Url, Head, Body : string);
occurs each time an html page is
downloaded. Returns the Headers and the body text of the page.
Url : url of the text page received
Hostname : hostname of the page received
Head : head of the http query request for the page received
Body : body of the text of the page received.
OnUpdatedStats (InQueue,
Downloading, ToAnalyze, Done, Retries, Errors: Integer);
occurs each time something changes in the httpscan state.Returns the number of
pages in queue (waiting for download), the number of pages currently
downloading, the number of pages waiting to be analyzed, the number of pages
analyzed (done), and the number of page downloads in error.
OnWorking (working_
: boolean);
occurs when httpscan pass from the
state "waiting" to the state "working" and opposite. Can be
used to detected when HttpScan has terminated his job. You can use also the Working property.
Comments on the WriteToFile parameter used in the OnLinkFound event :
WriteToFile contains the string to be written to the FileOfResults. If you leave it untouched, for each link found a line is written to the file like this : "TypeLink";"NewUrl";"HostName".
WriteToFile is useful to write links to the FileOfResult file only for some kind of links (e.g. "jpg"), or to choose the informations written to the file. For examples :
If you want to write your own data to
the file, e.g. Typelink, NewUrl and FromUrl then add the following line in the
event :
WriteToFile := '"' + TypeLink +
'";"' + NewUrl + '";"' + HostName + '"';
If you want to skip the event's link
and not to write anything into the file for the current link found, simply add
the following line in the event :
if ...=... then begin
WriteToFile := '';
end;
With THTTPSCAN you access to web sites as a collection of links to files and
data, instead of as graphics and text.
THTTPSCAN recursively analyses HTML pages and extracts all the links found with
detailed informations (document type, referer, host name,...). Links are
followed through HTML pages in the neighborhood of the initial URL.
Events are generated for each link found and each page read. The "depth
search level" and the "stay on site" parameters allow powerful
searches and full sites files view.
THTTPSCAN saves you having to tangle with the Microsoft Wininet API functions
and the internet address syntax analysis. Most common parameters can be simply
set from the Object Inspector. It can be placed on any window, it is only
visible at design time.
THTTPSCAN is the basic tool to create (and not limited to):
![]() | custom search engines : search without a browser. THTTPSCAN finds the
pages and returns the contents and the linked files list. |
![]() | multimedia finders : list the mp3s, jpgs, mpgs files linked to a site or
in the neighborhood of a site |
![]() | download managers : THTTPSCAN gives you the whole list of the links. |
![]() | site changes monitoring : create an automated tool to monitor when new
links are added to a site, or when the content of the pages has been
changed. |
![]() | Simple to use integration to a Delphi application.![]() Single registration per developer - no on-going licence fee | |
![]() | asynchronous-nonblocking transactions | ||||
![]() | ability to keep searching on the initial site (stay on site)
| ||||
![]() | depth search level from 1 (same page) to n
|
![]() | event generated on each link found (no polling necessary) with
the following parameters :
|
![]() | extract links from frames | ||||||
![]() | event generated on each HTML page read
| ||||||
![]() | possibility to seach for robots.txt files | ||||||
![]() | proxy support with username/password through Control Panel | Internet
Options parameters |
![]() | Windows 95/98 / Windows NT 3.5 or 4 / Windows 2000 |
![]() | Delphi Version 4 or 5 |
The author of this program accepts no responsibility for damages resulting from the use of this product and make no warranty or representation, either expressed or implied, including but not limited to, any implied warranty of merchantability or fitness for a practical purpose.
These software packages are provided here "AS IS", and you the user, assume all risks when using them.
Register THTTPSCAN and you will get the full source code. Registration costs :
![]() | US $27 |
![]() | EURO 29 |
![]() | French Francs 190 |
You may register online at http://www.getsoftware.com.
THttpscan home page: http://www.mfornengo.ath.cx.