|
Submitted 24 February 2003, Updated 20 September 2006
This article describes how to extract a list of all web resources (URL's) from a Web Site using the Clever Internet Suite components.
Currently the newer version of the Clever Internet Suite 6.2 has been released. Please see the Notes at the end of this article to update the URL Extractor with this new version 6.2
There is big demand for applications that allow look up through the web server (URL) and collect list of all web resources available from that web server. Most famous programs with such functionality probably would be Teleport Pro and Flash Get.
The general idea however is very simple - download a web page, parse it and extract all links and urls.
In this article we will discuss very simple and generic algorithm based on recursive downloading and parsing web pages in asynchronous mode.
As we mentioned above the main steps of algorithm are:
- Download web page / URL
- Parse downloaded page and extract all URL's (you can use any method you like to parse pages)
- Save extracted URLs into the URL list
- Take next URL from the URL list
- Go to the first step until end of the URL list
The first step can be implemented with Clever Downloader component. TclDownloader component in addition to base functionality provided by another popular libraries (such as Indy, IPWorks and so on) allows you to download web page / URL in asynchronous mode without interfering with main application process. After downloading process completed the OnProcessCompleted component event occurs. In order to implement recursive downloading we also need to use OnIsBusyChanged component event. OnIsBusyChanged event will protect us from stack overflow during crawling through server URL's.
Here is the code for the first step of our algorithm:
clDownLoader: TclDownLoader; memURLList: TMemo; |
...
procedure TForm1.btnStartClick(Sender: TObject); begin if clDownLoader.IsBusy then Exit; memURLList.Lines.Clear(); memURLList.Lines.Add(edtRootURL.Text); FCurrentURLIndex := -1; ProcessNextURL(); end; |
procedure TForm1.ProcessNextURL(); begin repeat Inc(FCurrentURLIndex); until (FCurrentURLIndex >= memURLList.Lines.Count) or (Pos('.asp', memURLList.Lines[FCurrentURLIndex]) > 0) or (Pos('.htm', memURLList.Lines[FCurrentURLIndex]) > 0); if (FCurrentURLIndex < memURLList.Lines.Count) then begin clDownLoader.URL := memURLList.Lines[FCurrentURLIndex]; clDownLoader.Start(); end else begin ShowMessage('Process Completed'); end; end; |
Main loop within the ProcessNextURL method is searching for next URL in URL list and the important thing is that URL should looks like html page (in this example we just check for page extension). We simplified this method but if you need more advanced analysis you can easily modify it according your needs.
After downloading completed the OnIsBusyChanged event occurs. Here is a code for this event:
procedure TForm1.clDownLoaderIsBusyChanged(Sender: TObject); var List: TStrings; begin if clDownLoader.IsBusy then Exit; if FileExists(clDownLoader.LocalFile) then begin List := TStringList.Create(); try List.LoadFromFile(clDownLoader.LocalFile); ExtractURLS(List, memURLList.Lines); finally List.Free(); end; DeleteFile(clDownLoader.LocalFile); end; ProcessNextURL(); end; |
You can extract URLs from downloaded web page with any method you like the most. In our example we used simple page parsing. Full source code can be downloaded at urlextractor.zip.
procedure TForm1.ExtractURLS(APage, AURLList: TStrings); var i: Integer; List: TStrings; begin List := TStringList.Create(); try ParsePage(APage, List); for i := 0 to List.Count - 1 do begin if (AURLList.IndexOf(List[i]) < 0) then begin AURLList.Add(List[i]); end; end; finally List.Free(); end; end; |
When parsing URL content please pay attention to the following issues:
- Check for duplicate URL entries before adding a URL to URL list. Most web pages have cross linked references within web site.
- Check for links to foreign web sites. This will prevents from jumping to another web server while crawling through the specified web site.
In order to check if link belongs to the web site on which this web page is hosted you can simply compare host part of both URL's (for example http://www.site.com and http://www.site.com/index.asp). Clever Internet Suite has TclURLParser class specially designed for this purpose.
function TForm1.IsURLNative(const AURL, ABaseURL: string): Boolean; var URLParser, BaseURLParser: TclURLParser; begin Result := False; URLParser := TclURLParser.Create(); BaseURLParser := TclURLParser.Create(); try BaseURLParser.Parse(ABaseURL); if (URLParser.Parse(AURL) <> '') then begin Result := (URLParser.Host = BaseURLParser.Host); end; finally BaseURLParser.Free(); URLParser.Free(); end; end; |
Given example is far from perfect and might reqire additional alterations. You can always enhance this code to desired functionality:
In the newer version 6.2 we have improved the URL parsing algorithm and applied some code modifications. Please check it before compiling our samples:
- We have moved the TclURLParser class in to the clUriUtils unit.
- The spelling of this class was also changed: TclURLParser -> TclUrlParser
- New Html Parser component was added. It allows you to extract all URLs, Images and other HTML tags and access them as simple collections.
Please declare the clUriUtils unit within your uses list and re-compile your program. You can get all class and method declarations for the Clever Internet Suite library at inetsuitehdr.zip
With Html Parser component you can replace your code as follows:
procedure TForm1.ParsePage(APage, AList: TStrings); var i: Integer; url, fullurl: string; begin clHtmlParser1.Parse(APage); for i := 0 to clHtmlParser1.Links.Count - 1 do begin url := clHtmlParser1.Links[i].Href; if GetFullURL(url, fullurl) and (AList.IndexOf(fullurl) < 0) then AList.Add(fullurl); end; end; |
Best regards, Sergey Shirokov Clever Components team. Please feel free to contact me at info@clevercomponents.com
|