2008年07月05日

webspider 源码

webspider 源码

What is a WebSpider
A WebSpider or crawler is an automated program that follows links on websites and calls a WebRobot to handle the contents of each link.

What is a WebRobot
A WebRobot is a program that processes the content found through a link, a WebRobot can be used for indexing a page or extracting useful information based on a predefined query, common examples are - Link checkers, e-mail address extractors, multimedia extractors and update watchers.


Background
I had a recent contract to build a web page link checker, this component had to be able to check links that were stored in a database as well as to check links on a website, both through the local file system and over the internet.

This article explains the WebRobot, the WebSpider and how to enhance the WebRobot through specialized content handlers, the code shown has some superfluous code such try blocks, variable initialization and minor methods removed.

Class overview
The classes that make up the WebRobot are; WebPageState, which represents a URI and its current state in the process chain and an implementation of IWebPageProcessor, which performs the actual reading of the URI, calling content handlers and dealing with page errors.

The WebSpider has only one class WebSpider, this maintains a list of pending/processed URI's contained in a list of WebPageState objects and runs WebPageProcessor against each WebPageState to extract links to other pages and to test whether the URI's are valid.

Using the code - WebRobot
Web page processing is handled by an object that implements IWebPageProcessor. The Process method expects to receive a WebPageState, this will be updated during page processing and if all is successful the method will return true. Any number of content handlers can be also be called after the page has been read, by assigning WebPageContentDelegate delegates to the processor.

public delegate void WebPageContentDelegate( WebPageState state );

public interface IWebPageProcessor
{
   bool Process( WebPageState state );

   WebPageContentDelegate ContentHandler { get; set; }
}

 

本文转自:SEO基地

本文链接:http://www.11zhuce.com/seo/788.html

2008年12月13日--张靓颖上海演唱会

0 回复,0 引用: webspider 源码

添加回复

◎欢迎参与讨论,请在这里发表您的看法、交流您的观点。