2008年07月05日
webspider 源码
What is a WebSpider
A WebSpider or crawler is an automated program that follows links on websites and calls a WebRobot to handle the contents of each link.
What is a WebRobot
A WebRobot is a program that processes the content found through a link, a WebRobot can be used for indexing a page or extracting useful information based on a predefined query, common examples are - Link checkers, e-mail address extractors, multimedia extractors and update watchers.
Background
I had a recent contract to build a web page link checker, this component had to be able to check links that were stored in a database as well as to check links on a website, both through the local file system and over the internet.
This article explains the WebRobot, the WebSpider and how to enhance the WebRobot through specialized content handlers, the code shown has some superfluous code such try blocks, variable initialization and minor methods removed.
Class overview
The classes that make up the WebRobot are; WebPageState, which represents a URI and its current state in the process chain and an implementation of IWebPageProcessor, which performs the actual reading of the URI, calling content handlers and dealing with page errors.
The WebSpider has only one class WebSpider, this maintains a list of pending/processed URI's contained in a list of WebPageState objects and runs WebPageProcessor against each WebPageState to extract links to other pages and to test whether the URI's are valid.
Using the code - WebRobot
Web page processing is handled by an object that implements IWebPageProcessor. The Process method expects to receive a WebPageState, this will be updated during page processing and if all is successful the method will return true. Any number of content handlers can be also be called after the page has been read, by assigning WebPageContentDelegate delegates to the processor.
public delegate void WebPageContentDelegate( WebPageState state );
public interface IWebPageProcessor
{
bool Process( WebPageState state );
WebPageContentDelegate ContentHandler { get; set; }
}
本文转自:SEO基地
本文链接:http://www.11zhuce.com/seo/788.html
相关文章 [查看与 webspider 源码 webspider 源码 相关的全部文章]
0 回复,0 引用: webspider 源码
添加回复