Saturday, 8 March 2014

What is a Web Crawler?

A Web Crawler is nothing but a software program which is also known as a Bot. A Web Crawler's task is to crawl from webpage to webpage by using the links available on the anchor tags of the webpage. Sometimes, it is also referred as Web Spider. Web Crawlers are mostly used by Search Engines to index the web pages.

Internet as a Graph.

An Internet is network of networks and Web Crawler considers Internet as a Graph (Data Structure). A Web Crawler traverses the Internet in the same way we traverse the Graph i.e by using the algorithms like Breath First Search or Depth First Search. Internet is similar to a graph where every web page is like a node in a graph and these nodes are connected to each other by edges. In every web page, we can find an HTML Anchor tag which links to some other page, either of same domain or of any other domain and these links works as edge between two web pages.

Now, we came to know that Internet is similar to Graph and hence their technique of traversal is same. But as in Graph we have to provide a starting point for traversal which is also similar in case of Web Crawler. Yes, we have to feed some urls(also known as Seeds) as an input to the crawler. By doing this, we give the crawler a start point to commence his work.

How it Works?

The concept used by the the Web Crawler is quite simple:

  1. Select a URL from the set of urls.
  2. Download the web page associated to that link.
  3. Extract the links from the web page.
  4. Filter the links by removing the links which were previously traversed.
  5. Store the non-traversed link into the set of urls.
  6. Fetch the next link from the set and repeat the same procedure from Point 1.

Yes, development of a Web Crawler is very easy, but it will be a simple Web Crawler. To build a Crawler which are used by Search Engines is a big challenge. As to maintain a Search Engine updated with latest web pages, a crawler has to traverse the pages in high scalability like crawling thousands of pages every second. If we try to do this then there is high possibility that Web server will crash. So one of the way is to distribute the downloads with multiple computers which will reduce the load.

Crawling Policies.

Every Web Crawler has to follow some policies. These policies helps crawler to work smoothly. Below are to four policies that a crawler has to follow.

  • Selection Policy:
    A Crawler has to select which pages it needs to crawl. Every crawler has to obey the Robot Exclusion Protocol or robots.txt file
  • Re-Visit Policy:
    A Crawler has to revisit the web pages to refresh the changed contents of the pages.
  • Politeness Policy:
    A Crawler must not disrupt the performance of the website or server
  • Parallelization Policy:
    A Crawler must execute multiple processes to enhance the download rate.

Existing Web Crawlers.

As mentioned earlier, a Web Crawler is mostly used by Search Engines. Following are the list of crawlers used by the most popular Search Engines:

  • Google: Googlebot
  • Yahoo: Slurp.
  • Bing: BingBot.