Web Spiders

Have any of you wondered how search engines know what content you have on your web page, what pages link to your web page, or how popular it is? This is all done through the use of Web Spiders. A Web Spider, which can be also referred to as a Web robot, bot or worm are programs or automated script which browse the World Wide Web in a methodical, automated manner.

Web crawlers are primarily used to create a copy of all the visited pages for later processing by a search engine, that will index the downloaded pages to provide fast searches. By doing this, search engines are able to reach a wide spectrum of web pages at high speeds. However, this may be difficult where a web site is completely isolated, and not linked to anything.

Web Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code.

The use of web crawlers give rise to a number of issues:

Updating Web Pages

The nature of the world wide web is that is constantly changing and web sites are being updated at exponential rates. For a web crawler to have served a useful purpose it must be able to display sites that are up to date. As such, web crawlers need to be able to scan sights on a highly regular basis.

Restricted Access

Obviously, some web pages were never intended be viewed by everyone. As such, web spiders will not be able to access restricted sites. Furthermore, web spiders often utilise a large amount of the bandwidth available to a web page. As such, the operation of the web spiders may result in the reduction of internet bandwidth quotas and an effective “slow down” in both internet traffic and speed. To mitigate this problem, designers have created a ‘gentleman’ solution, by defining certain pages they allow the web spider to access in /robots.txt.

Spam

Web crawlers can also be used to gather specific types of information from Web pages, such as harvesting e-mail addresses. This is usually done for the purposes of Spam. This is an increasingly problematic area for the operation of web crawlers.

Examples

Some of the most commonly known web crawlers are

Google

Altavista

Lycos

A more detailed list can be found at the Federal Court website here.

 
web_spiders.txt · Last modified: 2006/10/28 15:41 by sachinsuch01
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki