Search Engine Robots or Web Crawlers
Most of the common users or visitors use different available search engines to search out the piece of information they required. But how this information is provided by search engines? Where from they have collected these information? Basically most of these search engines maintain their own database of information. These database includes the sites available in the webworld which ultimately maintain the detail web pages information for each available sites. Basically search engine do some background work by using robots to collect information and maintain the database. They make catalog of gathered information and then present it publicly or at-times for private use. In this article we will discuss about those entities which loiter in the global internet environment or we will about web crawlers which move around in netspace.
We will learn · What it’s all about and what purpose they serve ? · Pros and cons of using these entities. · How we can keep our pages away from crawlers ? · Differences between the common crawlers and robots. In the following portion we will divide the whole research work under the following two sections : I. Search Engine Spider : Robots.txt.
II. Search Engine Robots : Meta-tags Explained. I. Search Engine Spider : Robots.txt What is robots.txt file ? A web robot is a program or search engine software that visits sites regularly and automatically and crawl through the web’s hypertext structure by fetching a document, and recursively retrieving all the documents which are referenced. Sometimes site owners do not want all their site pages to be crawled by the web robots. For this reason they can exclude few of their pages being crawled by the robots by using some standard agents. So most of the robots abide by the ‘Robots Exclusion Standard’, a set of constraints to restricts robots behavior. ‘Robot Exclusion Standard’ is a protocol used by the site administrator to control the movement of the robots.
When search engine robots come to a site it will search for a file named robots.txt in the root domain of the site (http://www.anydomain.com/robots.txt). This is a plain text file which implements ‘Robots Exclusion Protocols’ by allowing or disallowing specific files within the directories of files. Site administrator can disallow access to cgi, temporary or private directories by specifying robot user agent names. The format of the robot.txt file is very simple. It consists of two field : user-agent and one or more disallow field.
What is User-agent ? This is the technical name for an programming concepts in the world wide networking environment and used to mention the specific search engine robot within the robots.txt file. For example : User-agent: googlebot We can also use the wildcard character “*” to specify all robots : User-agent: * Means all the robots are allowed to come to visit. What is Disallow ? In the robot.txt file second field is known as the disallow: These lines guide the robots, to which file should be crawled or which should not be. For example to prevent downloading email.htm the syntax will be: Disallow: email.htm Prevent crawling through directories the syntax will be: Disallow: /cgi-bin/ White Space and Comments : Using # at the beginning of any line in the robots.txt file will be considered as comments only and using # at the beginning of the robots.txt like the following example entail us which url to be crawled.
# robots.txt for www.anydomain.com Entry Details for robots.txt : 1) User-agent: * Disallow: The asterisk (*) in the User-agent field is denoting “all robots” are invited. As nothing is disallowed so all robots are free to crawl through. 2) User-agent: * Disallow: /cgi-bin/ Disallow: /temp/ Disallow: /private/ All robots are allowed to crawl through the all files except the cgi-bin, temp and private file. 3) User-agent: dangerbot Disallow: / Dangerbot is not allowed to crawl through any of the directories. “/” stands for all directories.