Welcome to Welcome to DNF.com™ - Domain Sales, Domain Forum, Domain Appraisals, Domain Registrars

If you are new to domains and looking to buy, sell and learn about domains then you have come to the right place. DNForum is the largest domain name community on the internet and continues to grow every day. There are over 105,000 domainers on DNForum doing everything from buying domains, selling domains, learning about domains and discussing domains. Take a minute and Register.

Register Today on DNForum IT'S FREE!

Results 1 to 3 of 3
  1. #1
    DNF Addict

    Join Date
    Jul 2004
    Location
    Chandigarh, India
    Posts
    2,674
    DNF$
    13,195
    Bank
    0
    Total DNF$
    13,195
    Donate  

    A Beginners Guide to Robots.txt

    Search engines use robots to crawl or spider web pages on the web, these robots or crawlers are nothing else but special programs written for reading web page information including text, links, graphics, headings etc. These crawlers or robots tend to follow a special specification file known as the robots.txt file. For example if a search robot visits a site http://www.seopages.com then it first looks for the robots text file at http://www.seopages.com/robots.txt. If found then the robot follows the instructions in that file is having about how to index that site which pages to read and which not to read. This robots.txt file guides the search robot which part of a website to index and which not to index. The robots specification was developed in 1993 came to be known as the ‘The Robots Exclusion Standard’ and still remains the standard for directing robots with almost all search engines following it. You can learn to define and place a robots file further in this article.

    Basically robots.txt as the file extension implies is just a simple text file without any scripting or programming code in it. It can be created using a simple text editor like notepad and consists of simple text directives. Complex word processors should never be used because their formatting can create problems and lead to removal of the site. Almost every website has certain privileged pages containing sensitive and confidential information that is not intended for general users those pages can be disallowed for reading by search engines with robots file. Robots.txt file can be customized to allow only specific search robots to spider the site, and to disallow reading specific directories or files. Let us create a simple robots.txt file here. Open a simple text editor i.e. notepad write the following lines and save as robots:

    #this is a typical example of robots file
    #comments are placed after hash.

    User-agent: *
    Disallow: /cgi-bin/

    This is a typical example of robots.txt file the User-agent line directive specifies the name of the robot or spider that is visiting the website for example “User-agent: googlebot” specifies Googles robot and the instructions following down will be for that robot. A “ User-agent: * “ value means all robots on the web. Further comes the “Disallow” directive. The disallow directive line specifies the file name or folder name that is to be disallowed to read by that specific robot. Disallow field can be left blank also which will specify that all pages are allowed to spider. Here one care is to be taken in the disallow field that each file to be disallowed should be declared on a new line. In other words multiple files should not be written against single disallow directive. For example for multiple files to be disallowed we will define robots.txt as :

    User-agent: Googlebot
    Disallow: information.html
    Disallow: private.html
    Disallow: shipping.html

    User-agent: Architext
    Disallow: /

    In this example Googlebot is disallowed three pages to crawl and Architext, the spider of Excite, is disallowed all the pages of the site. Similarly all spiders can be instructed if you know their names otherwise use ‘ * ’. However if the file that is to be protected is residing in a folder other than root folder( / ) then complete path of the file can be specified. Now the question arises that where should robots.txt be placed on a website. The answer is root directory( / ) where the index file is placed. Remember that there should always be just one Robots.txt file on a website. Website addresses(URL’s) are case-sensitive, and "robots.txt" string must be all in lower-case and exactly same in name. Blank lines are not permitted within a single record in the "robots.txt" file and there must be exactly one “User-agent” field per record. If robots file is placed in wrong folder then it looses its functionality and spiders ignore it making it useless.

    Advantages of having a Robots.txt

    It helps to hide and protect sensitive and confidential information by disallowing spiders to index them.

    It helps in search engine specific optimization of a website (making web pages for particular search engines).

    This file should be very carefully written according to the format specified before uploading to a website because a simple mistake can result in index removal of a complete website from search engines. Don’t indulge in the activity of making too many copies of web pages to be optimized for every search engine present instead be reasonable with the number and keep the target of the major five or seven engines. So now you know What is a robots.txt file? How to define it? How to use it? and Where to place it?

    Enjoy!
    NICE Domains for sale - Huge Collection

    PM me for details

  2. #2
    Platinum Lifetime Member
    kokopelli's Avatar
    Join Date
    Jul 2004
    Location
    USA
    Posts
    1,062
    DNF$
    5,251
    Bank
    0
    Total DNF$
    5,251
    Donate  

    Re: A Beginners Guide to Robots.txt

    Good article! Here's an example of a typical robots.txt file I may use:

    User-agent: Mediapartners-Google*
    Disallow:
    User-agent: Googlebot
    Disallow: /*.doc$
    Disallow: /*.PDF$
    Disallow: /*.jpeg$
    Disallow: /*.jpg$
    Disallow: /*.png$
    Disallow: /*.gif$
    Disallow: /*.exe$
    Disallow: /*.mp3$
    Disallow: /*.mid$
    Disallow: /*.wav$
    User-Agent: msnbot
    Disallow: *.doc$
    Disallow: *.PDF$
    Disallow: *.jpeg$
    Disallow: *.jpg$
    Disallow: *.png$
    Disallow: *.gif$
    Disallow: *.exe$
    Disallow: *.mp3$
    Disallow: *.mid$
    Disallow: *.wav$
    User-agent: *
    Disallow: /cgi-bin/
    Disallow: /images/
    Disallow: /guardian/
    Disallow: /axs/
    Disallow: /admin/
    User-agent: Slurp
    Crawl-delay: 60
    User-Agent: msnbot
    Crawl-delay: 60
    This robots.txt file tells Google and MSN not to index certain (e.g. image) files and limits the frequency of hits of the spiders slurp and msnbot (otherwise they can eat up bandwidth).

    This is just an example. Each website is different.

    Here's one robots.txt file validator you can use: http://www.searchengineworld.com/cgi-bin/robotcheck.cgi
    Last edited by kokopelli; 07-23-2005 at 05:48 PM. Reason: Added link
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    My Current Websites for SALE

  3. #3
    DNF Addict

    Join Date
    Jul 2004
    Location
    Chandigarh, India
    Posts
    2,674
    DNF$
    13,195
    Bank
    0
    Total DNF$
    13,195
    Donate  

    Re: A Beginners Guide to Robots.txt

    Thanks for the comments and for the example kokopelli !!
    NICE Domains for sale - Huge Collection

    PM me for details

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  

Domain name forum recommended by Domaining.com