Incorporating robots.txt & meta tags

by Tyler Downer
3/12/08

Blocking Search Engines

You most likely know that search engines use robots or spiders to gather sites in their index. You also probably know that they follow every link they can find, gathering more and more content. They have probably already visited your site. But what if you want to keep them away from a portion of your site? Maybe you have a page that, for some reason, should not be spidered. Or you have a private section of your site, and you want to keep it that way. I will show you techniques to accomplish these goals in a secure way, to keep spiders and unwanted visitors away. The first layer of defense, is the robots.txt file.

robots.txt

To block a page on your site from a bot, you can use one of two methods. The first is using a text file that is stored in your root folder. It must be named robots.txt and be accessible by typing this in your web browser, www.yoururl.com/robots.txt.
Most spiders know that the robots file is stored in this place and accesses it before they visit any of your pages. They then stay away from any that you mark. Below is a simple robots.txt file.

User-agent: *
Disallow: /private.html

In this file, User-agent means robot, the asterisk, or *, is a wildcard, meaning all. Basically it means "All robots, listen to what follows." Next is Disallow: /private.html. This tells the robots not to go to private.html.

You can also tell specific bots to stay away, by using their names. For example, if you just wanted google to stay away from private.html, you would type googlebot after user-agent.

If you want to make a more elaborate robots.txt file, look at the one below.

User-agent: googlebot
Disallow: /private.html
Disallow: /phonenumbers.html
Useragent: *
Disallow: /addresses.html

This file tells Google to stay away from private.html, phonenumber.html and all robots to stay away from addresses.html

Handy as the robots.txt file is, it has it's problems. First, it requires root access to your web site. If it is not in your root, it will not be found by a spider. Also, anyone can view it. You can go to almost any site and enter /robots.txt after the root url and see all the pages they don't want a spider to see! Fortunately there is an alternative, using the robots meta tag. This hides the page from search engines, and does not provide a trail for people to find.

Robots meta tag

This option is the best for a private section of your site. If you have a secret page for employees only and don't want the outside world to find it, you should put a meta tag on it. That way visitors can't visit your robots.tx file and find it from there. A robots meta tag simply looks like this:

<meta name="robots" content="noindex, nofollow" />

This does the same as the robots.txt file. It tells all robots, marked in the name attribute, to not index anything on the page or follow any links (noindex, nofollow). While it does not have the flexibility of the robots.txt file in specifying individual spiders, it is great for secret pages, or when you don't have access to the root of your web site. It can be tedious to add it to many files, so maybe you should figure some other method of protecting those pages, such as passwords.


Enter our new Contest today!

plants