Tag Archives: Robots.txt

WordPress Duplicate Content

One of the problems with WordPress is that it creates a lot of duplicate content, and there have been a lot of conversations on whether search engines, especially Google penalize you for duplicate content. It seems that it is not a penalty but a filter.

If you create a standard WordPress site you will have same post display under the specific post, Archives, Categories, Feeds and Trackbacks. And the Archives and categories can create several duplicates by themselves depending on the site settings.

When a crawler visits the site it must decide which of these pages is most relevant, and the majority of time it does not pick the one you would. I have tested this on several sites and found that the Category pages usually get listed ahead or in place of the actual post.

There are 2 ways to cure this. One is with a robots.txt file and the other is with an IF statement in the code. The downside to using only a robots.txt file solution is that it is a blanket being thrown over a specific problem and sometimes you can trap a good page by accident.

A robots file should be used to block out some of the core files in WordPress, as follows.

User-agent: *
Disallow: /category/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-admin/
Disallow: /wp-
Disallow: /about/trackback/
Disallow: /wp-register.php
Disallow: /wp-login.php
Disallow: /trackback/
Disallow: /feed/

Now to stop all the other duplicate content place the following statement in the header file right before the first occurrence of Meta…

<?php if(is_home() | is_single() | is_page()){
echo ‘<meta name=”robots” content=”index,follow”>’;
} else {
echo ‘<meta name=”robots” content=”noindex,follow”>’;
}?>

This will make all pages that are not the Home page, or a post page or a static page tell the Robots NOT index, but follow all links.

With these changes the site should get a very clean and accurate index listing.