Although search engines keep improving on their ability to deal with duplicate content this continues to be one of the main SEO issues facing news and content sites. Even when the engines do a reasonable job of filtering out duplicates in their results, sites are essentially shooting themselves in the foot by splitting internal and inbound links to a particular piece of content across multiple URLs. Therefore it is critical for publishers to diagnose and eliminate (or at least mitigate) the duplicate content issues on their sites.
Here are the most common causes of duplicate content on news media sites:
- Tracking codes. Appending tracking codes to URLs (e.g. ?xid=rss or ?cid=top-stories) results in the same piece of content existing on multiple URLs – in some cases quite a large number of URLs. The canonical URL tag is a good way to mitigate this issue.
- Publishing the same content in multiple sections. An article can be linked from as many sections and locations on the site as desired, but it should only exist on one unique, permanent URL.
- Repurposing content in new packages. Media sites often pull existing content into new features/packages, typically to create attractive options for advertisers. For example a selection of movie reviews (that also exist in the film section) will be duplicated on a different template in a “What’s Hot This Summer” feature. Since the pages are not exactly the same the canonical URL tag is not the ideal solution, and publishers typically resist consolidating through permanent 301 redirects because they want the content to also exist in its original location. From a SEO perspective, the best approach is to avoid this practice altogether.
- Syndication. Syndicating content is a common practice and an important revenue stream for publishers. But when the search engines encounter the same article on multiple sites it is likely that one version will be given prominence, and it may not always be the original. My post on syndication best practices covers ways to reduce the risk of being outranked for your own content.
- CMS issues. Although content management systems have become more SEO friendly over the years most still cause a number of SEO problems, including duplicate content issues. The most common is printer-friendly pages. Or for example in photo galleries the first slide may appear on a different URL when you go back to it via the “previous” button. Conduct a comprehensive SEO site audit to identify CMS and site architecture issues (as well as editorial and marketing issues).
A few other notes on dealing with duplicate content:
- This week Google specifically recommended against using robots.txt to block duplicates, which is a change from previous recommendations.
- In the duplicate content session at SMX East, Google also announced that the canonical URL tag will work across domains by the end of the year (currently it only works with URLs on the same domain). As interesting, Yahoo and Bing admitted that they are still not supporting the current version of the tag but are hoping to by the end of year.
- In September Google added a function to Webmaster Tools called Parameter Handling that allows sites to specify certain URL parameters that can be ignored during crawling. There is a good writeup on the duplicate content implications of this on Search Engine Land: Google Lets You Tell Them Which URL Parameters To Ignore