How And Where Search Engines See Duplicate Content

by : Danny Wirken


Search engines have become the gateway to information in the Internet. Search engines are so important that websites find that they need to rank well in search engine results pages (SERPs) in order to get noticed. With the numerous websites vying to get into the coveted position of the top 30 results listed in SERPs more and more website owners are using search engine optimization (SEO) techniques to improve their rankings. People who use SEO know that there are certain factors that can affect your ranking positively and of course negatively. Of the negative factors one of the most well-known is duplicate content.

Search engines are biased against duplicate content. As a matter of fact some sites do not get listed in SERPs because of this factor. This happens when crawlers do not index sites which they have previously determined to be a duplicate site of another site. The crawlers skip the duplicate site to be more efficient and save time. Crawler also do this for another reason - to avoid listing duplicate pages in SERPs and thus point users to different sites containing just the same information. Search engines do not like that to happen because it would be irritating for users who expect to see different sites for the different links they click. For similar sites, search engines also usually just list one of the sites and relegate the others under a link that says See related pages. For those that get manage to be listed in the SERPs the page rank is still usually affected and so affects the sites standing.

Where Search Engines See Duplicate Content

So where do crawlers see this duplicate content. And what are the possible content that they would interpret as duplicate? According to an article by William Slawski on Duplicate Content Issues and Search Engines, search engines see duplicate content from the following kind of web pages:

1. Product descriptions from manufacturers, publishers, and producers reproduced by a number of different distributors in large ecommerce sites.

2. Alternative print pages - This happens when website owners who are user friendly offer copies of the same documents in different formats for a varied printing options. Although helpful to users it might actually indexed by crawlers as duplicate pages.

3. Pages that reproduce syndicated RSS feeds through a server side script.

4. Canonicalization issues, where a search engine may see the same page as different pages with different URLs.

5. Pages that serve session IDs to search engines, so that they try to crawl and index the same page under different URLs.

6. Pages that serve multiple data variables through URLs, so that they crawl and index the same page under different URLs.

7. Pages that share too many common elements, or where those are very similar from one page to another, including title, meta descriptions, headings, navigation, and text that is shared globally. - This is common for company websites that insist on having their logo, description, etc put on every page of their website.

8. Copyright infringement - Plagiarism is of course a good reason for not being indexed. The problem is that crawlers cannot distinguish the original from the duplicate and might mistakenly filter out the original instead.

9. Use of the same or very similar pages on different subdomains or different country top level domains (TLDs).

10. Article syndication - Some writer allow their articles to be published in other websites as long as they are given credit for their work. The problem arises when the crawler sees the original article as the duplicate and opts to index duplicate page or at least give it a higher rating.

11. Mirrored sites - Mirrored sites are used to handle the traffic of a very popular site. Mirror sites have a good chance of being ignored by web crawlers and so won't be indexed.

How Search Engines See Duplicate Content

There are many methods employed by different search engines to determine pages with duplicate content. The methods in many ways, from the concept, to the algorithms, and of course their effectiveness. Search engines are, however, all finding new ways to improve their methods for searching duplicate content as seen by the patents filed by different search engines companies like AltaVista, Microsoft Corporation, Google, and other bodies like the company Digital Equipment Corporation and even the Regents of the University of California.

The different patents include methods for Detecting query-specific duplicate documents, Detecting duplicate and near-duplicate files, clustering closely resembling data objects, identifying near duplicate pages in a hyperlinked database, indexing duplicate database records using a full-record fingerprint, indexing duplicate records of information of a database, utilizing information redundancy to improve text searches and methods and apparatus for detecting and summarizing document similarity within large document sets, and for finding mirrored hosts by analyzing URLs.

Each method is unique and is interesting in its approach. The methods vary greatly from generating fingerprints for records to using query-relevant information to limit the portion of the documents to be compared. Discussing each method would be interesting and would shed light as to how different search engines approach the problem. The new methods are all innovative and if some of them are used in concert with each other, it would surely improve the search engine's ability to detect duplicate documents. However, since the patent holders are competing companies, it is unlikely that there would be collaboration between them.


As search engines further refine their methods for detecting duplicate content it would be harder for plagiarists to get away with what they do. However, web pages containing duplicate content for a good reason could suffer as well. Furthermore since none of the published patents tackled the issue of differentiating the original content from the duplicate ones refinement in the search engine's methods might mean further trouble for the website owners of original content. Because of this search engines ought to find ways and invent new methods for identifying original content from duplicate ones as well as valid duplicate content.