Method and system for detecting duplicate documents in web crawls

Number of patents in Portfolio can not be more than 2000

United States of America Patent

PATENT NO 6547829
SERIAL NO

09343511

Stats

ATTORNEY / AGENT: (SPONSORED)

Importance

Loading Importance Indicators... loading....

Abstract

See full text

A Web crawler application takes advantage of a document store's ability to provide a content identifier (CID) having a value that is a unique function of the physical storage location of a data object or document, such as a Web page. In operation, the crawler first tries to fetch the CID for a document. If the CID attribute is not supported by the document store, the crawler fetches the document, filters it to obtain a hash function, and commits the document to an index if the hash function is not present in a history table. If the CID is available from the document store, the CID is fetched from the document store. The crawler then determines whether the CID is present in the history table, which indicates whether an identical copy of the document in question has already been indexed under a different URL. If the CID is present, indicating that the document has already been indexed, the new URL is placed in the history file but the document itself is not retrieved from the document store, nor is it filtered again to obtain a CID. If the CID is not present in the history table, the full document is retrieved and indexed. The CID data structure is an extension of a known globally unique ID (GUID). Whereas the GUID is a 16-byte number, the CID comprises a 16-byte GUID plus an additional 6-byte number.

Loading the Abstract Image... loading....

First Claim

See full text

Family

Loading Family data... loading....

Patent Owner(s)

Patent OwnerAddress
MICROSOFT TECHNOLOGY LICENSING LLCONE MICROSOFT WAY REDMOND WA 98052

International Classification(s)

  • [Classification Symbol]
  • [Patents Count]

Inventor(s)

Inventor Name Address # of filed Patents Total Citations
Meyerzon, Dmitriy Bellevue, WA 113 3766
Norin, Scott Newcastle, WA 9 1618
Shoroff, Srikanth Issaquah, WA 30 1497
Terek, F Soner Bellevue, WA 16 899

Cited Art Landscape

Load Citation

Patent Citation Ranking

Forward Cite Landscape

Load Citation