Systems and methods for inferring uniform resource locator (URL) normalization rules

Number of patents in Portfolio can not be more than 2000

United States of America Patent

PATENT NO 7680785
APP PUB NO 20060218143A1
SERIAL NO

11089988

Stats

ATTORNEY / AGENT: (SPONSORED)

Importance

Loading Importance Indicators... loading....

Abstract

See full text

Different URLs that actually reference the same web page or other web resource are detected and that information is used to only download one instance of a web page or web resource from a web site. All web pages or web resources downloaded from a web server are compared to identify which are substantially identical. Once identical web pages or web resources with different URLs are found, the different URLs are then analyzed to identify what portions of the URL are essential for identifying a particular web page or web resource, and what portions are irrelevant. Once this has been done for each set of substantially identical web pages or web resources (also referred to as an “equivalence class” herein), these per-equivalence-class rules are generalized to trans-equivalence-class rules. There are two rule-learning steps: step (1), where it is learned for each equivalence class what portions of the URLs in that class are relevant for selecting the page and what portions are not; and step (2), where the per-equivalence-class rules constructed during step (1) are generalized to rules that cover many equivalence classes. Once a rule is determined, it is applied to the class of web pages or web resources to identify errors. If there are no errors, the rule is activated and is then used by the web crawler for future crawling to avoid the download of duplicative web pages or web resources.

Loading the Abstract Image... loading....

First Claim

See full text

Family

Loading Family data... loading....

Patent Owner(s)

  • MICROSOFT TECHNOLOGY LICENSING, LLC

International Classification(s)

  • [Classification Symbol]
  • [Patents Count]

Inventor(s)

Inventor Name Address # of filed Patents Total Citations
Najork, Marc Alexander Palo Alto, US 26 1004

Cited Art Landscape

Load Citation

Patent Citation Ranking

Forward Cite Landscape

Load Citation