发明授权
US08768926B2 Techniques for categorizing web pages 有权
技术分类网页

Techniques for categorizing web pages
摘要:
Web pages are efficiently categorized in a data processor without analyzing the content of the web pages. According to at least one embodiment, data is maintained that represents sample URLs grouped into a plurality of clusters. The sample URLs of a cluster are used to produce a URL regular expression pattern (“URL-regex”) that differentiates the sample URLs of the cluster from the sample URLs of other clusters and that covers at least a specified percentage of the sample URLs in the cluster. The process of producing a URL-regex is repeated for each of the clusters producing a URL-regex for each cluster. Web pages are then categorized into one of the clusters by determining which of the URL-regex patterns produced for the clusters match URLs that refer to the web pages. Thus, a web page may be categorized based on a URL that refers to the web page without having to obtain and analyze the content of the web page.
公开/授权文献
信息查询
0/0