发明授权
- 专利标题: Techniques for categorizing web pages
- 专利标题(中): 技术分类网页
-
申请号: US12652624申请日: 2010-01-05
-
公开(公告)号: US08768926B2公开(公告)日: 2014-07-01
- 发明人: Ashwin Tengli , Rajeev Rastogi , Jeyashankher Ramamirtham , Srinivasan H Sengamedu , Sandeepkumar Bhuramal Satpal
- 申请人: Ashwin Tengli , Rajeev Rastogi , Jeyashankher Ramamirtham , Srinivasan H Sengamedu , Sandeepkumar Bhuramal Satpal
- 申请人地址: US CA Sunnyvale
- 专利权人: Yahoo! Inc.
- 当前专利权人: Yahoo! Inc.
- 当前专利权人地址: US CA Sunnyvale
- 代理机构: Hickman Palermo Truong Becker Bingham Wong LLP
- 主分类号: G06F7/00
- IPC分类号: G06F7/00 ; G06F17/30
摘要:
Web pages are efficiently categorized in a data processor without analyzing the content of the web pages. According to at least one embodiment, data is maintained that represents sample URLs grouped into a plurality of clusters. The sample URLs of a cluster are used to produce a URL regular expression pattern (“URL-regex”) that differentiates the sample URLs of the cluster from the sample URLs of other clusters and that covers at least a specified percentage of the sample URLs in the cluster. The process of producing a URL-regex is repeated for each of the clusters producing a URL-regex for each cluster. Web pages are then categorized into one of the clusters by determining which of the URL-regex patterns produced for the clusters match URLs that refer to the web pages. Thus, a web page may be categorized based on a URL that refers to the web page without having to obtain and analyze the content of the web page.
公开/授权文献
- US20110167063A1 TECHNIQUES FOR CATEGORIZING WEB PAGES 公开/授权日:2011-07-07
信息查询