发明申请
US20090063538A1 METHOD FOR NORMALIZING DYNAMIC URLS OF WEB PAGES THROUGH HIERARCHICAL ORGANIZATION OF URLS FROM A WEB SITE
审中-公开
通过WEB站点URL的分层组织来正常化网页动态URL的方法
- 专利标题: METHOD FOR NORMALIZING DYNAMIC URLS OF WEB PAGES THROUGH HIERARCHICAL ORGANIZATION OF URLS FROM A WEB SITE
- 专利标题(中): 通过WEB站点URL的分层组织来正常化网页动态URL的方法
-
申请号: US11847989申请日: 2007-08-30
-
公开(公告)号: US20090063538A1公开(公告)日: 2009-03-05
- 发明人: Krishna Prasad CHITRAPURA , Anandsudhakar Kesari , Alok Kirpal , Mahesh Tiyyagura
- 申请人: Krishna Prasad CHITRAPURA , Anandsudhakar Kesari , Alok Kirpal , Mahesh Tiyyagura
- 主分类号: G06F17/30
- IPC分类号: G06F17/30
摘要:
Techniques are described for normalizing dynamic URLs using a hierarchical organization of a web site. Given web pages associated with a web site, an information extraction method is used to generate data structures that represent the content or structure of each of the web pages. These data structures are appended to the corresponding dynamic URLs. The modified URLs with the data structures are tokenized with the resulting tokens clustered to create a hierarchical organization. Nodes of the hierarchical organization may be merged based upon occurrence or patterns of content and structure. The merged hierarchical organization may then be pruned to remove irrelevant information and to reduce the memory footprint of the hierarchical organization. When a new dynamic URL is received, the new dynamic URL is matched to the hierarchical organization. Important parameters are taken into account and irrelevant information may be removed. Based upon the matching to the hierarchical organization, a normalized URL is returned.
信息查询