发明申请
US20090063538A1 METHOD FOR NORMALIZING DYNAMIC URLS OF WEB PAGES THROUGH HIERARCHICAL ORGANIZATION OF URLS FROM A WEB SITE 审中-公开
通过WEB站点URL的分层组织来正常化网页动态URL的方法

METHOD FOR NORMALIZING DYNAMIC URLS OF WEB PAGES THROUGH HIERARCHICAL ORGANIZATION OF URLS FROM A WEB SITE
摘要:
Techniques are described for normalizing dynamic URLs using a hierarchical organization of a web site. Given web pages associated with a web site, an information extraction method is used to generate data structures that represent the content or structure of each of the web pages. These data structures are appended to the corresponding dynamic URLs. The modified URLs with the data structures are tokenized with the resulting tokens clustered to create a hierarchical organization. Nodes of the hierarchical organization may be merged based upon occurrence or patterns of content and structure. The merged hierarchical organization may then be pruned to remove irrelevant information and to reduce the memory footprint of the hierarchical organization. When a new dynamic URL is received, the new dynamic URL is matched to the hierarchical organization. Important parameters are taken into account and irrelevant information may be removed. Based upon the matching to the hierarchical organization, a normalized URL is returned.
信息查询
0/0