- 专利标题: Clustering repetitive structure of asynchronous web application content
-
申请号: US14499348申请日: 2014-09-29
-
公开(公告)号: US09734147B2公开(公告)日: 2017-08-15
- 发明人: Mohammadreza Barouni Ebrahimi , Obidul Islam , Iosif V. Onut
- 申请人: International Business Machines Corporation
- 申请人地址: US NY Armonk
- 专利权人: International Business Machines Corporation
- 当前专利权人: International Business Machines Corporation
- 当前专利权人地址: US NY Armonk
- 代理商 Daniel R. Simek
- 主分类号: G06F17/30
- IPC分类号: G06F17/30
摘要:
A processor determines whether a DOM includes a repetitive pattern of a combination, formed by a tag of a leaf node and a tag of a parent node of the leaf node. Determining the repetitive pattern of the combination, the processor identifies a first inner cluster is identified by collapsing multiple instances of the repetitive pattern into a single instance. The processor generates a LSH signature for the single instance of the repetitive pattern. The processor determines an outer cluster, based on grouping one or more inner clusters, as part of a section rooted at a source node of the DOM, in which the source node is a parent node of the one or more inner clusters. Determining that a pair of outer clusters are near repetitive, the processor limits web content exploration to one of the pair of outer clusters.
公开/授权文献
信息查询