Invention Grant
- Patent Title: Method and device for deduplicating web page
-
Application No.: US14581464Application Date: 2014-12-23
-
Publication No.: US10346257B2Publication Date: 2019-07-09
- Inventor: Nan Jiang , Hui Zhang , Jia Wan
- Applicant: Huawei Technologies Co., Ltd.
- Applicant Address: CN Shenzhen
- Assignee: Huawei Technologies Co., Ltd.
- Current Assignee: Huawei Technologies Co., Ltd.
- Current Assignee Address: CN Shenzhen
- Priority: CN201210223009 20120630
- Main IPC: G06F11/14
- IPC: G06F11/14 ; G06F16/958

Abstract:
A method and a device is described for de-duplicating a web page. The method includes: extracting at least one core sentence from a target web page; mapping each core sentence to a unique numeric value to form a first numeric value set; determining an intersection set of the first numeric value set and each second numeric value set, and the number of numeric values included in each intersection set, and determining a maximum number of numeric values included in each intersection set; and when a ratio of the maximum number to a total number of numeric values in the first numeric value set is greater than a set threshold, processing the target web page as a duplicate web page. In embodiments of the present invention, during web page de-duplication processing, accuracy can be improved, an anti-noise capability can be enhanced, and a calculating scale can be reduced.
Public/Granted literature
- US20150142760A1 METHOD AND DEVICE FOR DEDUPLICATING WEB PAGE Public/Granted day:2015-05-21
Information query