发明授权
US09058378B2 System and method for identification of near duplicate user-generated content 有权
用于识别近似重复的用户生成的内容的系统和方法

System and method for identification of near duplicate user-generated content
摘要:
A computer-implemented system and method relates to identifying near duplicate content. An example embodiment includes a data receiver to receive a first instance of user-generated content and a tokenizer to tokenize the first instance into a set of words, create a set of portions from the tokenized first instance, and assign weight to each portion of the set of portions. The example embodiment also includes a magnitude calculator to calculate a magnitude for the first instance based on the weight of each portion and a resemblance score calculator to search a data store for a second instance with at least one portion in common with the first instance and calculate a resemblance score between the first instance and the second instance.
信息查询
0/0