-
公开(公告)号:US07958136B1
公开(公告)日:2011-06-07
申请号:US12050626
申请日:2008-03-18
Applicant: Taylor Curtis , Kenneth Heafield
Inventor: Taylor Curtis , Kenneth Heafield
IPC: G06F17/30
CPC classification number: G06F17/30616
Abstract: The present invention provides systems and methods for identifying similar documents. In an embodiment, the present invention identifies similar documents by (1) receiving document text for a current document that includes at least one word; (2) calculating a prominence score and a descriptiveness score for each word and each pair of consecutive words; (3) calculating a comparison metric for the current document; (4) finding at least one potential document, where document text for each potential document includes at least one of the words; and (5) analyzing each potential document to identify at least one similar document.
Abstract translation: 本发明提供了用于识别类似文档的系统和方法。 在一个实施例中,本发明通过(1)接收包括至少一个单词的当前文档的文档文本来识别类似的文档; (2)计算每个单词和每对连续词的突出分数和描述性分数; (3)计算当前文档的比较度量; (4)找到至少一个潜在文件,其中每个潜在文件的文件文本包括至少一个词; 和(5)分析每个潜在的文件以识别至少一个类似的文档。
-
公开(公告)号:US08713034B1
公开(公告)日:2014-04-29
申请号:US13153319
申请日:2011-06-03
Applicant: Taylor Curtis , Kenneth Heafield
Inventor: Taylor Curtis , Kenneth Heafield
IPC: G06F17/30
CPC classification number: G06F17/30616
Abstract: The present invention provides systems and methods for identifying similar documents. In an embodiment, the present invention identifies similar documents by (1) receiving document text for a current document that includes at least one word; (2) calculating a prominence score and a descriptiveness score for each word and each pair of consecutive words; (3) calculating a comparison metric for the current document; (4) finding at least one potential document, where document text for each potential document includes at least one of the words; and (5) analyzing each potential document to identify at least one similar document.
Abstract translation: 本发明提供了用于识别类似文档的系统和方法。 在一个实施例中,本发明通过(1)接收包括至少一个单词的当前文档的文档文本来识别类似的文档; (2)计算每个单词和每对连续词的突出分数和描述性分数; (3)计算当前文档的比较度量; (4)找到至少一个潜在文件,其中每个潜在文件的文件文本包括至少一个词; 和(5)分析每个潜在的文件以识别至少一个类似的文档。
-
公开(公告)号:US20090254884A1
公开(公告)日:2009-10-08
申请号:US12212534
申请日:2008-09-17
Applicant: Girish Maskeri Rama , Kenneth Heafield , Santonu Sarkar
Inventor: Girish Maskeri Rama , Kenneth Heafield , Santonu Sarkar
IPC: G06F9/44
CPC classification number: G06F8/75
Abstract: Topics in source code can be identified using Latent Dirichlet Allocation (LDA) by receiving source code, identifying domain specific keywords from the source code, generating a keyword matrix, processing the keyword matrix and the source code using LDA, and outputting a list of topics. The list of topics is output as collections of domain specific keywords. Probabilities of domain specific keywords belonging to their respective topics can also be output. The keyword matrix comprises weighted sums of occurrences of domain specific keywords in the source code.
Abstract translation: 可以通过接收源代码,从源代码识别特定于域的关键字,生成关键字矩阵,使用LDA处理关键字矩阵和源代码,以及输出主题列表,来识别源代码中的主题,使用潜在的Dirichlet分配(LDA) 。 主题列表作为域特定关键字的集合输出。 也可以输出属于其各自主题的特定于域的关键字的概率。 关键词矩阵包括源代码中的域特定关键词的出现的加权和。
-
公开(公告)号:US08209665B2
公开(公告)日:2012-06-26
申请号:US12212534
申请日:2008-09-17
Applicant: Girish Maskeri Rama , Kenneth Heafield , Santonu Sarkar
Inventor: Girish Maskeri Rama , Kenneth Heafield , Santonu Sarkar
IPC: G06F9/44
CPC classification number: G06F8/75
Abstract: Topics in source code can be identified using Latent Dirichlet Allocation (LDA) by receiving source code, identifying domain specific keywords from the source code, generating a keyword matrix, processing the keyword matrix and the source code using LDA, and outputting a list of topics. The list of topics is output as collections of domain specific keywords. Probabilities of domain specific keywords belonging to their respective topics can also be output. The keyword matrix comprises weighted sums of occurrences of domain specific keywords in the source code.
Abstract translation: 可以通过接收源代码,从源代码识别特定于域的关键字,生成关键字矩阵,使用LDA处理关键字矩阵和源代码,以及输出主题列表,来识别源代码中的主题,使用潜在的Dirichlet分配(LDA) 。 主题列表作为域特定关键字的集合输出。 也可以输出属于其各自主题的特定于域的关键字的概率。 关键词矩阵包括源代码中的域特定关键词的出现的加权和。
-
-
-