发明授权
- 专利标题: Software and method for recognizing similarity of documents written in different languages based on a quantitative measure of similarity
- 专利标题(中): 基于相似度的量化方法识别用不同语言编写的文档的相似性的软件和方法
-
申请号: US09588250申请日: 2000-06-06
-
公开(公告)号: US06519557B1公开(公告)日: 2003-02-11
- 发明人: Michael L. Emens , Reiner Kraft , Peter Chi-Shing Yim
- 申请人: Michael L. Emens , Reiner Kraft , Peter Chi-Shing Yim
- 主分类号: G06F1720
- IPC分类号: G06F1720
摘要:
A system for identifying different language versions of the same structured format document (e.g., HTML web page) detects the language of the two documents and translates one or both into a preferred language if necessary, parses the two candidate documents and builds two hierarchical data structure based on the document. The data structures are used to compare the hierarchical structure of the two documents and also to access text portions in congruent positions in the two documents. A fuzzy measure of similarity of a set of text portions occupying congruent positions in the two documents is then obtained, to induce a measure of the similarity of the two documents which is compared to a fuzzy threshold.
信息查询