LEARNING THEMATIC SIMILARITY METRIC FROM ARTICLE TEXT UNITS

    公开(公告)号:US20200125673A1

    公开(公告)日:2020-04-23

    申请号:US16167552

    申请日:2018-10-23

    摘要: A method of estimating a thematic similarity of sentences, comprising receiving a corpus of a plurality of documents describing a plurality of topics where each document comprises a plurality of sentences arranged in a plurality of sections, constructing sentence triplets for at least some of the sentences, each sentence triplet comprising a respective sentence, a respective positive sentence selected randomly from the section comprising the respective sentence and a respective negative sentence selected randomly from another section, training a first neural network with the sentence triplets to identify sentence-sentence vectors mapping each sentence with a shorter distance to its respective positive sentence compared to the distance to its respective negative sentence and outputting the first neural network for estimating thematic similarity between a pair of sentences by computing a distance between the sentence-sentence vectors produced for each sentence of the pair by the first neural network.