KEYWORD EXTRACTION METHOD, APPARATUS AND MEDIUM

    公开(公告)号:EP3835993A3

    公开(公告)日:2021-08-04

    申请号:EP20166998.3

    申请日:2020-03-31

    摘要: The present invention discloses a keyword extraction method, a keyword extraction apparatus and a medium, belonging to the field of data processing. The method comprises operations of: receiving an original document (S10); extracting candidate words from the original document, the extracted candidate words forming a first word set (S11); acquiring a first association degree between each first word in the first word set and the original document (S12), and determining a second word set according to the first association degree , the second word set being a subset of the first word set (S13); for each second word in the second word set, inquiring, in a word association topology, at least one node word satisfying a condition of association with the second word, the at least one node word forming a third word set, the word association topology indicating an association relation among multiple node words in a predetermined field (S14); and determining a union set of the second word set and the third word set (S15), acquiring a second association degree between each candidate keyword in the union set and the original document (S16), and selecting, according to the second association degree, at least one candidate keyword from the union set, to form a keyword set of the original document (S17). In accordance with the present invention, the calculation complexity can be reduced, and the calculation speed can be improved; the problem of preferentially selecting high-frequency words in the existing methods is solved; and, the expression of keywords is effectively enriched.

    KEYWORD EXTRACTION METHOD, APPARATUS AND MEDIUM

    公开(公告)号:EP3835993A2

    公开(公告)日:2021-06-16

    申请号:EP20166998.3

    申请日:2020-03-31

    摘要: The present invention discloses a keyword extraction method, a keyword extraction apparatus and a medium, belonging to the field of data processing. The method comprises operations of: receiving an original document (S10); extracting candidate words from the original document, the extracted candidate words forming a first word set (S11); acquiring a first association degree between each first word in the first word set and the original document (S12), and determining a second word set according to the first association degree , the second word set being a subset of the first word set (S13); for each second word in the second word set, inquiring, in a word association topology, at least one node word satisfying a condition of association with the second word, the at least one node word forming a third word set, the word association topology indicating an association relation among multiple node words in a predetermined field (S14); and determining a union set of the second word set and the third word set (S15), acquiring a second association degree between each candidate keyword in the union set and the original document (S16), and selecting, according to the second association degree, at least one candidate keyword from the union set, to form a keyword set of the original document (S17). In accordance with the present invention, the calculation complexity can be reduced, and the calculation speed can be improved; the problem of preferentially selecting high-frequency words in the existing methods is solved; and, the expression of keywords is effectively enriched.

    TEXT SEQUENCE SEGMENTATION METHOD AND DEVICE, AND STORAGE MEDIUM THEREOF

    公开(公告)号:EP3819808A1

    公开(公告)日:2021-05-12

    申请号:EP20177416.3

    申请日:2020-05-29

    IPC分类号: G06F40/284

    摘要: The present disclosure, belonging to the technical field of natural language processing, provides a text sequence segmentation method. The method includes: acquiring n segmentation sub-results of the text sequence, the n segmentation sub-results being acquired by segmenting the text sequence by n segmentation models; processing the n segmentation sub-results by a probability determination model branch in a result combination model to acquire a segmentation probability of the each segmentation position; and processing the segmentation probability of the each segmentation position by an activation function in the result combination model to acquire a segmentation result of the text sequence.

    METHOD AND DEVICE FOR OPTIMIZING TRAINING SET FOR TEXT CLASSIFICATION

    公开(公告)号:EP3792811A1

    公开(公告)日:2021-03-17

    申请号:EP19214350.1

    申请日:2019-12-09

    IPC分类号: G06F40/20

    摘要: The present disclosure relates to a method and device for optimizing a training set for text classification. The method includes: the training set for text classification is acquired; part of samples are selected from the training set as a first initial training subset, and an incorrectly tagged sample in the first initial training subset is corrected to obtain a second initial training subset; a text classification model is trained according to the second initial training subset; the samples in the training set are predicted by the trained text classification model to obtain a prediction result; an incorrectly tagged sample set is generated according to the prediction result; a key incorrectly tagged sample is selected from the incorrectly tagged sample set, and a tag of the key incorrectly tagged sample is corrected to generate a correctly tagged sample corresponding to the key incorrectly tagged sample; and the training set is updated by using the correctly tagged sample.

    KEYWORD EXTRACTION METHOD, KEYWORD EXTRACTION DEVICE AND COMPUTER-READABLE STORAGE MEDIUM

    公开(公告)号:EP3835973A1

    公开(公告)日:2021-06-16

    申请号:EP20178727.2

    申请日:2020-06-08

    IPC分类号: G06F16/34 G06F40/274

    摘要: A keyword extraction method comprises: receiving an original document (S11); extracting candidate words from an original document to form a first word set (S12); acquiring the first correlation degree between each candidate word in the first word set and the original document, and determining a second word set according to the first correlation degree (S13); generating predicted words through a prediction model based on the original document, the obtained predicted words forming a third word set (S14); determining a union set of the second and third word sets (S15), acquiring the second correlation degree between each of the candidate keywords in the union set and the original document (S16), acquiring a divergence of each candidate keyword in the union set (S17); and selecting at least one candidate keyword from the union set as keywords based on the second correlation degree and the divergence (S18). Keyword redundancy can be avoided through the divergence of keywords. The final keywords are not affected by the frequency of candidate words, and the expression mode of keywords can be enriched.