Systems and methods for tokenizing user-annotated names

    公开(公告)号:US10552462B1

    公开(公告)日:2020-02-04

    申请号:US14525864

    申请日:2014-10-28

    发明人: Michael Hart

    IPC分类号: G06F16/31 G06F16/35

    摘要: A disclosed computer-implemented method for tokenizing user-annotated names may include (1) identifying an example set of user-annotated names, (2) creating a custom dictionary that includes known keywords by (a) extracting a set of known keywords from the example set of user-annotated names and (b) assigning a frequency score to each known keyword in the set of known keywords based on the respective frequency of each known keyword within the example set, and (3) enabling the computing device to tokenize an additional user-annotated name of arbitrary structure by performing a semantic analysis including (a) assigning, using the custom dictionary, a frequency score to a substring of the additional user-annotated name based on the substring matching the known keyword and (b) splitting the additional user-annotated name into tokens according to a permutation of substrings that received a top combined frequency score. Various other methods, systems, and computer-readable media are also disclosed.