Integration of knowledge graph embedding into topic modeling with hierarchical Dirichlet process

    公开(公告)号:US11636355B2

    公开(公告)日:2023-04-25

    申请号:US16427225

    申请日:2019-05-30

    申请人: Baidu USA, LLC

    摘要: Leveraging domain knowledge is an effective strategy for enhancing the quality of inferred low-dimensional representations of documents by topic models. Presented herein are embodiments of a Bayesian nonparametric model that employ knowledge graph (KG) embedding in the context of topic modeling for extracting more coherent topics; embodiments of the model may be referred to as topic modeling with knowledge graph embedding (TMKGE). TMKGE embodiments are hierarchical Dirichlet process (HDP)-based models that flexibly borrow information from a KG to improve the interpretability of topics. Also, embodiments of a new, efficient online variational inference method based on a stick-breaking construction of HDP were developed for TMKGE models, making TMKGE suitable for large document corpora and KGs. Experiments on datasets illustrate the superior performance of TMKGE in terms of topic coherence and document classification accuracy, compared to state-of-the-art topic modeling methods.

    Cross-lingual language models and pretraining of cross-lingual language models

    公开(公告)号:US11886446B2

    公开(公告)日:2024-01-30

    申请号:US17575650

    申请日:2022-01-14

    申请人: Baidu USA, LLC

    摘要: Existing research on cross-lingual retrieval cannot take good advantage of large-scale pretrained language models, such as multilingual BERT and XLM. The absence of cross-lingual passage-level relevance data for finetuning and the lack of query-document style pretraining are some of the key factors of this issue. Accordingly, embodiments of two novel retrieval-oriented pretraining tasks are presented herein to further pretrain cross-lingual language models for downstream retrieval tasks, such as cross-lingual ad-hoc retrieval (CUR) and cross-lingual question answering (CLQA). In one or more embodiments, distant supervision data was constructed from multilingual texts using section alignment to support retrieval-oriented language model pretraining. In one or more embodiments, directly finetuning language models on part of an evaluation collection was performed by making Transformers capable of accepting longer sequences. Experiments show that model embodiments significantly improve upon general multilingual language models in at least the cross-lingual retrieval setting and the cross-lingual transfer setting.

    Coreference-aware representation learning for neural named entity recognition

    公开(公告)号:US11354506B2

    公开(公告)日:2022-06-07

    申请号:US16526614

    申请日:2019-07-30

    申请人: Baidu USA, LLC

    摘要: Previous neural network models that perform named entity recognition (NER) typically treat the input sentences as a linear sequence of words but ignore rich structural information, such as the coreference relations among non-adjacent words, phrases, or entities. Presented herein are novel approaches to learn coreference-aware word representations for the NER task. In one or more embodiments, a “CNN-BiLSTM-CRF” neural architecture is modified to include a coreference layer component on top of the BiLSTM layer to incorporate coreferential relations. Also, in one or more embodiments, a coreference regularization is added during training to ensure that the coreferential entities share similar representations and consistent predictions within the same coreference cluster. A model embodiment achieved new state-of-the-art performance when tested.

    Fast neural ranking on bipartite graph indices

    公开(公告)号:US12056133B2

    公开(公告)日:2024-08-06

    申请号:US17555316

    申请日:2021-12-17

    申请人: Baidu USA, LLC

    IPC分类号: G06F16/2457 G06F16/901

    CPC分类号: G06F16/24578 G06F16/9024

    摘要: Presented are systems and methods that construct BipartitE Graph INdices (BEGIN) embodiments for fast neural ranking. BEGIN embodiments comprise two types of nodes: sampled queries and base or searching objects. In one or more embodiments, edges connecting these nodes are constructed by using a neural network ranking measure. These embodiments extend traditional search-on-graph methods and lend themselves to fast neural ranking. Experimental results demonstrate the effectiveness and efficiency of such embodiments.

    Proximity graph maintenance for fast online nearest neighbor search

    公开(公告)号:US12050646B2

    公开(公告)日:2024-07-30

    申请号:US17408146

    申请日:2021-08-20

    申请人: Baidu USA, LLC

    IPC分类号: G06F16/901 G06F16/22

    CPC分类号: G06F16/9024 G06F16/2272

    摘要: Incremental proximity graph maintenance (IPGM) systems and methods for online ANN search support both online vertex deletion and insertion of vertices on proximity graphs. In various embodiments, updating a proximity graph comprises receiving a workload that represents a set of vertices in the proximity graph, each vertex being associated with a type of operation such as a query, insertion, or deletion. For a query or an insertion, a search may be executed on the graph to obtain a set of top-K vertices for each vertex. In the case of a deletion, a vertex may be deleted from the proximity graph, and a local or global reconnection update method may be used to reconstruct at least a portion of the proximity graph.

    Transformation for fast inner product search on graph

    公开(公告)号:US11989233B2

    公开(公告)日:2024-05-21

    申请号:US17033791

    申请日:2020-09-27

    申请人: Baidu USA, LLC

    摘要: Presented herein are embodiments of a fast search on graph methodology for Maximum Inner Product Search (MIPS). This optimization problem is challenging since traditional Approximate Nearest Neighbor (ANN) search methods may not perform efficiently in the nonmetric similarity measure. Embodiments herein are based on the property that a Möbius/Möbius-like transformation introduces an isomorphism between a subgraph of 2-Delaunay graph and Delaunay graph for inner product. Under this observation, embodiments of a novel graph indexing and searching methodology are presented to find the optimal solution with the largest inner product with the query. Experiments show significant improvements compared to existing methods.