Integration of knowledge graph embedding into topic modeling with hierarchical Dirichlet process

    公开(公告)号:US11636355B2

    公开(公告)日:2023-04-25

    申请号:US16427225

    申请日:2019-05-30

    申请人: Baidu USA, LLC

    摘要: Leveraging domain knowledge is an effective strategy for enhancing the quality of inferred low-dimensional representations of documents by topic models. Presented herein are embodiments of a Bayesian nonparametric model that employ knowledge graph (KG) embedding in the context of topic modeling for extracting more coherent topics; embodiments of the model may be referred to as topic modeling with knowledge graph embedding (TMKGE). TMKGE embodiments are hierarchical Dirichlet process (HDP)-based models that flexibly borrow information from a KG to improve the interpretability of topics. Also, embodiments of a new, efficient online variational inference method based on a stick-breaking construction of HDP were developed for TMKGE models, making TMKGE suitable for large document corpora and KGs. Experiments on datasets illustrate the superior performance of TMKGE in terms of topic coherence and document classification accuracy, compared to state-of-the-art topic modeling methods.

    Learning latent structural relations with segmentation variational autoencoders

    公开(公告)号:US11816533B2

    公开(公告)日:2023-11-14

    申请号:US16951158

    申请日:2020-11-18

    申请人: Baidu USA, LLC

    IPC分类号: G06N5/04 G06N3/088 G06N3/045

    CPC分类号: G06N5/042 G06N3/045 G06N3/088

    摘要: Learning disentangled representations is an important topic in machine learning for a wide range of applications. Disentangled latent variables represent interpretable semantic information and reflect separate factors of variation in data. Although generative models may learn latent representations and generate data samples as well, existing models may ignore the structural information among latent representations. Described in the present disclosure are embodiments to learn disentangled latent structural representations from data using decomposable variational auto-encoders, which simultaneously learn component representations and encode component relationships. Embodiments of a novel structural prior for latent representations are disclosed to capture interactions among different data components. Embodiments are applied to data segmentation and latent relation discovery among different data components. Experiments on several datasets demonstrate the utility of the present model embodiments.

    Representation learning for input classification via topic sparse autoencoder and entity embedding

    公开(公告)号:US11615311B2

    公开(公告)日:2023-03-28

    申请号:US16691554

    申请日:2019-11-21

    申请人: Baidu USA, LLC

    摘要: Described herein are embodiments of a unified neural network framework to integrate Topic modeling, Word embedding and Entity Embedding (TWEE) for representation learning of inputs. In one or more embodiments, a novel topic sparse autoencoder is introduced to incorporate discriminative topics into the representation learning of the input. Topic distributions of inputs are generated from a global viewpoint and are utilized to enable autoencoder to learn topical representations. A sparsity constraint may be added to ensure that the most discriminative representations are related to topics. In addition, both words and entity related information may be embedded into the network to help learn a more comprehensive input representation. Extensive empirical experiments show that embodiments of the TWEE framework outperform the state-of-the-art methods on different datasets.

    Systems and methods for mutual learning for topic discovery and word embedding

    公开(公告)号:US11568266B2

    公开(公告)日:2023-01-31

    申请号:US16355622

    申请日:2019-03-15

    申请人: Baidu USA, LLC

    摘要: Described herein are embodiments for systems and methods for mutual machine learning with global topic discovery and local word embedding. Both topic modeling and word embedding map documents onto a low-dimensional space, with the former clustering words into a global topic space and the latter mapping word into a local continuous embedding space. Embodiments of Topic Modeling and Sparse Autoencoder (TMSA) framework unify these two complementary patterns by constructing a mutual learning mechanism between word co-occurrence based topic modeling and autoencoder. In embodiments, word topics generated with topic modeling are passed into auto-encoder to impose topic sparsity for the autoencoder to learn topic-relevant word representations. In return, word embedding learned by autoencoder is sent back to topic modeling to improve the quality of topic generations. Performance evaluation on various datasets demonstrates the effectiveness of the disclosed TMSA framework in discovering topics and embedding words.