Representation learning for input classification via topic sparse autoencoder and entity embedding

    公开(公告)号:US11615311B2

    公开(公告)日:2023-03-28

    申请号:US16691554

    申请日:2019-11-21

    申请人: Baidu USA, LLC

    摘要: Described herein are embodiments of a unified neural network framework to integrate Topic modeling, Word embedding and Entity Embedding (TWEE) for representation learning of inputs. In one or more embodiments, a novel topic sparse autoencoder is introduced to incorporate discriminative topics into the representation learning of the input. Topic distributions of inputs are generated from a global viewpoint and are utilized to enable autoencoder to learn topical representations. A sparsity constraint may be added to ensure that the most discriminative representations are related to topics. In addition, both words and entity related information may be embedded into the network to help learn a more comprehensive input representation. Extensive empirical experiments show that embodiments of the TWEE framework outperform the state-of-the-art methods on different datasets.

    Systems and methods for mutual learning for topic discovery and word embedding

    公开(公告)号:US11568266B2

    公开(公告)日:2023-01-31

    申请号:US16355622

    申请日:2019-03-15

    申请人: Baidu USA, LLC

    摘要: Described herein are embodiments for systems and methods for mutual machine learning with global topic discovery and local word embedding. Both topic modeling and word embedding map documents onto a low-dimensional space, with the former clustering words into a global topic space and the latter mapping word into a local continuous embedding space. Embodiments of Topic Modeling and Sparse Autoencoder (TMSA) framework unify these two complementary patterns by constructing a mutual learning mechanism between word co-occurrence based topic modeling and autoencoder. In embodiments, word topics generated with topic modeling are passed into auto-encoder to impose topic sparsity for the autoencoder to learn topic-relevant word representations. In return, word embedding learned by autoencoder is sent back to topic modeling to improve the quality of topic generations. Performance evaluation on various datasets demonstrates the effectiveness of the disclosed TMSA framework in discovering topics and embedding words.

    Integration of knowledge graph embedding into topic modeling with hierarchical Dirichlet process

    公开(公告)号:US11636355B2

    公开(公告)日:2023-04-25

    申请号:US16427225

    申请日:2019-05-30

    申请人: Baidu USA, LLC

    摘要: Leveraging domain knowledge is an effective strategy for enhancing the quality of inferred low-dimensional representations of documents by topic models. Presented herein are embodiments of a Bayesian nonparametric model that employ knowledge graph (KG) embedding in the context of topic modeling for extracting more coherent topics; embodiments of the model may be referred to as topic modeling with knowledge graph embedding (TMKGE). TMKGE embodiments are hierarchical Dirichlet process (HDP)-based models that flexibly borrow information from a KG to improve the interpretability of topics. Also, embodiments of a new, efficient online variational inference method based on a stick-breaking construction of HDP were developed for TMKGE models, making TMKGE suitable for large document corpora and KGs. Experiments on datasets illustrate the superior performance of TMKGE in terms of topic coherence and document classification accuracy, compared to state-of-the-art topic modeling methods.