Dense video captioning
    11.
    发明授权

    公开(公告)号:US10958925B2

    公开(公告)日:2021-03-23

    申请号:US16687405

    申请日:2019-11-18

    Abstract: Systems and methods for dense captioning of a video include a multi-layer encoder stack configured to receive information extracted from a plurality of video frames, a proposal decoder coupled to the encoder stack and configured to receive one or more outputs from the encoder stack, a masking unit configured to mask the one or more outputs from the encoder stack according to one or more outputs from the proposal decoder, and a decoder stack coupled to the masking unit and configured to receive the masked one or more outputs from the encoder stack. Generating the dense captioning based on one or more outputs of the decoder stack. In some embodiments, the one or more outputs from the proposal decoder include a differentiable mask. In some embodiments, during training, error in the dense captioning is back propagated to the decoder stack, the encoder stack, and the proposal decoder.

    END-TO-END SPEECH RECOGNITION WITH POLICY LEARNING

    公开(公告)号:US20190130897A1

    公开(公告)日:2019-05-02

    申请号:US15878113

    申请日:2018-01-23

    Abstract: The disclosed technology teaches a deep end-to-end speech recognition model, including using multi-objective learning criteria to train a deep end-to-end speech recognition model on training data comprising speech samples temporally labeled with ground truth transcriptions. The multi-objective learning criteria updates model parameters of the model over one thousand to millions of backpropagation iterations by combining, at each iteration, a maximum likelihood objective function that modifies the model parameters to maximize a probability of outputting a correct transcription and a policy gradient function that modifies the model parameters to maximize a positive reward defined based on a non-differentiable performance metric which penalizes incorrect transcriptions in accordance with their conformity to corresponding ground truth transcriptions; and upon convergence after a final backpropagation iteration, persisting the modified model parameters learned by using the multi-objective learning criteria with the model to be applied to further end-to-end speech recognition.

    SYSTEMS AND METHODS FOR KNOWLEDGE BASE QUESTION ANSWERING USING GENERATION AUGMENTED RANKING

    公开(公告)号:US20230059870A1

    公开(公告)日:2023-02-23

    申请号:US17565305

    申请日:2021-12-29

    Abstract: Embodiments described herein provide a question answering approach that answers a question by generating an executable logical form. First, a ranking model is used to select a set of good logical forms from a pool of logical forms obtained by searching over a knowledge graph. The selected logical forms are good in the sense that they are close to (or exactly match, in some cases) the intents in the question and final desired logical form. Next, a generation model is adopted conditioned on the question as well as the selected logical forms to generate the target logical form and execute it to obtain the final answer. For example, at inference stage, when a question is received, a matching logical form is identified from the question, based on which the final answer can be generated based on the node that is associated with the matching logical form in the knowledge base.

    SYSTEMS AND METHODS FOR PARTIALLY SUPERVISED ONLINE ACTION DETECTION IN UNTRIMMED VIDEOS

    公开(公告)号:US20210357687A1

    公开(公告)日:2021-11-18

    申请号:US16931228

    申请日:2020-07-16

    Abstract: Embodiments described herein provide systems and methods for a partially supervised training model for online action detection. Specifically, the online action detection framework may include two modules that are trained jointly—a Temporal Proposal Generator (TPG) and an Online Action Recognizer (OAR). In the training phase, OAR performs both online per-frame action recognition and start point detection. At the same time, TPG generates class-wise temporal action proposals serving as noisy supervisions for OAR. TPG is then optimized with the video-level annotations. In this way, the online action detection framework can be trained with video-category labels only without pre-annotated segment-level boundary labels.

Patent Agency Ranking