-
公开(公告)号:US11895343B2
公开(公告)日:2024-02-06
申请号:US17852310
申请日:2022-06-28
Applicant: Microsoft Technology Licensing, LLC
Inventor: Gaurav Mittal , Ye Yu , Mei Chen , Junwen Chen
IPC: H04N21/23 , H04N21/234 , G06V20/40 , G06T7/246
CPC classification number: H04N21/23418 , G06T7/246 , G06V20/46 , G06T2207/10021
Abstract: Example solutions for video frame action detection use a gated history and include: receiving a video stream comprising a plurality of video frames; grouping the plurality of video frames into a set of present video frames and a set of historical video frames, the set of present video frames comprising a current video frame; determining a set of attention weights for the set of historical video frames, the set of attention weights indicating how informative a video frame is for predicting action in the current video frame; weighting the set of historical video frames with the set of attention weights to produce a set of weighted historical video frames; and based on at least the set of weighted historical video frames and the set of present video frames, generating an action prediction for the current video frame.
-
公开(公告)号:US12223412B2
公开(公告)日:2025-02-11
申请号:US17123697
申请日:2020-12-16
Applicant: Microsoft Technology Licensing, LLC
Inventor: Yinpeng Chen , Xiyang Dai , Mengchen Liu , Dongdong Chen , Lu Yuan , Zicheng Liu , Ye Yu , Mei Chen , Yunsheng Li
Abstract: A computer device for automatic feature detection comprises a processor, a communication device, and a memory configured to hold instructions executable by the processor to instantiate a dynamic convolution neural network, receive input data via the communication network, and execute the dynamic convolution neural network to automatically detect features in the input data. The dynamic convolution neural network compresses the input data from an input space having a dimensionality equal to a predetermined number of channels into an intermediate space having a dimensionality less than the number of channels. The dynamic convolution neural network dynamically fuses the channels into an intermediate representation within the intermediate space and expands the intermediate representation from the intermediate space to an expanded representation in an output space having a higher dimensionality than the dimensionality of the intermediate space. The features in the input data are automatically detected based on the expanded representation.
-
公开(公告)号:US12192543B2
公开(公告)日:2025-01-07
申请号:US18393664
申请日:2023-12-21
Applicant: Microsoft Technology Licensing, LLC
Inventor: Gaurav Mittal , Ye Yu , Mei Chen , Junwen Chen
IPC: H04N21/23 , G06T7/246 , G06V20/40 , H04N21/234
Abstract: Example solutions for video frame action detection use a gated history and include: receiving a video stream comprising a plurality of video frames; grouping the plurality of video frames into a set of present video frames and a set of historical video frames, the set of present video frames comprising a current video frame; determining a set of attention weights for the set of historical video frames, the set of attention weights indicating how informative a video frame is for predicting action in the current video frame; weighting the set of historical video frames with the set of attention weights to produce a set of weighted historical video frames; and based on at least the set of weighted historical video frames and the set of present video frames, generating an action prediction for the current video frame.
-
公开(公告)号:US20220210098A1
公开(公告)日:2022-06-30
申请号:US17606857
申请日:2020-04-02
Applicant: Microsoft Technology Licensing, LLC
Inventor: Jie Zhang , Jianyong Wang , Peng Chen , Zeyu Shang , Ye Yu
IPC: H04L51/02 , G06F16/332 , G06F16/335 , G06F16/31
Abstract: The present disclosure provides a method and an apparatus for providing responses in an event-related session. The event is associated with a predefined domain, and the session comprises an electronic conversational agent and at least one participant. At least one message from the at least one participant may be detected. A set of candidate responses may be retrieved, from an index set being based on the domain, according to the at least one message. The set of candidate responses may be optimized through filtering the set of candidate responses according to predetermined criteria. A response to the at least one message may be selected from the filtered set of candidate responses. The selected response may be provided in the session.
-
公开(公告)号:US12101280B2
公开(公告)日:2024-09-24
申请号:US17606857
申请日:2020-04-02
Applicant: Microsoft Technology Licensing, LLC
Inventor: Jie Zhang , Jianyong Wang , Peng Chen , Zeyu Shang , Ye Yu
IPC: H04L51/02 , G06F16/31 , G06F16/332 , G06F16/335
CPC classification number: H04L51/02 , G06F16/313 , G06F16/3326 , G06F16/3329 , G06F16/335
Abstract: The present disclosure provides a method and an apparatus for providing responses in an event-related session. The event is associated with a predefined domain, and the session comprises an electronic conversational agent and at least one participant. At least one message from the at least one participant may be detected. A set of candidate responses may be retrieved, from an index set being based on the domain, according to the at least one message. The set of candidate responses may be optimized through filtering the set of candidate responses according to predetermined criteria. A response to the at least one message may be selected from the filtered set of candidate responses. The selected response may be provided in the session.
-
公开(公告)号:US12087043B2
公开(公告)日:2024-09-10
申请号:US17535517
申请日:2021-11-24
Applicant: Microsoft Technology Licensing, LLC
Inventor: Gaurav Mittal , Ye Yu , Mei Chen , Jay Sanjay Patravali
IPC: G06K9/00 , G06F16/73 , G06F16/75 , G06N20/00 , G06V10/764 , G06V10/774
CPC classification number: G06V10/7753 , G06F16/73 , G06F16/75 , G06N20/00 , G06V10/764 , G06V10/7747
Abstract: The disclosure herein describes preparing and using a cross-attention model for action recognition using pre-trained encoders and novel class fine-tuning. Training video data is transformed into augmented training video segments, which are used to train an appearance encoder and an action encoder. The appearance encoder is trained to encode video segments based on spatial semantics and the action encoder is trained to encode video segments based on spatio-temporal semantics. A set of hard-mined training episodes are generated using the trained encoders. The cross-attention module is then trained for action-appearance aligned classification using the hard-mined training episodes. Then, support video segments are obtained, wherein each support video segment is associated with video classes. The cross-attention module is fine-tuned using the obtained support video segments and the associated video classes. A query video segment is obtained and classified as a video class using the fine-tuned cross-attention module.
-
-
-
-
-