Abstract:
A method of tracking an object across a sequence of video frames using a natural language query includes receiving the natural language query and identifying an initial target in an initial frame of the sequence of video frames based on the natural language query. The method also includes adjusting the natural language query, for a subsequent frame, based on content of the subsequent frame and/or a likelihood of a semantic property of the initial target appearing in the subsequent frame. The method further includes identifying a text driven target and a visual driven target in the subsequent frame. The method still further includes combining the visual driven target with the text driven target to obtain a final target in the subsequent frame.
Abstract:
A method of processing data within a convolutional attention recurrent neural network (RNN) includes generating a current multi-dimensional attention map. The current multi-dimensional attention map indicates areas of interest in a first frame from a sequence of spatio-temporal data. The method further includes receiving a multi-dimensional feature map. The method also includes convolving the current multi-dimensional attention map and the multi-dimensional feature map to obtain a multi-dimensional hidden state and a next multi-dimensional attention map. The method identifies a class of interest in the first frame based on the multi-dimensional hidden state and training data.
Abstract:
A method of predicting action labels for a video stream includes receiving the video stream and calculating an optical flow of consecutive frames of the video stream. An attention map is generated from the current frame of the video stream and the calculated optical flow. An action label is predicted for the current frame based on the optical flow, a previous hidden state and the attention map.
Abstract:
A method generates bounding-boxes within frames of a sequence of frames. The bounding-boxes may be generated via a recurrent neural network (RNN) such as a long short-term memory (LSTM) network. The method includes receiving the sequence of frames and generating an attention feature map for each frame of the sequence of frames. Each attention feature map indicates at least one potential moving object. The method also includes up-sampling each attention feature map to determine an attention saliency for pixels in each frame of the sequence of frames. The method further includes generating a bounding-box within each frame based on the attention saliency and temporally smoothing multiple bounding-boxes along the sequence of frames to obtain a smooth sequence of bounding-boxes. The method still further includes localizing an action location within each frame based on the smooth sequence of bounding-boxes.