Abstract:
A computing device includes a communication interface, a memory, and processing circuitry. The processing circuitry is coupled to the communication interface and to the memory and is configured to execute the operational instructions to perform various functions. The computing device is configured to process a video frame of a video segment on a per-frame basis and based on joint human-object interactive activity (HOIA) to generate a per-frame pairwise human-object interactive (HOI) feature based on a plurality of candidate HOI pairs. The computing device is also configured to process the per-frame pairwise HOI feature to identify a valid HOI pair among the plurality of candidate HOI pairs and to track the valid HOI pair through subsequent frames of the video segment to generate a contextual spatial-temporal feature for the valid HOI pair to be used in activity detection.
Abstract:
In an embodiment, the disclosure includes an object detecting device. The object detecting device is configured to execute the instructions to: obtain a first picture comprising a first object at a first time instant; determine a first feature pattern of the first object based on the first picture; generate a first feature map of the first object based on the first feature pattern; generate a first feature vector of the first object based on the first feature map; and send the first feature vector to a server. In this embodiment, the first feature vector is generated based on the first feature map by the object detecting device rather than starting another process of directly generating the first feature map based on the first picture. Therefore, the speed and the computing resource cost of generating the feature vector may be better.
Abstract:
An apparatus is configured to perform a method of parallax tolerant video stitching. The method includes determining a plurality of video sequences to be stitched together; performing a spatial-temporal localized warping computation process on the video sequences to determine a plurality of target warping maps; warping a plurality of frames among the video sequences into a plurality of target virtual frames using the target warping maps; performing a spatial-temporal content-based seam finding process on the target virtual frames to determine a plurality of target seam maps; and stitching the video sequences together using the target seam maps.
Abstract:
The disclosure relates to technology for generating a compressed neural network. A weight tensor is received from a neural network to be compressed, and it is reordered to be compressed to have an inner two-dimensional (2D) shape and a 2D sparse bitmap. A layered structure is generated that represents the reordered weight tensor, and the reordered weight tensor is divided into a group of coefficients (GOCs). An encoding mode is selected to generate a quantized reordered weight tensor using one of a codebook or direct quantization, and a column swapped quantized reordered weigh tensor is generated. A compressed neural network is formed by encoding and the compressed representation of the neural network is transmitted to a target system for decompression.
Abstract:
A computer-implemented method verifies an image based authentication via one or more processors performing operations including receiving image data corresponding to a face identified by a facial recognition system, processing the received raw image data via a deep neural network trained on training data that includes images of both verified and fake faces to perform a temporal facial analysis, and generating a verification signal in response to the temporal facial analysis to indicate whether the raw image data is fake.
Abstract:
A system and method of tracking an object and navigating an object tracking robot includes receiving tracking sensor input representing the object and an environment at multiple times, responsive to the tracking sensor input, calculating positions of the robot and the object at the multiple times, and using a computer implemented deep reinforcement learning (DRL) network trained as a function of tracking quality rewards and robot navigation path quality rewards, the DRL network being responsive to the calculated positions of the robot and the object at the multiple times to determine possible actions specifying movement of the object tracking robot from a current position of the robot and target, determine quality values (Q-values) for the possible actions, and select an action as a function of the Q-values. A method of training the DRL network is also included.
Abstract:
A computer-implemented method for a three-dimensional (3D) reconstruction of a dynamic scene includes receiving a plurality of color image sequences from a plurality of color imaging sensors, and at least one depth image sequence from at least one depth imaging sensor, where a color imaging sensor quantity is larger than a depth imaging sensor quantity. A plurality of calibrated color image sequences and at least one calibrated depth image sequence are generated based on the plurality of color imaging sequences and the at least one depth image sequence. A plurality of initial 3D patches is constructed using the plurality of calibrated color image sequences and the at least one calibrated depth image sequence. A 3D patch cloud is generated by expanding the plurality of initial 3D patches.
Abstract:
Methods and apparatus are described that enable augmented or virtual reality based on a light field. A geometric proxy of a mobile device such as a smart phone is used during the process of inserting a virtual object from the light field into the real world images being acquired. For example, a mobile device includes a processor and a camera coupled to the processor. The processor is configured to define a view-dependent geometric proxy, record images with the camera to produce recorded frames and, based on the view-dependent geometric proxy, render the recorded frames with an inserted light field virtual object.
Abstract:
A method for fine-grained object recognition in a robotic system is disclosed that includes obtaining an image of an object from an imaging device. Based on the image, a deep category-level detection neural network is used to detect pre-defined categories of objects. A feature map is generated for each pre-defined category of object detected by the deep category-level detection neural network. Embedded features are generated, based on the feature map, using a deep instance-level detection neural network corresponding to the pre-defined category of the object, wherein each pre-defined category of an object comprises a corresponding different instance-level detection neural network. An instance-level of the object is determined based on classification of the embedded features.
Abstract:
A computing device includes a communication interface, a memory, and processing circuitry. The processing circuitry is coupled to the communication interface and to the memory and is configured to execute the operational instructions to perform various functions. The computing device is configured to process a video frame of a video segment on a per-frame basis and based on joint human-object interactive activity (HOIA) to generate a per-frame pairwise human-object interactive (HOI) feature based on a plurality of candidate HOI pairs. The computing device is also configured to process the per-frame pairwise HOI feature to identify a valid HOI pair among the plurality of candidate HOI pairs and to track the valid HOI pair through subsequent frames of the video segment to generate a contextual spatial-temporal feature for the valid HOI pair to be used in activity detection.