摘要:
A semantic object tracking method tracks general semantic objects with multiple non-rigid motion, disconnected components and multiple colors throughout a vector image sequence. The method accurately tracks these general semantic objects by spatially segmenting image regions from a current frame and then classifying these regions as to which semantic object they originated from in the previous frame. To classify each region, the method perform a region based motion estimation between each spatially segmented region and the previous frame to computed the position of a predicted region in the previous frame. The method then classifies each region in the current frame as being part of a semantic object based on which semantic object in the previous frame contains the most overlapping points of the predicted region. Using this method, each region in the current image is tracked to one semantic object from the previous frame, with no gaps or overlaps. The method propagates few or no errors because it projects regions into a frame where the semantic object boundaries are previously computed rather than trying to project and adjust a boundary in a frame where the object's boundary is unknown.
摘要:
A semantic object tracking method tracks general semantic objects with multiple non-rigid motion, disconnected components and multiple colors throughout a vector image sequence. The method accurately tracks these general semantic objects by spatially segmenting image regions from a current frame and then classifying these regions as to which semantic object they originated from in the previous frame. To classify each region, the method performs a region based motion estimation between each spatially segmented region and the previous frame to compute the position of a predicted region in the previous frame. The method then classifies each region in the current frame as being part of a semantic object based on which semantic object in the previous frame contains the most overlapping points of the predicted region. Using this method, each region in the current image is tracked to one semantic object from the previous frame, with no gaps or overlaps. The method propagates few or no errors because it projects regions into a frame where the semantic object boundaries are previously computed rather than trying to project and adjust a boundary in a frame where the object's boundary is unknown.
摘要:
A semantic video object extraction system using mathematical morphology and perspective motion modeling. A user indicates a rough outline around an image feature of interest for a first frame in a video sequence. Without further user assistance, the rough outline is processed by a morphological segmentation tool to snap the rough outline into a precise boundary surrounding the image feature. Motion modeling is performed on the image feature to track its movement into a subsequent video frame. The motion model is applied to the precise boundary to warp the precise outline into a new rough outline for the image feature in the subsequent video frame. This new rough outline is then snapped to locate a new precise boundary. Automatic processing is repeated for subsequent video frames.
摘要:
A semantic object tracking method tracks general semantic objects with multiple non-rigid motion, disconnected components and multiple colors throughout a vector image sequence. The method accurately tracks these general semantic objects by spatially segmenting image regions from a current frame and then classifying these regions as to which semantic object they originated from in the previous frame. To classify each region, the method perform a region based motion estimation between each spatially segmented region and the previous frame to computed the position of a predicted region in the previous frame. The method then classifies each region in the current frame as being part of a semantic object based on which semantic object in the previous frame contains the most overlapping points of the predicted region. Using this method, each region in the current image is tracked to one semantic object from the previous frame, with no gaps or overlaps. The method propagates few or no errors because it projects regions into a frame where the semantic object boundaries are previously computed rather than trying to project and adjust a boundary in a frame where the object's boundary is unknown.
摘要:
A semantic object tracking method tracks general semantic objects with multiple non-rigid motion, disconnected components and multiple colors throughout a vector image sequence. The method accurately tracks these general semantic objects by spatially segmenting image regions from a current frame and then classifying these regions as to which semantic object they originated from in the previous frame. To classify each region, the method performs a region based motion estimation between each spatially segmented region and the previous frame to compute the position of a predicted region in the previous frame. The method then classifies each region in the current frame as being part of a semantic object based on which semantic object in the previous frame contains the most overlapping points of the predicted region. Using this method, each region in the current image is tracked to one semantic object from the previous frame, with no gaps or overlaps. The method propagates few or no errors because it projects regions into a frame where the semantic object boundaries are previously computed rather than trying to project and adjust a boundary in a frame where the object's boundary is unknown.
摘要:
A human speech detection method detects pure-speech signals in an audio signal containing a mixture of pure-speech and non-speech or mixed-speech signals. The method accurately detects the pure-speech signals by computing a novel Valley Percentage feature from the audio signal and then classifying the audio signals into pure-speech and non-speech (or mixed-speech) classifications. The Valley Percentage is a measurement of the low energy parts of the audio signal (the valley) in comparison to the high energy parts of the audio signal (the mountain). To classify the audio signal, the method performs a threshold decision on the value of the Valley Percentage. Using a binary mask, a high Valley Percentage is classified as pure-speech and a low Valley Percentage is classified as non-speech (or mixed-speech). The method further employs morphological filters to improve the accuracy of human speech detection. Before detection, a morphological closing filter may be employed to eliminate unwanted noise from the audio signal. After detection, a combination of morphological closing and opening filters may be employed to remove aberrant pure-speech and non-speech classifications from the binary mask resulting from impulsive audio signals in order to more accurately detect the boundaries between the pure-speech and non-speech portions of the audio signal. A number of parameters may be employed by the method to further improve the accuracy of human speech detection. For implementation in supervised digital audio signal applications, these parameters may be optimized by training the application a priori. For implementation in an unsupervised environment, adaptive determination of these parameters is also possible.
摘要:
A sprite generation method used in video coding generates a sprite from the video objects in the frames of a video sequence. The method estimates the motion between a video object in a current frame and a sprite constructed from video objects for previous frames. Specifically, the method computes motion coefficients of a 2D transform that minimizes the intensity errors between pixels in the video object and corresponding pixels inside the sprite. The method uses the motion coefficients from the previous frame as a starting point to minimizing the intensity errors. After estimating the motion parameters for an object in the current frame, the method transforms the video object to the coordinate system of the sprite. The method blends the warped pixels of the video object with the pixels at corresponding positions in the sprite using rounding average such that each video object in the video sequence provides substantially the same contribution to the sprite.
摘要:
A video compression error signal in a video compression scheme is affected by random and high frequency impulse noise. An error signal suppressor containing two filters is applied to the video compression error signal. The first filter reduces or eliminates random noise. The second filter eliminates high frequency impulse noise. Random and high frequency noise is reduced or eliminated from frequencies that are unimportant to human visual perception. The error signal suppressor reduces the overall video compression bitrate by up to between 10% and 20% which provides corresponding increases in video compression and transmission efficiency. The error signal suppressor is used in video compression encoding schemes such as MPEG to reduce random and high frequency noise.
摘要:
Various embodiments related to the generation and provision of media metadata are disclosed. For example, one disclosed embodiment provides a computing device having a logic subsystem configured to execute instructions, and a data holding subsystem comprising instructions stored thereon that are executable by the processor to receive an input of a video and/or audio content item, and to compare the content item to one or more object descriptors each representing an object for locating within the content item to locate instances of one or more of the objects in the content item. The instructions are further executable to generate metadata for each object located in the video content item, and to receive a validating user input related to whether the metadata generated for a selected object is correct.
摘要:
Image processing in mobile devices is optimized by combining at least two of the color conversion, rotation, and scaling operations. Received images, such as still images or frames of video stream, are subjected to a combined transformation after decoding, where each pixel is color converted (e.g. from YUV to RGB), rotated, and scaled as needed. By combining two or three of the processes into one, read/write operations consuming significant processing and memory resources are reduced enabling processing of higher resolution images and/or power and processing resource savings.