摘要:
Frames containing audio data may be received, the audio data having been derived from a microphone array, at least some of the frames containing residual acoustic echo after having acoustic echo partially removed therefrom. Probability distribution functions are determined from the frames of audio data. A probability distribution function comprises likelihoods that respective directions are directions of sources of sounds. An active speaker may be identified in frames of video data based on the video data and based on audio information derived from the audio data, where use of the audio information as a basis for identifying the active speaker is controlled by determining whether the probability distribution functions indicate that corresponding audio data includes residual acoustic echo.
摘要:
Frames containing audio data may be received, the audio data having been derived from a microphone array, at least some of the frames containing residual acoustic echo after having acoustic echo partially removed therefrom. Probability distribution functions are determined from the frames of audio data. A probability distribution function comprises likelihoods that respective directions are directions of sources of sounds. An active speaker may be identified in frames of video data based on the video data and based on audio information derived from the audio data, where use of the audio information as a basis for identifying the active speaker is controlled by determining whether the probability distribution functions indicate that corresponding audio data includes residual acoustic echo.
摘要:
Systems and methods for detecting people or speakers in an automated fashion are disclosed. A pool of features including more than one type of input (like audio input and video input) may be identified and used with a learning algorithm to generate a classifier that identifies people or speakers. The resulting classifier may be evaluated to detect people or speakers.
摘要:
An object activity modeling method which can efficiently model complex objects such as a human body is provided. The object activity modeling method includes the steps of (a) obtaining an optical flow vector from a video sequence; (b) obtaining the probability distribution of the feature vector for a plurality of video frames, using the optical flow vector; (c) modeling states, using the probability distribution of the feature vector; and (d) expressing the activity of the object in the video sequence based on state transition. According to the modeling method, in video indexing and recognition field, complex activities such as human activities can be efficiently modeled and recognized without segmenting objects.
摘要:
An object activity modeling method which can efficiently model complex objects such as a human body is provided. The object activity modeling method includes the steps of (a) obtaining an optical flow vector from a video sequence; (b) obtaining the probability distribution of the feature vector for a plurality of video frames, using the optical flow vector; (c) modeling states, using the probability distribution of the feature vector; and (d) expressing the activity of the object in the video sequence based on state transition. According to the modeling method, in video indexing and recognition field, complex activities such as human activities can be efficiently modeled and recognized without segmenting objects.
摘要:
This disclosure describes techniques of automatically identifying a direction of a speech source relative to an array of directional microphones using audio streams from some or all of the directional microphones. Whether the direction of the speech source is identified using audio streams from some of the directional microphones or from all of the directional microphones depends on whether using audio streams from a subgroup of the directional microphones or using audio streams from all of the directional microphones is more likely to correctly identify the direction of the speech source. Switching between using audio streams from some of the directional microphones and using audio streams from all of the directional microphones may occur automatically to best identify the direction of the speech source. A display screen at a remote venue may then display images having angles of view that are centered generally in the direction of the speech source.
摘要:
A digital video processing method and an apparatus thereof are provided. The method for processing digital images received in the form of compressed video streams comprising the step of determining a region intensity histogram (RIH) based on information on motion compensation of inter frames. The RIH information is obtained based on the motion compensation values of inter frames, and the RIH information is a good indicator of motion information of a video scene. Also, since the RIH information is quite a good indicator of intensity of the video scene, video streams having similar intensities can be effectively searched by searching for similar video scenes based on the RIH information obtained by the digital video processing method.
摘要:
A method describes activity in a video sequence. The method measures intensity, direction, spatial, and temporal attributes in the video sequence, and the measured attributes are combined in a digital descriptor of the activity of the video sequence.
摘要:
An object activity modeling method which can efficiently model complex objects such as a human body is provided. The object activity modeling method includes the steps of (a) obtaining an optical flow vector from a video sequence; (b) obtaining the probability distribution of the feature vector for a plurality of video frames, using the optical flow vector; (c) modeling states, using the probability distribution of the feature vector; and (d) expressing the activity of the object in the video sequence based on state transition. According to the modeling method, in video indexing and recognition field, complex activities such as human activities can be efficiently modeled and recognized without segmenting objects.
摘要:
Systems and methods for detecting people or speakers in an automated fashion are disclosed. A pool of features including more than one type of input (like audio input and video input) may be identified and used with a learning algorithm to generate a classifier that identifies people or speakers. The resulting classifier may be evaluated to detect people or speakers.