摘要:
The invention relates to the coding of audio signals that may include both speech-like and non-speech-like signal components. It describes methods and apparatus for code excited linear prediction (CELP) audio encoding and decoding that employ linear predictive coding (LPC) synthesis filters controlled by LPC parameters, a plurality of codebooks each having codevectors, at least one codebook providing an excitation more appropriate for non-speech-like signals and at least one codebook providing an excitation more appropriate for speech-like signals, and a plurality of gain factors, each associated with a codebook. The encoding methods and apparatus select from the codebooks codevectors and/or associated gain factors by minimizing a measure of the difference between the audio signal and a reconstruction of the audio signal derived from the codebook excitations. The decoding methods and apparatus generate a reconstructed output signal from the LPC parameters, codevectors, and gain factors.
摘要:
The invention relates to the coding of audio signals that may include both speech-like and non-speech-like signal components. It describes methods and apparatus for code excited linear prediction (CELP) audio encoding and decoding that employ linear predictive coding (LPC) synthesis filters controlled by LPC parameters, a plurality of codebooks each having codevectors, at least one codebook providing an excitation more appropriate for non-speech-like signals and at least one codebook providing an excitation more appropriate for speech-like signals, and a plurality of gain factors, each associated with a codebook. The encoding methods and apparatus select from the codebooks codevectors and/or associated gain factors by minimizing a measure of the difference between the audio signal and a reconstruction of the audio signal derived from the codebook excitations. The decoding methods and apparatus generate a reconstructed output signal from the LPC parameters, codevectors, and gain factors.
摘要:
Metadata comprising a set of gain values for creating a dominance effect is automatically generated. Automatically generating the metadata includes receiving multiple audio streams and a dominance criterion for at least one of the audio streams. A set of gains is computed for one or more audio streams based on the dominance criterion for the at least one audio stream and metadata is generated with the set of gains.
摘要:
Techniques for scene change detection around seed points in media data are provided. Media features of many different types may be extracted from the media data. One or more statistical patterns of media features in a plurality of time-wise intervals around a plurality of seed time points of the media data may be determined using one or more types of features extractable from the media data. At least one of the one or more types of features comprises a type of features that captures structural properties, tonality including harmony and melody, timbre, rhythm, loudness, stereo mix, or a quantity of sound sources as related to the media data. A plurality of beginning scene change points and a plurality of ending scene change points in the media data may be detected, based on the one or more statistical patterns, for the plurality of seed time points in the media data.
摘要:
Attributes are identified in media content. A classification value of the media content is computed based on the identified attributes. Thereafter, a fingerprint derived from the media content is stored or searched for based on the classification value of the media content.
摘要:
Multiple candidate feature components of media content or projection matrices (or other hash functions, e.g., non-linear projections) are identified. Each of the candidate projection matrices (or other hash functions) includes an array of coefficients that relate to the candidate features. A subgroup of the candidate features or the projection matrices (or other hash functions) are selected based at least partially on an optimized combination of at least two characteristics of the candidate features or projection matrices (or other hash functions). Media fingerprints that uniquely identify the media content are derived from the selected optimized subgroup. Optimal projection matrices (or other hash functions) may be designed. Performance or sensitivity (e.g., search time) characteristics of the fingerprints are thus balanced with robustness characteristics thereof.
摘要:
Content identification and quality monitoring are provided. The method involves obtaining a first fingerprint derived from a first media content, processing the first media content to generate a second media content, obtaining a second fingerprint derived from the second media content, and comparing the first fingerprint and the second fingerprint to determine one or more of: a similarity between the first fingerprint and the second fingerprint that indicates that the second media content is generated from the first media content or a difference between the first fingerprint and the second fingerprint to identify a quality degradation between the first media content and the second media content.
摘要:
Derivation of a fingerprint includes generating feature matrices based on one or more training images, generating projection matrices based on the feature matrices in a training process, and deriving a fingerprint for one or more images by, at least in part, projecting a feature matrix based on the one or more images onto the projection matrices generated in the training process.
摘要:
A method classifies segments of a video using an audio signal of the video and a set of classes. Selected classes of the set are combined as a subset of important classes, the subset of important classes being important for a specific highlighting task, the remaining classes of the set are combined as a subset of other classes. The subset of important classes and classes are trained with training audio data to form a task specific classifier. Then, the audio signal can be classified using the task specific classifier as either important or other to identify highlights in the video corresponding to the specific highlighting task. The classified audio signal can be used to segment and summarize the video.
摘要:
A method identifies highlight segments in a video including a sequence of frames. Audio objects are detected to identify frames associated with audio events in the video, and visual objects are detected to identify frames associated with visual events. Selected visual objects are matched with an associated audio object to form an audio-visual object only if the selected visual object matches the associated audio object, the audio-visual object identifying a candidate highlight segment. The candidate highlight segments are further refined, using low level features, to eliminate false highlight segments.