摘要:
The technology described in this document can be embodied in a computer-implemented method that includes receiving, at one or more processing devices, a portion of an input signal representing noisy speech, and extracting, from the portion of the input signal, one or more frequency domain features of the noisy speech. The method also includes generating a set of projected features by projecting each of the one or more frequency domain features on a manifold that represents a model of frequency domain features for clean speech. The method further includes using the set of projected features for at least one of: a) generating synthesized speech that represents a noise-reduced version of the noisy speech, b) performing speaker recognition, or c) performing speech recognition.
摘要:
An adaptive voice authentication system is provided. The adaptive voice authentication system includes an adaptive module configured to compare a feature quality index of the plurality of authentication features and the plurality of enrolment features and dynamically replace and store one or more enrolment features with one or more authentication features to form a plurality of updated enrolment features. The adaptive module is configured to generate an updated enrolment voice print model from the plurality of the updated enrolment features. The adaptive module is further configured to compare the updated enrolment voice print model with the previously stored enrolment voice print model and dynamically update the previously stored enrolment voice print model with the updated enrolment voice print model based on a model quality index.
摘要:
A system and method to assess vocal function of a subject. The system includes an accelerometer configured to acquire surface acceleration data associated with vocal functionality of the subject and a computer system configured to analyze the surface acceleration data and to estimate glottal airflow waveforms produced by the subject based on the surface acceleration data. The computer system performs the analysis and estimation by applying an inverse filter to the surface acceleration data based on a calibrated transmission line model and generates an indication of vocal functionality of the subject based on the estimated glottal airflow waveforms.
摘要:
A device may receive data indicative of a plurality of speech sounds associated with first voice characteristics of a first voice. The device may receive an input indicative of speech associated with second voice characteristics of a second voice. The device may map at least one portion of the speech of the second voice to one or more speech sounds of the plurality of speech sounds of the first voice. The device may compare the first voice characteristics with the second voice characteristics based on the map. The comparison may include vocal tract characteristics, nasal cavity characteristics, and voicing characteristics. The device may determine a given representation configured to associate the first voice characteristics with the second voice characteristics. The device may provide an output indicative of pronunciations of the one or more speech sounds of the first voice according to the second voice characteristics based on the given representation.
摘要:
An apparatus comprising: an audio source determiner configured to determine at least one audio source; a visualizer configured to generate a visual representation associated with the at least one audio source; and a controller configured to process an audio signal associated with the at least one audio source dependent on interaction with the visual representation.
摘要:
A video generation method includes obtaining audio data and initial image data of a virtual object, extracting an audio feature from the audio data, and performing predictive encoding on the audio data to obtain an encoded feature representing vocal channel characteristics of the audio data. The method further includes fusing the audio feature and the encoded feature to obtain a fused audio feature, and generating updated image data of the virtual object according to the fused audio feature and the initial image data. The method further includes generating video data including the updated image data and the audio data
摘要:
An emotion estimation apparatus 1 includes: a generation unit 2 configured to generate acoustic characteristic information indicating an acoustic characteristic using a first acoustic signal output to the ear canal and a second acoustic signal produced by the first acoustic signal echoing inside the body; and an estimation unit 3 configured to estimate emotion using the acoustic characteristic information.
摘要:
A computer-implemented method for correcting muffled speech caused by facial coverings is disclosed. The computer-implemented method includes monitoring a user's speech for speech distortion. The computer-implemented method further includes determining that the user's speech is distorted. The computer-implemented method further includes determining that a cause of the user's speech distortion is based, at least in part, on a presence of a particular type of facial covering. The computer-implemented method further includes automatically correcting the speech distortion of the user based, at least in part, on the particular type of facial covering causing the speech distortion.
摘要:
An adaptive voice authentication system is provided. The adaptive voice authentication system includes an adaptive module configured to compare a feature quality index of the plurality of authentication features and the plurality of enrollment features and dynamically replace and store one or more enrollment features with one or more authentication features to form a plurality of updated enrollment features. The adaptive module is configured to generate an updated enrollment voice print model from the plurality of the updated enrollment features. The adaptive module is further configured to compare the updated enrollment voice print model with the previously stored enrollment voice print model and dynamically update the previously stored enrollment voice print model with the updated enrollment voice print model based on a model quality index.
摘要:
A device may receive a speech signal. The device may determine acoustic feature parameters for the speech signal. The acoustic feature parameters may include phase data. The device may determine circular space representations for the phase data based on an alignment of the phase data with given axes of the circular space representations. The device may map the phase data to linguistic features based on the circular space representations. The linguistic features may be associated with linguistic content that includes phonemic content or text content. The device may provide a synthetic audio pronunciation of the linguistic content based on the mapping.