-
1.
公开(公告)号:US20240355351A1
公开(公告)日:2024-10-24
申请号:US18302079
申请日:2023-04-18
发明人: Moshe Tzur , Elior Hadad
摘要: The single-channel, Speech Features-Based Voice Activity Detection (SFVAD) system is a robust, low-latency system that generates per-frame speech and noise indications, along with calculating a pair of speech and noise time-frequency masks. The SFVAD system controls an adaptation mechanism for a Beam-Forming system control module and improves the speech quality and noise reduction capabilities of Automatic Speech Recognition applications, such as Virtual Assistance (VA) and Hands-Free (HF) calls, by robustly handling transient noises. The system extracts speech-like patterns from an input audio signal and it is invariant to the power-level of the input audio signal. Noise calculation is controlled by a pair of speech features-based detectors (voiced and unvoiced). A Cepstral-based pitch detector and a Centrum calculation method are used to prevent contamination of the calculated noise by speech content. The SFVAD system robustly handles instant changes of background noise level and has dramatically lower false detection rates.
-
2.
公开(公告)号:US20240321267A1
公开(公告)日:2024-09-26
申请号:US18573542
申请日:2022-06-23
发明人: Christopher VOIGT , Alam Aseequl KHAN , Kai Samuel David Erik KARREN , Michael SCHMITS , Mohammad Fayadan HOSSAIN
CPC分类号: G10L15/1815 , G10L15/22 , G10L25/93
摘要: A system and method are described for dialogue management by determining a transition from a domain or a use case to another and initiating a switch suggestion in a human-computer conversation for rendering services to a user based on the updated domain or use case for a dialogue. The method comprises determining a first use-case in the human-computer conversation and assigning a first use-case score to the said first use-case. The method then comprises determining a second use-case in the human-computer conversation and assigning a second use-case score to the said second use-case. Further, the method comprises determining whether to make the use-case switch suggestion to the said second use-case based on the said first use-case score and the said second use-case score. The invention aligns the system to extract intended information from/for the user thereby making the system more efficient, approachable and user-friendly by saving time and cost.
-
公开(公告)号:US12046255B2
公开(公告)日:2024-07-23
申请号:US17572100
申请日:2022-01-10
发明人: Fu-En Tsai , Feng Wen Hung , Chao-I Li
IPC分类号: G10L25/93 , G10L25/78 , H04N7/15 , H04N23/695
CPC分类号: G10L25/93 , G10L25/78 , H04N7/15 , H04N23/695 , G10L2025/783
摘要: A sound source tracking method adapted to an ongoing video conference comprising: obtaining a streaming signal of the video conference from an internet; performing a video conference procedure to obtain an audio signal from the streaming signal and send the audio signal to a speaker; performing an audio tracking procedure to obtain the audio signal outputted from the video conference procedure to the communication device and send the audio signal to a sound source tracking camera; playing the audio signal to generate a far-end sound; recording a field sound comprising at least one of the far-end sound and a local-end sound; and performing a comparing procedure to determine a shooting direction of the sound source tracking camera, wherein the shooting direction is adjusted so as not to shoot the speaker when a similarity of the far-end sound and the audio signal is greater than a threshold.
-
公开(公告)号:US20240087591A1
公开(公告)日:2024-03-14
申请号:US18451736
申请日:2023-08-17
申请人: SomniQ, Inc.
发明人: Rikko Sakaguchi , Hidenori Ishikawa
摘要: Methods, systems and apparatuses for computer-generated visualization of speech are described herein. An example method of computer-generated visualization of speech including at least one segment includes: generating a graphical representation of an object corresponding to a segment of the speech; and displaying the graphical representation of the object on a screen of a computing device. Generating the graphical representation includes: representing a duration of the respective segment by a length of the object and representing intensity of the respective segment by a width of the object; and placing, in the graphical representation, a space between adjacent objects.
-
公开(公告)号:US11922953B2
公开(公告)日:2024-03-05
申请号:US17414194
申请日:2018-12-18
发明人: Hideo Omura
摘要: A voice analyzer analyzes whether a voice signal input into a voice input unit includes a specific characteristic component. A voice recognizer recognizes a voice represented by the voice signal input into the voice input unit. A response instruction unit instructs a response to a response operation unit that operates in response to the voice recognized by the voice recognizer. A controller controls the voice recognizer not to execute voice recognition processing by the voice recognizer or controls the response instruction unit not to instruct the response operation unit about an instruction content by the voice recognized by the voice recognizer, when the voice analyzer analyzes that the voice signal includes the specific characteristic component.
-
公开(公告)号:US20230419967A1
公开(公告)日:2023-12-28
申请号:US17889110
申请日:2022-08-16
申请人: Apple Inc.
发明人: Erik D. HORNBERGER , James A. FORREST , Christopher M. GARRIDO , Patrick MIAUTON , Bradley F. PATTERSON , Karthick SANTHANAM , Luciano M. VERGER
摘要: Systems and processes for providing textual representations for a communication session are provided. For example, at least one audio input is received at an electronic device, wherein each audio input of the at least one audio input is associated with a respective priority level. A priority level of an audio input detected at a microphone of the electronic device is determined, wherein a highest priority level among the determined priority level and each received priority level corresponding to the at least one audio input is identified. A textual representation of a respective audio input corresponding to the identified highest priority level is obtained, wherein the obtained textual representation is displayed on a display of the electronic device.
-
公开(公告)号:US20230386458A1
公开(公告)日:2023-11-30
申请号:US17804544
申请日:2022-05-27
申请人: SoundHound, Inc.
发明人: Karl STAHL , Bernard MONT-REYNAUD
CPC分类号: G10L15/22 , G10L15/08 , G10L25/93 , G10L2015/088
摘要: Methods and systems for pre-wakeword speech processing are disclosed. Speech audio, comprising command speech spoken before a wakeword, may be stored in a buffer in oldest to newest order. Upon detection of the wakeword, reverse acoustic models and language models, such as reverse automatic speech recognition (R-ASR) can be applied to the buffered audio, in newest to oldest order, starting from before the wakeword. The speech is converted into a sequence of words. Natural language grammar models, such as natural language understanding (NLU), can be applied to match the sequence of words to a complete command, the complete command being associated with invoking a computer operation.
-
8.
公开(公告)号:US20230352042A1
公开(公告)日:2023-11-02
申请号:US17806565
申请日:2022-06-13
摘要: Systems and methods are provided for a transcription system with voice activity detection (VAD). The system includes a VAD module to receive incoming audio and generate an audio segment; and a speech decoder with a split predictor to perform, in a first pass, a decode operation to transcribe text from an audio segment into a message; wherein in the first pass, if the message is determined not to contain a split point based on a content-based analysis performed by the split predictor, the speech decoder forwards the message for display and if the message is determined based on the content-based analysis to contain the split point, the speech decoder performs in a second pass, a re-decode operation to transcribe text from the audio segment based on the split point wherein the split point is configured within an audio domain of the audio segment by the split predictor and forward the message for display.
-
9.
公开(公告)号:US20230306975A1
公开(公告)日:2023-09-28
申请号:US18160894
申请日:2023-01-27
发明人: Guillaume FUCHS , Archit TAMARAPU , Andrea EICHENSEER , Srikanth KORSE , Stefan DOEHLA , Markus MULTRUS
IPC分类号: G10L19/032 , G10L19/012 , G10L25/78 , G10L25/93
CPC分类号: G10L19/032 , G10L19/012 , G10L25/78 , G10L25/93
摘要: There are disclosed an apparatus for generating an encoded audio scene, and an apparatus for decoding and/or processing an encoded audio scene; as well as related methods and non-transitory storage units storing instructions which, when executed by a processor, cause the processor to perform a related method. An apparatus for processing an encoded audio scene may include, in a first frame, a first soundfield parameter representation and an encoded audio signal, wherein a second frame is an inactive frame, the apparatus including: an activity detector for detecting that the second frame is the inactive frame; a synthetic signal synthesizer for synthesizing a synthetic audio signal for the second frame using the parametric description for the second frame; an audio decoder for decoding the encoded audio signal for the first frame; and a spatial renderer for spatially rendering the audio signal for the first frame using the first soundfield parameter representation and using the synthetic audio signal for the second frame, or a transcoder for generating a meta data assisted output format including the audio signal for the first frame, the first soundfield parameter representation for the first frame, the synthetic audio signal for the second frame, and a second soundfield parameter representation for the second frame.
-
公开(公告)号:US11763099B1
公开(公告)日:2023-09-19
申请号:US18126341
申请日:2023-03-24
申请人: VoyagerX, Inc.
发明人: Hyeonsoo Oh , Sedong Nam
摘要: The present disclosure relates to systems and methods for providing subtitle for a video. The video's audio is transcribed to obtain caption text for the video. A first machine-trained model identifies sentences in the caption text. A second model identifies intra-sentence breaks with in the sentences identified using the first machine-trained model. Based on the identified sentences and intra-sentence breaks, one or more words in the caption text are grouped into a clip caption to be displayed for a corresponding clip of the video.
-
-
-
-
-
-
-
-
-