-
公开(公告)号:US12125496B1
公开(公告)日:2024-10-22
申请号:US18644959
申请日:2024-04-24
Applicant: Sanas.ai Inc.
Inventor: Shawn Zhang , Lukas Pfeifenberger , Jason Wu , Piotr Dura , David Braude , Bajibabu Bollepalli , Alvaro Escudero , Gokce Keskin , Ankita Jha , Maxim Serebryakov
CPC classification number: G10L21/0232 , G10L15/02 , G10L15/063 , G10L25/30 , G10L15/16 , G10L15/22
Abstract: The disclosed technology relates to methods, voice enhancement systems, and non-transitory computer readable media for real-time voice enhancement. In some examples, input audio data including foreground speech content, non-content elements, and speech characteristics is fragmented into input speech frames. The input speech frames are converted to low-dimensional representations of the input speech frames. One or more of the fragmentation or the conversion is based on an application of a first trained neural network to the input audio data. The low-dimensional representations of the input speech frames omit one or more of the non-content elements. A second trained neural network is applied to the low-dimensional representations of the input speech frames to generate target speech frames. The target speech frames are combined to generate output audio data. The output audio data further includes one or more portions of the foreground speech content and one or more of the speech characteristics.
-
2.
公开(公告)号:US12131745B1
公开(公告)日:2024-10-29
申请号:US18754280
申请日:2024-06-26
Applicant: Sanas.ai Inc.
Inventor: Lukas Pfeifenberger , Shawn Zhang
IPC: G10L21/007 , G06F3/16 , G10L13/00 , G10L13/033 , G10L15/02 , G10L15/06 , G10L15/16 , G10L15/26 , G10L21/003 , G10L21/01 , G10L21/013
CPC classification number: G10L21/007 , G06F3/162 , G10L13/00 , G10L13/033 , G10L15/02 , G10L15/063 , G10L15/16 , G10L15/26 , G10L21/003 , G10L21/013 , G10L21/01 , G10L2021/0135
Abstract: The disclosed technology relates to methods, accent conversion systems, and non-transitory computer readable media for real-time accent conversion. In some examples, a set of phonetic embedding vectors is obtained for phonetic content representing a source accent and obtained from input audio data. A trained machine learning model is applied to the set of phonetic embedding vectors to generate a set of transformed phonetic embedding vectors corresponding to phonetic characteristics of speech data in a target accent. An alignment is determined by maximizing a cosine distance between the set of phonetic embedding vectors and the set of transformed phonetic embedding vectors. The speech data is then aligned to the phonetic content based on the determined alignment to generate output audio data representing the target accent. The disclosed technology transforms phonetic characteristics of a source accent to match the target accent more closely for efficient and seamless accent conversion in real-time applications.
-