专利检索 ap:("Microsoft Technology Licensing, LLC") AND inv:"Tianyan ZHOU" 第 1 页

1.

发明申请
CONVOLUTIONAL NEURAL NETWORK WITH PHONETIC ATTENTION FOR SPEAKER VERIFICATION 有权

公开(公告)号：US20210082438A1

公开(公告)日：2021-03-18

申请号：US16682921

申请日：2019-11-13

申请人： Microsoft Technology Licensing, LLC

发明人： Yong ZHAO , Tianyan ZHOU , Jinyu LI , Yifan GONG , Jian WU , Zhuo CHEN

IPC分类号： G10L17/18 , G10L17/02 , G06N3/08

摘要： Embodiments may include reception of a plurality of speech frames, determination of a multi-dimensional acoustic feature associated with each of the plurality of speech frames, determination of a plurality of multi-dimensional phonetic features, each of the plurality of multi-dimensional phonetic features determined based on a respective one of the plurality of speech frames, generation of a plurality of two-dimensional feature maps based on the phonetic features, input of the feature maps and the plurality of acoustic features to a convolutional neural network, the convolutional neural network to generate a plurality of speaker embeddings based on the plurality of feature maps and the plurality of acoustic features, aggregation of the plurality of speaker embeddings into a first speaker embedding based on respective weights determined for each of the plurality of speaker embeddings, and determination of a speaker associated with the plurality of speech frames based on the first speaker embedding.

2.

发明申请
CONVOLUTIONAL NEURAL NETWORK WITH PHONETIC ATTENTION FOR SPEAKER VERIFICATION 有权

公开(公告)号：US20220157324A1

公开(公告)日：2022-05-19

申请号：US17665862

申请日：2022-02-07

申请人： Microsoft Technology Licensing, LLC

发明人： Yong ZHAO , Tianyan ZHOU , Jinyu LI , Yifan GONG , Jian WU , Zhuo CHEN

IPC分类号： G10L17/18 , G06N3/08 , G10L17/02

摘要： Embodiments may include determination, for each of a plurality of speech frames associated with an acoustic feature, of a phonetic feature based on the associated acoustic feature, generation of one or more two-dimensional feature maps based on the plurality of phonetic features, input of the one or more two-dimensional feature maps to a trained neural network to generate a plurality of speaker embeddings, and aggregation of the plurality of speaker embeddings into a speaker embedding based on respective weights determined for each of the plurality of speaker embeddings, wherein the speaker embedding is associated with an identity of the speaker.