Automated data cleanup by substitution of words of the same pronunciation and different spelling in speech recognition

发明授权

US09460708B2 Automated data cleanup by substitution of words of the same pronunciation and different spelling in speech recognition 有权

标题翻译：通过替换相同发音和语音识别中不同拼写的单词进行自动数据清理

请登陆查看更多内容

专利标题： Automated data cleanup by substitution of words of the same pronunciation and different spelling in speech recognition
专利标题（中）： 通过替换相同发音和语音识别中不同拼写的单词进行自动数据清理
申请号： US12561521

申请日： 2009-09-17
公开(公告)号： US09460708B2

公开(公告)日： 2016-10-04
发明人: Geoffrey Zweig , Yun-Cheng Ju
申请人： Geoffrey Zweig , Yun-Cheng Ju
申请人地址： US WA Redmond
专利权人： Microsoft Technology Licensing, LLC
当前专利权人： Microsoft Technology Licensing, LLC
当前专利权人地址： US WA Redmond
代理商 Alin Corie; Sandy Swain; Micky Minhas
主分类号： G06F17/20
IPC分类号： G06F17/20 ; G06F17/27 ; G10L15/06 ; G10L15/187

Automated data cleanup by substitution of words of the same pronunciation and different spelling in speech recognition

摘要：

The described implementations relate to automated data cleanup. One system includes a language model generated from language model seed text and a dictionary of possible data substitutions. This system also includes a transducer configured to cleanse a corpus utilizing the language model and the dictionary. The transducer can process speech recognition data in some cases by substituting a second word for a first word which shares pronunciation with the first word but is spelled differently. In some cases, this can be accomplished by establishing corresponding probabilities of the first word and second word based on a third word that appears in sequence with the first word.

摘要（中）：

所描述的实现涉及自动数据清理。一个系统包括从语言模型种子文本生成的语言模型和可能的数据替换的字典。该系统还包括配置成利用语言模型和词典清理语料库的换能器。在某些情况下，换能器可以处理语音识别数据，通过将第二个单词替换为与第一个单词共享发音但拼写不同的第一个单词。在一些情况下，这可以通过基于与第一个单词顺序出现的第三个单词建立第一个单词和第二个单词的相应概率来实现。

公开/授权文献

US20100076752A1 Automated Data Cleanup 公开/授权日：2010-03-25

信息查询

Espacenet