Automated Data Cleanup
    3.
    发明申请
    Automated Data Cleanup 有权
    自动数据清理

    公开(公告)号:US20100076752A1

    公开(公告)日:2010-03-25

    申请号:US12561521

    申请日:2009-09-17

    IPC分类号: G06F17/21 G10L15/26

    摘要: The described implementations relate to automated data cleanup. One system includes a language model generated from language model seed text and a dictionary of possible data substitutions. This system also includes a transducer configured to cleanse a corpus utilizing the language model and the dictionary.

    摘要翻译: 所描述的实现涉及自动数据清理。 一个系统包括从语言模型种子文本生成的语言模型和可能的数据替换的字典。 该系统还包括配置成利用语言模型和词典清理语料库的换能器。

    Structured models of repetition for speech recognition
    4.
    发明授权
    Structured models of repetition for speech recognition 有权
    用于语音识别的重复结构化模型

    公开(公告)号:US08965765B2

    公开(公告)日:2015-02-24

    申请号:US12233826

    申请日:2008-09-19

    IPC分类号: G10L15/00 G10L15/18

    CPC分类号: G10L15/1822

    摘要: Described is a technology by which a structured model of repetition is used to determine the words spoken by a user, and/or a corresponding database entry, based in part on a prior utterance. For a repeated utterance, a joint probability analysis is performed on (at least some of) the corresponding word sequences as recognized by one or more recognizers) and associated acoustic data. For example, a generative probabilistic model, or a maximum entropy model may be used in the analysis. The second utterance may be a repetition of the first utterance using the exact words, or another structural transformation thereof relative to the first utterance, such as an extension that adds one or more words, a truncation that removes one or more words, or a whole or partial spelling of one or more words.

    摘要翻译: 描述了一种技术,通过该技术,部分地基于先前的话语,使用结构化重复模型来确定用户说出的单词和/或相应的数据库条目。 对于重复的话语,对由一个或多个识别器识别的相应字序列(和至少一些)和相关联的声学数据进行联合概率分析。 例如,可以在分析中使用生成概率模型或最大熵模型。 第二个发音可以是使用精确的单词或相对于第一个发音的其他结构变换的第一个发音的重复,例如添加一个或多个单词的扩展,删除一个或多个单词的截断或整个 或一个或多个单词的部分拼写。

    DETERMINING SYNONYM-ANTONYM POLARITY IN TERM VECTORS
    6.
    发明申请
    DETERMINING SYNONYM-ANTONYM POLARITY IN TERM VECTORS 审中-公开
    确定定时矢量中的同步聚焦极化

    公开(公告)号:US20140067368A1

    公开(公告)日:2014-03-06

    申请号:US13597277

    申请日:2012-08-29

    IPC分类号: G06F17/27

    摘要: A document-term matrix may be generated based on a corpus. A term representation matrix may be generated based on modifying a plurality of elements of the document-term matrix based on antonym information included in the corpus. Similarities may be determined based on a plurality of elements of the term representation matrix.

    摘要翻译: 可以基于语料库生成文档术语矩阵。 可以基于基于语料库中包含的反义词信息修改文档项矩阵的多个元素来生成术语表示矩阵。 可以基于术语表示矩阵的多个元素来确定相似度。

    Method for clustering closely resembling data objects
    7.
    发明授权
    Method for clustering closely resembling data objects 有权
    聚类非常类似于数据对象的方法

    公开(公告)号:US06349296B1

    公开(公告)日:2002-02-19

    申请号:US09642017

    申请日:2000-08-21

    IPC分类号: G06F1730

    摘要: A computer-implemented method determines the resemblance of data objects such as Web pages. Each data object is partitioned into a sequence of tokens. The tokens are grouped into overlapping sets of the tokens to form shingles. Each shingle is represented by a unique identification element encoded as a fingerprint. A minimum element from each of the images of the set of fingerprints associated with a document under each of a plurality of pseudo random permutations of the set of all fingerprints are selected to generate a sketch of each data object. The sketches characterize the resemblance of the data objects. The sketches can be further partitioned into a plurality of groups. Each group is fingerprinted to form a feature. Data objects that share more than a certain numbers of features are estimated to be nearly identical.

    摘要翻译: 计算机实现的方法确定诸如网页之类的数据对象的相似性。 每个数据对象被分成令牌序列。 令牌被分组成重叠的令牌组以形成带状疱疹。 每个瓦片由编码为指纹的唯一识别元件表示。 选择与所有指纹集合的多个伪随机排列中的每一个下的文档相关联的指纹集合的每个图像的最小元素以生成每个数据对象的草图。 草图描绘了数据对象的相似之处。 草图可以进一步划分成多个组。 每组都有指纹识别功能。 共享超过一定数量特征的数据对象估计几乎相同。

    THREE-DIMENSIONAL OBJECT BROWSING IN DOCUMENTS
    8.
    发明申请
    THREE-DIMENSIONAL OBJECT BROWSING IN DOCUMENTS 有权
    文件中的三维对象浏览

    公开(公告)号:US20140037218A1

    公开(公告)日:2014-02-06

    申请号:US13567105

    申请日:2012-08-06

    IPC分类号: G06K9/68

    CPC分类号: G06F17/30268

    摘要: A document that includes a representation of a two-dimensional (2-D) image may be obtained. A selection indicator indicating a selection of at least a portion of the 2-D image may be obtained. A match correspondence may be determined between the selected portion of the 2-D image and a three-dimensional (3-D) image object stored in an object database, the match correspondence based on a web crawler analysis result. A 3-D rendering of the 3-D image object that corresponds to the selected portion of the 2-D image may be initiated.

    摘要翻译: 可以获得包括二维(2-D)图像的表示的文档。 可以获得指示选择2-D图像的至少一部分的选择指示符。 可以在2-D图像的所选部分和存储在对象数据库中的三维(3-D)图像对象之间确定匹配对应关系,该匹配对应基于网络爬行器分析结果。 可以启动对应于2-D图像的所选部分的3-D图像对象的3-D渲染。

    STRUCTURED MODELS OF REPITITION FOR SPEECH RECOGNITION
    9.
    发明申请
    STRUCTURED MODELS OF REPITITION FOR SPEECH RECOGNITION 有权
    用于语音识别的结构化复制模型

    公开(公告)号:US20100076765A1

    公开(公告)日:2010-03-25

    申请号:US12233826

    申请日:2008-09-19

    IPC分类号: G10L15/00

    CPC分类号: G10L15/1822

    摘要: Described is a technology by which a structured model of repetition is used to determine the words spoken by a user, and/or a corresponding database entry, based in part on a prior utterance. For a repeated utterance, a joint probability analysis is performed on (at least some of) the corresponding word sequences as recognized by one or more recognizers) and associated acoustic data. For example, a generative probabilistic model, or a maximum entropy model may be used in the analysis. The second utterance may be a repetition of the first utterance using the exact words, or another structural transformation thereof relative to the first utterance, such as an extension that adds one or more words, a truncation that removes one or more words, or a whole or partial spelling of one or more words.

    摘要翻译: 描述了一种技术,通过该技术,部分地基于先前的话语,使用结构化重复模型来确定用户说出的单词和/或相应的数据库条目。 对于重复的话语,对由一个或多个识别器识别的相应字序列(和至少一些)和相关联的声学数据进行联合概率分析。 例如,可以在分析中使用生成概率模型或最大熵模型。 第二个发音可以是使用精确的单词或相对于第一个发音的其他结构变换的第一个发音的重复,例如添加一个或多个单词的扩展,删除一个或多个单词的截断或整个 或一个或多个单词的部分拼写。

    Automatic construction of unique signatures and confusable sets for database access
    10.
    发明授权
    Automatic construction of unique signatures and confusable sets for database access 有权
    自动构建数据库访问的独特签名和混淆集

    公开(公告)号:US07251599B2

    公开(公告)日:2007-07-31

    申请号:US10315411

    申请日:2002-12-10

    IPC分类号: G10L15/02

    CPC分类号: G10L15/18

    摘要: Methods and arrangements for facilitating database access in speech recognition. A plurality of possible subsequences corresponding to a database entry are ascertained, a record of such subsequences and their correspondence to database entries is created, and either or both of the following are carried out: unique signatures are ascertained via determining whether a subsequence corresponding to a given database entry does not also correspond to at least one other database entry; and/or multiple occurrences of a given subsequence are found, with corresponding database entries being grouped into a confusion set.

    摘要翻译: 在语音识别中促进数据库访问的方法和安排。 确定对应于数据库条目的多个可能的子序列,创建这样的子序列的记录及其与数据库条目的对应关系,并执行以下任何一个或两者:唯一签名是通过确定对应于 给定的数据库条目也不对应于至少一个其他数据库条目; 和/或发现给定子序列的多次出现,其中相应的数据库条目被分组成混淆集合。