SYSTEM FOR MONITORING GLOBAL ONLINE OPINIONS VIA SEMANTIC EXTRACTION
    1.
    发明申请
    SYSTEM FOR MONITORING GLOBAL ONLINE OPINIONS VIA SEMANTIC EXTRACTION 有权
    通过语义提取监测全球在线意见的系统

    公开(公告)号:US20100223226A1

    公开(公告)日:2010-09-02

    申请号:US12394646

    申请日:2009-02-27

    IPC分类号: G06N5/04

    CPC分类号: G06Q30/02

    摘要: A system for transforming domain specific unstructured data into structured data including an intake platform controlled by feed back from a control platform. The intake platform includes an intake acquisition module for acquiring data building baseline data related to a domain and problem of interest, an intake pre-processing module, an intake language module, an intake application descriptors module, and an intake adjudication module. The control platform includes a control data acquisition module, a control data consistency collator, a control auditor, a control event definition and policy repository, an error resolver, and an output that outputs results of the workflow into structured data enabled to be used in data analytics.

    摘要翻译: 一种将域特定非结构化数据转换成结构化数据的系统,包括通过从控制平台反馈控制的进气平台。 进气平台包括用于获取与感兴趣的领域和问题相关的数据建立基线数据的进气采集模块,进气预处理模块,进气语言模块,进气应用描述模块和进气判定模块。 控制平台包括一个控制数据采集模块,一个控制数据一致性整理器,一个控制审核员,一个控制事件定义和策略存储库,一个错误解析器和一个输出,该工作流将结果数据输出到能够在数据中使用的结构化数据 分析

    HOLISTIC DISAMBIGUATION FOR ENTITY NAME SPOTTING
    2.
    发明申请
    HOLISTIC DISAMBIGUATION FOR ENTITY NAME SPOTTING 有权
    用于实体名称点播的HOLISTIC DISAMBIGATION

    公开(公告)号:US20100223292A1

    公开(公告)日:2010-09-02

    申请号:US12394078

    申请日:2009-02-27

    IPC分类号: G06F17/30 G06F17/27

    CPC分类号: G06F17/278

    摘要: A method resolves ambiguous spotted entity names in a data corpus by determining an activation level value for each of a plurality of nodes corresponding to a single ambiguous entity name. The activation levels for each of the nodes may be modified by inputting outside domain knowledge corresponding to the nodes to increase the activation value of the nodes, spotting entity names corresponding to the nodes to increase the activation value of the nodes, searching the data corpus to spot newly posted entity names to increase the activation value of the nodes, and searching the data corpus to reduce or deactivate the activation value of the nodes by eliminating false positives. The ambiguous entity name is assigned to the node determined to have the highest activation level and is then outputted to a user.

    摘要翻译: 一种方法通过确定对应于单个模糊实体名称的多个节点中的每个节点的激活水平值来解决数据语料库中的歧义发现实体名称。 可以通过输入与节点对应的外部领域知识来修改每个节点的激活水平,以增加节点的激活值,发现对应于节点的实体名称以增加节点的激活值,搜索数据语料库 发现新发布的实体名称以增加节点的激活值,并且通过消除假阳性来搜索数据语料库来减少或去激活节点的激活值。 将不明确的实体名称分配给确定具有最高激活电平的节点,然后将其输出给用户。

    System for monitoring global online opinions via semantic extraction
    3.
    发明授权
    System for monitoring global online opinions via semantic extraction 有权
    通过语义提取来监测全球在线意见的系统

    公开(公告)号:US08352412B2

    公开(公告)日:2013-01-08

    申请号:US12394646

    申请日:2009-02-27

    IPC分类号: G06F17/00 G06N7/00 G06N7/08

    CPC分类号: G06Q30/02

    摘要: A system for transforming domain specific unstructured data into structured data including an intake platform controlled by feed back from a control platform. The intake platform includes an intake acquisition module for acquiring data building baseline data related to a domain and problem of interest, an intake pre-processing module, an intake language module, an intake application descriptors module, and an intake adjudication module. The control platform includes a control data acquisition module, a control data consistency collator, a control auditor, a control event definition and policy repository, an error resolver, and an output that outputs results of the workflow into structured data enabled to be used in data analytics.

    摘要翻译: 一种将域特定非结构化数据转换成结构化数据的系统,包括通过从控制平台反馈控制的进气平台。 进气平台包括用于获取与感兴趣的领域和问题相关的数据建立基线数据的进气采集模块,进气预处理模块,进气语言模块,进气应用描述模块和进气判定模块。 控制平台包括一个控制数据采集模块,一个控制数据一致性整理器,一个控制审核员,一个控制事件定义和策略存储库,一个错误解析器和一个输出,该工作流将结果数据输出到能够在数据中使用的结构化数据 分析

    DATA INGEST OPTIMIZATION
    4.
    发明申请
    DATA INGEST OPTIMIZATION 审中-公开
    数据优化

    公开(公告)号:US20120330972A1

    公开(公告)日:2012-12-27

    申请号:US13604096

    申请日:2012-09-05

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30864 G06F17/30899

    摘要: Methods and systems for optimizing the retrieval of data from multiple sources are described. A slot map including slots for the storage of data elements can be obtained. The data elements associated with the slots can be prioritized by weighting values with costs of retrieving the data elements from respective data sources. Each value can be associated with a different data element and can indicate a respective degree of importance of the associated data element. Further, the systems and methods can direct the retrieval of data elements from the respective data sources in an order in accordance with the priority of the data elements to optimize the quality of data obtainable within a critical time constraint. In addition, the retrieved data elements can be stored in corresponding slots on a storage medium.

    摘要翻译: 描述了用于优化从多个源检索数据的方法和系统。 可以获得包括用于存储数据元素的时隙的时隙映射。 与时隙相关联的数据元素可以通过以相应数据源检索数据元素为代价的加权值进行优先化。 每个值可以与不同的数据元素相关联,并且可以指示相关联的数据元素的相应重要程度。 此外,系统和方法可以按照数据元素的优先级顺序从相应的数据源中取出数据元素的检索,以优化在关键时间约束内可获得的数据的质量。 另外,检索到的数据元素可以存储在存储介质上的相应时隙中。

    Holistic disambiguation for entity name spotting
    5.
    发明授权
    Holistic disambiguation for entity name spotting 有权
    整体排除实体名称的歧义

    公开(公告)号:US08856119B2

    公开(公告)日:2014-10-07

    申请号:US12394078

    申请日:2009-02-27

    IPC分类号: G06F17/30 G06F7/00 G06F17/27

    CPC分类号: G06F17/278

    摘要: A method resolves ambiguous spotted entity names in a data corpus by determining an activation level value for each of a plurality of nodes corresponding to a single ambiguous entity name. The activation levels for each of the nodes may be modified by inputting outside domain knowledge corresponding to the nodes to increase the activation value of the nodes, spotting entity names corresponding to the nodes to increase the activation value of the nodes, searching the data corpus to spot newly posted entity names to increase the activation value of the nodes, and searching the data corpus to reduce or deactivate the activation value of the nodes by eliminating false positives. The ambiguous entity name is assigned to the node determined to have the highest activation level and is then outputted to a user.

    摘要翻译: 一种方法通过确定对应于单个模糊实体名称的多个节点中的每个节点的激活水平值来解决数据语料库中的歧义发现实体名称。 可以通过输入与节点对应的外部领域知识来修改每个节点的激活水平,以增加节点的激活值,发现对应于节点的实体名称以增加节点的激活值,搜索数据语料库 发现新发布的实体名称以增加节点的激活值,并且通过消除假阳性来搜索数据语料库来减少或去激活节点的激活值。 将不明确的实体名称分配给确定具有最高激活电平的节点,然后将其输出给用户。

    Data ingest optimization
    6.
    发明授权
    Data ingest optimization 有权
    数据摄取优化

    公开(公告)号:US09589065B2

    公开(公告)日:2017-03-07

    申请号:US13604096

    申请日:2012-09-05

    IPC分类号: G06F17/30

    CPC分类号: G06F17/30864 G06F17/30899

    摘要: Methods and systems for optimizing the retrieval of data from multiple sources are described. A slot map including slots for the storage of data elements can be obtained. The data elements associated with the slots can be prioritized by weighting values with costs of retrieving the data elements from respective data sources. Each value can be associated with a different data element and can indicate a respective degree of importance of the associated data element. Further, the systems and methods can direct the retrieval of data elements from the respective data sources in an order in accordance with the priority of the data elements to optimize the quality of data obtainable within a critical time constraint. In addition, the retrieved data elements can be stored in corresponding slots on a storage medium.

    摘要翻译: 描述了用于优化从多个源检索数据的方法和系统。 可以获得包括用于存储数据元素的时隙的时隙映射。 与时隙相关联的数据元素可以通过以相应数据源检索数据元素为代价的加权值进行优先化。 每个值可以与不同的数据元素相关联,并且可以指示相关联的数据元素的相应重要程度。 此外,系统和方法可以按照数据元素的优先级顺序从相应的数据源中取出数据元素的检索,以优化在关键时间约束内可获得的数据的质量。 另外,检索到的数据元素可以存储在存储介质上的相应时隙中。

    SYSTEM AND METHOD FOR ADAPTIVE CONTENT PROCESSING AND CLASSIFICATION IN A HIGH-AVAILABILITY ENVIRONMENT
    7.
    发明申请
    SYSTEM AND METHOD FOR ADAPTIVE CONTENT PROCESSING AND CLASSIFICATION IN A HIGH-AVAILABILITY ENVIRONMENT 失效
    高可用性环境中自适应内容处理和分类的系统和方法

    公开(公告)号:US20080208893A1

    公开(公告)日:2008-08-28

    申请号:US11678075

    申请日:2007-02-23

    IPC分类号: G06F7/00

    CPC分类号: G06F17/30067

    摘要: The embodiments of the invention provide a systems, methods, etc. for adaptive content processing and classification in a high-availability environment. More specifically, a system is provided having a plurality of processing engines and at least one server that classifies data objects on the computer system. The classification includes analyzing the data objects for the presence of a type of content. This can include assigning a score corresponding to the amount of the type of content in each of the data objects. Moreover, the server can remove a data object from the computer system based on the results of the analyzing. The results of the analyzing are stored and the computer system is updated with feedback information. This can include allowing a user to review the results of the analyzing and aggregating reviews of the user into the feedback information.

    摘要翻译: 本发明的实施例为高可用性环境中的自适应内容处理和分类提供了系统,方法等。 更具体地,提供了具有多个处理引擎和至少一个服务器的系统,该服务器对计算机系统上的数据对象进行分类。 分类包括分析数据对象的存在类型的内容。 这可以包括分配对应于每个数据对象中的内容类型的量的分数。 此外,服务器可以根据分析结果从计算机系统中删除数据对象。 存储分析结果,并更新计算机系统的反馈信息。 这可以包括允许用户将分析的结果和将用户的评论聚合到反馈信息中。

    System and method for adaptive content processing and classification in a high-availability environment
    8.
    发明授权
    System and method for adaptive content processing and classification in a high-availability environment 失效
    在高可用性环境下进行自适应内容处理和分类的系统和方法

    公开(公告)号:US07966270B2

    公开(公告)日:2011-06-21

    申请号:US11678075

    申请日:2007-02-23

    IPC分类号: G06F15/18 G06E3/00

    CPC分类号: G06F17/30067

    摘要: The embodiments of the invention provide a systems, methods, etc. for adaptive content processing and classification in a high-availability environment. More specifically, a system is provided having a plurality of processing engines and at least one server that classifies data objects on the computer system. The classification includes analyzing the data objects for the presence of a type of content. This can include assigning a score corresponding to the amount of the type of content in each of the data objects. Moreover, the server can remove a data object from the computer system based on the results of the analyzing. The results of the analyzing are stored and the computer system is updated with feedback information. This can include allowing a user to review the results of the analyzing and aggregating reviews of the user into the feedback information.

    摘要翻译: 本发明的实施例为高可用性环境中的自适应内容处理和分类提供了系统,方法等。 更具体地,提供了具有多个处理引擎和至少一个服务器的系统,该服务器对计算机系统上的数据对象进行分类。 分类包括分析数据对象的存在类型的内容。 这可以包括分配对应于每个数据对象中的内容类型的量的分数。 此外,服务器可以根据分析结果从计算机系统中删除数据对象。 存储分析结果,并更新计算机系统的反馈信息。 这可以包括允许用户将分析的结果和将用户的评论聚合到反馈信息中。

    VALIDATION OF INGESTED DATA
    9.
    发明申请

    公开(公告)号:US20120330901A1

    公开(公告)日:2012-12-27

    申请号:US13604157

    申请日:2012-09-05

    IPC分类号: G06F17/30

    CPC分类号: G06Q50/22

    摘要: Methods and systems for validating ingested data are disclosed. In accordance with the methods and systems, data elements can be received for storage in slots of an individual descriptor in a storage medium. In addition, at least one validation test can be selected based on a weighting of the data elements that indicates a respective degree of importance of the data elements. The selected validation test or tests can be applied to the data elements stored in the slots to generate respective validation results. Further, a validation score indicating a sufficiency of the stored data elements can be generated based on the validation results.

    Method for duplicate detection on web-scale data in supercomputing environments
    10.
    发明授权
    Method for duplicate detection on web-scale data in supercomputing environments 失效
    在超级计算环境中对Web规模数据进行重复检测的方法

    公开(公告)号:US07363329B1

    公开(公告)日:2008-04-22

    申请号:US11939378

    申请日:2007-11-13

    IPC分类号: G06F17/30 G06F9/46

    摘要: A method for duplicate detection on web-scale data in a supercomputing environment includes computing a hash of at least one document in a computer system to generate data packets from the at least one document and to generate a fixed size tuple of information from the at least one document, distributing the data packets to each node of the plurality of nodes, applying localized detection techniques to data packets on each node of the plurality of nodes to remove data packet duplicates, redistributing the data packets to each node of the plurality of nodes based on the document fingerprint, reapplying the localized detection techniques on each node to the redistributed packets to remove exact data packet duplicates, and performing a global merge of results of the localized detection techniques in a distributed fashion.

    摘要翻译: 一种用于在超级计算环境中对网络规模数据进行重复检测的方法包括计算计算机系统中的至少一个文档的散列,以从所述至少一个文档生成数据分组,并从至少一个文档生成固定大小的信息元组 一个文档,将数据分组分发到多个节点的每个节点,对多个节点的每个节点上的数据分组应用局部检测技术以去除数据分组复制,将数据分组重新分配到多个节点中的每个节点 在文档指纹上,将每个节点上的本地化检测技术重新应用于重新分发的分组,以去除精确数据分组重复,并以分布式方式执行本地化检测技术的结果的全局合并。