Systems and methods for using anchor text as parallel corpora for cross-language information retrieval
    1.
    发明授权
    Systems and methods for using anchor text as parallel corpora for cross-language information retrieval 有权
    使用锚文本作为跨语言信息检索的并行语料库的系统和方法

    公开(公告)号:US08190608B1

    公开(公告)日:2012-05-29

    申请号:US13174209

    申请日:2011-06-30

    IPC分类号: G06F17/30

    摘要: A system performs cross-language query translations. The system receives a search query that includes terms in a first language and determines possible translations of the terms of the search query into a second language. The system also locates documents for use as parallel corpora to aid in the translation by: (1) locating documents in the first language that contain references that match the terms of the search query and identify documents in the second language; (2) locating documents in the first language that contain references that match the terms of the query and refer to other documents in the first language and identify documents in the second language that contain references to the other documents; or (3) locating documents in the first language that match the terms of the query and identify documents in the second language that contain references to the documents in the first language. The system may use the second language documents as parallel corpora to disambiguate among the possible translations of the terms of the search query and identify one of the possible translations as a likely translation of the search query into the second language.

    摘要翻译: 系统执行跨语言查询翻译。 该系统接收包括第一语言的搜索查询,并确定搜索查询的术语可能的翻译成第二语言。 该系统还将用作并行语料库的文档定位为通过以下方式帮助翻译:(1)以包含与搜索查询的条款匹配的引用的第一语言定位文档,并识别第二语言的文档; (2)以包含与查询条款相匹配的引用的第一语言定位文件,并引用第一语言的其他文档,并且识别包含对其他文档的引用的第二语言的文档; 或者(3)以符合查询条款的第一语言定位文档,并且识别第二语言中包含对第一语言文档的引用的文档。 系统可以使用第二语言文档作为并行语料库来消除搜索查询的术语的可能的翻译之间的歧义,并将可能的翻译之一识别为搜索查询到第二语言的可能的翻译。

    Systems and methods for using anchor text as parallel corpora for cross-language information retrieval
    2.
    发明授权
    Systems and methods for using anchor text as parallel corpora for cross-language information retrieval 有权
    使用锚文本作为跨语言信息检索的并行语料库的系统和方法

    公开(公告)号:US07996402B1

    公开(公告)日:2011-08-09

    申请号:US12872755

    申请日:2010-08-31

    IPC分类号: G06F17/30

    摘要: A system performs cross-language query translations. The system receives a search query that includes terms in a first language and determines possible translations of the terms of the search query into a second language. The system also locates documents for use as parallel corpora to aid in the translation by: (1) locating documents in the first language that contain references that match the terms of the search query and identify documents in the second language; (2) locating documents in the first language that contain references that match the terms of the query and refer to other documents in the first language and identify documents in the second language that contain references to the other documents; or (3) locating documents in the first language that match the terms of the query and identify documents in the second language that contain references to the documents in the first language. The system may use the second language documents as parallel corpora to disambiguate among the possible translations of the terms of the search query and identify one of the possible translations as a likely translation of the search query into the second language.

    摘要翻译: 系统执行跨语言查询翻译。 系统接收包括第一语言的搜索查询,并确定搜索查询的条款可能的翻译成第二语言。 该系统还将用作并行语料库的文档定位为通过以下方式帮助翻译:(1)以包含与搜索查询的条款匹配的引用的第一语言定位文档,并识别第二语言的文档; (2)以包含与查询条款相匹配的引用的第一语言定位文件,并引用第一语言的其他文档,并且识别包含对其他文档的引用的第二语言的文档; 或者(3)以符合查询条款的第一语言定位文档,并且识别第二语言中包含对第一语言文档的引用的文档。 系统可以使用第二语言文档作为并行语料库来消除搜索查询的术语的可能的翻译之间的歧义,并将可能的翻译之一识别为搜索查询到第二语言的可能的翻译。

    String predicate selectivity estimation
    3.
    发明授权
    String predicate selectivity estimation 失效
    字符串谓词选择性估计

    公开(公告)号:US07149735B2

    公开(公告)日:2006-12-12

    申请号:US10603035

    申请日:2003-06-24

    IPC分类号: G06F17/30

    摘要: A method of estimating selectivity of a given string predicate in a database query. In the method selectivities of substrings of various substring lengths are estimated. For example, the selectivity of substrings between length l (or some constant q) to the length of the given string predicate may be estimated. The method then selects a candidate sub string for each sub string length based on estimated selectivities of the substrings. The estimated selectivities of the candidate substrings are combined. The combined estimated selectivity of the candidate substrings is returned as the estimated selectivity of the given string predicate.

    摘要翻译: 在数据库查询中估计给定字符串谓词的选择性的方法。 在方法中,估计各种子串长度的子串的选择性。 例如,可以估计长度l(或一些常数q)与给定字符串谓词的长度之间的子串的选择性。 然后,该方法基于所估计的子串的选择性来选择每个子串长度的候选子串。 合并候选子串的估计选择性。 候选子串的组合估计选择性作为给定字符串谓词的估计选择性返回。

    Systems and methods for using anchor text as parallel corpora for cross-language information retrieval
    4.
    发明授权
    Systems and methods for using anchor text as parallel corpora for cross-language information retrieval 有权
    使用锚文本作为跨语言信息检索的并行语料库的系统和方法

    公开(公告)号:US07814103B1

    公开(公告)日:2010-10-12

    申请号:US11468674

    申请日:2006-08-30

    IPC分类号: G06F17/30

    摘要: A system performs cross-language query translations. The system receives a search query that includes terms in a first language and determines possible translations of the terms of the search query into a second language. The system also locates documents for use as parallel corpora to aid in the translation by: (1) locating documents in the first language that contain references that match the terms of the search query and identify documents in the second language; (2) locating documents in the first language that contain references that match the terms of the query and refer to other documents in the first language and identify documents in the second language that contain references to the other documents; or (3) locating documents in the first language that match the terms of the query and identify documents in the second language that contain references to the documents in the first language. The system may use the second language documents as parallel corpora to disambiguate among the possible translations of the terms of the search query and identify one of the possible translations as a likely translation of the search query into the second language.

    摘要翻译: 系统执行跨语言查询翻译。 系统接收包括第一语言的搜索查询,并确定搜索查询的条款可能的翻译成第二语言。 该系统还将用作并行语料库的文档定位为通过以下方式帮助翻译:(1)以包含与搜索查询的条款匹配的引用的第一语言定位文档,并识别第二语言的文档; (2)以包含与查询条款相匹配的引用的第一语言定位文件,并引用第一语言的其他文档,并且识别包含对其他文档的引用的第二语言的文档; 或者(3)以符合查询条款的第一语言定位文档,并且识别第二语言中包含对第一语言文档的引用的文档。 系统可以使用第二语言文档作为并行语料库来消除搜索查询的术语的可能的翻译之间的歧义,并将可能的翻译之一识别为搜索查询到第二语言的可能的翻译。

    Method of building multidimensional workload-aware histograms
    5.
    发明授权
    Method of building multidimensional workload-aware histograms 失效
    建立多维工作负载感知直方图的方法

    公开(公告)号:US07007039B2

    公开(公告)日:2006-02-28

    申请号:US09881500

    申请日:2001-06-14

    IPC分类号: G06F17/30

    摘要: In a database system, a method of maintaining a self-tuning histogram having a plurality of existing rectangular shaped buckets arranged in a hierarchical manner and defined by at least two bucket boundaries, a bucket volume, and a bucket frequency. At least one new bucket is created in response to a query on the database. Each new bucket is contained within at least one existing bucket and the new bucket becomes a child bucket and the existing bucket containing it becomes a parent bucket. The boundaries of each new bucket correspond to a region of the database accessed by the query and the frequency of the new bucket is a number of data records returned by the query. Buckets may be merged based on a merge criterion such as similar bucket density when the total number of buckets exceeds the predetermined budget.

    摘要翻译: 在数据库系统中,一种保持自调整直方图的方法,该自调整直方图具有以分层方式布置并由至少两个桶边界,桶体积和桶频率定义的多个现有矩形桶。 响应于数据库上的查询,至少创建一个新的桶。 每个新的桶都包含在至少一个现有的桶中,新的桶将成为一个小桶,并且包含它的现有桶成为一个主桶。 每个新桶的边界对应于由查询访问的数据库的区域,并且新桶的频率是查询返回的多个数据记录。 当桶的总数超过预定预算时,桶可以基于合并标准合并,例如相似桶密度。

    Method for cost-based optimization over multimeida repositories
    6.
    发明授权
    Method for cost-based optimization over multimeida repositories 失效
    用于多维度存储库的基于成本优化的方法

    公开(公告)号:US5806061A

    公开(公告)日:1998-09-08

    申请号:US859556

    申请日:1997-05-20

    IPC分类号: G06F17/30 G06F17/00

    摘要: A method for optimizing the cost of searches through a multimedia repository is disclosed where the repository contains a plurality of objects having at least two different attributes such as color in a newspaper photograph and text in the subtitle. The method comprises selecting a ranking expression, translating the ranking expression into resulting filter conditions and then optimizing the resulting filter conditions to perform the search. A database look-up step is included which determines the cost of performing searches over the various subconditions of the filter condition. The least costly subcondition is searched first to retrieve objects from the multimedia repository. The remaining subconditions are then evaluated on the retrieved objects using either a search step or probe step depending upon the determined cost to perform each. A further database look-up step predicts a grade of match necessary in the translated ranking expression to retrieve at least the number of objects requested in the search.

    摘要翻译: 公开了一种用于优化通过多媒体存储库的搜索成本的方法,其中存储库包含具有至少两个不同属性的多个对象,例如报纸照片中的颜色和副标题中的文本。 该方法包括选择排序表达式,将排名表达式转换成所得到的过滤条件,然后优化所得到的过滤条件以执行搜索。 包括数据库查找步骤,其确定在过滤条件的各种子条件下执行搜索的成本。 首先搜索成本最低的子条件,以从多媒体库中检索对象。 然后使用搜索步骤或探测步骤根据确定的执行每个的成本,在检索到的对象上评估剩余的子条件。 进一步的数据库查找步骤预测在翻译的排序表达中必需的匹配等级以至少检索搜索中请求的对象的数量。

    Systems and methods for using anchor text as parallel corpora for cross-language information retrieval
    7.
    发明授权
    Systems and methods for using anchor text as parallel corpora for cross-language information retrieval 有权
    使用锚文本作为跨语言信息检索的并行语料库的系统和方法

    公开(公告)号:US08631010B1

    公开(公告)日:2014-01-14

    申请号:US13474957

    申请日:2012-05-18

    IPC分类号: G06F17/30

    摘要: A method may include obtaining, based on a content of a search query, one or more documents in a first language; identifying one or more documents in a second language that contain an anchor that links to the one or more documents in the first language, the second language being different than the first language; and translating one or more terms of the search query into the second language using content included in the one or more documents in the second language.

    摘要翻译: 方法可以包括基于搜索查询的内容获得第一语言中的一个或多个文档; 以第二语言识别包含链接到所述第一语言中的一个或多个文档的锚的一个或多个文档,所述第二语言不同于所述第一语言; 以及使用所述第二语言中的一个或多个文档中包含的内容将所述搜索查询的一个或多个术语翻译成所述第二语言。

    Systems and methods for using anchor text as parallel corpora for cross-language information retrieval
    8.
    发明授权
    Systems and methods for using anchor text as parallel corpora for cross-language information retrieval 有权
    使用锚文本作为跨语言信息检索的并行语料库的系统和方法

    公开(公告)号:US07146358B1

    公开(公告)日:2006-12-05

    申请号:US09939661

    申请日:2001-08-28

    IPC分类号: G06F17/30 G06F7/00

    摘要: A system performs cross-language query translations. The system receives a search query that includes terms in a first language and determines possible translations of the terms of the search query into a second language. The system also locates documents for use as parallel corpora to aid in the translation by: (1) locating documents in the first language that contain references that match the terms of the search query and identify documents in the second language; (2) locating documents in the first language that contain references that match the terms of the query and refer to other documents in the first language and identify documents in the second language that contain references to the other documents; or (3) locating documents in the first language that match the terms of the query and identify documents in the second language that contain references to the documents in the first language. The system may use the second language documents as parallel corpora to disambiguate among the possible translations of the terms of the search query and identify one of the possible translations as a likely translation of the search query into the second language.

    摘要翻译: 系统执行跨语言查询翻译。 系统接收包括第一语言的搜索查询,并确定搜索查询的条款可能的翻译成第二语言。 该系统还将用作并行语料库的文档定位为通过以下方式帮助翻译:(1)以包含与搜索查询的条款匹配的引用的第一语言定位文档,并识别第二语言的文档; (2)以包含与查询条款相匹配的引用的第一语言定位文件,并引用第一语言的其他文档,并且识别包含对其他文档的引用的第二语言的文档; 或者(3)以符合查询条款的第一语言定位文档,并且识别第二语言中包含对第一语言文档的引用的文档。 系统可以使用第二语言文档作为并行语料库来消除搜索查询的术语的可能的翻译之间的歧义,并将可能的翻译之一识别为搜索查询到第二语言的可能的翻译。

    Text joins for data cleansing and integration in a relational database management system
    9.
    发明申请
    Text joins for data cleansing and integration in a relational database management system 审中-公开
    文本连接用于关系数据库管理系统中的数据清理和集成

    公开(公告)号:US20050027717A1

    公开(公告)日:2005-02-03

    申请号:US10828819

    申请日:2004-04-21

    IPC分类号: G06F7/02 G06F17/30

    摘要: An organization's data records are often noisy: because of transcription errors, incomplete information, and lack of standard formats for textual data. A fundamental task during data cleansing and integration is matching strings—perhaps across multiple relations—that refer to the same entity (e.g., organization name or address). Furthermore, it is desirable to perform this matching within an RDBMS, which is where the data is likely to reside. In this paper, We adapt the widely used and established cosine similarity metric from the information retrieval field to the relational database context in order to identify potential string matches across relations. We then use this similarity metric to characterize this key aspect of data cleansing and integration as a join between relations on textual attributes, where the similarity of matches exceeds a specified threshold. Computing an exact answer to the text join can be expensive. For query processing efficiency, we propose an approximate, sampling-based approach to the join problem that can be easily and efficiently executed in a standard, unmodified RDBMS. Therefore the present invention includes a system for string matching across multiple relations in a relational database management system comprising generating a set of strings from a set of characters, decomposing each string into a subset of tokens, establishing at least two relations within the strings, establishing a similarity threshold for the relations, sampling the at least two relations, correlating the relations for the similarity threshold and returning all of the tokens which meet the criteria of the similarity threshold.

    摘要翻译: 组织的数据记录通常是嘈杂的:因为转录错误,信息不完整以及文本数据的标准格式不足。 在数据清理和集成过程中,一个基本任务是匹配字符串(可能是跨多个关系),它们指的是同一个实体(例如,组织名称或地址)。 此外,希望在数据可能驻留的RDBMS内执行该匹配。 在本文中,我们将广泛使用和建立的余弦相似性度量从信息检索领域适应到关系数据库上下文,以便识别跨关系的潜在字符串匹配。 然后,我们使用这种相似性度量来表征数据清理和集成的这个关键方面,作为文本属性之间的关系之间的连接,其中匹配的相似性超过了指定的阈值。 计算文本连接的确切答案可能是昂贵的。 对于查询处理效率,我们提出了一种基于抽样的近似方法,可以在标准的未修改的RDBMS中轻松有效地执行连接问题。 因此,本发明包括一种用于在关系数据库管理系统中跨多个关系进行字符串匹配的系统,包括从一组字符生成一组字符串,将每个字符串分解为令牌子集,建立字符串内的至少两个关系,建立 关系的相似性阈值,对至少两个关系进行采样,将相似性阈值的关系相关联并返回满足相似性阈值的标准的所有令牌。

    Method of packet routing in torus networks with two buffers per edge
    10.
    发明授权
    Method of packet routing in torus networks with two buffers per edge 失效
    在每个边缘有两个缓冲区的环网中分组路由的方法

    公开(公告)号:US5444701A

    公开(公告)日:1995-08-22

    申请号:US969650

    申请日:1992-10-29

    IPC分类号: H04L12/56 H04L12/42

    CPC分类号: H04L45/06

    摘要: A method is for routing packets in parallel computers with torus interconnection networks of arbitrary size and dimension having a plurality of nodes, each of which contains at least 2 buffers per edge incident to the node. For each packet which is being routed or which is being injected into the communication network, a waiting set is specified which consists of those buffers to which the packet can be transferred. The packet can be transferred to any buffer in its waiting set which has enough storage available to hold the packet. This waiting set is specified by first defining a set of nodes to which the packet is allowed to move and then defining a candidate set of buffers within the defined set of nodes. Then, defining an ordering of the nodes across the network from smallest to largest. The buffers in each node are then classified into four classes. After the buffers in each node have been classified, a set of rules for placing into the waiting set those classes of candidate buffers to which the packet can move is provided such that the routing method is free of deadlock, livelock, and starvation.

    摘要翻译: 一种用于在具有多个节点的任意大小和尺寸的环面互连网络的并行计算机中路由分组的方法,每个节点每个边缘至少包含两个缓冲器,每个边缘入站到该节点。 对于正在路由或正在注入到通信网络中的每个分组,指定由可以传送分组的那些缓冲器组成的等待集。 该分组可以被传送到其等待集中的任何缓冲器,其具有足够的可用于保存分组的存储空间。 通过首先定义允许数据包移动到其中的一组节点,然后在定义的节点集合内定义一组候选的缓冲区来指定该等待集。 然后,定义网络中节点从最小到最小的顺序。 然后将每个节点中的缓冲区分为四类。 在每个节点中的缓冲区被分类之后,提供一组用于放置到等待集合中的规则,这些分组可以移动的候选缓冲器的类别被提供,使得路由方法没有死锁,活动锁定和饥饿。