System, method, and service for using a focused random walk to produce samples on a topic from a collection of hyper-linked pages
    1.
    发明授权
    System, method, and service for using a focused random walk to produce samples on a topic from a collection of hyper-linked pages 失效
    系统,方法和服务,用于使用集中的随机游走从超链接页面集合中的主题生成样本

    公开(公告)号:US07640488B2

    公开(公告)日:2009-12-29

    申请号:US11004412

    申请日:2004-12-04

    IPC分类号: G06F17/00 G06F17/20

    CPC分类号: G06F17/30864

    摘要: A focused random walk system produces samples of on-topic pages from a collection of hyper-linked pages such as Web pages. The focused random walk system utilizes a focused random walk to produce a focused sample, which is a random sample of Web pages focused on a topic. The focused random walk system uniformly samples pages iteratively, where each iteration follows a random link from a union of the in-links and out-links of a page. The system then classifies this randomly selected link to determine whether the page is on-topic. The random walk sampling process could comprise a hard-focus method that selects only on-topic pages at each step of the focused random walk, or a soft-focus method that allows limited divergence to off-topic pages.

    摘要翻译: 集中的随机游走系统从一系列超链接页面(如网页)生成主题页面的样本。 集中的随机游走系统利用一个集中的随机游走来产生一个聚焦的样本,这是一个专注于主题的网页的随机抽样。 集中的随机游走系统统一地对页面进行一次抽样,其中每次迭代都遵循一个页面的链接和外链的联合的随机链接。 然后,系统对这个随机选择的链接进行分类,以确定该页面是否是主题的。 随机游走抽样过程可以包括仅在聚焦随机游走的每个步骤选择专题页面的硬焦点方法,或者允许有限散点到偏离主题页面的软焦点方法。

    System and method for detecting matches of small edit distance
    3.
    发明申请
    System and method for detecting matches of small edit distance 审中-公开
    用于检测小编辑距离匹配的系统和方法

    公开(公告)号:US20070085716A1

    公开(公告)日:2007-04-19

    申请号:US11241468

    申请日:2005-09-30

    IPC分类号: H03M7/30

    CPC分类号: G06F16/90344

    摘要: A system and method of approximating edit distance for a set of character strings in a database includes producing a representative sketch for each of the character strings; and approximating an edit distance between two selected character strings based only on the representative sketch for each of the selected character strings. The character strings may comprise text, wherein the method further comprises encoding positions of substrings in the text using anchors, wherein the anchors comprise identical substrings occurring in two input character strings at a nearby position. A set of anchors may be used in a correlated manner, wherein character strings with a sufficiently small edit distance are likely to use a same sequence of anchors. The character strings may be substantially non-repetitive. The representative sketch of a first character string is preferably constructed absent knowledge of a second character string. A size of the representative sketch may be constant.

    摘要翻译: 近似数据库中的一组字符串的编辑距离的系统和方法包括为每个字符串产生代表性的草图; 并且仅基于每个所选择的字符串的代表性草图来近似两个所选字符串之间的编辑距离。 字符串可以包括文本,其中该方法还包括使用锚点对文本中的子串的位置进行编码,其中锚点包括在附近位置处的两个输入字符串中出现的相同的子串。 可以以相关方式使用一组锚,其中具有足够小的编辑距离的字符串可能使用相同的锚点序列。 字符串可以是基本上不重复的。 优选地构造第一个字符串的代表性草图而不知道第二个字符串。 代表性草图的大小可能不变。

    Identifying central entities
    5.
    发明授权
    Identifying central entities 有权
    识别中央实体

    公开(公告)号:US09009192B1

    公开(公告)日:2015-04-14

    申请号:US13153352

    申请日:2011-06-03

    IPC分类号: G06F17/30 G06F7/00

    CPC分类号: G06F17/30958

    摘要: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for identifying central entities. In one aspect, a method includes obtaining candidate entities for a first resource; filtering a first entity graph whose nodes represent different entities found in a plurality of resources to remove nodes that do not correspond to a candidate entity, wherein pairs of nodes in the filtered first entity graph that are connected by an edge correspond to pairs of candidate entities that are associated with the same resource; generating a second entity graph for the first resource from the filtered first entity graph, wherein the second entity graph does not include nodes from the filtered first entity graph that are not connected to other nodes in the filtered first graph; and identifying candidate entities that are represented by nodes in the second entity graph as being central entities for the first resource.

    摘要翻译: 方法,系统和装置,包括在计算机存储介质上编码的用于识别中央实体的计算机程序。 一方面,一种方法包括:获取第一资源的候选实体; 过滤其节点表示在多个资源中找到的不同实体的第一实体图,以去除不对应于候选实体的节点,其中由边缘连接的经过滤的第一实体图中的节点对对应于候选实体对 与相同的资源相关联; 从经滤波的第一实体图生成第一资源的第二实体图,其中第二实体图不包括经滤波的第一实体图中未经滤波的第一图中其他节点的节点; 以及将由所述第二实体图中的节点表示的候选实体识别为所述第一资源的中心实体。

    Methods and Apparatus for Assessing Web Page Decay

    公开(公告)号:US20080097988A1

    公开(公告)日:2008-04-24

    申请号:US11955458

    申请日:2007-12-13

    IPC分类号: G06F17/30

    CPC分类号: G06F17/3089

    摘要: Systems and methods are herein disclosed for assessing the staleness of a web page. In particular, in one method of the present invention, the staleness of a web page is assessed by examining internal date references within the web page. In another method of the present invention, the staleness of a web page is assessed by examining the meta-data associated with the web page. In a further method of the present invention, the staleness of a hyperlinked web page is determined by examining the link status of the hyperlinks. If the web page has a relatively large number of dead links, it is assessed as being a stale web page. In a still further method of the present invention, the link status of web pages in the neighborhood of the web page being assessed is likewise examined.

    Identifying topical entities
    7.
    发明授权

    公开(公告)号:US10068022B2

    公开(公告)日:2018-09-04

    申请号:US13153365

    申请日:2011-06-03

    IPC分类号: G06F17/30

    摘要: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for identifying topical entities. In one aspect, a method includes obtaining a plurality of entities that are associated with a first resource; for one or more of the identified entities, receiving search results for a search query derived from the entity; determining that search results for a search query including a particular entity include a specific type of search results; and determining that the particular entity is a topical entity of the first resource based at least in part on the particular entity appearing in a title or a resource locator of the first resource, wherein the topical entity of the first resource represents a predominant topic of the first resource.

    ENRICHING SEARCH RESULTS
    9.
    发明申请
    ENRICHING SEARCH RESULTS 有权
    增加搜索结果

    公开(公告)号:US20120109941A1

    公开(公告)日:2012-05-03

    申请号:US13118026

    申请日:2011-05-27

    IPC分类号: G06F17/30

    摘要: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for enhancing search results. In one aspect, a method includes identifying a plurality of registered publishers for enriched search results and, for each registered publisher, obtaining enrichment information from the registered publisher and associating the enrichment information with a resource provided by the publisher. A query is received. A plurality of responsive resources that are responsive to the query are identified. A first responsive resource is determined to be associated with enrichment information. An enriched search result is provided, the enriched search result identifying the first responsive resource and including the first responsive resource's associated enrichment information.

    摘要翻译: 方法,系统和装置,包括在计算机存储介质上编码的计算机程序,用于增强搜索结果。 一方面,一种方法包括识别用于丰富搜索结果的多个注册发布者,并且对于每个注册的发布者,从注册的发行者获取富集信息并将所述浓缩信息与由发布者提供的资源相关联。 接收到查询。 识别响应于查询的多个响应资源。 第一响应资源被确定为与浓缩信息相关联。 提供丰富的搜索结果,丰富的搜索结果识别第一响应资源并且包括第一响应资源的相关联的富集信息。

    GENERATING ADDITIONAL CONTENT
    10.
    发明申请
    GENERATING ADDITIONAL CONTENT 审中-公开
    产生附加内容

    公开(公告)号:US20160026727A1

    公开(公告)日:2016-01-28

    申请号:US13153379

    申请日:2011-06-03

    IPC分类号: G06F17/30

    CPC分类号: G06F16/9535 G06F16/335

    摘要: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for generating additional content. In one aspect, a method includes identifying one or more central entities, wherein each central entity represents a topic of a first resource being presented in a user interface; generating one or more search queries, each of the one or more search queries being derived from one or more of the central entities; obtaining search results for the one or more search queries from a search engine; selecting resources relevant to the first resource from resources referenced by the obtained search results; generating additional content for presentation in a user interface element of the user interface based on the selected resources; and categorizing the generated additional content into a plurality of categories, wherein each category of additional content is displayed in a separate portion of the user interface element.

    摘要翻译: 方法,系统和装置,包括在计算机存储介质上编码的计算机程序,用于产生附加内容。 一方面,一种方法包括识别一个或多个中央实体,其中每个中心实体表示呈现在用户界面中的第一资源的主题; 生成一个或多个搜索查询,所述一个或多个搜索查询中的每一个从一个或多个中央实体导出; 从搜索引擎获取所述一个或多个搜索查询的搜索结果; 从获得的搜索结果引用的资源中选择与第一资源相关的资源; 基于所选择的资源生成附加内容以呈现在所述用户界面的用户界面元素中; 以及将生成的附加内容分类为多个类别,其中每个类别的附加内容显示在用户界面元素的单独部分中。