-
公开(公告)号:US20100211533A1
公开(公告)日:2010-08-19
申请号:US12388517
申请日:2009-02-18
申请人: Jiangming Yang , Rui Cai , Lei Zhang , Wei-Ying Ma
发明人: Jiangming Yang , Rui Cai , Lei Zhang , Wei-Ying Ma
CPC分类号: G06N20/00 , G06F16/958
摘要: The web forum data extraction technique is designed for the structured data extraction of data on web forums using both page-level information and site-level knowledge. To do this, the technique finds the kinds of page objects a forum site has, which object a page belongs to, and how different page objects are connected with each other. This information can be obtained by re-constructing the sitemap of the target forum which is based on a Data Object Model of the target forum. The web forum data extraction technique collects three kinds of evidence for data extraction: 1) inner-page features which cover both semantic and layout information on an individual page; 2) inter-vertex features which describe linkage-related observations; and 3) inner-vertex features which characterize interrelationships among pages in one vertex. The technique employs Markov Logic Networks to combine the types of evidence statistically for inference and thereby can extract the desired structures.
摘要翻译: 网络论坛数据提取技术是为了使用页面级信息和站点级知识,在Web论坛上的数据结构化数据提取。 为此,该技术可以找到论坛网站所拥有的页面对象的种类,页面所属的对象以及不同的页面对象如何相互连接。 该信息可以通过重新构建基于目标论坛的数据对象模型的目标论坛的站点地图来获得。 网络论坛数据提取技术收集了三种数据提取证据:1)内页特征,涵盖单个页面上的语义和布局信息; 2)描述连锁相关观察的顶点间特征; 和3)表示一个顶点中的页面之间的相互关系的内顶点特征。 该技术采用马可夫逻辑网络来统计证据的类型,从而推断出所需的结构。
-
公开(公告)号:US08099408B2
公开(公告)日:2012-01-17
申请号:US12163895
申请日:2008-06-27
申请人: Lei Zhang , Wei-Ying Ma , Wei Lai , Jiangming Yang , Rui Cai
发明人: Lei Zhang , Wei-Ying Ma , Wei Lai , Jiangming Yang , Rui Cai
CPC分类号: G06F17/30864
摘要: A method and system for identifying informative links of a web site for use in crawling the web site is provided. A forum crawler analyzes sample web pages of a web forum to identify informative links and then crawls the web forum by following links determined to be informative and not following other links. The forum crawler system determines whether links are informative based on whether they are part of the overall structure of the web site or are used to select sequential information that has been split onto multiple web pages.
摘要翻译: 提供了一种用于识别用于爬行网站的网站的信息链接的方法和系统。 论坛搜寻器分析网页论坛的示例网页,以识别信息链接,然后通过确定为信息而不是遵循其他链接的链接抓取网页论坛。 论坛搜寻器系统基于它们是网站的整体结构的一部分还是用于选择分割到多个网页上的顺序信息来确定链接是否具有信息性。
-
公开(公告)号:US08051083B2
公开(公告)日:2011-11-01
申请号:US12103712
申请日:2008-04-16
申请人: Wei Lai , Rui Cai , Jiangming Yang , Lei Zhang , Wei-Ying Ma
发明人: Wei Lai , Rui Cai , Jiangming Yang , Lei Zhang , Wei-Ying Ma
CPC分类号: G06Q10/10
摘要: Described is a technology by which forum web pages are processed into clusters for classification purposes, including by determining repetitive regions between pages and associating pages that have similar repetitive regions into a common cluster. Patterns corresponding to the regions are determined, and a feature set based at least in part on those patterns (e.g., pattern frequency) is extracted from the page. The feature set of a page is compared against the feature set of another page to determine similarity therewith, e.g., via a feature space distance computation that is evaluated against a threshold distance.
摘要翻译: 描述了一种技术,通过该技术将论坛网页处理成用于分类目的的群集,包括通过确定页面之间的重复区域并将具有相似重复区域的页面关联到公共群集中。 确定与区域对应的模式,并且至少部分地基于那些模式(例如,模式频率)从页面提取特征集。 将页面的特征集合与另一页面的特征集进行比较以确定其相似性,例如通过针对阈值距离评估的特征空间距离计算。
-
公开(公告)号:US20100205168A1
公开(公告)日:2010-08-12
申请号:US12368768
申请日:2009-02-10
申请人: Jiangming Yang , Rui Cai , Lei Zhang , Wei-Ying Ma
发明人: Jiangming Yang , Rui Cai , Lei Zhang , Wei-Ying Ma
IPC分类号: G06F17/30
CPC分类号: G06F16/951
摘要: The incremental web forum crawling technique described herein is a web forum crawling technique that employs a thread-wise strategy that takes into account thread-level statistics, for example, the number of replies and the frequency of replies, to estimate the activity trend of each thread. To extract such statistical information, the technique employs a simple yet very robust approach to extract the timestamp of each post in a discussion thread. It also employs a regression model to predict the time of the next post for each thread.
摘要翻译: 本文描述的增量网页论坛抓取技术是一种网络论坛抓取技术,其采用考虑到线程级统计的线程策略,例如回复次数和回复频率,以估计每个 线。 为了提取这种统计信息,该技术采用一种简单而非常鲁棒的方法来提取讨论线程中每个帖子的时间戳。 它还采用回归模型来预测每个线程的下一个帖子的时间。
-
公开(公告)号:US08700600B2
公开(公告)日:2014-04-15
申请号:US13351952
申请日:2012-01-17
申请人: Lei Zhang , Wei-Ying Ma , Wei Lai , Jiangming Yang , Rui Cai
发明人: Lei Zhang , Wei-Ying Ma , Wei Lai , Jiangming Yang , Rui Cai
CPC分类号: G06F17/30864
摘要: A method and system for identifying informative links of a web site for use in crawling the web site is provided. A forum crawler analyzes sample web pages of a web forum to identify informative links and then crawls the web forum by following links determined to be informative and not following other links. The forum crawler system determines whether links are informative based on whether they are part of the overall structure of the web site or are used to select sequential information that has been split onto multiple web pages.
摘要翻译: 提供了一种用于识别用于爬行网站的网站的信息链接的方法和系统。 论坛搜寻器分析网页论坛的示例网页,以识别信息链接,然后通过确定为信息而不是遵循其他链接的链接抓取网页论坛。 论坛搜寻器系统基于它们是网站的整体结构的一部分还是用于选择分割到多个网页上的顺序信息来确定链接是否具有信息性。
-
公开(公告)号:US20090327237A1
公开(公告)日:2009-12-31
申请号:US12163895
申请日:2008-06-27
申请人: Lei Zhang , Wei-Ying Ma , Wei Lai , Jiangming Yang , Rui Cai
发明人: Lei Zhang , Wei-Ying Ma , Wei Lai , Jiangming Yang , Rui Cai
CPC分类号: G06F17/30864
摘要: A method and system for identifying informative links of a web site for use in crawling the web site is provided. A forum crawler analyzes sample web pages of a web forum to identify informative links and then crawls the web forum by following links determined to be informative and not following other links. The forum crawler system determines whether links are informative based on whether they are part of the overall structure of the web site or are used to select sequential information that has been split onto multiple web pages.
摘要翻译: 提供了一种用于识别用于爬行网站的网站的信息链接的方法和系统。 论坛搜寻器分析网页论坛的示例网页,以识别信息链接,然后通过确定为信息而不是遵循其他链接的链接抓取网页论坛。 论坛搜寻器系统基于它们是网站的整体结构的一部分还是用于选择分割到多个网页上的顺序信息来确定链接是否具有信息性。
-
公开(公告)号:US20120117052A1
公开(公告)日:2012-05-10
申请号:US13351952
申请日:2012-01-17
申请人: Lei Zhang , Wei-Ying Ma , Wei Lai , Jiangming Yang , Rui Cai
发明人: Lei Zhang , Wei-Ying Ma , Wei Lai , Jiangming Yang , Rui Cai
IPC分类号: G06F17/30
CPC分类号: G06F17/30864
摘要: A method and system for identifying informative links of a web site for use in crawling the web site is provided. A forum crawler analyzes sample web pages of a web forum to identify informative links and then crawls the web forum by following links determined to be informative and not following other links. The forum crawler system determines whether links are informative based on whether they are part of the overall structure of the web site or are used to select sequential information that has been split onto multiple web pages.
摘要翻译: 提供了一种用于识别用于爬行网站的网站的信息链接的方法和系统。 论坛搜寻器分析网页论坛的示例网页,以识别信息链接,然后通过确定为信息而不是遵循其他链接的链接抓取网页论坛。 论坛搜寻器系统基于它们是网站的整体结构的一部分还是用于选择分割到多个网页上的顺序信息来确定链接是否具有信息性。
-
公开(公告)号:US20090265363A1
公开(公告)日:2009-10-22
申请号:US12103712
申请日:2008-04-16
申请人: Wei Lai , Rui Cai , Jiangming Yang , Lei Zhang , Wei-Ying Ma
发明人: Wei Lai , Rui Cai , Jiangming Yang , Lei Zhang , Wei-Ying Ma
IPC分类号: G06F17/30
CPC分类号: G06Q10/10
摘要: Described is a technology by which forum web pages are processed into clusters for classification purposes, including by determining repetitive regions between pages and associating pages that have similar repetitive regions into a common cluster. Patterns corresponding to the regions are determined, and a feature set based at least in part on those patterns (e.g., pattern frequency) is extracted from the page. The feature set of a page is compared against the feature set of another page to determine similarity therewith, e.g., via a feature space distance computation that is evaluated against a threshold distance.
摘要翻译: 描述了一种技术,通过该技术将论坛网页处理成用于分类目的的群集,包括通过确定页面之间的重复区域并将具有相似重复区域的页面关联到公共群集中。 确定与区域对应的模式,并且至少部分地基于那些模式(例如,模式频率)从页面提取特征集。 将页面的特征集合与另一页面的特征集进行比较以确定其相似性,例如通过针对阈值距离评估的特征空间距离计算。
-
公开(公告)号:US08370119B2
公开(公告)日:2013-02-05
申请号:US12389368
申请日:2009-02-19
申请人: Rui Cai , Jiang-Ming Yang , Lei Zhang , Wei-Ying Ma
发明人: Rui Cai , Jiang-Ming Yang , Lei Zhang , Wei-Ying Ma
IPC分类号: G06G7/48
CPC分类号: G06F17/218 , G06F8/75 , G06F17/27
摘要: Website design pattern modeling technique embodiments are presented that model a website's design patterns. This can be based on the website's layout elements, its URL tokens, or both. When based on both, the design patterns can be modeled separately using first the layout elements and then the URL tokens, or vice versa. Alternately, the modeling can be based on coupled layout and URL token patterns. In operation, the modeling involves first identifying layout elements and/or URL tokens found on at least some of the pages of the website. The website design patterns are then modeled based on the occurrences of the identified layout elements and/or URL tokens in pages of the website. In cases where a coupled modeling scheme is employed, a modeling technique that exploits the correlations between the layout elements and URL tokens is used.
摘要翻译: 呈现网站设计模式建模技术实施例,模拟网站的设计模式。 这可以基于网站的布局元素,其网址令牌或两者兼而有之。 当基于这两者时,可以使用第一个布局元素和URL令牌来单独建模设计模式,反之亦然。 或者,建模可以基于耦合的布局和URL令牌模式。 在操作中,建模涉及首先识别在网站的至少一些页面上发现的布局元素和/或URL令牌。 然后基于网站页面中识别的布局元素和/或URL令牌的出现来对网站设计模式进行建模。 在使用耦合建模方案的情况下,使用利用布局元素和URL令牌之间的相关性的建模技术。
-
公开(公告)号:US20100211927A1
公开(公告)日:2010-08-19
申请号:US12389368
申请日:2009-02-19
申请人: Rui Cai , Jiang-Ming Yang , Lei Zhang , Wei-Ying Ma
发明人: Rui Cai , Jiang-Ming Yang , Lei Zhang , Wei-Ying Ma
IPC分类号: G06F9/44
CPC分类号: G06F17/218 , G06F8/75 , G06F17/27
摘要: Website design pattern modeling technique embodiments are presented that model a website's design patterns. This can be based on the website's layout elements, its URL tokens, or both. When based on both, the design patterns can be modeled separately using first the layout elements and then the URL tokens, or vice versa. Alternately, the modeling can be based on coupled layout and URL token patterns. In operation, the modeling involves first identifying layout elements and/or URL tokens found on at least some of the pages of the website. The website design patterns are then modeled based on the occurrences of the identified layout elements and/or URL tokens in pages of the website. In cases where a coupled modeling scheme is employed, a modeling technique that exploits the correlations between the layout elements and URL tokens is used.
摘要翻译: 呈现网站设计模式建模技术实施例,模拟网站的设计模式。 这可以基于网站的布局元素,其网址令牌或两者兼而有之。 当基于这两者时,可以使用第一个布局元素和URL令牌来分别设计设计模式,反之亦然。 或者,建模可以基于耦合的布局和URL令牌模式。 在操作中,建模涉及首先识别在网站的至少一些页面上发现的布局元素和/或URL令牌。 然后基于网站页面中识别的布局元素和/或URL令牌的出现来对网站设计模式进行建模。 在使用耦合建模方案的情况下,使用利用布局元素和URL令牌之间的相关性的建模技术。
-
-
-
-
-
-
-
-
-