FOCUSED WEB CRAWLING SYSTEM AND METHOD THEREOF

    公开(公告)号:US20220318320A1

    公开(公告)日:2022-10-06

    申请号:US17356619

    申请日:2021-06-24

    Abstract: The present invention relates to a system for focused web crawling comprising a crawler, a distiller, a queuing unit and a classifying module arranged to undergo a method for focused web crawling that inputs a seed address into a subsequently formed address queue, iteratively extracts a primary address from the address queue, iteratively invigilates the primary address for presence in an address store, and follows a series of steps to conduct relevancy check of the addresses via naive bayes protocol, simultaneously calculates primary conditional probability of a set of predefined webpage(s) using the protocol, sequentially calculates plurality of secondary conditional probabilities pertaining to the webpage(s) of the iteratively extracted primary addresses, further classifies the webpage(s) as relevant/irrelevant webpage(s) and finally transfers addresses of the relevant webpage(s) and the relevant set of addresses into the address queue, else into the address store.

Patent Agency Ranking