-
公开(公告)号:US20220318320A1
公开(公告)日:2022-10-06
申请号:US17356619
申请日:2021-06-24
Applicant: Rajesh Kumar Bhatia , Manish Kumar , Kashish Bhatia
Inventor: Rajesh Kumar Bhatia , Manish Kumar , Sanjeev Sofat , Sanjay Batish , Pardeep Kumar , Kapil Madan , Kashish Bhatia
IPC: G06F16/951
Abstract: The present invention relates to a system for focused web crawling comprising a crawler, a distiller, a queuing unit and a classifying module arranged to undergo a method for focused web crawling that inputs a seed address into a subsequently formed address queue, iteratively extracts a primary address from the address queue, iteratively invigilates the primary address for presence in an address store, and follows a series of steps to conduct relevancy check of the addresses via naive bayes protocol, simultaneously calculates primary conditional probability of a set of predefined webpage(s) using the protocol, sequentially calculates plurality of secondary conditional probabilities pertaining to the webpage(s) of the iteratively extracted primary addresses, further classifies the webpage(s) as relevant/irrelevant webpage(s) and finally transfers addresses of the relevant webpage(s) and the relevant set of addresses into the address queue, else into the address store.