显然,站点中的链接不是一棵树,而是一棵树
. 您应该有一个由URL标识的Page对象和一个从一个页面指向另一个页面的Link对象(页面a可以指向页面B,而页面B指向页面a,使其成为一个图形,而不是树)。
扫描算法伪码:
process_page(current_page):
for each link on the current_page:
if target_page is not already in your graph:
create a Page object to represent target_page
add it to to_be_scanned set
add a link from current_page to target_page
scan_website(start_page)
create Page object for start_page
to_be_scanned = set(start_page)
while to_be_scanned is not empty:
current_page = to_be_scanned.pop()
process_page(current_page)