Crawling and Indexing the Web
Here we are going to see about Crawling and Indexing the Web which is given as follows ,
Crawling the Web
• To make Web search efficient search engines collect web documents and index them by the words (terms) they contain.
• For the purposes of indexing web pages are first collected and stored in a local repository
• Web crawlers (also called spiders or robots) are programs that systematically and exhaustively browse the Web and store all visited pages
• Crawlers follow the hyperlinks in the Web documents implementing graph search algorithms like depth-first and breadth-first
Breadth-first Web crawling limited to depth 3
Issues in Web Crawling:
• Network latency (multithreading)
• Address resolution (DNS caching)
• Extracting URLs (use canonical form)
• Managing a huge web page repository
• Updating indices
• Responding to constantly changing Web
• Interaction of Web page developers
• Advanced crawling by guided (informed) search (using web page ranks)
Indexing and Keyword Search :
We need efficient content-based access to Web documents
• Document representation:
– Term-document matrix (inverted index)
• Relevance ranking:
– Vector space model