Skip to content

Web Crawler

Static Badge Static Badge


In this project, you will develop a web crawler that navigates through web pages within a domain. The crawler will start from a specified web URL (https://www.famnit.upr.si), extract all the links on that page, and recursively visit those links and generate new ones to visit. The crawler should stay within the same domain. As it crawls, the program should log all visited pages, both working and non-working links.

Running the Program

  • Begin crawling from a user-specified URL within the target domain.
  • For each page visited, extract all hyperlinks (anchor tags with href attributes).
  • Only follow links that point to pages within the same domain as the starting URL.
  • Attempt to access each extracted link to determine if it is working (e.g., returns a status code of 200) or non-working (e.g., broken links, 404 errors).
  • Record the status code or error message associated with links.
  • Generate a report or output file that lists all working and non-working links. Include details such as the status code, the page from which each link was found, and any error messages.

Testing

Test the crawler on the https://www.famnit.upr.si domain and measure the time taken to crawl a certain number of pages. Verify that the crawler correctly handles redirects and relative URLs. Ensure that the crawler does not follow external links or revisit the same page.