Skip to content

Web Crawler

Static Badge Static Badge

In this project, you will develop a web crawler that navigates through web pages within a domain. Starting from a specified URL (https://www.famnit.upr.si), the crawler will extract all the links on each page, recursively visiting them while generating new links to visit. The crawler should remain within the same domain, logging both working and non-working links as it navigates.

Running the Program

  1. Initialization:
  2. Begin crawling from a user-specified URL within the target domain (https://www.famnit.upr.si).

  3. Link Extraction:

  4. For each visited page, extract all hyperlinks (anchor tags with href attributes).

  5. Domain Restriction:

  6. Only follow links that point to pages within the same domain as the starting URL.

  7. Link Validation:

  8. Attempt to access each extracted link to determine if it is working (e.g., returns a status code of 200) or non-working (e.g., broken links, 404 errors).
  9. Record the status code or error message associated with each link.

  10. Output Generation:

  11. Generate a report or output file listing all working and non-working links. Include details such as:
  12. Status code or error message
  13. Page from which each link was found
  14. Any additional error details

Testing

  • Domain Testing: Test the crawler on the https://www.famnit.upr.si domain, measuring the time taken to crawl a specified number of pages.
  • Redirect and Relative URL Handling: Verify that the crawler correctly handles redirects and processes relative URLs.

Present results with charts/figures where possible, and explain any challenges or observations based on numeric and visual data collected during testing.