u-crawler is a web scraper that utilizes the BeautifulSoup, requests and selenium to collect data about courses and programs from the University of New South Wales.
It adheres to robots.txt and outputs its results into a neatly formatted directory composed of json files.
I created this for the Anansi YSWS and I learnt a lot about web scraping :D
No followers yet
Once you ship this you can't edit the description of the project, but you'll be able to add more devlogs and re-ship it as you add new features!
I added linux arm support to the prebuilt pyinstaller binaries so u-crawler can run on my raspberry pi :D
I also had to set the github actions runner to run bookworm instead of the latest debian version. Otherwise, older versions of debian distros wouldn’t work (see the attached image) as they are packaged with an older version of glibc (the c library that pyinstaller uses)
I fixed some errors that might have caused issues on other devices. For example, I changed the selenium headless flag to the older one that still works, disabled disable gpu and remade the spec file to be more efficient and only include modules that I actually need. The attached screenshot of it working on my device, and I also tested it on 2 other laptops so it should work now
I added pyinstaller support which builds u-crawler into an exe that is easy to run, with no need to install dependencies or to even have python!
Now my scraper uses the results of categories.json to crawl every program inside of that category. Instead of simply using requests and beautiful soup like I did for the category scraper, I had to use selenium to launch a headless browser, because the data for the programs is rendered with JavaScript. In the the screenshot below, you can see the format of the results with categories.json and the programs in each category as their own file and with some of the code on the right.
I finished the code that gets all of the areas of interest on the home page and puts it in a json file