u-crawler

u-crawler

10 devlogs
8h 21m
•  Ship certified
Created by obob

u-crawler is a web scraper that utilizes the BeautifulSoup, requests and selenium to collect data about courses and programs from the University of New South Wales.
It adheres to robots.txt and outputs its results into a neatly formatted directory composed of json files.
I created this for the Anansi YSWS and I learnt a lot about web scraping :D

Timeline

Ship 2

1 payout of shell 55.0 shells

obob

about 2 months ago

obob Covers 4 devlogs and 2h 24m

I added linux arm support to the prebuilt pyinstaller binaries so u-crawler can run on my raspberry pi :D

I also had to set the github actions runner to run bookworm instead of the latest debian version. Otherwise, older versions of debian distros wouldn’t work (see the attached image) as they are packaged with an older version of glibc (the c library that pyinstaller uses)

Update attachment

I fixed some errors that might have caused issues on other devices. For example, I changed the selenium headless flag to the older one that still works, disabled disable gpu and remade the spec file to be more efficient and only include modules that I actually need. The attached screenshot of it working on my device, and I also tested it on 2 other laptops so it should work now

Update attachment

I added github actions support so i can build using pyinstaller for MacOS, Windows and Linux!

Update attachment

I added pyinstaller support which builds u-crawler into an exe that is easy to run, with no need to install dependencies or to even have python!

Update attachment

Ship 1

1 payout of shell 67.0 shells

obob

3 months ago

obob Covers 6 devlogs and 5h 57m

Finished readme

Update attachment

I added error catching and logging in programs.py. Now I just need to write a readme

Update attachment

I implemented error catching and logging in categories.py

Update attachment

I implemented a robots.txt checker to make sure that it doesn't scrape forbidden pages

Update attachment
obob
obob
2h 18m 3 months ago

Now my scraper uses the results of categories.json to crawl every program inside of that category. Instead of simply using requests and beautiful soup like I did for the category scraper, I had to use selenium to launch a headless browser, because the data for the programs is rendered with JavaScript. In the the screenshot below, you can see the format of the results with categories.json and the programs in each category as their own file and with some of the code on the right.

Update attachment
obob
obob
1h 11m 3 months ago

I finished the code that gets all of the areas of interest on the home page and puts it in a json file

Update attachment