u-crawler

10 devlogs

8h 21m

• Ship certified

Created by obob

u-crawler is a web scraper that utilizes the BeautifulSoup, requests and selenium to collect data about courses and programs from the University of New South Wales.
It adheres to robots.txt and outputs its results into a neatly formatted directory composed of json files.
I created this for the Anansi YSWS and I learnt a lot about web scraping :D

Demo

Repository

Timeline

Ship 2

1 payout of shell 55.0 shells

about 2 months ago

obob • Covers 4 devlogs and 2h 24m

obob

23m • about 2 months ago

I added linux arm support to the prebuilt pyinstaller binaries so u-crawler can run on my raspberry pi :D

I also had to set the github actions runner to run bookworm instead of the latest debian version. Otherwise, older versions of debian distros wouldn’t work (see the attached image) as they are packaged with an older version of glibc (the c library that pyinstaller uses)

obob

57m • about 2 months ago

I fixed some errors that might have caused issues on other devices. For example, I changed the selenium headless flag to the older one that still works, disabled disable gpu and remade the spec file to be more efficient and only include modules that I actually need. The attached screenshot of it working on my device, and I also tested it on 2 other laptops so it should work now

obob

34m • 2 months ago

I added github actions support so i can build using pyinstaller for MacOS, Windows and Linux!

obob

29m • 2 months ago

I added pyinstaller support which builds u-crawler into an exe that is easy to run, with no need to install dependencies or to even have python!

Ship 1

1 payout of shell 67.0 shells

3 months ago

obob • Covers 6 devlogs and 5h 57m

obob

32m • 3 months ago

Finished readme

obob

40m • 3 months ago

I added error catching and logging in programs.py. Now I just need to write a readme

obob

42m • 3 months ago

I implemented error catching and logging in categories.py

obob

31m • 3 months ago

I implemented a robots.txt checker to make sure that it doesn't scrape forbidden pages

obob

2h 18m • 3 months ago

Now my scraper uses the results of categories.json to crawl every program inside of that category. Instead of simply using requests and beautiful soup like I did for the category scraper, I had to use selenium to launch a headless browser, because the data for the programs is rendered with JavaScript. In the the screenshot below, you can see the format of the results with categories.json and the programs in each category as their own file and with some of the code on the right.

obob

1h 11m • 3 months ago

I finished the code that gets all of the areas of interest on the home page and puts it in a json file

u-crawler

Followers

Ship Your Project

Get ready!

Ship Requirements Checklist

Link Verification

Link Status:

Timeline

Add a Comment

Add a Comment

Add a Comment

Add a Comment

Add a Comment

Add a Comment

Add a Comment

Add a Comment

Add a Comment

Add a Comment

README