Concurrent File Crawler

Ship 1

1 payout of shell 476.0 shells

2 months ago

HQ2000 • Covers 16 devlogs and 47h 35m

HQ2000

3h 3m • 2 months ago

Finished the docs, thanks to the SoM website it's the second time writing this, also, my docs failed to build three times, I copied a README wrong...today is not my day.
Anyways, will ship now and uploaded to crates.io.

HQ2000

3h 52m • 2 months ago

Divided everything into features (async e. g.) for faster compilation.
Working on the docs, added a new lazy-store feature, finished the examples, the feature overview and all the #[cfg(...)] stuff, will add an alternative parallelism backend after SoM.

HQ2000

3h 14m • 2 months ago

Working on the docs (my first time doing these).
Will finish tomorrow after adding an 'async' feature so you don't need to compile tokio when you want to use the normal crawler

HQ2000

6h 29m • 2 months ago

Debugged and restructured so much, eleminated felt 100s lines of code, but it's working! Really working, I'm so happy (I'm speaking about the async version, the parallel version was always fine, after another benchmark performed on my cargo registry folder it is 3x times faster).
Cleaning up the legacy stuff, reducing dependencies.
So next, I'll add some documentation, write something about it and then probably ship!

HQ2000

2h 46m • 2 months ago

wohoo, my first deadlock!
anyways, did some more refactoring/solving problems/testing I guess, did most of the time yesterday...

HQ2000

28m • 2 months ago

Finally fixed the bug!!
Basically, I got panic (=fatal error) because I tried to move a value I previously put into a shared reference (a pointer in Rust that gives multiple places immutable access to some data) out of it, but tried that in the par_run() (the recursive multithreaded helper function) and not in the actual Crawler::run() function...
So now I will - again - test and benchmark and then ship eventually.

HQ2000

2h 27m • 2 months ago

Still doing the context thingy, installing rustowl because of problems with the Arc the context is wrapped in, will probably upload tomorrow

HQ2000

1h 51m • 2 months ago

Doing stuff with the type system to expose Context in an ergonomic way...
Next big refactor incoming...

HQ2000

53m • 2 months ago

Fixed a bug with the regex validation, refactored and tested

HQ2000

3h 51m • 3 months ago

[Example code below]
Basically done I'd say!
Everything is working, a crawler builder pattern (which is also lazy until .run(...) so the crawler can be stored as a config) is available for both versions, Crawler::new() and Crawler::newasync() and configs like file/folder regex and search depth.
I also conducted a small benchmark where the async (via tokio) and multi-threaded (via rayon) versions took the same time while the earlier single-threaded recursive version was 2x slower (you can still access it via [crate name]::legacy::singlethreadedrecursive::foreach_file(...)).
Tomorrow I'll maybe add simple synchronisation primitives for some context or maybe something else, idk
(Also, repo is up to date)

HQ2000

4h 35m • 3 months ago

[Committed the code]
Worked much on the async version to get it finally working, it during this period it behaved very weirdly (dbg! statements affected the termination, it didn't terminate sometimes) which is solved now, so after having a non-async multi-threaded version, I'll conduct some benchmarks.
(Spent way too much time 😭)

HQ2000

1h 49m • 3 months ago

Refactoring the async stuff to use a second internal function for the recursion (or not?), also thinking about design choices for passing the data in the async recursion thingy (Crawler vs Config struct)...
Will upload today/tomorrow

HQ2000

1h 58m • 3 months ago

PLEASE FIX THE FORMATTING
Making a builder pattern now, added parameters (max search depth, file/folder filter regex, used manual filters previously...), getting the async version over to the builder pattern and I now think I know why it was slower: I just .await-ed all the task I spawned which basically means that it behaves like the single-threaded recursive version - but async...
So, what I'm doing now is using a task pool to still be able to wait for all the tasks (=user defined async actions for every specified file) before terminating.
Will commit now!

HQ2000

4h 28m • 3 months ago

Finished the recursive async version, isn't really faster than the recursive single-threaded one...I will need to make ergonomics also, I used Box::leak :sob:
Since async isn't faster, I'll try a non-recursive version and also add more options and a file crawler builder!

HQ2000

2h 13m • 3 months ago

Tried out different use cases (writing files to a file, count files). Found that you need a mutex anyway if you want to get references etc. into the closure that's applied to every file in the specified directory.
Since what the single threaded recursive version is basically trivial, I will work on the multithreaded version and add some ergonomics later so ppl don't need to mess with mutexes.

HQ2000

3h 29m • 3 months ago

I don't like OneDrive, I want my files to be local instead of being in the cloud, so I built a simple recursive file crawler as a simple script but directly tried a multithreaded approach...it didn't really work, but took some hours. After that I made a simple recursive version which at least works... This gave me an idea: why don't make a customisable multithreaded (or async later?) file crawler that does whatever you want! Mine just opened and directly closed every file it encountered making onedrive download it, but what about modifications, counting size, whatever you want.

Concurrent File Crawler

Followers

Ship Your Project

Get ready!

Ship Requirements Checklist

Link Verification

Link Status:

Timeline

Add a Comment

Add a Comment

Add a Comment

Add a Comment

Add a Comment

Add a Comment

Add a Comment

Add a Comment

Add a Comment

Add a Comment

Add a Comment

Add a Comment

Add a Comment

Add a Comment

Add a Comment

Add a Comment

README