using AI to learn scikit-learn and more about ML training with python
note - it might not be accurate, since the datasets are pretty small (1k each), due to system constraints, and ai rate limits!
slop_detector1020 is an ai code detector made in python
æra
Whoops! Looks like they don't have a project yet. Maybe ask them to start one?
Once you ship this you can't edit the description of the project, but you'll be able to add more devlogs and re-ship it as you add new features!
end of project! i want to move onto something new lol, also, there were 20 something commits. and i switched from mint to arch, and i was fed up with head ref issues, so i did a force rebase and push, which deleted all the git commits on gh too. so please reviewers dont flag me. now there are 4 commits.
finally added the rust python function, also after switching from mint to arch, i had to force push and delete my git history :( so the repo only says 2 commits, also the flask app is throwing a ssl error, hope its just provisioning!
tried to add rust suopport and this was by far the hardest thing to add, since i've never really programmed in rust, and dk the syntax that well, so had to skim through tons of ai rust code, github samples of rust code, and the rust book, i havent read the whole thing, but have a basic idea of how it now works, also asked the hc slack and got some tips there too!, but i have gotten a model ready and getting a 95% accuracy!
adding rust dataset, also using nvim as my editor now! got human dataset ready, just need to get ai code.
adding ts support, gathered the dataset yesterday, and today, i added the feature extraction, also needed to understand a bit of typescript to know what features to extract, after extracting "num_l", "num_b", "comment_r", "avg_l_len", "indent_var", "num_funcs", "arrow_r", "avg_ind_len",
'num_interfaces', 'num_types', 'num_enums', 'num_classes', 'num_imports', 'num_exports',
'type_annotations', 'generics', 'access_mods'
getting a 98% accuracy
ready to ship! finally caddy is done provisioning it, also needed to install scikit-learn on nest, pretty solid mvp imo, plan on adding more langauges like ts support!
was ready to ship the app, so tried deploying, vercel didnt work, tried render, but that too failed, then decided to use nest, and got it setup, but the caddy is still provisioning the ssl certificate.
made a very simple flask front end, right now only js code works, and you can choose any other file. need to fix that.
started working on flask app, create files to keep terminal prediction and flask app seperate. might make a python package, or not really, idk yet
added JS supprot! got 408 files each of human (from gh) and ai (gemini, gpt 4.1, claude), then i just copy pasted the feature extraction and model training, getting a 96% accuracy! changed the feature extraction to get stuff like arrow to func ratio now!
added a percentage, instead of a blunt AI or human, also added a few gpt5 samples and it seems to be pretty accurate, i plan on adding emoji extraction, so if it uses emojis, which gpt code does tend to, its more likely to be AI.
html support is added, scraped 400 human samples from github, getting human samples isnt a problem, its seamless, ai smaples are a problem, got 200 from gemini, and the rest were from chatgpt 4.1,
needed to use AI to get a better result, since if prompted, you can easily pass the detector, even after using AI (and tons of fixing ai code), its still pretty dodgy and isnt that reliable, ig its because of the lack of proper data, and i currently wont be able to solve that without paying for an AI service, so ig, HTML detection would be marked as a rough estimate, and it can be pretty dodgy.
also i fixed the scraper by using github's search api, and then decoding its base64 contents!
increased dataset to 400 files each for AI and human, scraped github for human, and used gemini 1.5 flash, claude and ai.hackclub to generate ai code. used tfidf to get a better result, ended up geting 97% accuracy!
also made a simple prediction script, it uses the same extraction code from the features.py, renames the dictionaries keys to suit the models requirements, and then loads the trained model to get the prediction,
currently only works for python code, but i plan on adding html, css and js
first devlog! today, i got 20 samples each, ai samples using ai hackclub using a simple ai code gen python script i created, and manually got human datasets, also built feature extraction and got it to ouput .csv! in the format - filename,label,lines,blanks,comment ratio,line length,indent variations,functions
also i need to gather larger datasets, perhaps from web scraping on gh and pypi