using AI to learn scikit-learn and more about ML training with python
slop_detector1020 is an ai code detector in its early stages, its made in python and only works for python code right now
No followers yet
Once you ship this you can't edit the description of the project, but you'll be able to add more devlogs and re-ship it as you add new features!
added a percentage, instead of a blunt AI or human, also added a few gpt5 samples and it seems to be pretty accurate, i plan on adding emoji extraction, so if it uses emojis, which gpt code does tend to, its more likely to be AI.
html support is added, scraped 400 human samples from github, getting human samples isnt a problem, its seamless, ai smaples are a problem, got 200 from gemini, and the rest were from chatgpt 4.1,
needed to use AI to get a better result, since if prompted, you can easily pass the detector, even after using AI (and tons of fixing ai code), its still pretty dodgy and isnt that reliable, ig its because of the lack of proper data, and i currently wont be able to solve that without paying for an AI service, so ig, HTML detection would be marked as a rough estimate, and it can be pretty dodgy.
also i fixed the scraper by using github's search api, and then decoding its base64 contents!
increased dataset to 400 files each for AI and human, scraped github for human, and used gemini 1.5 flash, claude and ai.hackclub to generate ai code. used tfidf to get a better result, ended up geting 97% accuracy!
also made a simple prediction script, it uses the same extraction code from the features.py, renames the dictionaries keys to suit the models requirements, and then loads the trained model to get the prediction,
currently only works for python code, but i plan on adding html, css and js
first devlog! today, i got 20 samples each, ai samples using ai hackclub using a simple ai code gen python script i created, and manually got human datasets, also built feature extraction and got it to ouput .csv! in the format - filename,label,lines,blanks,comment ratio,line length,indent variations,functions
also i need to gather larger datasets, perhaps from web scraping on gh and pypi