LLM-Synth 🎵

LLM-Synth 🎵 Used AI

11 devlogs
35h 56m
Created by Nedimark

An All-in-One-Solution for creating Synthetic Datasets

Timeline

FINALLY it works! And I have also done all the tools i wanted implemented. So for llm side tools you have a away to parse to the alapaca style isntruct or the Chat-ML Template for conversations the same if you want to derive them from context! plus a clean function that classufies and cleans your data! Now the more complex ones are the code tools: merge: merge two datasets of same length together great for adding messages in conversations! bind: if you have per row multiple dialogues for example they will now get each their own entry segregate: the completion part for clean here you get the cleaned dataset seperated according to the classification; expand want to a pply a single line of information to multiple rows of data USE EXPAND ! With select you can specifically select one of the segregated datasets and get acces to it only; count well counts the entries and with percantage you can for example see how much data fell through the classification! Pheeeeeew that was A LOT. Now this train is transforming into a jet cause its flying straight to UI-city!!!1!

Update attachment

I think I have a sickness ... and that is to rewrite the whole codebase ... it's now the third time... But now you have more flexivility to use tools on your data. For one you have llm tools like clean, parse and derive where the data is uploaded via batches and prewritten templates and code tools which are python functions that allow for example to connect to data_files finalize your dataset and more... but for this flexibility and ability to easily add more tools was really hard like look at the time. So today is just a touchpad drawn image illustrating my pain knowing I still have to debug......😿

Update attachment

As developers we all believe in the sunk cost fallacy... It's in our nature to believe in our self written code, it may be ugly, but it works!1!2! So today i have decided not to be one in a krillion🐠 and rewrote all my code using a node based system. Now not only should workflow creation be simplified but even more complex ones are possible. (I hope it was worth the pain 😿😿😿)

Update attachment

Finally after such a looooong night 😴😴😴 I have managed to implement JSON Parsing as a processing step after generating the seed (raw) data. Now the next step would be finalizing the JSON formated Data into a pure JSON only Dataset. that would logically then be the ... (looks up the life stages of a tree) ... sapling stage?

Update attachment

Before moving along my To-Do List I have decided to make the UX more ✨nice✨.
Now next stop!!! 🚂 (this time for real(I hope)) JSON Parsing!

Update attachment

After figthing the gods of pyhton for a whole day... I finally have gotten my first generated bit of synthetic data! While its still not pretty formated in JSON, it is a start! Therefore next I'll be heading to the sacred Valleys of JSON Parsing. ⛰️🏔️🚞

Update attachment

Before Ending the Day I also created a template for creating your own synthetic seeds. Feedback is welcome and wanted 😸😸😼

Update attachment

Hello! Today was a loooooot of brainstorming on what would be the best way to combine a lot of different prompts for an LLM to generate a great amount of Synthetic conversations. The Result this JSON Template filled out with an prompt, that took quite a while to get right ... I really might make an custom GPT or similiar to just generate these in a conversation... If u have any ideas dont hesitate to comment !😸😸😸

Update attachment

Today I decided to pick this project up again. And decided to switch to the OpenAI API and did a lil experimenting (cuz who doesnt love some good experiments), because i got a grant from the toolsmith YSWS ... With that in Mind i created a plan how to build the Application (see Attachment)

Update attachment

Sometimes you just have days when the code aint coding and you may have written thousands of lines, but in the end u push none.. 😿😿😿

Update attachment

Just Playing around with the Clairifai API... I present the KITTY DETECTOR😼... (Work Not Final)

Update attachment