Comment by jlpk
Course! These are one-time migrations, so at most we have 3-4 projects happening concurrently and don't need to worry about backwards compatibility either. We can just import a specific version or branch of the library, or at worst, we just copy and paste the function and make the change in the notebook. But a majority of the functions we've put work into in the library are really the consistently applied across any incoming data files - API scripts, de-duplication, assigning IDs... honestly any changes like this are usually pretty easy to make.
The tests are always consistent across files and primarily check for the validity of the upload (right type, for example, or logic (start times have end times, etc.). Every test should work for every file ever, since those are based on the platform and upload constraints rather than partner data.
Short answer is they don't. For onboarding interns into the process, I write a skeleton notebook that imports the internal library and walks them through cleaning a file. But we would hire for interns that have background in coding and data cleaning. Starting out, rather than change a existing function, they might add a line of code in the notebook that changes the data in a way that the existing function would now work, for example. There are cases where specific-business logic needs to be coded into a function, but we would just write those ad-hoc. This isn't an upload that needs to be done automatically or very quickly, so that hasn't been a problem.
The only reason other team members would contribute to this would be around shaping the business logic of which fields should be uploaded where and how they should be formatted, etc. But that data review part that sometimes needs to be done is very tricky, e.g., checking that the transformed results are what they want. It mostly happens in Excel - we did build a POC UI where they would upload a CSV, and they could click through each record and review and suggest changes in a cleaner way.
For LLMs, we don't use them for mapping, but we've tested it and it works fine. The LLM mapping doesn't really save us a ton of time compared to us just looking through the file and assigning columns in our mapping file (which is composed of about 120 columns). If we wanted to deploy it and allow the partner to upload a sheet and the LLM to suggest a mapping, that could work. (The mapping step in the demo video on your site looks great!) Instead we use them to format and extract unstructured data -> for example, we might need to pull tags out of descriptive text, or extract event hours out of a string of text describing the event. The LLM can do this really well now with structured JSON outputs, but reviewing it to ensure it is correct is still a manual process.
Appreciate the detailed follow up and all this information, its been super insightful to get more real world perspective on this. Your processes all sound very well thought out and address most of the shortcomings/pain points I've experienced personally. There's definitely a lot of valuable info in there for anyone in this space. And thank you for taking the time to check out the site, feedback is limited at this stage so hearing that something resonated is great, hoping it all makes sense.