Comment by dataflowmapper

Comment by dataflowmapper 7 months ago

Wow, thanks for the detailed and insightful breakdown of your process. This really resonates a lot of the complexities I've been exploring and encountered as an implementation consultant.

A few things particularly stood out to me:

  - Your point about semantic mapping ("understanding the structure... what it means and where it should go") being more irritating than the actual coding of transformations is spot on. Domain expertise on these things can make a big difference.

  - The version tracking issues with Excel and the need to trace back decisions or errors are a big reason Excel isn't scalable. I totally agree with you and think Excel is bad for this kind of stuff and repeatable transformations/processes are important.
  - Your desire for an easier way to flag/share errors with partners for correction and then effectively store those changes as part of a repeatable process (like 'code' or a versioned configuration) is a fascinating challenge in collaborative data onboarding. It seems crucial for those fixes to not be one-offs especially when dealing with partners/clients which are notoriously slow. I've used Flatfile in the past, but its limiting in that its all about one-offs and transformations aren't repeated.

It sounds like you've built a robust internal Python library to manage a lot of this, which is impressive. When you use Jupyter notebooks for each incoming file:

  - How do you manage the reusability of common logic or tests across those notebooks, especially if a core cleaning function needs an update that should apply to many different partner files?
  - And how do less technical team members (like the interns you mentioned) interact with or contribute to that Python/notebook-based process, particularly if they need to understand or slightly modify a transformation rule?

And for your usage of LLM, is that mostly for mapping source field to target fields?

Thanks again for sharing such a comprehensive view, it's incredibly helpful for understanding the friction points and the need for traceable, repeatable solutions.

jlpk 7 months ago

Course! These are one-time migrations, so at most we have 3-4 projects happening concurrently and don't need to worry about backwards compatibility either. We can just import a specific version or branch of the library, or at worst, we just copy and paste the function and make the change in the notebook. But a majority of the functions we've put work into in the library are really the consistently applied across any incoming data files - API scripts, de-duplication, assigning IDs... honestly any changes like this are usually pretty easy to make.

The tests are always consistent across files and primarily check for the validity of the upload (right type, for example, or logic (start times have end times, etc.). Every test should work for every file ever, since those are based on the platform and upload constraints rather than partner data.

Short answer is they don't. For onboarding interns into the process, I write a skeleton notebook that imports the internal library and walks them through cleaning a file. But we would hire for interns that have background in coding and data cleaning. Starting out, rather than change a existing function, they might add a line of code in the notebook that changes the data in a way that the existing function would now work, for example. There are cases where specific-business logic needs to be coded into a function, but we would just write those ad-hoc. This isn't an upload that needs to be done automatically or very quickly, so that hasn't been a problem.

The only reason other team members would contribute to this would be around shaping the business logic of which fields should be uploaded where and how they should be formatted, etc. But that data review part that sometimes needs to be done is very tricky, e.g., checking that the transformed results are what they want. It mostly happens in Excel - we did build a POC UI where they would upload a CSV, and they could click through each record and review and suggest changes in a cleaner way.

For LLMs, we don't use them for mapping, but we've tested it and it works fine. The LLM mapping doesn't really save us a ton of time compared to us just looking through the file and assigning columns in our mapping file (which is composed of about 120 columns). If we wanted to deploy it and allow the partner to upload a sheet and the LLM to suggest a mapping, that could work. (The mapping step in the demo video on your site looks great!) Instead we use them to format and extract unstructured data -> for example, we might need to pull tags out of descriptive text, or extract event hours out of a string of text describing the event. The LLM can do this really well now with structured JSON outputs, but reviewing it to ensure it is correct is still a manual process.

Reply View 1 reply

dataflowmapper 7 months ago

Appreciate the detailed follow up and all this information, its been super insightful to get more real world perspective on this. Your processes all sound very well thought out and address most of the shortcomings/pain points I've experienced personally. There's definitely a lot of valuable info in there for anyone in this space. And thank you for taking the time to check out the site, feedback is limited at this stage so hearing that something resonated is great, hoping it all makes sense.

Reply View | 0 replies