Comment by clejack

Comment by clejack 2 days ago

2 replies

The main issues for problems like this fall into 3 categories

- Things that prevent you from starting the job. Org silos, security, and permissions

- Things that prevent you from doing the job. This is primarily data cleaning.

- Things that make the job more difficult. This involves poor tooling, and you'll struggle to break the stranglehold that SQL and python-pandas have in this area. I'll also add plotting libraries to this. Many of them suck in a seemingly unavoidable way.

On the second and third points llms will most likely own these soon enough, though maybe there's room to build something small and local that's more efficient if the scope of the agent is reduced?

The first point is organizational generally, and it's very difficult to solve outside of integrating your system into an environment which is the strategy pursued by companies like snowflake and databricks.

robz75 2 days ago

What are the pain points your are facing with data cleaning? How do you handle it for now?

  • dapperdrake 2 days ago

    Data cleaning depends on the problem domain.

    Compare output from a spoctrometer (or spectrograph) vs. eliminating outliers from an almost linear process. One will wreck your data and the other is the only correct thing to do.

             *         
    
    **** ****