Comment by hyperman1

Comment by hyperman1 3 days ago

0 replies

Some of my experiences:

Step 1 is always storing a pristine unmodified copy. Try to build a chain of steps that can always be redone starting at that copy

If you have any control over what comes in, try to make some baseline requirements:. A stable identifier for each record, a parseable number format with clarity about US vs world conventies like , vs . and a stable date format like ISO 8601. Also a real structured format like json, xml, ... works better than csv, which is better than xls(x)

From there, it depends.

If there is a lot of data and you need a quick estimate, the POSIX toolset is a good choice: cut, grep, sed, jq, ...

If there is a lot of data, reasonably structured, load it minimally parsed in an sql table as (line number,source id,line of text) and parse from there. The database will auto parallelize.

If the data is hard to parse, grab a real programming language. You probably want something that can deal with character data quickly, e.g. Go Java Rust C# .... Python or R work if the amount of data is low or you can lean heavily on things like numpy. PHP, VBA,... tend to be slow and error prone.

My experience with ETL tooling is it's eternally 90% there. It should theoretically solve this, but I always seem to fall back on programming languages and fight the tools after a while.