Comment by didgetmaster
Comment by didgetmaster 18 hours ago
I am building a new data management system that can also handle relational data well. It is really good at finding anomalies in data to help clean it up after the fact, but I wanted to find a way to prevent the errors from polluting the data set in the first place.
My solution was to enable the user to create a 'dictionary' of valid values for each column in the table. In your case you would create a list of all valid country names for that particular column. When inserting a new row, it checks to make sure the country name matches one of the values. I thought this might slow things down significantly, but testing shows I can insert millions of rows with just a minor performance hit.
The next step is to 'auto correct' error values to the closest matching one instead of just rejecting it.
This isn't wildly different from what I've done but it's the sheer volume of crap that's scattered through all the different fields. The countries are the least of my problems. There are others where I'm faced with tens of millions of different combinations. The countries are a relatively trivial problem in comparison.