Comment by didgetmaster

Comment by didgetmaster 18 hours ago

3 replies

I am building a new data management system that can also handle relational data well. It is really good at finding anomalies in data to help clean it up after the fact, but I wanted to find a way to prevent the errors from polluting the data set in the first place.

My solution was to enable the user to create a 'dictionary' of valid values for each column in the table. In your case you would create a list of all valid country names for that particular column. When inserting a new row, it checks to make sure the country name matches one of the values. I thought this might slow things down significantly, but testing shows I can insert millions of rows with just a minor performance hit.

The next step is to 'auto correct' error values to the closest matching one instead of just rejecting it.

axegon_ 18 hours ago

This isn't wildly different from what I've done but it's the sheer volume of crap that's scattered through all the different fields. The countries are the least of my problems. There are others where I'm faced with tens of millions of different combinations. The countries are a relatively trivial problem in comparison.

  • didgetmaster 14 hours ago

    My solution will catch a lot of trivial errors like simple misspellings. You could have a relational tables with dozens of columns, each with its own dictionary; but that won't catch wrong combinations between columns.

    For example, there is a Paris, Texas; but I doubt there is a London, Texas. Dictionaries of state names and city names would not catch someone's error of matching the wrong city with a state when entering an address.

    Is this the kind of error you encounter often?