Comment by axegon_

Comment by axegon_ 19 hours ago

4 replies

Not a data engineer but my work revolves around processing a ton of data(let's call it partial data engineering). Much of the data I get is inputted by humans from different sources and platforms and countries. My biggest pain in a nutshell - the human factor. Believe it or not, people have managed to misspell "Austria" over 11,000 times(accents, spaces, different encodings, alphabets, languages, null characters and so on. Multiply that by 250-something countries and multiply that by around 90-100 other fields which suffer from similar issues and multiply that by 2.something billion rows and you get the picture.

didgetmaster 19 hours ago

I am building a new data management system that can also handle relational data well. It is really good at finding anomalies in data to help clean it up after the fact, but I wanted to find a way to prevent the errors from polluting the data set in the first place.

My solution was to enable the user to create a 'dictionary' of valid values for each column in the table. In your case you would create a list of all valid country names for that particular column. When inserting a new row, it checks to make sure the country name matches one of the values. I thought this might slow things down significantly, but testing shows I can insert millions of rows with just a minor performance hit.

The next step is to 'auto correct' error values to the closest matching one instead of just rejecting it.

  • axegon_ 18 hours ago

    This isn't wildly different from what I've done but it's the sheer volume of crap that's scattered through all the different fields. The countries are the least of my problems. There are others where I'm faced with tens of millions of different combinations. The countries are a relatively trivial problem in comparison.

    • didgetmaster 15 hours ago

      My solution will catch a lot of trivial errors like simple misspellings. You could have a relational tables with dozens of columns, each with its own dictionary; but that won't catch wrong combinations between columns.

      For example, there is a Paris, Texas; but I doubt there is a London, Texas. Dictionaries of state names and city names would not catch someone's error of matching the wrong city with a state when entering an address.

      Is this the kind of error you encounter often?