Comment by axegon_
Not a data engineer but my work revolves around processing a ton of data(let's call it partial data engineering). Much of the data I get is inputted by humans from different sources and platforms and countries. My biggest pain in a nutshell - the human factor. Believe it or not, people have managed to misspell "Austria" over 11,000 times(accents, spaces, different encodings, alphabets, languages, null characters and so on. Multiply that by 250-something countries and multiply that by around 90-100 other fields which suffer from similar issues and multiply that by 2.something billion rows and you get the picture.
I am building a new data management system that can also handle relational data well. It is really good at finding anomalies in data to help clean it up after the fact, but I wanted to find a way to prevent the errors from polluting the data set in the first place.
My solution was to enable the user to create a 'dictionary' of valid values for each column in the table. In your case you would create a list of all valid country names for that particular column. When inserting a new row, it checks to make sure the country name matches one of the values. I thought this might slow things down significantly, but testing shows I can insert millions of rows with just a minor performance hit.
The next step is to 'auto correct' error values to the closest matching one instead of just rejecting it.