To start this blog post, I’d like to assign you a task. Your task is to sort the words below into a list and eliminate duplicates in your final answer. Ready? Here is your list:
The List Is
What is your answer? There are at least three (3) possible correct answers. Answer 1 is just Psychology. You may have determined that this was the word I was trying to spell, and I spelled it wrong several times. If I had spelled it correctly, there would be just one word in the sorted list. Answer 2 is Physicology, Psicology, Psychologie, Psychology, and Pyschology. In this case, you sorted all the words in the list assuming that misspellings were intentional. There were no duplicates to remove. Answer 3 is Is, List, Physicology, Psicology, Psychologie, Psychology, Pyschology, and The. In this case, you assumed that after the instructions any words that appeared were to be sorted and de-duplicated.
But, which is the right answer?
The reality is that there isn’t a clear answer from the data. If I was to clean the data to remove “The List Is”, I could reduce my possible answers to two (2). If I then said that misspellings were to be removed, you might reduce the list to one (1) answer. But, if I gave you each word in a sentence where clearly, within the context of that sentence, I intended the meaning to be Psychology, you could then reduce the list to one (1) word and definitively give me an answer. But! To get to this final state, you had to do a lot.
Would machine learning (ML) have been able to do any better?
The reality is that ML is just an adaptable algorithm that adjusts its flow to the data it is presented and the guidance (training) that it is given. Then, it applies that algorithm repeatably at great speed, without error, and without tiring at the job.
To go further let’s explain what is meant by, “without error.” The algorithm follows a precise path. If you give it the same data over and over again, it will return the same answer (assuming the algorithm has not changed due to learning). Repeatability, without error, is not a feature of human behavior. Humans will introduce error into processes simply due to human flaws like lack of attention or lack of memory.
In the end, the algorithm, while producing a repeatable answer quickly, is going to be no better at producing a correct or definitive answer than a human. In fact, the ML algorithm might not even flag this data as problematic. If nothing else, a human might say, “I need more context,” or “I need someone else’s help with this.” ML is going to give an answer, and how well that answer is distinguished as ambiguous is an indicator of capabilities and/or limitations of the ML itself.
Data is the foundation of good results. It is the foundation of good learning. It is the non-negotiable component of every dataset processed through every ML algorithm in existence or ever to be in existence.
So, when Seal Learning Services recommends that you spend time, in class or online, learning about the processes and procedures for good data review protocols, we are not doing it simply as a means of selling you more classes. Understanding your data and ensuring its consistency and accuracy is key to everything that Seal does. We want you to succeed, and this is why we do it.
For more information about how Seal Learning Services can assist you, visit https://www.seal-software.com/learning-services.