The Human Side of AI: Why Machine Learning Still Needs People
Authored by: Olga Kokhan
A few years ago, we had a client that was confident their computer vision model was production-ready. The benchmark accuracy looked solid – over 93%. They’d done everything right: clean architecture, a reasonable compute budget, and a proper train/test split. What they didn’t do was test it on the variety of inputs it would see in the real world.
In the first week of annotation review, our team identified over 200 edge cases that the model had never encountered during training. Objects were somewhat obscured, the lighting was poor, and labels were inconsistently applied across two annotation batches due to ambiguous original guidelines. The model was right, but it was confidently wrong in ways you could only see if humans looked closely at what it was actually doing.
Since then, this pattern has been replicated in dozens of projects, including the human catch, invisible failure modes, and capable model.
Edge cases don’t announce themselves
The problem with edge cases is that the model does not know they are edge cases. It treats them with the same confidence as it does inputs that it handles well. There isn’t an internal flag that indicates “unusual situation, proceed with caution.” The output appears like any other output.
Here, human reviewers offer something that is difficult to duplicate. When working with data from autonomous vehicles or medical imaging, for example, a skilled annotator not just labels what they see but also recognizes when something doesn’t seem right. They identify the infrequent presentation, the unclear boundary, and the situation that, while technically falling under one category, probably shouldn’t. They add domain intuition to a model’s training signal that hasn’t been and frequently can’t be encoded.
One project that I think about a lot was for a financial services client, and it involved document processing. The model worked well on the standard document formats. But a subset of older scanned documents had formatting conventions from a different era, column structures, abbreviations, and layout patterns – that were completely out of the training distribution. The model handled them without error messages. It simply produced systematically wrong extractions. A team member who had previously worked with that specific document type recognized the pattern on the second day. Automated QA would have continuously passed those outputs through.
When annotators improve the model itself
But there’s a second, less-talked-about advantage to keeping humans embedded in the annotation workflow: they create the feedback signal that actually makes models better over time.
If annotators are consistently flagging the same edge case category, it’s not noise, it’s a training opportunity. It tells you where the model’s current understanding breaks down and what more labeled examples you need to fill that gap. The annotation team is not just tagging data, they are diagnosing model behavior in production.
We have intentionally designed our workflows around this loop. Annotators don’t just tag errors, but log them in a way that directly feeds into the next training iteration. In one computer vision project, it was a meaningful accuracy improvement over three months, not from architectural changes but systematically feeding to the model the edge cases it kept failing on, as identified by the human team.
This is what human-in-the-loop means in reality. Not a safety net to catch the errors after the fact, but an active feedback mechanism to bridge the gap between what the model learned and what the real world keeps producing.
The annotation quality problem most teams overlook
One uncomfortable truth about the AI development process is that teams consistently underinvest in annotation quality and overspend fixing the fallout in production. The economics are good on paper – treat labeling as a commodity, keep costs down, and move quicker. In practice, inconsistent labels result in inconsistent models, and the debugging cost downstream is much higher than the savings upstream.
Clear guidelines, domain-appropriate reviewers, and structured processes for resolving annotator disagreement are not overhead; they are the foundation the model is built on. Research consistently points to annotation quality control as one of the primary drivers of real-world model performance gaps. The leverage is really in the human work that creates that data.
Automation scales what already works
I’m not against automation. Machine learning is about doing at scale what we can’t do as humans. But automation amplifies whatever is already there – including mistakes. The teams I’ve seen succeed with AI deployments are the ones that don’t see human judgment as a cost to minimize but as the engine that makes the automation trustworthy.
Machines learn patterns. Humans understand context. Both have a role, and pretending otherwise tends to show up in production in ways that nobody enjoys.
Author Bio:
Olga Kokhan – CEO of Tinkogroup, an AI data enablement partner delivering expert annotation, labeling, and validation services that power reliable machine learning models and intelligent systems – fully GDPR-compliant for secure collaboration