Open-End Text Analysis: How to measure quality & compare AIs?

A version of this article was published previously on the Caplena blog.

Humans are not perfect at categorizing text. They make errors. They’re biased. And they develop inconsistencies. Artificial intelligence can help fill in the gaps left by human fallacy, but AI can differ in quality, too. As a result, anyone looking to use artificial intelligence to help analyze their open-ended text should first assess what kind of AI will produce the best results.

In this article, we’ll take a look at a) why humans produce better results when partnering with AI (then they would on their own), and b) how to assess different types of artificial intelligence to get the richest, most actionable insight from your open-ended feedback.

Analyzing Open-Ended Questions

Open-ended questions yield immensely valuable feedback for market research or insight into customer behavior.

For example, let’s say that a company wants to get more insight into how it’s perceived versus how its competitors are perceived. To gain this sort of insight, respondents are asked to score this company and the competition using multiple NPS surveys. To capture the extreme breadth of areas in which their competitors might differentiate themselves, and to get feedback as unbiased as possible, ratings are followed up by an open-ended question – Can you explain your rating?

The resulting feedback allows the company to gain a granular and consistent analysis of feedback…and ultimately, uncover the true drivers of customer satisfaction.

The Challenge

The challenge here is, of course, assessing the open-ended feedback. Let’s say this company surveyed 2,000 respondents and asked about 12 different competitors. That yields 24k open-ends. 

What would taking a manual, human-centered approach look like in this context? Every answer must be read, interpreted, and tagged with relevant topics and sub-topics (which aren’t determined before coding begins).

What’s the problem with this kind of approach?

Human-centered analysis is:

  1. time consuming
  2. tedious and demanding
  3. expensive

Finally, and most importantly, it’s not the most accurate approach.

The Illusion of the “Perfect” Human Coder

Typically, humans excel at interpreting text. However, they are susceptible to making significant errors for the following reasons:

  • Coders may have different understandings of the topic at hand. If more than one person is working on a project, they may have different opinions about the meaning of an open-ended text. Or, a single coder may misunderstand a topic altogether, or operate with a bias.
  • Coders are prone to error when under stress. When working under stress or extreme time constraints, people are less likely to interpret texts meaningfully and consistently.
  • Coders get tired and may change interpretations over time. The quality of coding suffers when humans get tired. The longer the survey or the more open-ends there are to analyze, the higher the risk for fatigue and changing interpretation.
  • Coders learn more over time. During the analysis and coding process, humans tend to learn about the topics from seeing more and more cases. But these learnings are usually not applied to early analyzed answers, which can lead to different qualities of analysis within the same survey.

So, what’s the solution? How can we efficiently analyze open-ended texts with accuracy and natural human insight?

Artificial Intelligence: How to Evaluate

By partnering artificial intelligence with human intelligence, you resolve human errors in analysis–and still get rich human insight.

As mentioned above, however, AIs differ in quality. Ideally, they should be compared and assessed to choose a text-analysis tool that will produce the best results.

Don’t rely on accuracy alone.

To compare different tools, testers can feed a sample of responses, including their assigned codes, to a system of different types of AI in a supervised classification regime. The AI will then learn and apply the coding to a new dataset, allowing the tester to verify each tool’s level of accuracy.

However, accuracy is not a sufficient measure of quality.

Accuracy is simply the number of correct samples as a percentage of all samples. But as texts can have one or more code, most software will count every instance of the code being correctly assigned or not assigned as a sample.

For example: When we have a codebook with four codes, each response should include one of these. But if the AI algorithm assigns no code, this still counts as being 75% correct (as three codes have correctly been not assigned).

Datasets are nearly always unbalanced when tagging texts, as codes are typically not present at a much higher rate than being present. The result? A ridiculously high (and inaccurate) level of accuracy–usually above 98%.

The better method: F1.

To find a text-analysis AI tool to produce consistent results, you can use the scientific accuracy metric F1. F1 is the harmonic mean of precision and recall, both metrics providing information about the quality of a classification engine.

Here’s how it works:

Precision answers the question “How many assigned codes are relevant?” It is calculated by dividing the amount of correctly assigned codes by the total amount of assigned codes.

Recall answers the question “How many of the codes which would be correct, did the AI select?” It is calculated by dividing the amount of correctly assigned codes by the total amount of all codes that could have been assigned.

What’s a good score for either standard?

Theoretically, precision and recall can reach 100…but this score is never reached.

Instead, you’ll want to consider the following benchmarks:

  • On average when the same person (a coding professional) annotates the same question twice, an overlap of the two iterations of around 90 can be achieved.
  • When two different people code the same survey, they usually achieve an overlap score in the range from low 50 to 80 depending mostly on the complexity of the texts, the number of codes and quality of the codebook (see here for a guide on how to create a good codebook).

In the end, the F1 score is stricter and more unforgiving than measuring accuracy alone. As a result, it’s a more laser-specific tool for comparing different AI text analysis tools, as well as comparing AI tools against human coding.

Artificial Intelligence & Symbiosis

What have field tests with clients shown?

– AI solutions make coding faster
– AI solutions make coding less expensive
– And finally, AI solutions produce more quality, consistent results

The F1 score has also been found to be a very good indicator of the quality of the codebook being used. Automated open-end coding can help to build more meaningful codebooks by pointing out ambiguous and imprecise codes. Often, low F1 scores tend to reveal codebooks that are too large with too many codes – making it difficult to distinguish codes (for machines and humans).

The right text-analysis tools can not only increase cost-effectiveness but also ensure unparalleled quality in coding consistency. Combining advanced AI with human skills creates a scalable system for open-end coding. It also produces a more consistent, accurate analysis with less effort.

Finally, an automated text-analysis tool gives the human coder more time to fine-tune the codebook, to create highly relevant codes as well as to spend more time deriving insights.

Leave a Comment

Scroll to Top