Demystifying Digital Twins: What They Are and How to Get Them Right

By Fairgen

  • article
  • AI
  • Artificial Intelligence
  • Synthetic Data
  • Digital Twins
  • AI Agents
  • Survey Research

Summarise with AI

ChatGPT
Claude Logo
Gemini Logo
Perplexity Logo

The term “digital twin” is everywhere in market research right now. The problem is that it means very different things depending on who you ask. There are numerous ways people in the industry are approaching this, and the differences between them matter a great deal. Some approaches produce results you can act on. Others do not. I want to map out where things stand and explain what I think makes a genuine digital twin work.

Learn more by watching or listening to Samuel on the Founders and Leaders Series podcast here:


Starting With a Framework

Before getting into twins specifically, it helps to understand the broader space they sit in. I think about synthetic data approaches across two dimensions: methodology and stakes.

On the methodology side, you have a spectrum that runs from purely exploratory and directional research at one end to foundational research at the other. Foundational research includes areas such as segmentations and brand trackers. Directional research covers faster, iterative work where you are trying to get a read on something quickly rather than build a definitive picture.

On the stakes side, you have decisions that are relatively low-risk at one end, and very high-stakes at the other. Should we invest ten million dollars to expand into a new market? That is a high-stakes decision. Testing a rough concept in an early innovation sprint is not.

Different methodologies are appropriate for different combinations of stakes and research type. Digital twins, in my view, belong at the directional end of the spectrum, where they have significant potential to help teams test ideas, iterate quickly and move forward without waiting six weeks for results.

Why Fully Synthetic Panels Do Not Work

At one end of the spectrum, you have what some people call fully synthetic panels. The basic idea: create a panel of a thousand demographically profiled respondents, and then prompt a large language model to answer survey questions as if it were each of those people.

This does not work, and the reason is fairly simple. Large language models like ChatGPT or Claude are trained on the whole internet. They are, in a very literal sense, averaging machines. The variance in their answers at the individual level is essentially zero. You ask them a question, and you get back an average answer, because that is what they are optimised to produce. If you are trying to understand how different consumer segments actually think and feel, then averaging is exactly the problem you need to solve, not a feature.

What a Genuine Digital Twin Requires

At the other end of the spectrum is what I believe is the right approach: to anchor every twin to a real person.

A twin should be one-to-one with a real individual, built on data that has actually been collected from that person. Not just age, gender, and region, but meaningful behavioural and attitudinal data specific to the category you are researching. If you want to test a new soft drink concept, you need twins whose underlying data covers how those real people think and behave within the soft drink category. A 30-question general profiling survey will not tell you whether someone prefers Pepsi or Coke, how much they spend on soft drinks, or how they make decisions in that category.

So you have to create audiences that are built at the category level, even the subcategory level. You collect what you need from real people, at the depth you need for that category, and then you build your twins on that foundation.

The Data Challenge

This approach is obviously more expensive and more complex than prompting an LLM with demographic variables. It requires serious data collection, data orchestration, and processing. It also requires strong partnerships with data companies that can provide the coverage you need across categories and markets.

And there is a maintenance question that always comes up when I explain this approach: how often do you need to refresh the data?

The honest answer is: often. Consumer attitudes and behaviours shift. A twin built on data from 12 months ago may not reflect where that person or that category is today. My recommendation is to refresh primary data quarterly. On top of that, you can incorporate secondary data sources, things like clickstream data, transactional data, and live news data, to keep the model as current as possible between primary data collection cycles. But you still need the primary data at the core.

Where Digital Twins Fit in the Research Workflow

One of the most useful things about digital twins, done properly, is the speed they offer for directional and exploratory work. In innovation research in particular, it is genuinely painful to run one round of fieldwork, wait weeks for results, and then have to go back and do it again. Twins allow you to test ideas and iterate at a pace much closer to what product and marketing teams actually need.

That said, I want to be clear about what twins are not suited for. If you are running foundational research at high stakes, such as segmentations, brand trackers, or work that will inform a major investment decision, twins are not the right primary tool on their own. Boosting, which uses synthetic data to augment real survey results at the segment level, is a better fit for that kind of foundational work. The two approaches belong in different parts of the grid.

The Infrastructure Behind Category-Level Twins

Building genuine category-level twins at scale requires access to large volumes of high-quality primary data across many different categories. Most organisations cannot produce that data alone. This points to a marketplace model, in which data vendors contribute category-specific datasets that can be used to construct twin audiences.

The way I think about it is that there is a public layer, where vendors can upload their data and make category-specific twin audiences available to researchers who need them, and a private layer, where enterprise teams can bring in their proprietary data and build twins within a controlled environment. Both layers use the same underlying approach. The difference is who owns and controls the data.

This kind of infrastructure is what makes rigorous twin-based research repeatable and scalable. Without it, every twin programme becomes a bespoke project built from scratch, which is slow and expensive. With it, the barrier comes down significantly, and the methodology becomes accessible to a much wider range of organisations.


Where Things Are Heading

There is a lot of noise in this space right now, and that creates a lot of confusion for researchers trying to evaluate their options. The core question to ask of any twin-based solution is simple: is each twin anchored to a real person, and is that person’s data collected at the category level? If the answer is no, the approach will not produce the variance needed to make the results useful.

The potential is real. Digital twins, built properly, can help teams move faster, test more ideas, and reach decisions with greater confidence. But the methodology only works if the data foundation is right.

Learn more by watching or listening to Samuel on the Founders and Leaders Series podcast here:


Author

Learn more about

Scroll to Top