
Why Synthetic Personas Work Best When Built on Proprietary Data
By Verve
- article
- Artificial Intelligence
- Generative AI
- AI Personas
- Customer Panels
- Insight Communities
- Online Communities
- Qualitative Research
- Synthetic Data
- Data Analytics
- Trend Analytics
There is a growing body of opinion in the research industry about synthetic personas – some positive, some deeply sceptical. Having spent the past three and a half years building and testing them with over 30 clients, I want to offer a more grounded perspective: synthetic personas can work well, but only when built on the right data. The nature of that data matters more than almost anything else.
Learn more by watching or listening to Andrew on the Founders and Leaders Series podcast here:
Episode 9: Andrew Cooper, Founder & CEO, Verve
The Data Question is Not About Volume
One of the most common misconceptions I encounter is that synthetic simulations require large datasets to function reliably. That is not my experience. What they require is high-quality data – depth rather than breadth. A rich, well-structured dataset of qualitative interviews will take you further than a thin dataset of thousands of survey responses. This is actually quite refreshing in a world that has long equated scale with rigour.
When we build a simulation, we are careful about what we put in. The process takes four to six weeks, and that is deliberate. During that time, we are not simply loading data – we are validating, checking that what the simulation produces reflects what real human beings actually say when given the same questions on hold-out data. We target a correlation of 0.9 between simulation outputs and real human responses, and we have now completed over a hundred tests at that level. That validation process takes time, and it should.
Why Proprietary Data Changes the Outcome
The most important factor in building a simulation that produces genuinely useful insight is whether the underlying data is proprietary to the client. If you build a simulation using data sources that any competitor could also access, you will learn roughly what they learn. The simulation becomes a replication of widely available knowledge rather than a reflection of your specific customers.
Proprietary client data – whether from a customer community, CRM systems, qualitative interview archives or other primary research – provides a starting point that is already trusted, relevant to the client’s real customers, and differentiated. That is where competitive advantage in insight comes from.
This does not mean community panels are the only route in. Many agencies hold years of primary research data from client programmes. Qualitative interviews conducted over multiple years, even if the original brief was narrow, contain a great deal of relevant information that can seed a simulation. The insight that was left on the cutting room floor because it was off-topic at the time – the things respondents mentioned but that were not in scope for the report – still went into the simulation. Nothing is wasted.
Curated Sources and the Role of Augmentation
Once a strong proprietary seed dataset is in place, it can be augmented with curated external sources. This is where a significant part of the technical work happens – determining which external data is relevant, how it should be weighted, and how it interacts with the proprietary data to produce a simulation that is both rounded and accurate.
The key word here is curated. Not all external data is useful, and indiscriminate augmentation is one of the ways simulations go wrong. Pointing AI at the internet and asking it to produce a persona yields something that may feel directionally plausible but cannot be validated or trusted for decisions of consequence. It may be adequate for low-stakes marketing decisions. It is not adequate for board-level choices.
The Importance of Being Auditable
One legitimate concern about synthetic data in general is the black-box problem. You receive an output, but you cannot trace how it was produced, what inputs drove it, or where it might be wrong. This is a reasonable concern, and I think the research industry is right to take it seriously.
Our approach is to be auditable at every stage. We know what data went in, how the simulation was validated, what it was trained for, and where the boundaries of its reliable output lie. That auditability is what allows a simulation to be used for decisions of consequence – the kind of decisions that reach the boardroom – rather than just directional marketing guidance.
There is also a maintenance dimension. Simulations need to be fed with new primary data as client relationships evolve. A simulation built on research conducted five years ago will drift from the current reality of customers’ views and behaviours unless it is updated. The process of refreshing and revalidating is ongoing, and clients need to understand that from the outset.
A Note on What Simulations are Not
It is worth being clear about the limits. Simulations are not a replacement for all research. They complement it, handling tasks that are onerous, expensive, or logistically difficult to perform with real human respondents. The real human being remains essential – for depth, validation, and qualitative exploration that cannot be replicated artificially. The simulation remains in the room after the respondent leaves. It handles the forty-five minute conjoint that no cloud computing expert would sit through for a twenty-dollar Amazon voucher. It handles the tactical work so the real person can do meaningful work.
What simulations require to do that job well is the same thing any research method requires: a high-quality, relevant, trusted dataset as its foundation. Without that, the outputs will reflect the limitations of the inputs. With it, they can produce insight that is reliable, auditable and genuinely useful for decisions.
The conversation in our industry about synthetic data tends to get stuck on whether it works at all. The more useful question is: under what conditions does it work, and what does it need to work well? The answer, in my experience, starts with the data.
Learn more by watching or listening to Andrew on the Founders and Leaders Series podcast here:







