Blog

Blog

Blog

How to Use Synthetic Data Without Fooling Yourself or Your Clients

Trust, Governance, and Disclosure 

Nobody cares where a number came from until the number is wrong.  

That is the trust problem with synthetic data. If it is not labeled, not explained, and not governed, teams will treat it like any other number and act on it. Then, the moment it turns out to be just one version of the truth, trust drops fast. Not because synthetic is automatically wrong, but because synthetic can be easy to over-trust, easy to overclaim, and easy to hide in plain sight if teams do not put clear rules around how it is generated, validated, and reported. 

That is why the real conversation is not only about whether synthetic data can be useful. It is about what “responsible synthetic” looks like in day-to-day research operations.  

In a conversation with James “JT” Turner, Founder and CEO at Delineate, the emphasis kept returning to that operational reality. Synthetic data can be a good augment to natural data, but it cannot replace it. And the moment synthetic starts pretending to be natural, it becomes dangerous. 

What Clients Actually Want

 

Most senior stakeholders are not asking for more complexity. They want usable answers. If synthetic data is part of the workflow, they want to know what it is doing, what it informs, what it does not inform, and what checks sit behind it.

What becomes hard to recover from is when synthetic data is not disclosed properly, or when it is presented in a way that makes it look like natural data.  

“How you’ve collected the data, the methodology should always be available to clients and should always be communicated in the delivery of the data,” says JT. 

If synthetic data is being used, it should not be implied or assumed.  

“We need to be explicit,” he says. “We need to say where and if synthetic data has been used and why.” 

That is a trust issue more than a technical issue. A stakeholder can handle complexity when it is explained, and they can even accept uncertainty when the boundaries are clear. What they cannot accept is surprise. 

One useful way to make that land is to call it what it is. It is modeled data. That wording draws attention to the fact that it should not be trusted in the same way as natural data, and it pushes the right follow-up questions. 

JT also points to three red flags to watch for. 

The first red flag is when synthetic data is not marked as such in reporting. If someone cannot tell whether they are looking at natural data, natural plus synthetic, or synthetic data only, then the work is already on the wrong side of disclosure. 

The second red flag is structural. There is no validation or governance process around testing it to real data. This is the difference between a model being a tool and a model being a gamble. If there is no process to compare synthetic outputs to natural data, you do not have synthetic “insight.” You have synthetic output. 

The third red flag is one that can quietly destroy a system over time. Models are trained on their own outputs. This is how you end up with models that agree with themselves, drift away from reality, and still look good in internal testing. 

Those three red flags map directly to the most common ways synthetic causes damage: 

  • Undisclosed augmentation 
  • Unvalidated confidence 
  • Self-reinforcing loops 

When the Model Collapses

 

Sometimes the biggest problems begin with something that sounds harmless. Is that the case with feedback loops? 

“Yes, we should be worried about model collapse and feedback loops,” says JT. 

If synthetic outputs feed back into training data without clear provenance and filters, you risk what JT calls “distributional drift and overconfidence in the data, like an echo chamber where everyone’s patting themselves on the back, and it all looks good.”  

The output looks tidy, but the system can start converging to overly smooth trends that underrepresent the natural volatility or change in the data. Relationships can weaken, correlations “dilute or wash out the signal,” self-referential validation creeps in – models agreeing with themselves instead of being treated as fresh data to test and control. 

In other words, the system can start to optimize for agreement, not accuracy. It can look stable, and stability can feel reassuring, but it can be the opposite of what you need when markets are moving. 

Synthetic data cannot be allowed to become a closed loop. You need provenance, filters, and a way to stop synthetic data from quietly becoming its own training set.

Why Research Culture Matters More Than the Interface

 

JT draws a distinction between research companies that have grown with technology and are embracing AI and synthetic data as an augment in addition to their current capabilities, versus model or technology businesses that are finding a use case for market research. 

That distinction matters because market research is not just data generation. It is methodology, standards, production discipline, and a habit of being critical about what the data says. JT points out that the industry is “perfectly set up” for understanding, testing, control, feedback loops, and being critical. His worry is that clients will be wowed by “pretty front ends and a claim of accuracy” that is not as validated as it appears when you dig into it. 

This is where governance becomes the real product, not a flashy feature set. 

As an MRS company partner, JT points to following the guidelines of the Market Research Society and the international standard ISO 20252, translating governance into delivery requirements. Methodology should always be available to clients and communicated in delivery, and synthetic data usage should be explicit: 

  • Where did the data come from? What is natural, what is synthetic, and what is augmented or modeled? 
  • How was the synthetic data generated? What assumptions are embedded? 
  • What has been used to validate it, and how often is it updated? 
  • What decisions does it inform? 
  • What are the risks, benefits, and limitations? 

That is the operational definition of “responsible synthetic,” and a practical way to protect trust. It also forces internal clarity. If a team cannot explain what synthetic data is doing, it probably does not fully understand it. 

What Good Governance Looks Like in Practice

 

If synthetic data is being used responsibly, it should show up in the reporting itself, not in a private note somewhere. 

JT describes exactly how this can look in practice. In a survey context, reporting might be in a BI tool, dashboard, or PowerPoint, with footnotes or measures named so you can see whether it is natural data, natural plus synthetic, or just synthetic. That labeling is the first layer of trust, preventing silent substitution. 

The second layer is interpretation support. As you select the data, if possible, the pros and cons should be shown. That is a subtle but important point. The goal is not to disclose and walk away, but to help the client understand what changes when synthetic data is involved. 

The third layer is decision hygiene. If an analysis has been enabled through augmentation of natural data, it should be described as augmented, and further analysis may be required. That tells the client what they can safely do with this output, and what they should not overclaim. 

Then there is a final layer that is rarely discussed but highly practical when working with small audiences (low-incidence group). When you don’t yet have enough real respondents, what you can do early on is augment the dataset with synthetic data so you can run the analysis now, but then return later once enough natural responses have been collected and rerun the analysis on natural data alone. This way, you can check whether you would have made the same decision if you had waited for real data. 

This is what JT calls a “synthetic footnote,” treating synthetic data as a temporary support in the early stage to move faster when you must, but come back with natural data to confirm. 

Where AI Starts to Get Interesting

 

Synthetic data is often marketed as a privacy solution.  

More broadly, yes, synthetic data can help with privacy issues, but market research has long been built around anonymization of the participant and has had controls for privacy since its beginning.  

Where synthetic may help more inside the research ecosystem is further up the supply chain, where panelist data includes sensitive attributes beyond basic demographics.  

If a client is told “synthetic solves privacy,” the right follow-up is “in which part of the workflow, exactly.” The answer is rarely universal. 

Why Disciplined Workflows Are Easier When the Data Is Ready

 

A big reason synthetic data gets used poorly is that organizations struggle to operationalize validationupdating, and labeling. When data is stuck in silos, formats are legacy, and analytics happens in disconnected environments, governance becomes a promise rather than a practice. 

JT describes Delineate Proximity® as an end-to-end platform that covers survey generation, deployment, collection, processing, and storage. From the beginning, it was designed with a data lake, keeping survey data in modern formats accessible to modern tools, including data science applications. 

It becomes easier to process and augment survey data for use cases like synthetic, analytics, models, and segmentations because the data is pre prepped, ready to go, available via API, and in a workspace that can run models. 

That matters because the mechanics of responsible synthetic data depend on the mechanics of data access: 

  • You cannot validate well if you cannot connect synthetic outputs to natural baselines.  
  • You cannot update regularly if model training is a heavy lift every time.  
  • You cannot clearly disclose if your reporting layer cannot distinguish between natural and augmented. 

This is also where the AI Sandbox fits. JT describes it as a way to experiment on client-specific projects and collaborate with clients to get additional value from a dataset. That can include looking for new segments, running sophisticated analysis, creating bespoke synthetic data, or combining data with another dataset. 

Again, the point is process. A sandbox is useful when it is a controlled environment for testing, validation, and collaboration, not a black box that produces answers. 

The Line That Should Not Move

 

All of this comes back to a simple boundary. Synthetic data is an augment to natural data and should always be treated as such. 

“Delineate will never use synthetic data exclusively,” says JT. “We will never stop testing, controlling, and learning from it, but we will also never fail to disclose.” 

That is what a responsible synthetic posture looks like in a sentence. Because, at the end of the day, it’s not worth pretending that uncertainty doesn’t exist. It is about making the uncertainty visible and manageable, so synthetic data stays useful, and trust stays intact. 

Related reading 

Synthetic Data in Market Research: An Expert View on Why Natural Data First Still Wins 

Where Synthetic Data Breaks First: Time, Novelty, and Bias in Market Research 

Join our Newsletter