Sharing Ethically Fabricated Data
Last Updated 14 January 2026 Show Versions
DESCRIPTION
In many instances, sharing an original raw or processed dataset may be impossible, for example, due to the need to protect participants in contexts where there are heightened risks associated with reidentification (Burgard et al., 2017, 235; Markham, 2012, 342; Yim & Schwartz-Shea, 2022, 234). While anonymisation is one response, it may not be possible or practicable in all cases: anonymised data can be reidentified, especially where one or more public dataset is linked (Liu et al., 2025, 1055; Quintana, 2019, 1), and anonymisation can limit the usefulness of a dataset, as when 'removing demographic information directly results in the inability to study these variables' (Wang et al. 2025, 774). In the context of qualitative research, anonymisation may fail to protect internal confidentiality where participants are known to one another (Yim & Schwartz-Shea, 2022, 230). These risks are heightened in an era of increasingly networked data proliferation, with the result that, as Saunders et al. state, 'it is no longer necessary to know people's names to assemble a rich profile of their lives and thereby identify their names via jigsaw identification' (2015, 125).
In such contexts, ethically fabricated data may be one solution, enabling a dataset that reflects the features of the original data to be shared openly for onward use while minimising the risk of disclosure. We derive the term 'ethically fabricated data' from Markham (2012), for whom fabrication is 'a sensible and ethically grounded solution for protecting privacy in arenas of shifting public/private contexts' (341), and consider it as a means to facilitate data exploration within the boundaries of limitations on sharing. We focus here on two types of such data: synthetic and composite.
Synthetic data is defined by Quintana as algorithmically-generated artificial data that 'mimics an original dataset by preserving its statistical properties and relationships between variables' (2019, 1). An antecedent to the practice of sharing synthetic research data is the use of synthetic data to make sensitive administrative data more widely available for research, with one example being the synthesising of otherwise highly-restricted datasets used in UK Longitudinal Studies (Nowok et al., 2017). Within and around the social sciences, the sharing of synthetic data has been discussed in relation to educational data (Liu et al., 2025) and organisational science (Wang et al., 2025), as well as specific data types (see, for example Kamelski & Olivos' 2025 discussion of the use of AI replicas in image-based research).
There are a variety of methods by which synthetic data may be created. These include statistical methods including Bayesian networks and deep learning methods such as Generative Adversarial Networks, or GANs. When combined with differential privacy (DP) mechanisms, which achieve 'strong privacy guarantees by adding noise' (Liu et al., 2025, 1055), synthetic datasets can provide high privacy levels (De Cristofaro, 2023, 6), albeit with some degree of trade-off between utility and privacy in some instances (Liu, 2025, 1069) and potential underrepresentation of minorities due to DP mechanisms' necessary suppression of outliers (Jordon et al. 2022, 25). In general, Wang et al. emphasise that 'for synthetic data to serve as a reasonable substitute for the original data, researchers need to determine their intended use, evaluate the efficacy and effectiveness of synthetic data generation techniques closely, and [...] uphold ethical, privacy, and proprietary standards' (2025, 772).
Creating composite data means 'remix[ing] data from multiple study participants to tell a coherent story of a character that could have, but did not, exist in the real world' with the result that '[a] third-party reader may recognize an individual research participant [...] but that does not reveal anything new about the individual: the other actions attributed to that character could have been the deeds of others' (Arjomand, 2024, 437, 438). This practice is sometimes used in ethnographic research, but can also be employed more generally in the context of interview data. Different accounts of the process have been shared, including Arjomand's Sequence-Based Composites (SBC) method (2024) and Willis's four guiding practices which include that '[a]ll quotations come directly from [...] interview transcripts' and '[o]ther details, such as where the interview took place; how the conversation evolved; and any paraphrasing of discussions, are taken directly from one of the source interviews' for each composite account (2019, 475). Yim and Schwarz-Shea (2022, 237) offer additional guidance on the stage during the research process at which composite construction should take place. In general, researchers engaged in composite construction are advised to develop, document and transparently share a robust protocol for their development.
Composite accounts allow the presentation of rich, embodied data despite the limitations imposed by privacy, acknowledging the complexity of participants' lived realities (Willis, 2019, 476). The creation of composite accounts can be understood as not merely the development of a shareable form of data but as both an act of data analysis in itself (Arjomand, 2024, 437) and an acknowledgement and foregrounding of the extent to which, as Markham states, '[a]ll data are narrowed, altered, and abstracted through various filters before they are analyzed' (2012, 341). Finally, Burles and Bally (2018) suggest that the '[c]reation of composite cases from thematic categories can [...] produce comprehensive findings that are accessible to diverse audiences' (7). The documented risks of composite accounts include excessive simplification or caricature of the data (Willis, 2019, 4478) and the negative reception of the practice among research communities with differing epistemological approaches - for a detailed discussion of the latter, see Markham.
Depending on its volume and form, ethically fabricated data may be shared within a research publication (as is most typical for composite data) and/or via an appropriate data repository; in either case, they should be accompanied by detailed documentation tracing the steps in their creation, including (in the case of synthetic data) the generative model or process used and the specific calibration employed.
References
Arjomand, N.A. (2024). 'Empirical Fiction: Composite Character Narratives in Analytical Sociology', The American Sociologist, 55(4), 436–472. https://doi.org/10.1007/s12108-022-09546-z
Burgard, J.P. et al. (2017). 'Synthetic Data for Open and Reproducible Methodological Research in Social Sciences and Official Statistics', Wirtschafts- und Sozialstatistisches Archiv, 11(3–4), 233–244. https://doi.org/10.1007/s11943-017-0214-8
Burles, M.C. and Bally, J.M.G. (2018). 'Ethical, Practical, and Methodological Considerations for Unobtrusive Qualitative Research About Personal Narratives Shared on the Internet', International Journal of Qualitative Methods, 17(1). https://doi.org/10.1177/1609406918788203
De Cristofaro, E. (2023). 'Synthetic Data: Methods, Use Cases, and Risks'. Preprint. ArXiv. https://doi.org/10.48550/arxiv.2303.01230
Jordon, J., et al. (2022). 'Synthetic Data—What, Why and How?' Report. https://royalsociety.org/-/media/policy/projects/privacy-enhancing-technologies/Synthetic_Data_Survey-24.pdf [accessed 02/10/25]
Kamelski, T. and Olivos, F. (2025). 'AI-Replicas as Ethical Practice: Introducing an Alternative to Traditional Anonymisation Techniques in Image-Based Research', Qualitative Research: QR. https://doi.org/10.1177/14687941241308705
Liu, Q. et al. (2025). 'Ensuring Privacy through Synthetic Data Generation in Education', British Journal of Educational Technology, 56(3), 1053–1073. https://doi.org/10.1111/bjet.13576
Markham, A. (2012). 'Fabrication as Ethical Practice: Qualitative Inquiry in Ambiguous Internet Contexts', Information, Communication & Society, 15(3), 334–353. https://doi.org/10.1080/1369118X.2011.641993
Nowok, B., et al. (2017). 'Providing Bespoke Synthetic Data for the UK Longitudinal Studies and Other Sensitive Data with the Synthpop Package for R', Statistical Journal of the IAOS, 33(3), 785–796. https://doi.org/10.3233/SJI-150153
Quintana, D. (2019). 'Synthetic Datasets: A Non-Technical Primer for the Biobehavioral Sciences'. Preprint. PsyArXiv. https://doi.org/10.31234/osf.io/dmfb3
Saunders, B., et al. (2015). 'Participant Anonymity in the Internet Age: From Theory to Practice', Qualitative Research in Psychology, 12(2), 125–137. https://doi.org/10.1080/14780887.2014.948697
Wang, P. et al. (2025). 'Advancing Organizational Science Through Synthetic Data: A Path to Enhanced Data Sharing and Collaboration', Journal of Business and Psychology, 40(4), 771–797. https://doi.org/10.1007/s10869-024-09997-w
Willis, R. (2019). 'The Use of Composite Narratives to Present Interview Findings', Qualitative Research: QR, 19(4), 471–480. https://doi.org/10.1177/1468794118787711
Yim, J.M. and Schwartz-Shea, P. (2022). 'Composite Actors as Participant Protection: Methodological Opportunities for Ethnographers', Journal of Organizational Ethnography, 11(3), 228–242. https://doi.org/10.1108/JOE-02-2021-0009