The Library
A method for machine learning generation of realistic synthetic datasets for validating healthcare applications
Tools
Arvanitis, Theodoros N., White, Sean, Harrison, Stuart, Chaplin, Rupert and Despotou, George (2022) A method for machine learning generation of realistic synthetic datasets for validating healthcare applications. Health Informatics Journal, 28 (2). 146045822210770. doi:10.1177/14604582221077000 ISSN 1460-4582.
|
PDF
WRAP-method-machine-learning-generation-realistic-sythetic-datasets-healthcare-applications-2022.pdf - Published Version - Requires a PDF viewer. Available under License Creative Commons Attribution 4.0. Download (1247Kb) | Preview |
Official URL: http://dx.doi.org/10.1177/14604582221077000
Abstract
Digital health applications can improve quality and effectiveness of healthcare, by offering a number of new tools to users, which are often considered a medical device. Assuring their safe operation requires, amongst others, clinical validation, needing large datasets to test them in realistic clinical scenarios. Access to datasets is challenging, due to patient privacy concerns. Development of synthetic datasets is seen as a potential alternative. The objective of the paper is the development of a method for the generation of realistic synthetic datasets, statistically equivalent to real clinical datasets, and demonstrate that the Generative Adversarial Network (GAN) based approach is fit for purpose. A generative adversarial network was implemented and trained, in a series of six experiments, using numerical and categorical variables, including ICD-9 and laboratory codes, from three clinically relevant datasets. A number of contextual steps provided the success criteria for the synthetic dataset. A synthetic dataset that exhibits very similar statistical characteristics with the real dataset was generated. Pairwise association of variables is very similar. A high degree of Jaccard similarity and a successful K-S test further support this. The proof of concept of generating realistic synthetic datasets was successful, with the approach showing promise for further work.
Item Type: | Journal Article | ||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Subjects: | Q Science > QA Mathematics > QA76 Electronic computers. Computer science. Computer software R Medicine > R Medicine (General) |
||||||||||||||||||||||||||||||
Divisions: | Faculty of Science, Engineering and Medicine > Engineering > WMG (Formerly the Warwick Manufacturing Group) | ||||||||||||||||||||||||||||||
Library of Congress Subject Headings (LCSH): | Neural networks (Computer science), Machine learning, Medical Informatics, Medicine -- Data processing | ||||||||||||||||||||||||||||||
Journal or Publication Title: | Health Informatics Journal | ||||||||||||||||||||||||||||||
Publisher: | Sage Publications Ltd. | ||||||||||||||||||||||||||||||
ISSN: | 1460-4582 | ||||||||||||||||||||||||||||||
Official Date: | 1 January 2022 | ||||||||||||||||||||||||||||||
Dates: |
|
||||||||||||||||||||||||||||||
Volume: | 28 | ||||||||||||||||||||||||||||||
Number: | 2 | ||||||||||||||||||||||||||||||
Article Number: | 146045822210770 | ||||||||||||||||||||||||||||||
DOI: | 10.1177/14604582221077000 | ||||||||||||||||||||||||||||||
Status: | Peer Reviewed | ||||||||||||||||||||||||||||||
Publication Status: | Published | ||||||||||||||||||||||||||||||
Access rights to Published version: | Open Access (Creative Commons) | ||||||||||||||||||||||||||||||
Date of first compliant deposit: | 13 April 2022 | ||||||||||||||||||||||||||||||
Date of first compliant Open Access: | 14 April 2022 | ||||||||||||||||||||||||||||||
RIOXX Funder/Project Grant: |
|
||||||||||||||||||||||||||||||
Related URLs: |
Request changes or add full text files to a record
Repository staff actions (login required)
View Item |
Downloads
Downloads per month over past year