Techniques to produce and evaluate realistic multivariate synthetic data
Techniques to produce and evaluate realistic multivariate synthetic data
Blog Article
Abstract Data modeling requires a sufficient sample size for reproducibility.A small sample size can inhibit model evaluation.A synthetic data generation technique addressing this small sample size problem is evaluated: from the space of arbitrarily distributed samples, a subgroup (class) has a latent multivariate normal characteristic; synthetic data can be generated from this class with univariate kernel density estimation (KDE); and synthetic samples are statistically like their respective samples.Three 1976 corvette center console samples (n = 667) were investigated with 10 input variables (X).KDE was used to augment the sample size in X.
Maps produced univariate normal variables in Y.Principal component analysis in Y produced uncorrelated variables in T, where the probability density functions were approximated as normal and characterized; synthetic data was generated with normally distributed univariate random variables in T.Reversing each step produced synthetic data australis palette in Y and X.All samples were approximately multivariate normal in Y, permitting the generation of synthetic data.Probability density function and covariance comparisons showed similarity between samples and synthetic samples.
A class of samples has a latent normal characteristic.For such samples, this approach offers a solution to the small sample size problem.Further studies are required to understand this latent class.