Handling Informative Missingness of Data in Genetic Association Studies
Some proportion of predictor data collected for genetic association studies will be sporadically missing. As more sources of data are integrated into association models, more complex patterns of missingness will arise. The current solution for handling missing genotypes and untyped variants in GWAS datasets, genotype imputation, generates probabilistic genotypes. Analogously, variants called from sequencing are probabilistic values that are thresholded to generate calls or missing values. The patterns of missingness in variant datasets are frequently correlated with the missing values, and cannot justifiably be ignored. Current statistical methods employed in association studies must be scalably adapted to more rigorously handle missingness; this requires a fundamental restructuring of most statistical tools.