A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank
Autoři:
Junyang Qian aff001; Yosuke Tanigawa aff002; Wenfei Du aff001; Matthew Aguirre aff002; Chris Chang aff003; Robert Tibshirani aff001; Manuel A. Rivas aff002; Trevor Hastie aff001
Působiště autorů:
Department of Statistics, Stanford University, Stanford, CA, United States of America
aff001; Department of Biomedical Data Science, Stanford University, Stanford, CA, United States of America
aff002; Grail, Inc., Menlo Park, CA, United States of America
aff003
Vyšlo v časopise:
A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank. PLoS Genet 16(10): e32767. doi:10.1371/journal.pgen.1009141
Kategorie:
Research Article
doi:
https://doi.org/10.1371/journal.pgen.1009141
Souhrn
The UK Biobank is a very large, prospective population-based cohort study across the United Kingdom. It provides unprecedented opportunities for researchers to investigate the relationship between genotypic information and phenotypes of interest. Multiple regression methods, compared with genome-wide association studies (GWAS), have already been showed to greatly improve the prediction performance for a variety of phenotypes. In the high-dimensional settings, the lasso, since its first proposal in statistics, has been proved to be an effective method for simultaneous variable selection and estimation. However, the large-scale and ultrahigh dimension seen in the UK Biobank pose new challenges for applying the lasso method, as many existing algorithms and their implementations are not scalable to large applications. In this paper, we propose a computational framework called batch screening iterative lasso (BASIL) that can take advantage of any existing lasso solver and easily build a scalable solution for very large data, including those that are larger than the memory size. We introduce snpnet, an R package that implements the proposed algorithm on top of glmnet and optimizes for single nucleotide polymorphism (SNP) datasets. It currently supports ℓ1-penalized linear model, logistic regression, Cox model, and also extends to the elastic net with ℓ1/ℓ2 penalty. We demonstrate results on the UK Biobank dataset, where we achieve competitive predictive performance for all four phenotypes considered (height, body mass index, asthma, high cholesterol) using only a small fraction of the variants compared with other established polygenic risk score methods.
Klíčová slova:
Algorithms – Asthma – Body Mass Index – Genetics – Genome-wide association studies – Heredity – Hypercholesterolemia – Single nucleotide polymorphisms
Zdroje
1. Friedman J, Hastie T, Tibshirani R. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition. Springer series in statistics. Springer-Verlag; 2009.
2. Efron B, Hastie T. Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. vol. 5. Cambridge University Press; 2016.
3. Dean J, Ghemawat S. MapReduce: Simplified Data Processing on Large Clusters. Commun ACM. 2008;51(1):107–113. doi: 10.1145/1327452.1327492
4. Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I. Spark: Cluster Computing with Working Sets. In: Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing. HotCloud’10. Berkeley, CA, USA: USENIX Association; 2010. p. 10–10. Available from: http://dl.acm.org/citation.cfm?id=1863103.1863113.
5. Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, et al. TensorFlow: A System for Large-scale Machine Learning. In: Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation. OSDI’16. Berkeley, CA, USA: USENIX Association; 2016. p. 265–283. Available from: http://dl.acm.org/citation.cfm?id=3026877.3026899.
6. Tibshirani R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society Series B (Methodological). 1996;58(1):267–288. doi: 10.1111/j.2517-6161.1996.tb02080.x
7. R Core Team. R: A Language and Environment for Statistical Computing; 2017. Available from: https://www.R-project.org/.
8. Friedman J, Hastie T, Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent Journal of Statistical Software; 2010. http://dx.doi.org/10.18637/jss.v033.i01 20808728
9. Breheny P, Huang J. Coordinate Descent Algorithms for Nonconvex Penalized Regression, with Applications to Biological Feature Selection. The Annals of Applied Statistics. 2011;5(1):232–253. doi: 10.1214/10-AOAS388
10. Hastie T. Statistical Learning with Big Data; 2015. Presentation at Data Science at Stanford Seminar. Available from: https://web.stanford.edu/~hastie/TALKS/SLBD_new.pdf.
11. Bycroft C, Freeman C, Petkova D, Band G, Elliott LT, Sharp K, et al. The UK Biobank Resource with Deep Phenotyping and Genomic Data. Nature. 2018;562(7726):203–209. doi: 10.1038/s41586-018-0579-z 30305743
12. Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, Brown MA, et al. 10 Years of GWAS Discovery: Biology, Function, and Translation. The American Journal of Human Genetics. 2017;101(1):5–22. doi: 10.1016/j.ajhg.2017.06.005 28686856
13. Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience. 2015;4(1). doi: 10.1186/s13742-015-0047-8 25722852
14. Purcell S, Chang C. PLINK 1.9; 2015. Available from: www.cog-genomics.org/plink/1.9/.
15. Tibshirani R, Bien J, Friedman J, Hastie T, Simon N, Taylor J, et al. Strong Rules for Discarding Predictors in Lasso-Type Problems. Journal of the Royal Statistical Society Series B (Statistical Methodology). 2012;74(2):245–266. doi: 10.1111/j.1467-9868.2011.01004.x 25506256
16. Boyd S, Vandenberghe L. Convex Optimization. Cambridge university press; 2004.
17. Friedman J, Hastie T, Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, Articles. 2010;33(1):1–22. 20808728
18. Cox DR. Regression Models and Life-Tables. Journal of the Royal Statistical Society Series B (Methodological). 1972;34(2):187–220. doi: 10.1111/j.2517-6161.1972.tb00899.x
19. Li R, Chang C, Justesen JM, Tanigawa Y, Qian J, Hastie T, et al. Fast Lasso method for Large-scale and Ultrahigh-dimensional Cox Model with applications to UK Biobank. Biostatistics, kxaa038 2020 doi: 10.1093/biostatistics/kxaa038 32989444
20. Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2005;67(2):301–320. doi: 10.1111/j.1467-9868.2005.00503.x
21. Lello L, Avery SG, Tellier L, Vazquez AI, de los Campos G, Hsu SDH. Accurate Genomic Prediction of Human Height. Genetics. 2018;210(2):477–497. doi: 10.1534/genetics.118.301267 30150289
22. DeBoever C, Tanigawa Y, Lindholm ME, McInnes G, Lavertu A, Ingelsson E, et al. Medical Relevance of Protein-Truncating Variants across 337,205 Individuals in the UK Biobank Study. Nature Communications. 2018;9(1):1612. doi: 10.1038/s41467-018-03910-9 29691392
23. Wold H. Soft Modelling by Latent Variables: The Non-Linear Iterative Partial Least Squares (NIPALS) Approach. Journal of Applied Probability. 1975;12(S1):117–142. doi: 10.1017/S0021900200047604
24. Meinshausen N. Relaxed Lasso. Computational Statistics & Data Analysis. 2007;52(1):374–393. doi: 10.1016/j.csda.2006.12.019
25. Tanigawa Y, Li J, Justesen JM, Horn H, Aguirre M, DeBoever C, et al. Components of genetic associations across 2,138 phenotypes in the UK Biobank highlight adipocyte biology. Nature communications. 2019;10(1):4064. doi: 10.1038/s41467-019-11953-9 31492854
26. Ge T, Chen CY, Ni Y, Feng YCA, Smoller JW. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nature Communications. 2019;10(1):1776 doi: 10.1038/s41467-019-09718-5 30992449
27. Lloyd-Jones LR, Zeng J, Sidorenko J, Yengo L, Moser G, Kemper KE, et al. Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nature Communications. 2019;10(1):1776. doi: 10.1038/s41467-019-12653-0 30992449
28. Purcell S, Chang C. PLINK 2.0; 2020. Available from: www.cog-genomics.org/plink/2.0/.
29. Zeng J, De Vlaming R, Wu Y, Robinson MR, Lloyd-Jones LR, Yengo L, et al. Signatures of negative selection in the genetic architecture of human complex traits. Nature Genetics. 2018;50(5):746–753. doi: 10.1038/s41588-018-0101-4 29662166
30. Silventoinen K, Sammalisto S, Perola M, Boomsma DI, Cornes BK, Davis C, et al. Heritability of Adult Body Height: A Comparative Study of Twin Cohorts in Eight Countries. Twin Research. 2003;6(5):399–408. doi: 10.1375/136905203770326402 14624724
31. Visscher PM, Medland SE, Ferreira MAR, Morley KI, Zhu G, Cornes BK, et al. Assumption-Free Estimation of Heritability from Genome-Wide Identity-by-Descent Sharing between Full Siblings. PLOS Genetics. 2006;2(3):e41. doi: 10.1371/journal.pgen.0020041 16565746
32. Visscher PM, McEvoy B, Yang J. From Galton to GWAS: Quantitative Genetics of Human Height. Genetics Research. 2010;92(5-6):371–379. doi: 10.1017/S0016672310000571 21429269
33. Zaitlen N, Kraft P, Patterson N, Pasaniuc B, Bhatia G, Pollack S, et al. Using Extended Genealogy to Estimate Components of Heritability for 23 Quantitative and Dichotomous Traits. PLOS Genetics. 2013;9(5):e1003520. doi: 10.1371/journal.pgen.1003520 23737753
34. Hemani G, Yang J, Vinkhuyzen A, Powell J, Willemsen G, Hottenga JJ, et al. Inference of the Genetic Architecture Underlying BMI and Height with the Use of 20,240 Sibling Pairs. The American Journal of Human Genetics. 2013;93(5):865–875. doi: 10.1016/j.ajhg.2013.10.005 24183453
35. Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, et al. Common SNPs Explain a Large Proportion of the Heritability for Human Height. Nature Genetics. 2010;42:565. doi: 10.1038/ng.608 20562875
36. Yang J, Bakshi A, Zhu Z, Hemani G, Vinkhuyzen AAE, Lee SH, et al. Genetic Variance Estimation with Imputed Variants Finds Negligible Missing Heritability for Human Height and Body Mass Index. Nature Genetics. 2015;47:1114. doi: 10.1038/ng.3390 26323059
37. Lango Allen H, Estrada K, Lettre G, Berndt SI, Weedon MN, Rivadeneira F, et al. Hundreds of Variants Clustered in Genomic Loci and Biological Pathways Affect Human Height. Nature. 2010;467:832. doi: 10.1038/nature09410 20881960
38. Wood AR, Esko T, Yang J, Vedantam S, Pers TH, Gustafsson S, et al. Defining the Role of Common Variation in the Genomic and Biological Architecture of Adult Human Height. Nature Genetics. 2014;46:1173. doi: 10.1038/ng.3097 25282103
39. Marouli E, Graff M, Medina-Gomez C, Lo KS, Wood AR, Kjaer TR, et al. Rare and Low-Frequency Coding Variants Alter Human Adult Height. Nature. 2017;542:186. doi: 10.1038/nature21039 28146470
40. Parikh N, Boyd S. Proximal Algorithms. Foundations and Trends in Optimization. 2014;1(3):127–239. doi: 10.1561/2400000003
41. Xiao L. Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization. Journal of Machine Learning Research. 2010;11(88):2543–2596.
42. Duchi JC, Agarwal A, Wainwright MJ. Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling. IEEE Transactions on Automatic Control. 2012;57(3):592–606. doi: 10.1109/TAC.2011.2161027
43. Bickel PJ, Ritov Y, Tsybakov AB. Simultaneous analysis of Lasso and Dantzig selector. Ann Statist. 2009;37(4):1705–1732. doi: 10.1214/08-AOS620
44. Zhao P, Yu B. On model selection consistency of Lasso. Journal of Machine learning research. 2006;7(90):2541–2563.
45. DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the Areas under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach. Biometrics. 1988;44(3):837–845. doi: 10.2307/2531595
46. Cortes C, Mohri M. Confidence intervals for the area under the ROC curve. In: Advances in Neural Information Processing Systems; 2005. p. 305–312.
47. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D. Principal Components Analysis Corrects for Stratification in Genome-Wide Association Studies. Nature Genetics. 2006;38:904. doi: 10.1038/ng1847
48. Patterson N, Price AL, Reich D. Population Structure and Eigenanalysis. PLOS Genetics. 2006;2(12):1–20. doi: 10.1038/ng1847 16862161
49. Kane MJ, Emerson J, Weston S. Scalable Strategies for Computing with Massive Data. Journal of Statistical Software. 2013;55(14):1–19. doi: 10.18637/jss.v055.i14
50. Sobel E, Lange K, Wu TT, Hastie T, Chen YF. Genome-Wide Association Analysis by Lasso Penalized Logistic Regression. Bioinformatics. 2009;25(6):714–721. doi: 10.1093/bioinformatics/btp041 19176549
51. El Ghaoui L, Viallon V, Rabbani T. Safe Feature Elimination for the Lasso and Sparse Supervised Learning Problems. arXiv preprint arXiv:10094219. 2010;.
52. Fan J, Lv J. Sure Independence Screening for Ultrahigh Dimensional Feature Space. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2008;70(5):849–911. doi: 10.1111/j.1467-9868.2008.00674.x
53. Wang J, Wonka P, Ye J. Lasso Screening Rules via Dual Polytope Projection. Journal of Machine Learning Research. 2015;16:1063–1101.
54. Zeng Y, Breheny P. The biglasso Package: A Memory-and Computation-Efficient Solver for Lasso Model Fitting with Big Data in R. arXiv preprint arXiv:170105936. 2017;.
55. Privé F, Blum MGB, Aschard H, Ziyatdinov A. Efficient Analysis of Large-Scale Genome-Wide Data with Two R packages: bigstatsr and bigsnpr. Bioinformatics. 2018;34(16):2781–2787. doi: 10.1093/bioinformatics/bty185 29617937
56. Huling JD, Qian PZ. Fast Penalized Regression and Cross Validation for Tall Data with the oem Package. arXiv preprint arXiv:180109661. 2018;.
57. Speliotes EK, Willer CJ, Berndt SI, Monda KL, Thorleifsson G, Jackson AU, et al. Association Analyses of 249,796 Individuals Reveal 18 New Loci Associated with Body Mass Index. Nature Genetics. 2010;42:937. doi: 10.1038/ng.686 20935630
58. Locke AE, Kahali B, Berndt SI, Justice AE, Pers TH, Day FR, et al. Genetic Studies of Body Mass Index Yield New Insights for Obesity Biology. Nature. 2015;518:197. doi: 10.1038/nature14177 25673413
59. Turner SD. qqman: An R Package for Visualizing GWAS Results Using Q-Q and Manhattan Plots. Journal of Open Source Software. 2018;3(25):731. doi: 10.21105/joss.00731
Článek vyšel v časopise
PLOS Genetics
2020 Číslo 10
- Antibiotika na nachlazení nezabírají! Jak můžeme zpomalit šíření rezistence?
- FDA varuje před selfmonitoringem cukru pomocí chytrých hodinek. Jak je to v Česku?
- Prof. Jan Škrha: Metformin je bezpečný, ale je třeba jej bezpečně užívat a léčbu kontrolovat
- Ibuprofen jako alternativa antibiotik při léčbě infekcí močových cest
- Jak a kdy u celiakie začíná reakce na lepek? Možnou odpověď poodkryla čerstvá kanadská studie
Nejčtenější v tomto čísle
- Evaluation of both exonic and intronic variants for effects on RNA splicing allows for accurate assessment of the effectiveness of precision therapies
- RNA-directed DNA Methylation
- The DNA methylome of human sperm is distinct from blood with little evidence for tissue-consistent obesity associations
- Correction: Molecular predictors of brain metastasis-related microRNAs in lung adenocarcinoma