Multiple lines of evidence indicate that different diseases have shared pathological pathways. From an epidemiological perspective, certain diseases frequently cooccur within the same individuals. From a genetic perspective, the phenomenon of pleiotropy is widespread. In this work, I introduce, test and apply statistical methodology that aims to bridge these two perspectives and to better define how genetic risk factors influence the broad spectrum of common human diseases. The approach taken is to use topic modelling applied to routine healthcare data, specifically hospital records encoded by the International Classification of Disease Version 10 (ICD10) ontology. Due to the sparse nature of the data, we introduce statistical methodology that uses the hierarchical structure of the ontology to create a prior on feature distributions. We combine this with Bayesian non-negative matrix factorization to develop our own methodology, named "treeLFA", to model the multi-morbidity patterns of common diseases as disease "topics". The estimated individuals' weights for topics then provide derived phenotypes for the analysis of association. We first introduce the methodology and the computational methods used to estimate parameters. Using simulation, we then demonstrate that treeLFA outperforms other commonly used topic models in situation where the training data is small or individuals frequently have multiple topics. We also assess treeLFA's performance under various aspects of mis-specification. We then apply the methodology to data from UK Biobank (UKB), finding that inferred disease topics align well with current medical understanding and provide additional power for genetic discovery (69 new loci identified for common diseases in the UKB). However, by comparing patterns of genetic association of topics and single disease codes, we find that the majority of genetic effects (about two thirds of topic-associated loci) are readily identified at the single code level. Nevertheless, we also show that, for about two thirds of diseases, genetic risk prediction can be improved by leveraging the GWAS results for topics. We also explore modelling multi-morbidity at different resolutions (i.e., with different numbers of topics) and find that inference results and most associations are stable across different resolutions. In summary, treeLFA provides a new data-driven approach to model the hidden structure of the high-dimensional and sparse phenome data in biobanks. In addition, its inference results provide additional insights to genetic and epidemiological studies, in terms of both prediction of risks and the understanding of biology.